From mst at mellanox.co.il Tue Nov 1 01:09:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 1 Nov 2005 11:09:02 +0200 Subject: [openib-general] Re: 2.6.14 patches In-Reply-To: References: Message-ID: <20051101090902.GG31134@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: 2.6.14 patches > > >Sean, Hal, now that 2.6.14 is out, do you plan to apply > >the patches in > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/? > >Once you do, I'll put reverted patches in the backport directory. > > I'll apply the patch to addr.c shortly. Thanks for the reminder. OK, I have removed linux-2.6.14-rc3-addr.diff since trunk does not need it anymore. Hal, could you please apply the patch linux-2.6.14-rc3-at.diff to at.c? Thanks, -- MST From mirko.benz at xiranet.com Tue Nov 1 01:45:17 2005 From: mirko.benz at xiranet.com (Mirko Benz) Date: Tue, 01 Nov 2005 10:45:17 +0100 Subject: [openib-general] [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator Message-ID: <4367392D.7080804@xiranet.com> Hello, We (Xiranet) are developing SRP targets / routers, too. We are testing against: - OpenIB Linux SRP initiator - OpenIB Windows SRP initiator - Mellanox Linux SRP initiator The OpenIB Linux initiator works already well with our system. We would like to see it in mainline. Some suggestions (not necessarily kernel related): - More status information: When parameter parsing for configuration fails no message is printed out (e.g. when a parameter name is misspelled like ioc_gguid instead of ioc_guid when following your Email). - OpenIB SRP Wiki Page should be updated. - Maybe auto discovery/connection of targets should be integrated as on option (as the Windows and Mellanox SRP initiators do). - We had problems with the configuration tool (dmcli) on 64 bit Linux. - An explanation for the pkey and service_id parameters should be added. - backport for Linux enterprise distributions with prebuild RPMs for drivers, config and documentation would be helpful Regards, Mirko From liran at mellanox.co.il Tue Nov 1 03:14:43 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Tue, 1 Nov 2005 13:14:43 +0200 Subject: [openib-general] [PATCH] Osmtest - update command options + vapi fix Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AC09B@mtlexch01.mtl.com> Hi , Hal . We've decided to keep and maintain Osmtest in the main trunk , since it is not only a test but a tool to validate SA/SM. The following is a small patch for the follwoing : 1. Support old form of running osmtest , i.e instead of -g= , use -g and add '-p' option to display current available port guids. 2. Support Vapi stack. 3. Update Service flow (Update one of the service lease checks from 1 sec to 4 sec). 4. Ident switch-case) issues in main.c Thanks , Liran . Signed-off-by: Liran Sorani Index: osmt_mtl_regular_qp.c =================================================================== --- osmt_mtl_regular_qp.c (revision 3928) +++ osmt_mtl_regular_qp.c (working copy) @@ -73,7 +73,7 @@ #include #include #include - +#include /* * Initialize the QP etc. * Given in res: port_num, max_outs_sq, max_outs_rq Index: osmt_service.c =================================================================== --- osmt_service.c (revision 3928) +++ osmt_service.c (working copy) @@ -1266,7 +1266,7 @@ p_osmt, cl_ntoh64(id[1]), /* IN ib_net64_t service_id, */ IB_DEFAULT_PKEY,/* IN ib_net16_t service_pkey, */ - cl_hton32(0x00000001), /* IN ib_net32_t service_lease, */ + cl_hton32(0x00000004), /* IN ib_net32_t service_lease, */ 11, /* IN uint8_t service_key_lsb, */ (char*)service_name[1] /* IN char *service_name */ ); Index: main.c =================================================================== --- main.c (revision 3928) +++ main.c (working copy) @@ -128,9 +128,11 @@ "--guid \n" " This option specifies the local port GUID value\n" " with which osmtest should bind. osmtest may be\n" - " bound to 1 port at a time.\n" - " Without -g, osmtest displays a menu of possible\n" - " port GUIDs and waits for user input.\n\n" ); + " bound to 1 port at a time.\n\n"); + printf( "-p \n" + "--port\n" + " This option display menu of possible local port GUID values\n" + " with which osmtest could bind.\n\n"); printf( "-h\n" "--help\n" " Display this usage info then exit.\n\n" ); printf( "-i \n" @@ -160,9 +162,9 @@ " --- -----------------\n" " -M1 - Short Multicast Flow (default) - single mode.\n" " -M2 - Short Multicast Flow - multiple mode.\n" - " -M3 - Long Multicast Flow - single mode.\n" - " -M4 - Long Multicast Flow - mutiple mode.\n" - " Single mode - Osmtest is tested alone, with no other\n" + " -M3 - Long MultiCast Flow - single mode.\n" + " -M4 - Long MultiCast Flow - mutiple mode.\n" + " Single mode - Osmtest is tested alone , with no other \n" " apps that interact vs. OpenSM MC.\n" " Multiple mode - Could be run with other apps using MC vs.\n" " OpenSM." @@ -305,7 +307,7 @@ char flow_name[64]; boolean_t mem_track = FALSE; uint32_t next_option; - const char *const short_option = "f:l:m:M:d:g::s:t:i:cvVh"; + const char *const short_option = "f:l:m:M:d:g:s:t:i:pcvVh"; /* * In the array below, the 2nd parameter specified the number @@ -322,9 +324,10 @@ {"inventory", 1, NULL, 'i'}, {"max_lid", 1, NULL, 'm'}, {"guid", 2, NULL, 'g'}, + {"port", 0, NULL, 'p'}, {"help", 0, NULL, 'h'}, {"stress", 1, NULL, 's'}, - {"Multicast_Mode", 1, NULL, 'M'}, + {"MultiCast_Mode", 1, NULL, 'M'}, {"timeout", 1, NULL, 't'}, {"verbose", 0, NULL, 'v'}, {"log_file", 1, NULL, 'l'}, @@ -363,7 +366,6 @@ { next_option = getopt_long_only( argc, argv, short_option, long_option, NULL ); - switch ( next_option ) { case 'c': @@ -446,28 +448,30 @@ break; case 'g': - /* - Specifies port guid with which to bind. - */ - if (optarg) { - guid = cl_hton64( strtoull( optarg, NULL, 16 )); - printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); - } else - guid = INVALID_GUID; - break; - + /* + * Specifies port guid with which to bind. + */ + guid = cl_hton64( strtoull( optarg, NULL, 16 )); + printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); + break; + case 'p': + /* + * Display current port guids + */ + guid = INVALID_GUID; + break; case 't': - /* + /* * Specifies transaction timeout. - */ - opt.transaction_timeout = strtol( optarg, NULL, 0 ); - printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); - break; + */ + opt.transaction_timeout = strtol( optarg, NULL, 0 ); + printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); + break; case 'l': - opt.log_file = optarg; - printf("\tLog File:%s\n", opt.log_file ); - break; + opt.log_file = optarg; + printf("\tLog File:%s\n", opt.log_file ); + break; case 'v': /* @@ -510,32 +514,32 @@ } break; - case 'M': - /* - * Perform stress test. - */ - opt.mmode = strtol( optarg, NULL, 0 ); - printf( "\tMulticast test enabled: " ); - switch ( opt.mmode ) - { - case 1: - printf( "Short MC Flow - single mode (default)\n" ); - break; - case 2: - printf( "Short MC Flow - mutiple mode\n" ); - break; - case 3: - printf( "Long MC Flow - single mode\n" ); - break; - case 4: - printf( "Long MC Flow - mutiple mode\n" ); - break; - default: - printf( "Unknown value %u (ignored)\n", opt.stress ); - opt.mmode = 0; - break; - } - break; + case 'M': + /* + * Perform stress test. + */ + opt.mmode = strtol( optarg, NULL, 0 ); + printf( "\tMultiCast test enabled: " ); + switch ( opt.mmode ) + { + case 1: + printf( "Short MC Flow - single mode (default)\n" ); + break; + case 2: + printf( "Short MC Flow - mutiple mode\n" ); + break; + case 3: + printf( "Long MC Flow - single mode\n" ); + break; + case 4: + printf( "Long MC Flow - mutiple mode\n" ); + break; + default: + printf( "Unknown value %u (ignored)\n", opt.stress ); + opt.mmode = 0; + break; + } + break; case 'd': /* Index: Makefile.am =================================================================== --- Makefile.am (revision 3928) +++ Makefile.am (working copy) @@ -13,9 +13,11 @@ bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ osmt_multicast.c osmt_inform.c - +if OSMV_VAPI +osmtest_SOURCES = osmt_mtl_regular_qp.c +endif osmtest_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) -L. \ $(OSMV_LDADD) -lopensm -losmcomp -losmvendor osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm > Liran Sorani > Mellanox Technologies LTD. > mailto:liran at mellanox.co.il > Phone: +972(4)9097200 Ext: 214 > Israel, Yokneam P.O.B 586 ZIP 20692 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at gtgfd.com Tue Nov 1 01:41:38 2005 From: info at gtgfd.com (info at gtgfd.com) Date: 1 Nov 2005 18:41:38 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051101094138.1547.qmail@mail.gtgfd.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B -----Forwarded Message----- From: Hal Rosenstock To: Itamar Rabenstein Cc: openib-general at openib.org, Eitan Zahavi Subject: Re: opensm problem ??? Date: 31 Oct 2005 16:49:58 -0500 Hi Itamar, On Wed, 2005-10-26 at 11:25, Itamar Rabenstein wrote: > Hi All, > I am running openib gen2 svn rev 3872 (kernel + user). > my system is EM64T (x86_64) + SUSE9.3 + k2.6.13.4 I've run Opterons with 2.6.13 and not quite as recent svn 3850. I'm in the process of updating to the latest now that I'm back. Do you still have this problem ? > I have arbel in memfree mode (fw 5.1.132) . I don't have a memfree HCA (arbel or otherwise). It also appears you are using more recent firmware than is generally available. Are you sure it's unrelated to that ? > my 2 ports are connected in loopback. Loopback configuration works in general. > I am running opensm but the links are not getting into ACTIVE. > in the osm.log i see > > Oct 26 16:59:25 366150 [43005960] -> __osm_vl15_poller: 1 QP0 MADs on > wire, 1 outstanding, 0 unicasts sent, 1 total sent. > > Oct 26 16:59:33 937993 [44007960] -> umad_receiver: ERR 5404: recv > error on MAD sized umad (Interrupted system call) It looks to me like the code in osm_vendor_ibumad.c::umad_receiver() should handle this (just indicates this occured) and reissue the umad_recv. It appears that the GetResp for NodeInfo is never received yet this transaction doesn't timeout either which would have been what I expected. -- Hal > > Does it works for others ? > > Itamar From halr at voltaire.com Tue Nov 1 05:13:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 08:13:03 -0500 Subject: [openib-general] Re: [PATCH] Osmtest - update command options + vapi fix Message-ID: <1130850782.15904.6103.camel@hal.voltaire.com> Hi Liran, On Tue, 2005-11-01 at 06:14, Liran Sorani wrote: > Hi , Hal . > We've decided to keep and maintain Osmtest in the main trunk , since > it is not only a test but a tool to validate SA/SM. I'm not sure I see the difference. It is a test tool which validates SA/SM. Other tools validate other components. Anyhow, I will apply the patch. > The following is a small patch for the follwoing : > 1. Support old form of running osmtest , i.e instead of -g= guid> , use -g and add '-p' option to display current > available port guids. > > 2. Support Vapi stack. > 3. Update Service flow (Update one of the service lease checks from 1 > sec to 4 sec). > 4. Ident switch-case) issues in main.c > > Thanks , Liran . The patch for main.c appears to be line wrapped: patching file main.c patch: **** malformed patch at line 34: value\n" I only need that part of the patch. > Signed-off-by: Liran Sorani [snip...] > Index: Makefile.am > =================================================================== > --- Makefile.am (revision 3928) > +++ Makefile.am (working copy) > @@ -13,9 +13,11 @@ > bin_PROGRAMS = osmtest > osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c > \ > osmt_multicast.c osmt_inform.c > - > +if OSMV_VAPI > +osmtest_SOURCES = osmt_mtl_regular_qp.c Shouldn't this be: osmtest_SOURCES += osmt_mtl_regular_qp.c -- Hal > +endif > osmtest_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT > $(DBGFLAGS) > -osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) \ > +osmtest_LDADD = -L../complib -L../libvendor -L../opensm -L$(libdir) > -L. \ > $(OSMV_LDADD) -lopensm -losmcomp -losmvendor > > osmtest_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -lpthread -L../opensm > > > > > Liran Sorani > > Mellanox Technologies LTD. > > mailto:liran at mellanox.co.il > > Phone: +972(4)9097200 Ext: 214 > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > From bardov at gmail.com Tue Nov 1 05:17:21 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 1 Nov 2005 15:17:21 +0200 Subject: [openib-general] ppc64 compilation failure In-Reply-To: <20051031190340.GE6246@us.ibm.com> References: <20051031184924.GD6246@us.ibm.com> <20051031190340.GE6246@us.ibm.com> Message-ID: I fixed the iser compile warning r3929. Dan On 10/31/05, Nishanth Aravamudan wrote: > On 31.10.2005 [10:49:24 -0800], Nishanth Aravamudan wrote: > > Hi Roland, > > > > Looks like ppc64 build with 2.6.14-git3 and svn 3918 is busted: > > Only the ppc64 build had finished when I sent this mail, but the same > happens on x86, with an additional: > > drivers/infiniband/ulp/iser/iser_mod.c:59: warning: large integer implicitly truncated to unsigned type > > Thanks, > Nish > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From liran at mtl075.yok.mtl.com Tue Nov 1 05:38:51 2005 From: liran at mtl075.yok.mtl.com (Liran Sorani) Date: 01 Nov 2005 15:38:51 +0200 Subject: [openib-general] Re:[PATCH] Osmtest - update command option + vapi fix Message-ID: <30u0ewv66s.fsf@mtl066.yok.mtl.com> Hi Hal, 1. Regarding the osmtest_SOURCES , it works both ways (i.e compile all files required) , still the correct one is += 2. Following is the patch for main.c : Index: main.c =================================================================== --- main.c (revision 3928) +++ main.c (working copy) @@ -128,9 +128,11 @@ "--guid \n" " This option specifies the local port GUID value\n" " with which osmtest should bind. osmtest may be\n" - " bound to 1 port at a time.\n" - " Without -g, osmtest displays a menu of possible\n" - " port GUIDs and waits for user input.\n\n" ); + " bound to 1 port at a time.\n\n"); + printf( "-p \n" + "--port\n" + " This option display menu of possible local port GUID values\n" + " with which osmtest could bind.\n\n"); printf( "-h\n" "--help\n" " Display this usage info then exit.\n\n" ); printf( "-i \n" @@ -160,9 +162,9 @@ " --- -----------------\n" " -M1 - Short Multicast Flow (default) - single mode.\n" " -M2 - Short Multicast Flow - multiple mode.\n" - " -M3 - Long Multicast Flow - single mode.\n" - " -M4 - Long Multicast Flow - mutiple mode.\n" - " Single mode - Osmtest is tested alone, with no other\n" + " -M3 - Long MultiCast Flow - single mode.\n" + " -M4 - Long MultiCast Flow - mutiple mode.\n" + " Single mode - Osmtest is tested alone , with no other \n" " apps that interact vs. OpenSM MC.\n" " Multiple mode - Could be run with other apps using MC vs.\n" " OpenSM." @@ -305,7 +307,7 @@ char flow_name[64]; boolean_t mem_track = FALSE; uint32_t next_option; - const char *const short_option = "f:l:m:M:d:g::s:t:i:cvVh"; + const char *const short_option = "f:l:m:M:d:g:s:t:i:pcvVh"; /* * In the array below, the 2nd parameter specified the number @@ -322,9 +324,10 @@ {"inventory", 1, NULL, 'i'}, {"max_lid", 1, NULL, 'm'}, {"guid", 2, NULL, 'g'}, + {"port", 0, NULL, 'p'}, {"help", 0, NULL, 'h'}, {"stress", 1, NULL, 's'}, - {"Multicast_Mode", 1, NULL, 'M'}, + {"MultiCast_Mode", 1, NULL, 'M'}, {"timeout", 1, NULL, 't'}, {"verbose", 0, NULL, 'v'}, {"log_file", 1, NULL, 'l'}, @@ -363,7 +366,6 @@ { next_option = getopt_long_only( argc, argv, short_option, long_option, NULL ); - switch ( next_option ) { case 'c': @@ -446,28 +448,30 @@ break; case 'g': - /* - Specifies port guid with which to bind. - */ - if (optarg) { - guid = cl_hton64( strtoull( optarg, NULL, 16 )); - printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); - } else - guid = INVALID_GUID; - break; - + /* + * Specifies port guid with which to bind. + */ + guid = cl_hton64( strtoull( optarg, NULL, 16 )); + printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); + break; + case 'p': + /* + * Display current port guids + */ + guid = INVALID_GUID; + break; case 't': - /* + /* * Specifies transaction timeout. - */ - opt.transaction_timeout = strtol( optarg, NULL, 0 ); - printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); - break; + */ + opt.transaction_timeout = strtol( optarg, NULL, 0 ); + printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); + break; case 'l': - opt.log_file = optarg; - printf("\tLog File:%s\n", opt.log_file ); - break; + opt.log_file = optarg; + printf("\tLog File:%s\n", opt.log_file ); + break; case 'v': /* @@ -510,32 +514,32 @@ } break; - case 'M': - /* - * Perform stress test. - */ - opt.mmode = strtol( optarg, NULL, 0 ); - printf( "\tMulticast test enabled: " ); - switch ( opt.mmode ) - { - case 1: - printf( "Short MC Flow - single mode (default)\n" ); - break; - case 2: - printf( "Short MC Flow - mutiple mode\n" ); - break; - case 3: - printf( "Long MC Flow - single mode\n" ); - break; - case 4: - printf( "Long MC Flow - mutiple mode\n" ); - break; - default: - printf( "Unknown value %u (ignored)\n", opt.stress ); - opt.mmode = 0; - break; - } - break; + case 'M': + /* + * Perform stress test. + */ + opt.mmode = strtol( optarg, NULL, 0 ); + printf( "\tMultiCast test enabled: " ); + switch ( opt.mmode ) + { + case 1: + printf( "Short MC Flow - single mode (default)\n" ); + break; + case 2: + printf( "Short MC Flow - mutiple mode\n" ); + break; + case 3: + printf( "Long MC Flow - single mode\n" ); + break; + case 4: + printf( "Long MC Flow - mutiple mode\n" ); + break; + default: + printf( "Unknown value %u (ignored)\n", opt.stress ); + opt.mmode = 0; + break; + } + break; case 'd': /* From info at jfidu.com Tue Nov 1 06:02:40 2005 From: info at jfidu.com (info at jfidu.com) Date: 1 Nov 2005 23:02:40 +0900 Subject: [openib-general] $B:G?78D<<3+J|(B Message-ID: <20051101140240.9369.qmail@mail.jfidu.com> $B$*Hh$lMM$G$9!*:#F|=i$a$F;XL>$r$7$F$^$9!#(B $B5U!}(BOK$B$G$9!*!Y$H$$$&%a%C%;!<%8$,F~$j$^$7$?!#(B $B;XL>$r!"8f=P$G$/$@$5$$!#(B $B8D<<$G?4B!BQ$($J$$J}$*Bg;v$K(B $B5qH]!'(Bbadluck at arigatouo.net From rolandd at cisco.com Tue Nov 1 07:08:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 07:08:55 -0800 Subject: [openib-general] [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <4367392D.7080804@xiranet.com> (Mirko Benz's message of "Tue, 01 Nov 2005 10:45:17 +0100") References: <4367392D.7080804@xiranet.com> Message-ID: <52oe54flrs.fsf@cisco.com> > We (Xiranet) are developing SRP targets / routers, too. Great, welcome to the club! > - More status information: > When parameter parsing for configuration fails no message is printed > out (e.g. when a parameter name is misspelled like ioc_gguid instead > of ioc_guid when following your Email). Good point, I'll add an error message to the configuration parsing code. > - OpenIB SRP Wiki Page should be updated. It's easy to create an account and edit Wiki pages... > - Maybe auto discovery/connection of targets should be integrated as > on option (as the Windows and Mellanox SRP initiators do). There's lots of scope for userspace tools for discovery, connection, health monitoring and so on for SRP targets. It would be great if someone started working on something like that. Do you have any interest in contributing to this? > - We had problems with the configuration tool (dmcli) on 64 bit Linux. Let's not call dmcli "the configuration tool" -- it's a quick hack that I put together that needs to be replaced by real device management (see above). > - An explanation for the pkey and service_id parameters should be added. Care to supply some suggested text? Thanks, Roland From kingman at storagegear.com Tue Nov 1 07:25:42 2005 From: kingman at storagegear.com (John Kingman) Date: Tue, 1 Nov 2005 09:25:42 -0600 (CST) Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: <52ek61gdx7.fsf@cisco.com> References: <52ek61gdx7.fsf@cisco.com> Message-ID: On Mon, 31 Oct 2005, Roland Dreier wrote: >With that said I don't think I like this patch. I don't think it's a >win to allocate 1 KB IUs when we'll almost never have gather/scatter >lists that big. Even the 256 byte IUs that the current driver uses >seem on the borderline of being too big. > >Also, is it really a win to have the target fetch a large indirect >buffer list? It seems like it would be better for performance to give >the SCSI layer a limit on the size of the gather/scatter list we >support so that our indirect buffer lists always fit in the IUs we send. Without knowing what the optimal values should be, perhaps we should make some of these module parameters. John From mshefty at ichips.intel.com Tue Nov 1 09:23:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 09:23:21 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <528xwdqn4x.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> Message-ID: <4367A489.5070803@ichips.intel.com> Roland Dreier wrote: > Something like this is probably required for ucm and anything else > that exports a character device, since everyone seems to have copied > my bad user_mad code. But I haven't had a chance to do anything > beyond user_mad and uverbs so far... Thanks for the info. I'll take a look at ucm. - Sean From ftillier at silverstorm.com Tue Nov 1 09:29:49 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 1 Nov 2005 09:29:49 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: Message-ID: <001301c5df09$dd0529d0$9e5aa8c0@infiniconsys.com> > From: John Kingman [mailto:kingman at storagegear.com] > Sent: Tuesday, November 01, 2005 7:26 AM > > On Mon, 31 Oct 2005, Roland Dreier wrote: > > >With that said I don't think I like this patch. I don't think it's a > >win to allocate 1 KB IUs when we'll almost never have gather/scatter > >lists that big. Even the 256 byte IUs that the current driver uses > >seem on the borderline of being too big. > > > >Also, is it really a win to have the target fetch a large indirect > >buffer list? It seems like it would be better for performance to give > >the SCSI layer a limit on the size of the gather/scatter list we > >support so that our indirect buffer lists always fit in the IUs we send. > > Without knowing what the optimal values should be, perhaps we should > make some of these module parameters. The Windows SRP initiator sizes the IU to be capable of performing a 64KB I/O with all SGEs specified in the IU. It takes 350 bytes to be able to put the full SGL into an IDBD IU, assuming 4K pages. An alternative is to always force a RDMA read of the SGL, and just go with the minimum size IU. I don't know how this would affect performance, though - likely increased latencies. In fact, in environments where each I/O buffer can be registered (via regular or fast MR) on the fly, DDBD should be used and the IU would be a constant 64 bytes. This should yield the best performance. - Fab From info at jfudy.com Tue Nov 1 07:09:43 2005 From: info at jfudy.com (info at jfudy.com) Date: 2 Nov 2005 00:09:43 +0900 Subject: [openib-general] $B:G?78D<<3+J|(B Message-ID: <20051101150943.19480.qmail@mail.jfudy.com> $B$*Hh$lMM$G$9!*:#F|=i$a$F;XL>$r$7$F$^$9!#(B $B5U!}(BOK$B$G$9!*!Y$H$$$&%a%C%;!<%8$,F~$j$^$7$?!#(B $B;XL>$r!"8f=P$G$/$@$5$$!#(B $B8D<<$G?4B!BQ$($J$$J}$*Bg;v$K(B $B5qH]!'(Bbadluck at arigatouo.net From info at kjgjd.com Tue Nov 1 06:50:11 2005 From: info at kjgjd.com (info at kjgjd.com) Date: 1 Nov 2005 23:50:11 +0900 Subject: [openib-general] $BLt6I1?1DpJs(B Message-ID: <20051101145011.3657.qmail@mail.kjgjd.com> $B5.J}$N%"%I%l%9$,!Z(BID:145265 $B at 6;R![$5$s$+$iD>@\;XL>$r$5$l$?$3$H$,3NG'$G$-$^$7$?$N$G!"D>@\O"Mm2DG=$H at _Dj$5$;$FD:$-$^$7$?!#:#$+$iD>@\O"MmJ}K!$r$40FFb$G$-$7$^$9$N$G!"G'>Z$H$7$F4JC1$JFCJL?=9~$_(B($BA4$FL5NA(B)$B$r$*4j$$CW$7$^$9!#(B $B8^IC$GL5NAEPO?"*%m%0%$%s!!(Bhttp://www.jumpb2.net/?raku $B"!4JC1(BPF$B>R2p"!(B $BG/Np!'Fb=o(B $B;E;v!'Lt6IE9J^1?1D(B($BA49q==FsE9J^(B) $B%3%a%s%H!'!V0l2s#5K|$/$i$$G=w at -$r0FFbCW$7$^$9$N$G!"D>@\%a!<%k(B $B$h$j%"%I%l%9$J$I$N3NG'$,$G$-$k$HJ]>Z$7$^$9!#L>A0!Z at 6;R![$G(B $BEPO?$5$l$F$*$j$^$9!#(B \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ $B5qH]%"%I(B (Refusal Adress) iranai at jumpb2.net $B!!(B From halr at voltaire.com Tue Nov 1 09:38:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 12:38:34 -0500 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <4367A489.5070803@ichips.intel.com> References: <528xwdqn4x.fsf@cisco.com> <4367A489.5070803@ichips.intel.com> Message-ID: <1130866714.4381.505.camel@hal.voltaire.com> On Tue, 2005-11-01 at 12:23, Sean Hefty wrote: > Roland Dreier wrote: > > Something like this is probably required for ucm and anything else > > that exports a character device, since everyone seems to have copied > > my bad user_mad code. But I haven't had a chance to do anything > > beyond user_mad and uverbs so far... > > Thanks for the info. I'll take a look at ucm. Should this be done for uat too or doesn't it matter ? -- Hal From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 1 09:39:37 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 1 Nov 2005 17:39:37 +0000 (UTC) Subject: [openib-general] compilation platform dependencies Message-ID: Hi, I think that I've noticed a problem in compiling user applications with a different compiler than the running-kernel modules compiler (x86 32bit vs. 64bit). For compiling an openib application on a 32bit x86 and running it on a 64bit AMD Opteron. When compiling a program with a 32bit gcc, the sizeof(struct cm_abi_event_resp) was 184 bytes (written to the kernel from ib_cm_get_event()) vs. the 192 bytes resulting from a x86_64 compiler. When ucm's ib_ucm_event() compares the sizeof() of the received cmd/buffer to sizeof(struct ib_ucm_event_resp) it finds a mismatch and returns -ENOSPC. Notice that 32bit applications are allowed to run on a x86_64. I can see two fixes to this issue: 1. Disallow 32bit applications to use 64bit kernel modules and warn about it at run-time. 2. Specifiy gcc packing pragmas for user/kernel communication structures in header files. Any comments? Thanks, y From rolandd at cisco.com Tue Nov 1 09:51:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 09:51:59 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <1130866714.4381.505.camel@hal.voltaire.com> (Hal Rosenstock's message of "01 Nov 2005 12:38:34 -0500") References: <528xwdqn4x.fsf@cisco.com> <4367A489.5070803@ichips.intel.com> <1130866714.4381.505.camel@hal.voltaire.com> Message-ID: <52br14fe80.fsf@cisco.com> >> Roland Dreier wrote: > Something like this is probably required >> for ucm and anything else > that exports a character device, >> since everyone seems to have copied > my bad user_mad code. >> But I haven't had a chance to do anything > beyond user_mad and >> uverbs so far... Hal> Should this be done for uat too or doesn't it matter ? The bugs definitely exist in uat. However, fixing things like passing kernel pointers to userspace would seem like a higher priority to me. Or we could just deprecate uat in favor of Sean's work... - R. From halr at voltaire.com Tue Nov 1 09:51:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 12:51:28 -0500 Subject: [openib-general] compilation platform dependencies In-Reply-To: References: Message-ID: <1130867263.4381.537.camel@hal.voltaire.com> On Tue, 2005-11-01 at 12:39, yipee wrote: > Hi, > > > I think that I've noticed a problem in compiling user applications with a > different compiler than the running-kernel modules compiler (x86 32bit vs. > 64bit). For compiling an openib application on a 32bit x86 and running it on a > 64bit AMD Opteron. > When compiling a program with a 32bit gcc, the sizeof(struct cm_abi_event_resp) > was 184 bytes (written to the kernel from ib_cm_get_event()) vs. the 192 bytes > resulting from a x86_64 compiler. > When ucm's ib_ucm_event() compares the sizeof() of the received cmd/buffer to > sizeof(struct ib_ucm_event_resp) it finds a mismatch and returns -ENOSPC. > > Notice that 32bit applications are allowed to run on a x86_64. > I can see two fixes to this issue: > 1. Disallow 32bit applications to use 64bit kernel modules and warn about it at > run-time. It is a requirement to work in this mode so this would not be acceptable. -- Hal > 2. Specifiy gcc packing pragmas for user/kernel communication structures in > header files. > > Any comments? > Thanks, > y > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Tue Nov 1 09:55:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 09:55:40 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: References: Message-ID: <4367AC1C.5030600@ichips.intel.com> yipee wrote: > I think that I've noticed a problem in compiling user applications with a > different compiler than the running-kernel modules compiler (x86 32bit vs. > 64bit). For compiling an openib application on a 32bit x86 and running it on a > 64bit AMD Opteron. > When compiling a program with a 32bit gcc, the sizeof(struct cm_abi_event_resp) > was 184 bytes (written to the kernel from ib_cm_get_event()) vs. the 192 bytes > resulting from a x86_64 compiler. > When ucm's ib_ucm_event() compares the sizeof() of the received cmd/buffer to > sizeof(struct ib_ucm_event_resp) it finds a mismatch and returns -ENOSPC. I think that we can fix this by adding padding to the end of these structures to align them to a 64-bit boundary. Did you notice if any other data structures had this issue? - Sean From halr at voltaire.com Tue Nov 1 09:56:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 12:56:52 -0500 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <52br14fe80.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> <4367A489.5070803@ichips.intel.com> <1130866714.4381.505.camel@hal.voltaire.com> <52br14fe80.fsf@cisco.com> Message-ID: <1130867811.4381.571.camel@hal.voltaire.com> On Tue, 2005-11-01 at 12:51, Roland Dreier wrote: > >> Roland Dreier wrote: > Something like this is probably required > >> for ucm and anything else > that exports a character device, > >> since everyone seems to have copied > my bad user_mad code. > >> But I haven't had a chance to do anything > beyond user_mad and > >> uverbs so far... > > Hal> Should this be done for uat too or doesn't it matter ? > > The bugs definitely exist in uat. However, fixing things like passing > kernel pointers to userspace would seem like a higher priority to me. > > Or we could just deprecate uat in favor of Sean's work... Isn't that in process as AT is being deprecated in favor of CMA ? That's part of why I was asking: given that, are these issues needed to be fixed in the short term ? (I would prefer not to unless this is really needed by someone). -- Hal From rolandd at cisco.com Tue Nov 1 10:07:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 10:07:17 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: (yipee's message of "Tue, 1 Nov 2005 17:39:37 +0000 (UTC)") References: Message-ID: <527jbsfdii.fsf@cisco.com> yipee> Hi, I think that I've noticed a problem in compiling user yipee> applications with a different compiler than the yipee> running-kernel modules compiler (x86 32bit vs. 64bit). For yipee> compiling an openib application on a 32bit x86 and running yipee> it on a 64bit AMD Opteron. When compiling a program with a yipee> 32bit gcc, the sizeof(struct cm_abi_event_resp) was 184 yipee> bytes (written to the kernel from ib_cm_get_event()) yipee> vs. the 192 bytes resulting from a x86_64 compiler. When yipee> ucm's ib_ucm_event() compares the sizeof() of the received yipee> cmd/buffer to sizeof(struct ib_ucm_event_resp) it finds a yipee> mismatch and returns -ENOSPC. Yes, this looks like a real bug. yipee> Notice that 32bit applications are allowed to run on a yipee> x86_64. I can see two fixes to this issue: 1. Disallow yipee> 32bit applications to use 64bit kernel modules and warn yipee> about it at run-time. 2. Specifiy gcc packing pragmas for yipee> user/kernel communication structures in header files. I think the real fix is just to fix the declaration so that the structure is laid out the same for all architectures, and bump the ABI version yet again. All structs more than 4 bytes in size have to be padded to a multiple of 8 bytes, or else they end up with a different size between 32-bit and 64-bit architectures. I think something like the patch below along with the corresponding userspace libibcm change is required. - R. --- infiniband/include/rdma/ib_user_cm.h (revision 3932) +++ infiniband/include/rdma/ib_user_cm.h (working copy) @@ -38,7 +38,7 @@ #include -#define IB_USER_CM_ABI_VERSION 3 +#define IB_USER_CM_ABI_VERSION 4 enum { IB_USER_CM_CMD_CREATE_ID, @@ -84,6 +84,7 @@ struct ib_ucm_create_id_resp { struct ib_ucm_destroy_id { __u64 response; __u32 id; + __u32 reserved; }; struct ib_ucm_destroy_id_resp { @@ -93,6 +94,7 @@ struct ib_ucm_destroy_id_resp { struct ib_ucm_attr_id { __u64 response; __u32 id; + __u32 reserved; }; struct ib_ucm_attr_id_resp { @@ -164,6 +166,7 @@ struct ib_ucm_listen { __be64 service_id; __be64 service_mask; __u32 id; + __u32 reserved; }; struct ib_ucm_establish { @@ -219,7 +222,7 @@ struct ib_ucm_req { __u8 rnr_retry_count; __u8 max_cm_retries; __u8 srq; - __u8 reserved[1]; + __u8 reserved[5]; }; struct ib_ucm_rep { @@ -236,6 +239,7 @@ struct ib_ucm_rep { __u8 flow_control; __u8 rnr_retry_count; __u8 srq; + __u32 reserved; }; struct ib_ucm_info { @@ -245,7 +249,7 @@ struct ib_ucm_info { __u64 data; __u8 info_len; __u8 data_len; - __u8 reserved[2]; + __u8 reserved[6]; }; struct ib_ucm_mra { @@ -273,6 +277,7 @@ struct ib_ucm_sidr_req { __u16 pkey; __u8 len; __u8 max_cm_retries; + __u32 reserved; }; struct ib_ucm_sidr_rep { @@ -284,7 +289,7 @@ struct ib_ucm_sidr_rep { __u64 data; __u8 info_len; __u8 data_len; - __u8 reserved[2]; + __u8 reserved[6]; }; /* * event notification ABI structures. @@ -295,7 +300,7 @@ struct ib_ucm_event_get { __u64 info; __u8 data_len; __u8 info_len; - __u8 reserved[2]; + __u8 reserved[6]; }; struct ib_ucm_req_event_resp { @@ -315,6 +320,7 @@ struct ib_ucm_req_event_resp { __u8 rnr_retry_count; __u8 srq; __u8 port; + __u8 reserved[7]; }; struct ib_ucm_rep_event_resp { @@ -329,7 +335,7 @@ struct ib_ucm_rep_event_resp { __u8 flow_control; __u8 rnr_retry_count; __u8 srq; - __u8 reserved[1]; + __u8 reserved[5]; }; struct ib_ucm_rej_event_resp { @@ -374,6 +380,7 @@ struct ib_ucm_event_resp { __u32 id; __u32 event; __u32 present; + __u32 reserved; union { struct ib_ucm_req_event_resp req_resp; struct ib_ucm_rep_event_resp rep_resp; From rolandd at cisco.com Tue Nov 1 10:09:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 10:09:26 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <4367AC1C.5030600@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Nov 2005 09:55:40 -0800") References: <4367AC1C.5030600@ichips.intel.com> Message-ID: <523bmgfdex.fsf@cisco.com> Sean> Did you notice if any other data structures had this issue? I use the perl script below to check this. You feed it a header file, and it prints out a C program that prints the size of every struct. Then compile and run the program on both 32-bit and 64-bit architectures and diff the output. With the patch I just sent, ib_user_cm.h is clean. - R. #!/usr/bin/env perl use English; use strict; my @structs; while (<>) { if (m/(struct [^\s]+) \{/) { push @structs, $1; } s/__be/__u/; print; } print <<'EOT'; #include int main(int argc, char *argv[]) { printf("Word size: %zd\n", sizeof (long)); EOT for my $s (@structs) { print <<"EOT"; printf("%-40s:\\t%zd\\n", "$s", sizeof ($s)); EOT } print <<"EOT"; return 0; } EOT From info at ijdud.com Tue Nov 1 07:28:18 2005 From: info at ijdud.com (info at ijdud.com) Date: 2 Nov 2005 00:28:18 +0900 Subject: [openib-general] $B:G?78D<<3+J|(B Message-ID: <20051101152818.3515.qmail@mail.ijdud.com> $B$*Hh$lMM$G$9!*:#F|=i$a$F;XL>$r$7$F$^$9!#(B $B5U!}(BOK$B$G$9!*!Y$H$$$&%a%C%;!<%8$,F~$j$^$7$?!#(B $B;XL>$r!"8f=P$G$/$@$5$$!#(B $B8D<<$G?4B!BQ$($J$$J}$*Bg;v$K(B $B5qH]!'(Bbadluck at arigatouo.net From mshefty at ichips.intel.com Tue Nov 1 10:53:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 10:53:04 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <527jbsfdii.fsf@cisco.com> References: <527jbsfdii.fsf@cisco.com> Message-ID: <4367B990.5040205@ichips.intel.com> Roland Dreier wrote: > I think the real fix is just to fix the declaration so that the > structure is laid out the same for all architectures, and bump the ABI > version yet again. I think that we'll need a similar update to cm_abi.h too. - Sean From jbarker at lanl.gov Tue Nov 1 10:55:01 2005 From: jbarker at lanl.gov (James W. Barker) Date: Tue, 01 Nov 2005 11:55:01 -0700 Subject: [openib-general] build fails on revision 3930 Message-ID: <6.2.3.4.2.20051101114151.0227ba00@cic-mail.lanl.gov> All, Following the instructions posted in your "installation cheetsheet" after I issue the command "make modules modules_install" the build fails with the error message below (this is revision 3930), the same procedure (not sure which revision number) was successful last week: CC [M] drivers/infiniband/core/addr.o drivers/infiniband/core/addr.c:330: warning: initialization from incompatible pointer type CC [M] drivers/infiniband/core/at.o In file included from drivers/infiniband/include/rdma/ib_sa.h:42, from drivers/infiniband/core/at.c:53: drivers/infiniband/include/rdma/ib_mad.h:601: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_mad.h:601: warning: function declaration isn’t a prototype In file included from drivers/infiniband/core/at.c:53: drivers/infiniband/include/rdma/ib_sa.h:288: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_sa.h:291: error: ‘ib_sa_path_rec_get’ declared as function returning a function drivers/infiniband/include/rdma/ib_sa.h:291: warning: function declaration isn’t a prototype drivers/infiniband/include/rdma/ib_sa.h:292: error: syntax error before ‘void’ drivers/infiniband/include/rdma/ib_sa.h:299: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_sa.h:302: error: ‘ib_sa_mcmember_rec_query’ declared as function returning a function drivers/infiniband/include/rdma/ib_sa.h:302: warning: function declaration isn’t a prototype drivers/infiniband/include/rdma/ib_sa.h:303: error: syntax error before ‘void’ drivers/infiniband/include/rdma/ib_sa.h:310: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_sa.h:313: error: ‘ib_sa_service_rec_query’ declared as function returning a function drivers/infiniband/include/rdma/ib_sa.h:313: warning: function declaration isn’t a prototype drivers/infiniband/include/rdma/ib_sa.h:314: error: syntax error before ‘void’ drivers/infiniband/include/rdma/ib_sa.h:345: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_sa.h:348: error: ‘ib_sa_mcmember_rec_set’ declared as function returning a function drivers/infiniband/include/rdma/ib_sa.h:348: warning: function declaration isn’t a prototype drivers/infiniband/include/rdma/ib_sa.h:349: error: syntax error before ‘void’ drivers/infiniband/include/rdma/ib_sa.h:387: error: syntax error before ‘gfp_t’ drivers/infiniband/include/rdma/ib_sa.h:390: error: ‘ib_sa_mcmember_rec_delete’ declared as function returning a function drivers/infiniband/include/rdma/ib_sa.h:390: warning: function declaration isn’t a prototype drivers/infiniband/include/rdma/ib_sa.h:391: error: syntax error before ‘void’ make[3]: *** [drivers/infiniband/core/at.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 James W. Barker, Ph.D. Los Alamos National Laboratory Computer and Computational Sciences Division Advanced Computing Laboratory - Resilient Technologies Team 505-665-9558 From kenjeffries at storagegear.com Tue Nov 1 11:19:31 2005 From: kenjeffries at storagegear.com (Ken Jeffries) Date: Tue, 1 Nov 2005 13:19:31 -0600 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation References: <52ek61gdx7.fsf@cisco.com> Message-ID: <019501c5df19$30305ee0$0a97a8c0@blacktip> Roland, It's not clear to me which part(s) of the patch you don't like so I apologize if some of this is not relevant. The SRP-1 spec calls for iu size negotiation during login so not allowing iu size negotiation would be a bug in terms of spec compliance. I think there are valid reasons why iu size negotiation should be in the spec. I am sure that for a particular application and network that there is an optimum iu size (a Goldilocks size, neither too small nor too large). I suspect that a small iu will be better for small file or maybe small record database i/o and a large iu will be better for video serving. A server that has lots of cheap memory and perhaps an aversion to implementing the full indirect memory descriptor capability may be happy with very large iu's. An embedded system server with only modest memory may really need to not waste memory in permanently allocated big iu's that go largely unused. While I'm sure that there will be an optimum size iu for any particular application' and network I'm equally sure I don't know what that size is right now and I won't know what it is before we do considerable performance testing. Any particular srp client may connect to more than one srp server and those servers (and applications) may have different needs. One might be a video server and another might be a db server. Having the iu size set in a compile time variable in the srp client is less flexible than what we, at least, would like to see. When we were considering how to get both smaller iu's and to implement the real indirect memory descriptor capability it occurred to us that allowing the Linux side iu's to be sized by the existing compile time variable but making the on-the-wire iu size set by negotiation was an almost trivial extension to the existing code. By doing that applications can see a potentially large scatter/gather list length (a function of the client internal iu size) but the srp target also gets only what it wants. Since the indirect table memory descriptor just points to the descriptor list in the client side iu and since the "partial" list of descriptors in the on-the-wire iu is just a copy of the first descriptors in the client side iu, indirect descriptor setup and operation is easy. Regards, Ken Jeffries ----- Original Message ----- From: "Roland Dreier" To: "John Kingman" Cc: Sent: Monday, October 31, 2005 11:00 PM Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation > Thanks for the patch. However, I would like to hold off on new > features for the SRP driver to get it merged into 2.6.15. There's > about another week in the 2.6.15 merge window, so either way the delay > shouldn't be too long. > > With that said I don't think I like this patch. I don't think it's a > win to allocate 1 KB IUs when we'll almost never have gather/scatter > lists that big. Even the 256 byte IUs that the current driver uses > seem on the borderline of being too big. > > Also, is it really a win to have the target fetch a large indirect > buffer list? It seems like it would be better for performance to give > the SCSI layer a limit on the size of the gather/scatter list we > support so that our indirect buffer lists always fit in the IUs we send. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Tue Nov 1 11:19:05 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 1 Nov 2005 11:19:05 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <527jbsfdii.fsf@cisco.com> References: <527jbsfdii.fsf@cisco.com> Message-ID: <20051101191905.GE6815@esmail.cup.hp.com> On Tue, Nov 01, 2005 at 10:07:17AM -0800, Roland Dreier wrote: > --- infiniband/include/rdma/ib_user_cm.h (revision 3932) > +++ infiniband/include/rdma/ib_user_cm.h (working copy) ... > @@ -84,6 +84,7 @@ struct ib_ucm_create_id_resp { > struct ib_ucm_destroy_id { > __u64 response; > __u32 id; > + __u32 reserved; I've seen use of this use of "data[0]": include/rdma/ib_user_verbs.h: __u64 driver_data[0]; isn't that for the same purpose? Apologies if I'm mixing things up... thanks, grant From mshefty at ichips.intel.com Tue Nov 1 11:26:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 11:26:49 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: References: Message-ID: <4367C179.5050102@ichips.intel.com> yipee wrote: > I think that I've noticed a problem in compiling user applications with a > different compiler than the running-kernel modules compiler (x86 32bit vs. > 64bit). For compiling an openib application on a 32bit x86 and running it on a > 64bit AMD Opteron. I've checked in Roland's patch, along with a similar one for userspace. Can you please pull the latest kernel and userspace and verify that your problem has been fixed? - Sean From mshefty at ichips.intel.com Tue Nov 1 11:30:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 11:30:57 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <527jbsfdii.fsf@cisco.com> References: <527jbsfdii.fsf@cisco.com> Message-ID: <4367C271.3080208@ichips.intel.com> Roland Dreier wrote: > All structs more than 4 bytes in size have to be padded to a multiple > of 8 bytes, or else they end up with a different size between 32-bit > and 64-bit architectures. I think something like the patch below > along with the corresponding userspace libibcm change is required. Thanks - I committed this with a minor change to use __u8 reserved[4] in a couple places where __u32 were used. - Sean From mshefty at ichips.intel.com Tue Nov 1 12:06:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 12:06:10 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <528xwdqn4x.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> Message-ID: <4367CAB2.6030600@ichips.intel.com> Roland Dreier wrote: > I just committed the following patch for user_mad.c, which fixes > various issues with possibly freeing various data structures before > the last reference is gone. For example, cdev_del() might return > before the last reference to the cdev is gone, so freeing a structure > containing the cdev is wrong at that point. (Side note: it's > essentially impossible to use cdev_init() safely unless the cdev in > question is statically allocated as part of the module). I can't say that I grasp the relationship between the cdev_* and class_* calls yet, but should umad and ucm create their own classes? I'm trying to add the ucma, and I'm wondering if we should add another infiniband_blah class, versus adding an rdma_cm entry somewhere else. - Sean From rolandd at cisco.com Tue Nov 1 12:10:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 12:10:56 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <20051101191905.GE6815@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 1 Nov 2005 11:19:05 -0800") References: <527jbsfdii.fsf@cisco.com> <20051101191905.GE6815@esmail.cup.hp.com> Message-ID: <52pspkdt7z.fsf@cisco.com> > I've seen use of this use of "data[0]": > include/rdma/ib_user_verbs.h: __u64 driver_data[0]; > > isn't that for the same purpose? > Apologies if I'm mixing things up... The driver_data[] in ib_user_verbs.h is really there to give a hint that extra device-dependent data could follow. Reserved members of structs are used to pad it up to a 64-bit boundary. I'm not sure if __u64 driver_data[0]; forces alignment to an 8-byte boundary on i386... does it? - R. From rolandd at cisco.com Tue Nov 1 12:27:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 12:27:19 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: <019501c5df19$30305ee0$0a97a8c0@blacktip> (Ken Jeffries's message of "Tue, 1 Nov 2005 13:19:31 -0600") References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> Message-ID: <52d5lkdsgo.fsf@cisco.com> Ken> The SRP-1 spec calls for iu size negotiation during login so Ken> not allowing iu size negotiation would be a bug in terms of Ken> spec compliance. I think there are valid reasons why iu size Ken> negotiation should be in the spec. Sure, no objection here. My objections are the following (as I said in my previous mail): - I don't like allocating a 1 KB IU for every send IU, since most of that memory will probably never be used. - I'm not convinced that it's _ever_ a win to have the target do another RDMA to fetch the indirect buffer list. You need to convince me that it's not better to simply tell the upper layers what the limit on s/g list length is to fit in the current IU size. - R. From rolandd at cisco.com Tue Nov 1 12:27:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 12:27:46 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: (John Kingman's message of "Tue, 1 Nov 2005 09:25:42 -0600 (CST)") References: <52ek61gdx7.fsf@cisco.com> Message-ID: <528xw8dsfx.fsf@cisco.com> John> Without knowing what the optimal values should be, perhaps John> we should make some of these module parameters. Yes, or make them per-target-port tunables. - R. From rolandd at cisco.com Tue Nov 1 12:29:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 12:29:46 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <4367CAB2.6030600@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Nov 2005 12:06:10 -0800") References: <528xwdqn4x.fsf@cisco.com> <4367CAB2.6030600@ichips.intel.com> Message-ID: <524q6wdscl.fsf@cisco.com> Sean> I can't say that I grasp the relationship between the cdev_* Sean> and class_* calls yet, but should umad and ucm create their Sean> own classes? I'm trying to add the ucma, and I'm wondering Sean> if we should add another infiniband_blah class, versus Sean> adding an rdma_cm entry somewhere else. ucma is not really attached to a single device, is it? How may character devices are you going to create? - R. From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 1 12:43:44 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 1 Nov 2005 20:43:44 +0000 (UTC) Subject: [openib-general] Re: compilation platform dependencies References: <4367C179.5050102@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: [snip] > I've checked in Roland's patch, along with a similar one for userspace. > Can you please pull the latest kernel and userspace and verify that > your problem has been fixed? I've already left work for today and I'll be back only on Thursday. I'll test these fixes first thing in the morning. Thanks for the quick response (from everyone). y From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 1 12:55:14 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 1 Nov 2005 20:55:14 +0000 (UTC) Subject: [openib-general] Re: compilation platform dependencies References: <4367AC1C.5030600@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: [snip] > Did you notice if any other data structures had this issue? Nope, that's the first one that bit me. It took some time to verify this problem. I had to update the kernel to 2.6.14 on two platforms, install everything and recheck. If I'll bump into more problems I'll post them here. Thanks, y From kingman at storagegear.com Tue Nov 1 13:21:40 2005 From: kingman at storagegear.com (John Kingman) Date: Tue, 1 Nov 2005 15:21:40 -0600 (CST) Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: <52d5lkdsgo.fsf@cisco.com> References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> <52d5lkdsgo.fsf@cisco.com> Message-ID: On Tue, 1 Nov 2005, Roland Dreier wrote: >My objections are the following (as I said in my previous mail): > - I don't like allocating a 1 KB IU for every send IU, since most of > that memory will probably never be used. I have no problem with changing the 1K IU to some other value. I would rather see this max IU size as a module parameter, however, so that it may be changed without having to rebuild the module. > - I'm not convinced that it's _ever_ a win to have the target do > another RDMA to fetch the indirect buffer list. You need to > convince me that it's not better to simply tell the upper layers > what the limit on s/g list length is to fit in the current IU size. If you agree that it_iu size negotiation is OK, then the case where you connect to a target with a smaller it_iu size than ib_srp was built with leaves some number of indirect descriptors in the position of only being available to the target via RDMA. This would probably be considered a win compared to not talking to the target at all. :-) John From rolandd at cisco.com Tue Nov 1 13:28:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 13:28:59 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: (John Kingman's message of "Tue, 1 Nov 2005 15:21:40 -0600 (CST)") References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> <52d5lkdsgo.fsf@cisco.com> Message-ID: <52br14cb1g.fsf@cisco.com> John> I have no problem with changing the 1K IU to some other John> value. I would rather see this max IU size as a module John> parameter, however, so that it may be changed without having John> to rebuild the module. I guess that's OK for development but I'm not convinced we need to make this tunable for general end users. John> If you agree that it_iu size negotiation is OK, then the John> case where you connect to a target with a smaller it_iu size John> than ib_srp was built with leaves some number of indirect John> descriptors in the position of only being available to the John> target via RDMA. This would probably be considered a win John> compared to not talking to the target at all. :-) I'm missing something here. The SRP initiator registers itself with the SCSI midlayer after it has successfully connected to the target port, so I don't see why it can't pass exactly the right sg_tablesize value to the SCSI midlayer. - R. From mshefty at ichips.intel.com Tue Nov 1 13:40:18 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 13:40:18 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <524q6wdscl.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> <4367CAB2.6030600@ichips.intel.com> <524q6wdscl.fsf@cisco.com> Message-ID: <4367E0C2.90704@ichips.intel.com> Roland Dreier wrote: > Sean> I can't say that I grasp the relationship between the cdev_* > Sean> and class_* calls yet, but should umad and ucm create their > Sean> own classes? I'm trying to add the ucma, and I'm wondering > Sean> if we should add another infiniband_blah class, versus > Sean> adding an rdma_cm entry somewhere else. > > ucma is not really attached to a single device, is it? How may > character devices are you going to create? No - it's not attached to a device. I was going to create just one character device, which is why I was wondering if creating a new class was the right approach. - Sean From brad.benton at us.ibm.com Tue Nov 1 14:27:32 2005 From: brad.benton at us.ibm.com (Brad Benton) Date: Tue, 1 Nov 2005 16:27:32 -0600 Subject: [openib-general] opensm errors with ehca In-Reply-To: <20051030235504.GT3275@kalmia.hozed.org> Message-ID: Troy Benjegerdes wrote on 10/30/2005 05:55:04 PM: > The firmware on the IBM eHCA causes opensm to spit out these kinds of > errors all the time.. > > Is there a way we can either not send P_KeyTable requests to any eHCA > guids, or figure out what (if anything) is broken in their firmware? > > Is this a spec violation, or just ambiguities in implementation? ... > Oct 30 17:49:46 053861 [43005960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x2 > trans_id................0x158c > attr_id.................0x16 (P_KeyTable) > resv....................0x0 > attr_mod................0x260000 Here is what is happening: The attribute modifier for the P_KeyTable attribute is divided into two, 16-bit halves. The most significant 16 bits is information that is only valid for switches. The problem here is that this SubnGet is for an HCA. The firmware currently sees that the upper bits are non-zero and since it is not a switch, throws the packet away. The proper response would be for it to ignore the upper bits and process the MAD. However, this is in firmware that won't be able to be changed quickly. So, in the meantime as a work around, would it be possible to have the opensm clear out the upper 16 bits of the attribute modifier when making a P_KeyTable request of an HCA? Thanks, --Brad -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at lgatwg.com Tue Nov 1 14:31:38 2005 From: info at lgatwg.com (info at lgatwg.com) Date: 2 Nov 2005 07:31:38 +0900 Subject: [openib-general] $BLt6I1?1DpJs(B Message-ID: <20051101223138.12037.qmail@mail.lgatwg.com> $B5.J}$N%"%I%l%9$,!Z(BID:145265 $B at 6;R![$5$s$+$iD>@\;XL>$r$5$l$?$3$H$,3NG'$G$-$^$7$?$N$G!"D>@\O"Mm2DG=$H at _Dj$5$;$FD:$-$^$7$?!#:#$+$iD>@\O"MmJ}K!$r$40FFb$G$-$7$^$9$N$G!"G'>Z$H$7$F4JC1$JFCJL?=9~$_(B($BA4$FL5NA(B)$B$r$*4j$$CW$7$^$9!#(B $B8^IC$GL5NAEPO?"*%m%0%$%s!!(Bhttp://www.jumpb2.net/?raku $B"!4JC1(BPF$B>R2p"!(B $BG/Np!'Fb=o(B $B;E;v!'Lt6IE9J^1?1D(B($BA49q==FsE9J^(B) $B%3%a%s%H!'!V0l2s#5K|$/$i$$G=w at -$r0FFbCW$7$^$9$N$G!"D>@\%a!<%k(B $B$h$j%"%I%l%9$J$I$N3NG'$,$G$-$k$HJ]>Z$7$^$9!#L>A0!Z at 6;R![$G(B $BEPO?$5$l$F$*$j$^$9!#(B \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ $B5qH]%"%I(B (Refusal Adress) iranai at jumpb2.net $B!!(B From halr at voltaire.com Tue Nov 1 14:44:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 17:44:43 -0500 Subject: [openib-general] opensm errors with ehca In-Reply-To: References: Message-ID: <1130885082.4381.1202.camel@hal.voltaire.com> Hi Brad, On Tue, 2005-11-01 at 17:27, Brad Benton wrote: > > Troy Benjegerdes wrote on 10/30/2005 05:55:04 PM: > > > The firmware on the IBM eHCA causes opensm to spit out these kinds > of > > errors all the time.. > > > > Is there a way we can either not send P_KeyTable requests to any > eHCA > > guids, or figure out what (if anything) is broken in their firmware? > > > > Is this a spec violation, or just ambiguities in implementation? > ... > > Oct 30 17:49:46 053861 [43005960] -> SMP dump: > > base_ver................0x1 > > mgmt_class..............0x81 > > class_ver...............0x1 > > method..................0x1 > (SubnGet) > > D bit...................0x0 > > status..................0x0 > > hop_ptr.................0x0 > > hop_count...............0x2 > > trans_id................0x158c > > attr_id.................0x16 > (P_KeyTable) > > resv....................0x0 > > attr_mod................0x260000 > > Here is what is happening: The attribute modifier for the P_KeyTable > attribute is divided into two, 16-bit halves. The most significant 16 > bits > is information that is only valid for switches. The problem here is > that > this SubnGet is for an HCA. The firmware currently sees that the > upper > bits are non-zero and since it is not a switch, throws the packet > away. > The proper response would be for it to ignore the upper bits and > process > the MAD. However, this is in firmware that won't be able to be > changed > quickly. So, in the meantime as a work around, would it be possible > to > have the opensm clear out the upper 16 bits of the attribute modifier > when > making a P_KeyTable request of an HCA? I thought the IBM eHCA identified itself as both a switch and some number of HCAs behind it. Are you sure this is a SubnSet P_KeyTable to a HCA port ? If so, I will look at this and fix it so that even though this should be ignored for HCA and router ports, it will be set to 0. Troy, is there more of this log that can be sent ? -- Hal > Thanks, > --Brad > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kingman at storagegear.com Tue Nov 1 15:43:07 2005 From: kingman at storagegear.com (John Kingman) Date: Tue, 1 Nov 2005 17:43:07 -0600 (CST) Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: <52br14cb1g.fsf@cisco.com> References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> <52d5lkdsgo.fsf@cisco.com> <52br14cb1g.fsf@cisco.com> Message-ID: On Tue, 1 Nov 2005, Roland Dreier wrote: >I'm missing something here. The SRP initiator registers itself with >the SCSI midlayer after it has successfully connected to the target >port, so I don't see why it can't pass exactly the right sg_tablesize >value to the SCSI midlayer. The current code sets the sg_tablesize prior to the call to scsi_host_alloc() which is done at the time the target is added, not at the time of the connection to the target. If sg_tablesize can be modified/supplied with the scsi_add_host() call, then the SRP initiator could pass it at connect time. John From rolandd at cisco.com Tue Nov 1 15:50:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 15:50:01 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: (John Kingman's message of "Tue, 1 Nov 2005 17:43:07 -0600 (CST)") References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> <52d5lkdsgo.fsf@cisco.com> <52br14cb1g.fsf@cisco.com> Message-ID: <52y848apxy.fsf@cisco.com> John> The current code sets the sg_tablesize prior to the call to John> scsi_host_alloc() which is done at the time the target is John> added, not at the time of the connection to the target. There is an .sg_tablesize value in the host template we use, yes. But there's no reason that I know of that the sg_tablesize value in the actual SCSI host structure can't be modified after it's allocated, in exactly the same way as the existing code can modify the max_sectors value. - R. From rolandd at cisco.com Tue Nov 1 15:50:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 15:50:34 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <4367E0C2.90704@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Nov 2005 13:40:18 -0800") References: <528xwdqn4x.fsf@cisco.com> <4367CAB2.6030600@ichips.intel.com> <524q6wdscl.fsf@cisco.com> <4367E0C2.90704@ichips.intel.com> Message-ID: <52u0ewapx1.fsf@cisco.com> Sean> No - it's not attached to a device. I was going to create Sean> just one character device, which is why I was wondering if Sean> creating a new class was the right approach. It might just make sense to use the existing misc class I guess. - R. From mshefty at ichips.intel.com Tue Nov 1 16:15:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 01 Nov 2005 16:15:16 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <52u0ewapx1.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> <4367CAB2.6030600@ichips.intel.com> <524q6wdscl.fsf@cisco.com> <4367E0C2.90704@ichips.intel.com> <52u0ewapx1.fsf@cisco.com> Message-ID: <43680514.9090208@ichips.intel.com> Roland Dreier wrote: > Sean> No - it's not attached to a device. I was going to create > Sean> just one character device, which is why I was wondering if > Sean> creating a new class was the right approach. > > It might just make sense to use the existing misc class I guess. It appears that doing this requires using the misc MAJOR number. Do we want to do that or use the IB MAJOR? - Sean From rolandd at cisco.com Tue Nov 1 16:20:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 16:20:09 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <43680514.9090208@ichips.intel.com> (Sean Hefty's message of "Tue, 01 Nov 2005 16:15:16 -0800") References: <528xwdqn4x.fsf@cisco.com> <4367CAB2.6030600@ichips.intel.com> <524q6wdscl.fsf@cisco.com> <4367E0C2.90704@ichips.intel.com> <52u0ewapx1.fsf@cisco.com> <43680514.9090208@ichips.intel.com> Message-ID: <52mzknc346.fsf@cisco.com> Sean> It appears that doing this requires using the misc MAJOR Sean> number. Do we want to do that or use the IB MAJOR? I don't think it really matters either way. Using misc is probably easier. - R. From kenjeffries at austin.rr.com Tue Nov 1 16:44:55 2005 From: kenjeffries at austin.rr.com (Kenneth L Jeffries) Date: Tue, 1 Nov 2005 18:44:55 -0600 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu lengthnegotiation References: <52ek61gdx7.fsf@cisco.com> <019501c5df19$30305ee0$0a97a8c0@blacktip> <52d5lkdsgo.fsf@cisco.com> Message-ID: <020501c5df46$a58ec870$0a97a8c0@blacktip> > My objections are the following (as I said in my previous mail): > - I don't like allocating a 1 KB IU for every send IU, since most of > that memory will probably never be used. > - I'm not convinced that it's _ever_ a win to have the target do > another RDMA to fetch the indirect buffer list. You need to > convince me that it's not better to simply tell the upper layers > what the limit on s/g list length is to fit in the current IU size. I also don't want to allocate 1KB IU's. If IU's were fixed size, I'd want (probably, depending on performance testing) a fixed size of 350 bytes (from Fab Tiller's 64KB i/o, 4KB pages, Windows) or possibly even the mininum DDBD (as Fab Tiller also says). 1KB IU's with thousands of RC's causes me a lot of wasted space heartburn. [as an aside, it sure would be nice if we could do an SRP-3 (since SRP-2 is dead) where multiple direct descriptors would be allowed. The only way to get multiple descriptors now is with indirect descriptors.] I am pretty sure that someone doing a video server might want to do, say, 1MB i/o's. 1MB with 4KB pages means 256 descriptors and an iu of something over 4096 bytes. I definitely don't want to be told by the srp initiator that I need to use 4KB iu's. (So we agree there.) Your second point has a couple of parts. On the srp target side, if rmda reads of additional indirect buffer descriptors is done only 1% of the time and the trade off is much better memory utilization (ie. smaller iu's) then from the target's point of view there probably is a big win in doing the extra descriptor fetches. The other side is the number of trips from the application thru the the scsi and srp layers per i/o. But again, if extra trips are made only 1% of the time, then my guess is that smaller iu's would be better. I do find some appeal in having the internal initiator iu size be able to be larger (easily) than the on-the-wire iu size. If it were hard to do then the appeal would not outweigh the cost. As long as the target is able to set the iu size and the target can set the iu size to be fairly small, then I'm ok with just passing that size on to the scsi upper layer. I'm also ok with a per-connection internal initiator iu size if someone wants to code that. Ken Jeffries From rolandd at cisco.com Tue Nov 1 17:26:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 01 Nov 2005 17:26:54 -0800 Subject: [openib-general] [PATCH] kmalloc + memset(, 0, ) -> kzalloc conversions Message-ID: <524q6vc00x.fsf@cisco.com> Anyone have any objection to me committing the following patch? It has the following effect on a x86_64 build: text data bss dec hex filename 220354 7416 1336 229106 37ef2 drivers/infiniband/built-in.o-before 219826 7416 1336 228578 37ce2 drivers/infiniband/built-in.o-after Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 3935) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -729,25 +729,21 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", ca->name, IPOIB_RX_RING_SIZE); goto out; } - memset(priv->rx_ring, 0, - IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf)); - priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", ca->name, IPOIB_TX_RING_SIZE); goto out_rx_ring_cleanup; } - memset(priv->tx_ring, 0, - IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf)); /* priv->tx_head & tx_tail are already 0 */ Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 3935) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -135,12 +135,10 @@ static struct ipoib_mcast *ipoib_mcast_a { struct ipoib_mcast *mcast; - mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + mcast = kzalloc(sizeof *mcast, can_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!mcast) return NULL; - memset(mcast, 0, sizeof (*mcast)); - init_completion(&mcast->done); mcast->dev = dev; Index: infiniband/core/agent.c =================================================================== --- infiniband/core/agent.c (revision 3935) +++ infiniband/core/agent.c (working copy) @@ -155,13 +155,12 @@ int ib_agent_port_open(struct ib_device int ret; /* Create new device info */ - port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); if (!port_priv) { printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); ret = -ENOMEM; goto error1; } - memset(port_priv, 0, sizeof *port_priv); /* Obtain send only MAD agent for SMI QP */ port_priv->agent[0] = ib_register_mad_agent(device, port_num, Index: infiniband/core/cm.c =================================================================== --- infiniband/core/cm.c (revision 3935) +++ infiniband/core/cm.c (working copy) @@ -544,11 +544,10 @@ struct ib_cm_id *ib_create_cm_id(struct struct cm_id_private *cm_id_priv; int ret; - cm_id_priv = kmalloc(sizeof *cm_id_priv, GFP_KERNEL); + cm_id_priv = kzalloc(sizeof *cm_id_priv, GFP_KERNEL); if (!cm_id_priv) return ERR_PTR(-ENOMEM); - memset(cm_id_priv, 0, sizeof *cm_id_priv); cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.device = device; cm_id_priv->id.cm_handler = cm_handler; @@ -621,10 +620,9 @@ static struct cm_timewait_info * cm_crea { struct cm_timewait_info *timewait_info; - timewait_info = kmalloc(sizeof *timewait_info, GFP_KERNEL); + timewait_info = kzalloc(sizeof *timewait_info, GFP_KERNEL); if (!timewait_info) return ERR_PTR(-ENOMEM); - memset(timewait_info, 0, sizeof *timewait_info); timewait_info->work.local_id = local_id; INIT_WORK(&timewait_info->work.work, cm_work_handler, Index: infiniband/core/uverbs_main.c =================================================================== --- infiniband/core/uverbs_main.c (revision 3935) +++ infiniband/core/uverbs_main.c (working copy) @@ -725,12 +725,10 @@ static void ib_uverbs_add_one(struct ib_ if (!device->alloc_ucontext) return; - uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL); + uverbs_dev = kzalloc(sizeof *uverbs_dev, GFP_KERNEL); if (!uverbs_dev) return; - memset(uverbs_dev, 0, sizeof *uverbs_dev); - kref_init(&uverbs_dev->ref); spin_lock(&map_lock); Index: infiniband/core/device.c =================================================================== --- infiniband/core/device.c (revision 3935) +++ infiniband/core/device.c (working copy) @@ -161,17 +161,9 @@ static int alloc_name(char *name) */ struct ib_device *ib_alloc_device(size_t size) { - void *dev; - BUG_ON(size < sizeof (struct ib_device)); - dev = kmalloc(size, GFP_KERNEL); - if (!dev) - return NULL; - - memset(dev, 0, size); - - return dev; + return kzalloc(size, GFP_KERNEL); } EXPORT_SYMBOL(ib_alloc_device); Index: infiniband/core/mad.c =================================================================== --- infiniband/core/mad.c (revision 3935) +++ infiniband/core/mad.c (working copy) @@ -255,12 +255,11 @@ struct ib_mad_agent *ib_register_mad_age } /* Allocate structures */ - mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL); if (!mad_agent_priv) { ret = ERR_PTR(-ENOMEM); goto error1; } - memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->agent.mr = ib_get_dma_mr(port_priv->qp_info[qpn].qp->pd, IB_ACCESS_LOCAL_WRITE); @@ -448,14 +447,13 @@ struct ib_mad_agent *ib_register_mad_sno goto error1; } /* Allocate structures */ - mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + mad_snoop_priv = kzalloc(sizeof *mad_snoop_priv, GFP_KERNEL); if (!mad_snoop_priv) { ret = ERR_PTR(-ENOMEM); goto error1; } /* Now, fill in the various structures */ - memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; mad_snoop_priv->agent.device = device; mad_snoop_priv->agent.recv_handler = recv_handler; @@ -794,10 +792,9 @@ struct ib_mad_send_buf * ib_create_send_ (!rmpp_active && buf_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - buf = kmalloc(sizeof *mad_send_wr + buf_size, gfp_mask); + buf = kzalloc(sizeof *mad_send_wr + buf_size, gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, sizeof *mad_send_wr + buf_size); mad_send_wr = buf + buf_size; mad_send_wr->send_buf.mad = buf; @@ -1039,14 +1036,12 @@ static int method_in_use(struct ib_mad_m static int allocate_method_table(struct ib_mad_mgmt_method_table **method) { /* Allocate management method table */ - *method = kmalloc(sizeof **method, GFP_ATOMIC); + *method = kzalloc(sizeof **method, GFP_ATOMIC); if (!*method) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_method_table\n"); return -ENOMEM; } - /* Clear management method table */ - memset(*method, 0, sizeof **method); return 0; } @@ -1137,15 +1132,14 @@ static int add_nonoui_reg_req(struct ib_ class = &port_priv->version[mad_reg_req->mgmt_class_version].class; if (!*class) { /* Allocate management class table for "new" class version */ - *class = kmalloc(sizeof **class, GFP_ATOMIC); + *class = kzalloc(sizeof **class, GFP_ATOMIC); if (!*class) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_class_table\n"); ret = -ENOMEM; goto error1; } - /* Clear management class table */ - memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ method = &(*class)->method_table[mgmt_class]; if ((ret = allocate_method_table(method))) @@ -1209,25 +1203,24 @@ static int add_oui_reg_req(struct ib_mad mad_reg_req->mgmt_class_version].vendor; if (!*vendor_table) { /* Allocate mgmt vendor class table for "new" class version */ - vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + vendor = kzalloc(sizeof *vendor, GFP_ATOMIC); if (!vendor) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_vendor_class_table\n"); goto error1; } - /* Clear management vendor class table */ - memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; } if (!(*vendor_table)->vendor_class[vclass]) { /* Allocate table for this management vendor class */ - vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + vendor_class = kzalloc(sizeof *vendor_class, GFP_ATOMIC); if (!vendor_class) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_vendor_class\n"); goto error2; } - memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; } for (i = 0; i < MAX_MGMT_OUI; i++) { @@ -2524,12 +2517,12 @@ static int ib_mad_port_open(struct ib_de char name[sizeof "ib_mad123"]; /* Create new device info */ - port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); if (!port_priv) { printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); return -ENOMEM; } - memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; port_priv->port_num = port_num; spin_lock_init(&port_priv->reg_lock); Index: infiniband/core/sysfs.c =================================================================== --- infiniband/core/sysfs.c (revision 3935) +++ infiniband/core/sysfs.c (working copy) @@ -307,14 +307,13 @@ static ssize_t show_pma_counter(struct i if (!p->ibdev->process_mad) return sprintf(buf, "N/A (no PMA)\n"); - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); if (!in_mad || !out_mad) { ret = -ENOMEM; goto out; } - memset(in_mad, 0, sizeof *in_mad); in_mad->mad_hdr.base_version = 1; in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; in_mad->mad_hdr.class_version = 1; @@ -508,10 +507,9 @@ static int add_port(struct ib_device *de if (ret) return ret; - p = kmalloc(sizeof *p, GFP_KERNEL); + p = kzalloc(sizeof *p, GFP_KERNEL); if (!p) return -ENOMEM; - memset(p, 0, sizeof *p); p->ibdev = device; p->port_num = port_num; Index: infiniband/core/ucm.c =================================================================== --- infiniband/core/ucm.c (revision 3935) +++ infiniband/core/ucm.c (working copy) @@ -172,11 +172,10 @@ static struct ib_ucm_context *ib_ucm_ctx struct ib_ucm_context *ctx; int result; - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + ctx = kzalloc(sizeof *ctx, GFP_KERNEL); if (!ctx) return NULL; - memset(ctx, 0, sizeof *ctx); atomic_set(&ctx->ref, 1); init_waitqueue_head(&ctx->wait); ctx->file = file; @@ -386,11 +385,10 @@ static int ib_ucm_event_handler(struct i ctx = cm_id->context; - uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + uevent = kzalloc(sizeof *uevent, GFP_KERNEL); if (!uevent) goto err1; - memset(uevent, 0, sizeof(*uevent)); uevent->ctx = ctx; uevent->cm_id = cm_id; uevent->resp.uid = ctx->uid; @@ -1345,11 +1343,10 @@ static void ib_ucm_add_one(struct ib_dev if (!device->alloc_ucontext) return; - ucm_dev = kmalloc(sizeof *ucm_dev, GFP_KERNEL); + ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) return; - memset(ucm_dev, 0, sizeof *ucm_dev); ucm_dev->ib_dev = device; ucm_dev->devnum = find_first_zero_bit(dev_map, IB_UCM_MAX_DEVICES); Index: infiniband/hw/mthca/mthca_profile.c =================================================================== --- infiniband/hw/mthca/mthca_profile.c (revision 3935) +++ infiniband/hw/mthca/mthca_profile.c (working copy) @@ -80,12 +80,10 @@ u64 mthca_make_profile(struct mthca_dev struct mthca_resource tmp; int i, j; - profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + profile = kzalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); if (!profile) return -ENOMEM; - memset(profile, 0, MTHCA_RES_NUM * sizeof *profile); - profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; Index: infiniband/hw/mthca/mthca_mr.c =================================================================== --- infiniband/hw/mthca/mthca_mr.c (revision 3935) +++ infiniband/hw/mthca/mthca_mr.c (working copy) @@ -140,13 +140,11 @@ static int __devinit mthca_buddy_init(st buddy->max_order = max_order; spin_lock_init(&buddy->lock); - buddy->bits = kmalloc((buddy->max_order + 1) * sizeof (long *), + buddy->bits = kzalloc((buddy->max_order + 1) * sizeof (long *), GFP_KERNEL); if (!buddy->bits) goto err_out; - memset(buddy->bits, 0, (buddy->max_order + 1) * sizeof (long *)); - for (i = 0; i <= buddy->max_order; ++i) { s = BITS_TO_LONGS(1 << (buddy->max_order - i)); buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL); From halr at voltaire.com Tue Nov 1 18:26:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2005 21:26:56 -0500 Subject: [openib-general] opensm errors with ehca In-Reply-To: <20051030235504.GT3275@kalmia.hozed.org> References: <20051030235504.GT3275@kalmia.hozed.org> Message-ID: <1130898415.4381.1784.camel@hal.voltaire.com> On Sun, 2005-10-30 at 18:55, Troy Benjegerdes wrote: > The firmware on the IBM eHCA causes opensm to spit out these kinds of > errors all the time.. > > Is there a way we can either not send P_KeyTable requests to any eHCA > guids, or figure out what (if anything) is broken in their firmware? > > Is this a spec violation, or just ambiguities in implementation? > > Oct 30 17:49:46 053820 [43005960] -> umad_receiver: ERR 5409: send > completed wit > h error (method=0x1 attr=0x16 trans_id=0x158c) -- dropping. > Oct 30 17:49:46 053830 [43005960] -> umad_receiver: ERR 5411: DR SMP hop > ptr 0 h > op count 2 DR SLID 0x0 DR DLID 0x0 > Oct 30 17:49:46 053839 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MA > D completed in error (IB_TIMEOUT). > Oct 30 17:49:46 053861 [43005960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x2 > trans_id................0x158c > attr_id.................0x16 (P_KeyTable) > resv....................0x0 > attr_mod................0x260000 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0][1][16] > Return path: [0][0][0] > Reserved: [0][0][0][0][0][0][0] Can you try the following opensm patch and see if this eliminates those timeout messages ? This patch clears the high part of the attribute modifier when not a switch (when obtaining the PKeyTable). -- Hal Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 3906) +++ osm_port_info_rcv.c (working copy) @@ -430,6 +430,7 @@ void osm_pkey_get_tables( osm_dr_path_t path; uint8_t port_num; uint16_t block_num, max_blocks; + uint32_t attr_mod_ho; osm_switch_t* p_switch; OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); @@ -455,7 +456,7 @@ void osm_pkey_get_tables( else { /* This is a switch, and not a management port. The maximum blocks is defined - on the switch info partition enforcement cap. */ + in the switch info partition enforcement cap. */ p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); if (! p_switch) @@ -472,10 +473,14 @@ void osm_pkey_get_tables( for (block_num = 0 ; block_num < max_blocks ; block_num++) { + if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH) + attr_mod_ho = block_num; + else + attr_mod_ho = block_num | (port_num << 16); status = osm_req_get( p_req, &path, IB_MAD_ATTR_P_KEY_TABLE, - cl_hton32(block_num | (port_num << 16) ), + cl_hton32(attr_mod_ho), CL_DISP_MSGID_NONE, &context ); From ftillier at silverstorm.com Tue Nov 1 20:22:14 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 1 Nov 2005 20:22:14 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: <020501c5df46$a58ec870$0a97a8c0@blacktip> Message-ID: <000101c5df65$01a94a90$9e5aa8c0@infiniconsys.com> > From: Kenneth L Jeffries [mailto:kenjeffries at austin.rr.com] > Sent: Tuesday, November 01, 2005 4:45 PM > > > My objections are the following (as I said in my previous mail): > > - I don't like allocating a 1 KB IU for every send IU, since most of > > that memory will probably never be used. > > - I'm not convinced that it's _ever_ a win to have the target do > > another RDMA to fetch the indirect buffer list. You need to > > convince me that it's not better to simply tell the upper layers > > what the limit on s/g list length is to fit in the current IU size. > > I also don't want to allocate 1KB IU's. If IU's were fixed size, I'd want > (probably, depending on performance testing) a fixed size of 350 bytes > (from Fab Tiller's 64KB i/o, 4KB pages, Windows) or possibly even > the mininum DDBD (as Fab Tiller also says). 1KB IU's with thousands > of RC's causes me a lot of wasted space heartburn. Even 350 bytes is a burden - imagine a target that supports a queue depth of 1000 I/Os from a few dozen initators. Ideally, I'd like to see us use just DDBDs and the 64-byte IU, along with registering the data buffers on a per-I/O basis, either via FMR or regular MRs. > [as an aside, it sure would be nice if we could do an SRP-3 (since SRP-2 > is dead) where multiple direct descriptors would be allowed. The only > way to get multiple descriptors now is with indirect descriptors.] That saves you 20 bytes - not a huge gain. > I am pretty sure that someone doing a video server might want to do, say, > 1MB i/o's. 1MB with 4KB pages means 256 descriptors and an iu of > something over 4096 bytes. I definitely don't want to be told by the srp > initiator that I need to use 4KB iu's. (So we agree there.) For large I/O, doing a registration of the buffer and sending a DDBD with a single descriptor might well provide the best performance. If you look at the traffic on the wire, having the target do multiple page-sized RDMA operations is far less efficient than creating a virtual contiguous (to the target) region that a single RDMA operation can service. - Fab From info at dswrench.com Tue Nov 1 20:33:30 2005 From: info at dswrench.com (info at dswrench.com) Date: Tue, 01 Nov 2005 20:33:30 -0800 Subject: [openib-general] DS4000 Storage Server (Engenio based) Management Protocol Analyzer Message-ID: Dswrench.com announces a new debugging tool for the "DS4000 storage server (Engenio based disk storage array)." We are proud to introduce the DS4000 Management Protocol Analyzer, or DSMPA. DSMPA is freeware built by engineers for engineers, so we encourage you to use often and spread the word! What it does: DSPMA captures management network traffic between SANtricity or other management software and disk storage arrays. It assembles network packets into DS4000 management objects. The captured objects can be viewed graphically and played back. It is a powerful tool for DS4000 storage server support center, storage administrators, storage management software developers and quality assurance engineers. To view sample shots of DSMPA capabilities, click the following links: -To access the User guide, please click http://www.dswrench.com/documents/dsmpa.pdf. -To access the DSMPA main screen, please click http://www.dswrench.com/documents/CaptureAndView.html -DSMPA can save a communication session to html file. To view a sample session output, please click http://www.dswrench.com/documents/session_sample.html. -DSMPA also captures SANtricity generated network traffic. This page (http://www.dswrench.com/documents/snmp_sample.html) shows captured SNMP trap fired by the SANtricity management software. How to get it: DSMPA is now available for free download from http://www.dswrench.com For more information, please visit http://www.dswrench.com. Thank you for taking a few minutes to improve the operation of your DS4000 storage server. Please visit dswrench.com often for updates and new software. Use often and spread the word! Best Regards, The Dswrench.com Team From hozer at hozed.org Tue Nov 1 20:49:18 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 1 Nov 2005 22:49:18 -0600 Subject: [openib-general] opensm errors with ehca In-Reply-To: <1130898415.4381.1784.camel@hal.voltaire.com> References: <20051030235504.GT3275@kalmia.hozed.org> <1130898415.4381.1784.camel@hal.voltaire.com> Message-ID: <20051102044918.GZ3275@kalmia.hozed.org> > Can you try the following opensm patch and see if this eliminates those > timeout messages ? > > This patch clears the high part of the attribute modifier when not a > switch (when obtaining the PKeyTable). > > -- Hal > > Index: osm_port_info_rcv.c > =================================================================== > --- osm_port_info_rcv.c (revision 3906) > +++ osm_port_info_rcv.c (working copy) > @@ -430,6 +430,7 @@ void osm_pkey_get_tables( > osm_dr_path_t path; > uint8_t port_num; > uint16_t block_num, max_blocks; > + uint32_t attr_mod_ho; > osm_switch_t* p_switch; > > OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); > @@ -455,7 +456,7 @@ void osm_pkey_get_tables( > else > { > /* This is a switch, and not a management port. The maximum blocks is defined > - on the switch info partition enforcement cap. */ > + in the switch info partition enforcement cap. */ > p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); > > if (! p_switch) > @@ -472,10 +473,14 @@ void osm_pkey_get_tables( > > for (block_num = 0 ; block_num < max_blocks ; block_num++) > { > + if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH) > + attr_mod_ho = block_num; > + else > + attr_mod_ho = block_num | (port_num << 16); > status = osm_req_get( p_req, > &path, > IB_MAD_ATTR_P_KEY_TABLE, > - cl_hton32(block_num | (port_num << 16) ), > + cl_hton32(attr_mod_ho), > CL_DISP_MSGID_NONE, > &context ); > This seems to ignore the IBM logical HCA, but gives the same thing on the IBM Logical switch. Is there a way to ignore this as well? switchguids=0x2550000038580 Switch 63 "S-0002550000038580" # IBM Logical Switch 1 port 0 lid 21 [2] "H-0002550000038500"[1] [1] "S-0002c90200402917"[22] I still get: Nov 01 22:34:08 660205 [43005960] -> umad_receiver: ERR 5409: send completed wit h error (method=0x1 attr=0x16 trans_id=0x13c9) -- dropping. Nov 01 22:34:08 660213 [43005960] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 h op count 2 DR SLID 0x0 DR DLID 0x0 Nov 01 22:34:08 660221 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MA D completed in error (IB_TIMEOUT). Nov 01 22:34:08 660243 [43005960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x2 trans_id................0x13c9 attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x10000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][16] Return path: [0][0][0] Reserved: [0][0][0][0][0][0][0] From panda at cse.ohio-state.edu Tue Nov 1 20:54:49 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue, 1 Nov 2005 23:54:49 -0500 (EST) Subject: [openib-general] Announcing the release of MVAPICH2 0.9.0 (MPI-2 over InfiniBand and other RDMA Interconnects) Message-ID: <200511020454.jA24snOm029539@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the release of MVAPICH2 0.9.0 for the following platforms, OS, compilers, and InfiniBand adapters: - Platforms: EM64T, Opteron, IA-32, and Mac G5 - Operating Systems: Linux, Solaris, and Mac OSX - Compilers: gcc, intel, and pgi - InfiniBand Adapters: Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards) In addition to delivering high performance with VAPI interface, MVAPICH2 0.9.0 also provides uDAPL support for portability across networks and platforms with highest performance. The uDAPL interface of this release has been tested with InfiniBand (OpenIB/Gen2 uDAPL, IBGD/uDAPL, and Solaris IBTL/uDAPL), Ammasso GigE (Ammasso uDAPL), and Myrinet (DAPL-GM beta). Starting with this release, MVAPICH2 enables InfiniBand support for Solaris environment through uDAPL support. MVAPICH2 0.9.0 is being distributed as a single integrated package (with MPICH2 1.0.2p1 and MVICH). It is available under BSD license. This new release has the following features: - MPI-2 functionalities (one-sided, collectives, datatype) - all MPI-1 functionalities - high performance and optimized support for all one-sided operations (Get, Put, and Accumulate) - support for active and passive synchronization - optimized two-sided operations with RDMA support - efficient memory registration/de-registration schemes for RDMA operations - optimized intra-node shared memory support (bus-based and NUMA) - shared library support - ROMIO support - uDAPL support (tested for InfiniBand on Linux and Solaris, Myrinet, and Ammasso GigE) - scalable job start-up - optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express with SDR and DDR) - support for multiple compilers (gcc, icc, and pgi) - single code base for all of the above platforms and OS - memory efficient scaling modes for medium and large clusters Other features of this release include: - Excellent performance: Sample performance numbers include: Two-sided operations on EM64T, PCI-Ex: - 3.47 microsec one-way latency with IBA-SDR - 1502 MB/sec unidirectional bandwidth with IBA-DDR - 2752 MB/sec bidirectional bandwidth with IBA-DDR One-sided operations on EM64T, PCI-Ex, IBA-DDR: - 5.96 microsec Put latency - 1503 MB/sec unidirectional PUT bandwidth - 2759 MB/sec bidirectional PUT bandwidth Two-sided operations with Solaris uDAPL/IBTL on Opteron, PCI-X, IBA-SDR: - 5.58 microsec one-way latency - 655 MB/sec unidirectional bandwidth - 799 MB/sec bidirectional bandwidth Two-sided operations with OpenIB/Gen2 uDAPL on Opteron, PCI-Ex IBA-SDR: - 3.63 microsec one-way latency - 962 MB/sec unidirectional bandwidth - 1869 MB/sec bidirectional bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance Results' section of the project's web page. - Similar performance with MVAPICH: With the new ADI-3-level design, MVAPICH2 0.9.0 delivers similar performance for two-sided operations compared to MVAPICH 0.9.5. Organizations and users interested in getting the best performance for both two-sided and one-sided operations may migrate from MVAPICH code base to MVAPICH2 code base. - A set of benchmarks to evaluate both two-sided and one-sided operations (Put, Get, and Accumulate) - An enhanced and detailed `User Guide' to assist users: - to install this package on different platforms with both interfaces (VAPI and uDAPL) and different options - to vary different parameters of the MPI installation to extract maximum performance and achieve scalability, especially on large-scale systems. You are welcome to download the MVAPICH2 0.9.0 package and access relevant information from the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A successive version with support for OpenIB/Gen2 will be available soon. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu. Thanks, MVAPICH Team at OSU/NBCL ---------- PS: If you would like to be removed from this mailing list, please end an e-mail to mvapich_request at cse.ohio-state.edu. ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation , U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems, and Linux Networx; and with equipment support from AMD, Ammasso, Apple, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner include Etnus. ====================================================================== From pradeep at us.ibm.com Tue Nov 1 22:02:20 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 1 Nov 2005 22:02:20 -0800 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: Message-ID: openib-general-bounces at openib.org wrote on 10/18/2005 03:40:47 PM: > > > > > On Mon, 2005-10-18 at 10:07, Kevin Reilly wrote: > >On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: > >> > Should this code work, because it seems that out_dev is a kernel > >> > address (platform: PPC64) which cannot accessed by a userspace > >> > program. Via GDB I can see that rt has the following content: > >> > > >> > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a > >> > kernel address. > >> > >> Yes, this is a bug which has been previously pointed out on the list and > >> not fixed. > > Can some one point me to the previous discussions on this list (search did not yield any results)? The problem is because of a copy_to_user (in uat.c) between struct ib_at_ib_route which are different between user and kernel space causing this crash. What was the rationale of putting a pointer to struct ibv_device in the user space version of ib_at_ib_route? The out_dev field in user space is not really used as far as I could see. > >The fix for this involves an ABI change: it should return the GID of the > >outgoing IB device. > > Would a simple solution like adding a device_name field to both the ib_at_ib_route structures be acceptable? The out_dev field could be used as a "reserved" field in user space and not be used. That should not break anything as far as I can see. > >-- Hal Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at kijshd.com Tue Nov 1 20:37:34 2005 From: info at kijshd.com (info at kijshd.com) Date: 2 Nov 2005 13:37:34 +0900 Subject: [openib-general] $BLt6I1?1DpJs(B Message-ID: <20051102043734.9835.qmail@mail.kijshd.com> $B5.J}$N%"%I%l%9$,!Z(BID:145265 $B at 6;R![$5$s$+$iD>@\;XL>$r$5$l$?$3$H$,3NG'$G$-$^$7$?$N$G!"D>@\O"Mm2DG=$H at _Dj$5$;$FD:$-$^$7$?!#:#$+$iD>@\O"MmJ}K!$r$40FFb$G$-$7$^$9$N$G!"G'>Z$H$7$F4JC1$JFCJL?=9~$_(B($BA4$FL5NA(B)$B$r$*4j$$CW$7$^$9!#(B $B8^IC$GL5NAEPO?"*%m%0%$%s!!(Bhttp://www.jumpb2.net/?raku $B"!4JC1(BPF$B>R2p"!(B $BG/Np!'Fb=o(B $B;E;v!'Lt6IE9J^1?1D(B($BA49q==FsE9J^(B) $B%3%a%s%H!'!V0l2s#5K|$/$i$$G=w at -$r0FFbCW$7$^$9$N$G!"D>@\%a!<%k(B $B$h$j%"%I%l%9$J$I$N3NG'$,$G$-$k$HJ]>Z$7$^$9!#L>A0!Z at 6;R![$G(B $BEPO?$5$l$F$*$j$^$9!#(B \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ $B5qH]%"%I(B (Refusal Adress) iranai at jumpb2.net $B!!(B From mst at mellanox.co.il Wed Nov 2 00:45:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 10:45:01 +0200 Subject: [openib-general] scsi/srp.h Message-ID: <20051102084501.GR31134@mellanox.co.il> Roland, would you mind moving scsi/srp.h from ulp/srp to infiniband/include in subversion, please? The fact that its under ulp/srp breaks build of a tree linked to under drivers/infiniband drivers/infiniband/ulp/srp/ib_srp.c:49:22: scsi/srp.h: No such file or directory And I think it makes sense to keep includes in one place, right? -- MST From mst at mellanox.co.il Wed Nov 2 00:54:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 10:54:29 +0200 Subject: [openib-general] Re: 2.6.14 patches In-Reply-To: <20051030123622.GD4769@mellanox.co.il> References: <20051030123622.GD4769@mellanox.co.il> Message-ID: <20051102085429.GS31134@mellanox.co.il> Quoting Michael S. Tsirkin : > Sean, Hal, now that 2.6.14 is out, do you plan to apply > the patches in > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/? > Once you do, I'll put reverted patches in the backport directory. Guys, I know there are plans for removing at.c, but since it is, for now, included in the makefile, I plan to apply linux-2.6.14-rc3-at.diff and check in, to avoid the warning for 2.6.14 builds. Does anyone have a problem with this? -- MST From mst at mellanox.co.il Wed Nov 2 01:03:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 11:03:10 +0200 Subject: [openib-general] Re: build fails on revision 3930 In-Reply-To: <6.2.3.4.2.20051101114151.0227ba00@cic-mail.lanl.gov> References: <6.2.3.4.2.20051101114151.0227ba00@cic-mail.lanl.gov> Message-ID: <20051102090310.GT31134@mellanox.co.il> Quoting James W. Barker : > Subject: build fails on revision 3930 > > All, > > Following the instructions posted in your > "installation cheetsheet" after I issue the > command "make modules modules_install" the build > fails with the error message below (this is > revision 3930), the same procedure (not sure > which revision number) was successful last week: > > CC [M] drivers/infiniband/core/addr.o > drivers/infiniband/core/addr.c:330: warning: > initialization from incompatible pointer type Looks like you are using kernel 2.6.13 or older. The subversion trunk is for the latest kernels.org kernel only, which is 2.6.14 as of this writing. Please note, that to load sdp, at and addr modules in kernel 2.6.14, you have to apply the following kernel patch https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-fib-frontend.diff If you want to work on older kernels, you need to apply the backport patches to the subversion repository. Find them here https://openib.org/svn/gen2/branches/backport/ -- MST From mst at mellanox.co.il Wed Nov 2 02:30:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 12:30:08 +0200 Subject: [openib-general] [PATCH applied] remove side effects from kunmap_atomic Message-ID: <20051102103007.GU31134@mellanox.co.il> The following is already applied. --- On some platforms kunmap_atomic is an empty macro. Therefore it is unsafe for calls to kunmap_atomic to have side effects, such as incrementing a counter. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c (revision 3926) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -727,7 +727,8 @@ static int sdp_send_iocb_buff_write(stru offset += copy; offset &= (~PAGE_MASK); - kunmap_atomic(iocb->page_array[counter++], KM_IRQ0); + kunmap_atomic(iocb->page_array[counter], KM_IRQ0); + ++counter; local_irq_restore(flags); } Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c (revision 3926) +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -610,7 +610,8 @@ static int sdp_read_buff_iocb(struct sdp iocb->io_addr += copy; - kunmap_atomic(iocb->page_array[counter++], KM_IRQ0); + kunmap_atomic(iocb->page_array[counter], KM_IRQ0); + ++counter; local_irq_restore(flags); } -- MST From mst at mellanox.co.il Wed Nov 2 03:36:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 13:36:58 +0200 Subject: [openib-general] Re: 2.6.14 patches In-Reply-To: <1130929425.4381.3327.camel@hal.voltaire.com> References: <1130929425.4381.3327.camel@hal.voltaire.com> Message-ID: <20051102113658.GV31134@mellanox.co.il> > > Guys, I know there are plans for removing at.c, but since it is, for > now, > > included in the makefile, I plan to apply linux-2.6.14-rc3-at.diff > > and check in, to avoid the warning for 2.6.14 builds. > > Does anyone have a problem with this? > > It's fine to do this. I haven't be able to upgrade to 2.6.14 yet. I was > going to do this during that process. > > -- Hal OK, I did this, removed the patch, and updated the backport directory appropriately. -- MST From jkwnd at go.com Wed Nov 2 01:36:00 2005 From: jkwnd at go.com (Terence Bowers) Date: Wed, 2 Nov 2005 11:36:00 +0200 Subject: [openib-general] Your request. Message-ID: <378p758t.6905415@go.com> We noticed you had bought one of our products before. We just recently slashed prices, and thought we should let you know. http://theewatchshop.net/ Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Terence Bowers Customer Service Rep. unison see precedent try may spokane a it tabernacle some or activate , see tat it's a stipend ! may butterfat thesee valid in. hygrometer ! suspense the and stolid , , phonemic a but desegregate the , smooch it be grillwork or ! mirage insee arlington ,. From halr at voltaire.com Wed Nov 2 03:40:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 2 Nov 2005 13:40:39 +0200 Subject: [openib-general] Re: 2.6.14 patches Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175CD8@taurus.voltaire.com> On Wed, 2005-11-02 at 03:54, Michael S. Tsirkin wrote: > Quoting Michael S. Tsirkin : > > Sean, Hal, now that 2.6.14 is out, do you plan to apply > > the patches in > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/? > > Once you do, I'll put reverted patches in the backport directory. > > Guys, I know there are plans for removing at.c, but since it is, for now, > included in the makefile, I plan to apply linux-2.6.14-rc3-at.diff > and check in, to avoid the warning for 2.6.14 builds. > Does anyone have a problem with this? If you can't wait for me to do it, it's fine to go ahead. I haven't found the time to upgrade to 2.6.14 yet. I was going to take care of this during that process. -- Hal From halr at voltaire.com Wed Nov 2 03:44:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 2 Nov 2005 13:44:55 +0200 Subject: [openib-general] opensm errors with ehca Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175CD9@taurus.voltaire.com> On Tue, 2005-11-01 at 23:49, Troy Benjegerdes wrote: > > Can you try the following opensm patch and see if this eliminates those > > timeout messages ? > > > > This patch clears the high part of the attribute modifier when not a > > switch (when obtaining the PKeyTable). > > > > -- Hal > > > > Index: osm_port_info_rcv.c > > =================================================================== > > --- osm_port_info_rcv.c (revision 3906) > > +++ osm_port_info_rcv.c (working copy) > > @@ -430,6 +430,7 @@ void osm_pkey_get_tables( > > osm_dr_path_t path; > > uint8_t port_num; > > uint16_t block_num, max_blocks; > > + uint32_t attr_mod_ho; > > osm_switch_t* p_switch; > > > > OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); > > @@ -455,7 +456,7 @@ void osm_pkey_get_tables( > > else > > { > > /* This is a switch, and not a management port. The maximum blocks is defined > > - on the switch info partition enforcement cap. */ > > + in the switch info partition enforcement cap. */ > > p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); > > > > if (! p_switch) > > @@ -472,10 +473,14 @@ void osm_pkey_get_tables( > > > > for (block_num = 0 ; block_num < max_blocks ; block_num++) > > { > > + if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH) > > + attr_mod_ho = block_num; > > + else > > + attr_mod_ho = block_num | (port_num << 16); > > status = osm_req_get( p_req, > > &path, > > IB_MAD_ATTR_P_KEY_TABLE, > > - cl_hton32(block_num | (port_num << 16) ), > > + cl_hton32(attr_mod_ho), > > CL_DISP_MSGID_NONE, > > &context ); > > > > This seems to ignore the IBM logical HCA, but gives the same thing > on the IBM Logical switch. Is there a way to ignore this as well? It is correct for the logical switch. It needs to be handled there per the spec. The high 16 bits are required to be the port number whereas for HCAs and routers this was ignore. This _will_ require a firmware change. I'm unaware of a workaround for this unless we want to do it only for the IBM OUI only temporarily. Will they all have this OUI 000255 ? BTW, getting this error does not appear to cause any bad effects. Does this agree with what you are seeing ? -- Hal > switchguids=0x2550000038580 > Switch 63 "S-0002550000038580" # IBM Logical Switch 1 port 0 > lid 21 > [2] "H-0002550000038500"[1] > [1] "S-0002c90200402917"[22] > > > I still get: > > Nov 01 22:34:08 660205 [43005960] -> umad_receiver: ERR 5409: send > completed wit > h error (method=0x1 attr=0x16 trans_id=0x13c9) -- dropping. > Nov 01 22:34:08 660213 [43005960] -> umad_receiver: ERR 5411: DR SMP hop > ptr 0 h > op count 2 DR SLID 0x0 DR DLID 0x0 > Nov 01 22:34:08 660221 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MA > D completed in error (IB_TIMEOUT). > Nov 01 22:34:08 660243 [43005960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x2 > trans_id................0x13c9 > attr_id.................0x16 > (P_KeyTable) > resv....................0x0 > attr_mod................0x10000 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0][1][16] > Return path: [0][0][0] > Reserved: [0][0][0][0][0][0][0] > > > From halr at voltaire.com Wed Nov 2 03:58:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 06:58:02 -0500 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: References: Message-ID: <1130932682.4381.3532.camel@hal.voltaire.com> On Wed, 2005-11-02 at 01:02, Pradeep Satyanarayana wrote: > openib-general-bounces at openib.org wrote on 10/18/2005 03:40:47 PM: > > > > > > > > > > > On Mon, 2005-10-18 at 10:07, Kevin Reilly wrote: > > >On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: > > >> > Should this code work, because it seems that out_dev is a > kernel > > >> > address (platform: PPC64) which cannot accessed by a userspace > > >> > program. Via GDB I can see that rt has the following content: > > >> > > > >> > The address is rt->out_dev = 0xc0000000cffaa800 which looks > like a > > >> > kernel address. > > >> > > >> Yes, this is a bug which has been previously pointed out on the > list and > > >> not fixed. > > > > > Can some one point me to the previous discussions on this list (search > did not yield any results)? There were various posts from Heiko J Schick on 10/17 and subsequent ones from Kevin. > The problem is because of a copy_to_user (in uat.c) between struct > ib_at_ib_route > which are different between user and kernel space causing this crash. > What was the rationale of putting a pointer to struct ibv_device in > the user space version of > ib_at_ib_route? The out_dev field in user space is not really used as > far as I could see. > > > >The fix for this involves an ABI change: it should return the GID > of the > > >outgoing IB device. > > > > > Would a simple solution like adding a device_name field to both the > ib_at_ib_route structures > be acceptable? The out_dev field could be used as a "reserved" field > in user space and not be used. > That should not break anything as far as I can see. Ideally, the current out_dev field should be removed and all consumers should be converted over to the new structures/interfaces. Guess the question also is whether people want this by name, GID, or both ? -- Hal > > >-- Hal > > Pradeep > pradeep at us.ibm.com > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kenjeffries at austin.rr.com Wed Nov 2 04:49:09 2005 From: kenjeffries at austin.rr.com (Kenneth L Jeffries) Date: Wed, 2 Nov 2005 06:49:09 -0600 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation References: <000101c5df65$01a94a90$9e5aa8c0@infiniconsys.com> Message-ID: <025401c5dfab$d22c7de0$0a97a8c0@blacktip> From: "Fab Tillier" Sent: Tuesday, November 01, 2005 10:22 PM >Even 350 bytes is a burden - imagine a target that supports a queue depth of >1000 I/Os from a few dozen initators. Ideally, I'd like to see us use just >DDBDs and the 64-byte IU, along with registering the data buffers on a per-I/O >basis, either via FMR or regular MRs. Wouldn't a registering a MR per i/o kill performance? Right now, I believe, the srp initiator registers all memory in as one region. >> [as an aside, it sure would be nice if we could do an SRP-3 (since SRP-2 >> is dead) where multiple direct descriptors would be allowed. The only >> way to get multiple descriptors now is with indirect descriptors.] >That saves you 20 bytes - not a huge gain. Yes but I wasn't clear. Allowing multiple direct descriptors would make it reasonable for a target to not implement indirect descriptors at all. Presently target implementers may be tempted to only partially implement indirect descriptors by implementing partial descriptor list processing but not the actual indirect list. There is an argument that says that making iu's really big will eliminate real indirect descriptors ( that is, indirect descriptors beyond the partial list delivered in the iu) and make complete implementation (ie fetching the rest of the list) of indirect descriptors unnecessary. >> I am pretty sure that someone doing a video server might want to do, say, >> 1MB i/o's. 1MB with 4KB pages means 256 descriptors and an iu of >> something over 4096 bytes. I definitely don't want to be told by the srp >> initiator that I need to use 4KB iu's. (So we agree there.) >For large I/O, doing a registration of the buffer and sending a DDBD with a >single descriptor might well provide the best performance. If you look at the >traffic on the wire, having the target do multiple page-sized RDMA operations is >far less efficient than creating a virtual contiguous (to the target) region >that a single RDMA operation can service. Agreed. But I'm missing something (no doubt because I'm working on the embedded target side, not the Linux side). It looks like the srp initiator registers all of kernel memory and does i/o from there. I'm not sure that an application can cause an arbitrarily large address-contiguous payload to appear on the wire. Probably I just don't understand all of what is happening there. Ken Jeffries From halr at voltaire.com Wed Nov 2 04:43:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 07:43:22 -0500 Subject: [openib-general] [PATCH] OpenSM: Clear port number in attribute modifier for P_KeyTable when not switch Message-ID: <1130935402.4381.3627.camel@hal.voltaire.com> Hi, Any objections to committing the patch below ? -- Hal When obtaining the P_KeyTable, clear the high 16 bits of the attribute modifier when node is not a switch. This is supposed to be an ignore field but not all implementations are conformant with this. Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 3906) +++ osm_port_info_rcv.c (working copy) @@ -430,6 +430,7 @@ void osm_pkey_get_tables( osm_dr_path_t path; uint8_t port_num; uint16_t block_num, max_blocks; + uint32_t attr_mod_ho; osm_switch_t* p_switch; OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); @@ -455,7 +456,7 @@ void osm_pkey_get_tables( else { /* This is a switch, and not a management port. The maximum blocks is defined - on the switch info partition enforcement cap. */ + in the switch info partition enforcement cap. */ p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); if (! p_switch) @@ -472,10 +473,14 @@ void osm_pkey_get_tables( for (block_num = 0 ; block_num < max_blocks ; block_num++) { + if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH) + attr_mod_ho = block_num; + else + attr_mod_ho = block_num | (port_num << 16); status = osm_req_get( p_req, &path, IB_MAD_ATTR_P_KEY_TABLE, - cl_hton32(block_num | (port_num << 16) ), + cl_hton32(attr_mod_ho), CL_DISP_MSGID_NONE, &context ); From mst at mellanox.co.il Wed Nov 2 05:25:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 15:25:35 +0200 Subject: [openib-general] Re: build fails on revision 3930 In-Reply-To: <20051102090310.GT31134@mellanox.co.il> References: <6.2.3.4.2.20051101114151.0227ba00@cic-mail.lanl.gov> <20051102090310.GT31134@mellanox.co.il> Message-ID: <20051102132535.GB31134@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: build fails on revision 3930 > > Quoting James W. Barker : > > Subject: build fails on revision 3930 > > > > All, > > > > Following the instructions posted in your > > "installation cheetsheet" after I issue the > > command "make modules modules_install" the build > > fails with the error message below (this is > > revision 3930), the same procedure (not sure > > which revision number) was successful last week: > > Looks like you are using kernel 2.6.13 or older. > The subversion trunk is for the latest kernels.org kernel only, which > is 2.6.14 as of this writing. I have now updated the cheat sheet with this information. -- MST From poxbedvhxu at noos.fr Wed Nov 2 04:24:02 2005 From: poxbedvhxu at noos.fr (Tiffany Anthony) Date: Wed, 2 Nov 2005 13:24:02 +0100 Subject: [openib-general] Personalized mortgage rate quote! Message-ID: <20199474095115.poxbedvhxu@noos.fr> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://site-123.net/p2.asp Have a good day. Sincerely, Tiffany Anthony Customer Service Rep eKOX Inc. lame in eelgrass or , affectation it some frill , a complexion on see platform ! some tactile not in ecstasy ,but caprice on. Update on site tapa it's madeline may try crisis it on please but or pepperoni a it's divest see be analogous and may healthful ,, orpheus or. From halr at voltaire.com Wed Nov 2 05:35:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 08:35:38 -0500 Subject: [openib-general] Re: [PATCH] Osmtest - update command options + vapi fix In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E35AC09B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E35AC09B@mtlexch01.mtl.com> Message-ID: <1130938538.4381.3739.camel@hal.voltaire.com> On Tue, 2005-11-01 at 06:14, Liran Sorani wrote: > Hi , Hal . > We've decided to keep and maintain Osmtest in the main trunk , since > it is not only a test but a tool to validate SA/SM. > > The following is a small patch for the follwoing : > 1. Support old form of running osmtest , i.e instead of -g= guid> , use -g and add '-p' option to display current > available port guids. > > 2. Support Vapi stack. > 3. Update Service flow (Update one of the service lease checks from 1 > sec to 4 sec). > 4. Ident switch-case) issues in main.c I just applied the changes for 2 and 3 so far. I am working on the updated change to main.c (1 and 4). -- Hal From eitan at mellanox.co.il Wed Nov 2 05:46:49 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 2 Nov 2005 15:46:49 +0200 Subject: [openib-general] [PATCH] OpenSM: Clear port number in attribu te modifier for P_KeyTable when not switch Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618891@mtlexch01.mtl.com> It's ok. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 02, 2005 2:43 PM > To: openib-general at openib.org > Subject: [openib-general] [PATCH] OpenSM: Clear port number in attribute modifier > for P_KeyTable when not switch > > Hi, > > Any objections to committing the patch below ? > > -- Hal > > When obtaining the P_KeyTable, clear the high 16 bits of the attribute > modifier when node is not a switch. This is supposed to be an ignore > field but not all implementations are conformant with this. > > Signed-off-by: Hal Rosenstock > > Index: osm_port_info_rcv.c > =================================================================== > --- osm_port_info_rcv.c (revision 3906) > +++ osm_port_info_rcv.c (working copy) > @@ -430,6 +430,7 @@ void osm_pkey_get_tables( > osm_dr_path_t path; > uint8_t port_num; > uint16_t block_num, max_blocks; > + uint32_t attr_mod_ho; > osm_switch_t* p_switch; > > OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); > @@ -455,7 +456,7 @@ void osm_pkey_get_tables( > else > { > /* This is a switch, and not a management port. The maximum blocks is defined > - on the switch info partition enforcement cap. */ > + in the switch info partition enforcement cap. */ > p_switch = osm_get_switch_by_guid(p_subn, p_node->node_info.node_guid); > > if (! p_switch) > @@ -472,10 +473,14 @@ void osm_pkey_get_tables( > > for (block_num = 0 ; block_num < max_blocks ; block_num++) > { > + if (osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH) > + attr_mod_ho = block_num; > + else > + attr_mod_ho = block_num | (port_num << 16); > status = osm_req_get( p_req, > &path, > IB_MAD_ATTR_P_KEY_TABLE, > - cl_hton32(block_num | (port_num << 16) ), > + cl_hton32(attr_mod_ho), > CL_DISP_MSGID_NONE, > &context ); > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Nov 2 05:46:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 08:46:26 -0500 Subject: [openib-general] Re: Re:[PATCH] Osmtest - update command option + vapi fix In-Reply-To: <30u0ewv66s.fsf@mtl066.yok.mtl.com> References: <30u0ewv66s.fsf@mtl066.yok.mtl.com> Message-ID: <1130939185.4381.3763.camel@hal.voltaire.com> Hi Liran, On Tue, 2005-11-01 at 08:38, Liran Sorani wrote: > Hi Hal, > 1. Regarding the osmtest_SOURCES , it works both ways (i.e compile all files required) , > still the correct one is += I understand. You only had = not += in your patch for this. I changed it so that it works and doesn't override osmtest_SOURCES but adds to it when VAPI is being built. > 2. Following is the patch for main.c : > > Index: main.c > =================================================================== > --- main.c (revision 3928) > +++ main.c (working copy) > @@ -128,9 +128,11 @@ > "--guid \n" > " This option specifies the local port GUID value\n" > " with which osmtest should bind. osmtest may be\n" > - " bound to 1 port at a time.\n" > - " Without -g, osmtest displays a menu of possible\n" > - " port GUIDs and waits for user input.\n\n" ); > + " bound to 1 port at a time.\n\n"); > + printf( "-p \n" > + "--port\n" > + " This option display menu of possible local port GUID values\n" > + " with which osmtest could bind.\n\n"); > printf( "-h\n" > "--help\n" " Display this usage info then exit.\n\n" ); > printf( "-i \n" > @@ -160,9 +162,9 @@ > " --- -----------------\n" > " -M1 - Short Multicast Flow (default) - single mode.\n" > " -M2 - Short Multicast Flow - multiple mode.\n" > - " -M3 - Long Multicast Flow - single mode.\n" > - " -M4 - Long Multicast Flow - mutiple mode.\n" > - " Single mode - Osmtest is tested alone, with no other\n" > + " -M3 - Long MultiCast Flow - single mode.\n" > + " -M4 - Long MultiCast Flow - mutiple mode.\n" Should it be Multicast or MultiCast ? -- Hal > + " Single mode - Osmtest is tested alone , with no other \n" > " apps that interact vs. OpenSM MC.\n" > " Multiple mode - Could be run with other apps using MC vs.\n" > " OpenSM." > @@ -305,7 +307,7 @@ > char flow_name[64]; > boolean_t mem_track = FALSE; > uint32_t next_option; > - const char *const short_option = "f:l:m:M:d:g::s:t:i:cvVh"; > + const char *const short_option = "f:l:m:M:d:g:s:t:i:pcvVh"; > > /* > * In the array below, the 2nd parameter specified the number > @@ -322,9 +324,10 @@ > {"inventory", 1, NULL, 'i'}, > {"max_lid", 1, NULL, 'm'}, > {"guid", 2, NULL, 'g'}, > + {"port", 0, NULL, 'p'}, > {"help", 0, NULL, 'h'}, > {"stress", 1, NULL, 's'}, > - {"Multicast_Mode", 1, NULL, 'M'}, > + {"MultiCast_Mode", 1, NULL, 'M'}, > {"timeout", 1, NULL, 't'}, > {"verbose", 0, NULL, 'v'}, > {"log_file", 1, NULL, 'l'}, > @@ -363,7 +366,6 @@ > { > next_option = getopt_long_only( argc, argv, short_option, > long_option, NULL ); > - > switch ( next_option ) > { > case 'c': > @@ -446,28 +448,30 @@ > break; > > case 'g': > - /* > - Specifies port guid with which to bind. > - */ > - if (optarg) { > - guid = cl_hton64( strtoull( optarg, NULL, 16 )); > - printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > - } else > - guid = INVALID_GUID; > - break; > - > + /* > + * Specifies port guid with which to bind. > + */ > + guid = cl_hton64( strtoull( optarg, NULL, 16 )); > + printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > + break; > + case 'p': > + /* > + * Display current port guids > + */ > + guid = INVALID_GUID; > + break; > case 't': > - /* > + /* > * Specifies transaction timeout. > - */ > - opt.transaction_timeout = strtol( optarg, NULL, 0 ); > - printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > - break; > + */ > + opt.transaction_timeout = strtol( optarg, NULL, 0 ); > + printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > + break; > > case 'l': > - opt.log_file = optarg; > - printf("\tLog File:%s\n", opt.log_file ); > - break; > + opt.log_file = optarg; > + printf("\tLog File:%s\n", opt.log_file ); > + break; > > case 'v': > /* > @@ -510,32 +514,32 @@ > } > break; > > - case 'M': > - /* > - * Perform stress test. > - */ > - opt.mmode = strtol( optarg, NULL, 0 ); > - printf( "\tMulticast test enabled: " ); > - switch ( opt.mmode ) > - { > - case 1: > - printf( "Short MC Flow - single mode (default)\n" ); > - break; > - case 2: > - printf( "Short MC Flow - mutiple mode\n" ); > - break; > - case 3: > - printf( "Long MC Flow - single mode\n" ); > - break; > - case 4: > - printf( "Long MC Flow - mutiple mode\n" ); > - break; > - default: > - printf( "Unknown value %u (ignored)\n", opt.stress ); > - opt.mmode = 0; > - break; > - } > - break; > + case 'M': > + /* > + * Perform stress test. > + */ > + opt.mmode = strtol( optarg, NULL, 0 ); > + printf( "\tMultiCast test enabled: " ); > + switch ( opt.mmode ) > + { > + case 1: > + printf( "Short MC Flow - single mode (default)\n" ); > + break; > + case 2: > + printf( "Short MC Flow - mutiple mode\n" ); > + break; > + case 3: > + printf( "Long MC Flow - single mode\n" ); > + break; > + case 4: > + printf( "Long MC Flow - mutiple mode\n" ); > + break; > + default: > + printf( "Unknown value %u (ignored)\n", opt.stress ); > + opt.mmode = 0; > + break; > + } > + break; > > case 'd': > /* > From halr at voltaire.com Wed Nov 2 05:54:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 08:54:49 -0500 Subject: [openib-general] Re: Re:[PATCH] Osmtest - update command option + vapi fix Message-ID: <1130939586.4381.3778.camel@hal.voltaire.com> Hi Liran, On Tue, 2005-11-01 at 08:38, Liran Sorani wrote: > Hi Hal, > 1. Regarding the osmtest_SOURCES , it works both ways (i.e compile all files required) , > still the correct one is += I understand. You only had = not += in your patch for this. I changed it so that it works and doesn't override osmtest_SOURCES but adds to it when VAPI is being built. > 2. Following is the patch for main.c : > > Index: main.c > =================================================================== > --- main.c (revision 3928) > +++ main.c (working copy) > @@ -128,9 +128,11 @@ > "--guid \n" > " This option specifies the local port GUID value\n" > " with which osmtest should bind. osmtest may be\n" > - " bound to 1 port at a time.\n" > - " Without -g, osmtest displays a menu of possible\n" > - " port GUIDs and waits for user input.\n\n" ); > + " bound to 1 port at a time.\n\n"); > + printf( "-p \n" > + "--port\n" > + " This option display menu of possible local port GUID values\n" > + " with which osmtest could bind.\n\n"); > printf( "-h\n" > "--help\n" " Display this usage info then exit.\n\n" ); > printf( "-i \n" > @@ -160,9 +162,9 @@ > " --- -----------------\n" > " -M1 - Short Multicast Flow (default) - single mode.\n" > " -M2 - Short Multicast Flow - multiple mode.\n" > - " -M3 - Long Multicast Flow - single mode.\n" > - " -M4 - Long Multicast Flow - mutiple mode.\n" > - " Single mode - Osmtest is tested alone, with no other\n" > + " -M3 - Long MultiCast Flow - single mode.\n" > + " -M4 - Long MultiCast Flow - mutiple mode.\n" Should it be MultiCast or Multicast ? -- Hal > + " Single mode - Osmtest is tested alone , with no other \n" > " apps that interact vs. OpenSM MC.\n" > " Multiple mode - Could be run with other apps using MC vs.\n" > " OpenSM." > @@ -305,7 +307,7 @@ > char flow_name[64]; > boolean_t mem_track = FALSE; > uint32_t next_option; > - const char *const short_option = "f:l:m:M:d:g::s:t:i:cvVh"; > + const char *const short_option = "f:l:m:M:d:g:s:t:i:pcvVh"; > > /* > * In the array below, the 2nd parameter specified the number > @@ -322,9 +324,10 @@ > {"inventory", 1, NULL, 'i'}, > {"max_lid", 1, NULL, 'm'}, > {"guid", 2, NULL, 'g'}, > + {"port", 0, NULL, 'p'}, > {"help", 0, NULL, 'h'}, > {"stress", 1, NULL, 's'}, > - {"Multicast_Mode", 1, NULL, 'M'}, > + {"MultiCast_Mode", 1, NULL, 'M'}, > {"timeout", 1, NULL, 't'}, > {"verbose", 0, NULL, 'v'}, > {"log_file", 1, NULL, 'l'}, > @@ -363,7 +366,6 @@ > { > next_option = getopt_long_only( argc, argv, short_option, > long_option, NULL ); > - > switch ( next_option ) > { > case 'c': > @@ -446,28 +448,30 @@ > break; > > case 'g': > - /* > - Specifies port guid with which to bind. > - */ > - if (optarg) { > - guid = cl_hton64( strtoull( optarg, NULL, 16 )); > - printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > - } else > - guid = INVALID_GUID; > - break; > - > + /* > + * Specifies port guid with which to bind. > + */ > + guid = cl_hton64( strtoull( optarg, NULL, 16 )); > + printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > + break; > + case 'p': > + /* > + * Display current port guids > + */ > + guid = INVALID_GUID; > + break; > case 't': > - /* > + /* > * Specifies transaction timeout. > - */ > - opt.transaction_timeout = strtol( optarg, NULL, 0 ); > - printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > - break; > + */ > + opt.transaction_timeout = strtol( optarg, NULL, 0 ); > + printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > + break; > > case 'l': > - opt.log_file = optarg; > - printf("\tLog File:%s\n", opt.log_file ); > - break; > + opt.log_file = optarg; > + printf("\tLog File:%s\n", opt.log_file ); > + break; > > case 'v': > /* > @@ -510,32 +514,32 @@ > } > break; > > - case 'M': > - /* > - * Perform stress test. > - */ > - opt.mmode = strtol( optarg, NULL, 0 ); > - printf( "\tMulticast test enabled: " ); > - switch ( opt.mmode ) > - { > - case 1: > - printf( "Short MC Flow - single mode (default)\n" ); > - break; > - case 2: > - printf( "Short MC Flow - mutiple mode\n" ); > - break; > - case 3: > - printf( "Long MC Flow - single mode\n" ); > - break; > - case 4: > - printf( "Long MC Flow - mutiple mode\n" ); > - break; > - default: > - printf( "Unknown value %u (ignored)\n", opt.stress ); > - opt.mmode = 0; > - break; > - } > - break; > + case 'M': > + /* > + * Perform stress test. > + */ > + opt.mmode = strtol( optarg, NULL, 0 ); > + printf( "\tMultiCast test enabled: " ); > + switch ( opt.mmode ) > + { > + case 1: > + printf( "Short MC Flow - single mode (default)\n" ); > + break; > + case 2: > + printf( "Short MC Flow - mutiple mode\n" ); > + break; > + case 3: > + printf( "Long MC Flow - single mode\n" ); > + break; > + case 4: > + printf( "Long MC Flow - mutiple mode\n" ); > + break; > + default: > + printf( "Unknown value %u (ignored)\n", opt.stress ); > + opt.mmode = 0; > + break; > + } > + break; > > case 'd': > /* > From mst at mellanox.co.il Wed Nov 2 06:14:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 16:14:28 +0200 Subject: [openib-general] openib segfaults when openib is not loaded Message-ID: <20051102141428.GE31134@mellanox.co.il> Hi! If I try to load opensm without loading any of openib modules, opensm crashes on exit. Has anyone else seen this? # /usr/local/bin/opensm ------------------------------------------------- OpenSM Rev:openib-1.1.0 Command Line Arguments: Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.1.0 ibwarn: [8954] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? Error from osm_vendor_get_all_port_attr (ffffffff) Error: Could not get port guid Exiting SM Segmentation fault (core dumped) -- MST From halr at voltaire.com Wed Nov 2 06:20:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 09:20:18 -0500 Subject: [openib-general] Re: openib segfaults when openib is not loaded In-Reply-To: <20051102141428.GE31134@mellanox.co.il> References: <20051102141428.GE31134@mellanox.co.il> Message-ID: <1130941218.4381.3821.camel@hal.voltaire.com> On Wed, 2005-11-02 at 09:14, Michael S. Tsirkin wrote: > Hi! > If I try to load opensm without loading any of openib modules, > opensm crashes on exit. > Has anyone else seen this? > > # /usr/local/bin/opensm > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > Command Line Arguments: > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > > ibwarn: [8954] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? > > Error from osm_vendor_get_all_port_attr (ffffffff) > Error: Could not get port guid > Exiting SM > > Segmentation fault (core dumped) Yes, this seg fault is caused due to the following: osm_opensm_destroy shutdowns the dispatcher and subsequent to this osm_vl15_destroy attempts to unregister with the dispatcher (although this has already been done). osm_opensm.c::osm_opensm_destroy /* shut down the dispatcher - so no new messages cross */ cl_disp_shutdown( &p_osm->disp ); /* cleanup all messages on VL15 fifo that were not sent yet */ osm_vl15_shutdown( &p_osm->vl15, &p_osm->mad_pool ); /* lock the whole thing so we do not get any requests etc */ cl_plock_excl_acquire( &p_osm->lock ); /* do the destruction in reverse order as init */ updn_destroy( p_osm->p_updn_ucast_routing ); osm_sa_destroy( &p_osm->sa ); osm_sm_destroy( &p_osm->sm ); osm_db_destroy( &p_osm->db ); osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool ); My workaround has been to remove this from osm_vl15intf.c::osm_vl15_destroy but I'm not sure this is the best long term fix as yet. I hadn't searched out whether there were other paths that were different from this flow. This seems lower priority to me than some other issues I'm still sorting through but I will get back to this unless someone else gets to it first or thinks that the workaround I have should be made permanent. -- Hal From liran at mellanox.co.il Wed Nov 2 06:35:22 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Wed, 2 Nov 2005 16:35:22 +0200 Subject: [openib-general] RE: Re:[PATCH] Osmtest - update command option + vapi fix Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E372A1E1@mtlexch01.mtl.com> Hi , Hal . PLS see below , search for [LS] -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, November 02, 2005 3:55 PM To: Liran Sorani Cc: openib-general at openib.org Subject: Re: Re:[PATCH] Osmtest - update command option + vapi fix Hi Liran, On Tue, 2005-11-01 at 08:38, Liran Sorani wrote: > Hi Hal, > 1. Regarding the osmtest_SOURCES , it works both ways (i.e compile all files required) , > still the correct one is += I understand. You only had = not += in your patch for this. I changed it so that it works and doesn't override osmtest_SOURCES but adds to it when VAPI is being built. > 2. Following is the patch for main.c : > > Index: main.c > =================================================================== > --- main.c (revision 3928) > +++ main.c (working copy) > @@ -128,9 +128,11 @@ > "--guid \n" > " This option specifies the local port GUID value\n" > " with which osmtest should bind. osmtest may be\n" > - " bound to 1 port at a time.\n" > - " Without -g, osmtest displays a menu of possible\n" > - " port GUIDs and waits for user input.\n\n" ); > + " bound to 1 port at a time.\n\n"); > + printf( "-p \n" > + "--port\n" > + " This option display menu of possible local port GUID values\n" > + " with which osmtest could bind.\n\n"); > printf( "-h\n" > "--help\n" " Display this usage info then exit.\n\n" ); > printf( "-i \n" > @@ -160,9 +162,9 @@ > " --- -----------------\n" > " -M1 - Short Multicast Flow (default) - single mode.\n" > " -M2 - Short Multicast Flow - multiple mode.\n" > - " -M3 - Long Multicast Flow - single mode.\n" > - " -M4 - Long Multicast Flow - mutiple mode.\n" > - " Single mode - Osmtest is tested alone, with no other\n" > + " -M3 - Long MultiCast Flow - single mode.\n" > + " -M4 - Long MultiCast Flow - mutiple mode.\n" Should it be MultiCast or Multicast ? [LS] Lets set it to Multicast. -- Hal > + " Single mode - Osmtest is tested alone , with no other \n" > " apps that interact vs. OpenSM MC.\n" > " Multiple mode - Could be run with other apps using MC vs.\n" > " OpenSM." > @@ -305,7 +307,7 @@ > char flow_name[64]; > boolean_t mem_track = FALSE; > uint32_t next_option; > - const char *const short_option = "f:l:m:M:d:g::s:t:i:cvVh"; > + const char *const short_option = "f:l:m:M:d:g:s:t:i:pcvVh"; > > /* > * In the array below, the 2nd parameter specified the number > @@ -322,9 +324,10 @@ > {"inventory", 1, NULL, 'i'}, > {"max_lid", 1, NULL, 'm'}, > {"guid", 2, NULL, 'g'}, > + {"port", 0, NULL, 'p'}, > {"help", 0, NULL, 'h'}, > {"stress", 1, NULL, 's'}, > - {"Multicast_Mode", 1, NULL, 'M'}, > + {"MultiCast_Mode", 1, NULL, 'M'}, > {"timeout", 1, NULL, 't'}, > {"verbose", 0, NULL, 'v'}, > {"log_file", 1, NULL, 'l'}, > @@ -363,7 +366,6 @@ > { > next_option = getopt_long_only( argc, argv, short_option, > long_option, NULL ); > - > switch ( next_option ) > { > case 'c': > @@ -446,28 +448,30 @@ > break; > > case 'g': > - /* > - Specifies port guid with which to bind. > - */ > - if (optarg) { > - guid = cl_hton64( strtoull( optarg, NULL, 16 )); > - printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > - } else > - guid = INVALID_GUID; > - break; > - > + /* > + * Specifies port guid with which to bind. > + */ > + guid = cl_hton64( strtoull( optarg, NULL, 16 )); > + printf(" Guid <0x%"PRIx64">\n", cl_hton64( guid )); > + break; > + case 'p': > + /* > + * Display current port guids > + */ > + guid = INVALID_GUID; > + break; > case 't': > - /* > + /* > * Specifies transaction timeout. > - */ > - opt.transaction_timeout = strtol( optarg, NULL, 0 ); > - printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > - break; > + */ > + opt.transaction_timeout = strtol( optarg, NULL, 0 ); > + printf( "\tTransaction timeout = %d\n", opt.transaction_timeout ); > + break; > > case 'l': > - opt.log_file = optarg; > - printf("\tLog File:%s\n", opt.log_file ); > - break; > + opt.log_file = optarg; > + printf("\tLog File:%s\n", opt.log_file ); > + break; > > case 'v': > /* > @@ -510,32 +514,32 @@ > } > break; > > - case 'M': > - /* > - * Perform stress test. > - */ > - opt.mmode = strtol( optarg, NULL, 0 ); > - printf( "\tMulticast test enabled: " ); > - switch ( opt.mmode ) > - { > - case 1: > - printf( "Short MC Flow - single mode (default)\n" ); > - break; > - case 2: > - printf( "Short MC Flow - mutiple mode\n" ); > - break; > - case 3: > - printf( "Long MC Flow - single mode\n" ); > - break; > - case 4: > - printf( "Long MC Flow - mutiple mode\n" ); > - break; > - default: > - printf( "Unknown value %u (ignored)\n", opt.stress ); > - opt.mmode = 0; > - break; > - } > - break; > + case 'M': > + /* > + * Perform stress test. > + */ > + opt.mmode = strtol( optarg, NULL, 0 ); > + printf( "\tMultiCast test enabled: " ); > + switch ( opt.mmode ) > + { > + case 1: > + printf( "Short MC Flow - single mode (default)\n" ); > + break; > + case 2: > + printf( "Short MC Flow - mutiple mode\n" ); > + break; > + case 3: > + printf( "Long MC Flow - single mode\n" ); > + break; > + case 4: > + printf( "Long MC Flow - mutiple mode\n" ); > + break; > + default: > + printf( "Unknown value %u (ignored)\n", opt.stress ); > + opt.mmode = 0; > + break; > + } > + break; > > case 'd': > /* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Nov 2 06:40:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 09:40:28 -0500 Subject: [openib-general] RE: Re:[PATCH] Osmtest - update command option + vapi fix In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E372A1E1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E372A1E1@mtlexch01.mtl.com> Message-ID: <1130942427.4381.3866.camel@hal.voltaire.com> On Wed, 2005-11-02 at 09:35, Liran Sorani wrote: > Hi , Hal . > PLS see below , search for [LS] > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 02, 2005 3:55 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: Re:[PATCH] Osmtest - update command option + vapi fix > > > Hi Liran, > > On Tue, 2005-11-01 at 08:38, Liran Sorani wrote: > > Hi Hal, > > 1. Regarding the osmtest_SOURCES , it works both ways (i.e compile > all files required) , > > still the correct one is += > > I understand. You only had = not += in your patch for this. I changed > it > so that it works and doesn't override osmtest_SOURCES but adds to it > when VAPI is being built. > > > 2. Following is the patch for main.c : > > Thanks. Applied with some minor format changes. -- Hal From eitan at mellanox.co.il Wed Nov 2 06:50:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 2 Nov 2005 16:50:57 +0200 Subject: [openib-general] Re: openib segfaults when openib is not load ed Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618893@mtlexch01.mtl.com> Hi Hal, Yael is working on the exact same problem. She is probably going to complete it tomorrow. The issue was both the vl15 cl_unregister but we are also facing some issues as the umad receiver never exists. When MADs are arriving after the dispatcher is destroyed they cause a segfault. Hope it will be all fixed by the weekend. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 02, 2005 4:20 PM > To: Michael S. Tsirkin > Cc: openib-general at openib.org > Subject: [openib-general] Re: openib segfaults when openib is not loaded > > On Wed, 2005-11-02 at 09:14, Michael S. Tsirkin wrote: > > Hi! > > If I try to load opensm without loading any of openib modules, > > opensm crashes on exit. > > Has anyone else seen this? > > > > # /usr/local/bin/opensm > > ------------------------------------------------- > > OpenSM Rev:openib-1.1.0 > > Command Line Arguments: > > Log File: /var/log/osm.log > > ------------------------------------------------- > > OpenSM Rev:openib-1.1.0 > > > > ibwarn: [8954] umad_init: can't read ABI version from > /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module > loaded? > > > > Error from osm_vendor_get_all_port_attr (ffffffff) > > Error: Could not get port guid > > Exiting SM > > > > Segmentation fault (core dumped) > > Yes, this seg fault is caused due to the following: > osm_opensm_destroy shutdowns the dispatcher and subsequent to this > osm_vl15_destroy attempts to unregister with the dispatcher (although > this has already been done). > > osm_opensm.c::osm_opensm_destroy > > /* shut down the dispatcher - so no new messages cross */ > cl_disp_shutdown( &p_osm->disp ); > > /* cleanup all messages on VL15 fifo that were not sent yet */ > osm_vl15_shutdown( &p_osm->vl15, &p_osm->mad_pool ); > > /* lock the whole thing so we do not get any requests etc */ > cl_plock_excl_acquire( &p_osm->lock ); > > /* do the destruction in reverse order as init */ > updn_destroy( p_osm->p_updn_ucast_routing ); > osm_sa_destroy( &p_osm->sa ); > osm_sm_destroy( &p_osm->sm ); > osm_db_destroy( &p_osm->db ); > osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool ); > > > My workaround has been to remove this from > osm_vl15intf.c::osm_vl15_destroy but I'm not sure this is the best long > term fix as yet. I hadn't searched out whether there were other paths > that were different from this flow. > > This seems lower priority to me than some other issues I'm still sorting > through but I will get back to this unless someone else gets to it first > or thinks that the workaround I have should be made permanent. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Nov 2 06:55:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 06:55:30 -0800 Subject: [openib-general] Re: scsi/srp.h In-Reply-To: <20051102084501.GR31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Nov 2005 10:45:01 +0200") References: <20051102084501.GR31134@mellanox.co.il> Message-ID: <52ek5z9k0t.fsf@cisco.com> Michael> Roland, would you mind moving scsi/srp.h from ulp/srp to Michael> infiniband/include in subversion, please? Michael> The fact that its under ulp/srp breaks build of a tree Michael> linked to under drivers/infiniband Michael> drivers/infiniband/ulp/srp/ib_srp.c:49:22: scsi/srp.h: No Michael> such file or directory It's not a big deal to move it but I don't understand why your build is breaking. I thought the kernel passed a "-I" option with the current source directory to gcc. I have lots of kernel trees with the svn linux-kernel/infiniband directory symlinked to drivers/infiniband, and the builds all work fine. What does make V=1 drivers/infiniband/ulp/srp/ib_srp.o show for you? - R. From mst at mellanox.co.il Wed Nov 2 07:07:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 17:07:49 +0200 Subject: [openib-general] Re: scsi/srp.h In-Reply-To: <52ek5z9k0t.fsf@cisco.com> References: <52ek5z9k0t.fsf@cisco.com> Message-ID: <20051102150749.GG31134@mellanox.co.il> Quoting Roland Dreier : > Michael> Roland, would you mind moving scsi/srp.h from ulp/srp to > Michael> infiniband/include in subversion, please? > > Michael> The fact that its under ulp/srp breaks build of a tree > Michael> linked to under drivers/infiniband > Michael> drivers/infiniband/ulp/srp/ib_srp.c:49:22: scsi/srp.h: No > Michael> such file or directory > > It's not a big deal to move it but I don't understand why your build > is breaking. I thought the kernel passed a "-I" option with the > current source directory to gcc. I have lots of kernel trees with the > svn linux-kernel/infiniband directory symlinked to drivers/infiniband, > and the builds all work fine. What does > > make V=1 drivers/infiniband/ulp/srp/ib_srp.o > > show for you? > > - R. > # make V=1 drivers/infiniband/ulp/srp/ib_srp.o make -f scripts/Makefile.build obj=scripts/basic SPLIT include/linux/autoconf.h -> include/config/* make -f scripts/Makefile.build obj=scripts make -f scripts/Makefile.build obj=scripts/mod gcc -Wp,-MD,scripts/mod/.empty.o.d -nostdinc -isystem /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/include -D__KERNEL__ -Iinclude -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -O2 -fomit-frame-pointer -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -DKBUILD_BASENAME=empty -DKBUILD_MODNAME=empty -c -o scripts/mod/empty.o scripts/mod/empty.c scripts/mod/mk_elfconfig x86_64 < scripts/mod/empty.o > scripts/mod/elfconfig.h gcc -Wp,-MD,scripts/mod/.file2alias.o.d -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -c -o scripts/mod/file2alias.o scripts/mod/file2alias.c gcc -Wp,-MD,scripts/mod/.modpost.o.d -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -c -o scripts/mod/modpost.o scripts/mod/modpost.c gcc -Wp,-MD,scripts/mod/.sumversion.o.d -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -c -o scripts/mod/sumversion.o scripts/mod/sumversion.c gcc -o scripts/mod/modpost scripts/mod/modpost.o scripts/mod/file2alias.o scripts/mod/sumversion.o make -f scripts/Makefile.build obj=drivers/infiniband/ulp/srp drivers/infiniband/ulp/srp/ib_srp.o gcc -Wp,-MD,drivers/infiniband/ulp/srp/.ib_srp.o.d -nostdinc -isystem /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.5/include -D__KERNEL__ -Iinclude -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -ffreestanding -O2 -fomit-frame-pointer -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Idrivers/infiniband/include -DMODULE -DKBUILD_BASENAME=ib_srp -DKBUILD_MODNAME=ib_srp -c -o drivers/infiniband/ulp/srp/ib_srp.o drivers/infiniband/ulp/srp/ib_srp.c drivers/infiniband/ulp/srp/ib_srp.c:49:22: scsi/srp.h: No such file or directory -- MST From rolandd at cisco.com Wed Nov 2 07:11:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 07:11:56 -0800 Subject: [openib-general] Re: scsi/srp.h In-Reply-To: <20051102150749.GG31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 2 Nov 2005 17:07:49 +0200") References: <52ek5z9k0t.fsf@cisco.com> <20051102150749.GG31134@mellanox.co.il> Message-ID: <52acgn9j9f.fsf@cisco.com> I see -- the build fails if you build directly in your kernel tree. I always build with O=xxx to keep my source tree clean (so I can build multiple targets from the same svn checkout). Anyway, I just moved the include in svn. - R. From halr at voltaire.com Wed Nov 2 07:29:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 2 Nov 2005 17:29:15 +0200 Subject: [openib-general] Re: openib segfaults when openib is not load ed Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175CDC@taurus.voltaire.com> On Wed, 2005-11-02 at 09:50, Eitan Zahavi wrote: > Hi Hal, > > Yael is working on the exact same problem. She is probably going to > complete it tomorrow. > > The issue was both the vl15 cl_unregister but we are also facing some > issues as the umad receiver never exists. Yes, I've also been working on making the umad receiver exit. This has also been a lower priority and I don't have a completed solution yet. -- Hal > When MADs are arriving after the dispatcher is destroyed they cause a > segfault. > > Hope it will be all fixed by the weekend. > > EZ > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, November 02, 2005 4:20 PM > > To: Michael S. Tsirkin > > Cc: openib-general at openib.org > > Subject: [openib-general] Re: openib segfaults when openib is not > loaded > > > > On Wed, 2005-11-02 at 09:14, Michael S. Tsirkin wrote: > > > Hi! > > > If I try to load opensm without loading any of openib modules, > > > opensm crashes on exit. > > > Has anyone else seen this? > > > > > > # /usr/local/bin/opensm > > > ------------------------------------------------- > > > OpenSM Rev:openib-1.1.0 > > > Command Line Arguments: > > > Log File: /var/log/osm.log > > > ------------------------------------------------- > > > OpenSM Rev:openib-1.1.0 > > > > > > ibwarn: [8954] umad_init: can't read ABI version from > > /sys/class/infiniband_mad/abi_version (No such file or directory): > is ib_umad module > > loaded? > > > > > > Error from osm_vendor_get_all_port_attr (ffffffff) > > > Error: Could not get port guid > > > Exiting SM > > > > > > Segmentation fault (core dumped) > > > > Yes, this seg fault is caused due to the following: > > osm_opensm_destroy shutdowns the dispatcher and subsequent to this > > osm_vl15_destroy attempts to unregister with the dispatcher > (although > > this has already been done). > > > > osm_opensm.c::osm_opensm_destroy > > > > /* shut down the dispatcher - so no new messages cross */ > > cl_disp_shutdown( &p_osm->disp ); > > > > /* cleanup all messages on VL15 fifo that were not sent yet */ > > osm_vl15_shutdown( &p_osm->vl15, &p_osm->mad_pool ); > > > > /* lock the whole thing so we do not get any requests etc */ > > cl_plock_excl_acquire( &p_osm->lock ); > > > > /* do the destruction in reverse order as init */ > > updn_destroy( p_osm->p_updn_ucast_routing ); > > osm_sa_destroy( &p_osm->sa ); > > osm_sm_destroy( &p_osm->sm ); > > osm_db_destroy( &p_osm->db ); > > osm_vl15_destroy( &p_osm->vl15, &p_osm->mad_pool ); > > > > > > My workaround has been to remove this from > > osm_vl15intf.c::osm_vl15_destroy but I'm not sure this is the best > long > > term fix as yet. I hadn't searched out whether there were other > paths > > that were different from this flow. > > > > This seems lower priority to me than some other issues I'm still > sorting > > through but I will get back to this unless someone else gets to it > first > > or thinks that the workaround I have should be made permanent. > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Wed Nov 2 08:40:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Nov 2005 08:40:45 -0800 Subject: [openib-general] [PATCH] kmalloc + memset(, 0, ) -> kzalloc conversions In-Reply-To: <524q6vc00x.fsf@cisco.com> References: <524q6vc00x.fsf@cisco.com> Message-ID: <4368EC0D.1000009@ichips.intel.com> Roland Dreier wrote: > Anyone have any objection to me committing the following patch? I have no objection. - Sean From robert.j.woodruff at intel.com Wed Nov 2 09:27:26 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 2 Nov 2005 09:27:26 -0800 Subject: [openib-general] Problems with SDP on Itanium In-Reply-To: <4368EC0D.1000009@ichips.intel.com> Message-ID: Has anyone tried using SDP on Itanium ? I was trying to run a NetPIPE over SDP (svn Rev 3882). It seems to run fine for small transfers, but the applications hangs when it gets to > 1 Megabyte transfers. woody From bohra at cs.rutgers.edu Wed Nov 2 10:35:28 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Wed, 02 Nov 2005 13:35:28 -0500 Subject: [openib-general] uDAPL again Message-ID: <436906F0.3050803@cs.rutgers.edu> Hello, The following is the log for a request I am sending, The number of IOVs for req is 2. And the iov is shown below : REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov 0xbf964290 f 0^M post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M post_snd: returned^M dapl_ep_post_send () returns 0x0^M dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M cq_object_wait: CQ channel 0x8081290 time -1^M cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M dapl_evd_dto_callback : CQE ^M work_req_id 134771572^M status 12^M >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M DTO completion ERROR: 12: op 0xff^M disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M destroy_cm_id: conn 0x808a008 id 134774528^M dapli_evd_post_event: Called with event # 4006^M Any ideas how to proceed to even debug this ? Thanks Aniruddha From iod00d at hp.com Wed Nov 2 10:42:40 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 2 Nov 2005 10:42:40 -0800 Subject: [openib-general] Problems with SDP on Itanium In-Reply-To: References: <4368EC0D.1000009@ichips.intel.com> Message-ID: <20051102184240.GJ28222@esmail.cup.hp.com> On Wed, Nov 02, 2005 at 09:27:26AM -0800, Bob Woodruff wrote: > Has anyone tried using SDP on Itanium ? Yes - but it's been 5-6 weeks since I have tried it (SVN r3547). > I was trying to run a NetPIPE over SDP (svn Rev 3882). > It seems to run fine for small transfers, but > the applications hangs when it gets to > 1 Megabyte > transfers. I haven't tested message sizes > 128KB. I'll include 256/512/1024/2048 KB message sizes in the next round. And I still owe Michael some investigation results from the last round were perf dropped off to near zero for medium sized messages. grant From jlentini at netapp.com Wed Nov 2 10:44:05 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Nov 2005 13:44:05 -0500 (EST) Subject: [openib-general] Re: uDAPL again In-Reply-To: <436906F0.3050803@cs.rutgers.edu> References: <436906F0.3050803@cs.rutgers.edu> Message-ID: On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > Hello, > The following is the log for a request I am sending, > > The number of IOVs for req is 2. And the iov is shown below : > > REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M > REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > > dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M > dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M > dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov > 0xbf964290 f 0^M > post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M > post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M > post_snd: returned^M > dapl_ep_post_send () returns 0x0^M > dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M > dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M > cq_object_wait: CQ channel 0x8081290 time -1^M > cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > dapl_evd_dto_callback : CQE ^M > work_req_id 134771572^M > status 12^M Status 12 is IBV_WC_RETRY_EXC_ERR. Are you sure you can communicate over IB? Do pings over IPoIB work, etc.? > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > DTO completion ERROR: 12: op 0xff^M > disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M > destroy_cm_id: conn 0x808a008 id 134774528^M > dapli_evd_post_event: Called with event # 4006^M > > > Any ideas how to proceed to even debug this ? > > Thanks > Aniruddha > From jlentini at netapp.com Wed Nov 2 10:51:52 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Nov 2005 13:51:52 -0500 (EST) Subject: [openib-general] [OpenSM] SA database query tool Message-ID: Hal, Is there an existing OpenIB tool that can query an SA's database using MADs? Specifically, I want to retrieve all of the SA's service records. If such a tool doesn't exist, where would you start writing one? Would you layer it on top of libibmad? james From bohra at cs.rutgers.edu Wed Nov 2 10:59:33 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Wed, 02 Nov 2005 13:59:33 -0500 Subject: [openib-general] Re: uDAPL again In-Reply-To: References: <436906F0.3050803@cs.rutgers.edu> Message-ID: <43690C95.3050009@cs.rutgers.edu> James Lentini wrote: >On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > > > >>Hello, >> The following is the log for a request I am sending, >> >>The number of IOVs for req is 2. And the iov is shown below : >> >>REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M >>REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M >> >>dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M >>dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M >>dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M >>post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov >>0xbf964290 f 0^M >>post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M >>post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M >>post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M >>post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M >>post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M >>post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M >>post_snd: returned^M >>dapl_ep_post_send () returns 0x0^M >>dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M >>dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M >>cq_object_wait: CQ channel 0x8081290 time -1^M >>cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M >> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >> dapl_evd_dto_callback : CQE ^M >> work_req_id 134771572^M >> status 12^M >> >> > >Status 12 is IBV_WC_RETRY_EXC_ERR. > >Are you sure you can communicate over IB? Do pings over IPoIB work, >etc.? > > > bohra at hora-3 ~]$ ping -b 10.10.10.255 WARNING: pinging broadcast address PING 10.10.10.255 (10.10.10.255) 56(84) bytes of data. 64 bytes from 10.10.10.12: icmp_seq=0 ttl=64 time=0.034 ms 64 bytes from 10.10.10.13: icmp_seq=0 ttl=64 time=8.98 ms (DUP!) 64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.033 ms 64 bytes from 10.10.10.13: icmp_seq=1 ttl=64 time=0.095 ms (DUP!) 64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.025 ms 64 bytes from 10.10.10.13: icmp_seq=2 ttl=64 time=0.096 ms (DUP!) --- 10.10.10.255 ping statistics --- 3 packets transmitted, 3 received, +3 duplicates, 0% packet loss, time 2020ms rtt min/avg/max/mdev = 0.025/1.544/8.986/3.328 ms, pipe 2 [bohra at hora-3 ~]$ ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.10.10.12 Bcast:10.255.255.255 Mask:255.255.255.0 inet6 addr: fe80::202:c901:81e:7471/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:4 errors:0 dropped:0 overruns:0 frame:0 TX packets:77 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:308 (308.0 b) TX bytes:4788 (4.6 KiB) My target is the filer, which does not respond to pings (10.10.10.11). Aniruddha From mst at mellanox.co.il Wed Nov 2 10:58:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 2 Nov 2005 20:58:39 +0200 Subject: [openib-general] Re: Problems with SDP on Itanium In-Reply-To: References: Message-ID: <20051102185838.GA26005@mellanox.co.il> Quoting r. Bob Woodruff : > Subject: Problems with SDP on Itanium > > Has anyone tried using SDP on Itanium ? > I was trying to run a NetPIPE over SDP (svn Rev 3882). > It seems to run fine for small transfers, but > the applications hangs when it gets to > 1 Megabyte > transfers. > > > woody > No, dont think I've seen that one, but its been a while since I last run anything on Itanium. Can you try to debug it a little? What does it mean that an application "hangs"? Is some data sent from one side not received by another one? -- MST From jlentini at netapp.com Wed Nov 2 11:01:11 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Nov 2005 14:01:11 -0500 (EST) Subject: [openib-general] Re: uDAPL again In-Reply-To: <43690C95.3050009@cs.rutgers.edu> References: <436906F0.3050803@cs.rutgers.edu> <43690C95.3050009@cs.rutgers.edu> Message-ID: On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > James Lentini wrote: > > > On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > > > > > > > Hello, > > > The following is the log for a request I am sending, > > > > > > The number of IOVs for req is 2. And the iov is shown below : > > > > > > REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M > > > REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > > > > > > dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M > > > dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M > > > dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > > > post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov > > > 0xbf964290 f 0^M > > > post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M > > > post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > > > post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > > > post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > > > post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > > > post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M > > > post_snd: returned^M > > > dapl_ep_post_send () returns 0x0^M > > > dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M > > > dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M > > > cq_object_wait: CQ channel 0x8081290 time -1^M > > > cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M > > > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > > > dapl_evd_dto_callback : CQE ^M > > > work_req_id 134771572^M > > > status 12^M > > > > > > > Status 12 is IBV_WC_RETRY_EXC_ERR. > > > > Are you sure you can communicate over IB? Do pings over IPoIB work, etc.? > > > > > bohra at hora-3 ~]$ ping -b 10.10.10.255 > WARNING: pinging broadcast address > PING 10.10.10.255 (10.10.10.255) 56(84) bytes of data. > 64 bytes from 10.10.10.12: icmp_seq=0 ttl=64 time=0.034 ms > 64 bytes from 10.10.10.13: icmp_seq=0 ttl=64 time=8.98 ms (DUP!) > 64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.033 ms > 64 bytes from 10.10.10.13: icmp_seq=1 ttl=64 time=0.095 ms (DUP!) > 64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.025 ms > 64 bytes from 10.10.10.13: icmp_seq=2 ttl=64 time=0.096 ms (DUP!) > > --- 10.10.10.255 ping statistics --- I don't see DUPs when I ping the broadcast address. Is it possible another machine is configured with the same IP address? Do you only have the one OpenIB node? > 3 packets transmitted, 3 received, +3 duplicates, 0% packet loss, time 2020ms > rtt min/avg/max/mdev = 0.025/1.544/8.986/3.328 ms, pipe 2 > [bohra at hora-3 ~]$ ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:10.10.10.12 Bcast:10.255.255.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c901:81e:7471/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:4 errors:0 dropped:0 overruns:0 frame:0 > TX packets:77 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:128 > RX bytes:308 (308.0 b) TX bytes:4788 (4.6 KiB) > > My target is the filer, which does not respond to pings (10.10.10.11). > > Aniruddha > From halr at voltaire.com Wed Nov 2 11:02:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 14:02:06 -0500 Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: References: Message-ID: <1130958126.4381.4109.camel@hal.voltaire.com> On Wed, 2005-11-02 at 13:51, James Lentini wrote: > Hal, > > Is there an existing OpenIB tool that can query an SA's database using > MADs? Specifically, I want to retrieve all of the SA's service > records. The only current way is via ibis. > If such a tool doesn't exist, where would you start writing one? > Would you layer it on top of libibmad? There are two approaches I can think of off the top of my head: 1. Support for this and other SA searches in the SM console 2. Be able to obtain these remotely via building on top of umad in some form (a real userspace SA client will likely be done in the not too distant future). What is the timeframe for this need ? How formal do you need something in the interim ? -- Hal From halr at voltaire.com Wed Nov 2 11:06:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 14:06:59 -0500 Subject: [openib-general] Re: uDAPL again In-Reply-To: References: <436906F0.3050803@cs.rutgers.edu> <43690C95.3050009@cs.rutgers.edu> Message-ID: <1130958417.4381.4118.camel@hal.voltaire.com> On Wed, 2005-11-02 at 14:01, James Lentini wrote: > On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > > > James Lentini wrote: > > > > > On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > > > > > > > > > > Hello, > > > > The following is the log for a request I am sending, > > > > > > > > The number of IOVs for req is 2. And the iov is shown below : > > > > > > > > REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M > > > > REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > > > > > > > > dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M > > > > dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M > > > > dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M > > > > post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov > > > > 0xbf964290 f 0^M > > > > post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M > > > > post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > > > > post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M > > > > post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > > > > post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M > > > > post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M > > > > post_snd: returned^M > > > > dapl_ep_post_send () returns 0x0^M > > > > dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M > > > > dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M > > > > cq_object_wait: CQ channel 0x8081290 time -1^M > > > > cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M > > > > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > > > > dapl_evd_dto_callback : CQE ^M > > > > work_req_id 134771572^M > > > > status 12^M > > > > > > > > > > Status 12 is IBV_WC_RETRY_EXC_ERR. > > > > > > Are you sure you can communicate over IB? Do pings over IPoIB work, etc.? > > > > > > > > bohra at hora-3 ~]$ ping -b 10.10.10.255 > > WARNING: pinging broadcast address > > PING 10.10.10.255 (10.10.10.255) 56(84) bytes of data. > > 64 bytes from 10.10.10.12: icmp_seq=0 ttl=64 time=0.034 ms > > 64 bytes from 10.10.10.13: icmp_seq=0 ttl=64 time=8.98 ms (DUP!) > > 64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.033 ms > > 64 bytes from 10.10.10.13: icmp_seq=1 ttl=64 time=0.095 ms (DUP!) > > 64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.025 ms > > 64 bytes from 10.10.10.13: icmp_seq=2 ttl=64 time=0.096 ms (DUP!) > > > > --- 10.10.10.255 ping statistics --- > > > I don't see DUPs when I ping the broadcast address. I get dups. This is ping. When using the subnet broadcast address, it does not distriguish that the replies are different; just that it got multiple replies for a single request. It may depend on the version of ping. -- Hal > Is it possible > another machine is configured with the same IP address? > > Do you only have the one OpenIB node? > > > 3 packets transmitted, 3 received, +3 duplicates, 0% packet loss, time 2020ms > > rtt min/avg/max/mdev = 0.025/1.544/8.986/3.328 ms, pipe 2 > > [bohra at hora-3 ~]$ ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:10.10.10.12 Bcast:10.255.255.255 Mask:255.255.255.0 > > inet6 addr: fe80::202:c901:81e:7471/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > > RX packets:4 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:77 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:308 (308.0 b) TX bytes:4788 (4.6 KiB) > > > > My target is the filer, which does not respond to pings (10.10.10.11). > > > > Aniruddha > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bohra at cs.rutgers.edu Wed Nov 2 11:16:12 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Wed, 02 Nov 2005 14:16:12 -0500 Subject: [openib-general] Re: uDAPL again In-Reply-To: References: <436906F0.3050803@cs.rutgers.edu> <43690C95.3050009@cs.rutgers.edu> Message-ID: <4369107C.2070203@cs.rutgers.edu> James Lentini wrote: >On Wed, 2 Nov 2005, Aniruddha Bohra wrote: > > > >>James Lentini wrote: >> >> >> >>>On Wed, 2 Nov 2005, Aniruddha Bohra wrote: >>> >>> >>> >>> >>>>Hello, >>>> The following is the log for a request I am sending, >>>> >>>>The number of IOVs for req is 2. And the iov is shown below : >>>> >>>>REQ[0] = (0xb5f3f100, 48, 0xca88003b)^M >>>>REQ[1] = (0xb5f3f2b8, 152, 0xca88003b)^M >>>> >>>>dapl_ep_post_send (0x8087110, 2, 0x808b300, 0xb5f3f6b4, 0)^M >>>>dapl_ep_post_send : LOCALIOV[0] = (0xb5f3f100, 48, 0xca88003b)^M >>>>dapl_ep_post_send : LOCALIOV[1] = (0xb5f3f2b8, 152, 0xca88003b)^M >>>>post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x808b300 r_iov >>>>0xbf964290 f 0^M >>>>post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x808b300^M >>>>post_snd_localiov: lkey 0xca88003b va 0xb5f3f100 len 48 ^M >>>>post_snd: lkey 0xca88003b va 0xb5f3f100 len 48 ^M >>>>post_snd_localiov: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M >>>>post_snd: lkey 0xca88003b va 0xb5f3f2b8 len 152 ^M >>>>post_snd: op 0x2 flags 0x2 sglist 0xbf9641b0, 2^M >>>>post_snd: returned^M >>>>dapl_ep_post_send () returns 0x0^M >>>>dapl_evd_wait (0x8083ca0, -1, 1, 0xbf9642d0, 0xbf9642cc)^M >>>>dapl_evd_wait: EVD 0x8083ca0, CQ 0x8083da0^M >>>>cq_object_wait: CQ channel 0x8081290 time -1^M >>>>cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) Success^M >>>> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >>>> dapl_evd_dto_callback : CQE ^M >>>> work_req_id 134771572^M >>>> status 12^M >>>> >>>> >>>> >>>Status 12 is IBV_WC_RETRY_EXC_ERR. >>> >>>Are you sure you can communicate over IB? Do pings over IPoIB work, etc.? >>> >>> >>> >>> >>bohra at hora-3 ~]$ ping -b 10.10.10.255 >>WARNING: pinging broadcast address >>PING 10.10.10.255 (10.10.10.255) 56(84) bytes of data. >>64 bytes from 10.10.10.12: icmp_seq=0 ttl=64 time=0.034 ms >>64 bytes from 10.10.10.13: icmp_seq=0 ttl=64 time=8.98 ms (DUP!) >>64 bytes from 10.10.10.12: icmp_seq=1 ttl=64 time=0.033 ms >>64 bytes from 10.10.10.13: icmp_seq=1 ttl=64 time=0.095 ms (DUP!) >>64 bytes from 10.10.10.12: icmp_seq=2 ttl=64 time=0.025 ms >>64 bytes from 10.10.10.13: icmp_seq=2 ttl=64 time=0.096 ms (DUP!) >> >>--- 10.10.10.255 ping statistics --- >> >> > > >I don't see DUPs when I ping the broadcast address. Is it possible >another machine is configured with the same IP address? > > > No. There are 3 nodes on the switch. Two openib nodes. I can login to the .13 node through the 10.10.10 subnet using IPoIB. bohra at hora-3 ~]$ ping 10.10.10.13 PING 10.10.10.13 (10.10.10.13) 56(84) bytes of data. 64 bytes from 10.10.10.13: icmp_seq=0 ttl=64 time=0.112 ms 64 bytes from 10.10.10.13: icmp_seq=1 ttl=64 time=0.063 ms 64 bytes from 10.10.10.13: icmp_seq=2 ttl=64 time=0.073 ms 64 bytes from 10.10.10.13: icmp_seq=3 ttl=64 time=0.052 ms 64 bytes from 10.10.10.13: icmp_seq=4 ttl=64 time=0.069 ms 64 bytes from 10.10.10.13: icmp_seq=5 ttl=64 time=0.055 ms 64 bytes from 10.10.10.13: icmp_seq=6 ttl=64 time=0.076 ms 64 bytes from 10.10.10.13: icmp_seq=7 ttl=64 time=0.052 ms 64 bytes from 10.10.10.13: icmp_seq=8 ttl=64 time=0.070 ms Aniruddha From ardavis at ichips.intel.com Wed Nov 2 12:02:57 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 02 Nov 2005 12:02:57 -0800 Subject: [openib-general] Re: uDAPL again In-Reply-To: <436906F0.3050803@cs.rutgers.edu> References: <436906F0.3050803@cs.rutgers.edu> Message-ID: <43691B71.2040500@ichips.intel.com> Aniruddha Bohra wrote: > cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) > Success^M > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > dapl_evd_dto_callback : CQE ^M > work_req_id 134771572^M > status 12^M > >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M > DTO completion ERROR: 12: op 0xff^M > disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M > destroy_cm_id: conn 0x808a008 id 134774528^M > dapli_evd_post_event: Called with event # 4006^M > > > Any ideas how to proceed to even debug this ? Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or the default one that use's uCM and uAT? For the socket_CM version the timeout is set to 14 (~67ms) and the retries are set to 7 so the receiving node would have to be delayed beyond ~469ms to get this failure. For the default uCM/uAT version the retries are set to 7 and the timeout is set to pktlifetime+1 so you would have to look at the path-record for the timeout value for the connection. Can you successfully run the IB verbs ibv_rc_pingpong test suite? Anything special about your fabric configuration that could induce this kind of latencies? Something on the fabric or in your remote system is delaying ACK's beyond your total timeout/retry times. If you had no buffers posted or attempted to send to unregistered memory you would get different errors. -arlin > > Thanks > Aniruddha > From jerome.pioux at bull.com Wed Nov 2 12:31:16 2005 From: jerome.pioux at bull.com (Jerome Pioux) Date: Wed, 2 Nov 2005 13:31:16 -0700 Subject: [openib-general] Problems with SDP on Itanium References: <4368EC0D.1000009@ichips.intel.com> <20051102184240.GJ28222@esmail.cup.hp.com> Message-ID: <000f01c5dfec$619ad4a0$0211708d@gpv.az05.bull.com> I am running SDP on rev 3882 on ia64 (modified RHEL4 - 2.6.12 kernel). I do not run NetPIPE but TTCP with options "-l 1048576 -b 1048576" which I think means 1M. I just tried a run with "-l 2097152 -b 2097152" which then would mean 2M and it seems to run okay: ttcp-t: buflen=2097152, nbuf=45000, align=16384/0, port=5001, sockbufsize=2097152 tcp -> 192.168.0.100 ttcp-t: 94371840000 bytes in 148.11 real seconds = 607.65 MB/sec +++ ttcp-t: 45000 I/O calls, msec/call = 3.37, calls/sec = 303.82 ttcp-t: user: 41016 sys: 36393573 total: 36434589 real: 148112374 But maybe, TTCP does not run message size > 1M even with these options set?... Jerome ----- Original Message ----- From: "Grant Grundler" To: "Bob Woodruff" Cc: Sent: Wednesday, November 02, 2005 11:42 AM Subject: Re: [openib-general] Problems with SDP on Itanium > On Wed, Nov 02, 2005 at 09:27:26AM -0800, Bob Woodruff wrote: >> Has anyone tried using SDP on Itanium ? > > Yes - but it's been 5-6 weeks since I have tried it (SVN r3547). > >> I was trying to run a NetPIPE over SDP (svn Rev 3882). >> It seems to run fine for small transfers, but >> the applications hangs when it gets to > 1 Megabyte >> transfers. > > I haven't tested message sizes > 128KB. > I'll include 256/512/1024/2048 KB message sizes in the next round. > > And I still owe Michael some investigation results from the > last round were perf dropped off to near zero for medium > sized messages. > > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From bohra at cs.rutgers.edu Wed Nov 2 12:44:22 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Wed, 02 Nov 2005 15:44:22 -0500 Subject: [openib-general] Re: uDAPL again In-Reply-To: <43691B71.2040500@ichips.intel.com> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> Message-ID: <43692526.3030003@cs.rutgers.edu> Arlin Davis wrote: > Aniruddha Bohra wrote: > >> cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) >> Success^M >> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >> dapl_evd_dto_callback : CQE ^M >> work_req_id 134771572^M >> status 12^M >> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >> DTO completion ERROR: 12: op 0xff^M >> disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M >> destroy_cm_id: conn 0x808a008 id 134774528^M >> dapli_evd_post_event: Called with event # 4006^M >> >> >> Any ideas how to proceed to even debug this ? > > > > Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or > the default one that use's uCM and uAT? For the socket_CM version the > timeout is set to 14 (~67ms) and the retries are set to 7 so the > receiving node would have to be delayed beyond ~469ms to get this > failure. For the default uCM/uAT version the retries are set to 7 and > the timeout is set to pktlifetime+1 so you would have to look at the > path-record for the timeout value for the connection. > I am using the default one. Actually, even the dapl_ep_connect() takes a long time. I am not sure, but arent uCM and uAT simply for connection establishment? > Can you successfully run the IB verbs ibv_rc_pingpong test suite? Between the two OpenIB nodes, I can run the ibv_rc_pingpong. > Anything special about your fabric configuration that could induce > this kind of latencies? Something on the fabric or in your remote > system is delaying ACK's beyond your total timeout/retry times. It has 3 machines on the switch : one is a netapp filer, which might be the source of the problem :( > > If you had no buffers posted or attempted to send to unregistered > memory you would get different errors. This is good, as it seems my code is trying to DTRT :) Thanks Aniruddha From jlentini at netapp.com Wed Nov 2 12:57:45 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Nov 2005 15:57:45 -0500 (EST) Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: <1130958126.4381.4109.camel@hal.voltaire.com> References: <1130958126.4381.4109.camel@hal.voltaire.com> Message-ID: On Wed, 2 Nov 2005, Hal Rosenstock wrote: > On Wed, 2005-11-02 at 13:51, James Lentini wrote: > > Hal, > > > > Is there an existing OpenIB tool that can query an SA's database using > > MADs? Specifically, I want to retrieve all of the SA's service > > records. > > The only current way is via ibis. Is ibis in the OpenIB tree? I've seen it somewhere, but I can't remember where. > > If such a tool doesn't exist, where would you start writing one? > > Would you layer it on top of libibmad? > > There are two approaches I can think of off the top of my head: > > 1. Support for this and other SA searches in the SM console > 2. Be able to obtain these remotely via building on top of umad in some > form (a real userspace SA client will likely be done in the not too > distant future). This sounds like a good approach since the tool could be used with SM's other than OpenSM. Is there an application in the OpenIB tree that is a good example of how to use the umad library? > What is the timeframe for this need ? I'm thinking of debugging tools that would be useful for me at SC05. > How formal do you need something in the interim ? I don't need anything formal. From eitan at mellanox.co.il Wed Nov 2 13:05:01 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 2 Nov 2005 23:05:01 +0200 Subject: [openib-general] Re: [OpenSM] SA database query tool Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618896@mtlexch01.mtl.com> Hi James, Assuming you ask for a remote access to the data through the IB network (In-Band): (but actually this will work also in loopback if run on the same IB port) If you need the query tool for scripting ibis is your best choice. A quick example for what you need to do to get all NodeInfoRecor's: ibis -port_num 1 sacNodeQuery getTable 0 # the zero is the comp mask. You could also read the ibis.c and the osm_vendor_sa_api.h to see how one can implement that in C. You could use umad directly Or decide to re-write the exiting software too. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 02, 2005 9:02 PM > To: James Lentini > Cc: openib-general > Subject: [openib-general] Re: [OpenSM] SA database query tool > > On Wed, 2005-11-02 at 13:51, James Lentini wrote: > > Hal, > > > > Is there an existing OpenIB tool that can query an SA's database using > > MADs? Specifically, I want to retrieve all of the SA's service > > records. > > The only current way is via ibis. > > > If such a tool doesn't exist, where would you start writing one? > > Would you layer it on top of libibmad? > > There are two approaches I can think of off the top of my head: > > 1. Support for this and other SA searches in the SM console > 2. Be able to obtain these remotely via building on top of umad in some > form (a real userspace SA client will likely be done in the not too > distant future). > > What is the timeframe for this need ? How formal do you need something > in the interim ? > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Nov 2 13:08:53 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 2 Nov 2005 23:08:53 +0200 Subject: [openib-general] Re: [OpenSM] SA database query tool Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618897@mtlexch01.mtl.com> Hi Again, Ibis is currently under: https://openib.org/svn/gen2/utils/src/linux-user/ibis A doc regarding how to write SA client queries is available in the file: doc/ibis_wrap.html If you will need more info or examples I will be happy to provide them. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, November 02, 2005 10:58 PM > To: Hal Rosenstock > Cc: openib-general > Subject: [openib-general] Re: [OpenSM] SA database query tool > > > > On Wed, 2 Nov 2005, Hal Rosenstock wrote: > > > On Wed, 2005-11-02 at 13:51, James Lentini wrote: > > > Hal, > > > > > > Is there an existing OpenIB tool that can query an SA's database using > > > MADs? Specifically, I want to retrieve all of the SA's service > > > records. > > > > The only current way is via ibis. > > Is ibis in the OpenIB tree? I've seen it somewhere, but I can't > remember where. > > > > If such a tool doesn't exist, where would you start writing one? > > > Would you layer it on top of libibmad? > > > > There are two approaches I can think of off the top of my head: > > > > 1. Support for this and other SA searches in the SM console > > 2. Be able to obtain these remotely via building on top of umad in some > > form (a real userspace SA client will likely be done in the not too > > distant future). > > This sounds like a good approach since the tool could be used with > SM's other than OpenSM. > > Is there an application in the OpenIB tree that is a good example > of how to use the umad library? > > > What is the timeframe for this need ? > > I'm thinking of debugging tools that would be useful for me at SC05. > > > How formal do you need something in the interim ? > > I don't need anything formal. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Nov 2 13:11:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2005 16:11:15 -0500 Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: References: <1130958126.4381.4109.camel@hal.voltaire.com> Message-ID: <1130965874.4381.4249.camel@hal.voltaire.com> On Wed, 2005-11-02 at 15:57, James Lentini wrote: > On Wed, 2 Nov 2005, Hal Rosenstock wrote: > > > On Wed, 2005-11-02 at 13:51, James Lentini wrote: > > > Hal, > > > > > > Is there an existing OpenIB tool that can query an SA's database using > > > MADs? Specifically, I want to retrieve all of the SA's service > > > records. > > > > The only current way is via ibis. > > Is ibis in the OpenIB tree? I've seen it somewhere, but I can't > remember where. > > > > If such a tool doesn't exist, where would you start writing one? > > > Would you layer it on top of libibmad? > > > > There are two approaches I can think of off the top of my head: > > > > 1. Support for this and other SA searches in the SM console > > 2. Be able to obtain these remotely via building on top of umad in some > > form (a real userspace SA client will likely be done in the not too > > distant future). > > This sounds like a good approach since the tool could be used with > SM's other than OpenSM. > > Is there an application in the OpenIB tree that is a good example > of how to use the umad library? As I said there is likely a real SA client that will be developed. In the short term, you can use some diag as an example but these are SMP rather than GMP based (except for perfquery). There is some SA infrastructure in place but I'm not sure how well it works. Would you be using RMPP too as little has exercised it to date ? There's sa_call and just an ib_path_query right now (in libibmad/src/sa.c). A service query could be easily added. RMPP is not supported yet at this level. > > What is the timeframe for this need ? > > I'm thinking of debugging tools that would be useful for me at SC05. I was planning on using ibis at SC05 if this was needed. -- Hal > > How formal do you need something in the interim ? > > I don't need anything formal. From rolandd at cisco.com Wed Nov 2 13:34:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 13:34:24 -0800 Subject: [openib-general] [PATCH/RFC v2] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051101002811.GD3107@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 31 Oct 2005 16:28:11 -0800") References: <52wtjtk3d1.fsf@cisco.com> <20051101002811.GD3107@esmail.cup.hp.com> Message-ID: <52r79y91jz.fsf_-_@cisco.com> Here is an updated version of the patch to add an IB SRP initiator. I've incorporated Grant's suggestions, and also split off the SRP structures and constants into so that we can move ibmvscsi to sharing the same header file. Are there any more suggestions before I ask Linus to pull this code? Positive votes and/or vetoes are also appreciated. Thanks, Roland Subject: [PATCH] IB: Add SCSI RDMA Protocol (SRP) initiator Add an InfiniBand SCSI RDMA Protocol (SRP) initiator. This driver is used to talk talk to InfiniBand SRP targets (storage devices). Signed-off-by: Roland Dreier --- drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/ulp/srp/Kbuild | 1 drivers/infiniband/ulp/srp/Kconfig | 11 drivers/infiniband/ulp/srp/ib_srp.c | 1696 +++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/srp/ib_srp.h | 150 +++ include/scsi/srp.h | 226 +++++ 7 files changed, 2087 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/srp/Kbuild create mode 100644 drivers/infiniband/ulp/srp/Kconfig create mode 100644 drivers/infiniband/ulp/srp/ib_srp.c create mode 100644 drivers/infiniband/ulp/srp/ib_srp.h create mode 100644 include/scsi/srp.h applies-to: d918cd1ba0ef9afa692cef281afee2f6d6634a1e c449fd3cc9e2194757c866cb1973fb98975331c8 diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 325d502..bdf0891 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -33,4 +33,6 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/ulp/ipoib/Kconfig" +source "drivers/infiniband/ulp/srp/Kconfig" + endmenu diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index d256cf7..a43fb34 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ +obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/ulp/srp/Kbuild b/drivers/infiniband/ulp/srp/Kbuild new file mode 100644 index 0000000..a16c73c --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kbuild @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND_SRP) += ib_srp.o diff --git a/drivers/infiniband/ulp/srp/Kconfig b/drivers/infiniband/ulp/srp/Kconfig new file mode 100644 index 0000000..8fe3be4 --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kconfig @@ -0,0 +1,11 @@ +config INFINIBAND_SRP + tristate "InfiniBand SCSI RDMA Protocol" + depends on INFINIBAND && SCSI + ---help--- + Support for the SCSI RDMA Protocol over InfiniBand. This + allows you to access storage devices that speak SRP over + InfiniBand. + + The SRP protocol is defined by the INCITS T10 technical + committee. See . + diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c new file mode 100644 index 0000000..502635a --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -0,0 +1,1696 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.c 3932 2005-11-01 17:19:29Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#include + +#include "ib_srp.h" + +#define DRV_NAME "ib_srp" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.2" +#define DRV_RELDATE "November 1, 2005" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol initiator " + "v" DRV_VERSION " (" DRV_RELDATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int topspin_workarounds = 1; + +module_param(topspin_workarounds, int, 0444); +MODULE_PARM_DESC(topspin_workarounds, + "Enable workarounds for Topspin/Cisco SRP target bugs if != 0"); + +static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; + +static void srp_add_one(struct ib_device *device); +static void srp_remove_one(struct ib_device *device); +static void srp_completion(struct ib_cq *cq, void *target_ptr); +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +static struct ib_client srp_client = { + .name = "srp", + .add = srp_add_one, + .remove = srp_remove_one +}; + +static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) +{ + return (struct srp_target_port *) host->hostdata; +} + +static const char *srp_target_info(struct Scsi_Host *host) +{ + return host_to_target(host)->target_name; +} + +static struct srp_iu *srp_alloc_iu(struct srp_host *host, size_t size, + gfp_t gfp_mask, + enum dma_data_direction direction) +{ + struct srp_iu *iu; + + iu = kmalloc(sizeof *iu, gfp_mask); + if (!iu) + goto out; + + iu->buf = kzalloc(size, gfp_mask); + if (!iu->buf) + goto out_free_iu; + + iu->dma = dma_map_single(host->dev->dma_device, iu->buf, size, direction); + if (dma_mapping_error(iu->dma)) + goto out_free_buf; + + iu->size = size; + iu->direction = direction; + + return iu; + +out_free_buf: + kfree(iu->buf); +out_free_iu: + kfree(iu); +out: + return NULL; +} + +static void srp_free_iu(struct srp_host *host, struct srp_iu *iu) +{ + if (!iu) + return; + + dma_unmap_single(host->dev->dma_device, iu->dma, iu->size, iu->direction); + kfree(iu->buf); + kfree(iu); +} + +static void srp_qp_event(struct ib_event *event, void *context) +{ + printk(KERN_ERR PFX "QP event %d\n", event->event); +} + +static int srp_init_qp(struct srp_target_port *target, + struct ib_qp *qp) +{ + struct ib_qp_attr *attr; + int ret; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + ret = ib_find_cached_pkey(target->srp_host->dev, + target->srp_host->port, + be16_to_cpu(target->path.pkey), + &attr->pkey_index); + if (ret) + return ret; + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + attr->port_num = target->srp_host->port; + + return ib_modify_qp(qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | + IB_QP_PORT); +} + +static int srp_create_target_ib(struct srp_target_port *target) +{ + struct ib_qp_init_attr *init_attr; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + target->cq = ib_create_cq(target->srp_host->dev, srp_completion, + NULL, target, SRP_CQ_SIZE); + if (IS_ERR(target->cq)) { + ret = PTR_ERR(target->cq); + goto out; + } + + ib_req_notify_cq(target->cq, IB_CQ_NEXT_COMP); + + init_attr->event_handler = srp_qp_event; + init_attr->cap.max_send_wr = SRP_SQ_SIZE; + init_attr->cap.max_recv_wr = SRP_RQ_SIZE; + init_attr->cap.max_recv_sge = 1; + init_attr->cap.max_send_sge = 1; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = target->cq; + init_attr->recv_cq = target->cq; + + target->qp = ib_create_qp(target->srp_host->pd, init_attr); + if (IS_ERR(target->qp)) { + ret = PTR_ERR(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + + ret = srp_init_qp(target, target->qp); + if (ret) { + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + +out: + kfree(init_attr); + return ret; +} + +static void srp_free_target_ib(struct srp_target_port *target) +{ + int i; + + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + + for (i = 0; i < SRP_RQ_SIZE; ++i) + srp_free_iu(target->srp_host, target->rx_ring[i]); + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) + srp_free_iu(target->srp_host, target->tx_ring[i]); +} + +static void srp_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + target->status = status; + if (status) + printk(KERN_ERR PFX "Got failed path rec status %d\n", status); + else + target->path = *pathrec; + complete(&target->done); +} + +static int srp_lookup_path(struct srp_target_port *target) +{ + target->path.numb_path = 1; + + init_completion(&target->done); + + target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev, + target->srp_host->port, + &target->path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + SRP_PATH_REC_TIMEOUT_MS, + GFP_KERNEL, + srp_path_rec_completion, + target, &target->path_query); + if (target->path_query_id < 0) + return target->path_query_id; + + wait_for_completion(&target->done); + + if (target->status < 0) + printk(KERN_WARNING PFX "Path record query failed\n"); + + return target->status; +} + +static int srp_send_req(struct srp_target_port *target) +{ + struct { + struct ib_cm_req_param param; + struct srp_login_req priv; + } *req = NULL; + int status; + + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + req->param.primary_path = &target->path; + req->param.alternate_path = NULL; + req->param.service_id = target->service_id; + req->param.qp_num = target->qp->qp_num; + req->param.qp_type = target->qp->qp_type; + req->param.private_data = &req->priv; + req->param.private_data_len = sizeof req->priv; + req->param.flow_control = 1; + + get_random_bytes(&req->param.starting_psn, 4); + req->param.starting_psn &= 0xffffff; + + /* + * Pick some arbitrary defaults here; we could make these + * module parameters if anyone cared about setting them. + */ + req->param.responder_resources = 4; + req->param.remote_cm_response_timeout = 20; + req->param.local_cm_response_timeout = 20; + req->param.retry_count = 7; + req->param.rnr_retry_count = 7; + req->param.max_cm_retries = 15; + + req->priv.opcode = SRP_LOGIN_REQ; + req->priv.tag = 0; + req->priv.req_it_iu_len = cpu_to_be32(SRP_MAX_IU_LEN); + req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | + SRP_BUF_FORMAT_INDIRECT); + memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); + /* + * Topspin/Cisco SRP targets will reject our login unless we + * zero out the first 8 bytes of our initiator port ID. The + * second 8 bytes must be our local node GUID, but we always + * use that anyway. + */ + if (topspin_workarounds && !memcmp(&target->ioc_guid, topspin_oui, 3)) { + printk(KERN_DEBUG PFX "Topspin/Cisco initiator port ID workaround " + "activated for target GUID %016llx\n", + (unsigned long long) be64_to_cpu(target->ioc_guid)); + memset(req->priv.initiator_port_id, 0, 8); + } + memcpy(req->priv.target_port_id, &target->id_ext, 8); + memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); + + status = ib_send_cm_req(target->cm_id, &req->param); + + kfree(req); + + return status; +} + +static void srp_disconnect_target(struct srp_target_port *target) +{ + /* XXX should send SRP_I_LOGOUT request */ + + init_completion(&target->done); + ib_send_cm_dreq(target->cm_id, NULL, 0); + wait_for_completion(&target->done); +} + +static void srp_remove_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_DEAD) { + spin_unlock_irq(target->scsi_host->host_lock); + scsi_host_put(target->scsi_host); + return; + } + target->state = SRP_TARGET_REMOVED; + spin_unlock_irq(target->scsi_host->host_lock); + + down(&target->srp_host->target_mutex); + list_del(&target->list); + up(&target->srp_host->target_mutex); + + scsi_remove_host(target->scsi_host); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + /* And another put to really free the target port... */ + scsi_host_put(target->scsi_host); +} + +static int srp_connect_target(struct srp_target_port *target) +{ + int ret; + + ret = srp_lookup_path(target); + if (ret) + return ret; + + while (1) { + init_completion(&target->done); + ret = srp_send_req(target); + if (ret) + return ret; + wait_for_completion(&target->done); + + /* + * The CM event handling code will set status to + * SRP_PORT_REDIRECT if we get a port redirect REJ + * back, or SRP_DLID_REDIRECT if we get a lid/qp + * redirect REJ back. + */ + switch (target->status) { + case 0: + return 0; + + case SRP_PORT_REDIRECT: + ret = srp_lookup_path(target); + if (ret) + return ret; + break; + + case SRP_DLID_REDIRECT: + break; + + default: + return target->status; + } + } +} + +static int srp_reconnect_target(struct srp_target_port *target) +{ + struct ib_cm_id *new_cm_id; + struct ib_qp_attr qp_attr; + struct srp_request *req; + struct ib_wc wc; + int ret; + int i; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_LIVE) { + spin_unlock_irq(target->scsi_host->host_lock); + return -EAGAIN; + } + target->state = SRP_TARGET_CONNECTING; + spin_unlock_irq(target->scsi_host->host_lock); + + srp_disconnect_target(target); + /* + * Now get a new local CM ID so that we avoid confusing the + * target in case things are really fouled up. + */ + new_cm_id = ib_create_cm_id(target->srp_host->dev, + srp_cm_handler, target); + if (IS_ERR(new_cm_id)) { + ret = PTR_ERR(new_cm_id); + goto err; + } + ib_destroy_cm_id(target->cm_id); + target->cm_id = new_cm_id; + + qp_attr.qp_state = IB_QPS_RESET; + ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); + if (ret) + goto err; + + ret = srp_init_qp(target, target->qp); + if (ret) + goto err; + + while (ib_poll_cq(target->cq, 1, &wc) > 0) + ; /* nothing */ + + list_for_each_entry(req, &target->req_queue, list) { + req->scmnd->result = DID_RESET << 16; + req->scmnd->scsi_done(req->scmnd); + } + + target->rx_head = 0; + target->tx_head = 0; + target->tx_tail = 0; + target->req_head = 0; + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_connect_target(target); + if (ret) + goto err; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + ret = 0; + target->state = SRP_TARGET_LIVE; + } else + ret = -EAGAIN; + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; + +err: + printk(KERN_ERR PFX "reconnect failed (%d), removing target port.\n", ret); + + /* + * We couldn't reconnect, so kill our target port off. + * However, we have to defer the real removal because we might + * be in the context of the SCSI error handler now, which + * would deadlock if we call scsi_remove_host(). + */ + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + target->state = SRP_TARGET_DEAD; + INIT_WORK(&target->work, srp_remove_work, target); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; +} + +static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target, + struct srp_request *req) +{ + struct srp_cmd *cmd = req->cmd->buf; + int len; + u8 fmt; + + if (!scmnd->request_buffer || scmnd->sc_data_direction == DMA_NONE) + return sizeof (struct srp_cmd); + + if (scmnd->sc_data_direction != DMA_FROM_DEVICE && + scmnd->sc_data_direction != DMA_TO_DEVICE) { + printk(KERN_WARNING PFX "Unhandled data direction %d\n", + scmnd->sc_data_direction); + return -EINVAL; + } + + if (scmnd->use_sg) { + struct scatterlist *scat = scmnd->request_buffer; + int n; + int i; + + n = dma_map_sg(target->srp_host->dev->dma_device, + scat, scmnd->use_sg, scmnd->sc_data_direction); + + if (n == 1) { + struct srp_direct_buf *buf = (void *) cmd->add_data; + + fmt = SRP_DATA_DESC_DIRECT; + + buf->va = cpu_to_be64(sg_dma_address(scat)); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(sg_dma_len(scat)); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_direct_buf); + } else { + struct srp_indirect_buf *buf = (void *) cmd->add_data; + u32 datalen = 0; + + fmt = SRP_DATA_DESC_INDIRECT; + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->data_out_desc_cnt = n; + else + cmd->data_in_desc_cnt = n; + + buf->table_desc.va = cpu_to_be64(req->cmd->dma + + sizeof *cmd + + sizeof *buf); + buf->table_desc.key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->table_desc.len = + cpu_to_be32(n * sizeof (struct srp_direct_buf)); + + for (i = 0; i < n; ++i) { + buf->desc_list[i].va = cpu_to_be64(sg_dma_address(&scat[i])); + buf->desc_list[i].key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->desc_list[i].len = cpu_to_be32(sg_dma_len(&scat[i])); + + datalen += sg_dma_len(&scat[i]); + } + + buf->len = cpu_to_be32(datalen); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_indirect_buf) + + n * sizeof (struct srp_direct_buf); + } + } else { + struct srp_direct_buf *buf = (void *) cmd->add_data; + dma_addr_t dma; + + dma = dma_map_single(target->srp_host->dev->dma_device, + scmnd->request_buffer, scmnd->request_bufflen, + scmnd->sc_data_direction); + if (dma_mapping_error(dma)) { + printk(KERN_WARNING PFX "unable to map %p/%d (dir %d)\n", + scmnd->request_buffer, (int) scmnd->request_bufflen, + scmnd->sc_data_direction); + return -EINVAL; + } + + pci_unmap_addr_set(req, direct_mapping, dma); + + buf->va = cpu_to_be64(dma); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(scmnd->request_bufflen); + + fmt = SRP_DATA_DESC_DIRECT; + + len = sizeof (struct srp_cmd) + sizeof (struct srp_direct_buf); + } + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->buf_fmt = fmt << 4; + else + cmd->buf_fmt = fmt; + + + return len; +} + +static void srp_unmap_data(struct scsi_cmnd *scmnd, + struct srp_target_port *target, + struct srp_request *req) +{ + if (!scmnd->request_buffer || + (scmnd->sc_data_direction != DMA_TO_DEVICE && + scmnd->sc_data_direction != DMA_FROM_DEVICE)) + return; + + if (scmnd->use_sg) + dma_unmap_sg(target->srp_host->dev->dma_device, + (struct scatterlist *) scmnd->request_buffer, + scmnd->use_sg, scmnd->sc_data_direction); + else + dma_unmap_single(target->srp_host->dev->dma_device, + pci_unmap_addr(req, direct_mapping), + scmnd->request_bufflen, + scmnd->sc_data_direction); +} + +static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) +{ + struct srp_request *req; + struct scsi_cmnd *scmnd; + unsigned long flags; + s32 delta; + + delta = (s32) be32_to_cpu(rsp->req_lim_delta); + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + + target->req_lim += delta; + + req = &target->req_ring[rsp->tag & ~SRP_TAG_TSK_MGMT]; + + if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) { + if (be32_to_cpu(rsp->resp_data_len) < 4) + req->tsk_status = -1; + else + req->tsk_status = rsp->data[3]; + complete(&req->done); + } else { + scmnd = req->scmnd; + if (!scmnd) + printk(KERN_ERR "Null scmnd for RSP w/tag %016llx\n", + (unsigned long long) rsp->tag); + scmnd->result = rsp->status; + + if (rsp->flags & SRP_RSP_FLAG_SNSVALID) { + memcpy(scmnd->sense_buffer, rsp->data + + be32_to_cpu(rsp->resp_data_len), + min_t(int, be32_to_cpu(rsp->sense_data_len), + SCSI_SENSE_BUFFERSIZE)); + } + + if (rsp->flags & (SRP_RSP_FLAG_DOOVER | SRP_RSP_FLAG_DOUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_out_res_cnt); + else if (rsp->flags & (SRP_RSP_FLAG_DIOVER | SRP_RSP_FLAG_DIUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_in_res_cnt); + + srp_unmap_data(scmnd, target, req); + + if (!req->tsk_mgmt) { + req->scmnd = NULL; + scmnd->host_scribble = (void *) -1L; + scmnd->scsi_done(scmnd); + + list_del(&req->list); + req->next = target->req_head; + target->req_head = rsp->tag & ~SRP_TAG_TSK_MGMT; + } else + req->cmd_done = 1; + } + + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); +} + +static void srp_reconnect_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + srp_reconnect_target(target); +} + +static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc) +{ + struct srp_iu *iu; + u8 opcode; + + iu = target->rx_ring[wc->wr_id & ~SRP_OP_RECV]; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); + + opcode = *(u8 *) iu->buf; + + if (0) { + int i; + + printk(KERN_ERR PFX "recv completion, opcode 0x%02x\n", opcode); + + for (i = 0; i < wc->byte_len; ++i) { + if (i % 8 == 0) + printk(KERN_ERR " [%02x] ", i); + printk(" %02x", ((u8 *) iu->buf)[i]); + if ((i + 1) % 8 == 0) + printk("\n"); + } + + if (wc->byte_len % 8) + printk("\n"); + } + + switch (opcode) { + case SRP_RSP: + srp_process_rsp(target, iu->buf); + break; + + case SRP_T_LOGOUT: + /* XXX Handle target logout */ + printk(KERN_WARNING PFX "Got target logout request\n"); + break; + + default: + printk(KERN_WARNING PFX "Unhandled SRP opcode 0x%02x\n", opcode); + break; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); +} + +static void srp_completion(struct ib_cq *cq, void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + struct ib_wc wc; + unsigned long flags; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + if (wc.status) { + printk(KERN_ERR PFX "failed %s status %d\n", + wc.wr_id & SRP_OP_RECV ? "receive" : "send", + wc.status); + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state == SRP_TARGET_LIVE) + schedule_work(&target->work); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + break; + } + + if (wc.wr_id & SRP_OP_RECV) + srp_handle_recv(target, &wc); + else + ++target->tx_tail; + } +} + +static int __srp_post_recv(struct srp_target_port *target) +{ + struct srp_iu *iu; + struct ib_sge list; + struct ib_recv_wr wr, *bad_wr; + unsigned int next; + int ret; + + next = target->rx_head & (SRP_RQ_SIZE - 1); + wr.wr_id = next | SRP_OP_RECV; + iu = target->rx_ring[next]; + + list.addr = iu->dma; + list.length = iu->size; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.sg_list = &list; + wr.num_sge = 1; + + ret = ib_post_recv(target->qp, &wr, &bad_wr); + if (!ret) + ++target->rx_head; + + return ret; +} + +static int srp_post_recv(struct srp_target_port *target) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + ret = __srp_post_recv(target); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + + return ret; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target) +{ + if (target->tx_head - target->tx_tail >= SRP_SQ_SIZE) + return NULL; + + return target->tx_ring[target->tx_head & SRP_SQ_SIZE]; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static int __srp_post_send(struct srp_target_port *target, + struct srp_iu *iu, int len) +{ + struct ib_sge list; + struct ib_send_wr wr, *bad_wr; + int ret = 0; + + if (target->req_lim < 1) { + printk(KERN_ERR PFX "Target has req_lim %d\n", target->req_lim); + return -EAGAIN; + } + + list.addr = iu->dma; + list.length = len; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.wr_id = target->tx_head & SRP_SQ_SIZE; + wr.sg_list = &list; + wr.num_sge = 1; + wr.opcode = IB_WR_SEND; + wr.send_flags = IB_SEND_SIGNALED; + + ret = ib_post_send(target->qp, &wr, &bad_wr); + + if (!ret) { + ++target->tx_head; + --target->req_lim; + } + + return ret; +} + +static int srp_queuecommand(struct scsi_cmnd *scmnd, + void (*done)(struct scsi_cmnd *)) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_cmd *cmd; + long req_index; + int len; + + if (target->state == SRP_TARGET_CONNECTING) + goto err; + + if (target->state == SRP_TARGET_DEAD || + target->state == SRP_TARGET_REMOVED) { + scmnd->result = DID_BAD_TARGET << 16; + done(scmnd); + return 0; + } + + iu = __srp_get_tx_iu(target); + if (!iu) + goto err; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + req_index = target->req_head; + + scmnd->scsi_done = done; + scmnd->result = 0; + scmnd->host_scribble = (void *) req_index; + + cmd = iu->buf; + memset(cmd, 0, sizeof *cmd); + + cmd->opcode = SRP_CMD; + cmd->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + cmd->tag = req_index; + memcpy(cmd->cdb, scmnd->cmnd, scmnd->cmd_len); + + req = &target->req_ring[req_index]; + + req->scmnd = scmnd; + req->cmd = iu; + req->cmd_done = 0; + req->tsk_mgmt = NULL; + + len = srp_map_data(scmnd, target, req); + if (len < 0) { + printk(KERN_ERR PFX "Failed to map data\n"); + goto err; + } + + if (__srp_post_recv(target)) { + printk(KERN_ERR PFX "Recv failed\n"); + goto err_unmap; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + if (__srp_post_send(target, iu, len)) { + printk(KERN_ERR PFX "Send failed\n"); + goto err_unmap; + } + + target->req_head = req->next; + list_add_tail(&req->list, &target->req_queue); + + return 0; + +err_unmap: + srp_unmap_data(scmnd, target, req); + +err: + return SCSI_MLQUEUE_HOST_BUSY; +} + +static int srp_alloc_iu_bufs(struct srp_target_port *target) +{ + int i; + + for (i = 0; i < SRP_RQ_SIZE; ++i) { + target->rx_ring[i] = srp_alloc_iu(target->srp_host, + target->max_ti_iu_len, + GFP_KERNEL, DMA_FROM_DEVICE); + if (!target->rx_ring[i]) + goto err; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + target->tx_ring[i] = srp_alloc_iu(target->srp_host, + SRP_MAX_IU_LEN, + GFP_KERNEL, DMA_TO_DEVICE); + if (!target->tx_ring[i]) + goto err; + } + + return 0; + +err: + for (i = 0; i < SRP_RQ_SIZE; ++i) { + srp_free_iu(target->srp_host, target->rx_ring[i]); + target->rx_ring[i] = NULL; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + srp_free_iu(target->srp_host, target->tx_ring[i]); + target->tx_ring[i] = NULL; + } + + return -ENOMEM; +} + +static void srp_cm_rej_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event, + struct srp_target_port *target) +{ + struct ib_class_port_info *cpi; + int opcode; + + switch (event->param.rej_rcvd.reason) { + case IB_CM_REJ_PORT_CM_REDIRECT: + cpi = event->param.rej_rcvd.ari; + target->path.dlid = cpi->redirect_lid; + target->path.pkey = cpi->redirect_pkey; + cm_id->remote_cm_qpn = be32_to_cpu(cpi->redirect_qp) & 0x00ffffff; + memcpy(target->path.dgid.raw, cpi->redirect_gid, 16); + + target->status = target->path.dlid ? + SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; + break; + + case IB_CM_REJ_PORT_REDIRECT: + if (topspin_workarounds && + !memcmp(&target->ioc_guid, topspin_oui, 3)) { + /* + * Topspin/Cisco SRP gateways incorrectly send + * reject reason code 25 when they mean 24 + * (port redirect). + */ + memcpy(target->path.dgid.raw, + event->param.rej_rcvd.ari, 16); + + printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", + (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), + (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + + target->status = SRP_PORT_REDIRECT; + } else { + printk(KERN_WARNING " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); + target->status = -ECONNRESET; + } + break; + + case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: + printk(KERN_WARNING " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); + target->status = -ECONNRESET; + break; + + case IB_CM_REJ_CONSUMER_DEFINED: + opcode = *(u8 *) event->private_data; + if (opcode == SRP_LOGIN_REJ) { + struct srp_login_rej *rej = event->private_data; + u32 reason = be32_to_cpu(rej->reason); + + if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) + printk(KERN_WARNING PFX + "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); + else + printk(KERN_WARNING PFX + "SRP LOGIN REJECTED, reason 0x%08x\n", reason); + } else + printk(KERN_WARNING " REJ reason: IB_CM_REJ_CONSUMER_DEFINED," + " opcode 0x%02x\n", opcode); + target->status = -ECONNRESET; + break; + + default: + printk(KERN_WARNING " REJ reason 0x%x\n", + event->param.rej_rcvd.reason); + target->status = -ECONNRESET; + } +} + +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct srp_target_port *target = cm_id->context; + struct ib_qp_attr *qp_attr = NULL; + int attr_mask = 0; + int comp = 0; + int opcode = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + printk(KERN_DEBUG PFX "Sending CM REQ failed\n"); + comp = 1; + target->status = -ECONNRESET; + break; + + case IB_CM_REP_RECEIVED: + comp = 1; + opcode = *(u8 *) event->private_data; + + if (opcode == SRP_LOGIN_RSP) { + struct srp_login_rsp *rsp = event->private_data; + + target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); + target->req_lim = be32_to_cpu(rsp->req_lim_delta); + + target->scsi_host->can_queue = min(target->req_lim, + target->scsi_host->can_queue); + } else { + printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", opcode); + target->status = -ECONNRESET; + break; + } + + target->status = srp_alloc_iu_bufs(target); + if (target->status) + break; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) { + target->status = -ENOMEM; + break; + } + + qp_attr->qp_state = IB_QPS_RTR; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = srp_post_recv(target); + if (target->status) + break; + + qp_attr->qp_state = IB_QPS_RTS; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = ib_send_cm_rtu(cm_id, NULL, 0); + if (target->status) + break; + + break; + + case IB_CM_REJ_RECEIVED: + printk(KERN_DEBUG PFX "REJ received\n"); + comp = 1; + + srp_cm_rej_handler(cm_id, event, target); + break; + + case IB_CM_MRA_RECEIVED: + printk(KERN_ERR PFX "MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + break; + + case IB_CM_TIMEWAIT_EXIT: + printk(KERN_ERR PFX "connection closed\n"); + + comp = 1; + target->status = 0; + break; + + default: + printk(KERN_WARNING PFX "Unhandled CM event %d\n", event->event); + break; + } + + if (comp) + complete(&target->done); + + kfree(qp_attr); + + return 0; +} + +static int srp_send_tsk_mgmt(struct scsi_cmnd *scmnd, u8 func) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_tsk_mgmt *tsk_mgmt; + int req_index; + int ret = FAILED; + + spin_lock_irq(target->scsi_host->host_lock); + + if (scmnd->host_scribble == (void *) -1L) + goto out; + + req_index = (long) scmnd->host_scribble; + printk(KERN_ERR "Abort for req_index %d\n", req_index); + + req = &target->req_ring[req_index]; + init_completion(&req->done); + + iu = __srp_get_tx_iu(target); + if (!iu) + goto out; + + tsk_mgmt = iu->buf; + memset(tsk_mgmt, 0, sizeof *tsk_mgmt); + + tsk_mgmt->opcode = SRP_TSK_MGMT; + tsk_mgmt->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + tsk_mgmt->tag = req_index | SRP_TAG_TSK_MGMT; + tsk_mgmt->tsk_mgmt_func = func; + tsk_mgmt->task_tag = req_index; + + if (__srp_post_send(target, iu, sizeof *tsk_mgmt)) + goto out; + + req->tsk_mgmt = iu; + + spin_unlock_irq(target->scsi_host->host_lock); + if (!wait_for_completion_timeout(&req->done, + msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) + return FAILED; + spin_lock_irq(target->scsi_host->host_lock); + + if (req->cmd_done) { + list_del(&req->list); + req->next = target->req_head; + target->req_head = req_index; + + scmnd->scsi_done(scmnd); + } else if (!req->tsk_status) { + scmnd->result = DID_ABORT << 16; + ret = SUCCESS; + } + +out: + spin_unlock_irq(target->scsi_host->host_lock); + return ret; +} + +static int srp_abort(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP abort called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_ABORT_TASK); +} + +static int srp_reset_device(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP reset_device called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_LUN_RESET); +} + +static int srp_reset_host(struct scsi_cmnd *scmnd) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + int ret = FAILED; + + printk(KERN_ERR PFX "SRP reset_host called\n"); + + if (!srp_reconnect_target(target)) + ret = SUCCESS; + + return ret; +} + +static struct scsi_host_template srp_template = { + .module = THIS_MODULE, + .name = DRV_NAME, + .info = srp_target_info, + .queuecommand = srp_queuecommand, + .eh_abort_handler = srp_abort, + .eh_device_reset_handler = srp_reset_device, + .eh_host_reset_handler = srp_reset_host, + .can_queue = SRP_SQ_SIZE, + .this_id = -1, + .sg_tablesize = SRP_MAX_INDIRECT, + .cmd_per_lun = SRP_SQ_SIZE, + .use_clustering = ENABLE_CLUSTERING +}; + +static int srp_add_target(struct srp_host *host, struct srp_target_port *target) +{ + sprintf(target->target_name, "SRP.T10:%016llX", + (unsigned long long) be64_to_cpu(target->id_ext)); + + if (scsi_add_host(target->scsi_host, host->dev->dma_device)) + return -ENODEV; + + down(&host->target_mutex); + list_add_tail(&target->list, &host->target_list); + up(&host->target_mutex); + + target->state = SRP_TARGET_LIVE; + + /* XXX: are we supposed to have a definition of SCAN_WILD_CARD ?? */ + scsi_scan_target(&target->scsi_host->shost_gendev, + 0, target->scsi_id, ~0, 0); + + return 0; +} + +static void srp_release_class_dev(struct class_device *class_dev) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + complete(&host->released); +} + +static struct class srp_class = { + .name = "infiniband_srp", + .release = srp_release_class_dev +}; + +/* + * Target ports are added by writing + * + * id_ext=,ioc_guid=,dgid=, + * pkey=,service_id= + * + * to the add_target sysfs attribute. + */ +enum { + SRP_OPT_ERR = 0, + SRP_OPT_ID_EXT = 1 << 0, + SRP_OPT_IOC_GUID = 1 << 1, + SRP_OPT_DGID = 1 << 2, + SRP_OPT_PKEY = 1 << 3, + SRP_OPT_SERVICE_ID = 1 << 4, + SRP_OPT_MAX_SECT = 1 << 5, + SRP_OPT_ALL = (SRP_OPT_ID_EXT | + SRP_OPT_IOC_GUID | + SRP_OPT_DGID | + SRP_OPT_PKEY | + SRP_OPT_SERVICE_ID), +}; + +static match_table_t srp_opt_tokens = { + { SRP_OPT_ID_EXT, "id_ext=%s" }, + { SRP_OPT_IOC_GUID, "ioc_guid=%s" }, + { SRP_OPT_DGID, "dgid=%s" }, + { SRP_OPT_PKEY, "pkey=%x" }, + { SRP_OPT_SERVICE_ID, "service_id=%s" }, + { SRP_OPT_MAX_SECT, "max_sect=%d" }, + { SRP_OPT_ERR, NULL } +}; + +static int srp_parse_options(const char *buf, struct srp_target_port *target) +{ + char *options, *sep_opt; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + sep_opt = options; + while ((p = strsep(&sep_opt, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, srp_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case SRP_OPT_ID_EXT: + p = match_strdup(args); + target->id_ext = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_IOC_GUID: + p = match_strdup(args); + target->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) { + printk(KERN_WARNING PFX "bad dest GID parameter '%s'\n", p); + goto out; + } + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + target->path.dgid.raw[i] = simple_strtoul(dgid, NULL, 16); + } + break; + + case SRP_OPT_PKEY: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX "bad P_Key parameter '%s'\n", p); + goto out; + } + target->path.pkey = cpu_to_be16(token); + break; + + case SRP_OPT_SERVICE_ID: + p = match_strdup(args); + target->service_id = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_MAX_SECT: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX "bad max sect parameter '%s'\n", p); + goto out; + } + target->scsi_host->max_sectors = token; + break; + + default: + printk(KERN_WARNING PFX "unknown parameter or missing value " + "'%s' in target creation request\n", p); + goto out; + } + } + + if ((opt_mask & SRP_OPT_ALL) == SRP_OPT_ALL) + ret = 0; + else + for (i = 0; i < ARRAY_SIZE(srp_opt_tokens); ++i) + if ((srp_opt_tokens[i].token & SRP_OPT_ALL) && + !(srp_opt_tokens[i].token & opt_mask)) + printk(KERN_WARNING PFX "target creation request is " + "missing parameter '%s'\n", + srp_opt_tokens[i].pattern); + +out: + kfree(options); + return ret; +} + +static ssize_t srp_create_target(struct class_device *class_dev, + const char *buf, size_t count) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + struct Scsi_Host *target_host; + struct srp_target_port *target; + int ret; + int i; + + target_host = scsi_host_alloc(&srp_template, + sizeof (struct srp_target_port)); + if (!target_host) + return -ENOMEM; + + target = host_to_target(target_host); + memset(target, 0, sizeof *target); + + target->scsi_host = target_host; + target->srp_host = host; + + INIT_WORK(&target->work, srp_reconnect_work, target); + + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_parse_options(buf, target); + if (ret) + goto err; + + ib_get_cached_gid(host->dev, host->port, 0, &target->path.sgid); + + printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " + "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + (unsigned long long) be64_to_cpu(target->id_ext), + (unsigned long long) be64_to_cpu(target->ioc_guid), + be16_to_cpu(target->path.pkey), + (unsigned long long) be64_to_cpu(target->service_id), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[0]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[2]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[4]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[6]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[8]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[10]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[12]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[14])); + + ret = srp_create_target_ib(target); + if (ret) + goto err; + + target->cm_id = ib_create_cm_id(host->dev, srp_cm_handler, target); + if (IS_ERR(target->cm_id)) { + ret = PTR_ERR(target->cm_id); + goto err_free; + } + + ret = srp_connect_target(target); + if (ret) { + printk(KERN_ERR PFX "Connection failed\n"); + goto err_cm_id; + } + + ret = srp_add_target(host, target); + if (ret) + goto err_disconnect; + + return count; + +err_disconnect: + srp_disconnect_target(target); + +err_cm_id: + ib_destroy_cm_id(target->cm_id); + +err_free: + srp_free_target_ib(target); + +err: + scsi_host_put(target_host); + + return ret; +} + +static CLASS_DEVICE_ATTR(add_target, S_IWUSR, NULL, srp_create_target); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + return sprintf(buf, "%s\n", host->dev->name); +} + +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + return sprintf(buf, "%d\n", host->port); +} + +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static struct srp_host *srp_add_port(struct ib_device *device, + __be64 node_guid, u8 port) +{ + struct srp_host *host; + + host = kzalloc(sizeof *host, GFP_KERNEL); + if (!host) + return NULL; + + INIT_LIST_HEAD(&host->target_list); + init_MUTEX(&host->target_mutex); + init_completion(&host->released); + host->dev = device; + host->port = port; + + host->initiator_port_id[7] = port; + memcpy(host->initiator_port_id + 8, &node_guid, 8); + + host->pd = ib_alloc_pd(device); + if (IS_ERR(host->pd)) + goto err_free; + + host->mr = ib_get_dma_mr(host->pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(host->mr)) + goto err_pd; + + host->class_dev.class = &srp_class; + host->class_dev.dev = device->dma_device; + snprintf(host->class_dev.class_id, BUS_ID_SIZE, "srp-%s-%d", + device->name, port); + + if (class_device_register(&host->class_dev)) + goto err_mr; + if (class_device_create_file(&host->class_dev, &class_device_attr_add_target)) + goto err_class; + if (class_device_create_file(&host->class_dev, &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(&host->class_dev, &class_device_attr_port)) + goto err_class; + + return host; + +err_class: + class_device_unregister(&host->class_dev); + +err_mr: + ib_dereg_mr(host->mr); + +err_pd: + ib_dealloc_pd(host->pd); + +err_free: + kfree(host); + + return NULL; +} + +static void srp_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host; + struct ib_device_attr *dev_attr; + int s, e, p; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", + device->name); + goto out; + } + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + goto out; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + host = srp_add_port(device, dev_attr->node_guid, p); + if (host) + list_add_tail(&host->list, dev_list); + } + + ib_set_client_data(device, &srp_client, dev_list); + +out: + kfree(dev_attr); +} + +static void srp_remove_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host, *tmp_host; + LIST_HEAD(target_list); + struct srp_target_port *target, *tmp_target; + unsigned long flags; + + dev_list = ib_get_client_data(device, &srp_client); + + list_for_each_entry_safe(host, tmp_host, dev_list, list) { + class_device_unregister(&host->class_dev); + /* + * Wait for the sysfs entry to go away, so that no new + * target ports can be created. + */ + wait_for_completion(&host->released); + + /* + * Mark all target ports as removed, so we stop queueing + * commands and don't try to reconnect. + */ + down(&host->target_mutex); + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state != SRP_TARGET_REMOVED) + target->state = SRP_TARGET_REMOVED; + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + } + up(&host->target_mutex); + + /* + * Wait for any reconnection tasks that may have + * started before we marked our target ports as + * removed, and any target port removal tasks. + */ + flush_scheduled_work(); + + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + scsi_remove_host(target->scsi_host); + srp_disconnect_target(target); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + } + + ib_dereg_mr(host->mr); + ib_dealloc_pd(host->pd); + kfree(host); + } + + kfree(dev_list); +} + +static int __init srp_init_module(void) +{ + int ret; + + ret = class_register(&srp_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class infiniband_srp\n"); + return ret; + } + + ret = ib_register_client(&srp_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client\n"); + class_unregister(&srp_class); + return ret; + } + + return 0; +} + +static void __exit srp_cleanup_module(void) +{ + ib_unregister_client(&srp_client); + class_unregister(&srp_class); +} + +module_init(srp_init_module); +module_exit(srp_cleanup_module); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h new file mode 100644 index 0000000..4fec28a --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -0,0 +1,150 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.h 3932 2005-11-01 17:19:29Z roland $ + */ + +#ifndef IB_SRP_H +#define IB_SRP_H + +#include +#include + +#include + +#include +#include + +#include +#include +#include + +enum { + SRP_PATH_REC_TIMEOUT_MS = 1000, + SRP_ABORT_TIMEOUT_MS = 5000, + + SRP_PORT_REDIRECT = 1, + SRP_DLID_REDIRECT = 2, + + SRP_MAX_IU_LEN = 256, + + SRP_RQ_SHIFT = 6, + SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT, + SRP_SQ_SIZE = SRP_RQ_SIZE - 1, + SRP_CQ_SIZE = SRP_SQ_SIZE + SRP_RQ_SIZE, + + SRP_TAG_TSK_MGMT = 1 << (SRP_RQ_SHIFT + 1) +}; + +#define SRP_OP_RECV (1 << 31) +#define SRP_MAX_INDIRECT ((SRP_MAX_IU_LEN - \ + sizeof (struct srp_cmd) - \ + sizeof (struct srp_indirect_buf)) / 16) + +enum srp_target_state { + SRP_TARGET_LIVE, + SRP_TARGET_CONNECTING, + SRP_TARGET_DEAD, + SRP_TARGET_REMOVED +}; + +struct srp_host { + u8 initiator_port_id[16]; + struct ib_device *dev; + u8 port; + struct ib_pd *pd; + struct ib_mr *mr; + struct class_device class_dev; + struct list_head target_list; + struct semaphore target_mutex; + struct completion released; + struct list_head list; +}; + +struct srp_request { + struct list_head list; + struct scsi_cmnd *scmnd; + struct srp_iu *cmd; + struct srp_iu *tsk_mgmt; + DECLARE_PCI_UNMAP_ADDR(direct_mapping) + struct completion done; + short next; + u8 cmd_done; + u8 tsk_status; +}; + +struct srp_target_port { + __be64 id_ext; + __be64 ioc_guid; + __be64 service_id; + struct srp_host *srp_host; + struct Scsi_Host *scsi_host; + char target_name[32]; + unsigned int scsi_id; + + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + + struct ib_cm_id *cm_id; + struct ib_cq *cq; + struct ib_qp *qp; + + int max_ti_iu_len; + s32 req_lim; + + unsigned rx_head; + struct srp_iu *rx_ring[SRP_RQ_SIZE]; + + unsigned tx_head; + unsigned tx_tail; + struct srp_iu *tx_ring[SRP_SQ_SIZE + 1]; + + int req_head; + struct list_head req_queue; + struct srp_request req_ring[SRP_SQ_SIZE]; + + struct work_struct work; + + struct list_head list; + struct completion done; + int status; + enum srp_target_state state; +}; + +struct srp_iu { + dma_addr_t dma; + void *buf; + size_t size; + enum dma_data_direction direction; +}; + +#endif /* IB_SRP_H */ diff --git a/include/scsi/srp.h b/include/scsi/srp.h new file mode 100644 index 0000000..6c2681d --- /dev/null +++ b/include/scsi/srp.h @@ -0,0 +1,226 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef SCSI_SRP_H +#define SCSI_SRP_H + +/* + * Structures and constants for the SCSI RDMA Protocol (SRP) as + * defined by the INCITS T10 committee. This file was written using + * draft Revision 16a of the SRP standard. + */ + +#include + +enum { + SRP_LOGIN_REQ = 0x00, + SRP_TSK_MGMT = 0x01, + SRP_CMD = 0x02, + SRP_I_LOGOUT = 0x03, + SRP_LOGIN_RSP = 0xc0, + SRP_RSP = 0xc1, + SRP_LOGIN_REJ = 0xc2, + SRP_T_LOGOUT = 0x80, + SRP_CRED_REQ = 0x81, + SRP_AER_REQ = 0x82, + SRP_CRED_RSP = 0x41, + SRP_AER_RSP = 0x42 +}; + +enum { + SRP_BUF_FORMAT_DIRECT = 1 << 1, + SRP_BUF_FORMAT_INDIRECT = 1 << 2 +}; + +enum { + SRP_NO_DATA_DESC = 0, + SRP_DATA_DESC_DIRECT = 1, + SRP_DATA_DESC_INDIRECT = 2 +}; + +enum { + SRP_TSK_ABORT_TASK = 0x01, + SRP_TSK_ABORT_TASK_SET = 0x02, + SRP_TSK_CLEAR_TASK_SET = 0x04, + SRP_TSK_LUN_RESET = 0x08, + SRP_TSK_CLEAR_ACA = 0x40 +}; + +enum srp_login_rej_reason { + SRP_LOGIN_REJ_UNABLE_ESTABLISH_CHANNEL = 0x00010000, + SRP_LOGIN_REJ_INSUFFICIENT_RESOURCES = 0x00010001, + SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE = 0x00010002, + SRP_LOGIN_REJ_UNABLE_ASSOCIATE_CHANNEL = 0x00010003, + SRP_LOGIN_REJ_UNSUPPORTED_DESCRIPTOR_FMT = 0x00010004, + SRP_LOGIN_REJ_MULTI_CHANNEL_UNSUPPORTED = 0x00010005, + SRP_LOGIN_REJ_CHANNEL_LIMIT_REACHED = 0x00010006 +}; + +struct srp_direct_buf { + __be64 va; + __be32 key; + __be32 len; +}; + +/* + * We need the packed attribute because the SRP spec puts the list of + * descriptors at an offset of 20, which is not aligned to the size + * of struct srp_direct_buf. + */ +struct srp_indirect_buf { + struct srp_direct_buf table_desc; + __be32 len; + struct srp_direct_buf desc_list[0] __attribute__((packed)); +}; + +enum { + SRP_MULTICHAN_SINGLE = 0, + SRP_MULTICHAN_MULTI = 1 +}; + +struct srp_login_req { + u8 opcode; + u8 reserved1[7]; + u64 tag; + __be32 req_it_iu_len; + u8 reserved2[4]; + __be16 req_buf_fmt; + u8 req_flags; + u8 reserved3[5]; + u8 initiator_port_id[16]; + u8 target_port_id[16]; +}; + +struct srp_login_rsp { + u8 opcode; + u8 reserved1[3]; + __be32 req_lim_delta; + u64 tag; + __be32 max_it_iu_len; + __be32 max_ti_iu_len; + __be16 buf_fmt; + u8 rsp_flags; + u8 reserved2[25]; +}; + +struct srp_login_rej { + u8 opcode; + u8 reserved1[3]; + __be32 reason; + u64 tag; + u8 reserved2[8]; + __be16 buf_fmt; + u8 reserved3[6]; +}; + +struct srp_i_logout { + u8 opcode; + u8 reserved[7]; + u64 tag; +}; + +struct srp_t_logout { + u8 opcode; + u8 sol_not; + u8 reserved[2]; + __be32 reason; + u64 tag; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_tsk_mgmt { + u8 opcode; + u8 sol_not; + u8 reserved1[6]; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3[2]; + u8 tsk_mgmt_func; + u8 reserved4; + u64 task_tag; + u8 reserved5[8]; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_cmd { + u8 opcode; + u8 sol_not; + u8 reserved1[3]; + u8 buf_fmt; + u8 data_out_desc_cnt; + u8 data_in_desc_cnt; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3; + u8 task_attr; + u8 reserved4; + u8 add_cdb_len; + u8 cdb[16]; + u8 add_data[0]; +}; + +enum { + SRP_RSP_FLAG_RSPVALID = 1 << 0, + SRP_RSP_FLAG_SNSVALID = 1 << 1, + SRP_RSP_FLAG_DOOVER = 1 << 2, + SRP_RSP_FLAG_DOUNDER = 1 << 3, + SRP_RSP_FLAG_DIOVER = 1 << 4, + SRP_RSP_FLAG_DIUNDER = 1 << 5 +}; + +struct srp_rsp { + u8 opcode; + u8 sol_not; + u8 reserved1[2]; + __be32 req_lim_delta; + u64 tag; + u8 reserved2[2]; + u8 flags; + u8 status; + __be32 data_out_res_cnt; + __be32 data_in_res_cnt; + __be32 sense_data_len; + __be32 resp_data_len; + u8 data[0]; +}; + +#endif /* SCSI_SRP_H */ --- 0.99.9 From ftillier at silverstorm.com Wed Nov 2 13:48:28 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 2 Nov 2005 13:48:28 -0800 Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: <1130965874.4381.4249.camel@hal.voltaire.com> Message-ID: <000301c5dff7$29c5eb40$9e5aa8c0@infiniconsys.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 02, 2005 1:11 PM > > As I said there is likely a real SA client that will be developed. In > the short term, you can use some diag as an example but these are SMP > rather than GMP based (except for perfquery). There is some SA > infrastructure in place but I'm not sure how well it works. Would you be > using RMPP too as little has exercised it to date ? RMPP would be required for a query of all service registrations. > There's sa_call and just an ib_path_query right now (in > libibmad/src/sa.c). A service query could be easily added. RMPP is not > supported yet at this level. > > > > What is the timeframe for this need ? > > > > I'm thinking of debugging tools that would be useful for me at SC05. > > I was planning on using ibis at SC05 if this was needed. If there are Windows boxes on the same IB fabric, you could pretty easily write a program to do the query for you. Windows supports user-mode SA queries including RMPP. I don't know if this is practical for your SC05 needs. - Fab From mshefty at ichips.intel.com Wed Nov 2 13:48:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Nov 2005 13:48:11 -0800 Subject: [openib-general] common userspace support Message-ID: <4369341B.8080004@ichips.intel.com> I'm implementing the userspace CMA and noticed that there are a couple of areas where userspace support overlaps. For example, both the CMA and IB CM need to copy path records between userspace and the kernel. They also copy QP attributes, which would also be needed by verbs at some point to support query QP. In these cases, the data structures passed between userspace and the kernel are the same, as is the code to copy them. Does anyone have a preference for how to deal with this issue on both the kernel and userspace sides? My thinking is that for the kernel, the kernel structures would be defined in a common header, with functions exported to copy to/from them. This results in additional dependencies between modules. (E.g. rdma_ucm would require ib_uverbs and ib_usa modules. ib_user_verbs.h would define the QP attribute structure and uverbs_?.c would export copy routines.) For userspace, we can do something similar, which would build dependencies between the different libraries. - Sean From robert.j.woodruff at intel.com Wed Nov 2 13:47:34 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 2 Nov 2005 13:47:34 -0800 Subject: [openib-general] RE: Problems with SDP on Itanium In-Reply-To: <20051102185838.GA26005@mellanox.co.il> Message-ID: Michael wrote, >No, dont think I've seen that one, but its been a while >since I last run anything on Itanium. >Can you try to debug it a little? What does it mean that >an application "hangs"? Is some data sent from one side not received >by another one? >-- >MST Looks like it is stuck in the write()system call. 103: 1048573 bytes 21 times --> 3853.24 Mbps in 2076.17 usec 104: 1048576 bytes 24 times --> 3854.65 Mbps in 2075.42 usec 105: 1048579 bytes 24 times --> 3847.86 Mbps in 2079.08 usec 106: 1572861 bytes 24 times --> Program received signal SIGINT, Interrupt. 0xa000000000010641 in ?? () (gdb) bt #0 0xa000000000010641 in ?? () #1 0x20000000001bf9c0 in write () from /lib/tls/libc.so.6.1 #2 0x4000000000004920 in SendData () #3 0x40000000000036e0 in main () Here is the gdb traceback from the other side after it hangs. It is blocked in a read() system call. (gdb) run Starting program: /home/exports/NetPIPE_3.5-SDP/NPtcp Failed to read a valid object file image from memory. (no debugging symbols found) (no debugging symbols found) (no debugging symbols found) Send and receive buffers are 135168 and 135168 bytes (A bug in Linux doubles the requested buffer sizes) Program received signal SIGINT, Interrupt. 0xa000000000010641 in ?? () (gdb) bt #0 0xa000000000010641 in ?? () #1 0x20000000001bf8c0 in read () from /lib/tls/libc.so.6.1 #2 0x4000000000004a50 in RecvData () #3 0x4000000000003aa0 in main () From mst at mellanox.co.il Wed Nov 2 14:03:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 00:03:58 +0200 Subject: [openib-general] Re: [PATCH/RFC v2] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <52r79y91jz.fsf_-_@cisco.com> References: <52r79y91jz.fsf_-_@cisco.com> Message-ID: <20051102220358.GA27132@mellanox.co.il> Hello, Roland! Quoting Roland Dreier : > +static int srp_init_qp(struct srp_target_port *target, > + struct ib_qp *qp) > +{ > + struct ib_qp_attr *attr; > + int ret; > + > + attr = kmalloc(sizeof *attr, GFP_KERNEL); > + if (!attr) > + return -ENOMEM; > + > + ret = ib_find_cached_pkey(target->srp_host->dev, > + target->srp_host->port, > + be16_to_cpu(target->path.pkey), > + &attr->pkey_index); > + if (ret) > + return ret; > + > + attr->qp_state = IB_QPS_INIT; > + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | > + IB_ACCESS_REMOTE_WRITE); > + attr->port_num = target->srp_host->port; > + > + return ib_modify_qp(qp, attr, > + IB_QP_STATE | > + IB_QP_PKEY_INDEX | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PORT); > +} This seems to leak sizeof *attr bytes if ib_find_cached_pkey returns an error. -- MST From trimmer at silverstorm.com Wed Nov 2 14:04:07 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Wed, 2 Nov 2005 17:04:07 -0500 Subject: [openib-general] Re: [PATCH/RFC v2] IB: Add SCSI RDMA Protocol(SRP) initiator Message-ID: <5D78D28F88822E4D8702BB9EEF1A436773E947@mercury.infiniconsys.com> also leaks it on success > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Wednesday, November 02, 2005 5:04 PM > To: Roland Dreier > Cc: openib-general at openib.org; linux-kernel at vger.kernel.org; > linux-scsi at vger.kernel.org > Subject: [openib-general] Re: [PATCH/RFC v2] IB: Add SCSI RDMA > Protocol(SRP) initiator > > > Hello, Roland! > Quoting Roland Dreier : > > +static int srp_init_qp(struct srp_target_port *target, > > + struct ib_qp *qp) > > +{ > > + struct ib_qp_attr *attr; > > + int ret; > > + > > + attr = kmalloc(sizeof *attr, GFP_KERNEL); > > + if (!attr) > > + return -ENOMEM; > > + > > + ret = ib_find_cached_pkey(target->srp_host->dev, > > + target->srp_host->port, > > + be16_to_cpu(target->path.pkey), > > + &attr->pkey_index); > > + if (ret) > > + return ret; > > + > > + attr->qp_state = IB_QPS_INIT; > > + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | > > + IB_ACCESS_REMOTE_WRITE); > > + attr->port_num = target->srp_host->port; > > + > > + return ib_modify_qp(qp, attr, > > + IB_QP_STATE | > > + IB_QP_PKEY_INDEX | > > + IB_QP_ACCESS_FLAGS | > > + IB_QP_PORT); > > +} > > This seems to leak sizeof *attr bytes if ib_find_cached_pkey > returns an error. > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Wed Nov 2 14:04:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 14:04:41 -0800 Subject: [openib-general] Re: [PATCH/RFC v2] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051102220358.GA27132@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Nov 2005 00:03:58 +0200") References: <52r79y91jz.fsf_-_@cisco.com> <20051102220358.GA27132@mellanox.co.il> Message-ID: <52mzkm905i.fsf@cisco.com> Michael> This seems to leak sizeof *attr bytes if Michael> ib_find_cached_pkey returns an error. Good catch. It actually seems to leak attr unconditionally... I'll fix it up now. - R. From rolandd at cisco.com Wed Nov 2 14:08:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 14:08:35 -0800 Subject: [openib-general] Re: [PATCH/RFC v2] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051102220358.GA27132@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Nov 2005 00:03:58 +0200") References: <52r79y91jz.fsf_-_@cisco.com> <20051102220358.GA27132@mellanox.co.il> Message-ID: <52irva8zz0.fsf@cisco.com> OK, I replaced the function with: +static int srp_init_qp(struct srp_target_port *target, + struct ib_qp *qp) +{ + struct ib_qp_attr *attr; + int ret; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + ret = ib_find_cached_pkey(target->srp_host->dev, + target->srp_host->port, + be16_to_cpu(target->path.pkey), + &attr->pkey_index); + if (ret) + goto out; + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + attr->port_num = target->srp_host->port; + + ret = ib_modify_qp(qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | + IB_QP_PORT); + +out: + kfree(attr); + return ret; +} From ardavis at ichips.intel.com Wed Nov 2 14:08:21 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 02 Nov 2005 14:08:21 -0800 Subject: [openib-general] Re: uDAPL again In-Reply-To: <43692526.3030003@cs.rutgers.edu> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> <43692526.3030003@cs.rutgers.edu> Message-ID: <436938D5.6030403@ichips.intel.com> Aniruddha Bohra wrote: > Arlin Davis wrote: > >> Aniruddha Bohra wrote: >> >>> cq_object_wait: RET evd 0x8083ca0 ibv_cq 0x8083da0 ibv_ctx (nil) >>> Success^M >>> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >>> dapl_evd_dto_callback : CQE ^M >>> work_req_id 134771572^M >>> status 12^M >>> >>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<^M >>> DTO completion ERROR: 12: op 0xff^M >>> disconnect(ep 0x8087110, conn 0x808a008, id 134774528 flags 0)^M >>> destroy_cm_id: conn 0x808a008 id 134774528^M >>> dapli_evd_post_event: Called with event # 4006^M >>> >>> >>> Any ideas how to proceed to even debug this ? >> >> Are you using the uDAPL provider with socket CM (VERBS=openib_scm) or >> the default one that use's uCM and uAT? For the socket_CM version >> the timeout is set to 14 (~67ms) and the retries are set to 7 so the >> receiving node would have to be delayed beyond ~469ms to get this >> failure. For the default uCM/uAT version the retries are set to 7 and >> the timeout is set to pktlifetime+1 so you would have to look at the >> path-record for the timeout value for the connection. >> > I am using the default one. Actually, even the dapl_ep_connect() takes > a long time. How long does it typically take to process your dapl_ep_connect? Your time is most likely being spent resolving the remote IP address to a GID and then resolving the path record. Both require SA quieries. > I am not sure, but arent uCM and uAT simply for connection establishment? > Yes, but they also set up many of the transfer attributes of the connected QP. The uCM/uAT version uses path_records from the SA query but the socket_CM version just builds them by hand similiar to the way ibv_rc_pingpong does. You would have to look at the pathrecord->pktlifetime to see the actual timeout value being used. > >> Can you successfully run the IB verbs ibv_rc_pingpong test suite? > > > Between the two OpenIB nodes, I can run the ibv_rc_pingpong. I would suggest that you try the socket CM version and see if you get different results. Just build with "make VERBS=openib_scm". -arlin From jlentini at netapp.com Wed Nov 2 14:09:56 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 2 Nov 2005 17:09:56 -0500 (EST) Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: <1130965874.4381.4249.camel@hal.voltaire.com> References: <1130958126.4381.4109.camel@hal.voltaire.com> <1130965874.4381.4249.camel@hal.voltaire.com> Message-ID: halr> > I'm thinking of debugging tools that would be useful for me at SC05. halr> halr> I was planning on using ibis at SC05 if this was needed. I'll check out ibis. Based on Eitan mail, it sounds perfect. james From mst at mellanox.co.il Wed Nov 2 14:15:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 00:15:22 +0200 Subject: [openib-general] Re: common userspace support In-Reply-To: <4369341B.8080004@ichips.intel.com> References: <4369341B.8080004@ichips.intel.com> Message-ID: <20051102221522.GA27731@mellanox.co.il> Quoting r. Sean Hefty : > Subject: common userspace support > > I'm implementing the userspace CMA and noticed that there are a couple of areas > where userspace support overlaps. > > For example, both the CMA and IB CM need to copy path records between userspace > and the kernel. They also copy QP attributes, which would also be needed by > verbs at some point to support query QP. In these cases, the data structures > passed between userspace and the kernel are the same, as is the code to copy them. > > Does anyone have a preference for how to deal with this issue on both the kernel > and userspace sides? > > My thinking is that for the kernel, the kernel structures would be defined in a > common header, with functions exported to copy to/from them. This results in > additional dependencies between modules. (E.g. rdma_ucm would require ib_uverbs > and ib_usa modules. ib_user_verbs.h would define the QP attribute structure and > uverbs_?.c would export copy routines.) > > For userspace, we can do something similar, which would build dependencies > between the different libraries. > > - Sean Common header files/structures might make some sense, but what would the routines do, besides copy to/from user? Could you give an example? -- MST From mst at mellanox.co.il Wed Nov 2 14:18:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 00:18:50 +0200 Subject: [openib-general] Re: Problems with SDP on Itanium In-Reply-To: References: Message-ID: <20051102221850.GB27132@mellanox.co.il> Quoting r. Bob Woodruff : > Subject: RE: Problems with SDP on Itanium > > Michael wrote, > >No, dont think I've seen that one, but its been a while > >since I last run anything on Itanium. > >Can you try to debug it a little? What does it mean that > >an application "hangs"? Is some data sent from one side not received > >by another one? > > >-- > >MST > > Looks like it is stuck in the write()system call. > > 103: 1048573 bytes 21 times --> 3853.24 Mbps in 2076.17 usec > 104: 1048576 bytes 24 times --> 3854.65 Mbps in 2075.42 usec > 105: 1048579 bytes 24 times --> 3847.86 Mbps in 2079.08 usec > 106: 1572861 bytes 24 times --> > Program received signal SIGINT, Interrupt. > 0xa000000000010641 in ?? () > (gdb) bt > #0 0xa000000000010641 in ?? () > #1 0x20000000001bf9c0 in write () from /lib/tls/libc.so.6.1 > #2 0x4000000000004920 in SendData () > #3 0x40000000000036e0 in main () > > Here is the gdb traceback from the other side after it hangs. > It is blocked in a read() system call. > > (gdb) run > Starting program: /home/exports/NetPIPE_3.5-SDP/NPtcp > Failed to read a valid object file image from memory. > (no debugging symbols found) > (no debugging symbols found) > (no debugging symbols found) > Send and receive buffers are 135168 and 135168 bytes > (A bug in Linux doubles the requested buffer sizes) > > Program received signal SIGINT, Interrupt. > 0xa000000000010641 in ?? () > (gdb) bt > #0 0xa000000000010641 in ?? () > #1 0x20000000001bf8c0 in read () from /lib/tls/libc.so.6.1 > #2 0x4000000000004a50 in RecvData () > #3 0x4000000000003aa0 in main () > Interesting. I'll try to look at this next week - shouldnt be too hard to debug if I manage to reproduce it here. Meanwhile, could you please try to enable sdp data debugging, and post the resulting log if the problem reproduces there? -- MST From mshefty at ichips.intel.com Wed Nov 2 14:15:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 02 Nov 2005 14:15:44 -0800 Subject: [openib-general] Re: common userspace support In-Reply-To: <20051102221522.GA27731@mellanox.co.il> References: <4369341B.8080004@ichips.intel.com> <20051102221522.GA27731@mellanox.co.il> Message-ID: <43693A90.80400@ichips.intel.com> Michael S. Tsirkin wrote: > Common header files/structures might make some sense, but what would > the routines do, besides copy to/from user? > Could you give an example? The "copies" aren't memory copies, but field by field assignments. The function below is used by ib_ucm to copy QP attributes from the kernel to the userspace app. The same functionality is needed by rdma_ucm. - Sean static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, struct ib_qp_attr *src_attr) { dest_attr->cur_qp_state = src_attr->cur_qp_state; dest_attr->path_mtu = src_attr->path_mtu; dest_attr->path_mig_state = src_attr->path_mig_state; dest_attr->qkey = src_attr->qkey; dest_attr->rq_psn = src_attr->rq_psn; dest_attr->sq_psn = src_attr->sq_psn; dest_attr->dest_qp_num = src_attr->dest_qp_num; dest_attr->qp_access_flags = src_attr->qp_access_flags; dest_attr->max_send_wr = src_attr->cap.max_send_wr; dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; dest_attr->max_send_sge = src_attr->cap.max_send_sge; dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; dest_attr->max_inline_data = src_attr->cap.max_inline_data; ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); dest_attr->pkey_index = src_attr->pkey_index; dest_attr->alt_pkey_index = src_attr->alt_pkey_index; dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; dest_attr->sq_draining = src_attr->sq_draining; dest_attr->max_rd_atomic = src_attr->max_rd_atomic; dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; dest_attr->min_rnr_timer = src_attr->min_rnr_timer; dest_attr->port_num = src_attr->port_num; dest_attr->timeout = src_attr->timeout; dest_attr->retry_cnt = src_attr->retry_cnt; dest_attr->rnr_retry = src_attr->rnr_retry; dest_attr->alt_port_num = src_attr->alt_port_num; dest_attr->alt_timeout = src_attr->alt_timeout; } From mst at mellanox.co.il Wed Nov 2 14:28:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 00:28:24 +0200 Subject: [openib-general] Re: common userspace support In-Reply-To: <4369341B.8080004@ichips.intel.com> References: <4369341B.8080004@ichips.intel.com> Message-ID: <20051102222824.GB27731@mellanox.co.il> Quoting r. Sean Hefty : > Subject: common userspace support > > I'm implementing the userspace CMA and noticed that there are a couple of areas > where userspace support overlaps. > > For example, both the CMA and IB CM need to copy path records between userspace > and the kernel. They also copy QP attributes, which would also be needed by > verbs at some point to support query QP. In these cases, the data structures > passed between userspace and the kernel are the same, as is the code to copy them. > > Does anyone have a preference for how to deal with this issue on both the kernel > and userspace sides? > > My thinking is that for the kernel, the kernel structures would be defined in a > common header, with functions exported to copy to/from them. This results in > additional dependencies between modules. (E.g. rdma_ucm would require ib_uverbs > and ib_usa modules. ib_user_verbs.h would define the QP attribute structure and > uverbs_?.c would export copy routines.) > > For userspace, we can do something similar, which would build dependencies > between the different libraries. > > - Sean I see what you mean now. In my opinion, given that cma is going to be used with uverbs anyway, it shouldnt be a problem to make cma depend on uverbs. -- MST From robert.j.woodruff at intel.com Wed Nov 2 15:16:30 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 2 Nov 2005 15:16:30 -0800 Subject: [openib-general] RE: Problems with SDP on Itanium In-Reply-To: <20051102221850.GB27132@mellanox.co.il> Message-ID: Michael wrote, >Interesting. I'll try to look at this next week - shouldnt be too hard >to debug if I manage to reproduce it here. >Meanwhile, could you please try to enable sdp data debugging, and post the >resulting log if the problem reproduces there? >-- >MST Yes, when I get some time, I will rebuild my kernel with debug and re-run it. woody From rolandd at cisco.com Wed Nov 2 15:27:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 15:27:51 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB Message-ID: <52d5li8waw.fsf@cisco.com> What are your plans for porting the RDS code so that it works with the upstream Linux IB stack? I've only seen a couple of checkins, and the code that you've dropped so far doesn't look usable and needs a lot of cleanup. There's not even a Makefile there. Someone uncharitable might believe that the whole purpose of this exercise was just to be able to issue your press release (http://silverstorm.com/news/rel/092005.asp). - R. From robert.j.woodruff at intel.com Wed Nov 2 17:37:52 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 2 Nov 2005 17:37:52 -0800 Subject: [openib-general] Problems with SDP on Itanium In-Reply-To: <014f01c5e00b$e35fb6d0$0211708d@gpv.az05.bull.com> Message-ID: Jerome wrote, >I tried your package and I have the same "hang" that you have at test 106; >sender in write call and receiver in read call. Not sure why ttcp would not >have this problem also? I will rebuild my kernel tomorrow with debug turned on and see if that provides and clues. woody From troy at scl.ameslab.gov Wed Nov 2 19:24:00 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Wed, 2 Nov 2005 21:24:00 -0600 Subject: [openib-general] OpenSM errors question.. Message-ID: <20051103032400.GF8748@minbar.scl.ameslab.gov> What does the following mean? (the ERR 1B11, in particular) Nov 02 16:18:33 656702 [41001960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x001B GID:0xfe80000000000000,0x0002c90200402789 Nov 02 16:18:33 674607 [41802960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches. Nov 02 16:18:34 197522 [41802960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches. Nov 02 16:19:59 917207 [41802960] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x001B GID:0xfe80000000000000,0x0002c90200402789 Nov 02 16:19:59 917610 [41001960] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x001B GID:0xfe80000000000000,0x0002c90200402789 Nov 02 16:19:59 926829 [41001960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method =SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 : 0x0000000000000016 Nov 02 16:20:01 670893 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method =SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 : 0x0000000000000002 (I got this on an isolated subnet with 3 machines.. two opterons with mellanox cards and an IBM with the eHCA card) From halr at voltaire.com Wed Nov 2 19:58:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 3 Nov 2005 05:58:51 +0200 Subject: [openib-general] OpenSM errors question.. Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589A9DF@taurus.voltaire.com> On Wed, 2005-11-02 at 22:24, Troy Benjegerdes wrote: > What does the following mean? (the ERR 1B11, in particular) > > Nov 02 16:18:33 656702 [41001960] -> osm_report_notice: Reporting > Generic Notice type:4 num:144 from LID:0x001B > GID:0xfe80000000000000,0x0002c90200402789 > Nov 02 16:18:33 674607 [41802960] -> osm_ucast_mgr_process: Min Hop > Tables configured on all switches. > Nov 02 16:18:34 197522 [41802960] -> osm_ucast_mgr_process: Min Hop > Tables configured on all switches. > Nov 02 16:19:59 917207 [41802960] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x001B > GID:0xfe80000000000000,0x0002c90200402789 > Nov 02 16:19:59 917610 [41001960] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x001B > GID:0xfe80000000000000,0x0002c90200402789 > Nov 02 16:19:59 926829 [41001960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: > method =SubnAdmSet,scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xff12601bffff0000 : 0x0000000000000016 > Nov 02 16:20:01 670893 [41802960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: > method =SubnAdmSet,scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xff12601bffff0000 : 0x0000000000000002 > > > (I got this on an isolated subnet with 3 machines.. two opterons with > mellanox cards and an IBM with the eHCA card) That means a join is being attempted to a multicast group which is not yet created. These are typically groups that you can ignore. They are benign. The two above are both IPv6 multicast groups. The first one ends in 0x16 and the second one 0x2. I think those are IGMP and all routers multicast groups. -- Hal From rolandd at cisco.com Wed Nov 2 20:18:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 02 Nov 2005 20:18:07 -0800 Subject: [openib-general] [PATCH] umad: fix hotplug Message-ID: <52veza74ao.fsf@cisco.com> I just committed the patch below, which should fix hotplug handling in umad. The practical effect of this that you can do "modprobe -r ib_mthca" with opensm running and not get an oops. Comments and test results solicited.... Thanks, Roland --- infiniband/core/user_mad.c (revision 3945) +++ infiniband/core/user_mad.c (working copy) @@ -94,6 +94,9 @@ struct ib_umad_port { struct class_device *sm_class_dev; struct semaphore sm_sem; + struct rw_semaphore mutex; + struct list_head file_list; + struct ib_device *ib_dev; struct ib_umad_device *umad_dev; int dev_num; @@ -108,10 +111,10 @@ struct ib_umad_device { struct ib_umad_file { struct ib_umad_port *port; - spinlock_t recv_lock; struct list_head recv_list; + struct list_head port_list; + spinlock_t recv_lock; wait_queue_head_t recv_wait; - struct rw_semaphore agent_mutex; struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; }; @@ -148,7 +151,7 @@ static int queue_packet(struct ib_umad_f { int ret = 1; - down_read(&file->agent_mutex); + down_read(&file->port->mutex); for (packet->mad.hdr.id = 0; packet->mad.hdr.id < IB_UMAD_MAX_AGENTS; packet->mad.hdr.id++) @@ -161,7 +164,7 @@ static int queue_packet(struct ib_umad_f break; } - up_read(&file->agent_mutex); + up_read(&file->port->mutex); return ret; } @@ -322,7 +325,7 @@ static ssize_t ib_umad_write(struct file goto err; } - down_read(&file->agent_mutex); + down_read(&file->port->mutex); agent = file->agent[packet->mad.hdr.id]; if (!agent) { @@ -419,7 +422,7 @@ static ssize_t ib_umad_write(struct file if (ret) goto err_msg; - up_read(&file->agent_mutex); + up_read(&file->port->mutex); return count; @@ -430,7 +433,7 @@ err_ah: ib_destroy_ah(ah); err_up: - up_read(&file->agent_mutex); + up_read(&file->port->mutex); err: kfree(packet); @@ -460,7 +463,12 @@ static int ib_umad_reg_agent(struct ib_u int agent_id; int ret; - down_write(&file->agent_mutex); + down_write(&file->port->mutex); + + if (!file->port->ib_dev) { + ret = -EPIPE; + goto out; + } if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { ret = -EFAULT; @@ -522,7 +530,7 @@ err: ib_unregister_mad_agent(agent); out: - up_write(&file->agent_mutex); + up_write(&file->port->mutex); return ret; } @@ -531,7 +539,7 @@ static int ib_umad_unreg_agent(struct ib u32 id; int ret = 0; - down_write(&file->agent_mutex); + down_write(&file->port->mutex); if (get_user(id, (u32 __user *) arg)) { ret = -EFAULT; @@ -548,7 +556,7 @@ static int ib_umad_unreg_agent(struct ib file->agent[id] = NULL; out: - up_write(&file->agent_mutex); + up_write(&file->port->mutex); return ret; } @@ -569,6 +577,7 @@ static int ib_umad_open(struct inode *in { struct ib_umad_port *port; struct ib_umad_file *file; + int ret = 0; spin_lock(&port_lock); port = umad_port[iminor(inode) - IB_UMAD_MINOR_BASE]; @@ -579,21 +588,32 @@ static int ib_umad_open(struct inode *in if (!port) return -ENXIO; + down_write(&port->mutex); + + if (!port->ib_dev) { + ret = -ENXIO; + goto out; + } + file = kzalloc(sizeof *file, GFP_KERNEL); if (!file) { kref_put(&port->umad_dev->ref, ib_umad_release_dev); - return -ENOMEM; + ret = -ENOMEM; + goto out; } spin_lock_init(&file->recv_lock); - init_rwsem(&file->agent_mutex); INIT_LIST_HEAD(&file->recv_list); init_waitqueue_head(&file->recv_wait); file->port = port; filp->private_data = file; - return 0; + list_add_tail(&file->port_list, &port->file_list); + +out: + up_write(&port->mutex); + return ret; } static int ib_umad_close(struct inode *inode, struct file *filp) @@ -680,9 +700,13 @@ static int ib_umad_sm_close(struct inode struct ib_port_modify props = { .clr_port_cap_mask = IB_PORT_SM }; - int ret; + int ret = 0; + + down_write(&port->mutex); + if (port->ib_dev) + ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); + up_write(&port->mutex); - ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); up(&port->sm_sem); kref_put(&port->umad_dev->ref, ib_umad_release_dev); @@ -745,6 +769,8 @@ static int ib_umad_init_port(struct ib_d port->ib_dev = device; port->port_num = port_num; init_MUTEX(&port->sm_sem); + init_rwsem(&port->mutex); + INIT_LIST_HEAD(&port->file_list); port->dev = cdev_alloc(); if (!port->dev) @@ -813,6 +839,9 @@ err_cdev: static void ib_umad_kill_port(struct ib_umad_port *port) { + struct ib_umad_file *file; + int id; + class_set_devdata(port->class_dev, NULL); class_set_devdata(port->sm_class_dev, NULL); @@ -826,6 +855,21 @@ static void ib_umad_kill_port(struct ib_ umad_port[port->dev_num] = NULL; spin_unlock(&port_lock); + down_write(&port->mutex); + + port->ib_dev = NULL; + + list_for_each_entry(file, &port->file_list, port_list) + for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { + if (!file->agent[id]) + continue; + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + } + + up_write(&port->mutex); + clear_bit(port->dev_num, dev_map); } From mst at mellanox.co.il Thu Nov 3 00:30:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 10:30:44 +0200 Subject: [openib-general] Re: [PATCH] umad: fix hotplug In-Reply-To: <52veza74ao.fsf@cisco.com> References: <52veza74ao.fsf@cisco.com> Message-ID: <20051103083044.GJ31134@mellanox.co.il> Quoting Roland Dreier : > Subject: [PATCH] umad: fix hotplug > > I just committed the patch below, which should fix hotplug handling in > umad. The practical effect of this that you can do "modprobe -r ib_mthca" > with opensm running and not get an oops. > > Comments and test results solicited.... > > Thanks, > Roland I've just checked in the following obvious fix. Hope thats OK with everyone. Committed revision 3956. --- Protect file->mr changes by port->mutex, and remove it from port_list on close. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/core/user_mad.c =================================================================== --- linux-kernel/drivers/infiniband/core/user_mad.c (revision 3955) +++ linux-kernel/drivers/infiniband/core/user_mad.c (working copy) @@ -623,6 +623,7 @@ static int ib_umad_close(struct inode *i struct ib_umad_packet *packet, *tmp; int i; + down_write(&file->port->mutex); for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) if (file->agent[i]) { ib_dereg_mr(file->mr[i]); @@ -632,6 +633,9 @@ static int ib_umad_close(struct inode *i list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); + list_del(&file->port_list); + up_write(&file->port->mutex); + kfree(file); kref_put(&dev->ref, ib_umad_release_dev); -- MST From yipeeyipeeyipeeyipee at yahoo.com Thu Nov 3 01:12:33 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 3 Nov 2005 09:12:33 +0000 (UTC) Subject: [openib-general] Re: compilation platform dependencies References: <4367C179.5050102@ichips.intel.com> Message-ID: yipee yahoo.com> writes: Hi again, I've updated my openib sources from the main trunk and verified that your fixes fixed the error I got. Now the call to ib_cm_get_event() returns a correct value. Yhanks, y From halr at voltaire.com Thu Nov 3 03:22:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 06:22:31 -0500 Subject: [openib-general] [PATCH] OpenSM: Don't obtain PKeyTables on switch when option not supported Message-ID: <1131016951.4338.45.camel@hal.voltaire.com> OpenSM: Don't obtain PKeyTables on switch when partition enforcement option not supported. Part of patch supplied by Brad Benton Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 3942) +++ osm_port_info_rcv.c (working copy) @@ -467,6 +467,11 @@ void osm_pkey_get_tables( cl_ntoh64(p_node->node_info.node_guid) ); goto Exit; } + + /* bail out if this is a switch with no partition enforcement capability */ + if (cl_ntoh16(p_switch->switch_info.enforce_cap) == 0) + goto Exit; + max_blocks = (cl_ntoh16(p_switch->switch_info.enforce_cap)+IB_NUM_PKEY_ELEMENTS_IN_BLOCK -1) / IB_NUM_PKEY_ELEMENTS_IN_BLOCK ; } From yipeeyipeeyipeeyipee at yahoo.com Thu Nov 3 04:15:50 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 3 Nov 2005 12:15:50 +0000 (UTC) Subject: [openib-general] netstat Message-ID: Hi, Is there some way to view the list of current CM end points in their various states (listen,connection)? I'm looking for some utility that would provide me with information similar to what netstat provides about kernel sockets. for example: [yipee at yipee new_mini_host]$ netstat -nat Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp 0 0 10.100.0.95:798 10.100.0.93:2049 ESTABLISHED tcp 0 0 10.100.0.95:800 10.100.0.93:2049 ESTABLISHED tcp 0 0 10.100.0.95:35148 10.100.0.93:111 TIME_WAIT The "Local Addresss" & "Foreign Address" fields can display the pair, the "State" field is meaningful too. thanks, y From yael at mellanox.co.il Thu Nov 3 05:07:42 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 03 Nov 2005 15:07:42 +0200 Subject: [openib-general] [PATCH] Opensm - bug in osm_sa_path_record with 0 records Message-ID: <5zacglyj4x.fsf@mtl066.yok.mtl.com> Hi Hal, During some testing of path record we found a bug in the code. If the number of records return is zero, then there is clearing of non allocated memory. I've added some changes to the __osm_pr_rcv_respond function, to match other sa responses. Attached is a patch to fix it. Thanks, Yael Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 3955) +++ opensm/osm_sa_path_record.c (working copy) @@ -1448,7 +1448,7 @@ __osm_pr_rcv_respond( osm_madw_t* p_resp_madw; const ib_sa_mad_t* p_sa_mad; ib_sa_mad_t* p_resp_sa_mad; - size_t num_rec, num_copied; + size_t num_rec, num_copied, pre_trim_num_rec; #ifndef VENDOR_RMPP_SUPPORT size_t trim_num_rec; #endif @@ -1456,6 +1456,7 @@ __osm_pr_rcv_respond( ib_api_status_t status; const ib_sa_mad_t* p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); osm_pr_item_t* p_pr_item; + uint32_t i; OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_respond ); @@ -1483,6 +1484,7 @@ __osm_pr_rcv_respond( goto Exit; } + pre_trim_num_rec = num_rec; #ifndef VENDOR_RMPP_SUPPORT trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_path_rec_t); if (trim_num_rec < num_rec) @@ -1495,11 +1497,15 @@ __osm_pr_rcv_respond( } #endif - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) - { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_respond: " "Generating response with %u records.\n", num_rec ); + + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; } /* @@ -1514,6 +1520,16 @@ __osm_pr_rcv_respond( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_respond: ERR 1F14: " "Unable to allocate MAD.\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + goto Exit; } @@ -1528,6 +1544,8 @@ __osm_pr_rcv_respond( p_resp_sa_mad->attr_offset = ib_get_attr_offset( sizeof(ib_path_rec_t) ); + p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + #ifndef VENDOR_RMPP_SUPPORT /* we support only one packet RMPP - so we will set the first and last flags for gettable */ @@ -1542,37 +1560,19 @@ __osm_pr_rcv_respond( p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; #endif - p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); - - if ( num_rec == 0 ) - { - if (p_resp_sa_mad->method == IB_MAD_METHOD_GET_RESP) - p_resp_sa_mad->status = IB_SA_MAD_STATUS_NO_RECORDS; - cl_memclr( p_resp_pr, sizeof(*p_resp_pr) ); - } - else + for ( i = 0; i < pre_trim_num_rec; i++ ) { p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); - - /* we need to track the number of copied items so we can - * stop the copy - but clear them all - */ - num_copied = 0; - - while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) - { - /* Copy the Path Records from the list into the MAD */ - if (num_copied < num_rec) - { + /* copy only if not trimmed */ + if (i < num_rec) *p_resp_pr = p_pr_item->path_rec; - num_copied++; - } + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); p_resp_pr++; - p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); - } } + CL_ASSERT( cl_is_qlist_empty( p_list ) ); + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); if( status != IB_SUCCESS ) From halr at voltaire.com Thu Nov 3 05:47:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 08:47:40 -0500 Subject: [openib-general] [PATCH] OpenSM: Workaround for IBM eHCA logical switch partition enforcement Message-ID: <1131024668.4338.178.camel@hal.voltaire.com> OpenSM: Workaround for IBM eHCA logical switch partition enforcement The problem is that the eHCA logical switches do not support partition enforcement. This *should* be reflected by a zero value in the PartitionEnforcementCap component of the switchinfo attribute. The IBM firmware bug is that it returns a one rather than a zero in this field. However, when subsequent requests to the switch port are received for the P_KeyTable, the firmware drops them on the floor and opensm thrashes timing out all the get P_KeyTable MADs it issues for all of the ports on the two logical switches. Remainder of patch supplied by Brad Benton Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 3959) +++ osm_port_info_rcv.c (working copy) @@ -416,6 +416,7 @@ __osm_pi_rcv_process_router_port( OSM_LOG_EXIT( p_rcv->p_log ); } +#define IBM_VENDOR_ID (0x5076) /********************************************************************** **********************************************************************/ void osm_pkey_get_tables( @@ -431,6 +432,7 @@ void osm_pkey_get_tables( uint8_t port_num; uint16_t block_num, max_blocks; uint32_t attr_mod_ho; + uint32_t vendor_id; osm_switch_t* p_switch; OSM_LOG_ENTER( p_log, osm_physp_has_pkey ); @@ -468,7 +470,12 @@ void osm_pkey_get_tables( goto Exit; } - /* bail out if this is a switch with no partition enforcement capability */ + /* Check for IBM eHCA firmware defect in reporting partition enforcement cap */ + vendor_id = cl_ntoh32(ib_node_info_get_vendor_id( &p_node->node_info)); + if (vendor_id == IBM_VENDOR_ID && cl_ntoh16(p_switch->switch_info.enforce_cap) == 1) + p_switch->switch_info.enforce_cap = 0; + + /* Bail out if this is a switch with no partition enforcement capability */ if (cl_ntoh16(p_switch->switch_info.enforce_cap) == 0) goto Exit; From mst at mellanox.co.il Thu Nov 3 06:00:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 16:00:12 +0200 Subject: [openib-general] [PATCH] support kernel-level sockets in sdp Message-ID: <20051103140011.GA31134@mellanox.co.il> Hi! I plan to commit the following. Comments? --- The following patch adds support for kernel-level sockets in SDP Zcopy (currently used with AIO). Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -176,30 +176,40 @@ if (!iocb->page_array) goto err_page; - down_read(¤t->mm->mmap_sem); - - result = get_user_pages(current, current->mm, - iocb->addr, iocb->page_count, - !!(iocb->flags & SDP_IOCB_F_RECV), 0, - iocb->page_array, NULL); + if (segment_eq(get_fs(), get_ds())) { + /* Kernel request */ + for (i = 0; i< iocb->page_count; ++i) { + iocb->page_array[i] = virt_to_page(addr); + iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); + addr += PAGE_SIZE; + } + } else { + /* User-level request */ + down_read(¤t->mm->mmap_sem); - up_read(¤t->mm->mmap_sem); + result = get_user_pages(current, current->mm, + iocb->addr, iocb->page_count, + !!(iocb->flags & SDP_IOCB_F_RECV), 0, + iocb->page_array, NULL); - if (result != iocb->page_count) { - sdp_dbg_err("unable to lock <%lx:%Zu> error <%d> <%d>", - iocb->addr, iocb->size, result, iocb->page_count); - goto err_get; + up_read(¤t->mm->mmap_sem); + + if (result != iocb->page_count) { + sdp_dbg_err("unable to lock <%lx:%Zu> error <%d> <%d>", + iocb->addr, iocb->size, result, + iocb->page_count); + goto err_get; + } + + iocb->flags |= SDP_IOCB_F_LOCKED; + iocb->mm = current->mm; + iocb->tsk = current; + + + for (i = 0; i< iocb->page_count; ++i) { + iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); + } } - - iocb->flags |= SDP_IOCB_F_LOCKED; - iocb->mm = current->mm; - iocb->tsk = current; - - - for (i = 0; i< iocb->page_count; ++i) { - iocb->addr_array[i] = page_to_phys(iocb->page_array[i]); - } - return 0; err_get: -- MST From halr at voltaire.com Thu Nov 3 06:11:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 3 Nov 2005 16:11:35 +0200 Subject: [openib-general] Re: [PATCH] Opensm - bug in osm_sa_path_record with 0 records Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589A9E1@taurus.voltaire.com> Hi Yael, On Thu, 2005-11-03 at 08:07, Yael Kalka wrote: > Hi Hal, > > During some testing of path record we found a bug in the code. > If the number of records return is zero, then there is clearing of > non allocated memory. > I've added some changes to the __osm_pr_rcv_respond function, to match > other sa responses. > Attached is a patch to fix it. A couple of minor comments below. -- Hal > Thanks, > Yael > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_sa_path_record.c > =================================================================== > --- opensm/osm_sa_path_record.c (revision 3955) > +++ opensm/osm_sa_path_record.c (working copy) > @@ -1448,7 +1448,7 @@ __osm_pr_rcv_respond( > osm_madw_t* p_resp_madw; > const ib_sa_mad_t* p_sa_mad; > ib_sa_mad_t* p_resp_sa_mad; > - size_t num_rec, num_copied; > + size_t num_rec, num_copied, pre_trim_num_rec; > #ifndef VENDOR_RMPP_SUPPORT > size_t trim_num_rec; > #endif > @@ -1456,6 +1456,7 @@ __osm_pr_rcv_respond( > ib_api_status_t status; > const ib_sa_mad_t* p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); > osm_pr_item_t* p_pr_item; > + uint32_t i; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_respond ); > > @@ -1483,6 +1484,7 @@ __osm_pr_rcv_respond( > goto Exit; > } > > + pre_trim_num_rec = num_rec; > #ifndef VENDOR_RMPP_SUPPORT > trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_path_rec_t); > if (trim_num_rec < num_rec) > @@ -1495,11 +1497,15 @@ __osm_pr_rcv_respond( > } > #endif > > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > - { > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > "__osm_pr_rcv_respond: " > "Generating response with %u records.\n", num_rec ); > + > + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) > + { > + osm_sa_send_error( p_rcv->p_resp, p_madw, > + IB_SA_MAD_STATUS_NO_RECORDS ); > + goto Exit; > } This can be moved up immediately after the C15-0.1.30 clause, OK ? > /* > @@ -1514,6 +1520,16 @@ __osm_pr_rcv_respond( > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_pr_rcv_respond: ERR 1F14: " > "Unable to allocate MAD.\n" ); > + > + for( i = 0; i < num_rec; i++ ) > + { > + p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > + } > + > + osm_sa_send_error( p_rcv->p_resp, p_madw, > + IB_SA_MAD_STATUS_NO_RESOURCES ); > + osm_sa_send_error also attempts to get a MAD from the pool. Is there a chance this succeeds after the one in this routine fails ? (Should this be eliminated ?) > goto Exit; > } > > @@ -1528,6 +1544,8 @@ __osm_pr_rcv_respond( > p_resp_sa_mad->attr_offset = > ib_get_attr_offset( sizeof(ib_path_rec_t) ); > > + p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > + > #ifndef VENDOR_RMPP_SUPPORT > /* we support only one packet RMPP - so we will set the first and > last flags for gettable */ > @@ -1542,37 +1560,19 @@ __osm_pr_rcv_respond( > p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; > #endif > > - p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > - > - if ( num_rec == 0 ) > - { > - if (p_resp_sa_mad->method == IB_MAD_METHOD_GET_RESP) > - p_resp_sa_mad->status = IB_SA_MAD_STATUS_NO_RECORDS; > - cl_memclr( p_resp_pr, sizeof(*p_resp_pr) ); > - } > - else > + for ( i = 0; i < pre_trim_num_rec; i++ ) > { > p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > - > - /* we need to track the number of copied items so we can > - * stop the copy - but clear them all > - */ > - num_copied = 0; > - > - while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) > - { > - /* Copy the Path Records from the list into the MAD */ > - if (num_copied < num_rec) > - { > + /* copy only if not trimmed */ > + if (i < num_rec) > *p_resp_pr = p_pr_item->path_rec; > - num_copied++; > - } > + > cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > p_resp_pr++; > - p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > - } > } > > + CL_ASSERT( cl_is_qlist_empty( p_list ) ); > + > status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > if( status != IB_SUCCESS ) > From glebn at voltaire.com Thu Nov 3 06:19:24 2005 From: glebn at voltaire.com (Gleb Natapov) Date: Thu, 3 Nov 2005 16:19:24 +0200 Subject: [openib-general] [hugh@veritas.com: Re: Nick's core remove PageReserved broke vmware...] Message-ID: <20051103141924.GE22185@minantech.com> Hello Michael, It seems that it is time to resurrect your DONTCOPY patch. Can you do it? If you have no time now I can handle it. ----- Forwarded message from Hugh Dickins ----- From: Hugh Dickins To: Gleb Natapov Cc: Benjamin Herrenschmidt , Petr Vandrovec , Nick Piggin , "Michael S. Tsirkin" , Badari Pulavarty , Linux Kernel Mailing List Subject: Re: Nick's core remove PageReserved broke vmware... Date: Thu, 3 Nov 2005 14:11:46 +0000 (GMT) On Thu, 3 Nov 2005, Gleb Natapov wrote: > On Wed, Nov 02, 2005 at 10:02:49PM +0000, Hugh Dickins wrote: > > On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote: > > > On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote: > > > > > > > The only extant problem here is if the pages are private, and you > > > > fork while this is going on, and the parent user process writes to the > > > > area before completion: then COW leaves the child with the page being > > > > DMAed into, giving the parent a copied page which may be incomplete. > > > > > > Won't happen, and if it does, it's a user error to rely on that working, > > > so it doesn't matter. > > > > I wish everyone else would see it that way! (But some people do > > have valid scenarios where it can't just be ruled out completely.) > > > I am one of those people :) > > Last discussion about this issue ended without resolution, but I remember > you mentioned the possibility to leave ptes writable in parent during fork > for private pages mapped for DMA. Is this approach acceptable? I was toying with that idea back then, but it leaves the pages in a peculiar limbo between being shared and private, such that it's hard to think through the consequences. We do already have a case rather like that (ptrace writing to a write-protected area), but some of us are a bit worried by that one, so I'd be foolish now to recommend another such subversion of the rules. In the time since we discussed before, I've rather come full circle round to my original position: abandoning such ideas of trying to handle it from get_user_pages itself, appreciating the simplicity of the original PROT_DONTCOPY idea from you guys; but sticking to my initial reaction that this is better done by madvise(MADV_DONTCOPY), not by the mmap/mprotect route in Michael's patch. (I never bought the "racy" argument advanced in favour of the mmap flag.) One of the factors which has swayed me to the DONTCOPY approach, is Nick's 2.6.14 optimization in fork's copy_page_range, where areas which can be safely faulted later are not copied pte by pte. But that doesn't apply to all areas, and in particular cannot apply to VM_NONLINEAR shared areas. It should be of benefit to apps which use large such areas, and also do a lot of forking children who don't need those areas, to be able to mark them VM_DONTCOPY. Or any other vmas the children won't need. (But there's one big distinction between the optimization and VM_DONTCOPY: the optimization copies vma but doesn't fill in its ptes, VM_DONTCOPY doesn't even copy the vma.) Two warnings if someone would like to post a MADV_DONTCOPY patch. It should include a matching MADV_DOCOPY to clear the condition, but that must not be allowed to clear VM_DONTCOPY set originally by driver: perhaps you'll end up with a VM_UDONTCOPY or something like that. And Badari has a MADV_REMOVE patch in the works, taking the next slot (just after MADV_DONTNEED in most of the arches): probably best for you to base yours on top of his (though yours is simpler and might jump ahead). Hugh ----- End forwarded message ----- -- Gleb. From schihei at de.ibm.com Thu Nov 3 06:28:38 2005 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 03 Nov 2005 15:28:38 +0100 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <20051031071703.GU3275@kalmia.hozed.org> References: <20051031071703.GU3275@kalmia.hozed.org> Message-ID: <436A1E96.4050003@de.ibm.com> Hello Troy, this bug should be fixed in OpenIB trunk 3960. Many thanks for pointing out this problem. Regards, Heiko Troy Benjegerdes wrote: > On an Openpower720 system with a mellanox HCA (and no IBM ehca > installed), I get the following when trying to run ibv_rc_pingpong: > > Starting program: > /usr/src/openib-src/userspace/libibverbs/examples/.libs/ibv_rc_pingpong > [Thread debugging using libthread_db enabled] > [New Thread 4398046660640 (LWP 6167)] > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 4398046660640 (LWP 6167)] > hipz_galpa_store (galpa={fw_handle = 0}, offset=48, value=0) > at src/hcp_phyp.c:72 > 72 *(u64 *) addr = value; > (gdb) bt > #0 hipz_galpa_store (galpa={fw_handle = 0}, offset=48, value=0) > at src/hcp_phyp.c:72 > #1 0x0000000010001b7c in pp_post_recv (ctx=0x100177d0, n=-3807848) > at verbs.h:844 > #2 0x0000000010002364 in main (argc=Variable "argc" is not available. > ) at examples/rc_pingpong.c:566 > > > I assume this means something somewhere is not actually checking sysfs > to see if the driver is actually there and active. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick ---------------------------------------------------------------------- Heiko J Schick I/O Firmware Development II Linux InfiniBand Device Drivers IBM Deutschland Entwicklung GmbH external: 49-07031-16-0 x4219 Schoenaicher Str. 220 t/l: 120-4129 71032 Boeblingen email: schickhj at de.ibm.com ---------------------------------------------------------------------- From mst at mellanox.co.il Thu Nov 3 06:39:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 16:39:15 +0200 Subject: [openib-general] Re: [hugh@veritas.com: Re: Nick's core remove PageReserved broke vmware...] In-Reply-To: <20051103141924.GE22185@minantech.com> References: <20051103141924.GE22185@minantech.com> Message-ID: <20051103143915.GC31134@mellanox.co.il> Hello Geb, I expect so, unless more fires sping up. I'll let you know if I need help. Thanks for the offer, MST Quoting glebn at voltaire.com : > Subject: [hugh at veritas.com: Re: Nick's core remove PageReserved broke vmware...] > > Hello Michael, > > It seems that it is time to resurrect your DONTCOPY patch. Can you do > it? > If you have no time now I can handle it. > > ----- Forwarded message from Hugh Dickins ----- > > From: Hugh Dickins > To: Gleb Natapov > Cc: Benjamin Herrenschmidt , > Petr Vandrovec , > Nick Piggin , > "Michael S. Tsirkin" , > Badari Pulavarty , > Linux Kernel Mailing List > Subject: Re: Nick's core remove PageReserved broke vmware... > Date: Thu, 3 Nov 2005 14:11:46 +0000 (GMT) > > On Thu, 3 Nov 2005, Gleb Natapov wrote: > > On Wed, Nov 02, 2005 at 10:02:49PM +0000, Hugh Dickins wrote: > > > On Thu, 3 Nov 2005, Benjamin Herrenschmidt wrote: > > > > On Wed, 2005-11-02 at 21:41 +0000, Hugh Dickins wrote: > > > > > > > > > The only extant problem here is if the pages are private, and > you > > > > > fork while this is going on, and the parent user process writes > to the > > > > > area before completion: then COW leaves the child with the page > being > > > > > DMAed into, giving the parent a copied page which may be > incomplete. > > > > > > > > Won't happen, and if it does, it's a user error to rely on that > working, > > > > so it doesn't matter. > > > > > > I wish everyone else would see it that way! (But some people do > > > have valid scenarios where it can't just be ruled out completely.) > > > > > I am one of those people :) > > > > Last discussion about this issue ended without resolution, but I > remember > > you mentioned the possibility to leave ptes writable in parent during > fork > > for private pages mapped for DMA. Is this approach acceptable? > > I was toying with that idea back then, but it leaves the pages in a > peculiar limbo between being shared and private, such that it's hard > to think through the consequences. We do already have a case rather > like that (ptrace writing to a write-protected area), but some of us > are a bit worried by that one, so I'd be foolish now to recommend > another such subversion of the rules. > > In the time since we discussed before, I've rather come full circle > round to my original position: abandoning such ideas of trying to > handle it from get_user_pages itself, appreciating the simplicity > of the original PROT_DONTCOPY idea from you guys; but sticking to my > initial reaction that this is better done by madvise(MADV_DONTCOPY), > not by the mmap/mprotect route in Michael's patch. (I never bought > the "racy" argument advanced in favour of the mmap flag.) > > One of the factors which has swayed me to the DONTCOPY approach, is > Nick's 2.6.14 optimization in fork's copy_page_range, where areas > which can be safely faulted later are not copied pte by pte. But > that doesn't apply to all areas, and in particular cannot apply to > VM_NONLINEAR shared areas. It should be of benefit to apps which > use large such areas, and also do a lot of forking children who don't > need those areas, to be able to mark them VM_DONTCOPY. Or any other > vmas the children won't need. (But there's one big distinction between > the optimization and VM_DONTCOPY: the optimization copies vma but > doesn't fill in its ptes, VM_DONTCOPY doesn't even copy the vma.) > > Two warnings if someone would like to post a MADV_DONTCOPY patch. > It should include a matching MADV_DOCOPY to clear the condition, but > that must not be allowed to clear VM_DONTCOPY set originally by driver: > perhaps you'll end up with a VM_UDONTCOPY or something like that. > > And Badari has a MADV_REMOVE patch in the works, taking the next > slot (just after MADV_DONTNEED in most of the arches): probably > best for you to base yours on top of his (though yours is simpler > and might jump ahead). > > Hugh > > ----- End forwarded message ----- > > -- > Gleb. > -- MST From rolandd at cisco.com Thu Nov 3 07:09:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 07:09:55 -0800 Subject: [openib-general] Re: [PATCH] umad: fix hotplug In-Reply-To: <20051103083044.GJ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Nov 2005 10:30:44 +0200") References: <52veza74ao.fsf@cisco.com> <20051103083044.GJ31134@mellanox.co.il> Message-ID: <527jbp7oos.fsf@cisco.com> Thanks, good catch. - R. From mst at mellanox.co.il Thu Nov 3 07:19:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 17:19:39 +0200 Subject: [openib-general] Re: [hugh@veritas.com: Re: Nick's core remove PageReserved broke vmware...] In-Reply-To: <20051103143915.GC31134@mellanox.co.il> References: <20051103143915.GC31134@mellanox.co.il> Message-ID: <20051103151939.GF31134@mellanox.co.il> Quoting Michael S. Tsirkin : > Hello Geb, Gleb :) Sorry about a typo. -- MST From halr at voltaire.com Thu Nov 3 07:32:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 10:32:46 -0500 Subject: [openib-general] [PATCH] umad: fix hotplug In-Reply-To: <52veza74ao.fsf@cisco.com> References: <52veza74ao.fsf@cisco.com> Message-ID: <1131031773.4338.332.camel@hal.voltaire.com> On Wed, 2005-11-02 at 23:18, Roland Dreier wrote: > I just committed the patch below, which should fix hotplug handling in > umad. The practical effect of this that you can do "modprobe -r ib_mthca" > with opensm running and not get an oops. > > Comments and test results solicited.... How do umad based application get notified of removals and insertions ? Are the devices ready for use then ? -- Hal From mst at mellanox.co.il Thu Nov 3 07:57:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 3 Nov 2005 17:57:28 +0200 Subject: [openib-general] Re: [PATCH] umad: fix hotplug In-Reply-To: <1131031773.4338.332.camel@hal.voltaire.com> References: <1131031773.4338.332.camel@hal.voltaire.com> Message-ID: <20051103155728.GI31134@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] umad: fix hotplug > > On Wed, 2005-11-02 at 23:18, Roland Dreier wrote: > > I just committed the patch below, which should fix hotplug handling in > > umad. The practical effect of this that you can do "modprobe -r > ib_mthca" > > with opensm running and not get an oops. > > > > Comments and test results solicited.... > > How do umad based application get notified of removals and insertions ? > Are the devices ready for use then ? > > -- Hal Would not a regular hotplug event for umad work? -- MST From eli at mellanox.co.il Thu Nov 3 08:01:12 2005 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 3 Nov 2005 18:01:12 +0200 Subject: [openib-general] DHCP over Infiniband Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E306629F@mtlexch01.mtl.com> Hi, has anyone had the chance to run a DHCP server on an Infiniband interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do not get responses from the server to DHCP requests. When running tcpdump on ib0 interface I can see the requests but the server does not respond. The server's version is isc-dhcpd-V3.0.3. I also tried version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with Mellanox's IBGD1.8 and the server responds to requests. I still had a problem that the server does not set the client identifier option in its responses although the client does set this option. If you have any experience with this please let me know. Thanks Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Nov 3 07:53:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 10:53:39 -0500 Subject: [openib-general] [PATCHv2] OpenSM: Workaround for IBM eHCA logical switch partition enforcement Message-ID: <1131032959.4338.369.camel@hal.voltaire.com> OpenSM: Workaround for IBM eHCA logical switch partition enforcement The problem is that the eHCA logical switches do not support partition enforcement. This *should* be reflected by a zero value in the PartitionEnforcementCap component of the switchinfo attribute. The IBM firmware bug is that it returns a one rather than a zero in this field. However, when subsequent requests to the switch port are received for the P_KeyTable, the firmware drops them on the floor and opensm thrashes timing out all the get P_KeyTable MADs it issues for all of the ports on the two logical switches. Remainder of patch supplied by Brad Benton Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== --- osm_port_info_rcv.c (revision 3959) +++ osm_port_info_rcv.c (working copy) @@ -416,6 +416,7 @@ __osm_pi_rcv_process_router_port( OSM_LOG_EXIT( p_rcv->p_log ); } +#define IBM_VENDOR_ID (0x5076) /********************************************************************** **********************************************************************/ void osm_pkey_get_tables( @@ -468,7 +469,11 @@ void osm_pkey_get_tables( goto Exit; } - /* bail out if this is a switch with no partition enforcement capability */ + /* Check for IBM eHCA firmware defect in reporting partition enforcement cap */ + if (cl_ntoh32(ib_node_info_get_vendor_id(&p_node->node_info)) == IBM_VENDOR_ID) + p_switch->switch_info.enforce_cap = 0; + + /* Bail out if this is a switch with no partition enforcement capability */ if (cl_ntoh16(p_switch->switch_info.enforce_cap) == 0) goto Exit; From halr at voltaire.com Thu Nov 3 08:06:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 11:06:49 -0500 Subject: [openib-general] DHCP over Infiniband In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E306629F@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E306629F@mtlexch01.mtl.com> Message-ID: <1131034009.4338.389.camel@hal.voltaire.com> On Thu, 2005-11-03 at 11:01, Eli Cohen wrote: > Hi, > has anyone had the chance to run a DHCP server on an Infiniband > interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do > not get responses from the server to DHCP requests. When running > tcpdump on ib0 interface I can see the requests but the server does > not respond. The server's version is isc-dhcpd-V3.0.3. I also tried > version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with > Mellanox's IBGD1.8 and the server responds to requests. I still had a > problem that the server does not set the client identifier option in > its responses although the client does set this option. If you have > any experience with this please let me know. What DHCP server and what client are you using ? This has been done with the ISC ones. It requires modifications due to the difference in hardware addresses (and there is the QPN issue). -- Hal From eli at mellanox.co.il Thu Nov 3 08:47:07 2005 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 3 Nov 2005 18:47:07 +0200 Subject: [openib-general] DHCP over Infiniband Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30662A1@mtlexch01.mtl.com> The client is Etherboot's client for configuring a client at boot time. The server is ISC. Can you explain what you mean by "the QPN issue"? -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, November 03, 2005 6:07 PM To: Eli Cohen Cc: 'openib-general at openib.org' Subject: Re: [openib-general] DHCP over Infiniband On Thu, 2005-11-03 at 11:01, Eli Cohen wrote: > Hi, > has anyone had the chance to run a DHCP server on an Infiniband > interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I do > not get responses from the server to DHCP requests. When running > tcpdump on ib0 interface I can see the requests but the server does > not respond. The server's version is isc-dhcpd-V3.0.3. I also tried > version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with > Mellanox's IBGD1.8 and the server responds to requests. I still had a > problem that the server does not set the client identifier option in > its responses although the client does set this option. If you have > any experience with this please let me know. What DHCP server and what client are you using ? This has been done with the ISC ones. It requires modifications due to the difference in hardware addresses (and there is the QPN issue). -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Nov 3 08:44:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 08:44:10 -0800 Subject: [openib-general] Re: [PATCH] umad: fix hotplug In-Reply-To: <20051103155728.GI31134@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 3 Nov 2005 17:57:28 +0200") References: <1131031773.4338.332.camel@hal.voltaire.com> <20051103155728.GI31134@mellanox.co.il> Message-ID: <52r79x65r9.fsf@cisco.com> Michael> Would not a regular hotplug event for umad work? Yes, and in fact they are generated -- that's how udev knows to create/destroy the device nodes for example. - R. From rolandd at cisco.com Thu Nov 3 08:44:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 08:44:39 -0800 Subject: [openib-general] [PATCH] umad: fix hotplug In-Reply-To: <1131031773.4338.332.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Nov 2005 10:32:46 -0500") References: <52veza74ao.fsf@cisco.com> <1131031773.4338.332.camel@hal.voltaire.com> Message-ID: <52mzkl65qg.fsf@cisco.com> Hal> How do umad based application get notified of removals and Hal> insertions ? Are the devices ready for use then ? There is no notification beyond the usual hotplug events that the kernel generates for all character devices. - R. From halr at voltaire.com Thu Nov 3 08:47:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 11:47:50 -0500 Subject: [openib-general] DHCP over Infiniband In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30662A1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30662A1@mtlexch01.mtl.com> Message-ID: <1131036320.4338.441.camel@hal.voltaire.com> On Thu, 2005-11-03 at 11:47, Eli Cohen wrote: > The client is Etherboot's client for configuring a client at boot > time. The server is ISC. I think that client needs modifications. > Can you explain what you mean by "the QPN issue"? The QPN is part of the hardware address and is not fixed. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, November 03, 2005 6:07 PM > To: Eli Cohen > Cc: 'openib-general at openib.org' > Subject: Re: [openib-general] DHCP over Infiniband > > > On Thu, 2005-11-03 at 11:01, Eli Cohen wrote: > > Hi, > > has anyone had the chance to run a DHCP server on an Infiniband > > interface? I checked this on Suse 10 kernel 2.6.13-15-bigsmp and I > do > > not get responses from the server to DHCP requests. When running > > tcpdump on ib0 interface I can see the requests but the server does > > not respond. The server's version is isc-dhcpd-V3.0.3. I also tried > > version dhcp-3.0.4b1 but with no luck. I checked on Suse SLES 9 with > > Mellanox's IBGD1.8 and the server responds to requests. I still had > a > > problem that the server does not set the client identifier option in > > its responses although the client does set this option. If you have > > any experience with this please let me know. > > What DHCP server and what client are you using ? This has been done > with > the ISC ones. It requires modifications due to the difference in > hardware addresses (and there is the QPN issue). > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Thu Nov 3 08:52:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 3 Nov 2005 08:52:52 -0800 Subject: [openib-general] compilation platform dependencies In-Reply-To: <52pspkdt7z.fsf@cisco.com> References: <527jbsfdii.fsf@cisco.com> <20051101191905.GE6815@esmail.cup.hp.com> <52pspkdt7z.fsf@cisco.com> Message-ID: <20051103165252.GA32699@esmail.cup.hp.com> Hi Roland, since no one smarter touched this.... On Tue, Nov 01, 2005 at 12:10:56PM -0800, Roland Dreier wrote: > > I've seen use of this use of "data[0]": > > include/rdma/ib_user_verbs.h: __u64 driver_data[0]; > > > > isn't that for the same purpose? > > Apologies if I'm mixing things up... > > The driver_data[] in ib_user_verbs.h is really there to give a hint > that extra device-dependent data could follow. Reserved members of > structs are used to pad it up to a 64-bit boundary. Yeah, this is the right way to do it. I just wasn't sure. > I'm not sure if __u64 driver_data[0]; forces alignment to an 8-byte > boundary on i386... does it? I'm now convinced it doesn't on x86. See output below. thanks, grant grundler <481>uname -a Linux ob500 2.6.13 #6 Sat Oct 1 23:58:35 PDT 2005 i686 GNU/Linux grundler <482>cat alignment_test.c #include #include struct foo { int y; unsigned long long x; }; int main(void) { return printf("offset of x is %d\n", offsetof(struct foo, x)); } grundler <483>make alignment_test cc alignment_test.c -o alignment_test grundler <484>./alignment_test offset of x is 4 From halr at voltaire.com Thu Nov 3 08:51:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 11:51:04 -0500 Subject: [openib-general] Re: [PATCH] Opensm - bug in osm_sa_path_record with 0 records In-Reply-To: <1131025659.4338.206.camel@hal.voltaire.com> References: <5zacglyj4x.fsf@mtl066.yok.mtl.com> <1131025659.4338.206.camel@hal.voltaire.com> Message-ID: <1131036469.4338.446.camel@hal.voltaire.com> One additional comment on this: On Thu, 2005-11-03 at 08:50, Hal Rosenstock wrote: > On Thu, 2005-11-03 at 08:07, Yael Kalka wrote: > > Hi Hal, > > > > During some testing of path record we found a bug in the code. > > If the number of records return is zero, then there is clearing of > > non allocated memory. > > I've added some changes to the __osm_pr_rcv_respond function, to match > > other sa responses. > > Attached is a patch to fix it. > > A couple of minor comments below. > > -- Hal > > > Thanks, > > Yael > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: opensm/osm_sa_path_record.c > > =================================================================== > > --- opensm/osm_sa_path_record.c (revision 3955) > > +++ opensm/osm_sa_path_record.c (working copy) > > @@ -1448,7 +1448,7 @@ __osm_pr_rcv_respond( > > osm_madw_t* p_resp_madw; > > const ib_sa_mad_t* p_sa_mad; > > ib_sa_mad_t* p_resp_sa_mad; > > - size_t num_rec, num_copied; > > + size_t num_rec, num_copied, pre_trim_num_rec; > > #ifndef VENDOR_RMPP_SUPPORT > > size_t trim_num_rec; > > #endif > > @@ -1456,6 +1456,7 @@ __osm_pr_rcv_respond( > > ib_api_status_t status; > > const ib_sa_mad_t* p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); > > osm_pr_item_t* p_pr_item; > > + uint32_t i; > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_respond ); > > > > @@ -1483,6 +1484,7 @@ __osm_pr_rcv_respond( > > goto Exit; > > } > > > > + pre_trim_num_rec = num_rec; > > #ifndef VENDOR_RMPP_SUPPORT > > trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_path_rec_t); > > if (trim_num_rec < num_rec) > > @@ -1495,11 +1497,15 @@ __osm_pr_rcv_respond( > > } > > #endif > > > > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > > - { > > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > > "__osm_pr_rcv_respond: " > > "Generating response with %u records.\n", num_rec ); > > + > > + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) > > + { > > + osm_sa_send_error( p_rcv->p_resp, p_madw, > > + IB_SA_MAD_STATUS_NO_RECORDS ); > > + goto Exit; > > } > > This can be moved up immediately after the C15-0.1.30 clause, OK ? > > > /* > > @@ -1514,6 +1520,16 @@ __osm_pr_rcv_respond( > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_pr_rcv_respond: ERR 1F14: " > > "Unable to allocate MAD.\n" ); > > + > > + for( i = 0; i < num_rec; i++ ) > > + { > > + p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > > + } > > + > > + osm_sa_send_error( p_rcv->p_resp, p_madw, > > + IB_SA_MAD_STATUS_NO_RESOURCES ); > > + > > osm_sa_send_error also attempts to get a MAD from the pool. Is there a > chance this succeeds after the one in this routine fails ? (Should this > be eliminated ?) > > > goto Exit; > > } > > > > @@ -1528,6 +1544,8 @@ __osm_pr_rcv_respond( > > p_resp_sa_mad->attr_offset = > > ib_get_attr_offset( sizeof(ib_path_rec_t) ); > > > > + p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > + > > #ifndef VENDOR_RMPP_SUPPORT > > /* we support only one packet RMPP - so we will set the first and > > last flags for gettable */ > > @@ -1542,37 +1560,19 @@ __osm_pr_rcv_respond( > > p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; > > #endif > > > > - p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > - > > - if ( num_rec == 0 ) > > - { > > - if (p_resp_sa_mad->method == IB_MAD_METHOD_GET_RESP) > > - p_resp_sa_mad->status = IB_SA_MAD_STATUS_NO_RECORDS; > > - cl_memclr( p_resp_pr, sizeof(*p_resp_pr) ); > > - } > > - else > > + for ( i = 0; i < pre_trim_num_rec; i++ ) > > { > > p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > - > > - /* we need to track the number of copied items so we can > > - * stop the copy - but clear them all > > - */ > > - num_copied = 0; > > - > > - while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) > > - { > > - /* Copy the Path Records from the list into the MAD */ > > - if (num_copied < num_rec) > > - { > > + /* copy only if not trimmed */ > > + if (i < num_rec) > > *p_resp_pr = p_pr_item->path_rec; > > - num_copied++; > > - } > > + > > cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > > p_resp_pr++; > > - p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > - } > > } Should p_resp_pr only be incremented if i < num_recs ? Also, these comments apply to all the other SA record code as well. -- Hal > > + CL_ASSERT( cl_is_qlist_empty( p_list ) ); > > + > > status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > > > if( status != IB_SUCCESS ) > > From robert.j.woodruff at intel.com Thu Nov 3 08:54:55 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 3 Nov 2005 08:54:55 -0800 Subject: [openib-general] RE: Problems with SDP on Itanium In-Reply-To: Message-ID: Woody wrote, >Yes, when I get some time, I will rebuild my kernel with debug >and re-run it. >woody Here are the dmesg logs when it hangs. woody Client side log: es. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240316> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <819939> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <200000000115bd20> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aaae:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aaaf:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab0:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab1:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab2:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240317> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240318> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240319> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240320> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240321> of <16384> bytes. ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab3:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab4:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab5:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> state <00001171> size <738099> pending <49104> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <200000000116fcd0> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab6:0003ab3d> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab7:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab8:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aab9:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aaba:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aabb:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240322> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240323> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240324> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240325> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240326> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240327> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240328> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240329> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240330> of <16384> bytes. ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aabc:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aabd:0003ab3e> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240331> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240332> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <558051> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <200000000119bc20> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aabe:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aabf:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac0:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac1:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac2:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240333> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240334> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240335> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240336> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240337> of <16384> bytes. ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac3:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac4:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac5:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac6:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> state <00001171> size <476211> pending <65472> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <20000000011afbd0> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac7:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac8:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aac9:0003ab3f> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240338> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240339> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240340> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240341> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240342> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240343> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240344> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <361635> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <20000000011cbb60> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aaca:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aacb:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aacc:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aacd:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240345> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240346> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240347> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240348> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <296163> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <20000000011dbb20> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aace:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aacf:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad0:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240349> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240350> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240351> of <16384> bytes. ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad1:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad2:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> state <00001171> size <247059> pending <32736> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <20000000011e7af0> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad3:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad4:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad5:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad6:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad7:0003ab40> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240352> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240353> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240354> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240355> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240356> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240357> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240358> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <132483> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <2000000001203a80> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad8:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aad9:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aada:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240359> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240360> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240361> of <16384> bytes. ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aadb:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aadc:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aadd:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> state <00001171> size <83379> pending <49104> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <200000000120fa50> users <1> flags <00000000> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240362> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240363> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240364> of <16384> bytes. ib_sdp DATA: <0> <1171> state <00001171> size <34275> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <200000000121ba20> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aade:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00004000:0003aadf:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00000613:0003aae0:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <1539> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240365> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240366> of <16384> bytes. ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240367> of <16384> bytes. ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00000016:0003aae1:0003ab41> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <6> ib_sdp DATA: <0> <1171> send state <1171> size <6> flags <00000000> ib_sdp DATA: <0> <1171> write IOCB <-1> addr <60000ffffffbcaf0> user <1> flag <00000000> ib_sdp DATA: <0> <1171> state <00001171> size <6> pending <6> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <60000ffffffbcb00> users <1> flags <00000000> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240368> of <16384> bytes. ib_sdp DATA: <0> <1171> send state <1171> size <4> flags <00000000> ib_sdp DATA: <0> <1171> write IOCB <-1> addr <60000ffffffbcb00> user <1> flag <00000000> ib_sdp DATA: <0> <1171> send state <1171> size <6> flags <00000000> ib_sdp DATA: <0> <1171> write IOCB <-1> addr <60000ffffffbcaf0> user <1> flag <00000000> ib_sdp DATA: <0> <1171> state <00001171> size <6> pending <0> falgs <00000000> ib_sdp DATA: <0> <1171> read IOCB <-1> addr <60000ffffffbcb00> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <000f:00:ff:00000016:0003aae2:0003ab45> ib_sdp DATA: <0> <1171> RECV BUFF, bytes <6> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240369> of <16384> bytes. ib_sdp DATA: <0> <1171> send state <1171> size <2097149> flags <00000000> ib_sdp DATA: <0> <1171> write IOCB <-1> addr <20000000016c4000> user <1> flag <00000000> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:3bab0300:8caa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:3cab0300:9aaa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:3dab0300:a8aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:3eab0300:b2aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:3fab0300:bbaa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:40ab0300:c9aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:41ab0300:d7aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:10000000:42ab0300:e0aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <0f00:00:ff:16000000:43ab0300:e1aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:14000000:44ab0300:e1aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:16000000:45ab0300:e1aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:00400000:46ab0300:e2aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:00400000:47ab0300:e2aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:00400000:48ab0300:e2aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:00400000:49ab0300:e2aa0300> ib_sdp DATA: <0> <1171> SENT BSDH <1000:00:ff:00400000:4aab0300:e2aa0300> ib_sdp DATA: CQ event. hashent <0> ib_sdp DATA: <0> <1171> RECV BSDH <0010:00:ff:00000010:0003aae3:0003ab4e> ib_sdp DATA: <0> <1171> POST RECV BUFF wrid <240370> of <16384> bytes. ib_sdp CRTL: info delete <192.168.0.21> <4295552652:4295242151> Server side log: DATA: <2> <1171> read IOCB <-1> addr <20000000011f7d70> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab34:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab35:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab36:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab37:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab38:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab39:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000313:0003ab3a:0003aa7e> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <771> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240450> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240451> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240452> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240453> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240454> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240455> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240456> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240457> of <16384> bytes. ib_sdp DATA: <2> <1171> send state <1171> size <1572867> flags <00000000> ib_sdp DATA: <2> <1171> write IOCB <-1> addr <2000000001094000> user <1> flag <00000000> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab3b:0003aa8c> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240458> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:10000000:7eaa0300:30ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:10000000:7faa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:80aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:81aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:82aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:83aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:84aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:85aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:86aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:87aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:88aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:89aa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8aaa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8baa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8caa0300:3aab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8daa0300:3bab0300> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab3c:0003aa9a> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240459> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8eaa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:8faa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:90aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:91aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:92aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:93aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:94aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:95aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:96aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:97aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:98aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:99aa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9aaa0300:3bab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9baa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9caa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9daa0300:3cab0300> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab3d:0003aaa8> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240460> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9eaa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:9faa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a0aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a1aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a2aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a3aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a4aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a5aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a6aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a7aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a8aa0300:3cab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:a9aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:aaaa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:abaa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:acaa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:adaa0300:3dab0300> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab3e:0003aab2> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240461> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:aeaa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:afaa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b0aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b1aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b2aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b3aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b4aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b5aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b6aa0300:3dab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b7aa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b8aa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:b9aa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:baaa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:bbaa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:bcaa0300:3eab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:bdaa0300:3eab0300> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab3f:0003aabb> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240462> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab40:0003aac9> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240463> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:beaa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:bfaa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c0aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c1aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c2aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c3aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c4aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c5aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c6aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c7aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c8aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:c9aa0300:3fab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:caaa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:cbaa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:ccaa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:cdaa0300:40ab0300> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab41:0003aad7> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240464> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:ceaa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:cfaa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d0aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d1aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d2aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d3aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d4aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d5aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d6aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d7aa0300:40ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d8aa0300:41ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:d9aa0300:41ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:daaa0300:41ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:dbaa0300:41ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:dcaa0300:41ab0300> ib_sdp DATA: <2> <1171> SENT BSDH <1000:00:ff:00400000:ddaa0300:41ab0300> ib_sdp DATA: <2> <1171> send state <1171> size <6> flags <00000000> ib_sdp DATA: <2> <1171> write IOCB <-1> addr <60000ffffffbcc40> user <1> flag <00000000> ib_sdp DATA: <2> <1171> state <00001171> size <6> pending <0> falgs <00000000> ib_sdp DATA: <2> <1171> read IOCB <-1> addr <60000ffffffbcc50> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000010:0003ab42:0003aae0> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240465> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <000f:00:ff:00000016:0003ab43:0003aae1> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <6> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240466> of <16384> bytes. ib_sdp DATA: <2> <1171> state <00001171> size <4> pending <0> falgs <00000000> ib_sdp DATA: <2> <1171> read IOCB <-1> addr <60000ffffffbcc50> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000014:0003ab44:0003aae1> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <4> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240467> of <16384> bytes. ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00000016:0003ab45:0003aae1> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <6> ib_sdp DATA: <2> <1171> send state <1171> size <6> flags <00000000> ib_sdp DATA: <2> <1171> write IOCB <-1> addr <60000ffffffbcc40> user <1> flag <00000000> ib_sdp DATA: <2> <1171> state <00001171> size <6> pending <6> falgs <00000000> ib_sdp DATA: <2> <1171> read IOCB <-1> addr <60000ffffffbcc50> users <1> flags <00000000> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240468> of <16384> bytes. ib_sdp DATA: <2> <1171> state <00001171> size <2097149> pending <0> falgs <00000000> ib_sdp DATA: <2> <1171> read IOCB <-1> addr <20000000016b4000> users <1> flags <00000000> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab46:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab47:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab48:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab49:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4a:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4b:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4c:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4d:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4e:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240469> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240470> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240471> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240472> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240473> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240474> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240475> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240476> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240477> of <16384> bytes. ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab4f:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab50:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab51:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab52:0003aae2> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: CQ event. hashent <2> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab53:0003aae3> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab54:0003aae3> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> RECV BSDH <0010:00:ff:00004000:0003ab55:0003aae3> ib_sdp DATA: <2> <1171> RECV BUFF, bytes <16368> ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240478> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240479> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240480> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240481> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240482> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240483> of <16384> bytes. ib_sdp DATA: <2> <1171> POST RECV BUFF wrid <240484> of <16384> bytes. ib_sdp DATA: <2> <1171> state <00001171> size <1835261> pending <0> falgs <00000000> ib_sdp DATA: <2> <1171> read IOCB <-1> addr <20000000016f3f00> users <1> flags <00000000> From robert.j.woodruff at intel.com Thu Nov 3 09:06:44 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 3 Nov 2005 09:06:44 -0800 Subject: [openib-general] RE: Problems with SDP on Itanium In-Reply-To: Message-ID: Woody wrote, >Here are the dmesg logs when it hangs. This might have something to do with it. >ib_sdp CRTL: info delete <192.168.0.21> <4295552652:4295242151> The code looks like this message gets printed when SDP calls sdp_path_info_distroy(). Question is, why is it tearing down the connection ? /* * sdp_link_sweep - periodic path information cleanup function */ static void sdp_link_sweep(void *data) { struct sdp_path_info *info; struct sdp_path_info *sweep; sweep = info_list; while (sweep) { info = sweep; sweep = sweep->next; if (jiffies > (info->use + SDP_LINK_INFO_TIMEOUT)) { sdp_dbg_ctrl(NULL, "info delete <%d.%d.%d.%d> <%lu:%lu>", info->dst & 0x000000ff, (info->dst & 0x0000ff00) >> 8, (info->dst & 0x00ff0000) >> 16, (info->dst & 0xff000000) >> 24, jiffies, info->use); sdp_path_info_destroy(info, -ETIMEDOUT); } } From hch at lst.de Thu Nov 3 09:12:10 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 3 Nov 2005 18:12:10 +0100 Subject: [openib-general] compilation platform dependencies In-Reply-To: <20051103165252.GA32699@esmail.cup.hp.com> References: <527jbsfdii.fsf@cisco.com> <20051101191905.GE6815@esmail.cup.hp.com> <52pspkdt7z.fsf@cisco.com> <20051103165252.GA32699@esmail.cup.hp.com> Message-ID: <20051103171210.GA12783@lst.de> On Thu, Nov 03, 2005 at 08:52:52AM -0800, Grant Grundler wrote: > > I'm not sure if __u64 driver_data[0]; forces alignment to an 8-byte > > boundary on i386... does it? > > I'm now convinced it doesn't on x86. > See output below. Yes, alignment rules for x86 are different for every other architecture in that respect. It causes a lot of problems with ioctl translations for x86 binaries on ia64/x86_64. For the private data I'd suggest you copy the network driver layer approach, see alloc_netdev and netdev_priv for details. From johnip at sgi.com Thu Nov 3 09:39:03 2005 From: johnip at sgi.com (John Partridge) Date: Thu, 03 Nov 2005 11:39:03 -0600 Subject: [openib-general] mvapich-gen2 on 2 x 16 CPU SGI Altix 1330 cluster Message-ID: <436A4B37.1080801@sgi.com> Hi DK, I just though you would like to know that I have now tested the Pallas benchmark on a two node SGI Altix 1330 cluster using OpenIB and mvapich-gen2. Each node had 16 CPU's. To do this I had to change SMPI_MAX_NUMLOCALNODES to be defined as 16 instead of the normal 4 for the test. I ran a 2x16 (32 total) CPU Pallas benchmark several times with no hang ups or errors. I'm wondering if there would be any more changes I would need to make for scaling to much larger systems. I do plan at some point in the near future to test this on a much larger system with a LOT more CPU's The test was conducted using a "kernel.org" 2.6.14 kernel and an OpenIB svn gen2 release of 3926 using Voltaire HCA's and switch We will be demonstrating OpenIB and mvapich-gen2 mpi at Supercomputing 05 (running smaller jobs though because the 32 way jobs take so long to complete). We will also demo rdma_lat, rdma_bw and IpoIB. I can send you the pallas results if you are interested. Regards John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From mshefty at ichips.intel.com Thu Nov 3 09:44:18 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Nov 2005 09:44:18 -0800 Subject: [openib-general] netstat In-Reply-To: References: Message-ID: <436A4C71.3020007@ichips.intel.com> yipee wrote: > Is there some way to view the list of current CM end points in their various > states (listen,connection)? Nothing like this is available today. I can record this as something to add in the future, but it's unlikely to be a high priority for at least a few weeks. - Sean From ardavis at ichips.intel.com Thu Nov 3 09:51:14 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 03 Nov 2005 09:51:14 -0800 Subject: [openib-general] [PATCH] Re: uDAPL again In-Reply-To: <436938D5.6030403@ichips.intel.com> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> <43692526.3030003@cs.rutgers.edu> <436938D5.6030403@ichips.intel.com> Message-ID: <436A4E12.1060005@ichips.intel.com> Arlin Davis wrote: > Aniruddha Bohra wrote: > >> I am not sure, but arent uCM and uAT simply for connection >> establishment? >> > Yes, but they also set up many of the transfer attributes of the > connected QP. The uCM/uAT version uses path_records from the SA query > but the socket_CM version just builds them by hand similiar to the way > ibv_rc_pingpong does. You would have to look at the > pathrecord->pktlifetime to see the actual timeout value being used. > Ok, I added some debug and it looks like the path record returned from uAT looks suspect. Here are the results from tuAT and opensm running on my cluster. Path record pktlife is 0 (uCM adds 1) so the ACK timeout value for this connection will be very short. path_comp_handler: ctxt 0x525fa0, req_id 90 rec_num 1 path_comp_handler: SRC GID subnet fe80000000000000 id 0002c9020000409d path_comp_handler: DST GID subnet fe80000000000000 id 0002c90200004071 path_comp_handler: slid 5 dlid 2 mtu 120203(2) pktlife 0(0) <<< ??? path_comp_handler: hops 0 npaths 0 pkey ffff tclass 0 rate 0(0) <<< ??? Hal, can you take a look at uAT and see if the copy to user space is working correctly. Aniruddha, can you apply the following patch and send us the output from your run? -arlin Signed-off by: Arlin Davis Index: dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/openib/dapl_ib_cm.c (revision 3951) +++ dapl/openib/dapl_ib_cm.c (working copy) @@ -136,14 +136,27 @@ dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: SRC GID subnet %016llx id %016llx\n", - (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.subnet_prefix), - (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.interface_id) ); + (unsigned long long)cpu_to_be64(conn->dapl_path.sgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_path.sgid.global.interface_id) ); dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: DST GID subnet %016llx id %016llx\n", - (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), - (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); + (unsigned long long)cpu_to_be64(conn->dapl_path.dgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_path.dgid.global.interface_id) ); + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " path_comp_handler: slid %x dlid %x mtu %x(%x) pktlife %x(%x)\n", + ntohs(conn->dapl_path.slid), ntohs(conn->dapl_path.dlid), + conn->dapl_path.mtu, conn->dapl_path.mtu_selector, + conn->dapl_path.packet_life_time, + conn->dapl_path.packet_life_time_selector ); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " path_comp_handler: hops %x npaths %x pkey %x tclass %x rate %x(%x)\n", + conn->dapl_path.hop_limit, conn->dapl_path.numb_path, + conn->dapl_path.pkey, conn->dapl_path.traffic_class, + conn->dapl_path.rate, conn->dapl_path.rate_selector); + if (rec_num <= 0) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: ERR %d retry %d\n", From halr at voltaire.com Thu Nov 3 09:57:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2005 12:57:46 -0500 Subject: [openib-general] [PATCH] Re: uDAPL again In-Reply-To: <436A4E12.1060005@ichips.intel.com> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> <43692526.3030003@cs.rutgers.edu> <436938D5.6030403@ichips.intel.com> <436A4E12.1060005@ichips.intel.com> Message-ID: <1131040666.4340.12.camel@hal.voltaire.com> Hi Arlin, On Thu, 2005-11-03 at 12:51, Arlin Davis wrote: > Arlin Davis wrote: > > > Aniruddha Bohra wrote: > > > >> I am not sure, but arent uCM and uAT simply for connection > >> establishment? > >> > > Yes, but they also set up many of the transfer attributes of the > > connected QP. The uCM/uAT version uses path_records from the SA query > > but the socket_CM version just builds them by hand similiar to the way > > ibv_rc_pingpong does. You would have to look at the > > pathrecord->pktlifetime to see the actual timeout value being used. > > > Ok, I added some debug and it looks like the path record returned from > uAT looks suspect. Here are the results from tuAT and opensm running on > my cluster. Path record pktlife is 0 (uCM adds 1) so the ACK timeout > value for this connection will be very short. > > path_comp_handler: ctxt 0x525fa0, req_id 90 rec_num 1 > path_comp_handler: SRC GID subnet fe80000000000000 id 0002c9020000409d > path_comp_handler: DST GID subnet fe80000000000000 id 0002c90200004071 > path_comp_handler: slid 5 dlid 2 mtu 120203(2) pktlife > 0(0) <<< ??? > path_comp_handler: hops 0 npaths 0 pkey ffff tclass 0 rate > 0(0) <<< ??? > > Hal, can you take a look at uAT and see if the copy to user space is > working correctly. Just want to clarify what I should be looking for: So you suspect pktlife and rate being bad (and the rest of the SA PR look OK) ? Is OpenSM being used in Aniruddha's subnet ? -- Hal From sean.hefty at intel.com Thu Nov 3 10:01:12 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 3 Nov 2005 10:01:12 -0800 Subject: [openib-general] [RFC] patch to export userspace to kernel QP attribute structure Message-ID: The following patch would expose support in uverbs to define a QP attribute structure that could be used by other kernel modules (e.g. the IB CM and CMA) needing to exchange QP attribute information with a userspace library. I've only compile tested this patch, since I'm only looking for feedback on whether this approach is acceptable. Similar functionality would be added to libibverbs. As a side note, uverbs defines two structures that are almost the same: ib_uverbs_qp_dest and ib_uverbs_ah_attr. The only difference is that the reserved fields in each are in different locations, so eliminating one of the structures would result in an abi change. - Sean Index: core/uverbs_cmd.c =================================================================== --- core/uverbs_cmd.c (revision 3947) +++ core/uverbs_cmd.c (working copy) @@ -808,6 +808,75 @@ return ret ? ret : in_len; } +static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src) +{ + memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof dst->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->ah_flags & IB_AH_GRH ? 1 : 0; + dst->port_num = src->port_num; +} + +static void ib_copy_ah_attr_from_user(struct ib_ah_attr *dst, + struct ib_uverbs_ah_attr *src) +{ + memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->ah_flags = src->is_global ? IB_AH_GRH : 0; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->max_send_wr = src->cap.max_send_wr; + dst->max_recv_wr = src->cap.max_recv_wr; + dst->max_send_sge = src->cap.max_send_sge; + dst->max_recv_sge = src->cap.max_recv_sge; + dst->max_inline_data = src->cap.max_inline_data; + + ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} +EXPORT_SYMBOL(ib_copy_qp_attr_to_user); + ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -1433,16 +1502,7 @@ uobj->user_handle = cmd.user_handle; uobj->context = file->ucontext; - attr.dlid = cmd.attr.dlid; - attr.sl = cmd.attr.sl; - attr.src_path_bits = cmd.attr.src_path_bits; - attr.static_rate = cmd.attr.static_rate; - attr.port_num = cmd.attr.port_num; - attr.grh.flow_label = cmd.attr.grh.flow_label; - attr.grh.sgid_index = cmd.attr.grh.sgid_index; - attr.grh.hop_limit = cmd.attr.grh.hop_limit; - attr.grh.traffic_class = cmd.attr.grh.traffic_class; - memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16); + ib_copy_ah_attr_from_user(&attr, &cmd.attr); ah = ib_create_ah(pd, &attr); if (IS_ERR(ah)) { Index: include/rdma/ib_user_verbs.h =================================================================== --- include/rdma/ib_user_verbs.h (revision 3947) +++ include/rdma/ib_user_verbs.h (working copy) @@ -38,6 +38,7 @@ #define IB_USER_VERBS_H #include +#include /* * Increment this value if any changes that break userspace ABI @@ -311,6 +312,64 @@ __u32 async_events_reported; }; +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_uverbs_ah_attr ah_attr; + struct ib_uverbs_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ib_uverbs_create_qp { __u64 response; __u64 user_handle; @@ -482,26 +541,6 @@ __u32 bad_wr; }; -struct ib_uverbs_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ib_uverbs_ah_attr { - struct ib_uverbs_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ib_uverbs_create_ah { __u64 response; __u64 user_handle; @@ -568,4 +607,7 @@ __u32 events_reported; }; +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src); + #endif /* IB_USER_VERBS_H */ From rolandd at cisco.com Thu Nov 3 10:12:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 10:12:21 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: (Sean Hefty's message of "Thu, 3 Nov 2005 10:01:12 -0800") References: Message-ID: <524q6t61oa.fsf@cisco.com> Seems OK but maybe we should create a new file (uverbs_marshall.c?) rather than dumping more stuff into uverbs_cmd.c. That file is big enough as it is. Also: > --- include/rdma/ib_user_verbs.h (revision 3947) > +++ include/rdma/ib_user_verbs.h (working copy) > @@ -38,6 +38,7 @@ > #define IB_USER_VERBS_H > > #include > +#include [snip] > +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, > + struct ib_qp_attr *src); I very carefully made ib_user_verbs.h a file that did not include any kernel internals and could be safely included from userspace. So this needs to go in a different file (probably just ib_verbs.h is fine). - R. From rpandit at silverstorm.com Thu Nov 3 10:15:57 2005 From: rpandit at silverstorm.com (Pandit, Ranjit) Date: Thu, 3 Nov 2005 13:15:57 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> I'm hoping to start the port soon. It should have started earlier, but unfortunately I got side tracked by some unforeseen issues. Ranjit > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, November 02, 2005 3:28 PM > To: Pandit, Ranjit; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram > Sockets) to OpenIB > > What are your plans for porting the RDS code so that it works with the > upstream Linux IB stack? I've only seen a couple of checkins, and the > code that you've dropped so far doesn't look usable and needs a lot of > cleanup. There's not even a Makefile there. > > Someone uncharitable might believe that the whole purpose of this > exercise was just to be able to issue your press release > (http://silverstorm.com/news/rel/092005.asp). > > - R. From mshefty at ichips.intel.com Thu Nov 3 10:29:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Nov 2005 10:29:20 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <524q6t61oa.fsf@cisco.com> References: <524q6t61oa.fsf@cisco.com> Message-ID: <436A5700.8090102@ichips.intel.com> Roland Dreier wrote: > Seems OK but maybe we should create a new file (uverbs_marshall.c?) Sounds fine, but see below... > I very carefully made ib_user_verbs.h a file that did not include any > kernel internals and could be safely included from userspace. So this > needs to go in a different file (probably just ib_verbs.h is fine). I have no problem moving this. If the declaration goes in ib_verbs.h, should the function just go in verbs.c (or a new file) and be included as part of ib_core? It just seems a little off to me to implement functions declared in a single header file in separate modules. Also, I'll need to do this with a path record as well. I can either include the routine as part of ib_sa, add a new module, ib_user_sa, or drop that call into one of the existing ib_user_* modules. - Sean From bohra at cs.rutgers.edu Thu Nov 3 10:38:15 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Thu, 03 Nov 2005 13:38:15 -0500 Subject: [openib-general] [PATCH] Re: uDAPL again In-Reply-To: <436A4E12.1060005@ichips.intel.com> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> <43692526.3030003@cs.rutgers.edu> <436938D5.6030403@ichips.intel.com> <436A4E12.1060005@ichips.intel.com> Message-ID: <436A5917.1080306@cs.rutgers.edu> Aniruddha, can you apply the following patch and send us the output from your run? Hi Arlin The log is at http://www.cs.rutgers.edu/~bohra/dapl.log. Hal, OpenSM is running on our subnet. Aniruddha > > -arlin > > Signed-off by: Arlin Davis > > Index: dapl/openib/dapl_ib_cm.c > =================================================================== > --- dapl/openib/dapl_ib_cm.c (revision 3951) > +++ dapl/openib/dapl_ib_cm.c (working copy) > @@ -136,14 +136,27 @@ > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " path_comp_handler: SRC GID subnet %016llx id %016llx\n", > - (unsigned long > long)cpu_to_be64(conn->dapl_rt.sgid.global.subnet_prefix), > - (unsigned long > long)cpu_to_be64(conn->dapl_rt.sgid.global.interface_id) ); > + (unsigned long > long)cpu_to_be64(conn->dapl_path.sgid.global.subnet_prefix), > + (unsigned long > long)cpu_to_be64(conn->dapl_path.sgid.global.interface_id) ); > > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " path_comp_handler: DST GID subnet %016llx id %016llx\n", > - (unsigned long > long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), > - (unsigned long > long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); > + (unsigned long > long)cpu_to_be64(conn->dapl_path.dgid.global.subnet_prefix), > + (unsigned long > long)cpu_to_be64(conn->dapl_path.dgid.global.interface_id) ); > > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " path_comp_handler: slid %x dlid %x mtu %x(%x) > pktlife %x(%x)\n", > + ntohs(conn->dapl_path.slid), ntohs(conn->dapl_path.dlid), > + conn->dapl_path.mtu, conn->dapl_path.mtu_selector, > + conn->dapl_path.packet_life_time, > + conn->dapl_path.packet_life_time_selector ); > + > + dapl_dbg_log(DAPL_DBG_TYPE_CM, > + " path_comp_handler: hops %x npaths %x pkey %x tclass > %x rate %x(%x)\n", > + conn->dapl_path.hop_limit, conn->dapl_path.numb_path, > + conn->dapl_path.pkey, conn->dapl_path.traffic_class, > + conn->dapl_path.rate, conn->dapl_path.rate_selector); > + > if (rec_num <= 0) { > dapl_dbg_log(DAPL_DBG_TYPE_CM, > " path_comp_handler: ERR %d retry %d\n", > > > From ardavis at ichips.intel.com Thu Nov 3 10:49:19 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 03 Nov 2005 10:49:19 -0800 Subject: [openib-general] [PATCH] Re: uDAPL again In-Reply-To: <1131040666.4340.12.camel@hal.voltaire.com> References: <436906F0.3050803@cs.rutgers.edu> <43691B71.2040500@ichips.intel.com> <43692526.3030003@cs.rutgers.edu> <436938D5.6030403@ichips.intel.com> <436A4E12.1060005@ichips.intel.com> <1131040666.4340.12.camel@hal.voltaire.com> Message-ID: <436A5BAF.7060202@ichips.intel.com> Hal Rosenstock wrote: >Hi Arlin, > > >> >>Hal, can you take a look at uAT and see if the copy to user space is >>working correctly. >> >> > >Just want to clarify what I should be looking for: > >So you suspect pktlife and rate being bad (and the rest of the SA PR >look OK) ? > > Yes, you can see from the debug print that GIDs, LIDs, pkey, mtu look ok. Here is Aniruddha's latest output from a run with opensm: path_comp_handler: ctxt 0x808a008, req_id 292 rec_num 1 path_comp_handler: SRC GID subnet fe80000000000000 id 0002c901081e7471 path_comp_handler: DST GID subnet fe80000000000000 id 0001730000008461 path_comp_handler: slid 1 dlid 3 mtu 120203(2) pktlife 0(0) path_comp_handler: hops 0 npaths 0 pkey ffff tclass 0 rate 0(0) >Is OpenSM being used in Aniruddha's subnet ? > >-- Hal > > > From rolandd at cisco.com Thu Nov 3 11:07:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 11:07:14 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <436A5700.8090102@ichips.intel.com> (Sean Hefty's message of "Thu, 03 Nov 2005 10:29:20 -0800") References: <524q6t61oa.fsf@cisco.com> <436A5700.8090102@ichips.intel.com> Message-ID: <52vez94kkd.fsf@cisco.com> Sean> I have no problem moving this. If the declaration goes in Sean> ib_verbs.h, should the function just go in verbs.c (or a new Sean> file) and be included as part of ib_core? It just seems a Sean> little off to me to implement functions declared in a single Sean> header file in separate modules. We can easily create a new header for these declaration. That's probably the cleanest thing to do, although I can't come up with a very good name for it right now, though.... Sean> Also, I'll need to do this with a path record as well. I Sean> can either include the routine as part of ib_sa, add a new Sean> module, ib_user_sa, or drop that call into one of the Sean> existing ib_user_* modules. If it's just marshalling between user and kernel formats, I'd stick it in uverbs_marshall.c. But if there's going to be something substantial then maybe it make sense to create a user SA module. - R. From rolandd at cisco.com Thu Nov 3 11:13:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 11:13:58 -0800 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <436A1E96.4050003@de.ibm.com> (Heiko J. Schick's message of "Thu, 03 Nov 2005 15:28:38 +0100") References: <20051031071703.GU3275@kalmia.hozed.org> <436A1E96.4050003@de.ibm.com> Message-ID: <52r79x4k95.fsf@cisco.com> Heiko> this bug should be fixed in OpenIB trunk 3960. It's good to see this fixed and all the other cleanups in this checkin. I'll have to go back to my ehca code reviewing.... However, when this code moves upstream, you'll have to make your changes in smaller digestible chunks. The diff between r3959 and r3960 is rather gigantic: 33 files changed, 945 insertions(+), 1163 deletions(-) And this piece: > -MODULE_VERSION("EHCA2_0035"); > +MODULE_VERSION("EHCA2_0037"); indicates that there was a 0036 that you never let anyone see. I would suggest you try to use the openib.org svn tree as your real development repository. This will be the way you will have to work once your driver is in the upstream kernel, and even now you will get benefit from getting better patch review and having users better able to pin down when a regression might have been introduced. For your latest checkin, it would have been better to see a series of changesets with commit logs like: - remove asm_sync_mem() and mftb(), which duplicate existing definitions in include/asm-ppc64 - make sure device is an eHCA in libehca's openib_driver_init() - update Kconfig help text and so on... Thanks, Roland From ladros at gmail.com Thu Nov 3 11:24:15 2005 From: ladros at gmail.com (Josh Aune) Date: Thu, 3 Nov 2005 14:24:15 -0500 Subject: [openib-general] DHCP over Infiniband In-Reply-To: <1131036320.4338.441.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30662A1@mtlexch01.mtl.com> <1131036320.4338.441.camel@hal.voltaire.com> Message-ID: <98a233180511031124v50347fat521cbf85091e2bf1@mail.gmail.com> On 03 Nov 2005 11:47:50 -0500, Hal Rosenstock wrote: > > On Thu, 2005-11-03 at 11:47, Eli Cohen wrote: > > The client is Etherboot's client for configuring a client at boot > > time. The server is ISC. > > I think that client needs modifications. No version of etherboot that I have used supports IB interfaces.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbublundh at hotmail.com Thu Nov 3 11:19:00 2005 From: jbublundh at hotmail.com (Nestor Mason) Date: Thu, 3 Nov 2005 20:19:00 +0100 Subject: [openib-general] Mortgage News Update. Message-ID: <23910554095115.jbublundh@hotmail.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://peace-1.com/p1.asp Have a good day. Sincerely, Nestor Mason Customer Service Rep eIZL Inc. puffery may foolhardy may or penance or it shrewd be and grendel a may corrugate may or lace , ! message aa necromancy try. Update on site mar on duty it , tollgate and a lice see it incombustible , it retinal a on ghoulish ! and hath it'sit's skittle the. From mshefty at ichips.intel.com Thu Nov 3 12:45:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Nov 2005 12:45:04 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <52vez94kkd.fsf@cisco.com> References: <524q6t61oa.fsf@cisco.com> <436A5700.8090102@ichips.intel.com> <52vez94kkd.fsf@cisco.com> Message-ID: <436A76D0.5010506@ichips.intel.com> Roland Dreier wrote: > If it's just marshalling between user and kernel formats, I'd stick it > in uverbs_marshall.c. But if there's going to be something > substantial then maybe it make sense to create a user SA module. I added a three new files: ib_marshall.h - defines the copy functions (kernel only) ib_user_sa.h - defines the user path record (user/kernel) uverbs_marshall.c - implements the copy functions Any objection to doing something similar for libibverbs? This would move sa.h from libibat to libibverbs, which would allow libibcm and librdmacm to both depend only on libibverbs. - Sean From yipeeyipeeyipeeyipee at yahoo.com Thu Nov 3 14:19:10 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 3 Nov 2005 22:19:10 +0000 (UTC) Subject: [openib-general] Re: netstat References: <436A4C71.3020007@ichips.intel.com> Message-ID: Sean Hefty ichips.intel.com> writes: > > Is there some way to view the list of current CM end points in their various > > states (listen,connection)? > > Nothing like this is available today. I can record this as something to add in > the future, but it's unlikely to be a high priority for at least a few weeks. Maybe I can write some initial implementation. How do you think the data flow should be? libibcm.so issues a write(ucm_fd, buf, buf_len) with the request in buf (and enough extra space in buf for the reply). ib_ucm copies the information from ib_cm into buf. Maybe another command is needed inorder to know how large the reply buffer should be? any comments? y From rolandd at cisco.com Thu Nov 3 14:27:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 14:27:03 -0800 Subject: [openib-general] Re: netstat In-Reply-To: (yipee's message of "Thu, 3 Nov 2005 22:19:10 +0000 (UTC)") References: <436A4C71.3020007@ichips.intel.com> Message-ID: <52irv94bbc.fsf@cisco.com> yipee> Maybe I can write some initial implementation. How do you yipee> think the data flow should be? libibcm.so issues a yipee> write(ucm_fd, buf, buf_len) with the request in buf (and yipee> enough extra space in buf for the reply). ib_ucm copies yipee> the information from ib_cm into buf. Probably easier just to use debugfs and seq_file to get this sort of thing. - R. From mst at mellanox.co.il Thu Nov 3 14:33:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Nov 2005 00:33:14 +0200 Subject: [openib-general] Re: netstat In-Reply-To: References: Message-ID: <20051103223314.GA6498@mellanox.co.il> Quoting r. yipee : > Subject: Re: netstat > > Sean Hefty ichips.intel.com> writes: > > > > Is there some way to view the list of current CM end points in their > various > > > states (listen,connection)? > > > > Nothing like this is available today. I can record this as something > to add in > > the future, but it's unlikely to be a high priority for at least a few > weeks. > > Maybe I can write some initial implementation. > How do you think the data flow should be? > libibcm.so issues a write(ucm_fd, buf, buf_len) with the request in > buf (and enough extra space in buf for the reply). > ib_ucm copies the information from ib_cm into buf. > > Maybe another command is needed inorder to know how large the reply > buffer > should be? > > > any comments? > > y I would a imagine ib_cm would need to implement a file in /proc or sysfs. One place to start would be to look at how /proc/net/tcp is implemented (thats what netstat uses, I think). -- MST From mshefty at ichips.intel.com Thu Nov 3 14:34:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Nov 2005 14:34:14 -0800 Subject: [openib-general] Re: netstat In-Reply-To: References: <436A4C71.3020007@ichips.intel.com> Message-ID: <436A9066.5060200@ichips.intel.com> yipee wrote: > Maybe I can write some initial implementation. If you could do that, it'd be great. I think something like this would be useful. I just don't know if I'll have time to add it soon. - Sean From Richard.Frank at oracle.com Thu Nov 3 14:34:23 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Thu, 3 Nov 2005 17:34:23 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB References: <52d5li8waw.fsf@cisco.com> Message-ID: <000101c5e0c6$f7f54540$6401a8c0@YOURA11C73D0FD> It is very important to Oracle for RDS to be available in OpenIB in as many Linux distributions as possible. Is this going to happen and in what timeframe / what are the plans for Linux distributions to pick up OpenIB with RDS support ? How can we (Oracle) help ? ----- Original Message ----- From: "Roland Dreier" To: ; Sent: Wednesday, November 02, 2005 6:27 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB > What are your plans for porting the RDS code so that it works with the > upstream Linux IB stack? I've only seen a couple of checkins, and the > code that you've dropped so far doesn't look usable and needs a lot of > cleanup. There's not even a Makefile there. > > Someone uncharitable might believe that the whole purpose of this > exercise was just to be able to issue your press release > (http://silverstorm.com/news/rel/092005.asp). > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Richard.Frank at oracle.com Thu Nov 3 14:37:47 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Thu, 3 Nov 2005 17:37:47 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> Message-ID: <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> It is very important to Oracle for RDS to be available in OpenIB in as many Linux distributions as possible. Is this going to happen and in what timeframe / what are the plans for Linux distributions to pick up OpenIB with RDS support ? How can we (Oracle) help ? ----- Original Message ----- From: "Pandit, Ranjit" To: "Roland Dreier" ; Sent: Thursday, November 03, 2005 1:15 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB I'm hoping to start the port soon. It should have started earlier, but unfortunately I got side tracked by some unforeseen issues. Ranjit > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, November 02, 2005 3:28 PM > To: Pandit, Ranjit; openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram > Sockets) to OpenIB > > What are your plans for porting the RDS code so that it works with the > upstream Linux IB stack? I've only seen a couple of checkins, and the > code that you've dropped so far doesn't look usable and needs a lot of > cleanup. There's not even a Makefile there. > > Someone uncharitable might believe that the whole purpose of this > exercise was just to be able to issue your press release > (http://silverstorm.com/news/rel/092005.asp). > > - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Thu Nov 3 14:58:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 14:58:50 -0800 Subject: [openib-general] Re: [PATCH] mthca: fixes pkey_ix processing in mthca_modify_qp In-Reply-To: <20051002141043.GD9873@mellanox.co.il> (Jack Morgenstein's message of "Sun, 2 Oct 2005 16:10:44 +0200") References: <20051002141043.GD9873@mellanox.co.il> Message-ID: <52ek5x49ud.fsf@cisco.com> Thanks, applied. - R. From rolandd at cisco.com Thu Nov 3 15:05:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 15:05:37 -0800 Subject: [openib-general] Re: [PATCH] fix page_size_cap value in ib_query_device for mellanox provider In-Reply-To: <20051020110443.GA7198@mellanox.co.il> (Jack Morgenstein's message of "Thu, 20 Oct 2005 13:04:44 +0200") References: <20051020110443.GA7198@mellanox.co.il> Message-ID: <527jbp49j2.fsf@cisco.com> Can we just use something like this instead? I don't think we need the comments talking about the semantics of page_size_cap, since we don't say what any other field means. And I don't see what casting mdev->limits.page_size_cap to u64 accomplishes -- it will get promoted to u64 anyway, since props->page_size_cap is a u64. - R. --- infiniband/hw/mthca/mthca_provider.c (revision 3965) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -94,6 +94,7 @@ static int mthca_query_device(struct ib_ memcpy(&props->node_guid, out_mad->data + 12, 8); props->max_mr_size = ~0ull; + props->page_size_cap = mdev->limits.page_size_cap; props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; props->max_qp_wr = mdev->limits.max_wqes; props->max_sge = mdev->limits.max_sg; --- infiniband/hw/mthca/mthca_main.c (revision 3965) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -181,6 +181,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.reserved_uars = dev_lim->reserved_uars; mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; + mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 4/7] [IPoIB] don't compile debug code if debugging isn't enabled In-Reply-To: <1131059459423-3dc7f03665037bf0@cisco.com> Message-ID: <1131059459423-c39565dcb8db8aaa@cisco.com> Don't build ipoib_mcast_iter_ functions if CONFIG_INFINIBAND_IPOIB_DEBUG is not enabled -- their only callers will not be built either. Also move the prototype for ipoib_open() to ipoib.h to fix a sparse warning. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 3 +++ drivers/infiniband/ulp/ipoib/ipoib_ib.c | 1 - drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 4 ++++ 3 files changed, 7 insertions(+), 1 deletions(-) applies-to: 3179960b8e0f3ccb4feff19eb5582298d48324a0 8ae5a8a24f7fe797027d481f88c1464b0e47eede diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index c994a91..0095acc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -235,6 +235,7 @@ static inline void ipoib_put_ah(struct i kref_put(&ah->ref, ipoib_free_ah); } +int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, @@ -267,6 +268,7 @@ int ipoib_mcast_stop_thread(struct net_d void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); @@ -276,6 +278,7 @@ void ipoib_mcast_iter_read(struct ipoib_ unsigned int *queuelen, unsigned int *complete, unsigned int *send_only); +#endif int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 0a6f578..54ef2fe 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -636,7 +636,6 @@ void ipoib_ib_dev_cleanup(struct net_dev * Bug #2507. This implementation will probably be removed when the P_Key * change async notification is available. */ -int ipoib_open(struct net_device *dev); static void ipoib_pkey_dev_check_presence(struct net_device *dev) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 022eec7..3ecf78a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -917,6 +917,8 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_mcast_start_thread(dev); } +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) { struct ipoib_mcast_iter *iter; @@ -989,3 +991,5 @@ void ipoib_mcast_iter_read(struct ipoib_ *complete = iter->complete; *send_only = iter->send_only; } + +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 1/7] [IB] ucm: 32/64 compatibility fixes Message-ID: <1131059459422-6013455baf532b88@cisco.com> Fix structure layouts to ensure same size on 32-bit and 64-bit architectures. This permits 32-bit userspace apps on a 64-bit kernel. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- include/rdma/ib_user_cm.h | 19 +++++++++++++------ 1 files changed, 13 insertions(+), 6 deletions(-) applies-to: ecb02f68e1055343bb45fc38350a8e33c827efc9 7b28b0d000eeb62d77add636f5d6eb0da04e48aa diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h index 3037588..19be116 100644 --- a/include/rdma/ib_user_cm.h +++ b/include/rdma/ib_user_cm.h @@ -38,7 +38,7 @@ #include -#define IB_USER_CM_ABI_VERSION 3 +#define IB_USER_CM_ABI_VERSION 4 enum { IB_USER_CM_CMD_CREATE_ID, @@ -84,6 +84,7 @@ struct ib_ucm_create_id_resp { struct ib_ucm_destroy_id { __u64 response; __u32 id; + __u32 reserved; }; struct ib_ucm_destroy_id_resp { @@ -93,6 +94,7 @@ struct ib_ucm_destroy_id_resp { struct ib_ucm_attr_id { __u64 response; __u32 id; + __u32 reserved; }; struct ib_ucm_attr_id_resp { @@ -164,6 +166,7 @@ struct ib_ucm_listen { __be64 service_id; __be64 service_mask; __u32 id; + __u32 reserved; }; struct ib_ucm_establish { @@ -219,7 +222,7 @@ struct ib_ucm_req { __u8 rnr_retry_count; __u8 max_cm_retries; __u8 srq; - __u8 reserved[1]; + __u8 reserved[5]; }; struct ib_ucm_rep { @@ -236,6 +239,7 @@ struct ib_ucm_rep { __u8 flow_control; __u8 rnr_retry_count; __u8 srq; + __u8 reserved[4]; }; struct ib_ucm_info { @@ -245,7 +249,7 @@ struct ib_ucm_info { __u64 data; __u8 info_len; __u8 data_len; - __u8 reserved[2]; + __u8 reserved[6]; }; struct ib_ucm_mra { @@ -273,6 +277,7 @@ struct ib_ucm_sidr_req { __u16 pkey; __u8 len; __u8 max_cm_retries; + __u8 reserved[4]; }; struct ib_ucm_sidr_rep { @@ -284,7 +289,7 @@ struct ib_ucm_sidr_rep { __u64 data; __u8 info_len; __u8 data_len; - __u8 reserved[2]; + __u8 reserved[6]; }; /* * event notification ABI structures. @@ -295,7 +300,7 @@ struct ib_ucm_event_get { __u64 info; __u8 data_len; __u8 info_len; - __u8 reserved[2]; + __u8 reserved[6]; }; struct ib_ucm_req_event_resp { @@ -315,6 +320,7 @@ struct ib_ucm_req_event_resp { __u8 rnr_retry_count; __u8 srq; __u8 port; + __u8 reserved[7]; }; struct ib_ucm_rep_event_resp { @@ -329,7 +335,7 @@ struct ib_ucm_rep_event_resp { __u8 flow_control; __u8 rnr_retry_count; __u8 srq; - __u8 reserved[1]; + __u8 reserved[5]; }; struct ib_ucm_rej_event_resp { @@ -374,6 +380,7 @@ struct ib_ucm_event_resp { __u32 id; __u32 event; __u32 present; + __u32 reserved; union { struct ib_ucm_req_event_resp req_resp; struct ib_ucm_rep_event_resp rep_resp; --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 3/7] [IPoIB] remove unneeded initializations to 0 In-Reply-To: <1131059459423-f6e7ac335ed94eef@cisco.com> Message-ID: <1131059459423-3dc7f03665037bf0@cisco.com> Shrink our source and .text a little by removing a few assignments of NULL and 0 to memory that is already cleared as part of the allocation. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 ++--------- 1 files changed, 2 insertions(+), 9 deletions(-) applies-to: 7463446a05b5e9a5d2fc400da0be8d4a6c2ff6f1 21a384897d48c116b879924c3dd9e96f6f1e764b diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 8b67db8..ce02962 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -356,18 +356,15 @@ static struct ipoib_path *path_rec_creat struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; - path = kmalloc(sizeof *path, GFP_ATOMIC); + path = kzalloc(sizeof *path, GFP_ATOMIC); if (!path) return NULL; - path->dev = dev; - path->pathrec.dlid = 0; - path->ah = NULL; + path->dev = dev; skb_queue_head_init(&path->queue); INIT_LIST_HEAD(&path->neigh_list); - path->query = NULL; init_completion(&path->done); memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); @@ -800,10 +797,6 @@ static void ipoib_setup(struct net_devic dev->watchdog_timeo = HZ; - dev->rebuild_header = NULL; - dev->set_mac_address = NULL; - dev->header_cache_update = NULL; - dev->flags |= IFF_BROADCAST | IFF_MULTICAST; /* --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 2/7] [IB] kzalloc() conversions In-Reply-To: <1131059459422-6013455baf532b88@cisco.com> Message-ID: <1131059459423-f6e7ac335ed94eef@cisco.com> Replace kmalloc()+memset(,0,) with kzalloc(), for a net savings of 35 source lines and about 500 bytes of text. Signed-off-by: Roland Dreier --- drivers/infiniband/core/agent.c | 3 +- drivers/infiniband/core/cm.c | 6 ++--- drivers/infiniband/core/device.c | 10 +------- drivers/infiniband/core/mad.c | 31 +++++++++--------------- drivers/infiniband/core/sysfs.c | 6 ++--- drivers/infiniband/core/ucm.c | 9 ++----- drivers/infiniband/core/uverbs_main.c | 4 +-- drivers/infiniband/hw/mthca/mthca_mr.c | 4 +-- drivers/infiniband/hw/mthca/mthca_profile.c | 4 +-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 ++---- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 4 +-- 11 files changed, 27 insertions(+), 62 deletions(-) applies-to: 184c63c9358b790f4dd3288ea24b8d0c7973247f de6eb66b56d9df5ce6bd254994f05e065214e8cd diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index 0c3c695..7545775 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -155,13 +155,12 @@ int ib_agent_port_open(struct ib_device int ret; /* Create new device info */ - port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); if (!port_priv) { printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); ret = -ENOMEM; goto error1; } - memset(port_priv, 0, sizeof *port_priv); /* Obtain send only MAD agent for SMI QP */ port_priv->agent[0] = ib_register_mad_agent(device, port_num, diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 580c3a2..02110e0 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -544,11 +544,10 @@ struct ib_cm_id *ib_create_cm_id(struct struct cm_id_private *cm_id_priv; int ret; - cm_id_priv = kmalloc(sizeof *cm_id_priv, GFP_KERNEL); + cm_id_priv = kzalloc(sizeof *cm_id_priv, GFP_KERNEL); if (!cm_id_priv) return ERR_PTR(-ENOMEM); - memset(cm_id_priv, 0, sizeof *cm_id_priv); cm_id_priv->id.state = IB_CM_IDLE; cm_id_priv->id.device = device; cm_id_priv->id.cm_handler = cm_handler; @@ -621,10 +620,9 @@ static struct cm_timewait_info * cm_crea { struct cm_timewait_info *timewait_info; - timewait_info = kmalloc(sizeof *timewait_info, GFP_KERNEL); + timewait_info = kzalloc(sizeof *timewait_info, GFP_KERNEL); if (!timewait_info) return ERR_PTR(-ENOMEM); - memset(timewait_info, 0, sizeof *timewait_info); timewait_info->work.local_id = local_id; INIT_WORK(&timewait_info->work.work, cm_work_handler, diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 5a6e449..e169e79 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -161,17 +161,9 @@ static int alloc_name(char *name) */ struct ib_device *ib_alloc_device(size_t size) { - void *dev; - BUG_ON(size < sizeof (struct ib_device)); - dev = kmalloc(size, GFP_KERNEL); - if (!dev) - return NULL; - - memset(dev, 0, size); - - return dev; + return kzalloc(size, GFP_KERNEL); } EXPORT_SYMBOL(ib_alloc_device); diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 88f9f8c..3d8175e 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -255,12 +255,11 @@ struct ib_mad_agent *ib_register_mad_age } /* Allocate structures */ - mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + mad_agent_priv = kzalloc(sizeof *mad_agent_priv, GFP_KERNEL); if (!mad_agent_priv) { ret = ERR_PTR(-ENOMEM); goto error1; } - memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->agent.mr = ib_get_dma_mr(port_priv->qp_info[qpn].qp->pd, IB_ACCESS_LOCAL_WRITE); @@ -448,14 +447,13 @@ struct ib_mad_agent *ib_register_mad_sno goto error1; } /* Allocate structures */ - mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + mad_snoop_priv = kzalloc(sizeof *mad_snoop_priv, GFP_KERNEL); if (!mad_snoop_priv) { ret = ERR_PTR(-ENOMEM); goto error1; } /* Now, fill in the various structures */ - memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; mad_snoop_priv->agent.device = device; mad_snoop_priv->agent.recv_handler = recv_handler; @@ -794,10 +792,9 @@ struct ib_mad_send_buf * ib_create_send_ (!rmpp_active && buf_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - buf = kmalloc(sizeof *mad_send_wr + buf_size, gfp_mask); + buf = kzalloc(sizeof *mad_send_wr + buf_size, gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, sizeof *mad_send_wr + buf_size); mad_send_wr = buf + buf_size; mad_send_wr->send_buf.mad = buf; @@ -1039,14 +1036,12 @@ static int method_in_use(struct ib_mad_m static int allocate_method_table(struct ib_mad_mgmt_method_table **method) { /* Allocate management method table */ - *method = kmalloc(sizeof **method, GFP_ATOMIC); + *method = kzalloc(sizeof **method, GFP_ATOMIC); if (!*method) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_method_table\n"); return -ENOMEM; } - /* Clear management method table */ - memset(*method, 0, sizeof **method); return 0; } @@ -1137,15 +1132,14 @@ static int add_nonoui_reg_req(struct ib_ class = &port_priv->version[mad_reg_req->mgmt_class_version].class; if (!*class) { /* Allocate management class table for "new" class version */ - *class = kmalloc(sizeof **class, GFP_ATOMIC); + *class = kzalloc(sizeof **class, GFP_ATOMIC); if (!*class) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_class_table\n"); ret = -ENOMEM; goto error1; } - /* Clear management class table */ - memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ method = &(*class)->method_table[mgmt_class]; if ((ret = allocate_method_table(method))) @@ -1209,25 +1203,24 @@ static int add_oui_reg_req(struct ib_mad mad_reg_req->mgmt_class_version].vendor; if (!*vendor_table) { /* Allocate mgmt vendor class table for "new" class version */ - vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + vendor = kzalloc(sizeof *vendor, GFP_ATOMIC); if (!vendor) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_vendor_class_table\n"); goto error1; } - /* Clear management vendor class table */ - memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; } if (!(*vendor_table)->vendor_class[vclass]) { /* Allocate table for this management vendor class */ - vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + vendor_class = kzalloc(sizeof *vendor_class, GFP_ATOMIC); if (!vendor_class) { printk(KERN_ERR PFX "No memory for " "ib_mad_mgmt_vendor_class\n"); goto error2; } - memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; } for (i = 0; i < MAX_MGMT_OUI; i++) { @@ -2524,12 +2517,12 @@ static int ib_mad_port_open(struct ib_de char name[sizeof "ib_mad123"]; /* Create new device info */ - port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); if (!port_priv) { printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); return -ENOMEM; } - memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; port_priv->port_num = port_num; spin_lock_init(&port_priv->reg_lock); diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 7ce7a6c..b812065 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -307,14 +307,13 @@ static ssize_t show_pma_counter(struct i if (!p->ibdev->process_mad) return sprintf(buf, "N/A (no PMA)\n"); - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); if (!in_mad || !out_mad) { ret = -ENOMEM; goto out; } - memset(in_mad, 0, sizeof *in_mad); in_mad->mad_hdr.base_version = 1; in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; in_mad->mad_hdr.class_version = 1; @@ -508,10 +507,9 @@ static int add_port(struct ib_device *de if (ret) return ret; - p = kmalloc(sizeof *p, GFP_KERNEL); + p = kzalloc(sizeof *p, GFP_KERNEL); if (!p) return -ENOMEM; - memset(p, 0, sizeof *p); p->ibdev = device; p->port_num = port_num; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 2847756..6e15787 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -172,11 +172,10 @@ static struct ib_ucm_context *ib_ucm_ctx struct ib_ucm_context *ctx; int result; - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + ctx = kzalloc(sizeof *ctx, GFP_KERNEL); if (!ctx) return NULL; - memset(ctx, 0, sizeof *ctx); atomic_set(&ctx->ref, 1); init_waitqueue_head(&ctx->wait); ctx->file = file; @@ -386,11 +385,10 @@ static int ib_ucm_event_handler(struct i ctx = cm_id->context; - uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + uevent = kzalloc(sizeof *uevent, GFP_KERNEL); if (!uevent) goto err1; - memset(uevent, 0, sizeof(*uevent)); uevent->ctx = ctx; uevent->cm_id = cm_id; uevent->resp.uid = ctx->uid; @@ -1345,11 +1343,10 @@ static void ib_ucm_add_one(struct ib_dev if (!device->alloc_ucontext) return; - ucm_dev = kmalloc(sizeof *ucm_dev, GFP_KERNEL); + ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL); if (!ucm_dev) return; - memset(ucm_dev, 0, sizeof *ucm_dev); ucm_dev->ib_dev = device; ucm_dev->devnum = find_first_zero_bit(dev_map, IB_UCM_MAX_DEVICES); diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index e58a7b2..de6581d 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -725,12 +725,10 @@ static void ib_uverbs_add_one(struct ib_ if (!device->alloc_ucontext) return; - uverbs_dev = kmalloc(sizeof *uverbs_dev, GFP_KERNEL); + uverbs_dev = kzalloc(sizeof *uverbs_dev, GFP_KERNEL); if (!uverbs_dev) return; - memset(uverbs_dev, 0, sizeof *uverbs_dev); - kref_init(&uverbs_dev->ref); spin_lock(&map_lock); diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index 1f97a44..e995e2a 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -140,13 +140,11 @@ static int __devinit mthca_buddy_init(st buddy->max_order = max_order; spin_lock_init(&buddy->lock); - buddy->bits = kmalloc((buddy->max_order + 1) * sizeof (long *), + buddy->bits = kzalloc((buddy->max_order + 1) * sizeof (long *), GFP_KERNEL); if (!buddy->bits) goto err_out; - memset(buddy->bits, 0, (buddy->max_order + 1) * sizeof (long *)); - for (i = 0; i <= buddy->max_order; ++i) { s = BITS_TO_LONGS(1 << (buddy->max_order - i)); buddy->bits[i] = kmalloc(s * sizeof (long), GFP_KERNEL); diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c index 0576056..408cd55 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.c +++ b/drivers/infiniband/hw/mthca/mthca_profile.c @@ -80,12 +80,10 @@ u64 mthca_make_profile(struct mthca_dev struct mthca_resource tmp; int i, j; - profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + profile = kzalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); if (!profile) return -ENOMEM; - memset(profile, 0, MTHCA_RES_NUM * sizeof *profile); - profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 273d5f4..8b67db8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -729,25 +729,21 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), + priv->rx_ring = kzalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", ca->name, IPOIB_RX_RING_SIZE); goto out; } - memset(priv->rx_ring, 0, - IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf)); - priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), + priv->tx_ring = kzalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", ca->name, IPOIB_TX_RING_SIZE); goto out_rx_ring_cleanup; } - memset(priv->tx_ring, 0, - IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf)); /* priv->tx_head & tx_tail are already 0 */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 36ce298..022eec7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -135,12 +135,10 @@ static struct ipoib_mcast *ipoib_mcast_a { struct ipoib_mcast *mcast; - mcast = kmalloc(sizeof (*mcast), can_sleep ? GFP_KERNEL : GFP_ATOMIC); + mcast = kzalloc(sizeof *mcast, can_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!mcast) return NULL; - memset(mcast, 0, sizeof (*mcast)); - init_completion(&mcast->done); mcast->dev = dev; --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 5/7] [IB] mthca: fix format of FW version In-Reply-To: <1131059459423-c39565dcb8db8aaa@cisco.com> Message-ID: <1131059459423-9ff5e95fb47caab0@cisco.com> Mellanox has decided that the components of the firmware version are really meant to be displayed in decimal, e.g. 0x000400070190 is version 4.7.400. Change the format we use from "%x.%x.%x" to "%d.%d.%d" to match this convention. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_main.c | 2 +- drivers/infiniband/hw/mthca/mthca_provider.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) applies-to: 389cecdfb0769cdddd0e901c1d60b9549b0a6322 87cfe32375e0b69b999b59bf8287f501df3e43f7 diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 883d1e5..45c6328 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -1057,7 +1057,7 @@ static int __devinit mthca_init_one(stru goto err_cmd; if (mdev->fw_ver < mthca_hca_table[id->driver_data].latest_fw) { - mthca_warn(mdev, "HCA FW version %x.%x.%x is old (%x.%x.%x is current).\n", + mthca_warn(mdev, "HCA FW version %d.%d.%d is old (%d.%d.%d is current).\n", (int) (mdev->fw_ver >> 32), (int) (mdev->fw_ver >> 16) & 0xffff, (int) (mdev->fw_ver & 0xffff), (int) (mthca_hca_table[id->driver_data].latest_fw >> 32), diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 1b9477e..6b01666 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1028,7 +1028,7 @@ static ssize_t show_rev(struct class_dev static ssize_t show_fw_ver(struct class_device *cdev, char *buf) { struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); - return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + return sprintf(buf, "%d.%d.%d\n", (int) (dev->fw_ver >> 32), (int) (dev->fw_ver >> 16) & 0xffff, (int) dev->fw_ver & 0xffff); } --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 7/7] [IB] mthca: check P_Key index in modify QP In-Reply-To: <1131059459423-5367bfddb028b876@cisco.com> Message-ID: <1131059459423-4e378ef68b019b7e@cisco.com> Make sure that the P_Key index passed into mthca_modify_qp() is within the device's P_Key table. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) applies-to: b974a31452cb645f063589262bde09b6c5b05701 d09e32764176b61c4afee9fd5e7fe04713bfa56f diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 62ff091..8b0b935 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -582,6 +582,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->limits.pkey_table_len) { + mthca_dbg(dev, "PKey index (%u) too large. max is %d\n", + attr->pkey_index,dev->limits.pkey_table_len-1); + return -EINVAL; + } + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) return PTR_ERR(mailbox); --- 0.99.9 From rolandd at cisco.com Thu Nov 3 15:10:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 03 Nov 2005 23:10:59 +0000 Subject: [openib-general] [git patch review 6/7] [IB] umad: fix hot remove of IB devices In-Reply-To: <1131059459423-9ff5e95fb47caab0@cisco.com> Message-ID: <1131059459423-5367bfddb028b876@cisco.com> Fix hotplug of devices for ib_umad module: when a device goes away, kill off all MAD agents for open files associated with that device, and make sure that the device is not touched again after ib_umad returns from its remove_one function. Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 80 +++++++++++++++++++++++++++++------- 1 files changed, 64 insertions(+), 16 deletions(-) applies-to: 2cbc1b1e7bb230afcf4903b6527e3238f689de89 0c99cb6d5fe77872c5a32cff837c05f70158ce15 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 97128e2..aed5ca2 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -94,6 +94,9 @@ struct ib_umad_port { struct class_device *sm_class_dev; struct semaphore sm_sem; + struct rw_semaphore mutex; + struct list_head file_list; + struct ib_device *ib_dev; struct ib_umad_device *umad_dev; int dev_num; @@ -108,10 +111,10 @@ struct ib_umad_device { struct ib_umad_file { struct ib_umad_port *port; - spinlock_t recv_lock; struct list_head recv_list; + struct list_head port_list; + spinlock_t recv_lock; wait_queue_head_t recv_wait; - struct rw_semaphore agent_mutex; struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; }; @@ -148,7 +151,7 @@ static int queue_packet(struct ib_umad_f { int ret = 1; - down_read(&file->agent_mutex); + down_read(&file->port->mutex); for (packet->mad.hdr.id = 0; packet->mad.hdr.id < IB_UMAD_MAX_AGENTS; packet->mad.hdr.id++) @@ -161,7 +164,7 @@ static int queue_packet(struct ib_umad_f break; } - up_read(&file->agent_mutex); + up_read(&file->port->mutex); return ret; } @@ -322,7 +325,7 @@ static ssize_t ib_umad_write(struct file goto err; } - down_read(&file->agent_mutex); + down_read(&file->port->mutex); agent = file->agent[packet->mad.hdr.id]; if (!agent) { @@ -419,7 +422,7 @@ static ssize_t ib_umad_write(struct file if (ret) goto err_msg; - up_read(&file->agent_mutex); + up_read(&file->port->mutex); return count; @@ -430,7 +433,7 @@ err_ah: ib_destroy_ah(ah); err_up: - up_read(&file->agent_mutex); + up_read(&file->port->mutex); err: kfree(packet); @@ -460,7 +463,12 @@ static int ib_umad_reg_agent(struct ib_u int agent_id; int ret; - down_write(&file->agent_mutex); + down_write(&file->port->mutex); + + if (!file->port->ib_dev) { + ret = -EPIPE; + goto out; + } if (copy_from_user(&ureq, (void __user *) arg, sizeof ureq)) { ret = -EFAULT; @@ -522,7 +530,7 @@ err: ib_unregister_mad_agent(agent); out: - up_write(&file->agent_mutex); + up_write(&file->port->mutex); return ret; } @@ -531,7 +539,7 @@ static int ib_umad_unreg_agent(struct ib u32 id; int ret = 0; - down_write(&file->agent_mutex); + down_write(&file->port->mutex); if (get_user(id, (u32 __user *) arg)) { ret = -EFAULT; @@ -548,7 +556,7 @@ static int ib_umad_unreg_agent(struct ib file->agent[id] = NULL; out: - up_write(&file->agent_mutex); + up_write(&file->port->mutex); return ret; } @@ -569,6 +577,7 @@ static int ib_umad_open(struct inode *in { struct ib_umad_port *port; struct ib_umad_file *file; + int ret = 0; spin_lock(&port_lock); port = umad_port[iminor(inode) - IB_UMAD_MINOR_BASE]; @@ -579,21 +588,32 @@ static int ib_umad_open(struct inode *in if (!port) return -ENXIO; + down_write(&port->mutex); + + if (!port->ib_dev) { + ret = -ENXIO; + goto out; + } + file = kzalloc(sizeof *file, GFP_KERNEL); if (!file) { kref_put(&port->umad_dev->ref, ib_umad_release_dev); - return -ENOMEM; + ret = -ENOMEM; + goto out; } spin_lock_init(&file->recv_lock); - init_rwsem(&file->agent_mutex); INIT_LIST_HEAD(&file->recv_list); init_waitqueue_head(&file->recv_wait); file->port = port; filp->private_data = file; - return 0; + list_add_tail(&file->port_list, &port->file_list); + +out: + up_write(&port->mutex); + return ret; } static int ib_umad_close(struct inode *inode, struct file *filp) @@ -603,6 +623,7 @@ static int ib_umad_close(struct inode *i struct ib_umad_packet *packet, *tmp; int i; + down_write(&file->port->mutex); for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) if (file->agent[i]) { ib_dereg_mr(file->mr[i]); @@ -612,6 +633,9 @@ static int ib_umad_close(struct inode *i list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); + list_del(&file->port_list); + up_write(&file->port->mutex); + kfree(file); kref_put(&dev->ref, ib_umad_release_dev); @@ -680,9 +704,13 @@ static int ib_umad_sm_close(struct inode struct ib_port_modify props = { .clr_port_cap_mask = IB_PORT_SM }; - int ret; + int ret = 0; + + down_write(&port->mutex); + if (port->ib_dev) + ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); + up_write(&port->mutex); - ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); up(&port->sm_sem); kref_put(&port->umad_dev->ref, ib_umad_release_dev); @@ -745,6 +773,8 @@ static int ib_umad_init_port(struct ib_d port->ib_dev = device; port->port_num = port_num; init_MUTEX(&port->sm_sem); + init_rwsem(&port->mutex); + INIT_LIST_HEAD(&port->file_list); port->dev = cdev_alloc(); if (!port->dev) @@ -813,6 +843,9 @@ err_cdev: static void ib_umad_kill_port(struct ib_umad_port *port) { + struct ib_umad_file *file; + int id; + class_set_devdata(port->class_dev, NULL); class_set_devdata(port->sm_class_dev, NULL); @@ -826,6 +859,21 @@ static void ib_umad_kill_port(struct ib_ umad_port[port->dev_num] = NULL; spin_unlock(&port_lock); + down_write(&port->mutex); + + port->ib_dev = NULL; + + list_for_each_entry(file, &port->file_list, port_list) + for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { + if (!file->agent[id]) + continue; + ib_dereg_mr(file->mr[id]); + ib_unregister_mad_agent(file->agent[id]); + file->agent[id] = NULL; + } + + up_write(&port->mutex); + clear_bit(port->dev_num, dev_map); } --- 0.99.9 From mst at mellanox.co.il Thu Nov 3 15:23:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Nov 2005 01:23:18 +0200 Subject: [openib-general] Re: [PATCH] fix page_size_cap value in ib_query_device for mellanox provider In-Reply-To: <527jbp49j2.fsf@cisco.com> References: <527jbp49j2.fsf@cisco.com> Message-ID: <20051103232318.GC6498@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH] fix page_size_cap value in ib_query_device for mellanox provider > > Can we just use something like this instead? I don't think we need > the comments talking about the semantics of page_size_cap, since we > don't say what any other field means. This was intended more as a clarification for you. I think its fine to remove this comment if you think its clear that _cap name means that its a bit mask. > And I don't see what casting mdev->limits.page_size_cap to u64 > accomplishes -- it will get promoted to u64 anyway, since > props->page_size_cap is a u64. > > - R. Makes sense, to me. -- MST From halr at voltaire.com Thu Nov 3 15:25:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 4 Nov 2005 01:25:32 +0200 Subject: [openib-general] [PATCH] Re: uDAPL again Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589A9E7@taurus.voltaire.com> On Thu, 2005-11-03 at 13:49, Arlin Davis wrote: > Yes, you can see from the debug print that GIDs, LIDs, pkey, mtu look ok. > > Here is Aniruddha's latest output from a run with opensm: > > path_comp_handler: ctxt 0x808a008, req_id 292 rec_num 1 > path_comp_handler: SRC GID subnet fe80000000000000 id 0002c901081e7471 > path_comp_handler: DST GID subnet fe80000000000000 id 0001730000008461 > path_comp_handler: slid 1 dlid 3 mtu 120203(2) pktlife 0(0) > path_comp_handler: hops 0 npaths 0 pkey ffff tclass 0 rate 0(0) The problem was in the uat library. Anirduddha, Can you update userspace/libibat, rebuild, and test ? You should get real rates and packet lifetimes now. -- Hal From iod00d at hp.com Thu Nov 3 16:21:01 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 3 Nov 2005 16:21:01 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> Message-ID: <20051104002101.GC1478@esmail.cup.hp.com> On Thu, Nov 03, 2005 at 05:37:47PM -0500, Rick Frank wrote: > It is very important to Oracle for RDS to be available in OpenIB in as many > Linux distributions as possible. > > Is this going to happen and in what timeframe / what are the plans for Linux > distributions to pick up OpenIB with RDS support ? OpenIB doesn't have RDS support yet AFAICT. Some code is in contrib/silverstorm/rds/ but not in the trunk where Roland can ship it to kernel.org and where every distro will look for it. But as Roland said, RDS doesn't even have a Makefile. I've reviewed some of it shortly after it got dropped in but still need to go through alot more of the code. > How can we (Oracle) help ? 1) Port contrib/silverstorm/rds/ to linux-kernel/infiniband/ulp/rds/ 2) include some docs on it's use and why RDS is better than SDP. 3) nag people to review the ported code 4) post functional test results That's a prioritized list if that helps. Regarding (2), I need to re-read your last OpenIB conf RDS presentation. ISTR there was a reason but the details escape me. You guys once explained it but the details didn't stick. cheers, grant From mshefty at ichips.intel.com Thu Nov 3 16:50:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 03 Nov 2005 16:50:26 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104002101.GC1478@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> Message-ID: <436AB052.9070507@ichips.intel.com> Grant Grundler wrote: > But as Roland said, RDS doesn't even have a Makefile. > I've reviewed some of it shortly after it got dropped in but > still need to go through alot more of the code. The code that I've looked at doesn't appear to be written to any of the openib code. A Makefile wouldn't help, since I don't think that any of it would compile anyway. Porting it to openib will be a major rewrite. > 2) include some docs on it's use and why RDS is better than SDP. Does someone have a link to a doc or presentation on why RDS is better than SDP? Or better yet, some actual data showing how it provides better performance or scalability? - Sean From robert.j.woodruff at intel.com Thu Nov 3 16:53:26 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 3 Nov 2005 16:53:26 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <20051104002101.GC1478@esmail.cup.hp.com> Message-ID: Grant wrote, >2) include some docs on it's use and why RDS is better than SDP. >3) nag people to review the ported code >4) post functional test results Looking at the code that is in the contrib branch, it looks like RDS uses connected channels, Is that correct ? If so, I do not see that it provides any value over SDP. If it indeed were using datagrams over IB, then I see that it might provide for better scaling than SDP, since with very large numbers of connections, memory usage becomes an issue, but as it is currently coded, I don't see the point. I was unable to attend the RDS talk at OpenIB workshop, so perhaps Rick can provide some reason why this protocol is better than SDP. woody From Richard.Frank at oracle.com Thu Nov 3 16:55:56 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Thu, 3 Nov 2005 19:55:56 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <436AB052.9070507@ichips.intel.com> Message-ID: <00ec01c5e0da$846dd510$6401a8c0@YOURA11C73D0FD> RDS is UDP protocol with reliability added - it remains connectionless from the consumer perspective. SDP is connection based - at least currently. ----- Original Message ----- From: "Sean Hefty" To: "Grant Grundler" Cc: "Rick Frank" ; "Kothanda Umamageswaran (Kodi) (E-mail)" ; "Sumanta Chatterjee" ; Sent: Thursday, November 03, 2005 7:50 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB > Grant Grundler wrote: >> But as Roland said, RDS doesn't even have a Makefile. >> I've reviewed some of it shortly after it got dropped in but >> still need to go through alot more of the code. > > The code that I've looked at doesn't appear to be written to any of the > openib code. A Makefile wouldn't help, since I don't think that any of it > would compile anyway. Porting it to openib will be a major rewrite. > >> 2) include some docs on it's use and why RDS is better than SDP. > > Does someone have a link to a doc or presentation on why RDS is better > than SDP? Or better yet, some actual data showing how it provides better > performance or scalability? > > - Sean > From rpandit at silverstorm.com Thu Nov 3 17:01:59 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 3 Nov 2005 17:01:59 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104002101.GC1478@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> Message-ID: <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> On 11/3/05, Grant Grundler wrote: > On Thu, Nov 03, 2005 at 05:37:47PM -0500, Rick Frank wrote: > > It is very important to Oracle for RDS to be available in OpenIB in as many > > Linux distributions as possible. > > > > Is this going to happen and in what timeframe / what are the plans for Linux > > distributions to pick up OpenIB with RDS support ? > > OpenIB doesn't have RDS support yet AFAICT. Some code is in > contrib/silverstorm/rds/ The code in contrib/silverstorm/rds is up-to-date but is on SilverStorm IbAccess layer. As previously mentioned, this code is for reference only and needs to be ported to OpenIB verbs and moved into trunk. > but not in the trunk where Roland can ship it to kernel.org > and where every distro will look for it. > > But as Roland said, RDS doesn't even have a Makefile. > I've reviewed some of it shortly after it got dropped in but > still need to go through alot more of the code. > I will go ahead and post the Makefile...but it's currently specific to SST Access layer. > > How can we (Oracle) help ? > > 1) Port contrib/silverstorm/rds/ to linux-kernel/infiniband/ulp/rds/ Grant, at the last OpenIB conference, you had volunteered to help port the code or, at the coding style. :) > 2) include some docs on it's use and why RDS is better than SDP. I will checkin the RDS presentations shortly. > 3) nag people to review the ported code > 4) post functional test results > > That's a prioritized list if that helps. > > Regarding (2), I need to re-read your last OpenIB conf RDS presentation. > ISTR there was a reason but the details escape me. You guys once explained > it but the details didn't stick. > > cheers, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Ranjit From Richard.Frank at oracle.com Thu Nov 3 17:06:21 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Thu, 3 Nov 2005 20:06:21 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> Message-ID: <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> I've atttached a draft proposal for RDS from Oracle which discusses some of the motivation for RDS. ----- Original Message ----- From: "Ranjit Pandit" To: "Grant Grundler" Cc: "Rick Frank" ; "Kothanda Umamageswaran (Kodi) (E-mail)" ; "Sumanta Chatterjee" ; Sent: Thursday, November 03, 2005 8:01 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB On 11/3/05, Grant Grundler wrote: > On Thu, Nov 03, 2005 at 05:37:47PM -0500, Rick Frank wrote: > > It is very important to Oracle for RDS to be available in OpenIB in as > > many > > Linux distributions as possible. > > > > Is this going to happen and in what timeframe / what are the plans for > > Linux > > distributions to pick up OpenIB with RDS support ? > > OpenIB doesn't have RDS support yet AFAICT. Some code is in > contrib/silverstorm/rds/ The code in contrib/silverstorm/rds is up-to-date but is on SilverStorm IbAccess layer. As previously mentioned, this code is for reference only and needs to be ported to OpenIB verbs and moved into trunk. > but not in the trunk where Roland can ship it to kernel.org > and where every distro will look for it. > > But as Roland said, RDS doesn't even have a Makefile. > I've reviewed some of it shortly after it got dropped in but > still need to go through alot more of the code. > I will go ahead and post the Makefile...but it's currently specific to SST Access layer. > > How can we (Oracle) help ? > > 1) Port contrib/silverstorm/rds/ to linux-kernel/infiniband/ulp/rds/ Grant, at the last OpenIB conference, you had volunteered to help port the code or, at the coding style. :) > 2) include some docs on it's use and why RDS is better than SDP. I will checkin the RDS presentations shortly. > 3) nag people to review the ported code > 4) post functional test results > > That's a prioritized list if that helps. > > Regarding (2), I need to re-read your last OpenIB conf RDS presentation. > ISTR there was a reason but the details escape me. You guys once explained > it but the details didn't stick. > > cheers, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > Ranjit -------------- next part -------------- A non-text attachment was scrubbed... Name: Proposal for a Reliable Datagram Socket Interface.doc Type: application/msword Size: 51200 bytes Desc: not available URL: From hozer at hozed.org Thu Nov 3 21:39:15 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 3 Nov 2005 23:39:15 -0600 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <52r79x4k95.fsf@cisco.com> References: <20051031071703.GU3275@kalmia.hozed.org> <436A1E96.4050003@de.ibm.com> <52r79x4k95.fsf@cisco.com> Message-ID: <20051104053915.GJ3275@kalmia.hozed.org> On Thu, Nov 03, 2005 at 11:13:58AM -0800, Roland Dreier wrote: > Heiko> this bug should be fixed in OpenIB trunk 3960. > > It's good to see this fixed and all the other cleanups in this > checkin. I'll have to go back to my ehca code reviewing.... > > However, when this code moves upstream, you'll have to make your > changes in smaller digestible chunks. The diff between r3959 and > r3960 is rather gigantic: > > 33 files changed, 945 insertions(+), 1163 deletions(-) > > And this piece: > > > -MODULE_VERSION("EHCA2_0035"); > > +MODULE_VERSION("EHCA2_0037"); > > indicates that there was a 0036 that you never let anyone see. I'll second the comment about smaller digestible chunks. A second thing I don't completely understand is the vast size difference between the ehca and mthca drivers. Is the ehca really that much more complex? I also want to comment that EHCA is the only thing that's versioned that is easy to tell what version of the module is actually loaded at the moment. I'd rather have versions I don't see float by than see every file in mthca get updated, but no version rev. I tried adding some code to generate a version string from the outupt of svnversion but it didn't work too well. The same goes for OpenSM as well.. the only version string you get when starting it is 'Opensm-1.1.0', which isn't very usefull. And once that's figured out, maybe we can start thinking about how to make sure kernel module versions match userspace versions. Personally, I'd like to see the ehca functions exported as a VDSO. From halr at voltaire.com Thu Nov 3 21:46:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Nov 2005 00:46:05 -0500 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <20051104053915.GJ3275@kalmia.hozed.org> References: <20051031071703.GU3275@kalmia.hozed.org> <436A1E96.4050003@de.ibm.com> <52r79x4k95.fsf@cisco.com> <20051104053915.GJ3275@kalmia.hozed.org> Message-ID: <1131083165.4340.1631.camel@hal.voltaire.com> On Fri, 2005-11-04 at 00:39, Troy Benjegerdes wrote: > On Thu, Nov 03, 2005 at 11:13:58AM -0800, Roland Dreier wrote: > > Heiko> this bug should be fixed in OpenIB trunk 3960. > > > > It's good to see this fixed and all the other cleanups in this > > checkin. I'll have to go back to my ehca code reviewing.... > > > > However, when this code moves upstream, you'll have to make your > > changes in smaller digestible chunks. The diff between r3959 and > > r3960 is rather gigantic: > > > > 33 files changed, 945 insertions(+), 1163 deletions(-) > > > > And this piece: > > > > > -MODULE_VERSION("EHCA2_0035"); > > > +MODULE_VERSION("EHCA2_0037"); > > > > indicates that there was a 0036 that you never let anyone see. > > I'll second the comment about smaller digestible chunks. A second thing > I don't completely understand is the vast size difference between the > ehca and mthca drivers. Is the ehca really that much more complex? > > I also want to comment that EHCA is the only thing that's versioned that > is easy to tell what version of the module is actually loaded at the > moment. I'd rather have versions I don't see float by than see every > file in mthca get updated, but no version rev. > > I tried adding some code to generate a version string from the outupt of > svnversion but it didn't work too well. > > The same goes for OpenSM as well.. the only version string you get when > starting it is 'Opensm-1.1.0', which isn't very usefull. What version would you propose for OpenSM ? Should I change the last something with every checkin ? Or perhaps append it with 1.1.0-svn version until we hit rc1 for this ? -- Hal > And once that's figured out, maybe we can start thinking about how to > make sure kernel module versions match userspace versions. Personally, > I'd like to see the ehca functions exported as a VDSO. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From RAISCH at de.ibm.com Fri Nov 4 02:32:27 2005 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Fri, 4 Nov 2005 11:32:27 +0100 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <52r79x4k95.fsf@cisco.com> Message-ID: The secret of 0036 was that we managed to build a driver which didn't work in all cases. I would guess in these changes > 33 files changed, 945 insertions(+), 1163 deletions(-) There are about ~80 lines of new code, rest of all that were modifications which don't change any algorithm but are desperately needed to be kernel coding style compliant: That's the mostly complete list what we've changed removing+renaming already existing assembly macros in ehca_asm.h changed the ehca_module pointer to a ehca_module struct removed EHCA_MEMPAGESIZE replaced quite a lot of typedef struct by struct capitalize DEFINES, changed most struct members to small letters replaced all ehca_retcode_t by u64 replaced the ehca_sleep() by appropriate kernel function replaced the assert() by BUG_ON() replaced ntohd() some naming and comment cleanup on struct hcp_modify_qp_control_block Roland, in case you're missing some some changes in there, we'll add these to one of the next releases to seperate the coding style cleanups from the functional changes. Gruss / Regards . . . Christoph R. Roland Dreier <> wrote on 03.11.2005 20:13:58: > Heiko> this bug should be fixed in OpenIB trunk 3960. > > It's good to see this fixed and all the other cleanups in this > checkin. I'll have to go back to my ehca code reviewing.... > > However, when this code moves upstream, you'll have to make your > changes in smaller digestible chunks. The diff between r3959 and > r3960 is rather gigantic: > > 33 files changed, 945 insertions(+), 1163 deletions(-) > > And this piece: > > > -MODULE_VERSION("EHCA2_0035"); > > +MODULE_VERSION("EHCA2_0037"); > > indicates that there was a 0036 that you never let anyone see. > > I would suggest you try to use the openib.org svn tree as your real > development repository. This will be the way you will have to work > once your driver is in the upstream kernel, and even now you will get > benefit from getting better patch review and having users better able > to pin down when a regression might have been introduced. > > For your latest checkin, it would have been better to see a series of > changesets with commit logs like: > > - remove asm_sync_mem() and mftb(), which duplicate existing > definitions in include/asm-ppc64 > - make sure device is an eHCA in libehca's openib_driver_init() > - update Kconfig help text > > and so on... > > Thanks, > Roland -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Fri Nov 4 04:23:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 4 Nov 2005 14:23:31 +0200 Subject: [openib-general] [PATCH] sdp zero copy support Message-ID: <20051104122331.GB15158@mellanox.co.il> Pls review the following. I wont have performance numbers for this code that till next week. --- Add zero copy support to synchronous socket operations (send_msg/recv_msg). This patch also includes a couple of fixes for aio, which I'll split and commit separately. Signed-off-by: Michael S. Tsirkin Index: drivers/infiniband/ulp/sdp/Kconfig =================================================================== --- drivers/infiniband/ulp/sdp/Kconfig (revision 3958) +++ drivers/infiniband/ulp/sdp/Kconfig (working copy) @@ -8,6 +8,20 @@ libsdp library from to have standard sockets applications use SDP. +config INFINIBAND_SDP_SEND_ZCOPY + bool "Sockets Direct Protocol Zero Copy Send support" + depends on INFINIBAND_SDP + default y + ---help--- + This option enables Zero Copy support for send_msg transactions. + +config INFINIBAND_SDP_RECV_ZCOPY + bool "Sockets Direct Protocol Zero Copy Receive support" + depends on INFINIBAND_SDP && INFINIBAND_SDP_SEND_ZCOPY + default y + ---help--- + This option enables Zero Copy support for recv_msg transactions. + config INFINIBAND_SDP_DEBUG bool "Sockets Direct Protocol debugging" depends on INFINIBAND_SDP Index: drivers/infiniband/ulp/sdp/sdp_rcvd.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_rcvd.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_rcvd.c (working copy) @@ -439,6 +439,11 @@ sdp_advt_destroy(advt); } + +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + /* There are no more src_pend, wake any waiting thread */ + sdp_iocb_wake(&conn->src_wait_list); +#endif /* * If there are active reads, mark the connection as being in * source cancel. Otherwise Index: drivers/infiniband/ulp/sdp/sdp_sock.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_sock.h (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_sock.h (working copy) @@ -61,7 +61,9 @@ #define SDP_ZCOPY_THRSH_SRC 257 /* Threshold for AIO write advertisments */ #define SDP_ZCOPY_THRSH_SNK 258 /* Threshold for AIO read advertisments */ #define SDP_ZCOPY_THRSH 256 /* Convenience for read and write */ - +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +#define SDP_ZCOPY_CANCEL_TIMEOUT (HZ * 60) /* Time before abortive close */ +#endif /* * Default values for SDP specific socket options. (for reference) */ Index: drivers/infiniband/ulp/sdp/sdp_proto.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_proto.h (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_proto.h (working copy) @@ -152,7 +152,10 @@ void sdp_iocb_q_put_tail(struct sdpc_iocb_q *table, struct sdpc_iocb *iocb); struct sdpc_iocb *sdp_iocb_q_lookup(struct sdpc_iocb_q *table, u32 key); +struct sdpc_iocb *sdp_iocb_q_lookup_req(struct sdpc_iocb_q *table, struct kiocb *req); +void sdp_iocb_q_mark_cancel(struct sdpc_iocb_q *table, struct kiocb *req); + void sdp_iocb_q_cancel(struct sdpc_iocb_q *table, u32 mask, ssize_t comp); void sdp_iocb_q_remove(struct sdpc_iocb *iocb); @@ -197,6 +200,8 @@ void *arg), void *arg); +int sdp_iocb_find_req(struct sdpc_desc *element, void *arg); + int sdp_desc_q_types_size(struct sdpc_desc_q *table, enum sdp_desc_type type); Index: drivers/infiniband/ulp/sdp/sdp_read.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_read.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_read.c (working copy) @@ -93,6 +93,12 @@ } } +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + /* If there are no more src_pend, wake any waiting thread */ + if (!sdp_advt_q_size(&conn->src_pend)) + sdp_iocb_wake(&conn->src_wait_list); + +#endif done: return 0; error: Index: drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_send.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -122,6 +122,10 @@ send_param.send_flags |= IB_SEND_SIGNALED; conn->send_cons = 0; } + + if (buff->bsdh_hdr->mid == SDP_MID_SRC_CANCEL) + sdp_dbg_ctrl(conn, "SRC_CANCEL bsdh_hdr->seq_num = %d conn->send_seq=%d\n", + buff->bsdh_hdr->seq_num, conn->send_seq); /* * post send */ @@ -1680,8 +1684,8 @@ static int sdp_inet_write_cancel(struct kiocb *req, struct io_event *ev) { struct sock_iocb *si = kiocb_to_siocb(req); - struct sdp_sock *conn; struct sdpc_iocb *iocb; + struct sdp_sock *conn; int result = 0; sdp_dbg_ctrl(NULL, "Cancel Write IOCB user <%d> key <%d> flag <%08lx>", @@ -1738,7 +1742,7 @@ /* * completion reference */ - aio_put_req(req); + aio_put_req(iocb->req); result = 0; } @@ -1797,9 +1801,8 @@ * no IOCB found. The cancel is probably in a race with a completion. * Assume the IOCB will be completed, return appropriate value. */ - sdp_warn("Cancel write with no IOCB. <%d:%d:%08lx>", - req->ki_users, req->ki_key, req->ki_flags); - + sdp_dbg_warn(conn, "Cancel write with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); result = -EAGAIN; unlock: @@ -1810,7 +1813,151 @@ return result; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +static int sdp_write_src_cancel(struct sdpc_desc *element, void *arg) +{ + struct sdpc_iocb *iocb = (struct sdpc_iocb *) element; + struct kiocb *req = (struct kiocb *)arg; + + if (element->type == SDP_DESC_TYPE_IOCB && iocb->req == req) + iocb->flags |= SDP_IOCB_F_CANCEL; + return -ERANGE; +} + +static int sdp_req_busy(struct sdp_sock *conn, struct sdpc_iocb_wait *wait) +{ + unsigned long flags; + int result = -EAGAIN; + + sdp_conn_lock(conn); + sdp_conn_unlock(conn); + + spin_lock_irqsave(&wait->lock, flags); + if (!wait->outstanding) + result = 0; + spin_unlock_irqrestore(&wait->lock, flags); + return result; +} /* + * sdp_write_cancel - cancel a synchronous IO operation + */ +static int sdp_write_cancel(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait) +{ + struct sdpc_iocb *iocb; + int result = 0; + + sdp_dbg_ctrl(NULL, "Cancel Write IOCB user <%d> key <%d> flag <%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + + sdp_conn_lock(conn); + + sdp_dbg_ctrl(conn, "Cancel Write IOCB. <%08x:%04x> <%08x:%04x>", + conn->src_addr, conn->src_port, + conn->dst_addr, conn->dst_port); + /* + * attempt to find the IOCB for this key. we don't have an indication + * whether this is a read or write. + */ + + while ((iocb = (struct sdpc_iocb *) + sdp_desc_q_lookup(&conn->send_queue, sdp_iocb_find_req, req))) { + iocb->flags |= SDP_IOCB_F_CANCEL; + + /* + * always remove the IOCB. + * If active, then place it into the correct active queue + */ + sdp_desc_q_remove((struct sdpc_desc *)iocb); + + if (iocb->flags & SDP_IOCB_F_ACTIVE) { + if (iocb->flags & SDP_IOCB_F_RDMA_W) + sdp_desc_q_put_tail(&conn->w_snk, + (struct sdpc_desc *)iocb); + else { + SDP_EXPECT((iocb->flags & SDP_IOCB_F_RDMA_R)); + + sdp_iocb_q_put_tail(&conn->w_src, iocb); + } + } else { + /* + * empty IOCBs can be deleted, while partials + * needs to be compelted. + */ + if (iocb->post > 0) { + sdp_iocb_complete(iocb, 0); + result = -EAGAIN; + } else { + sdp_iocb_destroy(iocb); + + /* + * completion reference + */ + if (!iocb->wait) + aio_put_req(iocb->req); + else { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + --iocb->wait->outstanding; + /* No need to wake up, + since we call sdp_req_busy + directly below */ + + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } + } + } + } + + /* + * check the sink queue, not much to do, since the operation is + * already in flight. + */ + sdp_desc_q_lookup(&conn->w_snk, sdp_write_src_cancel, req); + + iocb = (struct sdpc_iocb *)sdp_desc_q_lookup(&conn->w_snk, + sdp_iocb_find_req, + req); + if (iocb) { + sdp_dbg_ctrl(conn, "Sink Queue busy\n"); + result = -EAGAIN; + } + + /* + * check source queue. If we're in the source queue, then a cancel + * needs to be issued. + */ + sdp_iocb_q_mark_cancel(&conn->w_src, req); + + iocb = sdp_iocb_q_lookup_req(&conn->w_src, req); + if (iocb) { + sdp_dbg_ctrl(conn, "Sending Src Cancel\n"); + + if (! (conn->flags & SDP_CONN_F_SRC_CANCEL_L)) { + sdp_desc_q_lookup(&conn->w_snk, sdp_write_src_cancel, req); + conn->flags |= SDP_CONN_F_SRC_CANCEL_L; + result = sdp_send_ctrl_src_cancel(conn); + SDP_EXPECT(result >= 0); + } + + result = -EAGAIN; + } + + if (!result) { + /* + * no IOCB found. Assume the IOCB will be completed. + */ + sdp_dbg_ctrl(conn, "Cancel IOCB done. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + } + + sdp_conn_unlock(conn); + + return sdp_req_busy(conn, wait); +} +#endif + +/* * sdp_send_flush_advt - Flush passive sink advertisments */ static int sdp_send_flush_advt(struct sdp_sock *conn) @@ -1987,7 +2134,7 @@ return timeout; } -static inline int sdp_queue_iocb(struct kiocb *req, struct sdp_sock *conn, +static inline int sdp_queue_aio(struct kiocb *req, struct sdp_sock *conn, struct msghdr *msg, size_t size, size_t *copied) { @@ -2038,14 +2185,79 @@ return -EIOCBQUEUED; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +static inline int sdp_queue_sync(struct kiocb *req, struct sdp_sock *conn, + struct msghdr *msg, size_t size, + size_t *copied, + struct sdpc_iocb_wait *wait) +{ + struct sdpc_iocb *iocb; + struct iovec *msg_iov; + unsigned long flags; + size_t len; + int result; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%ld>", + size, (long)*copied); + return -ENOMEM; + } + + for (msg_iov = msg->msg_iov; !msg->msg_iov->iov_len; ++msg_iov); + + /* FMR alignment can add an extra page. */ + len = min(msg_iov->iov_len, (size_t)SDP_IOCB_SIZE_MAX - 4096); + iocb->len = len; + iocb->post = 0; + iocb->size = len; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg_iov->iov_base; + iocb->wait = wait; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%ld>", + result, size, (long)copied); + + sdp_iocb_destroy(iocb); + return result; + } + + SDP_CONN_STAT_WQ_INC(conn, iocb->size); + + result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", result); + sdp_iocb_destroy(iocb); + return result; + } + + spin_lock_irqsave(&wait->lock, flags); + ++wait->outstanding; + spin_unlock_irqrestore(&wait->lock, flags); + + conn->send_pipe += len; + *copied += len; /* copied amount was saved in IOCB. */ + msg_iov->iov_len -= len; + msg_iov->iov_base += len; + return 0; +} +#endif /* * sdp_inet_send - send data from user space to the network */ int sdp_inet_send(struct kiocb *req, struct socket *sock, struct msghdr *msg, size_t size) { - struct sock *sk; - struct sdp_sock *conn; +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + struct sdpc_iocb_wait wait; +#endif + struct sock *sk; + struct sdp_sock *conn; int result = 0; size_t copied = 0; int oob, zcopy; @@ -2074,6 +2286,7 @@ if (conn->state == SDP_CONN_ST_LISTEN || conn->state == SDP_CONN_ST_CLOSED) { result = -ENOTCONN; + sdp_conn_unlock(conn); goto done; } /* @@ -2082,13 +2295,24 @@ * they are smaller then the zopy threshold, but only if there is * no buffer write space. */ - zcopy = (size >= conn->src_zthresh && !is_sync_kiocb(req)); + zcopy = (size >= conn->src_zthresh && (!is_sync_kiocb(req) +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + || (!(msg->msg_flags & MSG_DONTWAIT) && !oob) +#endif + )); /* * clear ASYN space bit, it'll be reset if there is no space. */ if (!zcopy) clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + else if (is_sync_kiocb(req)) { + init_waitqueue_head(&wait.wait); + spin_lock_init(&wait.lock); + wait.outstanding = 0; + } +#endif /* * process data first if window is open, next check conditions, then * wait if there is more work to be done. The absolute window size is @@ -2143,14 +2367,45 @@ * completion. Wait on sync IO call create IOCB for async * call. */ +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (is_sync_kiocb(req) && zcopy) + result = sdp_queue_sync(req, conn, msg, size, &copied, + &wait); + /* TODO: limit the # of outstanding reqs */ + /* TODO: sleep on recoverable errors */ + else +#endif if (is_sync_kiocb(req)) timeout = sdp_wait_till_space(sk, conn, oob, timeout); else - result = sdp_queue_iocb(req, conn, msg, size, &copied); + result = sdp_queue_aio(req, conn, msg, size, &copied); } + sdp_conn_unlock(conn); + +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (!result && is_sync_kiocb(req) && zcopy) { + timeout = wait_event_interruptible_timeout(wait.wait, + !sdp_req_busy(conn, &wait), timeout); + if (!timeout) + result = -EAGAIN; + } + + if (signal_pending(current) && is_sync_kiocb(req) && zcopy) { + result = (timeout > 0) ? sock_intr_errno(timeout) : -EAGAIN; + + timeout = wait_event_timeout(wait.wait, + !sdp_write_cancel(req, conn, &wait), + SDP_ZCOPY_CANCEL_TIMEOUT); + if (!timeout) { + sdp_warn("sdp_write_cancel timed out. Abort.\n"); + sdp_conn_lock(conn); + sdp_conn_abort(conn); + sdp_conn_unlock(conn); + } + } +#endif done: - sdp_conn_unlock(conn); result = ((copied > 0) ? copied : result); if (result == -EPIPE && !(msg->msg_flags & MSG_NOSIGNAL)) Index: drivers/infiniband/ulp/sdp/sdp_conn.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_conn.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1279,7 +1279,15 @@ * connection lock */ sdp_conn_lock_init(conn); + +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY /* + * Tasks to wake up when we finish all src avail + */ + INIT_LIST_HEAD(&conn->src_wait_list); + +#endif + /* * insert connection into lookup table */ result = sdp_conn_table_insert(conn); Index: drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_recv.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -327,6 +327,10 @@ iocb = sdp_iocb_q_look(&conn->r_pend); if (!iocb) return ENODEV; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (iocb->flags & SDP_IOCB_F_WAITALL) + return ENODEV; +#endif /* * check zcopy threshold */ @@ -708,6 +712,9 @@ */ if (!iocb->len || (!conn->src_recv && +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + !(iocb->flags & SDP_IOCB_F_WAITALL) && +#endif !(sk_sdp(conn)->sk_rcvlowat > iocb->post))) { /* * complete IOCB @@ -1055,7 +1062,178 @@ return result; } +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY /* + * sdp_read_cancel - cancel a synchronous IO operation + */ +static int sdp_read_cancel(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait, size_t *copied) +{ + struct sdpc_iocb *iocb; + int result = 0; + + sdp_dbg_ctrl(NULL, "Cancel Read IOCB. user <%d> key <%d> flag <%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + + sdp_dbg_ctrl(conn, "Cancel Read IOCB. <%08x:%04x> <%08x:%04x>", + conn->src_addr, conn->src_port, + conn->dst_addr, conn->dst_port); + /* + * attempt to find the IOCB for this key. we don't have an indication + * whether this is a read or write. + */ + while ((iocb = sdp_iocb_q_lookup_req(&conn->r_pend, req))) { + /* + * always remove the IOCB. If active, then place it into + * the correct active queue. Inactive empty IOCBs can be + * deleted, while inactive partials needs to be compelted. + */ + sdp_iocb_q_remove(iocb); + + if (!(iocb->flags & SDP_IOCB_F_ACTIVE)) { + *copied -= iocb->len; + if (iocb->post > 0) { + /* + * callback to complete IOCB, or drop reference + */ + sdp_iocb_complete(iocb, 0); + result = -EAGAIN; + } + else { + sdp_iocb_destroy(iocb); + /* + * completion reference + */ + if (iocb->wait) { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + if (!--iocb->wait->outstanding) { + wake_up(&iocb->wait->wait); + } + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } else + aio_put_req(req); + + result = 0; + } + + goto out; + } + + if (iocb->flags & SDP_IOCB_F_RDMA_W) + sdp_iocb_q_put_tail(&conn->r_snk, iocb); + else { + SDP_EXPECT((iocb->flags & SDP_IOCB_F_RDMA_R)); + + sdp_desc_q_put_tail(&conn->r_src, + (struct sdpc_desc *)iocb); + } + } + /* + * check the source queue, not much to do, since the operation is + * already in flight. + */ + iocb = (struct sdpc_iocb *)sdp_desc_q_lookup(&conn->r_src, + sdp_iocb_find_req, req); + if (iocb) { + iocb->flags |= SDP_IOCB_F_CANCEL; + result = -EAGAIN; + + goto out; + } + /* + * check sink queue. If we're in the sink queue, then a cancel + * needs to be issued. + */ + iocb = sdp_iocb_q_lookup_req(&conn->r_snk, req); + if (iocb) { + /* + * Unfortunetly there is only a course grain cancel in SDP, so + * we have to cancel everything. + */ + if (!(conn->flags & SDP_CONN_F_SNK_CANCEL)) { + + result = sdp_send_ctrl_snk_cancel(conn); + SDP_EXPECT(result >= 0); + + conn->flags |= SDP_CONN_F_SNK_CANCEL; + } + + iocb->flags |= SDP_IOCB_F_CANCEL; + result = -EAGAIN; + + goto out; + } + /* + * no IOCB found. The cancel is probably in a race with a completion. + * Assume the IOCB will be completed, return appropriate value. + */ + sdp_dbg_ctrl(NULL, "Cancel read with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + + result = -EAGAIN; + +out: + return result; +} + +static int sdp_req_busy(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait, int waitall, + size_t *copied) +{ + struct sdpc_iocb *iocb; + unsigned long flags; + int result = -EAGAIN; + + for (;;) { + spin_lock_irqsave(&wait->lock, flags); + iocb = sdp_iocb_q_get_head(&wait->q); + if (!iocb) + break; + --wait->outstanding; + spin_unlock_irqrestore(&wait->lock, flags); + + sdp_iocb_release(iocb); + sdp_iocb_unlock(iocb); + sdp_iocb_destroy(iocb); + } + + if (!wait->outstanding) + result = 0; + + spin_unlock_irqrestore(&wait->lock, flags); + + sdp_conn_lock(conn); + + /* If WAITALL is clear, and there are no more src_pend, + remove all pending iocbs */ + if (!waitall && !sdp_advt_q_size(&conn->src_pend)) { + sdp_read_cancel(req, conn, wait, copied); + result = 0; + } + + if (!result) + list_del_init(&wait->src_wait_list); + + sdp_conn_unlock(conn); + + return result; +} +/* + * sdp_inet_read_cancel - cancel an IO operation + */ +static int sdp_cancel_read(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait, size_t *copied) +{ + sdp_conn_lock(conn); + sdp_read_cancel(req, conn, wait, copied); + sdp_conn_unlock(conn); + + return sdp_req_busy(req, conn, wait, 1, copied); +} +#endif + +/* * sdp_inet_recv - recv data from the network to user space */ int sdp_inet_recv(struct kiocb *req, struct socket *sock, struct msghdr *msg, @@ -1065,17 +1243,22 @@ struct sdp_sock *conn; struct sdpc_iocb *iocb; struct sdpc_buff *buff; - long timeout; + long timeout = 0 /*Turn off compiler warning */; size_t length; int result = 0; int expect; int low_water; - int copied = 0; + size_t copied = 0; int copy; int update; s8 oob = 0; s8 ack = 0; struct sdpc_buff_q peek_queue; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + int zcopy = 0; + struct sdpc_iocb_wait wait; + unsigned long f; +#endif sk = sock->sk; conn = sdp_sk(sk); @@ -1293,6 +1476,76 @@ /* * Either wait or create IOCB for defered completion. */ +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (is_sync_kiocb(req) && !(flags & MSG_PEEK) && + (zcopy || size - copied >= conn->snk_zthresh) && + (conn->src_recv || + (low_water - copied >= conn->snk_zthresh))) { + struct iovec *msg_iov; + size_t len; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, + "Error allocating IOCB <%Zu:%Zd>", + size, copied); + result = -ENOMEM; + break; + } + + for (msg_iov = msg->msg_iov; !msg->msg_iov->iov_len; ++msg_iov); + + /* FMR alignment can add an extra page. */ + len = min(msg_iov->iov_len, (size_t)SDP_IOCB_SIZE_MAX - 4096); + iocb->len = len; + iocb->post = 0; + iocb->size = len; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg_iov->iov_base; + iocb->wait = &wait; + + iocb->flags |= SDP_IOCB_F_RECV | SDP_IOCB_F_WAITALL; + + req->ki_cancel = sdp_inet_read_cancel; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, + "Error <%d> IOCB lock <%Zu:%Zd>", + result, size, copied); + + sdp_iocb_destroy(iocb); + break; + } + + SDP_CONN_STAT_RQ_INC(conn, iocb->size); + + if (!zcopy) { + init_waitqueue_head(&wait.wait); + INIT_LIST_HEAD(&wait.src_wait_list); + spin_lock_init(&wait.lock); + sdp_iocb_q_init(&wait.q); + wait.outstanding = 0; + zcopy = 1; + } + + sdp_iocb_q_put_tail(&conn->r_pend, iocb); + + spin_lock_irqsave(&wait.lock, f); + ++wait.outstanding; + spin_unlock_irqrestore(&wait.lock, f); + + /* TODO: set it? */ + ack = 1; + copied += len; + msg_iov->iov_len -= len; + msg_iov->iov_base += len; + break; + } else +#endif if (is_sync_kiocb(req)) { DECLARE_WAITQUEUE(wait, current); @@ -1325,7 +1578,7 @@ iocb = sdp_iocb_create(); if (!iocb) { sdp_dbg_warn(conn, - "Error allocating IOCB <%Zu:%d>", + "Error allocating IOCB <%Zu:%Zd>", size, copied); result = -ENOMEM; break; @@ -1346,7 +1599,7 @@ result = sdp_iocb_lock(iocb); if (result < 0) { sdp_dbg_warn(conn, - "Error <%d> IOCB lock <%Zu:%d>", + "Error <%d> IOCB lock <%Zu:%Zd>", result, size, copied); sdp_iocb_destroy(iocb); @@ -1382,6 +1635,43 @@ while ((buff = sdp_buff_q_get_tail(&peek_queue))) sdp_buff_q_put_head(&conn->recv_pool, buff); +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + /* If WAITALL is clear, wake up also when we run out of src avail */ + if (!result && is_sync_kiocb(req) && zcopy && !(flags & MSG_WAITALL)) { + list_add_tail(&conn->src_wait_list, &wait.src_wait_list); + } +#endif sdp_conn_unlock(conn); +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (!result && is_sync_kiocb(req) && zcopy) { + timeout = wait_event_interruptible_timeout(wait.wait, + !sdp_req_busy(req, conn, &wait, + (flags & MSG_WAITALL), &copied), + timeout); + if (!timeout) { + result = -EAGAIN; + if (!(flags & MSG_WAITALL)) { + sdp_conn_lock(conn); + list_del_init(&wait.src_wait_list); + sdp_conn_unlock(conn); + } + } + } + + if (signal_pending(current) && is_sync_kiocb(req) && zcopy) { + result = (timeout > 0) ? sock_intr_errno(timeout) : -EAGAIN; + + timeout = wait_event_timeout(wait.wait, + !sdp_cancel_read(req, conn, &wait, &copied), + SDP_ZCOPY_CANCEL_TIMEOUT); + if (!timeout) { + sdp_warn("sdp_read_cancel timed out. Abort.\n"); + sdp_conn_lock(conn); + sdp_conn_abort(conn); + sdp_conn_unlock(conn); + } + } +#endif + return ((copied > 0) ? copied : result); } Index: drivers/infiniband/ulp/sdp/sdp_conn.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_conn.h (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_conn.h (working copy) @@ -377,6 +377,9 @@ #ifdef _SDP_CONN_STATE_REC struct sdp_conn_state state_rec; #endif +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + struct list_head src_wait_list; +#endif }; #define SDP_WRAP_GT(x, y) ((signed int)((x) - (y)) > 0) Index: drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -307,12 +307,23 @@ sdp_dbg_data(NULL, "IOCB complete. <%d:%d:%08lx> value <%ld>", iocb->req->ki_users, iocb->req->ki_key, iocb->req->ki_flags, value); + +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (iocb->wait) { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + if (!--iocb->wait->outstanding) { + wake_up(&iocb->wait->wait); + } + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } else +#endif + /* + * valid result can be 0 or 1 for complete so + * we ignore the value. + */ + (void)aio_complete(iocb->req, value, 0); /* - * valid result can be 0 or 1 for complete so - * we ignore the value. - */ - (void)aio_complete(iocb->req, value, 0); - /* * delete IOCB */ sdp_iocb_destroy(iocb); @@ -325,7 +336,17 @@ { iocb->status = status; - if (in_atomic() || irqs_disabled()) { +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if ((iocb->flags & SDP_IOCB_F_RECV) && iocb->wait) { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + sdp_iocb_q_put_tail(&iocb->wait->q, iocb); + wake_up(&iocb->wait->wait); + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } else +#endif + if ((iocb->flags & SDP_IOCB_F_RECV) && + (in_atomic() || irqs_disabled())) { INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); schedule_work(&iocb->completion); } else @@ -382,6 +403,43 @@ return NULL; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +struct sdpc_iocb *sdp_iocb_q_lookup_req(struct sdpc_iocb_q *table, struct kiocb *req) +{ + struct sdpc_iocb *iocb = NULL; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->req == req) + return iocb; + + return NULL; +} + +void sdp_iocb_q_mark_cancel(struct sdpc_iocb_q *table, struct kiocb *req) +{ + struct sdpc_iocb *iocb = NULL; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->req == req) + iocb->flags |= SDP_IOCB_F_CANCEL; + +} + +int sdp_iocb_find_req(struct sdpc_desc *element, void *arg) +{ + struct sdpc_iocb *iocb = (struct sdpc_iocb *) element; + struct kiocb *req = (struct kiocb *)arg; + + if (element->type == SDP_DESC_TYPE_IOCB && iocb->req == req) + return 0; + return -ERANGE; +} +#endif + /* * sdp_iocb_create - create an IOCB object */ Index: drivers/infiniband/ulp/sdp/sdp_iocb.h =================================================================== --- drivers/infiniband/ulp/sdp/sdp_iocb.h (revision 3958) +++ drivers/infiniband/ulp/sdp/sdp_iocb.h (working copy) @@ -55,6 +55,9 @@ #define SDP_IOCB_F_LOCKED 0x00000040 /* IOCB is locked in memory */ #define SDP_IOCB_F_REG 0x00000080 /* IOCB memory is registered */ #define SDP_IOCB_F_RECV 0x00000100 /* IOCB is for a receive request */ +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY +#define SDP_IOCB_F_WAITALL 0x00000200 /* IOCB is for WAITALL request */ +#endif #define SDP_IOCB_F_ALL 0xFFFFFFFF /* IOCB all mask */ /* * zcopy constants. @@ -66,10 +69,12 @@ */ #define sdp_iocb_q_size(table) ((table)->size) +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +struct sdpc_iocb_wait; +#endif /* * INET read/write IOCBs */ - /* * save a kvec read/write for processing once data shows up. */ @@ -80,7 +85,7 @@ struct sdpc_iocb_q *table; /* table to which this iocb belongs */ void (*release)(struct sdpc_iocb *iocb); /* release the object */ /* - * iocb sepcific + * iocb specific */ int flags; /* usage flags */ /* @@ -112,6 +117,9 @@ int page_offset; /* offset into first page. */ struct work_struct completion; /* task for defered completion. */ +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + struct sdpc_iocb_wait *wait; +#endif /* * kernel iocb structure */ @@ -127,4 +135,26 @@ int size; /* current number of IOCBs in table */ }; +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +/* Report completions here */ +struct sdpc_iocb_wait { + spinlock_t lock; + int outstanding; + wait_queue_head_t wait; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + struct sdpc_iocb_q q; /* Receive iocbs only */ + struct list_head src_wait_list; /* Receive only */ +#endif +}; + +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY +static inline void sdp_iocb_wake(struct list_head *head) +{ + struct sdpc_iocb_wait *wait; + list_for_each_entry(wait, head, src_wait_list) + wake_up(&wait->wait); +} +#endif + +#endif #endif /* _SDP_IOCB_H */ -- MST From tom at opengridcomputing.com Fri Nov 4 09:02:41 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 04 Nov 2005 11:02:41 -0600 Subject: [openib-general][PATCH] local device search with source address wildcard Message-ID: <1131123761.3839.14.camel@trinity.austin.ammasso.com> Sean: I was looking through ip_resolve_local and it looks to me like if the source address is 0, it will end up getting set to the destination IP instead of the IP address of the local interface. Also if ip_dev_find can't find a local interface with connectivity to the specified remote peer, shouldn't the error be EHOSTUNREACH? Finally, if the user specifies a bogus source address, we don't compare it against the source address configured on the local interface found in the route. It will probably still fail later, but in some bizarre fashion. Here's a patch to show you what I mean. BTW, I think this brings up another issue: which locally configured IP address do we use if more than one is configured on the device (aliasing)? This patch just arbitrarily uses the first one. You could look for a key word in the ifname for example, i.e. eth0:rnic0 or something. I only compiled this in my branch and did not test it. It is just a conversation piece at this point. Signed-off-by: Tom Tucker Index: addr.c =================================================================== --- addr.c (revision 3860) +++ addr.c (working copy) @@ -216,17 +216,20 @@ struct ib_addr *addr) { struct net_device *dev; + struct in_device* indev; u32 src_ip = src_in->sin_addr.s_addr; u32 dst_ip = dst_in->sin_addr.s_addr; int ret = 0; dev = ip_dev_find(dst_ip); if (!dev) - return -EADDRNOTAVAIL; + return -EHOSTUNREACH; + indev = __in_dev_get(dev); + if (!src_ip) { - src_in->sin_family = dst_in->sin_family; - src_in->sin_addr.s_addr = dst_ip; + src_in->sin_family = AF_INET; + src_in->sin_addr.s_addr = indev->ifa_list->ifa_address; addr->sgid = *(union ib_gid *) (dev->dev_addr + 4); addr->pkey = addr_get_pkey(dev); } else { @@ -234,6 +237,11 @@ &addr->sgid, &addr->pkey); if (ret) goto out; + + if (src_in->sin_addr.s_addr != indev->ifa_list- >ifa_address) { + ret = -EINVAL; + goto out; + } } addr->dgid = *(union ib_gid *) (dev->dev_addr + 4); From iod00d at hp.com Fri Nov 4 08:32:51 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 4 Nov 2005 08:32:51 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> Message-ID: <20051104163251.GA4463@esmail.cup.hp.com> On Thu, Nov 03, 2005 at 05:01:59PM -0800, Ranjit Pandit wrote: > I will go ahead and post the Makefile...but it's currently specific to > SST Access layer. I wouldn't bother. > > 1) Port contrib/silverstorm/rds/ to linux-kernel/infiniband/ulp/rds/ > > Grant, at the last OpenIB conference, you had volunteered to help port > the code or, at the coding style. :) Yes, I offered to help port. But I can't do the "heavy lifting". I can test, code review, and fix up coding style nits. > > 2) include some docs on it's use and why RDS is better than SDP. > > I will checkin the RDS presentations shortly. That would be good. Just as a reminder, your slideset for the OpenIB "Data Center Workshop" is posted here: http://openib.org/docs/oib_wkshp_082205/Reliable_Datagram_Sockets.ppt thanks, grant From robert.j.woodruff at intel.com Fri Nov 4 08:35:10 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 08:35:10 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> Message-ID: Rick wrote, >I've atttached a draft proposal for RDS from Oracle which discusses some of >the motivation for RDS. I assume that you have a driver that uses TCP sockets, Correct ? If so, have you compared the performance of RDS to SDP ? woody From panda at cse.ohio-state.edu Fri Nov 4 09:00:34 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri, 4 Nov 2005 12:00:34 -0500 (EST) Subject: [openib-general] mvapich-gen2 on 2 x 16 CPU SGI Altix 1330 cluster In-Reply-To: <436A4B37.1080801@sgi.com> from "John Partridge" at Nov 03, 2005 11:39:03 AM Message-ID: <200511041700.jA4H0YKp025723@xi.cse.ohio-state.edu> Hi John, > I just though you would like to know that I have now tested the > Pallas benchmark on a two node SGI Altix 1330 cluster using OpenIB > and mvapich-gen2. Each node had 16 CPU's. To do this I had to change > SMPI_MAX_NUMLOCALNODES to be defined as 16 instead of the normal 4 > for the test. I ran a 2x16 (32 total) CPU Pallas benchmark several > times with no hang ups or errors. Very glad to know that you are able to run Pallas with the above parameter change. We had put that parameter for such multi-way SMP systems (Altix) in mind. > I'm wondering if there would be any more changes I would need to > make for scaling to much larger systems. I do plan at some point in > the near future to test this on a much larger system with a LOT more > CPU's We have some ideas on scaling mvapich on multi-way SMP systems (like Altix). Unfortunately, we neither have information on the details of the memory hierarchy/organization of these systems nor access to such systems to try out these ideas. > The test was conducted using a "kernel.org" 2.6.14 kernel and an > OpenIB svn gen2 release of 3926 using Voltaire HCA's and switch > > We will be demonstrating OpenIB and mvapich-gen2 mpi at > Supercomputing 05 (running smaller jobs though because the 32 way > jobs take so long to complete). We will also demo rdma_lat, rdma_bw > and IpoIB. Good to know that you will be having the demo at SC '05. I will stop by your booth. If you will have some time, we can discuss about the memory hierarchy/organization of such systems and how to optimize mvapich on it. > I can send you the pallas results if you are interested. Yes, please send it to me. We will be happy to look at the results. Best Regards, DK > Regards > John > > -- > John Partridge > > Silicon Graphics Inc > Tel: 651-683-3428 > Vnet: 233-3428 > E-Mail: johnip at sgi.com > From mshefty at ichips.intel.com Fri Nov 4 09:36:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 09:36:16 -0800 Subject: [openib-general][PATCH] local device search with source address wildcard In-Reply-To: <1131123761.3839.14.camel@trinity.austin.ammasso.com> References: <1131123761.3839.14.camel@trinity.austin.ammasso.com> Message-ID: <436B9C10.7050301@ichips.intel.com> Tom Tucker wrote: > Sean: > > I was looking through ip_resolve_local and it looks to me like > if the source address is 0, it will end up getting set to the > destination IP instead of the IP address of the local interface. The intent of ip_resolve_local() is to check if a given destination address is on the local system. If it is and no source address is specified, then the source address is set to the same address as the destination. > Also if ip_dev_find can't find a local interface with connectivity > to the specified remote peer, shouldn't the error be EHOSTUNREACH? If the address is not a local address, then a check is made to find a route to that address assuming that it exists somewhere remotely. See ib_resolve_addr() which calls ip_resolve_local() and ip_resolve_remote(). So, the return code from ip_resolve_local() returns that the given address is not available on the local system. The address may still be reachable as a remote address. > Finally, if the user specifies a bogus source address, we don't > compare it against the source address configured on the local > interface found in the route. It will probably still fail later, > but in some bizarre fashion. If the source address is bogus, then the call to ib_translate_addr() will fail with EADDRNOTAVAIL. I'm not tied to the return codes, so if one of them works better than the other, I can change it. - Sean From iod00d at hp.com Fri Nov 4 10:19:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 4 Nov 2005 10:19:52 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> Message-ID: <20051104181952.GB4463@esmail.cup.hp.com> On Thu, Nov 03, 2005 at 08:06:21PM -0500, Rick Frank wrote: > I've atttached a draft proposal for RDS from Oracle which discusses some of > the motivation for RDS. Thanks! Some questions/comments... o What is "GE" acronym for? o I'm seeing about 1/5th CPU load for SDP (vs IPoIB). The "50% less" number doesn't seem that impressive for RDP (vs IPoIB). Maybe this is a difference in the benchmark (I'm running netperf). o RDP wants to provide AF_INET_OFFLOAD. This doesn't exist in my source tree. I don't know who assigns these but it isn't lanana.org. Oracle would be wise to stick with what's in include/linux/sockets.h in order to avoid long term maintenance issues. ISTR OpenIB got flamed for wanting to use AF_INET_OFFLOAD name. If RDP is accepted, I would expect RDP to get AF_INET_RDP. And then use "LD_PRELOAD" and clone libsdp.so to take over AF_INET. ie follow a similar trajectory that SDP had. o Is access control to the RDP protocol something that applies to all protocols? I'm looking item #2 of "Additional Features". o Doesn't SDP meet the following requirement as well? | A goal of RDP should be to support all existing socket | functionality relevant to UDP with no changes to any | existing socket application - other than specifying | AF_INET_OFFLOAD. However, an RDP aware socket application | can take advantage of the RDP features. o I'm struggling with the "RDP is connectionless" comments made earlier. Later in this proposal, "RDP Interface" says packets will be delivered "in order". Doesn't that conflict with "connectionless"? Does UDP guarantee order? o The "crossover" value for zero copy vs inlining data is chipset specific. Ie even within the same architecture, different combinations of CPUs and chipsets will give wide variance. Things like cache size, cache replacement algorithm, available memory bandwidth, memory latency, et al, affect the choice. This value is normally define by/for each architecture since that's practical and lets each arch decide what the right tradeoff is. o The comments in "Recv operations" talk about "backpressure". Is this another way of saying the driver should drop packets once the "fairness threshold" is exceeded? o Does detecting the "death of a remote node" still fall within the "connectionless" definition? o I didn't look through the "config" and "statistics". o "RDP Information" section reminds me of the previous email thread about "netstat" support. Those probably want to be aligned so Oracle can leverage the same command as other users. ie reduce long term maintenance. And while researching the above, I found some nits with SDP: o I was expecting AF_INET_SDP to be in 2.6.14 and it's not. I hope it's part of 2.6.15-rc*. o The ulp/sdp/Kconfig comments say "AF_INET_SDP (address family 26)". AF_LLC uses 26 and sdp_sock.h defines 27. Michael - need a patch or is this trivial enough to fix by hand? thanks, grant From Richard.Frank at oracle.com Fri Nov 4 10:22:41 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Fri, 4 Nov 2005 13:22:41 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB References: Message-ID: <001401c5e16c$eefe06b0$6401a8c0@YOURA11C73D0FD> No we do not use TCP sockets - we use to many connections for this 100k+. ----- Original Message ----- From: "Bob Woodruff" To: "'Rick Frank'" ; "Ranjit Pandit" ; "Grant Grundler" Cc: Sent: Friday, November 04, 2005 11:35 AM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB > Rick wrote, >>I've atttached a draft proposal for RDS from Oracle which discusses some >>of > >>the motivation for RDS. > > I assume that you have a driver that uses TCP sockets, Correct ? > If so, have you compared the performance of RDS to SDP ? > > woody > > From swise at opengridcomputing.com Fri Nov 4 10:22:34 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 4 Nov 2005 12:22:34 -0600 Subject: [openib-general][PATCH] local device search with source addresswildcard References: <1131123761.3839.14.camel@trinity.austin.ammasso.com> <436B9C10.7050301@ichips.intel.com> Message-ID: <00fb01c5e16c$ba78be80$d5000a0a@STEVO> ----- Original Message ----- From: "Sean Hefty" To: "Tom Tucker" Cc: Sent: Friday, November 04, 2005 11:36 AM Subject: Re: [openib-general][PATCH] local device search with source addresswildcard > Tom Tucker wrote: >> Sean: >> >> I was looking through ip_resolve_local and it looks to me like >> if the source address is 0, it will end up getting set to the >> destination IP instead of the IP address of the local interface. > > The intent of ip_resolve_local() is to check if a given destination > address is on the local system. If it is and no source address is > specified, then the source address is set to the same address as the > destination. > This doesn't sound correct to me. The src ip address is supposed to be the local ip address to be used for establishing the connection. If you set it to the destination address, then you'd end up passing that address to the peer in the private data, and that is incorrect... >> Also if ip_dev_find can't find a local interface with connectivity to >> the specified remote peer, shouldn't the error be EHOSTUNREACH? > > If the address is not a local address, then a check is made to find a > route to that address assuming that it exists somewhere remotely. See > ib_resolve_addr() which calls ip_resolve_local() and > ip_resolve_remote(). So, the return code from ip_resolve_local() > returns that the given address is not available on the local system. > The address may still be reachable as a remote address. > >> Finally, if the user specifies a bogus source address, we don't >> compare it against the source address configured on the local >> interface found in the route. It will probably still fail later, but >> in some bizarre fashion. > > If the source address is bogus, then the call to ib_translate_addr() > will fail with EADDRNOTAVAIL. > > I'm not tied to the return codes, so if one of them works better than > the other, I can change it. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Fri Nov 4 10:30:03 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 10:30:03 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <001401c5e16c$eefe06b0$6401a8c0@YOURA11C73D0FD> References: <001401c5e16c$eefe06b0$6401a8c0@YOURA11C73D0FD> Message-ID: <436BA8AB.1050602@ichips.intel.com> Rick Frank wrote: > No we do not use TCP sockets - we use to many connections for this 100k+. Isn't RDS implemented on top of reliable IB/RDMA connections anyway? - Sean From mshefty at ichips.intel.com Fri Nov 4 10:32:08 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 10:32:08 -0800 Subject: [openib-general][PATCH] local device search with source addresswildcard In-Reply-To: <00fb01c5e16c$ba78be80$d5000a0a@STEVO> References: <1131123761.3839.14.camel@trinity.austin.ammasso.com> <436B9C10.7050301@ichips.intel.com> <00fb01c5e16c$ba78be80$d5000a0a@STEVO> Message-ID: <436BA928.9020403@ichips.intel.com> Steve Wise wrote: >> The intent of ip_resolve_local() is to check if a given destination >> address is on the local system. If it is and no source address is >> specified, then the source address is set to the same address as the >> destination. > > This doesn't sound correct to me. The src ip address is supposed to be > the local ip address to be used for establishing the connection. If you > set it to the destination address, then you'd end up passing that > address to the peer in the private data, and that is incorrect... If ip_resolve_local() succeeds, then the destination address is a local address. - Sean From ftillier at silverstorm.com Fri Nov 4 10:33:19 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 4 Nov 2005 10:33:19 -0800 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <00fb01c5e16c$ba78be80$d5000a0a@STEVO> Message-ID: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Friday, November 04, 2005 10:23 AM > > ----- Original Message ----- > From: "Sean Hefty" > To: "Tom Tucker" > Cc: > > > Tom Tucker wrote: > >> Sean: > >> > >> I was looking through ip_resolve_local and it looks to me like > >> if the source address is 0, it will end up getting set to the > >> destination IP instead of the IP address of the local interface. > > > > The intent of ip_resolve_local() is to check if a given destination > > address is on the local system. If it is and no source address is > > specified, then the source address is set to the same address as the > > destination. > > > > This doesn't sound correct to me. The src ip address is supposed to be > the local ip address to be used for establishing the connection. If you > set it to the destination address, then you'd end up passing that > address to the peer in the private data, and that is incorrect... If the destination address is on the local system, then the user is establishing a loopback connection. I think that if the user didn't specify a source address, returning the same address as the destination should give the proper results. For loopback connections, source and destination can (and will likely) be the same. - Fab From swise at opengridcomputing.com Fri Nov 4 10:33:29 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 4 Nov 2005 12:33:29 -0600 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> Message-ID: <011e01c5e16e$409f21b0$d5000a0a@STEVO> i misunderstood. I didn't realize ip_resolve_local() will fail if the address it a remote address. nevermind... :-\ ----- Original Message ----- From: "Fab Tillier" To: "'Steve Wise'" ; "Sean Hefty" ; "Tom Tucker" Cc: Sent: Friday, November 04, 2005 12:33 PM Subject: RE: [openib-general][PATCH] local device search with sourceaddresswildcard > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Friday, November 04, 2005 10:23 AM > > ----- Original Message ----- > From: "Sean Hefty" > To: "Tom Tucker" > Cc: > > > Tom Tucker wrote: > >> Sean: > >> > >> I was looking through ip_resolve_local and it looks to me like > >> if the source address is 0, it will end up getting set to the > >> destination IP instead of the IP address of the local interface. > > > > The intent of ip_resolve_local() is to check if a given destination > > address is on the local system. If it is and no source address is > > specified, then the source address is set to the same address as the > > destination. > > > > This doesn't sound correct to me. The src ip address is supposed to > be > the local ip address to be used for establishing the connection. If > you > set it to the destination address, then you'd end up passing that > address to the peer in the private data, and that is incorrect... If the destination address is on the local system, then the user is establishing a loopback connection. I think that if the user didn't specify a source address, returning the same address as the destination should give the proper results. For loopback connections, source and destination can (and will likely) be the same. - Fab From tom at opengridcomputing.com Fri Nov 4 11:40:27 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 04 Nov 2005 13:40:27 -0600 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <011e01c5e16e$409f21b0$d5000a0a@STEVO> References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> <011e01c5e16e$409f21b0$d5000a0a@STEVO> Message-ID: <1131133227.3839.44.camel@trinity.austin.ammasso.com> Well.... It won't necessarily fail will it? If you specified the source address as another port on the same machine, but NOT the one with connectivity to the remote peer, the routine will succeed, but the results are not what you expect...and you will fail further down the line (looking up the path record). This is one of the "bizarre" failures I was originally referring to. By the way, the function name is addr_resolve_local, not ip_xxx ... sorry. On Fri, 2005-11-04 at 12:33 -0600, Steve Wise wrote: > i misunderstood. > > I didn't realize ip_resolve_local() will fail if the address it a remote > address. > > nevermind... > > :-\ > > > > ----- Original Message ----- > From: "Fab Tillier" > To: "'Steve Wise'" ; "Sean Hefty" > ; "Tom Tucker" > Cc: > Sent: Friday, November 04, 2005 12:33 PM > Subject: RE: [openib-general][PATCH] local device search with > sourceaddresswildcard > > > > From: Steve Wise [mailto:swise at opengridcomputing.com] > > Sent: Friday, November 04, 2005 10:23 AM > > > > ----- Original Message ----- > > From: "Sean Hefty" > > To: "Tom Tucker" > > Cc: > > > > > Tom Tucker wrote: > > >> Sean: > > >> > > >> I was looking through ip_resolve_local and it looks to me like > > >> if the source address is 0, it will end up getting set to the > > >> destination IP instead of the IP address of the local interface. > > > > > > The intent of ip_resolve_local() is to check if a given destination > > > address is on the local system. If it is and no source address is > > > specified, then the source address is set to the same address as the > > > destination. > > > > > > > This doesn't sound correct to me. The src ip address is supposed to > > be > > the local ip address to be used for establishing the connection. If > > you > > set it to the destination address, then you'd end up passing that > > address to the peer in the private data, and that is incorrect... > > If the destination address is on the local system, then the user is > establishing > a loopback connection. I think that if the user didn't specify a source > address, returning the same address as the destination should give the > proper > results. > > For loopback connections, source and destination can (and will likely) > be the > same. > > - Fab From ftillier at silverstorm.com Fri Nov 4 10:42:27 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 4 Nov 2005 10:42:27 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <436BA8AB.1050602@ichips.intel.com> Message-ID: <000901c5e16f$827a26b0$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, November 04, 2005 10:30 AM > > Rick Frank wrote: > > No we do not use TCP sockets - we use to many connections for this 100k+. > > Isn't RDS implemented on top of reliable IB/RDMA connections anyway? There is not a 1:1 relationship between a UDP application socket and an IB QP, rather there is a single IB connection between systems over which traffic from multiple UDP sockets flows. - Fab From mshefty at ichips.intel.com Fri Nov 4 10:44:13 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 10:44:13 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <000901c5e16f$827a26b0$9e5aa8c0@infiniconsys.com> References: <000901c5e16f$827a26b0$9e5aa8c0@infiniconsys.com> Message-ID: <436BABFD.2010704@ichips.intel.com> Fab Tillier wrote: > There is not a 1:1 relationship between a UDP application socket and an IB QP, > rather there is a single IB connection between systems over which traffic from > multiple UDP sockets flows. Sounds like software based IB RDD. - Sean From mshefty at ichips.intel.com Fri Nov 4 10:54:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 10:54:26 -0800 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <1131133227.3839.44.camel@trinity.austin.ammasso.com> References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> <011e01c5e16e$409f21b0$d5000a0a@STEVO> <1131133227.3839.44.camel@trinity.austin.ammasso.com> Message-ID: <436BAE62.20205@ichips.intel.com> Tom Tucker wrote: > Well.... It won't necessarily fail will it? If you specified the source > address as another port on the same machine, but NOT the one with > connectivity to the remote peer, the routine will succeed, but the > results are not what you expect...and you will fail further down the > line (looking up the path record). This is one of the "bizarre" failures > I was originally referring to. If a source address is NOT specified, then I don't think that there's any issue. If a source address is specified, a failure can occur during rdma_resolve_route() with an error that the source and destination addresses are not reachable. This seems reasonable to me. - Sean From robert.j.woodruff at intel.com Fri Nov 4 10:57:40 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 10:57:40 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <000901c5e16f$827a26b0$9e5aa8c0@infiniconsys.com> Message-ID: Fab wrote, >There is not a 1:1 relationship between a UDP application socket and an IB QP, >rather there is a single IB connection between systems over which traffic from >multiple UDP sockets flows. >- Fab That would probably provide better scalability, since there would not be a 1:1 mapping between UDP sockets and IB connections, however for large clusters there may still be a scalability issue if every node needs to have a connection to every other node. If you implemented it on top of datagrams instead, then each node would only need one QP, rather than one for every node in the cluster. woody From tom at opengridcomputing.com Fri Nov 4 12:00:58 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 04 Nov 2005 14:00:58 -0600 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <436BAE62.20205@ichips.intel.com> References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> <011e01c5e16e$409f21b0$d5000a0a@STEVO> <1131133227.3839.44.camel@trinity.austin.ammasso.com> <436BAE62.20205@ichips.intel.com> Message-ID: <1131134458.3839.52.camel@trinity.austin.ammasso.com> Ok, so lets assume that all the other stuff is background noise about error codes... If you specify a 0 as a source address, won't the private data contain the destination address as the source address, or did I miss something? On Fri, 2005-11-04 at 10:54 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > Well.... It won't necessarily fail will it? If you specified the source > > address as another port on the same machine, but NOT the one with > > connectivity to the remote peer, the routine will succeed, but the > > results are not what you expect...and you will fail further down the > > line (looking up the path record). This is one of the "bizarre" failures > > I was originally referring to. > > If a source address is NOT specified, then I don't think that there's any issue. > > If a source address is specified, a failure can occur during > rdma_resolve_route() with an error that the source and destination addresses are > not reachable. This seems reasonable to me. > > - Sean From mshefty at ichips.intel.com Fri Nov 4 11:05:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 11:05:44 -0800 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <1131134458.3839.52.camel@trinity.austin.ammasso.com> References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> <011e01c5e16e$409f21b0$d5000a0a@STEVO> <1131133227.3839.44.camel@trinity.austin.ammasso.com> <436BAE62.20205@ichips.intel.com> <1131134458.3839.52.camel@trinity.austin.ammasso.com> Message-ID: <436BB108.7090009@ichips.intel.com> Tom Tucker wrote: > If you specify a 0 as a source address, won't the private data contain > the destination address as the source address, or did I miss something? The source and destination addresses will be the same if the destination is on the local system. E.g. The local system has address 192.168.0.101. The destination address is 192.168.0.101. The source address will also be set to 192.168.0.101. - Sean From tom at opengridcomputing.com Fri Nov 4 12:09:35 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 04 Nov 2005 14:09:35 -0600 Subject: [openib-general][PATCH] local device search with sourceaddresswildcard In-Reply-To: <436BB108.7090009@ichips.intel.com> References: <000801c5e16e$3b4a1490$9e5aa8c0@infiniconsys.com> <011e01c5e16e$409f21b0$d5000a0a@STEVO> <1131133227.3839.44.camel@trinity.austin.ammasso.com> <436BAE62.20205@ichips.intel.com> <1131134458.3839.52.camel@trinity.austin.ammasso.com> <436BB108.7090009@ichips.intel.com> Message-ID: <1131134975.3839.59.camel@trinity.austin.ammasso.com> Sean: I think I'm convinced I'm confused... which is another way of saying "you're right". Thanks, Tom On Fri, 2005-11-04 at 11:05 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > If you specify a 0 as a source address, won't the private data contain > > the destination address as the source address, or did I miss something? > > The source and destination addresses will be the same if the destination is on > the local system. > > E.g. The local system has address 192.168.0.101. The destination address is > 192.168.0.101. The source address will also be set to 192.168.0.101. > > - Sean From ftillier at silverstorm.com Fri Nov 4 11:16:20 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 4 Nov 2005 11:16:20 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: Message-ID: <000a01c5e174$3f3549c0$9e5aa8c0@infiniconsys.com> > From: Bob Woodruff [mailto:robert.j.woodruff at intel.com] > Sent: Friday, November 04, 2005 10:58 AM > > Fab wrote, > >There is not a 1:1 relationship between a UDP application socket > >and an IB QP, rather there is a single IB connection between systems > >over which traffic from multiple UDP sockets flows. > > That would probably provide better scalability, since there > would not be a 1:1 mapping between UDP sockets and IB connections, > however for large clusters there may still be a scalability issue > if every node needs to have a connection to every other node. > If you implemented it on top of datagrams instead, then each node > would only need one QP, rather than one for every node in the cluster. Doing a UDP to IB-UD protocol is unlikely to buy you anything over just using IPoIB. I don't know about doing UDP to IB-RDD, but the complexity of supporting end to end contexts and RDD QPs seems to me to outweigh the complexity of doing SW multiplexing over multiple IB-RC QPs. I don't think software multiplexing over IB-RC costs much from both a system/HCA resource and performance perspective, especially compared to doing something like uDAPL or SDP where there's a 1:1 relationship between EP or socket to QP, respectively. - Fab From caitlinb at broadcom.com Fri Nov 4 11:15:09 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 4 Nov 2005 11:15:09 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 10:58 AM > To: 'Fab Tillier'; 'Sean Hefty'; Rick Frank > Cc: openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] > ContributeRDS(ReliableDatagramSockets) to OpenIB > > Fab wrote, > >There is not a 1:1 relationship between a UDP application > socket and an > >IB > QP, > >rather there is a single IB connection between systems over which > >traffic > from > >multiple UDP sockets flows. > > >- Fab > > That would probably provide better scalability, since there > would not be a 1:1 mapping between UDP sockets and IB > connections, however for large clusters there may still be a > scalability issue if every node needs to have a connection to > every other node. > If you implemented it on top of datagrams instead, then each > node would only need one QP, rather than one for every node > in the cluster. > But then the application would have to take responsibility for congestion control and retries after network packet losses. RDS allows an application all the benefits of a reliable connection without the overhead, except for per connection back-pressure. Many applications do not need pre-connection back-pressure since they already have session-wide flow control policies in place. Going from one connection for each pair of application endpoints to one connection for each pair of hosts is a major improvement. For most applications going down to a single QP after that is not sufficiently valuable to add the complexity of working over a totally unreliable protocol. From swise at opengridcomputing.com Fri Nov 4 11:14:58 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 4 Nov 2005 13:14:58 -0600 Subject: [openib-general] machine check stop Message-ID: <015801c5e174$0bfea1a0$d5000a0a@STEVO> All, I'm running on X86_64 platforms with 2.6.13.3+kdb, the openib iwarp branch. I'm running with 2 dual port mellanox cards hooked up point to point between the two systems. So I have 4 IB subnets. I'm running a kernel module that sets up connections and pounds them with rdma writes. Intermittently, when I kick off these tests I get a machine check stop on the client system. Something like this: CPU 0: Machine Check Stop 4 Bank 4: b200000000070f0 TSC blah blah blah. Does this ring a bell with anyone? Thanks, Steve. From robert.j.woodruff at intel.com Fri Nov 4 11:31:04 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 11:31:04 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Catlin wrote, >Going from one connection for each pair of application endpoints >to one connection for each pair of hosts is a major improvement. >For most applications going down to a single QP after that is >not sufficiently valuable to add the complexity of working over >a totally unreliable protocol. I agree that there is some improvement in going from one QP per UDP socket to one per node, but it still will likely not scale to 10,000 node clusters, which is something that Oracle probably does not care about, but others in HPC do. If we are going to invent a Reliable Datagram Service, shouldn't it be made to scale so that MPIs that currently use datagrams could also benefit ? woody From rpandit at silverstorm.com Fri Nov 4 11:33:29 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Fri, 4 Nov 2005 11:33:29 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104181952.GB4463@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> <20051104181952.GB4463@esmail.cup.hp.com> Message-ID: <96f8e60e0511041133i222fc7edt172b7b5b1e7ba9fa@mail.gmail.com> On 11/4/05, Grant Grundler wrote: > On Thu, Nov 03, 2005 at 08:06:21PM -0500, Rick Frank wrote: > > I've atttached a draft proposal for RDS from Oracle which discusses some of > > the motivation for RDS. > > Thanks! > > Some questions/comments... > > o What is "GE" acronym for? Gigabit Ethernet > > o I'm seeing about 1/5th CPU load for SDP (vs IPoIB). > The "50% less" number doesn't seem that impressive for RDP (vs IPoIB). > Maybe this is a difference in the benchmark (I'm running netperf). 1/5th CPU load with SDP is with zero copy or without? > > o RDP wants to provide AF_INET_OFFLOAD. This doesn't exist in my source tree. > I don't know who assigns these but it isn't lanana.org. > Oracle would be wise to stick with what's in include/linux/sockets.h > in order to avoid long term maintenance issues. > > ISTR OpenIB got flamed for wanting to use AF_INET_OFFLOAD name. > If RDP is accepted, I would expect RDP to get AF_INET_RDP. > And then use "LD_PRELOAD" and clone libsdp.so to take over AF_INET. > ie follow a similar trajectory that SDP had. > On SST stack, AF_INET_OFFLOAD is used for both SDP as well as RDS. The difference is in the socket_type, SOCK_STREAM Vs SOCK_DGRAM. Can something similar be done on OpenIB? > o Is access control to the RDP protocol something that applies to > all protocols? > I'm looking item #2 of "Additional Features". > In this particular case, Oracle had a specific requirement for access control on RDP. I don't know if other users will have similar requirement on other ULPs or not. > > o Doesn't SDP meet the following requirement as well? > > | A goal of RDP should be to support all existing socket > | functionality relevant to UDP with no changes to any > | existing socket application - other than specifying > | AF_INET_OFFLOAD. However, an RDP aware socket application > | can take advantage of the RDP features. > SDP does not support SOCK_DGRAM... ulp/sdp/sdp_inet.c if (SOCK_STREAM != sock->type || (IPPROTO_IP != protocol && IPPROTO_TCP != protocol)) { sdp_dbg_warn(NULL, "SOCKET: unsupported type/proto. <%d:%d>", sock->type, protocol); return -EPROTONOSUPPORT; } > > o I'm struggling with the "RDP is connectionless" comments made earlier. > Later in this proposal, "RDP Interface" says packets will be > delivered "in order". Doesn't that conflict with "connectionless"? > Does UDP guarantee order? > As Fab and Rick mentioned, RDS provides UDP like connectionless model to applications but it uses IB/RC to communicate to the remote node. So the application doesn't have to maintain a connection state, which is a problem with TCP/SDP when there are 100K odd connections involved. > o The "crossover" value for zero copy vs inlining data is chipset specific. > Ie even within the same architecture, different combinations of CPUs > and chipsets will give wide variance. Things like cache size, cache > replacement algorithm, available memory bandwidth, memory latency, > et al, affect the choice. This value is normally define by/for each > architecture since that's practical and lets each arch decide > what the right tradeoff is. > Agreed. > o The comments in "Recv operations" talk about "backpressure". > Is this another way of saying the driver should drop packets once > the "fairness threshold" is exceeded? > The driver cannot drop packets. When backpressure'd, RDS returns EWOULDBLOCK to the application and then the application can retry. > o Does detecting the "death of a remote node" still fall > within the "connectionless" definition? When a particular socket is "backpressure'd/stalled" the application gets EWOULDBLOCK. Meanwhile, if the destination node dies, that socket needs to be unblocked. Any subsequent sends will return an error so the application can take corrective measures or cleanup. > > o I didn't look through the "config" and "statistics". > > o "RDP Information" section reminds me of the previous email thread > about "netstat" support. Those probably want to be aligned so > Oracle can leverage the same command as other users. > ie reduce long term maintenance. > > > And while researching the above, I found some nits with SDP: > > o I was expecting AF_INET_SDP to be in 2.6.14 and it's not. > I hope it's part of 2.6.15-rc*. > > o The ulp/sdp/Kconfig comments say "AF_INET_SDP (address family 26)". > AF_LLC uses 26 and sdp_sock.h defines 27. > Michael - need a patch or is this trivial enough to fix by hand? > > thanks, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Fri Nov 4 11:38:23 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 4 Nov 2005 11:38:23 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104181952.GB4463@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> <20051104181952.GB4463@esmail.cup.hp.com> Message-ID: <20051104193823.GD4463@esmail.cup.hp.com> On Fri, Nov 04, 2005 at 10:19:52AM -0800, Grant Grundler wrote: ... > o The comments in "Recv operations" talk about "backpressure". > Is this another way of saying the driver should drop packets once > the "fairness threshold" is exceeded? Ranjot's slideset answered this question (I think): | o Slow receiver ports are stalled at sender side | - combination of activity (LRU) and memory utilization used | to detect slow receivers | - sendmsg() to stalled destination port returns | EWOULDBLOCK, application can retry | - recvmsg() on a stalled port un-stalls it I'm having trouble reconciling previous "connectionless" and "transperent to user space" comments this this slide. Especially the "EWOULDBLOCK" return code. If a reciever can cause a sender to stall, it implies the packets will get dropped on the send side. This is a subtle change in behavior that I don't think any UDP application can assume. But I'm no networking protocol expert... thanks, grant From ranjit.pandit.ib at gmail.com Fri Nov 4 11:54:27 2005 From: ranjit.pandit.ib at gmail.com (pandit ib) Date: Fri, 4 Nov 2005 11:54:27 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104193823.GD4463@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> <20051104181952.GB4463@esmail.cup.hp.com> <20051104193823.GD4463@esmail.cup.hp.com> Message-ID: <96f8e60e0511041154r305d89a0iac7942c0006b5cbb@mail.gmail.com> On 11/4/05, Grant Grundler wrote: > On Fri, Nov 04, 2005 at 10:19:52AM -0800, Grant Grundler wrote: > ... > > o The comments in "Recv operations" talk about "backpressure". > > Is this another way of saying the driver should drop packets once > > the "fairness threshold" is exceeded? > > Ranjot's slideset answered this question (I think): > | o Slow receiver ports are stalled at sender side > | - combination of activity (LRU) and memory utilization used > | to detect slow receivers > | - sendmsg() to stalled destination port returns > | EWOULDBLOCK, application can retry > | - recvmsg() on a stalled port un-stalls it > > I'm having trouble reconciling previous "connectionless" and > "transperent to user space" comments this this slide. > Especially the "EWOULDBLOCK" return code. > > If a reciever can cause a sender to stall, it implies the packets > will get dropped on the send side. This is a subtle change > in behavior that I don't think any UDP application can assume. > But I'm no networking protocol expert... When the sender is stalled, the driver will backpressure the application.. no packets will be dropped. Since a UDP application assumes the underlying transport is unrealiable it should not have any problems running on RDS. On getting EWOUDBLOCK it will simply retry. > > thanks, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Fri Nov 4 12:05:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 12:05:31 -0800 Subject: [openib-general] machine check stop In-Reply-To: <015801c5e174$0bfea1a0$d5000a0a@STEVO> (Steve Wise's message of "Fri, 4 Nov 2005 13:14:58 -0600") References: <015801c5e174$0bfea1a0$d5000a0a@STEVO> Message-ID: <52wtjo2n78.fsf@cisco.com> Steve> CPU 0: Machine Check Stop 4 Bank 4: b200000000070f0 TSC Steve> blah blah blah. The chipset and/or CPU detected "something bad" like a parity error or something like that. You'll need to know all the details of your system and find some low-level documentation to decode the machine check output. - R. From rolandd at cisco.com Fri Nov 4 12:10:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 12:10:04 -0800 Subject: [openib-general] [PATCH] sdp zero copy support In-Reply-To: <20051104122331.GB15158@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 4 Nov 2005 14:23:31 +0200") References: <20051104122331.GB15158@mellanox.co.il> Message-ID: <52oe502mzn.fsf@cisco.com> I haven't read the code yet, but: > +config INFINIBAND_SDP_SEND_ZCOPY > + bool "Sockets Direct Protocol Zero Copy Send support" > + depends on INFINIBAND_SDP > + default y > + ---help--- > + This option enables Zero Copy support for send_msg transactions. > + > +config INFINIBAND_SDP_RECV_ZCOPY > + bool "Sockets Direct Protocol Zero Copy Receive support" > + depends on INFINIBAND_SDP && INFINIBAND_SDP_SEND_ZCOPY > + default y > + ---help--- > + This option enables Zero Copy support for recv_msg transactions. Why would I ever say 'n'? I think we should either get rid of these config options, or if there is a reason for them, explain it better in the help text. - R. From swise at opengridcomputing.com Fri Nov 4 12:20:38 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 4 Nov 2005 14:20:38 -0600 Subject: [openib-general] machine check stop References: <015801c5e174$0bfea1a0$d5000a0a@STEVO> <52wtjo2n78.fsf@cisco.com> Message-ID: <01a201c5e17d$38b9a7e0$d5000a0a@STEVO> Searching on the web shows this particular "4 bank 4" check when there are memory problems. I was just wondering if bad kernel/module code could cause this, or if really indicates a HW system issue, and if anyone else has seen this running openib stress tests on X86_64... Thanx, Stevo. ----- Original Message ----- From: "Roland Dreier" To: "Steve Wise" Cc: Sent: Friday, November 04, 2005 2:05 PM Subject: Re: [openib-general] machine check stop > Steve> CPU 0: Machine Check Stop 4 Bank 4: b200000000070f0 TSC > Steve> blah blah blah. > > The chipset and/or CPU detected "something bad" like a parity error or > something like that. > > You'll need to know all the details of your system and find some > low-level documentation to decode the machine check output. > > - R. > From halr at voltaire.com Fri Nov 4 12:43:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Nov 2005 15:43:07 -0500 Subject: [openib-general] [PATCH] user_mad.c: Only allow ib_umad_open to succeed on IB node types Message-ID: <1131136986.4340.4689.camel@hal.voltaire.com> user_mad.c: Only allow ib_umad_open to succeed on IB node types Signed-off-by: Hal Rosenstock Index: user_mad.c =================================================================== --- user_mad.c (revision 3968) +++ user_mad.c (working copy) @@ -595,6 +595,12 @@ static int ib_umad_open(struct inode *in goto out; } + if (port->ib_dev->node_type < IB_NODE_CA || + port->ib_dev->node_type > IB_NODE_ROUTER) { + ret = -ENODEV; + goto out; + } + file = kzalloc(sizeof *file, GFP_KERNEL); if (!file) { kref_put(&port->umad_dev->ref, ib_umad_release_dev); From rpandit at silverstorm.com Fri Nov 4 12:59:13 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Fri, 4 Nov 2005 12:59:13 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <96f8e60e0511041259v655a217anba925ae53f5c3dee@mail.gmail.com> > I agree that there is some improvement in going from one QP per > UDP socket to one per node, but it still will likely not > scale to 10,000 node clusters, which is something that Oracle > probably does not care about, but others in HPC do. > To put the improvement in perspective: For Mpi running on a 10,000 node cluster with 2 or 4 way nodes, here are the QP/ CM connection requirements: (assuming intra node communication doesn't use IB) Procs per node uDapl/Sdp Rds 2 19996 9999 4 39984 9999 Clearly, there is tradeoff in performance as we go from uDapl/Sdp to Rds. The choice will have to depend on the requirements of performance Vs Scalability. Btw, for this large a cluster, there is a huge overhead in just setting up the connections. Rds connections are setup only once. > If we are going to invent a Reliable Datagram Service, shouldn't > it be made to scale so that MPIs that currently use datagrams > could also benefit ? > > woody > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From robert.j.woodruff at intel.com Fri Nov 4 13:09:59 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 4 Nov 2005 13:09:59 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB Message-ID: <1AC79F16F5C5284499BB9591B33D6F00060828EB@orsmsx408> >To put the improvement in perspective: And if RDS were implemented over a connectionless QP, the QP savings are even more... Procs per node uDapl/Sdp Rds/connection oriented RDS/conectionless 2 19996 9999 1 4 39984 9999 1 Of coarse, connectionless would require the reliability to be done in S/W in addition to the demuxing of packets. woody From dpilon at gmail.com Fri Nov 4 13:11:41 2005 From: dpilon at gmail.com (Denis Pilon) Date: Fri, 4 Nov 2005 16:11:41 -0500 Subject: [openib-general] error compiling kernel... Message-ID: I am trying to compile but keep getting errors... linux-2.6.14(vanilla) plus latest svn release 3972. LD drivers/infiniband/built-in.o LD drivers/infiniband/core/built-in.o CC [M] drivers/infiniband/core/addr.o CC [M] drivers/infiniband/core/at.o CC [M] drivers/infiniband/core/cm.o drivers/infiniband/core/cm.c: In function `cm_alloc_msg': drivers/infiniband/core/cm.c:179: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:179: error: (Each undeclared identifier is reported only once drivers/infiniband/core/cm.c:179: error: for each function it appears in.) drivers/infiniband/core/cm.c:180: error: too few arguments to function `ib_create_send_mad' drivers/infiniband/core/cm.c:187: error: structure has no member named `ah' drivers/infiniband/core/cm.c:188: error: structure has no member named `retries' drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': drivers/infiniband/core/cm.c:209: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:210: error: too few arguments to function `ib_create_send_mad' drivers/infiniband/core/cm.c:215: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_free_msg': drivers/infiniband/core/cm.c:222: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_insert_listen': drivers/infiniband/core/cm.c:371: error: structure has no member named `device' drivers/infiniband/core/cm.c:371: error: structure has no member named `device' drivers/infiniband/core/cm.c:374: error: structure has no member named `device' drivers/infiniband/core/cm.c:374: error: structure has no member named `device' drivers/infiniband/core/cm.c:376: error: structure has no member named `device' drivers/infiniband/core/cm.c:376: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `cm_find_listen': drivers/infiniband/core/cm.c:398: error: structure has no member named `device' drivers/infiniband/core/cm.c:401: error: structure has no member named `device' drivers/infiniband/core/cm.c:403: error: structure has no member named `device' drivers/infiniband/core/cm.c: At top level: drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here drivers/infiniband/core/cm.c: In function `ib_create_cm_id': drivers/infiniband/core/cm.c:552: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': drivers/infiniband/core/cm.c:679: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:690: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:707: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_req': drivers/infiniband/core/cm.c:933: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:942: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:942: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_issue_rej': drivers/infiniband/core/cm.c:987: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:987: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': drivers/infiniband/core/cm.c:1195: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1195: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_match_req': drivers/infiniband/core/cm.c:1235: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': drivers/infiniband/core/cm.c:1381: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:1384: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1384: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': drivers/infiniband/core/cm.c:1448: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1448: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': drivers/infiniband/core/cm.c:1520: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1520: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_rep_handler': drivers/infiniband/core/cm.c:1588: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_establish_handler': drivers/infiniband/core/cm.c:1622: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_rtu_handler': drivers/infiniband/core/cm.c:1661: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': drivers/infiniband/core/cm.c:1719: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:1722: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1722: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': drivers/infiniband/core/cm.c:1785: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1785: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dreq_handler': drivers/infiniband/core/cm.c:1820: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:1834: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1834: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_drep_handler': drivers/infiniband/core/cm.c:1881: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': drivers/infiniband/core/cm.c:1949: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1949: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_rej_handler': drivers/infiniband/core/cm.c:2025: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2035: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': drivers/infiniband/core/cm.c:2093: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2093: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c:2106: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2106: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c:2119: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2119: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_mra_handler': drivers/infiniband/core/cm.c:2181: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2188: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2196: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': drivers/infiniband/core/cm.c:2279: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:2282: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2282: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_lap_handler': drivers/infiniband/core/cm.c:2359: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2359: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': drivers/infiniband/core/cm.c:2437: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2437: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_apr_handler': drivers/infiniband/core/cm.c:2476: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': drivers/infiniband/core/cm.c:2573: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:2578: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2578: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': drivers/infiniband/core/cm.c:2642: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': drivers/infiniband/core/cm.c:2713: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2713: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': drivers/infiniband/core/cm.c:2766: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_send_handler': drivers/infiniband/core/cm.c:2834: error: structure has no member named `send_buf' make[3]: *** [drivers/infiniband/core/cm.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 Am i missing something ? DP -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Fri Nov 4 13:17:30 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 4 Nov 2005 13:17:30 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F10414A1@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > Sent: Friday, November 04, 2005 1:10 PM > To: Ranjit Pandit > Cc: Caitlin Bestler; Fab Tillier; Sean Hefty; Rick Frank; > Matt L. Leininger; openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] > ContributeRDS(ReliableDatagramSockets) to OpenIB > > >To put the improvement in perspective: > > And if RDS were implemented over a connectionless QP, the QP > savings are even more... > > Procs per node uDapl/Sdp Rds/connection oriented > RDS/conectionless > 2 19996 9999 1 > 4 39984 9999 1 > > Of coarse, connectionless would require the reliability to be > done in S/W in addition to the demuxing of packets. > > woody > > It would also require the application to do SAR for all packets that are larger than the PMTU. One of the benefits of trying to ride on the SOCK_DGRAM interface is that it already defines a larger guaranteed message size. From robert.j.woodruff at intel.com Fri Nov 4 13:24:48 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 13:24:48 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10414A1@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Catlin wrote, >> >> And if RDS were implemented over a connectionless QP, the QP >> savings are even more... >> >> Procs per node uDapl/Sdp Rds/connection oriented conectionless >> 2 19996 9999 1 >> 4 39984 9999 1 >> >> Of coarse, connectionless would require the reliability to be >> done in S/W in addition to the demuxing of packets. >> >> woody >> >> >It would also require the application to do SAR for all >packets that are larger than the PMTU. One of the benefits >of trying to ride on the SOCK_DGRAM interface is that it >already defines a larger guaranteed message size. I suppose the RDS driver could also handle the SAR (if it were needed) in addition to any retries of lost packets. woody From mshefty at ichips.intel.com Fri Nov 4 13:34:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 04 Nov 2005 13:34:35 -0800 Subject: [openib-general] userverbs device node_type Message-ID: <436BD3EB.9020407@ichips.intel.com> Is there a way to get the node_type for an ibv_device? - Sean From lindahl at pathscale.com Fri Nov 4 13:46:53 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Fri, 4 Nov 2005 13:46:53 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511041154r305d89a0iac7942c0006b5cbb@mail.gmail.com> References: <5D78D28F88822E4D8702BB9EEF1A4367C2DD62@mercury.infiniconsys.com> <001701c5e0c7$37d1b090$6401a8c0@YOURA11C73D0FD> <20051104002101.GC1478@esmail.cup.hp.com> <96f8e60e0511031701i7b9ce5a0gbdade306735695e6@mail.gmail.com> <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> <20051104181952.GB4463@esmail.cup.hp.com> <20051104193823.GD4463@esmail.cup.hp.com> <96f8e60e0511041154r305d89a0iac7942c0006b5cbb@mail.gmail.com> Message-ID: <20051104214653.GC2013@greglaptop.internal.keyresearch.com> On Fri, Nov 04, 2005 at 11:54:27AM -0800, pandit ib wrote: > Since a UDP application assumes the underlying transport is > unrealiable it should not have any problems running on RDS. > On getting EWOUDBLOCK it will simply retry. Most existing UDP applications do not expect a return error code of EWOULDBLOCK. To begin with, the Linux manpages say that you have to specify non-blocking to get this error in the first place. Another possibility is ENOBUFS, which gives the advice "Normally, this does not occur in Linux. Packets are silently dropped when a device queue overflows." There was a somewhat famous case showing lack of error handling in UCP applications under Linux, where Alan Cox decided to read the RFCs different from everyone else, and caused an ICMP 'port unreach' to later cause the same sending socket to return an error for a send to some unrelated host. Many UDP-using apps considered this a fatal error. This was ~ 7 years ago, and this misfeature caused enough anger that it was corrected soon after Alan stopped owning the TCP/UDP stack. In short, I'm not sure there would be much benefit for giving existing UDP-expecting apps a reliable, ordered stream of datagrams. The only app which would see a benefit are those who know that they can turn off their reliability and ordering code, and handle backpressure explicitly. Those folks would benefit from a simpler programming interface than verbs. -- greg From pradeep at us.ibm.com Fri Nov 4 14:06:32 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 4 Nov 2005 14:06:32 -0800 Subject: [openib-general] Data structure size mismatch Message-ID: I realize that address translation will be replaced shortly. However, here are a few things that I observed which I believe are important. I recently saw an e-mail thread about compilation problems and data structure padding; this is in line with that. So that new incarnation does not face the same pitfalls of address translation, I will describe them here. When I tried running uatt it fails with -EFAULT. Debug revealed that it fails. The following copy_from_user() fails. ib_route = kmalloc(sizeof *ib_route, GFP_KERNEL); if (!ib_route) { result = -ENOMEM; goto err1; } if (copy_from_user(ib_route, cmd.ib_route, sizeof(ib_route))) { result = -EFAULT; goto err2; } In fact I believe this copy_from_user() is unnecessary since this will be actually filled in by "address translation" and passed back to user space later on. So, if I eliminate this copy_from_user(), uatt again fails with EFAULT in: if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) { result = -EFAULT; goto err4; } The environment I was using a 32-bit app and 64-bit kernel on Power. The reason is struct ib_uat_route_by_ip_req has pointers in them (LP64 vs ILP32). I am told a 64-bit app succeeded on a 64-bit kernel which confirmed my suspicions. Given that I took a quick look at all the places that copy_from_user() is used (I did not do this exercise for copy_to_user(), which would be the complete thing to do) and found that this (data structure size mismatch) potentially also occurs in user_mad,c. I did not see any anomalies in ucm and uverbs. Comments from people who are more familair with the code? Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Nov 4 14:14:58 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 14:14:58 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <010201c5e0db$f9ed3820$6401a8c0@YOURA11C73D0FD> Message-ID: Rick wrote, >I've atttached a draft proposal for RDS from Oracle which discusses some of >the motivation for RDS. Couple of questions/comments on the spec. AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. Would something like SCTP provide the same type of capabilities (relaible datagrams) that you are suggesting to add with RDP ? http://www.networksorcery.com/enp/protocol/sctp.htm http://www.faqs.org/rfcs/rfc2960.html From caitlinb at broadcom.com Fri Nov 4 14:30:34 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 4 Nov 2005 14:30:34 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F10414A7@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. From trimmer at silverstorm.com Fri Nov 4 14:34:37 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 4 Nov 2005 17:34:37 -0500 Subject: [openib-general] [ANNOUNCE]ContributeRDS(ReliableDatagramSockets) to OpenIB Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A08BD@mercury.infiniconsys.com> > Fab wrote, > >There is not a 1:1 relationship between a UDP application > socket and an IB > QP, > >rather there is a single IB connection between systems over > which traffic > from > >multiple UDP sockets flows. > > >- Fab > > Bob wrote, > That would probably provide better scalability, since there > would not be a 1:1 mapping between UDP sockets and IB connections, > however for large clusters there may still be a scalability issue > if every node needs to have a connection to every other node. > If you implemented it on top of datagrams instead, then each node > would only need one QP, rather than one for every node in the cluster. That is essentially what Oracle previously did when using UDP over IPoIB. Significant performance gains were realized with RDS (as compared to IPoIB) for a number of reasons: 1. use of RC connections allows for messages larger than IB MTU, which allows for more efficiency and better performance. 2. By using RC connections and flow control in the RDS socket mux, Oracle was able to remove the need for timeouts and retries in application space. Such algorithms in application space can get expensive, especially due to error handling which is inevitable when congestion and stress force the loss of packets by any unreliable datagram protocol (IB/UD or IPoIB/UDP or Ethernet/UDP). Todd Rimmer From halr at voltaire.com Fri Nov 4 14:30:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Nov 2005 17:30:49 -0500 Subject: [openib-general] Data structure size mismatch In-Reply-To: References: Message-ID: <1131143449.4340.5204.camel@hal.voltaire.com> On Fri, 2005-11-04 at 17:06, Pradeep Satyanarayana wrote: > I realize that address translation will be replaced shortly. However, > here are a few things that > I observed which I believe are important. Important to fix in what time frame ? > I recently saw an e-mail thread about compilation problems and > data structure padding; this is in line with that. > > So that new incarnation does not face the same pitfalls of address > translation, I will describe them here. > > When I tried running uatt it fails with -EFAULT. Debug revealed that > it fails. The following > copy_from_user() fails. > > ib_route = kmalloc(sizeof *ib_route, GFP_KERNEL); > if (!ib_route) { > result = -ENOMEM; > goto err1; > } > > if (copy_from_user(ib_route, cmd.ib_route, sizeof(ib_route))) { > result = -EFAULT; > goto err2; > } > > In fact I believe this copy_from_user() is unnecessary since this will > be actually filled in by "address translation" and > passed back to user space later on. Not always. If I recall correctly, there is a case where this copy is needed. It is not in the mode that uatt uses AT right now though. > So, if I eliminate this copy_from_user(), uatt again fails with > EFAULT in: > > if (copy_to_user((void __user *)(unsigned long)cmd.response, > &resp, sizeof(resp))) { > result = -EFAULT; > goto err4; > } > > The environment I was using a 32-bit app and 64-bit kernel on Power. > The reason is > struct ib_uat_route_by_ip_req has pointers in them (LP64 vs ILP32). This needs to be replaced by the port GID. Another alternative is the name. This has been discussed before on the list. -- Hal > I am told a 64-bit app succeeded on a 64-bit kernel which confirmed my > suspicions. > > Given that I took a quick look at all the places that copy_from_user() > is used (I did not > do this exercise for copy_to_user(), which would be the complete thing > to do) and found > that this (data structure size mismatch) potentially also occurs in > user_mad,c. I did not see any anomalies > in ucm and uverbs. > > Comments from people who are more familair with the code? > > Pradeep > pradeep at us.ibm.com > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Fri Nov 4 14:53:04 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 14:53:04 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10414A7@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Catlin wrote, >SCTP support would also have to be part of the offload device. >RDS enables reliable datagrams using existing offloaded RC >services (IB RC, iWARP, TOE). No NIC enhancements are required. BTW. SCTP runs in Linux today without any NIC enhancements or offload support. Perhaps if tunneling udp packets over RC connections rather than UD connections provides better performance, as was seen in the RDS experiment, then why not just convert IPoIB to use a connected model (rather than datagrams) and then all existing IP upper level protocols would could benefit, TCP, UDP, SCTP, .... woody -----Original Message----- From: Caitlin Bestler [mailto:caitlinb at broadcom.com] Sent: Friday, November 04, 2005 2:31 PM To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler Cc: openib-general at openib.org Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. From trimmer at silverstorm.com Fri Nov 4 15:02:04 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 4 Nov 2005 18:02:04 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB Message-ID: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> > Bob wrote, > Perhaps if tunneling udp packets over RC connections rather than > UD connections provides better performance, as was seen in the RDS > experiment, then why not just convert > IPoIB to use a connected model (rather than datagrams) > and then all existing IP upper level > protocols would could benefit, TCP, UDP, SCTP, .... This would miss the second major improvement of RDS, namely removing the need for the application to perform timeouts and retries on datagram packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or affect other UDP applications which are expecting a flow controlled interface. Todd Rimmer From robert.j.woodruff at intel.com Fri Nov 4 15:03:17 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 15:03:17 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: Message-ID: Woody wrote, >Perhaps if tunneling udp packets over RC connections rather than >UD connections provides better performance, as was seen in the RDS >experiment, then why not just convert >IPoIB to use a connected model (rather than datagrams) >and then all existing IP upper level >protocols would could benefit, TCP, UDP, SCTP, .... Saying this another way. Make the hardware run the existing protocols better, don't design a new protocol to work around the problems with a specific hardware transport. woody -----Original Message----- From: Caitlin Bestler [mailto:caitlinb at broadcom.com] Sent: Friday, November 04, 2005 2:31 PM To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler Cc: openib-general at openib.org Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From robert.j.woodruff at intel.com Fri Nov 4 15:10:54 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 15:10:54 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> Message-ID: Todd wrote, >This would miss the second major improvement of RDS, namely removing the need for >the application to perform timeouts and retries on datagram packets. If Oracle >ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. >If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or >affect other UDP applications which are expecting a flow controlled interface. >Todd Rimmer Then use SCTP instead of UDP, which already provides a loss-less reliable interface. If SCTP has problems with the number of endpoints it can currently support, why not just fix that problem and fix IpoIB to use a connected model to increase performance, rather than inventing a completly new protocol and/or address family. Just a thought. woody From rpandit at silverstorm.com Fri Nov 4 15:16:52 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Fri, 4 Nov 2005 15:16:52 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: References: Message-ID: <96f8e60e0511041516v48a115a3m5025a2ffc3026f5a@mail.gmail.com> On 11/4/05, Bob Woodruff wrote: > Woody wrote, > >Perhaps if tunneling udp packets over RC connections rather than > >UD connections provides better performance, as was seen in the RDS > >experiment, then why not just convert > >IPoIB to use a connected model (rather than datagrams) > >and then all existing IP upper level > >protocols would could benefit, TCP, UDP, SCTP, .... > > Saying this another way. > Make the hardware run the existing protocols better, don't > design a new protocol to work around the problems with a > specific hardware transport. > What about SDP? Isn't SDP bypassing the existing TCP protocol stack to take advantage of a specific harware transport - IB? RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM instead of SOCK_STREAM. > woody > > > > > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, November 04, 2005 2:31 PM > To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler > Cc: openib-general at openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Bob Woodruff > > Sent: Friday, November 04, 2005 2:15 PM > > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > > ReliableDatagramSockets) to OpenIB > > > > Rick wrote, > > >I've atttached a draft proposal for RDS from Oracle which discusses > > >some of > > > > >the motivation for RDS. > > > > Couple of questions/comments on the spec. > > > > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > > > Would something like SCTP provide the same type of > > capabilities (relaible datagrams) that you are suggesting to > > add with RDP ? > > > > Each stream within an SCTP association provides a reliable, > ordered service. > > There would be two primary constraints in using SCTP for > this usage profile: > > 1) The Stream ID is 16 bits, and the natural mapping would > be to have each stream represent a source/destination > pairing. That would imply fewer than 256 endpoints per > host. If the source were encoded by hand then the limitation > would be 64K, but that's an awkard mix of application and > transport layer encoding. > 2) The network has to be composed of SCTP friendly equipment. > When IP network equipment operated exclusively at L2/L3, > and L4 was left to the endpoints, SCTP would have had no > problem being deployed. But because of security and IPV4 > address shortages there are a lot of middleboxes that are > L4 aware, and generally that L4 awareness is limited to > TCP and UDP. > > SCTP support would also have to be part of the offload device. > RDS enables reliable datagrams using existing offloaded RC > services (IB RC, iWARP, TOE). No NIC enhancements are required. > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Richard.Frank at oracle.com Fri Nov 4 15:27:42 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Fri, 4 Nov 2005 18:27:42 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> Message-ID: <004201c5e197$99ecc730$6401a8c0@YOURA11C73D0FD> Folks, I just realized that the RDS proposal doc I posted was not the latest - I have attached a newer doc. BTW - Our goal at Oracle is to eventually replace our use of UDP with RDS - we want to get out of the business of making UDP work for us (from user mode) - each time we create new internal database clients with corresponding new IPC requirements. I'm not proposing that we change our use of the connectionless datagram model - we have to many dependencies on this. For example, very shortly we will need to support reliably (and efficiently) moving 1meg msgs in our IPC - which will further complicate the UDP implementation - and further reduce its performance compared to RDP - which can support the 1meg MTU naturally for some interconnects - and or rely on a driver level implementation RDS / transport for those that do not. Basically it is very hard to do this stuff from user mode. Note that we will still be using our existing IPC module for RDS - just removing the remaining UDP vestages. Of course for this to work - we will need RDS to be ubiquitous - supported on all interconnects - to include simple Ethernet NICs. ----- Original Message ----- From: "Rimmer, Todd" To: "Bob Woodruff" ; "Caitlin Bestler" ; "Rick Frank" ; "Pandit, Ranjit" ; "Grant Grundler" Cc: Sent: Friday, November 04, 2005 6:02 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB > Bob wrote, > Perhaps if tunneling udp packets over RC connections rather than > UD connections provides better performance, as was seen in the RDS > experiment, then why not just convert > IPoIB to use a connected model (rather than datagrams) > and then all existing IP upper level > protocols would could benefit, TCP, UDP, SCTP, .... This would miss the second major improvement of RDS, namely removing the need for the application to perform timeouts and retries on datagram packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or affect other UDP applications which are expecting a flow controlled interface. Todd Rimmer -------------- next part -------------- A non-text attachment was scrubbed... Name: Proposal_for_a_Reliable_Datagram_Socket_Interface.doc Type: application/msword Size: 51712 bytes Desc: not available URL: From Richard.Frank at oracle.com Fri Nov 4 15:29:57 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Fri, 4 Nov 2005 18:29:57 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB References: Message-ID: <004701c5e197$ac1d2670$6401a8c0@YOURA11C73D0FD> SCTP is connection based - we have many dependencies on our connectionless datagram model. ----- Original Message ----- From: "Bob Woodruff" To: "'Rimmer, Todd'" ; "Caitlin Bestler" ; "Rick Frank" ; "Pandit, Ranjit" ; "Grant Grundler" Cc: Sent: Friday, November 04, 2005 6:10 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB > Todd wrote, >>This would miss the second major improvement of RDS, namely removing the > need for >the application to perform timeouts and retries on datagram > packets. If Oracle >>ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable > interface. >If UDP/IP/IPoIB provided a loss-less reliable interface it > would likely break or >affect other UDP applications which are expecting a > flow controlled interface. > >>Todd Rimmer > > Then use SCTP instead of UDP, which already provides a loss-less reliable > interface. > If SCTP has problems with the number of endpoints it can currently > support, > why not just fix that problem and fix IpoIB to use a connected model to > increase performance, rather than inventing a completly new protocol > and/or > address family. > > Just a thought. > > woody > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From robert.j.woodruff at intel.com Fri Nov 4 15:49:33 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 15:49:33 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <004701c5e197$ac1d2670$6401a8c0@YOURA11C73D0FD> Message-ID: Rick wrote, >SCTP is connection based - we have many dependencies on our connectionless >datagram model. I think I get it now. I was just talking with Roy about SCTP, and he said the same thing, SCTP is a connected rather than datagram model, so SCTP does not seem to solve the problem since it has the same FD scaling problems as TCP. >Of course for this to work - we will need RDS to be ubiquitous - supported >on all interconnects - to include simple Ethernet NICs. Makes sense. woody From robert.j.woodruff at intel.com Fri Nov 4 15:58:13 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 4 Nov 2005 15:58:13 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511041516v48a115a3m5025a2ffc3026f5a@mail.gmail.com> Message-ID: Ranjit wrote, >RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM >instead of SOCK_STREAM. So back to the question from Roland that started this thread. When do you plan to re-work the code to use the OpenIB verbs and make it suitable for the kernel ? And do you plan to develop the code, or at least the infrastructure to allow multiple RDS providers to plug in so that it is ubiquitous - supported on all interconnects - to include simple Ethernet NICs ? woody From pradeep at us.ibm.com Fri Nov 4 15:59:38 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 4 Nov 2005 15:59:38 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: <1131143449.4340.5204.camel@hal.voltaire.com> Message-ID: Hal Rosenstock wrote on 11/04/2005 02:30:49 PM: > On Fri, 2005-11-04 at 17:06, Pradeep Satyanarayana wrote: > > I realize that address translation will be replaced shortly. However, > > here are a few things that > > I observed which I believe are important. > > Important to fix in what time frame ? > > > I recently saw an e-mail thread about compilation problems and > > data structure padding; this is in line with that. > > > > So that new incarnation does not face the same pitfalls of address > > translation, I will describe them here. > > > > When I tried running uatt it fails with -EFAULT. Debug revealed that > > it fails. The following > > copy_from_user() fails. > > > > ib_route = kmalloc(sizeof *ib_route, GFP_KERNEL); > > if (!ib_route) { > > result = -ENOMEM; > > goto err1; > > } > > > > if (copy_from_user(ib_route, cmd.ib_route, sizeof(ib_route))) { > > result = -EFAULT; > > goto err2; > > } > > > > In fact I believe this copy_from_user() is unnecessary since this will > > be actually filled in by "address translation" and > > passed back to user space later on. > > Not always. If I recall correctly, there is a case where this copy is > needed. It is not in the mode that uatt uses AT right now though. Maybe true, but there is still a 32-bit app 64-bit kernel issue that needs to be fixed, unless we agree to change the data structure to say incorporate a device_name as you suggest below. > > > So, if I eliminate this copy_from_user(), uatt again fails with > > EFAULT in: > > > > if (copy_to_user((void __user *)(unsigned long)cmd.response, > > &resp, sizeof(resp))) { > > result = -EFAULT; > > goto err4; > > } > > > > The environment I was using a 32-bit app and 64-bit kernel on Power. > > The reason is > > struct ib_uat_route_by_ip_req has pointers in them (LP64 vs ILP32). > > This needs to be replaced by the port GID. Another alternative is the > name. This has been discussed before on the list. > > -- Hal > > > I am told a 64-bit app succeeded on a 64-bit kernel which confirmed my > > suspicions. > > > > Given that I took a quick look at all the places that copy_from_user() > > is used (I did not > > do this exercise for copy_to_user(), which would be the complete thing > > to do) and found > > that this (data structure size mismatch) potentially also occurs in > > user_mad,c. I did not see any anomalies Even if we change struct ib_uat_route_by_ip_req, there still is user_mad.c that needs to be looked into. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Fri Nov 4 16:00:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 16:00:02 -0800 Subject: [openib-general] userverbs device node_type In-Reply-To: <436BD3EB.9020407@ichips.intel.com> (Sean Hefty's message of "Fri, 04 Nov 2005 13:34:35 -0800") References: <436BD3EB.9020407@ichips.intel.com> Message-ID: <52hdas2ccd.fsf@cisco.com> Sean> Is there a way to get the node_type for an ibv_device? Not at the moment... it's probably safe to assume it's a CA for now. - R. From rolandd at cisco.com Fri Nov 4 16:09:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 16:09:39 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: (Pradeep Satyanarayana's message of "Fri, 4 Nov 2005 15:59:38 -0800") References: Message-ID: <52d5lg2bwc.fsf@cisco.com> >>>>> "Pradeep" == Pradeep Satyanarayana writes: Pradeep> Even if we change struct ib_uat_route_by_ip_req, there Pradeep> still is user_mad.c that needs to be looked into. Could you be specific? As far as I can tell, all of the structures copied to and from userspace in user_mad.c are laid out identically for 32-bit and 64-bit architectures. Thanks, Roland From rolandd at cisco.com Fri Nov 4 16:20:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 16:20:03 -0800 Subject: [openib-general] [git pull] IB updates Message-ID: <523bmc2bf0.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Jack Morgenstein: [IB] mthca: check P_Key index in modify QP Michael S. Tsirkin: [IB] mthca: report asynchronous CQ events Roland Dreier: [IPoIB] use spin_trylock_irqsave() [IB] uverbs: Avoid NULL pointer deref on CQ async event [IB] mthca: Avoid SRQ free WQE list corruption [IPoIB] cleanups: fix comment, remove useless variables [IB] kzalloc() conversions [IPoIB] remove unneeded initializations to 0 [IPoIB] don't compile debug code if debugging isn't enabled [IB] mthca: fix format of FW version [IB] umad: fix hot remove of IB devices Sean Hefty: [IB] ucm: 32/64 compatibility fixes drivers/infiniband/core/agent.c | 3 - drivers/infiniband/core/cm.c | 6 +- drivers/infiniband/core/device.c | 10 --- drivers/infiniband/core/mad.c | 31 ++++----- drivers/infiniband/core/sysfs.c | 6 +- drivers/infiniband/core/ucm.c | 9 +-- drivers/infiniband/core/user_mad.c | 80 +++++++++++++++++++----- drivers/infiniband/core/uverbs.h | 1 drivers/infiniband/core/uverbs_cmd.c | 1 drivers/infiniband/core/uverbs_main.c | 13 +--- drivers/infiniband/hw/mthca/mthca_cq.c | 31 +++++++++ drivers/infiniband/hw/mthca/mthca_dev.h | 4 + drivers/infiniband/hw/mthca/mthca_eq.c | 4 + drivers/infiniband/hw/mthca/mthca_main.c | 2 - drivers/infiniband/hw/mthca/mthca_mr.c | 4 - drivers/infiniband/hw/mthca/mthca_profile.c | 4 - drivers/infiniband/hw/mthca/mthca_provider.c | 2 - drivers/infiniband/hw/mthca/mthca_qp.c | 7 ++ drivers/infiniband/hw/mthca/mthca_srq.c | 13 ++-- drivers/infiniband/ulp/ipoib/ipoib.h | 3 + drivers/infiniband/ulp/ipoib/ipoib_ib.c | 13 ++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 ++----- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++ drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 4 - include/rdma/ib_user_cm.h | 19 ++++-- 25 files changed, 178 insertions(+), 124 deletions(-) From rolandd at cisco.com Fri Nov 4 16:21:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 16:21:10 -0800 Subject: [openib-general] [git pull] Add IB SCSI RDMA Protocol (storage) initiator Message-ID: <52y8440wsp.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git srp This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git srp The pull will get the following change, which adds an InfiniBand SCSI RDMA Protocol initiator (used to talk to IB storage devices). Thanks, Roland IB: Add SCSI RDMA Protocol (SRP) initiator Add an InfiniBand SCSI RDMA Protocol (SRP) initiator. This driver is used to talk talk to InfiniBand SRP targets (storage devices). Signed-off-by: Roland Dreier --- drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/ulp/srp/Kbuild | 1 drivers/infiniband/ulp/srp/Kconfig | 11 drivers/infiniband/ulp/srp/ib_srp.c | 1700 +++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/srp/ib_srp.h | 150 +++ include/scsi/srp.h | 226 +++++ 7 files changed, 2091 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/srp/Kbuild create mode 100644 drivers/infiniband/ulp/srp/Kconfig create mode 100644 drivers/infiniband/ulp/srp/ib_srp.c create mode 100644 drivers/infiniband/ulp/srp/ib_srp.h create mode 100644 include/scsi/srp.h applies-to: d918cd1ba0ef9afa692cef281afee2f6d6634a1e aef9ec39c47f0cece886ddd6b53c440321e0b2a6 diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 325d502..bdf0891 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -33,4 +33,6 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/ulp/ipoib/Kconfig" +source "drivers/infiniband/ulp/srp/Kconfig" + endmenu diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index d256cf7..a43fb34 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ +obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/ulp/srp/Kbuild b/drivers/infiniband/ulp/srp/Kbuild new file mode 100644 index 0000000..a16c73c --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kbuild @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND_SRP) += ib_srp.o diff --git a/drivers/infiniband/ulp/srp/Kconfig b/drivers/infiniband/ulp/srp/Kconfig new file mode 100644 index 0000000..8fe3be4 --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kconfig @@ -0,0 +1,11 @@ +config INFINIBAND_SRP + tristate "InfiniBand SCSI RDMA Protocol" + depends on INFINIBAND && SCSI + ---help--- + Support for the SCSI RDMA Protocol over InfiniBand. This + allows you to access storage devices that speak SRP over + InfiniBand. + + The SRP protocol is defined by the INCITS T10 technical + committee. See . + diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c new file mode 100644 index 0000000..2687e34 --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -0,0 +1,1700 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.c 3932 2005-11-01 17:19:29Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include +#include + +#include + +#include "ib_srp.h" + +#define DRV_NAME "ib_srp" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.2" +#define DRV_RELDATE "November 1, 2005" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol initiator " + "v" DRV_VERSION " (" DRV_RELDATE ")"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int topspin_workarounds = 1; + +module_param(topspin_workarounds, int, 0444); +MODULE_PARM_DESC(topspin_workarounds, + "Enable workarounds for Topspin/Cisco SRP target bugs if != 0"); + +static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; + +static void srp_add_one(struct ib_device *device); +static void srp_remove_one(struct ib_device *device); +static void srp_completion(struct ib_cq *cq, void *target_ptr); +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +static struct ib_client srp_client = { + .name = "srp", + .add = srp_add_one, + .remove = srp_remove_one +}; + +static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) +{ + return (struct srp_target_port *) host->hostdata; +} + +static const char *srp_target_info(struct Scsi_Host *host) +{ + return host_to_target(host)->target_name; +} + +static struct srp_iu *srp_alloc_iu(struct srp_host *host, size_t size, + gfp_t gfp_mask, + enum dma_data_direction direction) +{ + struct srp_iu *iu; + + iu = kmalloc(sizeof *iu, gfp_mask); + if (!iu) + goto out; + + iu->buf = kzalloc(size, gfp_mask); + if (!iu->buf) + goto out_free_iu; + + iu->dma = dma_map_single(host->dev->dma_device, iu->buf, size, direction); + if (dma_mapping_error(iu->dma)) + goto out_free_buf; + + iu->size = size; + iu->direction = direction; + + return iu; + +out_free_buf: + kfree(iu->buf); +out_free_iu: + kfree(iu); +out: + return NULL; +} + +static void srp_free_iu(struct srp_host *host, struct srp_iu *iu) +{ + if (!iu) + return; + + dma_unmap_single(host->dev->dma_device, iu->dma, iu->size, iu->direction); + kfree(iu->buf); + kfree(iu); +} + +static void srp_qp_event(struct ib_event *event, void *context) +{ + printk(KERN_ERR PFX "QP event %d\n", event->event); +} + +static int srp_init_qp(struct srp_target_port *target, + struct ib_qp *qp) +{ + struct ib_qp_attr *attr; + int ret; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + ret = ib_find_cached_pkey(target->srp_host->dev, + target->srp_host->port, + be16_to_cpu(target->path.pkey), + &attr->pkey_index); + if (ret) + goto out; + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + attr->port_num = target->srp_host->port; + + ret = ib_modify_qp(qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | + IB_QP_PORT); + +out: + kfree(attr); + return ret; +} + +static int srp_create_target_ib(struct srp_target_port *target) +{ + struct ib_qp_init_attr *init_attr; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + target->cq = ib_create_cq(target->srp_host->dev, srp_completion, + NULL, target, SRP_CQ_SIZE); + if (IS_ERR(target->cq)) { + ret = PTR_ERR(target->cq); + goto out; + } + + ib_req_notify_cq(target->cq, IB_CQ_NEXT_COMP); + + init_attr->event_handler = srp_qp_event; + init_attr->cap.max_send_wr = SRP_SQ_SIZE; + init_attr->cap.max_recv_wr = SRP_RQ_SIZE; + init_attr->cap.max_recv_sge = 1; + init_attr->cap.max_send_sge = 1; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = target->cq; + init_attr->recv_cq = target->cq; + + target->qp = ib_create_qp(target->srp_host->pd, init_attr); + if (IS_ERR(target->qp)) { + ret = PTR_ERR(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + + ret = srp_init_qp(target, target->qp); + if (ret) { + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + +out: + kfree(init_attr); + return ret; +} + +static void srp_free_target_ib(struct srp_target_port *target) +{ + int i; + + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + + for (i = 0; i < SRP_RQ_SIZE; ++i) + srp_free_iu(target->srp_host, target->rx_ring[i]); + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) + srp_free_iu(target->srp_host, target->tx_ring[i]); +} + +static void srp_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + target->status = status; + if (status) + printk(KERN_ERR PFX "Got failed path rec status %d\n", status); + else + target->path = *pathrec; + complete(&target->done); +} + +static int srp_lookup_path(struct srp_target_port *target) +{ + target->path.numb_path = 1; + + init_completion(&target->done); + + target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev, + target->srp_host->port, + &target->path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + SRP_PATH_REC_TIMEOUT_MS, + GFP_KERNEL, + srp_path_rec_completion, + target, &target->path_query); + if (target->path_query_id < 0) + return target->path_query_id; + + wait_for_completion(&target->done); + + if (target->status < 0) + printk(KERN_WARNING PFX "Path record query failed\n"); + + return target->status; +} + +static int srp_send_req(struct srp_target_port *target) +{ + struct { + struct ib_cm_req_param param; + struct srp_login_req priv; + } *req = NULL; + int status; + + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + req->param.primary_path = &target->path; + req->param.alternate_path = NULL; + req->param.service_id = target->service_id; + req->param.qp_num = target->qp->qp_num; + req->param.qp_type = target->qp->qp_type; + req->param.private_data = &req->priv; + req->param.private_data_len = sizeof req->priv; + req->param.flow_control = 1; + + get_random_bytes(&req->param.starting_psn, 4); + req->param.starting_psn &= 0xffffff; + + /* + * Pick some arbitrary defaults here; we could make these + * module parameters if anyone cared about setting them. + */ + req->param.responder_resources = 4; + req->param.remote_cm_response_timeout = 20; + req->param.local_cm_response_timeout = 20; + req->param.retry_count = 7; + req->param.rnr_retry_count = 7; + req->param.max_cm_retries = 15; + + req->priv.opcode = SRP_LOGIN_REQ; + req->priv.tag = 0; + req->priv.req_it_iu_len = cpu_to_be32(SRP_MAX_IU_LEN); + req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | + SRP_BUF_FORMAT_INDIRECT); + memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); + /* + * Topspin/Cisco SRP targets will reject our login unless we + * zero out the first 8 bytes of our initiator port ID. The + * second 8 bytes must be our local node GUID, but we always + * use that anyway. + */ + if (topspin_workarounds && !memcmp(&target->ioc_guid, topspin_oui, 3)) { + printk(KERN_DEBUG PFX "Topspin/Cisco initiator port ID workaround " + "activated for target GUID %016llx\n", + (unsigned long long) be64_to_cpu(target->ioc_guid)); + memset(req->priv.initiator_port_id, 0, 8); + } + memcpy(req->priv.target_port_id, &target->id_ext, 8); + memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); + + status = ib_send_cm_req(target->cm_id, &req->param); + + kfree(req); + + return status; +} + +static void srp_disconnect_target(struct srp_target_port *target) +{ + /* XXX should send SRP_I_LOGOUT request */ + + init_completion(&target->done); + ib_send_cm_dreq(target->cm_id, NULL, 0); + wait_for_completion(&target->done); +} + +static void srp_remove_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_DEAD) { + spin_unlock_irq(target->scsi_host->host_lock); + scsi_host_put(target->scsi_host); + return; + } + target->state = SRP_TARGET_REMOVED; + spin_unlock_irq(target->scsi_host->host_lock); + + down(&target->srp_host->target_mutex); + list_del(&target->list); + up(&target->srp_host->target_mutex); + + scsi_remove_host(target->scsi_host); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + /* And another put to really free the target port... */ + scsi_host_put(target->scsi_host); +} + +static int srp_connect_target(struct srp_target_port *target) +{ + int ret; + + ret = srp_lookup_path(target); + if (ret) + return ret; + + while (1) { + init_completion(&target->done); + ret = srp_send_req(target); + if (ret) + return ret; + wait_for_completion(&target->done); + + /* + * The CM event handling code will set status to + * SRP_PORT_REDIRECT if we get a port redirect REJ + * back, or SRP_DLID_REDIRECT if we get a lid/qp + * redirect REJ back. + */ + switch (target->status) { + case 0: + return 0; + + case SRP_PORT_REDIRECT: + ret = srp_lookup_path(target); + if (ret) + return ret; + break; + + case SRP_DLID_REDIRECT: + break; + + default: + return target->status; + } + } +} + +static int srp_reconnect_target(struct srp_target_port *target) +{ + struct ib_cm_id *new_cm_id; + struct ib_qp_attr qp_attr; + struct srp_request *req; + struct ib_wc wc; + int ret; + int i; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_LIVE) { + spin_unlock_irq(target->scsi_host->host_lock); + return -EAGAIN; + } + target->state = SRP_TARGET_CONNECTING; + spin_unlock_irq(target->scsi_host->host_lock); + + srp_disconnect_target(target); + /* + * Now get a new local CM ID so that we avoid confusing the + * target in case things are really fouled up. + */ + new_cm_id = ib_create_cm_id(target->srp_host->dev, + srp_cm_handler, target); + if (IS_ERR(new_cm_id)) { + ret = PTR_ERR(new_cm_id); + goto err; + } + ib_destroy_cm_id(target->cm_id); + target->cm_id = new_cm_id; + + qp_attr.qp_state = IB_QPS_RESET; + ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); + if (ret) + goto err; + + ret = srp_init_qp(target, target->qp); + if (ret) + goto err; + + while (ib_poll_cq(target->cq, 1, &wc) > 0) + ; /* nothing */ + + list_for_each_entry(req, &target->req_queue, list) { + req->scmnd->result = DID_RESET << 16; + req->scmnd->scsi_done(req->scmnd); + } + + target->rx_head = 0; + target->tx_head = 0; + target->tx_tail = 0; + target->req_head = 0; + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_connect_target(target); + if (ret) + goto err; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + ret = 0; + target->state = SRP_TARGET_LIVE; + } else + ret = -EAGAIN; + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; + +err: + printk(KERN_ERR PFX "reconnect failed (%d), removing target port.\n", ret); + + /* + * We couldn't reconnect, so kill our target port off. + * However, we have to defer the real removal because we might + * be in the context of the SCSI error handler now, which + * would deadlock if we call scsi_remove_host(). + */ + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + target->state = SRP_TARGET_DEAD; + INIT_WORK(&target->work, srp_remove_work, target); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; +} + +static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target, + struct srp_request *req) +{ + struct srp_cmd *cmd = req->cmd->buf; + int len; + u8 fmt; + + if (!scmnd->request_buffer || scmnd->sc_data_direction == DMA_NONE) + return sizeof (struct srp_cmd); + + if (scmnd->sc_data_direction != DMA_FROM_DEVICE && + scmnd->sc_data_direction != DMA_TO_DEVICE) { + printk(KERN_WARNING PFX "Unhandled data direction %d\n", + scmnd->sc_data_direction); + return -EINVAL; + } + + if (scmnd->use_sg) { + struct scatterlist *scat = scmnd->request_buffer; + int n; + int i; + + n = dma_map_sg(target->srp_host->dev->dma_device, + scat, scmnd->use_sg, scmnd->sc_data_direction); + + if (n == 1) { + struct srp_direct_buf *buf = (void *) cmd->add_data; + + fmt = SRP_DATA_DESC_DIRECT; + + buf->va = cpu_to_be64(sg_dma_address(scat)); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(sg_dma_len(scat)); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_direct_buf); + } else { + struct srp_indirect_buf *buf = (void *) cmd->add_data; + u32 datalen = 0; + + fmt = SRP_DATA_DESC_INDIRECT; + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->data_out_desc_cnt = n; + else + cmd->data_in_desc_cnt = n; + + buf->table_desc.va = cpu_to_be64(req->cmd->dma + + sizeof *cmd + + sizeof *buf); + buf->table_desc.key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->table_desc.len = + cpu_to_be32(n * sizeof (struct srp_direct_buf)); + + for (i = 0; i < n; ++i) { + buf->desc_list[i].va = cpu_to_be64(sg_dma_address(&scat[i])); + buf->desc_list[i].key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->desc_list[i].len = cpu_to_be32(sg_dma_len(&scat[i])); + + datalen += sg_dma_len(&scat[i]); + } + + buf->len = cpu_to_be32(datalen); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_indirect_buf) + + n * sizeof (struct srp_direct_buf); + } + } else { + struct srp_direct_buf *buf = (void *) cmd->add_data; + dma_addr_t dma; + + dma = dma_map_single(target->srp_host->dev->dma_device, + scmnd->request_buffer, scmnd->request_bufflen, + scmnd->sc_data_direction); + if (dma_mapping_error(dma)) { + printk(KERN_WARNING PFX "unable to map %p/%d (dir %d)\n", + scmnd->request_buffer, (int) scmnd->request_bufflen, + scmnd->sc_data_direction); + return -EINVAL; + } + + pci_unmap_addr_set(req, direct_mapping, dma); + + buf->va = cpu_to_be64(dma); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(scmnd->request_bufflen); + + fmt = SRP_DATA_DESC_DIRECT; + + len = sizeof (struct srp_cmd) + sizeof (struct srp_direct_buf); + } + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->buf_fmt = fmt << 4; + else + cmd->buf_fmt = fmt; + + + return len; +} + +static void srp_unmap_data(struct scsi_cmnd *scmnd, + struct srp_target_port *target, + struct srp_request *req) +{ + if (!scmnd->request_buffer || + (scmnd->sc_data_direction != DMA_TO_DEVICE && + scmnd->sc_data_direction != DMA_FROM_DEVICE)) + return; + + if (scmnd->use_sg) + dma_unmap_sg(target->srp_host->dev->dma_device, + (struct scatterlist *) scmnd->request_buffer, + scmnd->use_sg, scmnd->sc_data_direction); + else + dma_unmap_single(target->srp_host->dev->dma_device, + pci_unmap_addr(req, direct_mapping), + scmnd->request_bufflen, + scmnd->sc_data_direction); +} + +static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) +{ + struct srp_request *req; + struct scsi_cmnd *scmnd; + unsigned long flags; + s32 delta; + + delta = (s32) be32_to_cpu(rsp->req_lim_delta); + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + + target->req_lim += delta; + + req = &target->req_ring[rsp->tag & ~SRP_TAG_TSK_MGMT]; + + if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) { + if (be32_to_cpu(rsp->resp_data_len) < 4) + req->tsk_status = -1; + else + req->tsk_status = rsp->data[3]; + complete(&req->done); + } else { + scmnd = req->scmnd; + if (!scmnd) + printk(KERN_ERR "Null scmnd for RSP w/tag %016llx\n", + (unsigned long long) rsp->tag); + scmnd->result = rsp->status; + + if (rsp->flags & SRP_RSP_FLAG_SNSVALID) { + memcpy(scmnd->sense_buffer, rsp->data + + be32_to_cpu(rsp->resp_data_len), + min_t(int, be32_to_cpu(rsp->sense_data_len), + SCSI_SENSE_BUFFERSIZE)); + } + + if (rsp->flags & (SRP_RSP_FLAG_DOOVER | SRP_RSP_FLAG_DOUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_out_res_cnt); + else if (rsp->flags & (SRP_RSP_FLAG_DIOVER | SRP_RSP_FLAG_DIUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_in_res_cnt); + + srp_unmap_data(scmnd, target, req); + + if (!req->tsk_mgmt) { + req->scmnd = NULL; + scmnd->host_scribble = (void *) -1L; + scmnd->scsi_done(scmnd); + + list_del(&req->list); + req->next = target->req_head; + target->req_head = rsp->tag & ~SRP_TAG_TSK_MGMT; + } else + req->cmd_done = 1; + } + + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); +} + +static void srp_reconnect_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + srp_reconnect_target(target); +} + +static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc) +{ + struct srp_iu *iu; + u8 opcode; + + iu = target->rx_ring[wc->wr_id & ~SRP_OP_RECV]; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); + + opcode = *(u8 *) iu->buf; + + if (0) { + int i; + + printk(KERN_ERR PFX "recv completion, opcode 0x%02x\n", opcode); + + for (i = 0; i < wc->byte_len; ++i) { + if (i % 8 == 0) + printk(KERN_ERR " [%02x] ", i); + printk(" %02x", ((u8 *) iu->buf)[i]); + if ((i + 1) % 8 == 0) + printk("\n"); + } + + if (wc->byte_len % 8) + printk("\n"); + } + + switch (opcode) { + case SRP_RSP: + srp_process_rsp(target, iu->buf); + break; + + case SRP_T_LOGOUT: + /* XXX Handle target logout */ + printk(KERN_WARNING PFX "Got target logout request\n"); + break; + + default: + printk(KERN_WARNING PFX "Unhandled SRP opcode 0x%02x\n", opcode); + break; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); +} + +static void srp_completion(struct ib_cq *cq, void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + struct ib_wc wc; + unsigned long flags; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + if (wc.status) { + printk(KERN_ERR PFX "failed %s status %d\n", + wc.wr_id & SRP_OP_RECV ? "receive" : "send", + wc.status); + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state == SRP_TARGET_LIVE) + schedule_work(&target->work); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + break; + } + + if (wc.wr_id & SRP_OP_RECV) + srp_handle_recv(target, &wc); + else + ++target->tx_tail; + } +} + +static int __srp_post_recv(struct srp_target_port *target) +{ + struct srp_iu *iu; + struct ib_sge list; + struct ib_recv_wr wr, *bad_wr; + unsigned int next; + int ret; + + next = target->rx_head & (SRP_RQ_SIZE - 1); + wr.wr_id = next | SRP_OP_RECV; + iu = target->rx_ring[next]; + + list.addr = iu->dma; + list.length = iu->size; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.sg_list = &list; + wr.num_sge = 1; + + ret = ib_post_recv(target->qp, &wr, &bad_wr); + if (!ret) + ++target->rx_head; + + return ret; +} + +static int srp_post_recv(struct srp_target_port *target) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + ret = __srp_post_recv(target); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + + return ret; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target) +{ + if (target->tx_head - target->tx_tail >= SRP_SQ_SIZE) + return NULL; + + return target->tx_ring[target->tx_head & SRP_SQ_SIZE]; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static int __srp_post_send(struct srp_target_port *target, + struct srp_iu *iu, int len) +{ + struct ib_sge list; + struct ib_send_wr wr, *bad_wr; + int ret = 0; + + if (target->req_lim < 1) { + printk(KERN_ERR PFX "Target has req_lim %d\n", target->req_lim); + return -EAGAIN; + } + + list.addr = iu->dma; + list.length = len; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.wr_id = target->tx_head & SRP_SQ_SIZE; + wr.sg_list = &list; + wr.num_sge = 1; + wr.opcode = IB_WR_SEND; + wr.send_flags = IB_SEND_SIGNALED; + + ret = ib_post_send(target->qp, &wr, &bad_wr); + + if (!ret) { + ++target->tx_head; + --target->req_lim; + } + + return ret; +} + +static int srp_queuecommand(struct scsi_cmnd *scmnd, + void (*done)(struct scsi_cmnd *)) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_cmd *cmd; + long req_index; + int len; + + if (target->state == SRP_TARGET_CONNECTING) + goto err; + + if (target->state == SRP_TARGET_DEAD || + target->state == SRP_TARGET_REMOVED) { + scmnd->result = DID_BAD_TARGET << 16; + done(scmnd); + return 0; + } + + iu = __srp_get_tx_iu(target); + if (!iu) + goto err; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + req_index = target->req_head; + + scmnd->scsi_done = done; + scmnd->result = 0; + scmnd->host_scribble = (void *) req_index; + + cmd = iu->buf; + memset(cmd, 0, sizeof *cmd); + + cmd->opcode = SRP_CMD; + cmd->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + cmd->tag = req_index; + memcpy(cmd->cdb, scmnd->cmnd, scmnd->cmd_len); + + req = &target->req_ring[req_index]; + + req->scmnd = scmnd; + req->cmd = iu; + req->cmd_done = 0; + req->tsk_mgmt = NULL; + + len = srp_map_data(scmnd, target, req); + if (len < 0) { + printk(KERN_ERR PFX "Failed to map data\n"); + goto err; + } + + if (__srp_post_recv(target)) { + printk(KERN_ERR PFX "Recv failed\n"); + goto err_unmap; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + if (__srp_post_send(target, iu, len)) { + printk(KERN_ERR PFX "Send failed\n"); + goto err_unmap; + } + + target->req_head = req->next; + list_add_tail(&req->list, &target->req_queue); + + return 0; + +err_unmap: + srp_unmap_data(scmnd, target, req); + +err: + return SCSI_MLQUEUE_HOST_BUSY; +} + +static int srp_alloc_iu_bufs(struct srp_target_port *target) +{ + int i; + + for (i = 0; i < SRP_RQ_SIZE; ++i) { + target->rx_ring[i] = srp_alloc_iu(target->srp_host, + target->max_ti_iu_len, + GFP_KERNEL, DMA_FROM_DEVICE); + if (!target->rx_ring[i]) + goto err; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + target->tx_ring[i] = srp_alloc_iu(target->srp_host, + SRP_MAX_IU_LEN, + GFP_KERNEL, DMA_TO_DEVICE); + if (!target->tx_ring[i]) + goto err; + } + + return 0; + +err: + for (i = 0; i < SRP_RQ_SIZE; ++i) { + srp_free_iu(target->srp_host, target->rx_ring[i]); + target->rx_ring[i] = NULL; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + srp_free_iu(target->srp_host, target->tx_ring[i]); + target->tx_ring[i] = NULL; + } + + return -ENOMEM; +} + +static void srp_cm_rej_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event, + struct srp_target_port *target) +{ + struct ib_class_port_info *cpi; + int opcode; + + switch (event->param.rej_rcvd.reason) { + case IB_CM_REJ_PORT_CM_REDIRECT: + cpi = event->param.rej_rcvd.ari; + target->path.dlid = cpi->redirect_lid; + target->path.pkey = cpi->redirect_pkey; + cm_id->remote_cm_qpn = be32_to_cpu(cpi->redirect_qp) & 0x00ffffff; + memcpy(target->path.dgid.raw, cpi->redirect_gid, 16); + + target->status = target->path.dlid ? + SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; + break; + + case IB_CM_REJ_PORT_REDIRECT: + if (topspin_workarounds && + !memcmp(&target->ioc_guid, topspin_oui, 3)) { + /* + * Topspin/Cisco SRP gateways incorrectly send + * reject reason code 25 when they mean 24 + * (port redirect). + */ + memcpy(target->path.dgid.raw, + event->param.rej_rcvd.ari, 16); + + printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", + (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), + (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + + target->status = SRP_PORT_REDIRECT; + } else { + printk(KERN_WARNING " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); + target->status = -ECONNRESET; + } + break; + + case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: + printk(KERN_WARNING " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); + target->status = -ECONNRESET; + break; + + case IB_CM_REJ_CONSUMER_DEFINED: + opcode = *(u8 *) event->private_data; + if (opcode == SRP_LOGIN_REJ) { + struct srp_login_rej *rej = event->private_data; + u32 reason = be32_to_cpu(rej->reason); + + if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) + printk(KERN_WARNING PFX + "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); + else + printk(KERN_WARNING PFX + "SRP LOGIN REJECTED, reason 0x%08x\n", reason); + } else + printk(KERN_WARNING " REJ reason: IB_CM_REJ_CONSUMER_DEFINED," + " opcode 0x%02x\n", opcode); + target->status = -ECONNRESET; + break; + + default: + printk(KERN_WARNING " REJ reason 0x%x\n", + event->param.rej_rcvd.reason); + target->status = -ECONNRESET; + } +} + +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct srp_target_port *target = cm_id->context; + struct ib_qp_attr *qp_attr = NULL; + int attr_mask = 0; + int comp = 0; + int opcode = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + printk(KERN_DEBUG PFX "Sending CM REQ failed\n"); + comp = 1; + target->status = -ECONNRESET; + break; + + case IB_CM_REP_RECEIVED: + comp = 1; + opcode = *(u8 *) event->private_data; + + if (opcode == SRP_LOGIN_RSP) { + struct srp_login_rsp *rsp = event->private_data; + + target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); + target->req_lim = be32_to_cpu(rsp->req_lim_delta); + + target->scsi_host->can_queue = min(target->req_lim, + target->scsi_host->can_queue); + } else { + printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", opcode); + target->status = -ECONNRESET; + break; + } + + target->status = srp_alloc_iu_bufs(target); + if (target->status) + break; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) { + target->status = -ENOMEM; + break; + } + + qp_attr->qp_state = IB_QPS_RTR; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = srp_post_recv(target); + if (target->status) + break; + + qp_attr->qp_state = IB_QPS_RTS; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = ib_send_cm_rtu(cm_id, NULL, 0); + if (target->status) + break; + + break; + + case IB_CM_REJ_RECEIVED: + printk(KERN_DEBUG PFX "REJ received\n"); + comp = 1; + + srp_cm_rej_handler(cm_id, event, target); + break; + + case IB_CM_MRA_RECEIVED: + printk(KERN_ERR PFX "MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + break; + + case IB_CM_TIMEWAIT_EXIT: + printk(KERN_ERR PFX "connection closed\n"); + + comp = 1; + target->status = 0; + break; + + default: + printk(KERN_WARNING PFX "Unhandled CM event %d\n", event->event); + break; + } + + if (comp) + complete(&target->done); + + kfree(qp_attr); + + return 0; +} + +static int srp_send_tsk_mgmt(struct scsi_cmnd *scmnd, u8 func) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_tsk_mgmt *tsk_mgmt; + int req_index; + int ret = FAILED; + + spin_lock_irq(target->scsi_host->host_lock); + + if (scmnd->host_scribble == (void *) -1L) + goto out; + + req_index = (long) scmnd->host_scribble; + printk(KERN_ERR "Abort for req_index %d\n", req_index); + + req = &target->req_ring[req_index]; + init_completion(&req->done); + + iu = __srp_get_tx_iu(target); + if (!iu) + goto out; + + tsk_mgmt = iu->buf; + memset(tsk_mgmt, 0, sizeof *tsk_mgmt); + + tsk_mgmt->opcode = SRP_TSK_MGMT; + tsk_mgmt->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + tsk_mgmt->tag = req_index | SRP_TAG_TSK_MGMT; + tsk_mgmt->tsk_mgmt_func = func; + tsk_mgmt->task_tag = req_index; + + if (__srp_post_send(target, iu, sizeof *tsk_mgmt)) + goto out; + + req->tsk_mgmt = iu; + + spin_unlock_irq(target->scsi_host->host_lock); + if (!wait_for_completion_timeout(&req->done, + msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) + return FAILED; + spin_lock_irq(target->scsi_host->host_lock); + + if (req->cmd_done) { + list_del(&req->list); + req->next = target->req_head; + target->req_head = req_index; + + scmnd->scsi_done(scmnd); + } else if (!req->tsk_status) { + scmnd->result = DID_ABORT << 16; + ret = SUCCESS; + } + +out: + spin_unlock_irq(target->scsi_host->host_lock); + return ret; +} + +static int srp_abort(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP abort called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_ABORT_TASK); +} + +static int srp_reset_device(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP reset_device called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_LUN_RESET); +} + +static int srp_reset_host(struct scsi_cmnd *scmnd) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + int ret = FAILED; + + printk(KERN_ERR PFX "SRP reset_host called\n"); + + if (!srp_reconnect_target(target)) + ret = SUCCESS; + + return ret; +} + +static struct scsi_host_template srp_template = { + .module = THIS_MODULE, + .name = DRV_NAME, + .info = srp_target_info, + .queuecommand = srp_queuecommand, + .eh_abort_handler = srp_abort, + .eh_device_reset_handler = srp_reset_device, + .eh_host_reset_handler = srp_reset_host, + .can_queue = SRP_SQ_SIZE, + .this_id = -1, + .sg_tablesize = SRP_MAX_INDIRECT, + .cmd_per_lun = SRP_SQ_SIZE, + .use_clustering = ENABLE_CLUSTERING +}; + +static int srp_add_target(struct srp_host *host, struct srp_target_port *target) +{ + sprintf(target->target_name, "SRP.T10:%016llX", + (unsigned long long) be64_to_cpu(target->id_ext)); + + if (scsi_add_host(target->scsi_host, host->dev->dma_device)) + return -ENODEV; + + down(&host->target_mutex); + list_add_tail(&target->list, &host->target_list); + up(&host->target_mutex); + + target->state = SRP_TARGET_LIVE; + + /* XXX: are we supposed to have a definition of SCAN_WILD_CARD ?? */ + scsi_scan_target(&target->scsi_host->shost_gendev, + 0, target->scsi_id, ~0, 0); + + return 0; +} + +static void srp_release_class_dev(struct class_device *class_dev) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + complete(&host->released); +} + +static struct class srp_class = { + .name = "infiniband_srp", + .release = srp_release_class_dev +}; + +/* + * Target ports are added by writing + * + * id_ext=,ioc_guid=,dgid=, + * pkey=,service_id= + * + * to the add_target sysfs attribute. + */ +enum { + SRP_OPT_ERR = 0, + SRP_OPT_ID_EXT = 1 << 0, + SRP_OPT_IOC_GUID = 1 << 1, + SRP_OPT_DGID = 1 << 2, + SRP_OPT_PKEY = 1 << 3, + SRP_OPT_SERVICE_ID = 1 << 4, + SRP_OPT_MAX_SECT = 1 << 5, + SRP_OPT_ALL = (SRP_OPT_ID_EXT | + SRP_OPT_IOC_GUID | + SRP_OPT_DGID | + SRP_OPT_PKEY | + SRP_OPT_SERVICE_ID), +}; + +static match_table_t srp_opt_tokens = { + { SRP_OPT_ID_EXT, "id_ext=%s" }, + { SRP_OPT_IOC_GUID, "ioc_guid=%s" }, + { SRP_OPT_DGID, "dgid=%s" }, + { SRP_OPT_PKEY, "pkey=%x" }, + { SRP_OPT_SERVICE_ID, "service_id=%s" }, + { SRP_OPT_MAX_SECT, "max_sect=%d" }, + { SRP_OPT_ERR, NULL } +}; + +static int srp_parse_options(const char *buf, struct srp_target_port *target) +{ + char *options, *sep_opt; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + sep_opt = options; + while ((p = strsep(&sep_opt, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, srp_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case SRP_OPT_ID_EXT: + p = match_strdup(args); + target->id_ext = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_IOC_GUID: + p = match_strdup(args); + target->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) { + printk(KERN_WARNING PFX "bad dest GID parameter '%s'\n", p); + goto out; + } + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + target->path.dgid.raw[i] = simple_strtoul(dgid, NULL, 16); + } + break; + + case SRP_OPT_PKEY: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX "bad P_Key parameter '%s'\n", p); + goto out; + } + target->path.pkey = cpu_to_be16(token); + break; + + case SRP_OPT_SERVICE_ID: + p = match_strdup(args); + target->service_id = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_MAX_SECT: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX "bad max sect parameter '%s'\n", p); + goto out; + } + target->scsi_host->max_sectors = token; + break; + + default: + printk(KERN_WARNING PFX "unknown parameter or missing value " + "'%s' in target creation request\n", p); + goto out; + } + } + + if ((opt_mask & SRP_OPT_ALL) == SRP_OPT_ALL) + ret = 0; + else + for (i = 0; i < ARRAY_SIZE(srp_opt_tokens); ++i) + if ((srp_opt_tokens[i].token & SRP_OPT_ALL) && + !(srp_opt_tokens[i].token & opt_mask)) + printk(KERN_WARNING PFX "target creation request is " + "missing parameter '%s'\n", + srp_opt_tokens[i].pattern); + +out: + kfree(options); + return ret; +} + +static ssize_t srp_create_target(struct class_device *class_dev, + const char *buf, size_t count) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + struct Scsi_Host *target_host; + struct srp_target_port *target; + int ret; + int i; + + target_host = scsi_host_alloc(&srp_template, + sizeof (struct srp_target_port)); + if (!target_host) + return -ENOMEM; + + target = host_to_target(target_host); + memset(target, 0, sizeof *target); + + target->scsi_host = target_host; + target->srp_host = host; + + INIT_WORK(&target->work, srp_reconnect_work, target); + + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_parse_options(buf, target); + if (ret) + goto err; + + ib_get_cached_gid(host->dev, host->port, 0, &target->path.sgid); + + printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " + "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + (unsigned long long) be64_to_cpu(target->id_ext), + (unsigned long long) be64_to_cpu(target->ioc_guid), + be16_to_cpu(target->path.pkey), + (unsigned long long) be64_to_cpu(target->service_id), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[0]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[2]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[4]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[6]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[8]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[10]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[12]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[14])); + + ret = srp_create_target_ib(target); + if (ret) + goto err; + + target->cm_id = ib_create_cm_id(host->dev, srp_cm_handler, target); + if (IS_ERR(target->cm_id)) { + ret = PTR_ERR(target->cm_id); + goto err_free; + } + + ret = srp_connect_target(target); + if (ret) { + printk(KERN_ERR PFX "Connection failed\n"); + goto err_cm_id; + } + + ret = srp_add_target(host, target); + if (ret) + goto err_disconnect; + + return count; + +err_disconnect: + srp_disconnect_target(target); + +err_cm_id: + ib_destroy_cm_id(target->cm_id); + +err_free: + srp_free_target_ib(target); + +err: + scsi_host_put(target_host); + + return ret; +} + +static CLASS_DEVICE_ATTR(add_target, S_IWUSR, NULL, srp_create_target); + +static ssize_t show_ibdev(struct class_device *class_dev, char *buf) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + return sprintf(buf, "%s\n", host->dev->name); +} + +static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); + +static ssize_t show_port(struct class_device *class_dev, char *buf) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + return sprintf(buf, "%d\n", host->port); +} + +static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); + +static struct srp_host *srp_add_port(struct ib_device *device, + __be64 node_guid, u8 port) +{ + struct srp_host *host; + + host = kzalloc(sizeof *host, GFP_KERNEL); + if (!host) + return NULL; + + INIT_LIST_HEAD(&host->target_list); + init_MUTEX(&host->target_mutex); + init_completion(&host->released); + host->dev = device; + host->port = port; + + host->initiator_port_id[7] = port; + memcpy(host->initiator_port_id + 8, &node_guid, 8); + + host->pd = ib_alloc_pd(device); + if (IS_ERR(host->pd)) + goto err_free; + + host->mr = ib_get_dma_mr(host->pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(host->mr)) + goto err_pd; + + host->class_dev.class = &srp_class; + host->class_dev.dev = device->dma_device; + snprintf(host->class_dev.class_id, BUS_ID_SIZE, "srp-%s-%d", + device->name, port); + + if (class_device_register(&host->class_dev)) + goto err_mr; + if (class_device_create_file(&host->class_dev, &class_device_attr_add_target)) + goto err_class; + if (class_device_create_file(&host->class_dev, &class_device_attr_ibdev)) + goto err_class; + if (class_device_create_file(&host->class_dev, &class_device_attr_port)) + goto err_class; + + return host; + +err_class: + class_device_unregister(&host->class_dev); + +err_mr: + ib_dereg_mr(host->mr); + +err_pd: + ib_dealloc_pd(host->pd); + +err_free: + kfree(host); + + return NULL; +} + +static void srp_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host; + struct ib_device_attr *dev_attr; + int s, e, p; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", + device->name); + goto out; + } + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + goto out; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + host = srp_add_port(device, dev_attr->node_guid, p); + if (host) + list_add_tail(&host->list, dev_list); + } + + ib_set_client_data(device, &srp_client, dev_list); + +out: + kfree(dev_attr); +} + +static void srp_remove_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host, *tmp_host; + LIST_HEAD(target_list); + struct srp_target_port *target, *tmp_target; + unsigned long flags; + + dev_list = ib_get_client_data(device, &srp_client); + + list_for_each_entry_safe(host, tmp_host, dev_list, list) { + class_device_unregister(&host->class_dev); + /* + * Wait for the sysfs entry to go away, so that no new + * target ports can be created. + */ + wait_for_completion(&host->released); + + /* + * Mark all target ports as removed, so we stop queueing + * commands and don't try to reconnect. + */ + down(&host->target_mutex); + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state != SRP_TARGET_REMOVED) + target->state = SRP_TARGET_REMOVED; + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + } + up(&host->target_mutex); + + /* + * Wait for any reconnection tasks that may have + * started before we marked our target ports as + * removed, and any target port removal tasks. + */ + flush_scheduled_work(); + + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + scsi_remove_host(target->scsi_host); + srp_disconnect_target(target); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + } + + ib_dereg_mr(host->mr); + ib_dealloc_pd(host->pd); + kfree(host); + } + + kfree(dev_list); +} + +static int __init srp_init_module(void) +{ + int ret; + + ret = class_register(&srp_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class infiniband_srp\n"); + return ret; + } + + ret = ib_register_client(&srp_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client\n"); + class_unregister(&srp_class); + return ret; + } + + return 0; +} + +static void __exit srp_cleanup_module(void) +{ + ib_unregister_client(&srp_client); + class_unregister(&srp_class); +} + +module_init(srp_init_module); +module_exit(srp_cleanup_module); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h new file mode 100644 index 0000000..4fec28a --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -0,0 +1,150 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.h 3932 2005-11-01 17:19:29Z roland $ + */ + +#ifndef IB_SRP_H +#define IB_SRP_H + +#include +#include + +#include + +#include +#include + +#include +#include +#include + +enum { + SRP_PATH_REC_TIMEOUT_MS = 1000, + SRP_ABORT_TIMEOUT_MS = 5000, + + SRP_PORT_REDIRECT = 1, + SRP_DLID_REDIRECT = 2, + + SRP_MAX_IU_LEN = 256, + + SRP_RQ_SHIFT = 6, + SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT, + SRP_SQ_SIZE = SRP_RQ_SIZE - 1, + SRP_CQ_SIZE = SRP_SQ_SIZE + SRP_RQ_SIZE, + + SRP_TAG_TSK_MGMT = 1 << (SRP_RQ_SHIFT + 1) +}; + +#define SRP_OP_RECV (1 << 31) +#define SRP_MAX_INDIRECT ((SRP_MAX_IU_LEN - \ + sizeof (struct srp_cmd) - \ + sizeof (struct srp_indirect_buf)) / 16) + +enum srp_target_state { + SRP_TARGET_LIVE, + SRP_TARGET_CONNECTING, + SRP_TARGET_DEAD, + SRP_TARGET_REMOVED +}; + +struct srp_host { + u8 initiator_port_id[16]; + struct ib_device *dev; + u8 port; + struct ib_pd *pd; + struct ib_mr *mr; + struct class_device class_dev; + struct list_head target_list; + struct semaphore target_mutex; + struct completion released; + struct list_head list; +}; + +struct srp_request { + struct list_head list; + struct scsi_cmnd *scmnd; + struct srp_iu *cmd; + struct srp_iu *tsk_mgmt; + DECLARE_PCI_UNMAP_ADDR(direct_mapping) + struct completion done; + short next; + u8 cmd_done; + u8 tsk_status; +}; + +struct srp_target_port { + __be64 id_ext; + __be64 ioc_guid; + __be64 service_id; + struct srp_host *srp_host; + struct Scsi_Host *scsi_host; + char target_name[32]; + unsigned int scsi_id; + + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + + struct ib_cm_id *cm_id; + struct ib_cq *cq; + struct ib_qp *qp; + + int max_ti_iu_len; + s32 req_lim; + + unsigned rx_head; + struct srp_iu *rx_ring[SRP_RQ_SIZE]; + + unsigned tx_head; + unsigned tx_tail; + struct srp_iu *tx_ring[SRP_SQ_SIZE + 1]; + + int req_head; + struct list_head req_queue; + struct srp_request req_ring[SRP_SQ_SIZE]; + + struct work_struct work; + + struct list_head list; + struct completion done; + int status; + enum srp_target_state state; +}; + +struct srp_iu { + dma_addr_t dma; + void *buf; + size_t size; + enum dma_data_direction direction; +}; + +#endif /* IB_SRP_H */ diff --git a/include/scsi/srp.h b/include/scsi/srp.h new file mode 100644 index 0000000..6c2681d --- /dev/null +++ b/include/scsi/srp.h @@ -0,0 +1,226 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef SCSI_SRP_H +#define SCSI_SRP_H + +/* + * Structures and constants for the SCSI RDMA Protocol (SRP) as + * defined by the INCITS T10 committee. This file was written using + * draft Revision 16a of the SRP standard. + */ + +#include + +enum { + SRP_LOGIN_REQ = 0x00, + SRP_TSK_MGMT = 0x01, + SRP_CMD = 0x02, + SRP_I_LOGOUT = 0x03, + SRP_LOGIN_RSP = 0xc0, + SRP_RSP = 0xc1, + SRP_LOGIN_REJ = 0xc2, + SRP_T_LOGOUT = 0x80, + SRP_CRED_REQ = 0x81, + SRP_AER_REQ = 0x82, + SRP_CRED_RSP = 0x41, + SRP_AER_RSP = 0x42 +}; + +enum { + SRP_BUF_FORMAT_DIRECT = 1 << 1, + SRP_BUF_FORMAT_INDIRECT = 1 << 2 +}; + +enum { + SRP_NO_DATA_DESC = 0, + SRP_DATA_DESC_DIRECT = 1, + SRP_DATA_DESC_INDIRECT = 2 +}; + +enum { + SRP_TSK_ABORT_TASK = 0x01, + SRP_TSK_ABORT_TASK_SET = 0x02, + SRP_TSK_CLEAR_TASK_SET = 0x04, + SRP_TSK_LUN_RESET = 0x08, + SRP_TSK_CLEAR_ACA = 0x40 +}; + +enum srp_login_rej_reason { + SRP_LOGIN_REJ_UNABLE_ESTABLISH_CHANNEL = 0x00010000, + SRP_LOGIN_REJ_INSUFFICIENT_RESOURCES = 0x00010001, + SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE = 0x00010002, + SRP_LOGIN_REJ_UNABLE_ASSOCIATE_CHANNEL = 0x00010003, + SRP_LOGIN_REJ_UNSUPPORTED_DESCRIPTOR_FMT = 0x00010004, + SRP_LOGIN_REJ_MULTI_CHANNEL_UNSUPPORTED = 0x00010005, + SRP_LOGIN_REJ_CHANNEL_LIMIT_REACHED = 0x00010006 +}; + +struct srp_direct_buf { + __be64 va; + __be32 key; + __be32 len; +}; + +/* + * We need the packed attribute because the SRP spec puts the list of + * descriptors at an offset of 20, which is not aligned to the size + * of struct srp_direct_buf. + */ +struct srp_indirect_buf { + struct srp_direct_buf table_desc; + __be32 len; + struct srp_direct_buf desc_list[0] __attribute__((packed)); +}; + +enum { + SRP_MULTICHAN_SINGLE = 0, + SRP_MULTICHAN_MULTI = 1 +}; + +struct srp_login_req { + u8 opcode; + u8 reserved1[7]; + u64 tag; + __be32 req_it_iu_len; + u8 reserved2[4]; + __be16 req_buf_fmt; + u8 req_flags; + u8 reserved3[5]; + u8 initiator_port_id[16]; + u8 target_port_id[16]; +}; + +struct srp_login_rsp { + u8 opcode; + u8 reserved1[3]; + __be32 req_lim_delta; + u64 tag; + __be32 max_it_iu_len; + __be32 max_ti_iu_len; + __be16 buf_fmt; + u8 rsp_flags; + u8 reserved2[25]; +}; + +struct srp_login_rej { + u8 opcode; + u8 reserved1[3]; + __be32 reason; + u64 tag; + u8 reserved2[8]; + __be16 buf_fmt; + u8 reserved3[6]; +}; + +struct srp_i_logout { + u8 opcode; + u8 reserved[7]; + u64 tag; +}; + +struct srp_t_logout { + u8 opcode; + u8 sol_not; + u8 reserved[2]; + __be32 reason; + u64 tag; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_tsk_mgmt { + u8 opcode; + u8 sol_not; + u8 reserved1[6]; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3[2]; + u8 tsk_mgmt_func; + u8 reserved4; + u64 task_tag; + u8 reserved5[8]; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_cmd { + u8 opcode; + u8 sol_not; + u8 reserved1[3]; + u8 buf_fmt; + u8 data_out_desc_cnt; + u8 data_in_desc_cnt; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3; + u8 task_attr; + u8 reserved4; + u8 add_cdb_len; + u8 cdb[16]; + u8 add_data[0]; +}; + +enum { + SRP_RSP_FLAG_RSPVALID = 1 << 0, + SRP_RSP_FLAG_SNSVALID = 1 << 1, + SRP_RSP_FLAG_DOOVER = 1 << 2, + SRP_RSP_FLAG_DOUNDER = 1 << 3, + SRP_RSP_FLAG_DIOVER = 1 << 4, + SRP_RSP_FLAG_DIUNDER = 1 << 5 +}; + +struct srp_rsp { + u8 opcode; + u8 sol_not; + u8 reserved1[2]; + __be32 req_lim_delta; + u64 tag; + u8 reserved2[2]; + u8 flags; + u8 status; + __be32 data_out_res_cnt; + __be32 data_in_res_cnt; + __be32 sense_data_len; + __be32 resp_data_len; + u8 data[0]; +}; + +#endif /* SCSI_SRP_H */ --- 0.99.9 From Richard.Frank at oracle.com Fri Nov 4 16:45:17 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Fri, 4 Nov 2005 19:45:17 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB References: Message-ID: <009401c5e1a2$9bb29210$6401a8c0@YOURA11C73D0FD> At this point we really need to get RDS on IB ported to Gen 2 so we can get this into Linux distributions ASAP. We (Oracle) are currently investigating / working on an RDS over Ethernet driver for Linux. Our current plans are to produce a new verbs provider that registers with Gen 2 IB verbs layer. This new driver will bind to a standard ethernet nic driver and implement the RC semantics. This will allow us to use 100% of the ported RDS ULP. Note that RDP should also run over any other interconnect that registers with the verbs layer - such as iWARP, etc . ----- Original Message ----- From: "Bob Woodruff" To: "'Ranjit Pandit'" Cc: "Rick Frank" ; Sent: Friday, November 04, 2005 6:58 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB > Ranjit wrote, >>RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM >>instead of SOCK_STREAM. > > So back to the question from Roland that started this thread. > When do you plan to re-work the code to use the OpenIB > verbs and make it suitable for the kernel ? > > And do you plan to develop the code, or at least the infrastructure > to allow multiple RDS providers to plug in > so that it is ubiquitous - supported on all interconnects - to include > simple Ethernet NICs ? > > woody > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Fri Nov 4 17:07:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 17:07:58 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <009401c5e1a2$9bb29210$6401a8c0@YOURA11C73D0FD> (Rick Frank's message of "Fri, 4 Nov 2005 19:45:17 -0500") References: <009401c5e1a2$9bb29210$6401a8c0@YOURA11C73D0FD> Message-ID: <52fyqb2975.fsf@cisco.com> Rick> We (Oracle) are currently investigating / working on an RDS Rick> over Ethernet driver for Linux. Our current plans are to Rick> produce a new verbs provider that registers with Gen 2 IB Rick> verbs layer. This new driver will bind to a standard Rick> ethernet nic driver and implement the RC semantics. This Rick> will allow us to use 100% of the ported RDS ULP. That seems rather an awkward way to go about it. Why not just use TCP? - R. From rolandd at cisco.com Fri Nov 4 18:49:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 04 Nov 2005 18:49:19 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <00b901c5e1a6$3eb1af20$6401a8c0@YOURA11C73D0FD> (Rick Frank's message of "Fri, 4 Nov 2005 20:14:16 -0500") References: <009401c5e1a2$9bb29210$6401a8c0@YOURA11C73D0FD> <52fyqb2975.fsf@cisco.com> <00b901c5e1a6$3eb1af20$6401a8c0@YOURA11C73D0FD> Message-ID: <5264r724i8.fsf@cisco.com> Rick> Do you mean useTCP and the RC transport in the ethernet Rick> verbs provider ? No, I mean just write RDS for ethernet on top of sockets. I don't think it's worth implementing a whole RDMA provider on top of ethernet just so you can use the same RDS code. The SilverStorm RDS code is only about 10K lines of code, and I think a sane implementation would probably be less than 5K, so you're not getting much benefit from from all the effort of writing an RDMA provider. In fact I'm not sure that it doesn't make sense to implement RDS as a library + daemon completely in userspace. - R. From halr at voltaire.com Sat Nov 5 07:40:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Nov 2005 10:40:14 -0500 Subject: [openib-general] Re: [PATCH] Opensm - bug in osm_sa_path_record with 0 records In-Reply-To: <1131036469.4338.446.camel@hal.voltaire.com> References: <5zacglyj4x.fsf@mtl066.yok.mtl.com> <1131025659.4338.206.camel@hal.voltaire.com> <1131036469.4338.446.camel@hal.voltaire.com> Message-ID: <1131204055.4407.1262.camel@hal.voltaire.com> Hi Yael, On Thu, 2005-11-03 at 11:51, Hal Rosenstock wrote: I committed this with minor format changes. Any changes (based on questions in my emails) should be incremental to this. Thanks. -- Hal > One additional comment on this: > > On Thu, 2005-11-03 at 08:50, Hal Rosenstock wrote: > > On Thu, 2005-11-03 at 08:07, Yael Kalka wrote: > > > Hi Hal, > > > > > > During some testing of path record we found a bug in the code. > > > If the number of records return is zero, then there is clearing of > > > non allocated memory. > > > I've added some changes to the __osm_pr_rcv_respond function, to match > > > other sa responses. > > > Attached is a patch to fix it. > > > > A couple of minor comments below. > > > > -- Hal > > > > > Thanks, > > > Yael > > > > > > Thanks, > > > Yael > > > > > > Signed-off-by: Yael Kalka > > > > > > Index: opensm/osm_sa_path_record.c > > > =================================================================== > > > --- opensm/osm_sa_path_record.c (revision 3955) > > > +++ opensm/osm_sa_path_record.c (working copy) > > > @@ -1448,7 +1448,7 @@ __osm_pr_rcv_respond( > > > osm_madw_t* p_resp_madw; > > > const ib_sa_mad_t* p_sa_mad; > > > ib_sa_mad_t* p_resp_sa_mad; > > > - size_t num_rec, num_copied; > > > + size_t num_rec, num_copied, pre_trim_num_rec; > > > #ifndef VENDOR_RMPP_SUPPORT > > > size_t trim_num_rec; > > > #endif > > > @@ -1456,6 +1456,7 @@ __osm_pr_rcv_respond( > > > ib_api_status_t status; > > > const ib_sa_mad_t* p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); > > > osm_pr_item_t* p_pr_item; > > > + uint32_t i; > > > > > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_respond ); > > > > > > @@ -1483,6 +1484,7 @@ __osm_pr_rcv_respond( > > > goto Exit; > > > } > > > > > > + pre_trim_num_rec = num_rec; > > > #ifndef VENDOR_RMPP_SUPPORT > > > trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_path_rec_t); > > > if (trim_num_rec < num_rec) > > > @@ -1495,11 +1497,15 @@ __osm_pr_rcv_respond( > > > } > > > #endif > > > > > > - if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) > > > - { > > > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > > > "__osm_pr_rcv_respond: " > > > "Generating response with %u records.\n", num_rec ); > > > + > > > + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) > > > + { > > > + osm_sa_send_error( p_rcv->p_resp, p_madw, > > > + IB_SA_MAD_STATUS_NO_RECORDS ); > > > + goto Exit; > > > } > > > > This can be moved up immediately after the C15-0.1.30 clause, OK ? > > > > > /* > > > @@ -1514,6 +1520,16 @@ __osm_pr_rcv_respond( > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > "__osm_pr_rcv_respond: ERR 1F14: " > > > "Unable to allocate MAD.\n" ); > > > + > > > + for( i = 0; i < num_rec; i++ ) > > > + { > > > + p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > > + cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > > > + } > > > + > > > + osm_sa_send_error( p_rcv->p_resp, p_madw, > > > + IB_SA_MAD_STATUS_NO_RESOURCES ); > > > + > > > > osm_sa_send_error also attempts to get a MAD from the pool. Is there a > > chance this succeeds after the one in this routine fails ? (Should this > > be eliminated ?) > > > > > goto Exit; > > > } > > > > > > @@ -1528,6 +1544,8 @@ __osm_pr_rcv_respond( > > > p_resp_sa_mad->attr_offset = > > > ib_get_attr_offset( sizeof(ib_path_rec_t) ); > > > > > > + p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > > + > > > #ifndef VENDOR_RMPP_SUPPORT > > > /* we support only one packet RMPP - so we will set the first and > > > last flags for gettable */ > > > @@ -1542,37 +1560,19 @@ __osm_pr_rcv_respond( > > > p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; > > > #endif > > > > > > - p_resp_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); > > > - > > > - if ( num_rec == 0 ) > > > - { > > > - if (p_resp_sa_mad->method == IB_MAD_METHOD_GET_RESP) > > > - p_resp_sa_mad->status = IB_SA_MAD_STATUS_NO_RECORDS; > > > - cl_memclr( p_resp_pr, sizeof(*p_resp_pr) ); > > > - } > > > - else > > > + for ( i = 0; i < pre_trim_num_rec; i++ ) > > > { > > > p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > > - > > > - /* we need to track the number of copied items so we can > > > - * stop the copy - but clear them all > > > - */ > > > - num_copied = 0; > > > - > > > - while( p_pr_item != (osm_pr_item_t*)cl_qlist_end( p_list ) ) > > > - { > > > - /* Copy the Path Records from the list into the MAD */ > > > - if (num_copied < num_rec) > > > - { > > > + /* copy only if not trimmed */ > > > + if (i < num_rec) > > > *p_resp_pr = p_pr_item->path_rec; > > > - num_copied++; > > > - } > > > + > > > cl_qlock_pool_put( &p_rcv->pr_pool, &p_pr_item->pool_item ); > > > p_resp_pr++; > > > - p_pr_item = (osm_pr_item_t*)cl_qlist_remove_head( p_list ); > > > - } > > > } > > Should p_resp_pr only be incremented if i < num_recs ? > > Also, these comments apply to all the other SA record code as well. > > -- Hal > > > > + CL_ASSERT( cl_is_qlist_empty( p_list ) ); > > > + > > > status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); > > > > > > if( status != IB_SUCCESS ) > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From lowwxhd at hotmail.com Sat Nov 5 23:59:22 2005 From: lowwxhd at hotmail.com (Connie Obrien) Date: Sun, 6 Nov 2005 05:-1:22 -0300 Subject: [openib-general] Maximum results seen after only a few weeks Message-ID: <962523.420000.77@hotmail.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://HOODIAHONEST.NET/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://HOODIAHONEST.NET/ Regards, Dr. Connie Obrien From mst at mellanox.co.il Sun Nov 6 02:43:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 12:43:18 +0200 Subject: [openib-general] question : skb->dev versus dev_kfree_skb_any Message-ID: <20051106104318.GO31134@mellanox.co.il> Hello, Roland! Why is ipoib always setting skb->dev before calling dev_kfree_skb_any? E.g. ipoib_multicast.c, line 126, has skb->dev = dev; dev_kfree_skb_any(skb); I went through dev_kfree_skb_any and couldnt see where does it use the dev member. What am I missing? Thanks, -- MST From mst at mellanox.co.il Sun Nov 6 02:48:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 12:48:45 +0200 Subject: [openib-general] Re: [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB In-Reply-To: <20051104181952.GB4463@esmail.cup.hp.com> References: <20051104181952.GB4463@esmail.cup.hp.com> Message-ID: <20051106104845.GP31134@mellanox.co.il> Quoting r. Grant Grundler : > o The ulp/sdp/Kconfig comments say "AF_INET_SDP (address family 26)". > AF_LLC uses 26 and sdp_sock.h defines 27. > Michael - need a patch or is this trivial enough to fix by hand? Fixed, thanks! -- MST From mst at mellanox.co.il Sun Nov 6 02:53:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 12:53:43 +0200 Subject: [openib-general] Re: [PATCH] sdp zero copy support In-Reply-To: <52oe502mzn.fsf@cisco.com> References: <52oe502mzn.fsf@cisco.com> Message-ID: <20051106105342.GQ31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] sdp zero copy support > > I haven't read the code yet, but: > > > +config INFINIBAND_SDP_SEND_ZCOPY > > + bool "Sockets Direct Protocol Zero Copy Send support" > > + depends on INFINIBAND_SDP > > + default y > > + ---help--- > > + This option enables Zero Copy support for send_msg transactions. > > + > > +config INFINIBAND_SDP_RECV_ZCOPY > > + bool "Sockets Direct Protocol Zero Copy Receive support" > > + depends on INFINIBAND_SDP && INFINIBAND_SDP_SEND_ZCOPY > > + default y > > + ---help--- > > + This option enables Zero Copy support for recv_msg transactions. > > Why would I ever say 'n'? I think we should either get rid of these > config options, or if there is a reason for them, explain it better in > the help text. > > - R. > Actually, zero copy still does not use the DMA API, so I should make it dependent on x86/x86_64. I guess this option is also useful for debug purposes. I'll clarify this in the help text. -- MST From mst at mellanox.co.il Sun Nov 6 02:54:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 12:54:49 +0200 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: References: Message-ID: <20051106105449.GR31134@mellanox.co.il> Quoting r. Denis Pilon : > Subject: error compiling kernel... > > I am trying to compile but keep getting errors... > > linux-2.6.14(vanilla) plus latest svn release 3972. > > > LD drivers/infiniband/built-in.o > LD drivers/infiniband/core/built-in.o > CC [M] drivers/infiniband/core/addr.o > CC [M] drivers/infiniband/core/at.o > CC [M] drivers/infiniband/core/cm.o > drivers/infiniband/core/cm.c: In function `cm_alloc_msg': > drivers/infiniband/core/cm.c:179: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) > drivers/infiniband/core/cm.c:179: error: (Each undeclared identifier is reported only once > drivers/infiniband/core/cm.c:179: error: for each function it appears in.) > drivers/infiniband/core/cm.c:180: error: too few arguments to function `ib_create_send_mad' > drivers/infiniband/core/cm.c:187: error: structure has no member named `ah' > drivers/infiniband/core/cm.c:188: error: structure has no member named `retries' > drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': > drivers/infiniband/core/cm.c:209: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) > drivers/infiniband/core/cm.c:210: error: too few arguments to function `ib_create_send_mad' > drivers/infiniband/core/cm.c:215: error: structure has no member named `ah' > drivers/infiniband/core/cm.c: In function `cm_free_msg': > drivers/infiniband/core/cm.c:222: error: structure has no member named `ah' > drivers/infiniband/core/cm.c: In function `cm_insert_listen': > drivers/infiniband/core/cm.c:371: error: structure has no member named `device' > drivers/infiniband/core/cm.c:371: error: structure has no member named `device' > drivers/infiniband/core/cm.c:374: error: structure has no member named `device' > drivers/infiniband/core/cm.c:374: error: structure has no member named `device' > drivers/infiniband/core/cm.c:376: error: structure has no member named `device' > drivers/infiniband/core/cm.c:376: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `cm_find_listen': > drivers/infiniband/core/cm.c:398: error: structure has no member named `device' > drivers/infiniband/core/cm.c:401: error: structure has no member named `device' > drivers/infiniband/core/cm.c:403: error: structure has no member named `device' > drivers/infiniband/core/cm.c: At top level: > drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' > include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here > drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' > include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here > drivers/infiniband/core/cm.c: In function `ib_create_cm_id': > drivers/infiniband/core/cm.c:552: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': > drivers/infiniband/core/cm.c:679: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:690: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:707: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_req': > drivers/infiniband/core/cm.c:933: error: structure has no member named `timeout_ms' > drivers/infiniband/core/cm.c:942: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:942: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_issue_rej': > drivers/infiniband/core/cm.c:987: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:987: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': > drivers/infiniband/core/cm.c:1195: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1195: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_match_req': > drivers/infiniband/core/cm.c:1235: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': > drivers/infiniband/core/cm.c:1381: error: structure has no member named `timeout_ms' > drivers/infiniband/core/cm.c:1384: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1384: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': > drivers/infiniband/core/cm.c:1448: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1448: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': > drivers/infiniband/core/cm.c:1520: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1520: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_rep_handler': > drivers/infiniband/core/cm.c:1588: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_establish_handler': > drivers/infiniband/core/cm.c:1622: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_rtu_handler': > drivers/infiniband/core/cm.c:1661: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': > drivers/infiniband/core/cm.c:1719: error: structure has no member named `timeout_ms' > drivers/infiniband/core/cm.c:1722: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1722: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': > drivers/infiniband/core/cm.c:1785: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1785: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dreq_handler': > drivers/infiniband/core/cm.c:1820: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:1834: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1834: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_drep_handler': > drivers/infiniband/core/cm.c:1881: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': > drivers/infiniband/core/cm.c:1949: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1949: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_rej_handler': > drivers/infiniband/core/cm.c:2025: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2035: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': > drivers/infiniband/core/cm.c:2093: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2093: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c:2106: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2106: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c:2119: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2119: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_mra_handler': > drivers/infiniband/core/cm.c:2181: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2188: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2196: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': > drivers/infiniband/core/cm.c:2279: error: structure has no member named `timeout_ms' > drivers/infiniband/core/cm.c:2282: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2282: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_lap_handler': > drivers/infiniband/core/cm.c:2359: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2359: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': > drivers/infiniband/core/cm.c:2437: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2437: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_apr_handler': > drivers/infiniband/core/cm.c:2476: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': > drivers/infiniband/core/cm.c:2573: error: structure has no member named `timeout_ms' > drivers/infiniband/core/cm.c:2578: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2578: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': > drivers/infiniband/core/cm.c:2642: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': > drivers/infiniband/core/cm.c:2713: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2713: error: too few arguments to function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': > drivers/infiniband/core/cm.c:2766: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_send_handler': > drivers/infiniband/core/cm.c:2834: error: structure has no member named `send_buf' > make[3]: *** [drivers/infiniband/core/cm.o] Error 1 > make[2]: *** [drivers/infiniband/core] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 > > > Am i missing something ? > > DP Move include/rdma, its in the way. -- MST From mst at mellanox.co.il Sun Nov 6 05:27:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 15:27:02 +0200 Subject: [openib-general] [PATCH] user_mad: set kobject name properly Message-ID: <20051106132702.GZ31134@mellanox.co.il> kobject name for issm device wasnt being set properly. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/infiniband/core/user_mad.c =================================================================== --- linux-kernel/infiniband/core/user_mad.c (revision 3956) +++ linux-kernel/infiniband/core/user_mad.c (working copy) @@ -801,7 +801,7 @@ static int ib_umad_init_port(struct ib_d goto err_class; port->sm_dev->owner = THIS_MODULE; port->sm_dev->ops = &umad_sm_fops; - kobject_set_name(&port->dev->kobj, "issm%d", port->dev_num); + kobject_set_name(&port->sm_dev->kobj, "issm%d", port->dev_num); if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) goto err_sm_cdev; -- MST From mst at mellanox.co.il Sun Nov 6 05:31:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 6 Nov 2005 15:31:47 +0200 Subject: [openib-general] [PATCH] user_mad: fix error handling Message-ID: <20051106133147.GA31134@mellanox.co.il> Fix off-by-one bug in error handling in ib_umad_add_one: we are storing port pointer in port[i - s], so we should use that for ib_umad_kill_port. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/core/user_mad.c =================================================================== --- linux-kernel/drivers/infiniband/core/user_mad.c (revision 3956) +++ linux-kernel/drivers/infiniband/core/user_mad.c (working copy) @@ -913,7 +913,7 @@ static void ib_umad_add_one(struct ib_de err: while (--i >= s) - ib_umad_kill_port(&umad_dev->port[i]); + ib_umad_kill_port(&umad_dev->port[i - s]); kref_put(&umad_dev->ref, ib_umad_release_dev); } -- MST From caitlinb at broadcom.com Sun Nov 6 08:40:37 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Sun, 6 Nov 2005 08:40:37 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1025E0B@NT-SJCA-0751.brcm.ad.broadcom.com> -----Original Message----- From: openib-general-bounces at openib.org on behalf of Roland Dreier Sent: Fri 11/4/2005 6:49 PM To: Rick Frank Cc: openib-general at openib.org Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Rick> Do you mean useTCP and the RC transport in the ethernet Rick> verbs provider ? No, I mean just write RDS for ethernet on top of sockets. I don't think it's worth implementing a whole RDMA provider on top of ethernet just so you can use the same RDS code. The SilverStorm RDS code is only about 10K lines of code, and I think a sane implementation would probably be less than 5K, so you're not getting much benefit from from all the effort of writing an RDMA provider. In fact I'm not sure that it doesn't make sense to implement RDS as a library + daemon completely in userspace. - R. ------------ [Caitlin] Correct, the idea of providing Reliable Datagram service over reliable point-to-point tunnels enables userspace solutions as long as they have access to high-throughput reliable connection service. Whether a TCP service that provides no stateful acceleration qualifies is a topic that we do not need to take up here. [/Caitlin] From pradeep at us.ibm.com Sun Nov 6 15:03:30 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Sun, 6 Nov 2005 15:03:30 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: <52d5lg2bwc.fsf@cisco.com> Message-ID: Roland Dreier wrote on 11/04/2005 04:09:39 PM: > >>>>> "Pradeep" == Pradeep Satyanarayana writes: > > Pradeep> Even if we change struct ib_uat_route_by_ip_req, there > Pradeep> still is user_mad.c that needs to be looked into. > > Could you be specific? As far as I can tell, all of the structures > copied to and from userspace in user_mad.c are laid out identically > for 32-bit and 64-bit architectures. I looked at this from the copy_from_user() side only. I do not know which user app uses this. Here is an example of the code from ib_umad_write() that illustrates this : packet = kmalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL); if (!packet) return -ENOMEM; if (copy_from_user(&packet->mad, buf, sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { ret = -EFAULT; goto err; } struct ib_umad_packet { struct ib_mad_send_buf *msg; struct list_head list; int length; struct ib_user_mad mad; }; Now, sizeof *packet will be different between 32-bit and 64-bit because of the pointers. Because of this, the offset of packet->mad will be incorrect and one might find unexpected data. Would you agree? Once I saw this, I did not look further. There may be other cases of this mismatch, and I have not had a chance to take a close look at all the code. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Mon Nov 7 04:47:54 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 07 Nov 2005 14:47:54 +0200 Subject: [openib-general] [PATCH] Opensm - modifying uninitialized memory Message-ID: <5z8xw0y685.fsf@mtl066.yok.mtl.com> Hi Hal, While running opensm with valgrind we found out that there is a problem with osm_req_set function. It clears the madw.data by size of IB_SMP_DATA_SIZE, but the function doesn't require a payload of this size. In osm_ucast_mgr there was a call to the function with a payload of smaller size. For fixing it - I've added a payload_size to the osm_req_set. It seems more correct then to just fix the specific call in the osm_ucast_mgr. The attached patch fixes it. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_req.h =================================================================== --- include/opensm/osm_req.h (revision 3975) +++ include/opensm/osm_req.h (working copy) @@ -308,6 +308,7 @@ osm_req_set( IN const osm_req_t* const p_req, IN const osm_dr_path_t* const p_path, IN const uint8_t* const p_payload, + IN const size_t payload_size, IN const uint16_t attr_id, IN const uint32_t attr_mod, IN const cl_disp_msgid_t err_msg, @@ -323,6 +324,9 @@ osm_req_set( * p_payload * [in] Pointer to the SMP payload to send. * +* payload_size +* [in] The size of the payload to be copied to the SMP data field. +* * attr_id * [in] Attribute ID to request. * Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 3975) +++ opensm/osm_state_mgr.c (working copy) @@ -1667,6 +1667,7 @@ __osm_state_mgr_send_handover( status = osm_req_set( p_mgr->p_req, osm_physp_get_dr_path_ptr ( osm_port_get_default_phys_ptr( p_port ) ), payload, + sizeof(payload), IB_MAD_ATTR_SM_INFO, IB_SMINFO_ATTR_MOD_HANDOVER, CL_DISP_MSGID_NONE, &context ); Index: opensm/osm_req.c =================================================================== --- opensm/osm_req.c (revision 3975) +++ opensm/osm_req.c (working copy) @@ -210,6 +210,7 @@ osm_req_set( IN const osm_req_t* const p_req, IN const osm_dr_path_t* const p_path, IN const uint8_t* const p_payload, + IN const size_t payload_size, IN const uint16_t attr_id, IN const uint32_t attr_mod, IN const cl_disp_msgid_t err_msg, @@ -286,7 +287,7 @@ osm_req_set( p_madw->context = *p_context; cl_memcpy( osm_madw_get_smp_ptr( p_madw )->data, - p_payload, IB_SMP_DATA_SIZE ); + p_payload, payload_size ); osm_vl15_post( p_req->p_vl15, p_madw ); Index: opensm/osm_mcast_mgr.c =================================================================== --- opensm/osm_mcast_mgr.c (revision 3975) +++ opensm/osm_mcast_mgr.c (working copy) @@ -488,6 +488,7 @@ __osm_mcast_mgr_set_tbl( status = osm_req_set( p_mgr->p_req, p_path, (void*)block, + sizeof(block), IB_MAD_ATTR_MCAST_FWD_TBL, cl_hton32( block_id_ho ), CL_DISP_MSGID_NONE, Index: opensm/osm_ucast_mgr.c =================================================================== --- opensm/osm_ucast_mgr.c (revision 3975) +++ opensm/osm_ucast_mgr.c (working copy) @@ -830,6 +830,7 @@ __osm_ucast_mgr_set_table( status = osm_req_set( p_mgr->p_req, p_path, (uint8_t*)&si, + sizeof(si), IB_MAD_ATTR_SWITCH_INFO, 0, CL_DISP_MSGID_NONE, @@ -864,6 +865,7 @@ __osm_ucast_mgr_set_table( status = osm_req_set( p_mgr->p_req, p_path, block, + sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, cl_hton32( block_id_ho ), CL_DISP_MSGID_NONE, Index: opensm/osm_link_mgr.c =================================================================== --- opensm/osm_link_mgr.c (revision 3975) +++ opensm/osm_link_mgr.c (working copy) @@ -355,6 +355,7 @@ osm_link_mgr_set_physp_pi( status = osm_req_set( p_mgr->p_req, osm_physp_get_dr_path_ptr( p_physp ), payload, + sizeof(payload), IB_MAD_ATTR_PORT_INFO, cl_hton32(port_num), CL_DISP_MSGID_NONE, Index: opensm/osm_sw_info_rcv.c =================================================================== --- opensm/osm_sw_info_rcv.c (revision 3975) +++ opensm/osm_sw_info_rcv.c (working copy) @@ -84,6 +84,7 @@ __osm_si_rcv_clear_sc_bit( status = osm_req_set( p_rcv->p_req, osm_node_get_any_dr_path_ptr( p_node ), payload, + sizeof(payload), IB_MAD_ATTR_SWITCH_INFO, 0, CL_DISP_MSGID_NONE, Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 3975) +++ opensm/osm_lid_mgr.c (working copy) @@ -1154,6 +1154,7 @@ __osm_lid_mgr_set_physp_pi( status = osm_req_set( p_mgr->p_req, osm_physp_get_dr_path_ptr( p_physp ), payload, + sizeof(payload), IB_MAD_ATTR_PORT_INFO, cl_hton32(osm_physp_get_port_num( p_physp )), CL_DISP_MSGID_NONE, From halr at voltaire.com Mon Nov 7 05:18:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 08:18:32 -0500 Subject: [openib-general] Re: [PATCH] Opensm - modifying uninitialized memory In-Reply-To: <5z8xw0y685.fsf@mtl066.yok.mtl.com> References: <5z8xw0y685.fsf@mtl066.yok.mtl.com> Message-ID: <1131369511.4382.6375.camel@hal.voltaire.com> On Mon, 2005-11-07 at 07:47, Yael Kalka wrote: > Hi Hal, > > While running opensm with valgrind we found out that there is a > problem with osm_req_set function. It clears the madw.data by size of > IB_SMP_DATA_SIZE, but the function doesn't require a payload of this > size. In osm_ucast_mgr there was a call to the function with a payload > of smaller size. > For fixing it - I've added a payload_size to the osm_req_set. > It seems more correct then to just fix the specific call in the > osm_ucast_mgr. > The attached patch fixes it. Thanks. Applied. From yael at mellanox.co.il Mon Nov 7 05:25:07 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 07 Nov 2005 15:25:07 +0200 Subject: [openib-general] [PATCH] Opensm - exiting issues Message-ID: <5z7jbky4i4.fsf@mtl066.yok.mtl.com> Hi Hal, There was a problem when running opensm with -o option, that caused the opensm to always exit with segfault, due to object destruction ordering. Also - there is the known issue of exiting opensm. We've done some clearing to the exiting code. The following patch fixes most of it. In the current code we saw that sometimes opensm gets "stuck" on exit, and causes the machine to get stuck too - resulting in need for rebooting. In the following patch fixes most of it. We did run (in the patch) into rare cases where opensm exits with an error, but at least it exits without stucking the machine... Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 3975) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -542,9 +542,14 @@ osm_vendor_delete( int agent_id; /* unregister UMAD agents */ + /* This sometimes causes errors on exit, that cause kernel errors + and result in need to reboot machine. Currently - do not call + the umad_unregister. */ + /* for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++) if ( (*pp_vend)->agents[agent_id] ) umad_unregister( (*pp_vend)->umad_port_id, agent_id ); + */ clear_madw( *pp_vend ); /* make sure all ports are closed */ umad_done(); @@ -839,7 +844,7 @@ osm_vendor_bind( "osm_vendor_bind: ERR 5426: " "Unable to register class %u version %u.\n", p_user_bind->mad_class, p_user_bind->class_version); - free(p_bind); + cl_free(p_bind); p_bind = 0; goto Exit; } @@ -851,7 +856,7 @@ osm_vendor_bind( "bad agent id %u or duplicate agent for class %u vers %u\n", p_bind->agent_id, p_user_bind->mad_class, p_user_bind->class_version); - free(p_bind); + cl_free(p_bind); p_bind = 0; goto Exit; } @@ -868,7 +873,7 @@ osm_vendor_bind( "osm_vendor_bind: ERR 5428: " "Unable to register class 1 version %u.\n", p_user_bind->class_version); - free(p_bind); + cl_free(p_bind); p_bind = 0; goto Exit; } @@ -879,7 +884,7 @@ osm_vendor_bind( "osm_vendor_bind: ERR 5429: " "bad agent id %u or duplicate agent for class 1 vers %u\n", p_bind->agent_id1, p_user_bind->class_version); - free(p_bind); + cl_free(p_bind); p_bind = 0; goto Exit; } @@ -892,6 +897,19 @@ Exit: return( (osm_bind_handle_t)p_bind ); } + + +/********************************************************************** + **********************************************************************/ +void +__osm_vendor_dummy_callback( + IN osm_madw_t *p_madw, + IN void *bind_context, + IN osm_madw_t *p_req_madw ) +{ + printf("Ignoring received/sent mads after unbind\n"); +} + /********************************************************************** **********************************************************************/ void @@ -903,6 +921,11 @@ osm_vendor_unbind( OSM_LOG_ENTER( p_vend->p_log, osm_vendor_unbind ); + cl_spinlock_acquire( &p_vend->cb_lock ); + p_bind->mad_recv_callback = __osm_vendor_dummy_callback; + p_bind->send_err_callback = __osm_vendor_dummy_callback; + cl_spinlock_release( &p_vend->cb_lock ); + OSM_LOG_EXIT( p_vend->p_log); } Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3975) +++ opensm/osm_opensm.c (working copy) @@ -108,14 +108,11 @@ osm_opensm_destroy( */ osm_sm_shutdown( &p_osm->sm ); - /* shut down the dispatcher - so no new messages cross */ - cl_disp_shutdown( &p_osm->disp ); - /* cleanup all messages on VL15 fifo that were not sent yet */ osm_vl15_shutdown( &p_osm->vl15, &p_osm->mad_pool ); - /* lock the whole thing so we do not get any requests etc */ - cl_plock_excl_acquire( &p_osm->lock ); + /* shut down the dispatcher - so no new messages cross */ + cl_disp_shutdown( &p_osm->disp ); /* do the destruction in reverse order as init */ updn_destroy( p_osm->p_updn_ucast_routing ); @@ -128,7 +125,6 @@ osm_opensm_destroy( osm_subn_destroy( &p_osm->subn ); cl_disp_destroy( &p_osm->disp ); - cl_plock_release( &p_osm->lock ); cl_plock_destroy( &p_osm->lock ); cl_mem_display( ); Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3975) +++ opensm/osm_vl15intf.c (working copy) @@ -334,8 +334,6 @@ osm_vl15_destroy( p_vl->state = OSM_VL15_STATE_INIT; cl_spinlock_destroy( &p_vl->lock ); - cl_disp_unregister( p_vl->h_disp ); - OSM_LOG_EXIT( p_vl->p_log ); } @@ -500,6 +498,8 @@ osm_vl15_shutdown( /* grap a lock on the object */ cl_spinlock_acquire( &p_vl->lock ); + cl_disp_unregister( p_vl->h_disp ); + /* go over all outstanding MADs and retire their transactions */ /* first we handle the list of response MADs */ From halr at voltaire.com Mon Nov 7 06:20:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 7 Nov 2005 16:20:56 +0200 Subject: [openib-general] Re: [PATCH] Opensm - exiting issues Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA0F@taurus.voltaire.com> Hi Yael, On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > Hi Hal, > > There was a problem when running opensm with -o option, that caused > the opensm to always exit with segfault, due to object destruction > ordering. Also - there is the known issue of exiting opensm. We've > done some clearing to the exiting code. The following patch fixes most > of it. I applied this part of the patch with some cosmetic changes in osm_vendor_ibumad.c. > In the current code we saw that sometimes opensm gets "stuck" on exit, > and causes the machine to get stuck too - resulting in need for > rebooting. In the following patch fixes most of it. > We did run (in the patch) into rare cases where opensm exits with an > error, but at least it exits without stucking the machine... Is there a reliable way to recreate machine "stuck" ? What exactly do you mean by this ? All umad_unregister does is some validation, a table lookup, and issue the ioctl to unregister the MAD agent. Not explictly unregistering the agent(s) does not cause any harm as when the fd is closed, this will occur as part of the cleanup. -- Hal From rolandd at cisco.com Mon Nov 7 06:33:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 06:33:28 -0800 Subject: [openib-general] Re: question : skb->dev versus dev_kfree_skb_any References: <20051106104318.GO31134@mellanox.co.il> Message-ID: <52vez4wmrr.fsf@cisco.com> Michael> Hello, Roland! Why is ipoib always setting skb->dev Michael> before calling dev_kfree_skb_any? Michael> E.g. ipoib_multicast.c, line 126, has Michael> skb-> dev = dev; Michael> dev_kfree_skb_any(skb); Michael> I went through dev_kfree_skb_any and couldnt see where Michael> does it use the dev member. What am I missing? I suspect all these loops throwing away packets in ipoib_multicast.c were cut-and-pasted from the loop that does dev_queue_xmit(), which does require the device to be set correctly. So there's no good reason to set skb->dev before freeing skb. - R. From eitan at mellanox.co.il Mon Nov 7 06:42:29 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 7 Nov 2005 16:42:29 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> Hi Hal, I will answer for Yael as she already left the office. The way to reproduce the "stuck" case is to run in bash: % while test $? = 0; do opensm -V -o; done The symptom we see is that OpenSM sort of exists but the process stay active (not even defunct). No way to kill it. It seems like one of the threads gets caught in the middle of ioctl or something. To be able to run OpenSM after this we need to reboot the machine. We avoid it by not issuing umad_unregister and umad_close_port Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, November 07, 2005 4:21 PM > To: yael at mellanox.co.il > Cc: openib-general at openib.org; eitan at mellanox.co.il > Subject: Re: [PATCH] Opensm - exiting issues > > Hi Yael, > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > Hi Hal, > > > > There was a problem when running opensm with -o option, that caused > > the opensm to always exit with segfault, due to object destruction > > ordering. Also - there is the known issue of exiting opensm. We've > > done some clearing to the exiting code. The following patch fixes most > > of it. > > I applied this part of the patch with some cosmetic changes in > osm_vendor_ibumad.c. > > > In the current code we saw that sometimes opensm gets "stuck" on exit, > > and causes the machine to get stuck too - resulting in need for > > rebooting. In the following patch fixes most of it. > > We did run (in the patch) into rare cases where opensm exits with an > > error, but at least it exits without stucking the machine... > > Is there a reliable way to recreate machine "stuck" ? What exactly do > you mean by this ? > > All umad_unregister does is some validation, a table lookup, and issue > the ioctl to unregister the MAD agent. Not explictly unregistering the > agent(s) does not cause any harm as when the fd is closed, this will > occur as part of the cleanup. > > -- Hal From jlentini at netapp.com Mon Nov 7 06:51:48 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 7 Nov 2005 09:51:48 -0500 (EST) Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511041259v655a217anba925ae53f5c3dee@mail.gmail.com> References: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> <96f8e60e0511041259v655a217anba925ae53f5c3dee@mail.gmail.com> Message-ID: On Fri, 4 Nov 2005, Ranjit Pandit wrote: > For Mpi running on a 10,000 node cluster with 2 or 4 way nodes, here > are the QP/ CM connection requirements: (assuming intra node > communication doesn't use IB) > > Procs per node uDapl/Sdp Rds > 2 19996 9999 > 4 39984 9999 > > Clearly, there is tradeoff in performance as we go from uDapl/Sdp to > Rds. The choice will have to depend on the requirements of performance > Vs Scalability. > Btw, for this large a cluster, there is a huge overhead in just > setting up the connections. Rds connections are setup only once. This isn't an apples to apples comparison. uDAPL is an API and RDS is a protocol. You could implement RDS using a DAPL API, OpenIB verbs, etc. james From halr at voltaire.com Mon Nov 7 06:55:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 09:55:07 -0500 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> Message-ID: <1131375306.4382.6952.camel@hal.voltaire.com> On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port I saw the change to not call umad_unregister in the patch. Where is the change for umad_close_port ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 4:21 PM > > To: yael at mellanox.co.il > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > From mst at mellanox.co.il Mon Nov 7 07:02:38 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 17:02:38 +0200 Subject: [openib-general] Re: question : skb->dev versus dev_kfree_skb_any In-Reply-To: <52vez4wmrr.fsf@cisco.com> References: <52vez4wmrr.fsf@cisco.com> Message-ID: <20051107150238.GQ31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: question : skb->dev versus dev_kfree_skb_any > > Michael> Hello, Roland! Why is ipoib always setting skb->dev > Michael> before calling dev_kfree_skb_any? > Michael> E.g. ipoib_multicast.c, line 126, has > > Michael> skb-> dev = dev; > Michael> dev_kfree_skb_any(skb); > > Michael> I went through dev_kfree_skb_any and couldnt see where > Michael> does it use the dev member. What am I missing? > > I suspect all these loops throwing away packets in ipoib_multicast.c > were cut-and-pasted from the loop that does dev_queue_xmit(), which > does require the device to be set correctly. So there's no good > reason to set skb->dev before freeing skb. > > - R. > Okay, do you want a patch to kill them? -- MST From rolandd at cisco.com Mon Nov 7 07:28:33 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 07:28:33 -0800 Subject: [openib-general] Data structure size mismatch References: Message-ID: <52oe4wwk7y.fsf@cisco.com> Pradeep> Now, sizeof *packet will be different between 32-bit and Pradeep> 64-bit because of the pointers. Because of this, the Pradeep> offset of packet->mad will be incorrect and one might Pradeep> find unexpected data. Would you agree? I don't understand your point. packet is a kernel data structure, and it doesn't matter that the layout changes if I compile the kernel for a different architecture. What is being copied from userspace is a struct ib_user_mad whose does not depend on the word size. Are you perhaps getting confused about the order of the parameters to copy_from_user()? They are ordered the same as memcpy(), that is the destination is first, followed by the source. So in if (copy_from_user(&packet->mad, buf, sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { the source of the data is the userspace pointer buf, and the exact location of the destination packet->mad does not matter to userspace at all -- it is purely kernel internal. - R. From mst at mellanox.co.il Mon Nov 7 07:49:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 17:49:19 +0200 Subject: [openib-general] ipoib oops Message-ID: <20051107154919.GU31134@mellanox.co.il> Hi! I saw this in /var/log/messages recently. Unfortunately I cant say exactly what I did to trigger this problem. Roland, its the same thing we were seeing a couple of months ago that went unresolved, isnt it? Unable to handle kernel NULL pointer dereference at 0000000000000488 RIP: {:ib_ipoib:ipoib_mcast_join_finish+100} PGD 1775cc067 PUD 177a21067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 12176, comm: ib_mad1 Not tainted 2.6.14 #4 RIP: 0010:[] {:ib_ipoib:ipoib_mcast_join_finish+100} RSP: 0000:ffff810178d59c58 EFLAGS: 00010282 RAX: 0000000052010000 RBX: 0000000000000000 RCX: 0000000000000010 RDX: ffff8101536b77c0 RSI: ffff8101536b77c0 RDI: ffff8101536b77c0 RBP: ffff8101536b77c0 R08: 0000000000000000 R09: ffff810178d59d38 R10: ffff810178d59df8 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000480 R14: 0000000000000000 R15: ffff810152a32298 FS: 0000000000000000(0000) GS:ffffffff805ff800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000488 CR3: 0000000154c92000 CR4: 00000000000006e0 Process ib_mad1 (pid: 12176, threadinfo ffff810178d58000, task ffff81017c587580) Stack: ffff810178d59c90 0000000000000282 ffff810178ae6840 0000000000000286 ffff81017ca7b400 0000000000000286 ffff81017e825e10 ffff81017560a3c0 ffff81017e825e10 ffff810176c01000 Call Trace:{:ib_ipoib:ipoib_mcast_join_complete+56} {:ib_core:ib_unpack+200} {:ib_sa:ib_sa_mcmember_rec_callback+76} {:ib_sa:recv_handler+66} {:ib_mad:ib_mad_completion_handler+957} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+476} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} -- MST From mst at mellanox.co.il Mon Nov 7 07:57:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 17:57:56 +0200 Subject: [openib-general] Re: ipoib oops In-Reply-To: <20051107154919.GU31134@mellanox.co.il> References: <20051107154919.GU31134@mellanox.co.il> Message-ID: <20051107155756.GV31134@mellanox.co.il> > Quoting Michael S. Tsirkin : > Subject: ipoib oops > > Hi! > I saw this in /var/log/messages recently. > Unfortunately I cant say exactly what I did to trigger this problem. Oops, I left out part of the log. Here it is in full. Actually, I had opensm running on the same node, and it appears stuck in defunc state currently - I wander whether we have the umad module crashing and corrupting the ipoib data structures, or the reverse. ------------------------------ Unable to handle kernel NULL pointer dereference at 0000000000000488 RIP: {:ib_ipoib:ipoib_mcast_join_finish+100} PGD 1775cc067 PUD 177a21067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 12176, comm: ib_mad1 Not tainted 2.6.14 #4 RIP: 0010:[] {:ib_ipoib:ipoib_mcast_join_finish+100} RSP: 0000:ffff810178d59c58 EFLAGS: 00010282 RAX: 0000000052010000 RBX: 0000000000000000 RCX: 0000000000000010 RDX: ffff8101536b77c0 RSI: ffff8101536b77c0 RDI: ffff8101536b77c0 RBP: ffff8101536b77c0 R08: 0000000000000000 R09: ffff810178d59d38 R10: ffff810178d59df8 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000480 R14: 0000000000000000 R15: ffff810152a32298 FS: 0000000000000000(0000) GS:ffffffff805ff800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000488 CR3: 0000000154c92000 CR4: 00000000000006e0 Process ib_mad1 (pid: 12176, threadinfo ffff810178d58000, task ffff81017c587580) Stack: ffff810178d59c90 0000000000000282 ffff810178ae6840 0000000000000286 ffff81017ca7b400 0000000000000286 ffff81017e825e10 ffff81017560a3c0 ffff81017e825e10 ffff810176c01000 Call Trace:{:ib_ipoib:ipoib_mcast_join_complete+56} {:ib_core:ib_unpack+200} {:ib_sa:ib_sa_mcmember_rec_callback+76} {:ib_sa:recv_handler+66} {:ib_mad:ib_mad_completion_handler+957} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+476} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 49 8b 7d 08 48 81 c7 cc 01 00 00 f3 a6 75 17 49 8b 45 70 8b RIP {:ib_ipoib:ipoib_mcast_join_finish+100} RSP CR2: 0000000000000488 <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: {:ib_umad:send_handler+41} PGD 0 Oops: 0000 [2] SMP CPU 1 Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 12203, comm: opensm Not tainted 2.6.14 #4 RIP: 0010:[] {:ib_umad:send_handler+41} RSP: 0018:ffff81017bae7bb8 EFLAGS: 00010296 RAX: ffff810178970d98 RBX: ffff81017bae7c78 RCX: ffff810178970d68 RDX: ffff81017bae7c68 RSI: ffff81017bae7c78 RDI: ffff81017e825c10 RBP: 0000000000000000 R08: ffff81017bae6000 R09: 0000000000000100 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81017e825c10 R13: ffff810176c01000 R14: ffff81017bae7c68 R15: ffff81017bae7ef8 FS: 0000000000000000(0000) GS:ffffffff805ff880(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process opensm (pid: 12203, threadinfo ffff81017bae6000, task ffff81017bbad0c0) Stack: ffff810178e780e0 ffff81017bae7c50 ffff81017e825c10 ffff81017e825c00 ffff81017bae7c78 ffffffff880109c6 0100000000000000 0000000000000435 0000000000000000 ffff810152d71000 Call Trace:{:ib_mad:ib_unregister_mad_agent+406} {:ib_mthca:mthca_cmd_box+72} {:ib_umad:ib_umad_close+70} {__fput+178} {filp_close+110} {put_files_struct+115} {do_exit+507} {__dequeue_signal+501} {do_group_exit+236} {get_signal_to_deliver+1431} {do_signal+159} {kill_proc_info+85} {sys_kill+348} {sysret_signal+28} {ptregscall_common+103} Code: 48 8b 45 00 48 8b 78 18 e8 9a 01 fd ff 48 8b 7d 00 e8 51 ce RIP {:ib_umad:send_handler+41} RSP CR2: 0000000000000000 <1>Fixing recursive fault but reboot is needed! -- MST From mst at mellanox.co.il Mon Nov 7 08:14:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 18:14:52 +0200 Subject: [openib-general] user_mad.c: deadlock? Message-ID: <20051107161452.GX31134@mellanox.co.il> Hi, Roland! This is not directly related to the last oops report that I sent. I noticed that unregister_mad_agent in mad.c flushes the port wq. This has the potential to block until a work is finished. However, one of the things done on the work queue is calling handlers for existing agents. Looking at user_mad.c, ib_umad_close calls ib_unregister_mad_agent with port mutex taken, while send_handler calls queue_packet which in turn takes the port mutex. It seems, therefore, that we can have a deadlock inside user_mad, where ib_umad_close calls ib_unregister_mad_agent which blocks until send_handler runs which is blocked by the port mutex. A possible solution would be to move ib_unregister_mad_agent outside the code section protected by the mutex. Does this make sense? -- MST From eitan at mellanox.co.il Mon Nov 7 08:28:04 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 7 Nov 2005 18:28:04 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CE@mtlexch01.mtl.com> We added it temporarily and removed it due to these problems. Sorry for the misleading information regarding the close_port. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, November 07, 2005 4:55 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] Opensm - exiting issues > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > Hi Hal, > > > > I will answer for Yael as she already left the office. > > > > The way to reproduce the "stuck" case is to run in bash: > > % while test $? = 0; do opensm -V -o; done > > > > The symptom we see is that OpenSM sort of exists but the process stay > > active (not even defunct). No way to kill it. It seems like one of the > > threads gets caught in the middle of ioctl or something. To be able to > > run OpenSM after this we need to reboot the machine. > > > > We avoid it by not issuing umad_unregister and umad_close_port > > I saw the change to not call umad_unregister in the patch. Where is the > change for umad_close_port ? > > -- Hal > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Monday, November 07, 2005 4:21 PM > > > To: yael at mellanox.co.il > > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > Hi Yael, > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > Hi Hal, > > > > > > > > There was a problem when running opensm with -o option, that caused > > > > the opensm to always exit with segfault, due to object destruction > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > done some clearing to the exiting code. The following patch fixes > > most > > > > of it. > > > > > > I applied this part of the patch with some cosmetic changes in > > > osm_vendor_ibumad.c. > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > exit, > > > > and causes the machine to get stuck too - resulting in need for > > > > rebooting. In the following patch fixes most of it. > > > > We did run (in the patch) into rare cases where opensm exits with an > > > > error, but at least it exits without stucking the machine... > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > > you mean by this ? > > > > > > All umad_unregister does is some validation, a table lookup, and issue > > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > > agent(s) does not cause any harm as when the fd is closed, this will > > > occur as part of the cleanup. > > > > > > -- Hal > > From mst at mellanox.co.il Mon Nov 7 08:27:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 18:27:35 +0200 Subject: [openib-general] Re: ipoib oops In-Reply-To: <20051107155756.GV31134@mellanox.co.il> References: <20051107155756.GV31134@mellanox.co.il> Message-ID: <20051107162735.GY31134@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: ipoib oops > > > Quoting Michael S. Tsirkin : > > Subject: ipoib oops > > > > Hi! > > I saw this in /var/log/messages recently. > > Unfortunately I cant say exactly what I did to trigger this problem. > > Oops, I left out part of the log. > Here it is in full. > Actually, I had opensm running on the same node, and it appears > stuck in defunc state currently - I wander whether > we have the umad module crashing and corrupting the ipoib data > structures, or the reverse. > > ------------------------------ > <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 > RIP: > {:ib_umad:send_handler+41} > PGD 0 > Oops: 0000 [2] SMP > CPU 1 > Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad > ib_core > Pid: 12203, comm: opensm Not tainted 2.6.14 #4 > RIP: 0010:[] > {:ib_umad:send_handler+41} > RSP: 0018:ffff81017bae7bb8 EFLAGS: 00010296 > RAX: ffff810178970d98 RBX: ffff81017bae7c78 RCX: ffff810178970d68 > RDX: ffff81017bae7c68 RSI: ffff81017bae7c78 RDI: ffff81017e825c10 > RBP: 0000000000000000 R08: ffff81017bae6000 R09: 0000000000000100 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff81017e825c10 > R13: ffff810176c01000 R14: ffff81017bae7c68 R15: ffff81017bae7ef8 > FS: 0000000000000000(0000) GS:ffffffff805ff880(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 This one seems to be at user_mad.c line 179: --------------------- static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { struct ib_umad_file *file = agent->context; struct ib_umad_packet *timeout; struct ib_umad_packet *packet = send_wc->send_buf->context[0]; ib_destroy_ah(packet->msg->ah); <----------------------------- here ib_free_send_mad(packet->msg); --------------------- Looks like send_wc is NULL. And given that the send handler seems to be always called with wc on the stack, it now appears that it was actually ipoib that triggered some data corruption for umad. Right? -- MST From mst at mellanox.co.il Mon Nov 7 09:02:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 19:02:44 +0200 Subject: [openib-general] IPoIB question/problem Message-ID: <20051107170244.GZ31134@mellanox.co.il> Hello, Roland! While debugging a (gen1) problem with IPoIB, I have noticed the following code in function neigh_update: net/core/neighbour.c:1015 if (lladdr != neigh->ha) { memcpy(&neigh->ha, lladdr, dev->addr_len); neigh_update_hhs(neigh); if (!(new & NUD_CONNECTED)) neigh->confirmed = jiffies - (neigh->parms->base_reachable_time << 1); #ifdef CONFIG_ARPD notify = 1; #endif } It appears, therefore, that the neighbour ha field may get updated without destroying the neighbour. Assuming that a remote node is replaced and its address changes (e.g. gid change), it seems that the ha field will gets out of sync with the address handle stored in ipoib_neigh->ah, with the result that the ah field would point to an incorrect path, resulting in all packets being lost. Does this analysis make sense? If yes, what would be the best way to fix this? Thanks, MST From rolandd at cisco.com Mon Nov 7 09:29:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:29:55 -0800 Subject: [openib-general] Re: question : skb->dev versus dev_kfree_skb_any In-Reply-To: <20051107150238.GQ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 17:02:38 +0200") References: <52vez4wmrr.fsf@cisco.com> <20051107150238.GQ31134@mellanox.co.il> Message-ID: <528xw0welo.fsf@cisco.com> Michael> Okay, do you want a patch to kill them? That's OK, I can do it myself. Thanks, Roland From rolandd at cisco.com Mon Nov 7 09:30:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:30:03 -0800 Subject: [openib-general] Re: [PATCH] user_mad: set kobject name properly In-Reply-To: <20051106132702.GZ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 6 Nov 2005 15:27:02 +0200") References: <20051106132702.GZ31134@mellanox.co.il> Message-ID: <524q6owelg.fsf@cisco.com> thanks, applied. From rolandd at cisco.com Mon Nov 7 09:30:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:30:08 -0800 Subject: [openib-general] Re: [PATCH] user_mad: fix error handling In-Reply-To: <20051106133147.GA31134@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 6 Nov 2005 15:31:47 +0200") References: <20051106133147.GA31134@mellanox.co.il> Message-ID: <52zmogv00v.fsf@cisco.com> thanks, applied. From rolandd at cisco.com Mon Nov 7 09:32:33 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:32:33 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051107161452.GX31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 18:14:52 +0200") References: <20051107161452.GX31134@mellanox.co.il> Message-ID: <52vez4uzwu.fsf@cisco.com> Michael> It seems, therefore, that we can have a deadlock inside Michael> user_mad, where ib_umad_close calls Michael> ib_unregister_mad_agent which blocks until send_handler Michael> runs which is blocked by the port mutex. It certainly looks that way, and it also looks like ib_umad_unreg_agent() has had the same potential deadlock for a while. In any case, I don't see any reason to hold the port mutex while unregistering agents in ib_umad_close() (the file is already gone, so it can't race against userspace registering or unregistering MAD agents via ioctl). So something like this should be good enough. Does anyone see anything wrong with this? - R. Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 3971) +++ infiniband/core/user_mad.c (working copy) @@ -505,8 +505,6 @@ found: goto out; } - file->agent[agent_id] = agent; - file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(file->mr[agent_id])) { ret = -ENOMEM; @@ -519,14 +517,15 @@ found: goto err_mr; } + file->agent[agent_id] = agent; ret = 0; + goto out; err_mr: ib_dereg_mr(file->mr[agent_id]); err: - file->agent[agent_id] = NULL; ib_unregister_mad_agent(agent); out: @@ -536,27 +535,33 @@ out: static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) { + struct ib_mad_agent *agent = NULL; + struct ib_mr *mr = NULL; u32 id; int ret = 0; - down_write(&file->port->mutex); + if (get_user(id, (u32 __user *) arg)) + return -EFAULT; - if (get_user(id, (u32 __user *) arg)) { - ret = -EFAULT; - goto out; - } + down_write(&file->port->mutex); if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { ret = -EINVAL; goto out; } - ib_dereg_mr(file->mr[id]); - ib_unregister_mad_agent(file->agent[id]); + agent = file->agent[id]; + mr = file->mr[id]; file->agent[id] = NULL; out: up_write(&file->port->mutex); + + if (agent) { + ib_unregister_mad_agent(agent); + ib_dereg_mr(mr); + } + return ret; } @@ -623,16 +628,16 @@ static int ib_umad_close(struct inode *i struct ib_umad_packet *packet, *tmp; int i; - down_write(&file->port->mutex); for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) if (file->agent[i]) { - ib_dereg_mr(file->mr[i]); ib_unregister_mad_agent(file->agent[i]); + ib_dereg_mr(file->mr[i]); } list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); + down_write(&file->port->mutex); list_del(&file->port_list); up_write(&file->port->mutex); @@ -801,7 +806,7 @@ static int ib_umad_init_port(struct ib_d goto err_class; port->sm_dev->owner = THIS_MODULE; port->sm_dev->ops = &umad_sm_fops; - kobject_set_name(&port->dev->kobj, "issm%d", port->dev_num); + kobject_set_name(&port->sm_dev->kobj, "issm%d", port->dev_num); if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) goto err_sm_cdev; @@ -913,7 +918,7 @@ static void ib_umad_add_one(struct ib_de err: while (--i >= s) - ib_umad_kill_port(&umad_dev->port[i]); + ib_umad_kill_port(&umad_dev->port[i - s]); kref_put(&umad_dev->ref, ib_umad_release_dev); } From rolandd at cisco.com Mon Nov 7 09:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:34:42 -0800 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <20051107170244.GZ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 19:02:44 +0200") References: <20051107170244.GZ31134@mellanox.co.il> Message-ID: <52r79suzt9.fsf@cisco.com> Michael> Does this analysis make sense? If yes, what would be the Michael> best way to fix this? I'm not sure if this could really happen or not. But could we add a header_cache_update() method to the IPoIB struct net_device to handle this situation if it does occur? - R. From rolandd at cisco.com Mon Nov 7 09:37:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:37:11 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <436A76D0.5010506@ichips.intel.com> (Sean Hefty's message of "Thu, 03 Nov 2005 12:45:04 -0800") References: <524q6t61oa.fsf@cisco.com> <436A5700.8090102@ichips.intel.com> <52vez94kkd.fsf@cisco.com> <436A76D0.5010506@ichips.intel.com> Message-ID: <52mzkguzp4.fsf@cisco.com> Sean> Any objection to doing something similar for libibverbs? Sean> This would move sa.h from libibat to libibverbs, which would Sean> allow libibcm and librdmacm to both depend only on Sean> libibverbs. No, that seems like a good thing to do. The only caveat is that I would like to freeze the libibverbs 1.0 API sooner rather than later, so we should try to get all this worked out. - R. From halr at voltaire.com Mon Nov 7 09:41:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 12:41:44 -0500 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <436A76D0.5010506@ichips.intel.com> References: <524q6t61oa.fsf@cisco.com> <436A5700.8090102@ichips.intel.com> <52vez94kkd.fsf@cisco.com> <436A76D0.5010506@ichips.intel.com> Message-ID: <1131385304.4478.10.camel@hal.voltaire.com> On Thu, 2005-11-03 at 15:45, Sean Hefty wrote: > Roland Dreier wrote: > > If it's just marshalling between user and kernel formats, I'd stick it > > in uverbs_marshall.c. But if there's going to be something > > substantial then maybe it make sense to create a user SA module. > > I added a three new files: > > ib_marshall.h - defines the copy functions (kernel only) > ib_user_sa.h - defines the user path record (user/kernel) > uverbs_marshall.c - implements the copy functions > > Any objection to doing something similar for libibverbs? This would move sa.h > from libibat to libibverbs, which would allow libibcm and librdmacm to both > depend only on libibverbs. Ultimately I suspect there will be something like libusa where sa.h would reside. -- Hal From rolandd at cisco.com Mon Nov 7 09:46:22 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 09:46:22 -0800 Subject: [openib-general] Re: ipoib oops In-Reply-To: <20051107162735.GY31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 18:27:35 +0200") References: <20051107155756.GV31134@mellanox.co.il> <20051107162735.GY31134@mellanox.co.il> Message-ID: <52ek5suz9t.fsf@cisco.com> Michael> Looks like send_wc is NULL. And given that the send Michael> handler seems to be always called with wc on the stack, Michael> it now appears that it was actually ipoib that triggered Michael> some data corruption for umad. If send_wc is NULL how could we access send_wc->send_buf->context[0]? I'm not sure the two crashes are necessarily directly related. - R. From boris at mellanox.com Mon Nov 7 09:56:55 2005 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 7 Nov 2005 09:56:55 -0800 Subject: [openib-general] Re: IPoIB question/problem Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA054D0054@MTIEX01> Roland, The issue came up during gen1 IPoIB debug. As far as I was able to track it, header_cache_update() was not called upon MAC address update, so I had to add an explicit check in the IPoIB code verifying the data in neighbor->ha and in path->... are consistent and if not - invalidating the old AH and starting get_path_record flow. Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Monday, November 07, 2005 9:35 AM To: Michael S. Tsirkin Cc: openib-general at openib.org Subject: [openib-general] Re: IPoIB question/problem Michael> Does this analysis make sense? If yes, what would be the Michael> best way to fix this? I'm not sure if this could really happen or not. But could we add a header_cache_update() method to the IPoIB struct net_device to handle this situation if it does occur? - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Nov 7 10:06:34 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 20:06:34 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52vez4uzwu.fsf@cisco.com> References: <52vez4uzwu.fsf@cisco.com> Message-ID: <20051107180634.GB31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> It seems, therefore, that we can have a deadlock inside > Michael> user_mad, where ib_umad_close calls > Michael> ib_unregister_mad_agent which blocks until send_handler > Michael> runs which is blocked by the port mutex. > > It certainly looks that way, and it also looks like > ib_umad_unreg_agent() has had the same potential deadlock for a > while. In any case, I don't see any reason to hold the port mutex > while unregistering agents in ib_umad_close() (the file is already > gone, so it can't race against userspace registering or unregistering > MAD agents via ioctl). So something like this should be good enough. > > Does anyone see anything wrong with this? > > - R. > > Index: infiniband/core/user_mad.c > =================================================================== > --- infiniband/core/user_mad.c (revision 3971) > +++ infiniband/core/user_mad.c (working copy) > @@ -505,8 +505,6 @@ found: > goto out; > } > > - file->agent[agent_id] = agent; > - > file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); > if (IS_ERR(file->mr[agent_id])) { > ret = -ENOMEM; > @@ -519,14 +517,15 @@ found: > goto err_mr; > } > > + file->agent[agent_id] = agent; > ret = 0; > + > goto out; > > err_mr: > ib_dereg_mr(file->mr[agent_id]); > > err: > - file->agent[agent_id] = NULL; > ib_unregister_mad_agent(agent); > > out: > @@ -536,27 +535,33 @@ out: > > static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) > { > + struct ib_mad_agent *agent = NULL; > + struct ib_mr *mr = NULL; > u32 id; > int ret = 0; > > - down_write(&file->port->mutex); > + if (get_user(id, (u32 __user *) arg)) > + return -EFAULT; > > - if (get_user(id, (u32 __user *) arg)) { > - ret = -EFAULT; > - goto out; > - } > + down_write(&file->port->mutex); > > if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { > ret = -EINVAL; > goto out; > } > > - ib_dereg_mr(file->mr[id]); > - ib_unregister_mad_agent(file->agent[id]); > + agent = file->agent[id]; > + mr = file->mr[id]; > file->agent[id] = NULL; > > out: > up_write(&file->port->mutex); > + > + if (agent) { > + ib_unregister_mad_agent(agent); > + ib_dereg_mr(mr); > + } > + > return ret; > } > > @@ -623,16 +628,16 @@ static int ib_umad_close(struct inode *i > struct ib_umad_packet *packet, *tmp; > int i; > > - down_write(&file->port->mutex); > for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) > if (file->agent[i]) { > - ib_dereg_mr(file->mr[i]); > ib_unregister_mad_agent(file->agent[i]); > + ib_dereg_mr(file->mr[i]); > } > > list_for_each_entry_safe(packet, tmp, &file->recv_list, list) > kfree(packet); > > + down_write(&file->port->mutex); > list_del(&file->port_list); > up_write(&file->port->mutex); > > @@ -801,7 +806,7 @@ static int ib_umad_init_port(struct ib_d > goto err_class; > port->sm_dev->owner = THIS_MODULE; > port->sm_dev->ops = &umad_sm_fops; > - kobject_set_name(&port->dev->kobj, "issm%d", port->dev_num); > + kobject_set_name(&port->sm_dev->kobj, "issm%d", port->dev_num); > if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) > goto err_sm_cdev; > > @@ -913,7 +918,7 @@ static void ib_umad_add_one(struct ib_de > > err: > while (--i >= s) > - ib_umad_kill_port(&umad_dev->port[i]); > + ib_umad_kill_port(&umad_dev->port[i - s]); > > kref_put(&umad_dev->ref, ib_umad_release_dev); > } > Looks fine except that it includes two of my patches which you said you have applied. -- MST From mst at mellanox.co.il Mon Nov 7 10:14:34 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 20:14:34 +0200 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <52r79suzt9.fsf@cisco.com> References: <52r79suzt9.fsf@cisco.com> Message-ID: <20051107181434.GC31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: IPoIB question/problem > > Michael> Does this analysis make sense? If yes, what would be the > Michael> best way to fix this? > > I'm not sure if this could really happen or not. This seems to be happening on some gen1 installations. > But could we add a > header_cache_update() method to the IPoIB struct net_device to handle > this situation if it does occur? > > - R. > It seems that we'll need to define hard_header_cache as well, then. And having this appears to have performance implications, affecting TCP code flow in a major way (for better or worse). I was thinking that using the header cache may be a good replacement for using the free space in the ha field, but this would be a major change. What do you think? A more modest approach that I was considering: keep a copy of the gid as part of ipoib_neigh structure, and make sure that the gid didnt change before posting a packet. This seems to work for some of our gen1 clients. -- MST From mst at mellanox.co.il Mon Nov 7 10:17:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 20:17:36 +0200 Subject: [openib-general] Re: ipoib oops In-Reply-To: <52ek5suz9t.fsf@cisco.com> References: <52ek5suz9t.fsf@cisco.com> Message-ID: <20051107181736.GD31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib oops > > Michael> Looks like send_wc is NULL. And given that the send > Michael> handler seems to be always called with wc on the stack, > Michael> it now appears that it was actually ipoib that triggered > Michael> some data corruption for umad. > > If send_wc is NULL how could we access send_wc->send_buf->context[0]? Right, I wasnt thinking straight. static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { struct ib_umad_file *file = agent->context; struct ib_umad_packet *timeout; struct ib_umad_packet *packet = send_wc->send_buf->context[0]; ib_destroy_ah(packet->msg->ah); <----------------------------- here ib_free_send_mad(packet->msg); Means that packet is NULL. Hmm. -- MST From mst at mellanox.co.il Mon Nov 7 10:20:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 20:20:49 +0200 Subject: [openib-general] Re: Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <52mzkguzp4.fsf@cisco.com> References: <52mzkguzp4.fsf@cisco.com> Message-ID: <20051107182049.GF31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: Re: [RFC] patch to export userspace to kernel QP attribute structure > > Sean> Any objection to doing something similar for libibverbs? > Sean> This would move sa.h from libibat to libibverbs, which would > Sean> allow libibcm and librdmacm to both depend only on > Sean> libibverbs. > > No, that seems like a good thing to do. > > The only caveat is that I would like to freeze the libibverbs 1.0 API > sooner rather than later, so we should try to get all this worked out. > > - R. What are the plans for libibverbs 1.0? When do you plan it? -- MST From rolandd at cisco.com Mon Nov 7 10:17:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 10:17:51 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051107180634.GB31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 20:06:34 +0200") References: <52vez4uzwu.fsf@cisco.com> <20051107180634.GB31134@mellanox.co.il> Message-ID: <5264r4uxtc.fsf@cisco.com> Michael> Looks fine except that it includes two of my patches Michael> which you said you have applied. Sorry, I wrote the patch before I committed those patches, and then realized I needed to apply them when I generated that diff. But I forgot to update the diff. Anyway, the situation is under control ;) - R. From rolandd at cisco.com Mon Nov 7 10:23:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 10:23:24 -0800 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <20051107181434.GC31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 20:14:34 +0200") References: <52r79suzt9.fsf@cisco.com> <20051107181434.GC31134@mellanox.co.il> Message-ID: <521x1suxk3.fsf@cisco.com> Michael> It seems that we'll need to define hard_header_cache as Michael> well, then. And having this appears to have performance Michael> implications, affecting TCP code flow in a major way (for Michael> better or worse). I was thinking that using the header Michael> cache may be a good replacement for using the free space Michael> in the ha field, but this would be a major change. What Michael> do you think? You're right, I misread the neighbour.c code a little bit. We can't just use the update method without the whole hard header cache thing. I looked at using that stuff originally, but it didn't seem quite suitable for what IPoIB needed. I can't remember why right now unfortunately. Michael> A more modest approach that I was considering: keep a Michael> copy of the gid as part of ipoib_neigh structure, and Michael> make sure that the gid didnt change before posting a Michael> packet. This seems to work for some of our gen1 clients. How does adding the 16-byte memcmp to the fast path affect performance? It's probably OK for at least a short-term fix, because it's better to be a little slower than to completely lose packets. - R. From rolandd at cisco.com Mon Nov 7 10:24:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 10:24:47 -0800 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <521x1suxk3.fsf@cisco.com> (Roland Dreier's message of "Mon, 07 Nov 2005 10:23:24 -0800") References: <52r79suzt9.fsf@cisco.com> <20051107181434.GC31134@mellanox.co.il> <521x1suxk3.fsf@cisco.com> Message-ID: <52wtjktixc.fsf@cisco.com> By the way, I was just about to commit the patch below to add the ability to get at the IPoIB path cache for debugging: --- infiniband/ulp/ipoib/ipoib_vlan.c (revision 3971) +++ infiniband/ulp/ipoib/ipoib_vlan.c (working copy) @@ -113,8 +113,7 @@ int ipoib_vlan_add(struct net_device *pd priv->parent = ppriv->dev; - if (ipoib_create_debug_file(priv->dev)) - goto debug_failed; + ipoib_create_debug_files(priv->dev); if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; @@ -130,9 +129,7 @@ int ipoib_vlan_add(struct net_device *pd return 0; sysfs_failed: - ipoib_delete_debug_file(priv->dev); - -debug_failed: + ipoib_delete_debug_files(priv->dev); unregister_netdev(priv->dev); register_failed: --- infiniband/ulp/ipoib/ipoib_main.c (revision 3971) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -58,6 +58,11 @@ module_param_named(debug_level, ipoib_de MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif +struct ipoib_path_iter { + struct net_device *dev; + struct ipoib_path path; +}; + static const u8 ipv4_bcast_addr[] = { 0x00, 0xff, 0xff, 0xff, 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, @@ -250,6 +255,64 @@ static void path_free(struct net_device kfree(path); } +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + +struct ipoib_path_iter *ipoib_path_iter_init(struct net_device *dev) +{ + struct ipoib_path_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->path.pathrec.dgid.raw, 0, 16); + + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + + return iter; +} + +int ipoib_path_iter_next(struct ipoib_path_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_path *path; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->path_tree); + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + if (memcmp(iter->path.pathrec.dgid.raw, path->pathrec.dgid.raw, + sizeof (union ib_gid)) < 0) { + iter->path = *path; + ret = 0; + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_path_iter_read(struct ipoib_path_iter *iter, + struct ipoib_path *path) +{ + *path = iter->path; +} + +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + void ipoib_flush_paths(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -763,7 +826,7 @@ void ipoib_dev_cleanup(struct net_device { struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; - ipoib_delete_debug_file(dev); + ipoib_delete_debug_files(dev); /* Delete any child interfaces first */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { @@ -972,8 +1035,7 @@ static struct net_device *ipoib_add_port goto register_failed; } - if (ipoib_create_debug_file(priv->dev)) - goto debug_failed; + ipoib_create_debug_files(priv->dev); if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; @@ -987,9 +1049,7 @@ static struct net_device *ipoib_add_port return priv->dev; sysfs_failed: - ipoib_delete_debug_file(priv->dev); - -debug_failed: + ipoib_delete_debug_files(priv->dev); unregister_netdev(priv->dev); register_failed: --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 3971) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -928,21 +928,16 @@ struct ipoib_mcast_iter *ipoib_mcast_ite return NULL; iter->dev = dev; - memset(iter->mgid.raw, 0, sizeof iter->mgid); + memset(iter->mgid.raw, 0, 16); if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } return iter; } -void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) -{ - kfree(iter); -} - int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) { struct ipoib_dev_priv *priv = netdev_priv(iter->dev); --- infiniband/ulp/ipoib/ipoib.h (revision 3971) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -179,6 +179,7 @@ struct ipoib_dev_priv { #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct list_head fs_list; struct dentry *mcg_dentry; + struct dentry *path_dentry; #endif }; @@ -270,7 +271,6 @@ void ipoib_mcast_dev_flush(struct net_de #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); -void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, union ib_gid *gid, @@ -278,6 +278,11 @@ void ipoib_mcast_iter_read(struct ipoib_ unsigned int *queuelen, unsigned int *complete, unsigned int *send_only); + +struct ipoib_path_iter *ipoib_path_iter_init(struct net_device *dev); +int ipoib_path_iter_next(struct ipoib_path_iter *iter); +void ipoib_path_iter_read(struct ipoib_path_iter *iter, + struct ipoib_path *path); #endif int ipoib_mcast_attach(struct net_device *dev, u16 mlid, @@ -299,13 +304,13 @@ void ipoib_pkey_poll(void *dev); int ipoib_pkey_dev_delay_open(struct net_device *dev); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -int ipoib_create_debug_file(struct net_device *dev); -void ipoib_delete_debug_file(struct net_device *dev); +void ipoib_create_debug_files(struct net_device *dev); +void ipoib_delete_debug_files(struct net_device *dev); int ipoib_register_debugfs(void); void ipoib_unregister_debugfs(void); #else -static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } -static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline void ipoib_create_debug_files(struct net_device *dev) { } +static inline void ipoib_delete_debug_files(struct net_device *dev) { } static inline int ipoib_register_debugfs(void) { return 0; } static inline void ipoib_unregister_debugfs(void) { } #endif --- infiniband/ulp/ipoib/ipoib_fs.c (revision 3971) +++ infiniband/ulp/ipoib/ipoib_fs.c (working copy) @@ -43,6 +43,18 @@ struct file_operations; static struct dentry *ipoib_root; +static void format_gid(union ib_gid *gid, char *buf) +{ + int i, n; + + for (n = 0, i = 0; i < 8; ++i) { + n += sprintf(buf + n, "%x", + be16_to_cpu(((__be16 *) gid->raw)[i])); + if (i < 7) + buf[n++] = ':'; + } +} + static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) { struct ipoib_mcast_iter *iter; @@ -54,7 +66,7 @@ static void *ipoib_mcg_seq_start(struct while (n--) { if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } } @@ -70,7 +82,7 @@ static void *ipoib_mcg_seq_next(struct s (*pos)++; if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } @@ -91,28 +103,29 @@ static int ipoib_mcg_seq_show(struct seq unsigned long created; unsigned int queuelen, complete, send_only; - if (iter) { - ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, - &complete, &send_only); - - for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { - n += sprintf(gid_buf + n, "%x", - be16_to_cpu(((__be16 *) mgid.raw)[i])); - if (i < sizeof mgid / 2 - 1) - gid_buf[n++] = ':'; - } - } + if (!iter) + return 0; + + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); - seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + format_git(&mgid, gid_buf); seq_printf(file, - " created: %10ld queuelen: %4d complete: %d send_only: %d\n", - created, queuelen, complete, send_only); + "GID: %s\n" + " created: %10ld\n" + " queuelen: %9d\n" + " complete: %9s\n" + " send_only: %8s\n" + "\n", + gid_buf, created, queuelen, + complete ? "yes" : "no", + send_only ? "yes" : "no"); return 0; } -static struct seq_operations ipoib_seq_ops = { +static struct seq_operations ipoib_mcg_seq_ops = { .start = ipoib_mcg_seq_start, .next = ipoib_mcg_seq_next, .stop = ipoib_mcg_seq_stop, @@ -124,7 +137,7 @@ static int ipoib_mcg_open(struct inode * struct seq_file *seq; int ret; - ret = seq_open(file, &ipoib_seq_ops); + ret = seq_open(file, &ipoib_mcg_seq_ops); if (ret) return ret; @@ -134,7 +147,7 @@ static int ipoib_mcg_open(struct inode * return 0; } -static struct file_operations ipoib_fops = { +static struct file_operations ipoib_mcg_fops = { .owner = THIS_MODULE, .open = ipoib_mcg_open, .read = seq_read, @@ -142,25 +155,139 @@ static struct file_operations ipoib_fops .release = seq_release }; -int ipoib_create_debug_file(struct net_device *dev) +static void *ipoib_path_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_path_iter *iter; + loff_t n = *pos; + + iter = ipoib_path_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_path_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_path_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + + return iter; +} + +static void ipoib_path_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_path_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_path_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + struct ipoib_path path; + int i, n; + int rate; + + if (!iter) + return 0; + + ipoib_path_iter_read(iter, &path); + + format_git(&path.pathrec.dgid, gid_buf); + + seq_printf(file, + "GID: %s\n" + " complete: %6s\n", + gid_buf, path.pathrec.dlid ? "yes" : "no"); + + if (path.pathrec.dlid) { + rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + + seq_printf(file, + " DLID: 0x%04x\n" + " SL: %12d\n" + " rate: %*d%s Gb/sec\n", + be16_to_cpu(path.pathrec.dlid), + path.pathrec.sl, + 10 - ((rate % 10) ? 2 : 0), + rate / 10, rate % 10 ? ".5" : ""); + } + + seq_putc(file, '\n'); + + return 0; +} + +static struct seq_operations ipoib_path_seq_ops = { + .start = ipoib_path_seq_start, + .next = ipoib_path_seq_next, + .stop = ipoib_path_seq_stop, + .show = ipoib_path_seq_show, +}; + +static int ipoib_path_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_path_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_path_fops = { + .owner = THIS_MODULE, + .open = ipoib_path_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +void ipoib_create_debug_files(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - char name[IFNAMSIZ + sizeof "_mcg"]; + char name[IFNAMSIZ + sizeof "_path"]; snprintf(name, sizeof name, "%s_mcg", dev->name); - priv->mcg_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, - ipoib_root, dev, &ipoib_fops); - - return priv->mcg_dentry ? 0 : -ENOMEM; + ipoib_root, dev, &ipoib_mcg_fops); + if (!priv->mcg_dentry) + ipoib_warn(priv, "failed to create mcg debug file\n"); + + snprintf(name, sizeof name, "%s_path", dev->name); + priv->path_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + ipoib_root, dev, &ipoib_path_fops); + if (!priv->path_dentry) + ipoib_warn(priv, "failed to create path debug file\n"); } -void ipoib_delete_debug_file(struct net_device *dev) +void ipoib_delete_debug_files(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); if (priv->mcg_dentry) debugfs_remove(priv->mcg_dentry); + if (priv->path_dentry) + debugfs_remove(priv->path_dentry); } int ipoib_register_debugfs(void) From rolandd at cisco.com Mon Nov 7 10:30:22 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 10:30:22 -0800 Subject: [openib-general] Re: [RFC] patch to export userspace to kernel QP attribute structure In-Reply-To: <20051107182049.GF31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 7 Nov 2005 20:20:49 +0200") References: <52mzkguzp4.fsf@cisco.com> <20051107182049.GF31134@mellanox.co.il> Message-ID: <52slu8tio1.fsf@cisco.com> Michael> What are the plans for libibverbs 1.0? When do you plan it? I don't have any definite plans but I am definitely thinking in terms of stabilization and starting to freeze. I would like to add at least the API and kernel ABI for CQ resize before 1.0, and maybe MWs. - R. From jlentini at netapp.com Mon Nov 7 10:59:37 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 7 Nov 2005 13:59:37 -0500 (EST) Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <20051107170901.GB16345@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> <96f8e60e0511041259v655a217anba925ae53f5c3dee@mail.gmail.com> <20051107170901.GB16345@esmail.cup.hp.com> Message-ID: On Mon, 7 Nov 2005, Grant Grundler wrote: > On Mon, Nov 07, 2005 at 09:51:48AM -0500, James Lentini wrote: > ... > > > Procs per node uDapl/Sdp Rds > > > 2 19996 9999 > > > 4 39984 9999 > > > > > > Clearly, there is tradeoff in performance as we go from uDapl/Sdp to > > > Rds. > ... > > This isn't an apples to apples comparison. uDAPL is an API and RDS is > > a protocol. > > I thought he was comparing SDP (using uDAPL) vs RDS. > Did I read that wrong? I think so. I'll let Ranjit clarify. To the best of my knowledge, SDP has never been implemented on uDAPL. From rpandit at silverstorm.com Mon Nov 7 11:29:25 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Mon, 7 Nov 2005 11:29:25 -0800 Subject: [openib-general] [ANNOUNCE] ContributeRDS(ReliableDatagramSockets) to OpenIB In-Reply-To: References: <54AD0F12E08D1541B826BE97C98F99F1020C40@NT-SJCA-0751.brcm.ad.broadcom.com> <96f8e60e0511041259v655a217anba925ae53f5c3dee@mail.gmail.com> <20051107170901.GB16345@esmail.cup.hp.com> Message-ID: <96f8e60e0511071129i644f0295id61e87a0d7ed7d23@mail.gmail.com> I mentioned uDapl or SDP specifically in terms of the typical MPI usage today - connection oriented. With both uDapl and SDP the connection requirments were the same so I included them in the same column. I did not intend to indicate Sdp implementation on uDapl.. On 11/7/05, James Lentini wrote: > > > On Mon, 7 Nov 2005, Grant Grundler wrote: > > > On Mon, Nov 07, 2005 at 09:51:48AM -0500, James Lentini wrote: > > ... > > > > Procs per node uDapl/Sdp Rds > > > > 2 19996 9999 > > > > 4 39984 9999 > > > > > > > > Clearly, there is tradeoff in performance as we go from uDapl/Sdp to > > > > Rds. > > ... > > > This isn't an apples to apples comparison. uDAPL is an API and RDS is > > > a protocol. > > > > I thought he was comparing SDP (using uDAPL) vs RDS. > > Did I read that wrong? > > I think so. I'll let Ranjit clarify. > > To the best of my knowledge, SDP has never been implemented on uDAPL. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Mon Nov 7 11:44:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 7 Nov 2005 21:44:13 +0200 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <521x1suxk3.fsf@cisco.com> References: <521x1suxk3.fsf@cisco.com> Message-ID: <20051107194413.GA20544@mellanox.co.il> Quoting r. Roland Dreier : > Michael> A more modest approach that I was considering: keep a > Michael> copy of the gid as part of ipoib_neigh structure, and > Michael> make sure that the gid didnt change before posting a > Michael> packet. This seems to work for some of our gen1 clients. > > How does adding the 16-byte memcmp to the fast path affect > performance? It's probably OK for at least a short-term fix, because > it's better to be a little slower than to completely lose packets. I wouldnt expect there to be any noticeable impact. Lets do it this way then. -- MST From halr at voltaire.com Mon Nov 7 12:03:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 15:03:08 -0500 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> Message-ID: <1131393787.4324.8.camel@hal.voltaire.com> On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port This part of the patch is not needed with the fix to user_mad put in by Roland based on the issue (and patch) from Michael on user_mad deadlock. -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 4:21 PM > > To: yael at mellanox.co.il > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > From halr at voltaire.com Mon Nov 7 12:06:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 15:06:34 -0500 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36188CD@mtlexch01.mtl.com> Message-ID: <1131393898.4324.11.camel@hal.voltaire.com> On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port This part of the patch is not needed with the fix to user_mad put in by Roland based on the issue (and patch) from Michael on user_mad deadlock. I've been running your test from over 30 minutes now without a hiccup. It used to fail pretty quickly. -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 4:21 PM > > To: yael at mellanox.co.il > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > From maillist at roomity.com Mon Nov 7 14:05:37 2005 From: maillist at roomity.com (shenanigans) Date: Mon, 7 Nov 2005 14:05:37 -0800 (PST) Subject: [openib-general] [OTAnn] Feedback Message-ID: <10742753.1331131401137251.JavaMail.tomcat5@slave1.roomity.com> I was interested in getting feedback from current mail group users. We have mirrored your mail list in a new application that provides a more aggregated and safe environment which utilizes the power of broadband. Roomity.com v 1.5 is a web 2.01 community webapp. Our newest version adds broadcast video and social networking such as favorite authors and an html editor. It?s free to join and any feedback would be appreciated. S. ------------------------------------------------------------------------------------------------------------------------------------------ Broadband interface (RIA) + mail box saftey = Open_IB_Discussion_List.roomity.com *Your* clubs, no sign up to read, ad supported; try broadband internet. ~~1131401137248~~ ------------------------------------------------------------------------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Nov 7 16:52:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 07 Nov 2005 16:52:55 -0800 Subject: [openib-general] [PATCH/RFC] Message-ID: <52vez4q7tk.fsf@cisco.com> Any comments on changing the signature of the struct ib_device resize_cq method to take the new CQ size rather than a pointer to the new CQ size? The low-level driver would then be responsible for updating the cq->cqe member itself (possibly with proper locking). This would also make the prototype match the create_cq method and the actual ib_resize_cq() function. No device drivers implement this interface yet (except ehca, which had a commented-out resize_cq method with a prototype that matches the new changed signature, rather than the current signature). - R. --- infiniband/include/rdma/ib_verbs.h (revision 3979) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -881,7 +881,7 @@ struct ib_device { struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); - int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*resize_cq)(struct ib_cq *cq, int cqe); int (*poll_cq)(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); --- infiniband/core/verbs.c (revision 3979) +++ infiniband/core/verbs.c (working copy) @@ -329,11 +329,7 @@ int ib_resize_cq(struct ib_cq *cq, if (!cq->device->resize_cq) return -ENOSYS; - ret = cq->device->resize_cq(cq, &cqe); - if (!ret) - cq->cqe = cqe; - - return ret; + return cq->device->resize_cq(cq, cqe); } EXPORT_SYMBOL(ib_resize_cq); From pradeep at us.ibm.com Mon Nov 7 17:14:28 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 7 Nov 2005 17:14:28 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: <52oe4wwk7y.fsf@cisco.com> Message-ID: Roland Dreier wrote on 11/07/2005 07:28:33 AM: > Pradeep> Now, sizeof *packet will be different between 32-bit and > Pradeep> 64-bit because of the pointers. Because of this, the > Pradeep> offset of packet->mad will be incorrect and one might > Pradeep> find unexpected data. Would you agree? > > I don't understand your point. packet is a kernel data structure, and > it doesn't matter that the layout changes if I compile the kernel for > a different architecture. What is being copied from userspace is a > struct ib_user_mad whose does not depend on the word size. > I have seen applications duplicate the data structure in the kernel. And this is copied across the user-kernel space. If a data structure like packet that contains a pointer element were copied across the user-kernel boundary one runs into the problem that I was mentioning. In this particular case since ib_user_mad is what is copied across the user-kernel boundary it is not an issue. Given that I had already seen this problem in uat.c, I was trying to sound a note of caution about this as a potential problem, looking at the kernel data structures. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Nov 7 17:08:47 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 7 Nov 2005 17:08:47 -0800 Subject: [openib-general] [PATCH/RFC] In-Reply-To: <52vez4q7tk.fsf@cisco.com> Message-ID: >Any comments on changing the signature of the struct ib_device >resize_cq method to take the new CQ size rather than a pointer to the >new CQ size? The low-level driver would then be responsible for >updating the cq->cqe member itself (possibly with proper locking). >This would also make the prototype match the create_cq method and the >actual ib_resize_cq() function. Sounds good to me. - Sean From ftillier at silverstorm.com Mon Nov 7 17:27:27 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Mon, 7 Nov 2005 17:27:27 -0800 Subject: [openib-general] [PATCH/RFC] In-Reply-To: <52vez4q7tk.fsf@cisco.com> Message-ID: <001401c5e403$9544b530$9e5aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Monday, November 07, 2005 4:53 PM > > Any comments on changing the signature of the struct ib_device > resize_cq method to take the new CQ size rather than a pointer to the > new CQ size? The low-level driver would then be responsible for > updating the cq->cqe member itself (possibly with proper locking). > This would also make the prototype match the create_cq method and the > actual ib_resize_cq() function. This matches what I plan on doing in Windows, so it sounds good to me. - Fab From surs at cse.ohio-state.edu Mon Nov 7 18:05:22 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 07 Nov 2005 21:05:22 -0500 Subject: [openib-general] OpenSM unable to bring up subnet Message-ID: <437007E2.6010704@cse.ohio-state.edu> Hi, I am using OpenSM (svn rev 3984 and with 3882). It is unable to bring up the subnet and "hangs". This behavior is observed with machines are connected back-to-back as well as with any switch. My kernel version is 2.6.13.1, machines are Opteron (on Tyan S295 motherboard). I have included the log file. Maybe someone can tell if I am doing anything wrong? [surs at ro0:~] lsmod | grep ^ib ib_ucm 22280 0 ib_cm 37616 1 ib_ucm ib_uverbs 40984 0 ib_umad 17824 2 ib_mthca 124320 0 ib_mad 42660 3 ib_cm,ib_umad,ib_mthca ib_core 56320 6 ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad [surs at ro0:tmp] ls -l /dev/infiniband/ total 0 crw-rw---- 1 root root 231, 64 2005-11-08 02:23 issm0 crw-rw---- 1 root root 231, 65 2005-11-08 02:23 issm1 crw-rw-rw- 1 root root 231, 224 2005-11-08 02:23 ucm0 crw-rw---- 1 root root 231, 0 2005-11-08 02:23 umad0 crw-rw---- 1 root root 231, 1 2005-11-08 02:23 umad1 crw-rw-rw- 1 root root 231, 192 2005-11-08 02:23 uverbs0 <==== Nov 08 02:59:33 576837 [AB454D00] -> OpenSM Rev:openib-1.1.0 Nov 08 02:59:33 576979 [0000] -> OpenSM Rev:openib-1.1.0 Nov 08 02:59:33 577953 [AB454D00] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Nov 08 02:59:33 578017 [AB454D00] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Nov 08 02:59:33 581289 [AB454D00] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c902004002e9) as the default port. Nov 08 02:59:33 581326 [AB454D00] -> osm_vendor_bind: Binding to port 0x2c902004002e9. Nov 08 02:59:33 583680 [AB454D00] -> osm_vendor_bind: Binding to port 0x2c902004002e9. Nov 08 02:59:33 987191 [40C05960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1234) -- dropping. Nov 08 02:59:33 987227 [40C05960] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 0 DR SLID 0x0 DR DLID 0x0 Nov 08 02:59:33 987243 [40C05960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Nov 08 02:59:33 987303 [40C05960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x0 trans_id................0x1234 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0] Return path: [0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Nov 08 02:59:33 987391 [40401960] -> __osm_state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown. Nov 08 02:59:33 987408 [0000] -> SM port is down. Nov 08 02:59:33 987485 [40401960] -> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING ===> -- http://www.cse.ohio-state.edu/~surs From halr at voltaire.com Mon Nov 7 18:27:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2005 21:27:27 -0500 Subject: [openib-general] OpenSM unable to bring up subnet In-Reply-To: <437007E2.6010704@cse.ohio-state.edu> References: <437007E2.6010704@cse.ohio-state.edu> Message-ID: <1131416846.4324.1013.camel@hal.voltaire.com> On Mon, 2005-11-07 at 21:05, Sayantan Sur wrote: > Hi, > > I am using OpenSM (svn rev 3984 and with 3882). It is unable to bring up > the subnet and "hangs". This behavior is observed with machines are > connected back-to-back as well as with any switch. My kernel version is > 2.6.13.1, machines are Opteron (on Tyan S295 motherboard). I have > included the log file. Maybe someone can tell if I am doing anything wrong? Is the infiniband support from 2.6.13.1 or has it been replaced with OpenIB svn of the revs indicated (or is that only OpenSM) ? If it is only OpenSM, I would recommend trying to update at least user_mad.c as there have been a number of problems which have been fixed in this. There will be some backport issues to 2.6.13.1 to deal with but they have all been discussed on the list. > [surs at ro0:~] lsmod | grep ^ib > ib_ucm 22280 0 > ib_cm 37616 1 ib_ucm > ib_uverbs 40984 0 > ib_umad 17824 2 > ib_mthca 124320 0 > ib_mad 42660 3 ib_cm,ib_umad,ib_mthca > ib_core 56320 6 > ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > > [surs at ro0:tmp] ls -l /dev/infiniband/ > total 0 > crw-rw---- 1 root root 231, 64 2005-11-08 02:23 issm0 > crw-rw---- 1 root root 231, 65 2005-11-08 02:23 issm1 > crw-rw-rw- 1 root root 231, 224 2005-11-08 02:23 ucm0 > crw-rw---- 1 root root 231, 0 2005-11-08 02:23 umad0 > crw-rw---- 1 root root 231, 1 2005-11-08 02:23 umad1 > crw-rw-rw- 1 root root 231, 192 2005-11-08 02:23 uverbs0 > > > <==== Was opensm started with -V ? > Nov 08 02:59:33 576837 [AB454D00] -> OpenSM Rev:openib-1.1.0 > Nov 08 02:59:33 576979 [0000] -> OpenSM Rev:openib-1.1.0 > > Nov 08 02:59:33 577953 [AB454D00] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Nov 08 02:59:33 578017 [AB454D00] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Nov 08 02:59:33 581289 [AB454D00] -> osm_vendor_get_all_port_attr: > assign CA mthca0 port 1 guid (0x2c902004002e9) as the default port. > Nov 08 02:59:33 581326 [AB454D00] -> osm_vendor_bind: Binding to port > 0x2c902004002e9. > Nov 08 02:59:33 583680 [AB454D00] -> osm_vendor_bind: Binding to port > 0x2c902004002e9. > Nov 08 02:59:33 987191 [40C05960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x11 trans_id=0x1234) -- dropping. > Nov 08 02:59:33 987227 [40C05960] -> umad_receiver: ERR 5411: DR SMP hop > ptr 0 hop count 0 DR SLID 0x0 DR DLID 0x0 > Nov 08 02:59:33 987243 [40C05960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT). > Nov 08 02:59:33 987303 [40C05960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x0 > trans_id................0x1234 > attr_id.................0x11 (NodeInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0] > Return path: [0] > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > Nov 08 02:59:33 987391 [40401960] -> __osm_state_mgr_is_sm_port_down: > ERR 3308: SM port GUID unknown. Since gets are timing out, there is no response to SubnGet NodeInfo for the local node which sets the SM port GUID. -- Hal > Nov 08 02:59:33 987408 [0000] -> SM port is down. > > Nov 08 02:59:33 987485 [40401960] -> __osm_sm_state_mgr_signal_error: > ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state > IB_SMINFO_STATE_DISCOVERING > ===> From halr at voltaire.com Mon Nov 7 18:30:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 8 Nov 2005 04:30:53 +0200 Subject: [openib-general] Re: OpenSM unable to bring up subnet Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3B@taurus.voltaire.com> On Mon, 2005-11-07 at 21:05, Sayantan Sur wrote: > Hi, > > I am using OpenSM (svn rev 3984 and with 3882). It is unable to bring up > the subnet and "hangs". This behavior is observed with machines are > connected back-to-back as well as with any switch. My kernel version is > 2.6.13.1, machines are Opteron (on Tyan S295 motherboard). I have > included the log file. Maybe someone can tell if I am doing anything wrong? Is the infiniband support from 2.6.13.1 or has it been replaced with OpenIB svn of the revs indicated (or is that only OpenSM) ? If it is only OpenSM, I would recommend trying to update at least user_mad.c as there have been a number of problems which have been fixed in this. There will be some backport issues to 2.6.13.1 to deal with but they have all been discussed on the list. > [surs at ro0:~] lsmod | grep ^ib > ib_ucm 22280 0 > ib_cm 37616 1 ib_ucm > ib_uverbs 40984 0 > ib_umad 17824 2 > ib_mthca 124320 0 > ib_mad 42660 3 ib_cm,ib_umad,ib_mthca > ib_core 56320 6 > ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > > [surs at ro0:tmp] ls -l /dev/infiniband/ > total 0 > crw-rw---- 1 root root 231, 64 2005-11-08 02:23 issm0 > crw-rw---- 1 root root 231, 65 2005-11-08 02:23 issm1 > crw-rw-rw- 1 root root 231, 224 2005-11-08 02:23 ucm0 > crw-rw---- 1 root root 231, 0 2005-11-08 02:23 umad0 > crw-rw---- 1 root root 231, 1 2005-11-08 02:23 umad1 > crw-rw-rw- 1 root root 231, 192 2005-11-08 02:23 uverbs0 > > > <==== Was opensm started with -V ? > Nov 08 02:59:33 576837 [AB454D00] -> OpenSM Rev:openib-1.1.0 > Nov 08 02:59:33 576979 [0000] -> OpenSM Rev:openib-1.1.0 > > Nov 08 02:59:33 577953 [AB454D00] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Nov 08 02:59:33 578017 [AB454D00] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Nov 08 02:59:33 581289 [AB454D00] -> osm_vendor_get_all_port_attr: > assign CA mthca0 port 1 guid (0x2c902004002e9) as the default port. > Nov 08 02:59:33 581326 [AB454D00] -> osm_vendor_bind: Binding to port > 0x2c902004002e9. > Nov 08 02:59:33 583680 [AB454D00] -> osm_vendor_bind: Binding to port > 0x2c902004002e9. > Nov 08 02:59:33 987191 [40C05960] -> umad_receiver: ERR 5409: send > completed with error (method=0x1 attr=0x11 trans_id=0x1234) -- dropping. > Nov 08 02:59:33 987227 [40C05960] -> umad_receiver: ERR 5411: DR SMP hop > ptr 0 hop count 0 DR SLID 0x0 DR DLID 0x0 > Nov 08 02:59:33 987243 [40C05960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT). > Nov 08 02:59:33 987303 [40C05960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x0 > trans_id................0x1234 > attr_id.................0x11 (NodeInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0] > Return path: [0] > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > Nov 08 02:59:33 987391 [40401960] -> __osm_state_mgr_is_sm_port_down: > ERR 3308: SM port GUID unknown. Since gets are timing out, there is no response to SubnGet NodeInfo for the local node which sets the SM port GUID. Anyrhing relevant in dmesg ? -- Hal > Nov 08 02:59:33 987408 [0000] -> SM port is down. > > Nov 08 02:59:33 987485 [40401960] -> __osm_sm_state_mgr_signal_error: > ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state > IB_SMINFO_STATE_DISCOVERING > ===> From surs at cse.ohio-state.edu Mon Nov 7 19:51:40 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 07 Nov 2005 22:51:40 -0500 Subject: [openib-general] Re: OpenSM unable to bring up subnet In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3B@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3B@taurus.voltaire.com> Message-ID: <437020CC.5090903@cse.ohio-state.edu> Hi, Thanks for your reply! >Is the infiniband support from 2.6.13.1 or has it been replaced with >OpenIB svn of the revs indicated (or is that only OpenSM) ? If it is >only OpenSM, I would recommend trying to update at least user_mad.c as >there have been a number of problems which have been fixed in this. >There will be some backport issues to 2.6.13.1 to deal with but they >have all been discussed on the list. > > Yes, the IB support is from 2.6.13.1 (kernel drivers at rev 3882). I have updated the userland stuff. user_mad.c is currently at the latest revision. Do I really need to update my kernel to 2.6.14 and get the latest drivers? > >Was opensm started with -V ? > > No, here is what I get with -V: [surs at ro0:tmp] sudo opensm -V Password: OpenSM Rev:openib-1.1.0 Using default guid 0x2c902004002e9 Error from osm_opensm_bind (0x2A) Exiting SM >Since gets are timing out, there is no response to SubnGet NodeInfo for >the local node which sets the SM port GUID. > >Anyrhing relevant in dmesg ? > > Whoa! I found this: ===> Modules linked in: ib_ucm ib_cm ib_uverbs ib_umad ib_mthca ib_mad ib_core usbserial usbcore freq_table thermal processor fan button snd_pcm_oss battery ac snd_mixer_oss ipv6 evdev floppy joydev snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd st soundcore sr_mod snd_page_alloc edd sg parport_pc lp parport video1394 ohci1394 raw1394 ieee1394 capability commoncap dm_mod reiserfs ide_cd cdrom ide_disk sata_nv libata amd74xx ide_core sd_mod scsi_mod Pid: 7025, comm: ib_mad1 Not tainted 2.6.13.1-smp RIP: 0010:[] {kfree+193} RSP: 0018:ffff81003b52fdb8 EFLAGS: 00010086 RAX: 0000000000000000 RBX: 28ffff81000124c0 RCX: ffff81000000d000 RDX: 000000000004d000 RSI: ffff81003c63db80 RDI: ffff81000125a029 RBP: ffff81000b000000 R08: ffff81003b52e000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000000 R12: ffff81007f963a10 R13: ffff810079558000 R14: ffff81007f963a78 R15: ffffffff882c1ad0 FS: 0000000040803960(0000) GS:ffffffff804ee800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000040600ed8 CR3: 000000007c6ef000 CR4: 00000000000006e0 Process ib_mad1 (pid: 7025, threadinfo ffff81003b52e000, task ffff810001fe5510) Stack: 0000000000000286 ffff81007f963a10 ffff810079c80380 ffffffff882bf52e ffff81003b52fe28 ffffffff882ea93f ffff810001fe5728 ffff81003c7b2d00 ffff81007f963a00 0000000000000292 Call Trace:{:ib_mad:ib_free_send_mad+14} {:ib_umad:send_handler+63} {:ib_mad:timeout_sends+379} {__wake_up+67} {worker_thread+478} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 8b 03 3b 43 04 73 04 89 c0 eb 0a 48 89 de e8 4c fe ff ff 8b RIP {kfree+193} RSP <==== -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Mon Nov 7 20:19:41 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 07 Nov 2005 23:19:41 -0500 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: <20051106105449.GR31134@mellanox.co.il> References: <20051106105449.GR31134@mellanox.co.il> Message-ID: <4370275D.9060001@cse.ohio-state.edu> Michael S. Tsirkin wrote: >>drivers/infiniband/core/cm.c: In function `cm_alloc_msg': >>drivers/infiniband/core/cm.c:179: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) >> >> >[... snip ...] > > >Move include/rdma, its in the way. > > I ran into the same problem while compiling with 2.6.14 kernel. I tried removing infniband/include/rdma, but still the same :-( Has anyone been able to compile gen2 with 2.6.14? Maybe I am missing something? Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Mon Nov 7 20:35:56 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon, 07 Nov 2005 23:35:56 -0500 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: <4370275D.9060001@cse.ohio-state.edu> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> Message-ID: <43702B2C.4060002@cse.ohio-state.edu> Hello, > >> Move include/rdma, its in the way. >> >> > I ran into the same problem while compiling with 2.6.14 kernel. I > tried removing infniband/include/rdma, but still the same :-( > > Has anyone been able to compile gen2 with 2.6.14? Maybe I am missing > something? Just adding that if I remove infiniband/include/rdma, I get this error instead: <=== CC [M] drivers/infiniband/core/addr.o drivers/infiniband/core/addr.c:37:26: rdma/ib_addr.h: No such file or directory drivers/infiniband/core/addr.c:67: warning: `union ib_gid' declared inside parameter list drivers/infiniband/core/addr.c:67: warning: its scope is only this definition or declaration, which is probably not what you want drivers/infiniband/core/addr.c: In function `ib_translate_addr': drivers/infiniband/core/addr.c:76: error: dereferencing pointer to incomplete type drivers/infiniband/core/addr.c:76: error: dereferencing pointer to incomplete type ===> My InfiniBand config is this: # # InfiniBand support # CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m # CONFIG_IPATH_CORE is not set CONFIG_INFINIBAND_MTHCA=m # CONFIG_INFINIBAND_MTHCA_DEBUG is not set CONFIG_INFINIBAND_IPOIB=m # CONFIG_INFINIBAND_IPOIB_DEBUG is not set CONFIG_INFINIBAND_SDP=m # CONFIG_INFINIBAND_SDP_DEBUG is not set # CONFIG_INFINIBAND_SRP is not set Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From iod00d at hp.com Mon Nov 7 20:54:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 7 Nov 2005 20:54:08 -0800 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: <4370275D.9060001@cse.ohio-state.edu> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> Message-ID: <20051108045408.GB18222@esmail.cup.hp.com> On Mon, Nov 07, 2005 at 11:19:41PM -0500, Sayantan Sur wrote: > I ran into the same problem while compiling with 2.6.14 kernel. I tried > removing infniband/include/rdma, but still the same :-( Try this: cd /usr/src/linux-2.6.14 mv include/rm include/rdm-ORIG ln -s drives/infiniband/include/rdma include/rdma make mrproper cp /boot/config-2.6.13 .config make oldconfig make -j4 Can the openib/drivers/infiniband/Makefile include a hack to checks the inodes in include/rdma/*.h == drivers/infiniband/include/rdma/*.h ? If so, it could spew the appropriate error message and halt. A FAQ entry won't be sufficient to prevent people from hitting this regularly until 2.6.15 integrates an updated include/rdma. > Has anyone been able to compile gen2 with 2.6.14? Yes - but I've run into a different problem. I'll send a seperate email describing it. grant From halr at voltaire.com Mon Nov 7 21:35:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 8 Nov 2005 07:35:05 +0200 Subject: [openib-general] Re: OpenSM unable to bring up subnet Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3D@taurus.voltaire.com> On Mon, 2005-11-07 at 22:51, Sayantan Sur wrote: > Hi, > > Thanks for your reply! > > >Is the infiniband support from 2.6.13.1 or has it been replaced with > >OpenIB svn of the revs indicated (or is that only OpenSM) ? If it is > >only OpenSM, I would recommend trying to update at least user_mad.c as > >there have been a number of problems which have been fixed in this. > >There will be some backport issues to 2.6.13.1 to deal with but they > >have all been discussed on the list. > > > > > Yes, the IB support is from 2.6.13.1 (kernel drivers at rev 3882). Can you update to the latest ? I think there may have been some problems there. > I > have updated the userland stuff. user_mad.c is currently at the latest > revision. > Do I really need to update my kernel to 2.6.14 and get the > latest drivers? I'm not sure. Mine was working with 2.6.13 and then I upgraded to 2.6.14. I saw a lot of problems but this may have been based on OpenIB svn versions during the time frame of various mad and user_mad changes which started I think with r3867 so you are definitely in that area. > > > >Was opensm started with -V ? > > > > > No, here is what I get with -V: > > [surs at ro0:tmp] sudo opensm -V > Password: > OpenSM Rev:openib-1.1.0 > > Using default guid 0x2c902004002e9 > > Error from osm_opensm_bind (0x2A) > Exiting SM > > >Since gets are timing out, there is no response to SubnGet NodeInfo for > >the local node which sets the SM port GUID. > > > >Anyrhing relevant in dmesg ? > > > > > Whoa! I found this: > > ===> > Modules linked in: ib_ucm ib_cm ib_uverbs ib_umad ib_mthca ib_mad > ib_core usbserial usbcore freq_table thermal processor fan button > snd_pcm_oss battery ac snd_mixer_oss ipv6 evdev floppy joydev > snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd st soundcore sr_mod > snd_page_alloc edd sg parport_pc lp parport video1394 ohci1394 raw1394 > ieee1394 capability commoncap dm_mod reiserfs ide_cd cdrom ide_disk > sata_nv libata amd74xx ide_core sd_mod scsi_mod > Pid: 7025, comm: ib_mad1 Not tainted 2.6.13.1-smp > RIP: 0010:[] {kfree+193} > RSP: 0018:ffff81003b52fdb8 EFLAGS: 00010086 > RAX: 0000000000000000 RBX: 28ffff81000124c0 RCX: ffff81000000d000 > RDX: 000000000004d000 RSI: ffff81003c63db80 RDI: ffff81000125a029 > RBP: ffff81000b000000 R08: ffff81003b52e000 R09: 0000000000000000 > R10: 00000000ffffffff R11: 0000000000000000 R12: ffff81007f963a10 > R13: ffff810079558000 R14: ffff81007f963a78 R15: ffffffff882c1ad0 > FS: 0000000040803960(0000) GS:ffffffff804ee800(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000040600ed8 CR3: 000000007c6ef000 CR4: 00000000000006e0 > Process ib_mad1 (pid: 7025, threadinfo ffff81003b52e000, task > ffff810001fe5510) > Stack: 0000000000000286 ffff81007f963a10 ffff810079c80380 ffffffff882bf52e > ffff81003b52fe28 ffffffff882ea93f ffff810001fe5728 ffff81003c7b2d00 > ffff81007f963a00 0000000000000292 > Call Trace:{:ib_mad:ib_free_send_mad+14} > {:ib_umad:send_handler+63} > {:ib_mad:timeout_sends+379} > {__wake_up+67} > {worker_thread+478} > {default_wake_function+0} > {__wake_up_common+67} > {default_wake_function+0} > {keventd_create_kthread+0} > {worker_thread+0} > {keventd_create_kthread+0} > {kthread+217} > {child_rip+8} > {keventd_create_kthread+0} > {kthread+0} {child_rip+0} > > > Code: 8b 03 3b 43 04 73 04 89 c0 eb 0a 48 89 de e8 4c fe ff ff 8b > RIP {kfree+193} RSP > <==== That explains why there were no responses. The kernel stuff is not working right. Please update. -- Hal From johann at pathscale.com Mon Nov 7 21:38:06 2005 From: johann at pathscale.com (Johann George) Date: Mon, 7 Nov 2005 21:38:06 -0800 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: <4370275D.9060001@cse.ohio-state.edu> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> Message-ID: <20051108053806.GB6448@cuprite.internal.keyresearch.com> > I ran into the same problem while compiling with 2.6.14 kernel. I tried > removing infniband/include/rdma, but still the same :-( > > Has anyone been able to compile gen2 with 2.6.14? Maybe I am missing > something? Yes. We just figure out how to; thanks to Roland. Assume that KERNEL_SOURCE_TREE is where you keep the 2.6.14 kernel sources you are building and GEN2 is the location of the gen2 directory of your OpenIB repository. Here is what we did: (1) Replace the infiniband directory on 2.6.14 with the infiniband directory from the repository (as you always do). (2) Update the include directory in your kernel tree: cd $KERNEL_SOURCE_TREE/drivers/infiniband/include cp -p -r * ../../../include This just updated the scsi and rdma directories in include. (3) Apply the fib-frontend patch to your kernel cd $KERNEL_SOURCE_TREE patch -p1 \ <$GEN2/trunk/src/linux-kernel/patches/linux-2.6.14-fib-frontend.diff This exports a needed symbol: ip_dev_find Now if you compile the kernel the usual way, all should work. Johann From iod00d at hp.com Mon Nov 7 21:59:49 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 7 Nov 2005 21:59:49 -0800 Subject: [openib-general] error compiling kernel... In-Reply-To: <4370275D.9060001@cse.ohio-state.edu> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> Message-ID: <20051108055949.GC18222@esmail.cup.hp.com> On Mon, Nov 07, 2005 at 11:19:41PM -0500, Sayantan Sur wrote: ... > Has anyone been able to compile gen2 with 2.6.14? Yes - but I've run into a different problem. After fixing up include/rdma to point at drivers/infiniband/include/rdma, I get a symbol missing from from the modules. iota:/usr/src/linux-2.6.14# make modules_install ... if [ -r System.map -a -x /sbin/depmod ]; then /sbin/depmod -ae -F System.map 2. 6.14; fi WARNING: /lib/modules/2.6.14/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs u nknown symbol ip_dev_find WARNING: /lib/modules/2.6.14/kernel/drivers/infiniband/core/ib_at.ko needs unkno wn symbol ip_dev_find WARNING: /lib/modules/2.6.14/kernel/drivers/infiniband/core/ib_addr.ko needs unk nown symbol ip_dev_find iota:/usr/src/linux-2.6.14# ip_dev_find() is not exported by net/ipv4/fib_frontend.c. However, drivers/infiniband is the only module that needs this. CONFIG_IP_MROUTE is another configurable user but cannot be enabled as a module. Patch below adds EXPORT_SYMBOL() to fib_frontend.c. I'm not trying to assert this is the Right Thing. It's just the first obvious solution to the immediate problem. thanks, grant Signed-off-by: Grant Grundler --- linux-2.6.14-ORIG/net/ipv4/fib_frontend.c 2005-10-27 17:02:08.000000000 -0700 +++ linux-2.6.14/net/ipv4/fib_frontend.c 2005-11-07 21:29:22.000000000 -0800 @@ -662,3 +662,4 @@ void __init ip_fib_init(void) EXPORT_SYMBOL(inet_addr_type); EXPORT_SYMBOL(ip_rt_ioctl); +EXPORT_SYMBOL(ip_dev_find); From surs at cse.ohio-state.edu Mon Nov 7 22:06:11 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 08 Nov 2005 01:06:11 -0500 Subject: [openib-general] Re: error compiling kernel... In-Reply-To: <20051108053806.GB6448@cuprite.internal.keyresearch.com> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> <20051108053806.GB6448@cuprite.internal.keyresearch.com> Message-ID: <43704053.9060107@cse.ohio-state.edu> Hello, Grant/Johann: Thanks for your replies!! >(1) Replace the infiniband directory on 2.6.14 with the infiniband directory > from the repository (as you always do). >(2) Update the include directory in your kernel tree: > cd $KERNEL_SOURCE_TREE/drivers/infiniband/include > cp -p -r * ../../../include > This just updated the scsi and rdma directories in include. >(3) Apply the fib-frontend patch to your kernel > cd $KERNEL_SOURCE_TREE > patch -p1 \ > <$GEN2/trunk/src/linux-kernel/patches/linux-2.6.14-fib-frontend.diff > This exports a needed symbol: ip_dev_find > >Now if you compile the kernel the usual way, all should work. > > Works perfectly. "$GEN2/trunk/src/linux-kernel/patches/linux-2.6.14-fib-frontend.diff" solves the symbol export problem. Thanks, Sayantan. >Johann > > -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Mon Nov 7 22:06:48 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 08 Nov 2005 01:06:48 -0500 Subject: [openib-general] Re: OpenSM unable to bring up subnet In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3D@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3D@taurus.voltaire.com> Message-ID: <43704078.6000608@cse.ohio-state.edu> Hal, > >That explains why there were no responses. The kernel stuff is not >working right. Please update. > > Thanks, it works after the update. Sayantan. -- http://www.cse.ohio-state.edu/~surs From iod00d at hp.com Mon Nov 7 22:08:41 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 7 Nov 2005 22:08:41 -0800 Subject: [openib-general] error compiling kernel... In-Reply-To: <20051108055949.GC18222@esmail.cup.hp.com> References: <20051106105449.GR31134@mellanox.co.il> <4370275D.9060001@cse.ohio-state.edu> <20051108055949.GC18222@esmail.cup.hp.com> Message-ID: <20051108060841.GD18222@esmail.cup.hp.com> On Mon, Nov 07, 2005 at 09:59:49PM -0800, Grant Grundler wrote: > WARNING: /lib/modules/2.6.14/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find > Patch below adds EXPORT_SYMBOL() to fib_frontend.c. ...never mind. Johann George just pointed out someone already has added this diff to the openib.org repository: https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-fib-frontend.diff sorry, grant From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 6/6] [IB] mthca: fix typo in catastrophic error polling In-Reply-To: <1131431419061-26662c4d4f27ac0a@cisco.com> Message-ID: <1131431419061-a91722e21245ff50@cisco.com> Fix a typo in the rearming of the catastrophic error polling timer: we should rearm the timer as long as the stop flag is _not_ set. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_catas.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) applies-to: 35085d3edccfc4f18930f45c4a1c896d041e7856 b523cfbb7bab356f66525e518f5b8c25cf0357d7 diff --git a/drivers/infiniband/hw/mthca/mthca_catas.c b/drivers/infiniband/hw/mthca/mthca_catas.c index 7ac52af..3447cd7 100644 --- a/drivers/infiniband/hw/mthca/mthca_catas.c +++ b/drivers/infiniband/hw/mthca/mthca_catas.c @@ -94,7 +94,7 @@ static void poll_catas(unsigned long dev } spin_lock_irqsave(&catas_lock, flags); - if (dev->catas_err.stop) + if (!dev->catas_err.stop) mod_timer(&dev->catas_err.timer, jiffies + MTHCA_CATAS_POLL_INTERVAL); spin_unlock_irqrestore(&catas_lock, flags); --- 0.99.9e From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 1/6] [IB] mthca: report page size capability Message-ID: <1131431419060-378986988cf168d2@cisco.com> Report the device's real page size capability in mthca_query_device(). Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_dev.h | 1 + drivers/infiniband/hw/mthca/mthca_main.c | 1 + drivers/infiniband/hw/mthca/mthca_provider.c | 1 + 3 files changed, 3 insertions(+), 0 deletions(-) applies-to: c403b29783de27e290b5d12f12054d03f96ce8b2 0f69ce1e4474e5d5e266457e8a1f4166cf71f6c7 diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index e7e5d3b..808037f 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -154,6 +154,7 @@ struct mthca_limits { int reserved_mcgs; int num_pds; int reserved_pds; + u32 page_size_cap; u32 flags; u8 port_width_cap; }; diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 45c6328..16594d1 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -181,6 +181,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.reserved_uars = dev_lim->reserved_uars; mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; + mdev->limits.page_size_cap = ~(u32) (dev_lim->min_page_sz - 1); mdev->limits.flags = dev_lim->flags; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 6b01666..e78259b 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -90,6 +90,7 @@ static int mthca_query_device(struct ib_ memcpy(&props->node_guid, out_mad->data + 12, 8); props->max_mr_size = ~0ull; + props->page_size_cap = mdev->limits.page_size_cap; props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; props->max_qp_wr = mdev->limits.max_wqes; props->max_sge = mdev->limits.max_sg; --- 0.99.9e From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 2/6] [IB] umad: two small fixes In-Reply-To: <1131431419060-378986988cf168d2@cisco.com> Message-ID: <1131431419060-1bf7238b3fe2f830@cisco.com> Two small fixes for the umad module: - set kobject name for issm device properly - in ib_umad_add_one(), s is subtracted from the index i when initializing ports, so s should be subtracted from the index when freeing ports in the error path as well. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) applies-to: ea857a2670e77bd8e8e8538f42504bcaa1a515d5 8b37b94721533f2729c79bcb6fa0bb3e2bc2f400 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index aed5ca2..6aefeed 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -801,7 +801,7 @@ static int ib_umad_init_port(struct ib_d goto err_class; port->sm_dev->owner = THIS_MODULE; port->sm_dev->ops = &umad_sm_fops; - kobject_set_name(&port->dev->kobj, "issm%d", port->dev_num); + kobject_set_name(&port->sm_dev->kobj, "issm%d", port->dev_num); if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) goto err_sm_cdev; @@ -913,7 +913,7 @@ static void ib_umad_add_one(struct ib_de err: while (--i >= s) - ib_umad_kill_port(&umad_dev->port[i]); + ib_umad_kill_port(&umad_dev->port[i - s]); kref_put(&umad_dev->ref, ib_umad_release_dev); } --- 0.99.9e From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 5/6] [IPoIB] no need to set skb->dev right before freeing skb In-Reply-To: <1131431419060-bf0b43a20ac24b6a@cisco.com> Message-ID: <1131431419061-26662c4d4f27ac0a@cisco.com> For cut-and-paste reasons, the IPoIB driver was setting skb->dev right before calling dev_kfree_skb_any(). Get rid of this. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 17 ++++------------- 1 files changed, 4 insertions(+), 13 deletions(-) applies-to: 8b16a6a547ff0459044c3698ff9ac1d33c84eaf4 6277da1d7c70d2fcaeeb74c3b20fc1645da0f1fe diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 8709693..c33ed87 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -120,12 +120,8 @@ static void ipoib_mcast_free(struct ipoi if (mcast->ah) ipoib_put_ah(mcast->ah); - while (!skb_queue_empty(&mcast->pkt_queue)) { - struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); - - skb->dev = dev; - dev_kfree_skb_any(skb); - } + while (!skb_queue_empty(&mcast->pkt_queue)) + dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); kfree(mcast); } @@ -317,13 +313,8 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - while (!skb_queue_empty(&mcast->pkt_queue)) { - struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); - - skb->dev = dev; - - dev_kfree_skb_any(skb); - } + while (!skb_queue_empty(&mcast->pkt_queue)) + dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); /* Clear the busy flag so we try again */ clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); --- 0.99.9e From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 4/6] [IB] umad: avoid potential deadlock when unregistering MAD agents In-Reply-To: <1131431419060-dda308f068edb6f9@cisco.com> Message-ID: <1131431419060-bf0b43a20ac24b6a@cisco.com> ib_unregister_mad_agent() completes all pending MAD sends and waits for the agent's send_handler routine to return. umad's send_handler() calls queue_packet(), which does down_read() on the port mutex to look up the agent ID. This means that the port mutex cannot be held for writing while calling ib_unregister_mad_agent(), or else it will deadlock. This patch fixes all the calls to ib_unregister_mad_agent() in the umad module to avoid this deadlock. Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 29 +++++++++++++++++------------ 1 files changed, 17 insertions(+), 12 deletions(-) applies-to: 69df1437f2ab467ff8bdd28a9282a3af19daa119 85e28883ef956d40b4aac3fc81ff39ac19713541 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 6aefeed..f5ed36c 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -505,8 +505,6 @@ found: goto out; } - file->agent[agent_id] = agent; - file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); if (IS_ERR(file->mr[agent_id])) { ret = -ENOMEM; @@ -519,14 +517,15 @@ found: goto err_mr; } + file->agent[agent_id] = agent; ret = 0; + goto out; err_mr: ib_dereg_mr(file->mr[agent_id]); err: - file->agent[agent_id] = NULL; ib_unregister_mad_agent(agent); out: @@ -536,27 +535,33 @@ out: static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) { + struct ib_mad_agent *agent = NULL; + struct ib_mr *mr = NULL; u32 id; int ret = 0; - down_write(&file->port->mutex); + if (get_user(id, (u32 __user *) arg)) + return -EFAULT; - if (get_user(id, (u32 __user *) arg)) { - ret = -EFAULT; - goto out; - } + down_write(&file->port->mutex); if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { ret = -EINVAL; goto out; } - ib_dereg_mr(file->mr[id]); - ib_unregister_mad_agent(file->agent[id]); + agent = file->agent[id]; + mr = file->mr[id]; file->agent[id] = NULL; out: up_write(&file->port->mutex); + + if (agent) { + ib_unregister_mad_agent(agent); + ib_dereg_mr(mr); + } + return ret; } @@ -623,16 +628,16 @@ static int ib_umad_close(struct inode *i struct ib_umad_packet *packet, *tmp; int i; - down_write(&file->port->mutex); for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) if (file->agent[i]) { - ib_dereg_mr(file->mr[i]); ib_unregister_mad_agent(file->agent[i]); + ib_dereg_mr(file->mr[i]); } list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); + down_write(&file->port->mutex); list_del(&file->port_list); up_write(&file->port->mutex); --- 0.99.9e From rolandd at cisco.com Mon Nov 7 22:30:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 06:30:19 +0000 Subject: [openib-general] [git patch review 3/6] [IPoIB] add path record information in debugfs In-Reply-To: <1131431419060-1bf7238b3fe2f830@cisco.com> Message-ID: <1131431419060-dda308f068edb6f9@cisco.com> Add ibX_path files to debugfs that contain information about the IPoIB path cache. IPoIB ARP only gives GIDs, which the IPoIB driver must resolve to real IB paths through the ib_sa module. For debugging, when the ARP table looks OK but traffic isn't flowing, it's useful to be able to see if the resolution from GID to path worked. Also clean up the formatting of the existing _mcg debugfs files. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib.h | 15 +- drivers/infiniband/ulp/ipoib/ipoib_fs.c | 179 +++++++++++++++++++++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 72 +++++++++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 9 - drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 7 - 5 files changed, 233 insertions(+), 49 deletions(-) applies-to: f681c9c9ea858c5b14f593077e7cadf9e93ad255 3924a6898a88a805d6297029f0705743cbfff587 diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 0095acc..9923a15 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -179,6 +179,7 @@ struct ipoib_dev_priv { #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct list_head fs_list; struct dentry *mcg_dentry; + struct dentry *path_dentry; #endif }; @@ -270,7 +271,6 @@ void ipoib_mcast_dev_flush(struct net_de #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); -void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, union ib_gid *gid, @@ -278,6 +278,11 @@ void ipoib_mcast_iter_read(struct ipoib_ unsigned int *queuelen, unsigned int *complete, unsigned int *send_only); + +struct ipoib_path_iter *ipoib_path_iter_init(struct net_device *dev); +int ipoib_path_iter_next(struct ipoib_path_iter *iter); +void ipoib_path_iter_read(struct ipoib_path_iter *iter, + struct ipoib_path *path); #endif int ipoib_mcast_attach(struct net_device *dev, u16 mlid, @@ -299,13 +304,13 @@ void ipoib_pkey_poll(void *dev); int ipoib_pkey_dev_delay_open(struct net_device *dev); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -int ipoib_create_debug_file(struct net_device *dev); -void ipoib_delete_debug_file(struct net_device *dev); +void ipoib_create_debug_files(struct net_device *dev); +void ipoib_delete_debug_files(struct net_device *dev); int ipoib_register_debugfs(void); void ipoib_unregister_debugfs(void); #else -static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } -static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline void ipoib_create_debug_files(struct net_device *dev) { } +static inline void ipoib_delete_debug_files(struct net_device *dev) { } static inline int ipoib_register_debugfs(void) { return 0; } static inline void ipoib_unregister_debugfs(void) { } #endif diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c index 38b150f..a5b8bb4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -43,6 +43,18 @@ struct file_operations; static struct dentry *ipoib_root; +static void format_gid(union ib_gid *gid, char *buf) +{ + int i, n; + + for (n = 0, i = 0; i < 8; ++i) { + n += sprintf(buf + n, "%x", + be16_to_cpu(((__be16 *) gid->raw)[i])); + if (i < 7) + buf[n++] = ':'; + } +} + static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) { struct ipoib_mcast_iter *iter; @@ -54,7 +66,7 @@ static void *ipoib_mcg_seq_start(struct while (n--) { if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } } @@ -70,7 +82,7 @@ static void *ipoib_mcg_seq_next(struct s (*pos)++; if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } @@ -91,28 +103,29 @@ static int ipoib_mcg_seq_show(struct seq unsigned long created; unsigned int queuelen, complete, send_only; - if (iter) { - ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, - &complete, &send_only); - - for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { - n += sprintf(gid_buf + n, "%x", - be16_to_cpu(((__be16 *) mgid.raw)[i])); - if (i < sizeof mgid / 2 - 1) - gid_buf[n++] = ':'; - } - } + if (!iter) + return 0; + + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); - seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + format_git(&mgid, gid_buf); seq_printf(file, - " created: %10ld queuelen: %4d complete: %d send_only: %d\n", - created, queuelen, complete, send_only); + "GID: %s\n" + " created: %10ld\n" + " queuelen: %9d\n" + " complete: %9s\n" + " send_only: %8s\n" + "\n", + gid_buf, created, queuelen, + complete ? "yes" : "no", + send_only ? "yes" : "no"); return 0; } -static struct seq_operations ipoib_seq_ops = { +static struct seq_operations ipoib_mcg_seq_ops = { .start = ipoib_mcg_seq_start, .next = ipoib_mcg_seq_next, .stop = ipoib_mcg_seq_stop, @@ -124,7 +137,7 @@ static int ipoib_mcg_open(struct inode * struct seq_file *seq; int ret; - ret = seq_open(file, &ipoib_seq_ops); + ret = seq_open(file, &ipoib_mcg_seq_ops); if (ret) return ret; @@ -134,7 +147,7 @@ static int ipoib_mcg_open(struct inode * return 0; } -static struct file_operations ipoib_fops = { +static struct file_operations ipoib_mcg_fops = { .owner = THIS_MODULE, .open = ipoib_mcg_open, .read = seq_read, @@ -142,25 +155,139 @@ static struct file_operations ipoib_fops .release = seq_release }; -int ipoib_create_debug_file(struct net_device *dev) +static void *ipoib_path_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_path_iter *iter; + loff_t n = *pos; + + iter = ipoib_path_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_path_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_path_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + + return iter; +} + +static void ipoib_path_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_path_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_path_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + struct ipoib_path path; + int i, n; + int rate; + + if (!iter) + return 0; + + ipoib_path_iter_read(iter, &path); + + format_git(&path.pathrec.dgid, gid_buf); + + seq_printf(file, + "GID: %s\n" + " complete: %6s\n", + gid_buf, path.pathrec.dlid ? "yes" : "no"); + + if (path.pathrec.dlid) { + rate = ib_sa_rate_enum_to_int(path.pathrec.rate) * 25; + + seq_printf(file, + " DLID: 0x%04x\n" + " SL: %12d\n" + " rate: %*d%s Gb/sec\n", + be16_to_cpu(path.pathrec.dlid), + path.pathrec.sl, + 10 - ((rate % 10) ? 2 : 0), + rate / 10, rate % 10 ? ".5" : ""); + } + + seq_putc(file, '\n'); + + return 0; +} + +static struct seq_operations ipoib_path_seq_ops = { + .start = ipoib_path_seq_start, + .next = ipoib_path_seq_next, + .stop = ipoib_path_seq_stop, + .show = ipoib_path_seq_show, +}; + +static int ipoib_path_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_path_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_path_fops = { + .owner = THIS_MODULE, + .open = ipoib_path_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +void ipoib_create_debug_files(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - char name[IFNAMSIZ + sizeof "_mcg"]; + char name[IFNAMSIZ + sizeof "_path"]; snprintf(name, sizeof name, "%s_mcg", dev->name); - priv->mcg_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, - ipoib_root, dev, &ipoib_fops); - - return priv->mcg_dentry ? 0 : -ENOMEM; + ipoib_root, dev, &ipoib_mcg_fops); + if (!priv->mcg_dentry) + ipoib_warn(priv, "failed to create mcg debug file\n"); + + snprintf(name, sizeof name, "%s_path", dev->name); + priv->path_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + ipoib_root, dev, &ipoib_path_fops); + if (!priv->path_dentry) + ipoib_warn(priv, "failed to create path debug file\n"); } -void ipoib_delete_debug_file(struct net_device *dev) +void ipoib_delete_debug_files(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); if (priv->mcg_dentry) debugfs_remove(priv->mcg_dentry); + if (priv->path_dentry) + debugfs_remove(priv->path_dentry); } int ipoib_register_debugfs(void) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index ce02962..2fa3075 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -58,6 +58,11 @@ module_param_named(debug_level, ipoib_de MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif +struct ipoib_path_iter { + struct net_device *dev; + struct ipoib_path path; +}; + static const u8 ipv4_bcast_addr[] = { 0x00, 0xff, 0xff, 0xff, 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, @@ -250,6 +255,64 @@ static void path_free(struct net_device kfree(path); } +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + +struct ipoib_path_iter *ipoib_path_iter_init(struct net_device *dev) +{ + struct ipoib_path_iter *iter; + + iter = kmalloc(sizeof *iter, GFP_KERNEL); + if (!iter) + return NULL; + + iter->dev = dev; + memset(iter->path.pathrec.dgid.raw, 0, 16); + + if (ipoib_path_iter_next(iter)) { + kfree(iter); + return NULL; + } + + return iter; +} + +int ipoib_path_iter_next(struct ipoib_path_iter *iter) +{ + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); + struct rb_node *n; + struct ipoib_path *path; + int ret = 1; + + spin_lock_irq(&priv->lock); + + n = rb_first(&priv->path_tree); + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + if (memcmp(iter->path.pathrec.dgid.raw, path->pathrec.dgid.raw, + sizeof (union ib_gid)) < 0) { + iter->path = *path; + ret = 0; + break; + } + + n = rb_next(n); + } + + spin_unlock_irq(&priv->lock); + + return ret; +} + +void ipoib_path_iter_read(struct ipoib_path_iter *iter, + struct ipoib_path *path) +{ + *path = iter->path; +} + +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + void ipoib_flush_paths(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -763,7 +826,7 @@ void ipoib_dev_cleanup(struct net_device { struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; - ipoib_delete_debug_file(dev); + ipoib_delete_debug_files(dev); /* Delete any child interfaces first */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { @@ -972,8 +1035,7 @@ static struct net_device *ipoib_add_port goto register_failed; } - if (ipoib_create_debug_file(priv->dev)) - goto debug_failed; + ipoib_create_debug_files(priv->dev); if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; @@ -987,9 +1049,7 @@ static struct net_device *ipoib_add_port return priv->dev; sysfs_failed: - ipoib_delete_debug_file(priv->dev); - -debug_failed: + ipoib_delete_debug_files(priv->dev); unregister_netdev(priv->dev); register_failed: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3ecf78a..8709693 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -928,21 +928,16 @@ struct ipoib_mcast_iter *ipoib_mcast_ite return NULL; iter->dev = dev; - memset(iter->mgid.raw, 0, sizeof iter->mgid); + memset(iter->mgid.raw, 0, 16); if (ipoib_mcast_iter_next(iter)) { - ipoib_mcast_iter_free(iter); + kfree(iter); return NULL; } return iter; } -void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter) -{ - kfree(iter); -} - int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter) { struct ipoib_dev_priv *priv = netdev_priv(iter->dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 332d730..d280b34 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -113,8 +113,7 @@ int ipoib_vlan_add(struct net_device *pd priv->parent = ppriv->dev; - if (ipoib_create_debug_file(priv->dev)) - goto debug_failed; + ipoib_create_debug_files(priv->dev); if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; @@ -130,9 +129,7 @@ int ipoib_vlan_add(struct net_device *pd return 0; sysfs_failed: - ipoib_delete_debug_file(priv->dev); - -debug_failed: + ipoib_delete_debug_files(priv->dev); unregister_netdev(priv->dev); register_failed: --- 0.99.9e From yael at mellanox.co.il Mon Nov 7 22:37:44 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 8 Nov 2005 08:37:44 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23B8@mtlexch01.mtl.com> Hi Hal, Just another comment - when running: % while test $? = 0; do opensm -V -o; done Try to run from a different port: % osmtest -f f This causes fludding of mads to the opensm, and that usually is the cause for the exiting problem. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, November 07, 2005 10:07 PM To: Eitan Zahavi Cc: Yael Kalka; openib-general at openib.org Subject: RE: [PATCH] Opensm - exiting issues On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port This part of the patch is not needed with the fix to user_mad put in by Roland based on the issue (and patch) from Michael on user_mad deadlock. I've been running your test from over 30 minutes now without a hiccup. It used to fail pretty quickly. -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 4:21 PM > > To: yael at mellanox.co.il > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > From yael at mellanox.co.il Tue Nov 8 02:12:03 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 8 Nov 2005 12:12:03 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23BB@mtlexch01.mtl.com> Hi Hal, It seems that there is still another race somewhere. The situation is much better. I had to run the testing for ~45 minutes in order to see the problem. I ran on a loopback machine the following: a) from port #2 % while test $? = 0; do opensm -o -e; done b) from port #1 % while test 1 = 1; do osmtest -f f; done The process is hang. When getting the process with ps -efww I get: root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g 0x2c902000017a2 Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp lsmod reports the following: Module Size Used by subfs 12416 1 nvram 13576 0 usbserial 34024 0 autofs4 23556 2 speedstep_lib 8324 0 freq_table 8832 0 thermal 18184 0 processor 28648 1 thermal ipv6 273920 20 fan 8836 0 button 11024 0 battery 14084 0 ac 9220 0 edd 14560 0 evdev 12928 0 joydev 13888 0 st 43676 0 sr_mod 21284 0 ib_ipoib 44804 0 ib_sa 16652 1 ib_ipoib ib_uverbs 37416 0 ib_umad 19376 2 af_packet 26760 4 sg 42912 0 ib_mthca 119452 0 ib_mad 41620 3 ib_sa,ib_umad,ib_mthca ib_core 48000 6 ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad e1000 91316 0 e100 43392 0 mii 9088 1 e100 i2c_i801 12556 0 i2c_core 26624 1 i2c_i801 uhci_hcd 37008 0 usbcore 121688 3 usbserial,uhci_hcd parport_pc 44356 0 lp 15396 0 parport 40392 2 parport_pc,lp video1394 22860 0 ohci1394 37508 1 video1394 raw1394 34540 0 ieee1394 108472 3 video1394,ohci1394,raw1394 capability 7224 0 nls_iso8859_1 8064 1 nls_cp437 9728 1 vfat 17792 1 fat 43804 1 vfat dm_mod 64768 0 ext3 145032 2 jbd 73764 1 ext3 ide_cd 44036 0 cdrom 42784 2 sr_mod,ide_cd ide_disk 22400 0 aic7xxx 200632 4 piix 14468 0 [permanent] ide_core 131904 3 ide_cd,ide_disk,piix sd_mod 23168 5 scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod Thanks, Yael -----Original Message----- From: Yael Kalka Sent: Tuesday, November 08, 2005 8:38 AM To: 'Hal Rosenstock'; Eitan Zahavi Cc: Yael Kalka; openib-general at openib.org Subject: RE: [PATCH] Opensm - exiting issues Hi Hal, Just another comment - when running: % while test $? = 0; do opensm -V -o; done Try to run from a different port: % osmtest -f f This causes fludding of mads to the opensm, and that usually is the cause for the exiting problem. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, November 07, 2005 10:07 PM To: Eitan Zahavi Cc: Yael Kalka; openib-general at openib.org Subject: RE: [PATCH] Opensm - exiting issues On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > Hi Hal, > > I will answer for Yael as she already left the office. > > The way to reproduce the "stuck" case is to run in bash: > % while test $? = 0; do opensm -V -o; done > > The symptom we see is that OpenSM sort of exists but the process stay > active (not even defunct). No way to kill it. It seems like one of the > threads gets caught in the middle of ioctl or something. To be able to > run OpenSM after this we need to reboot the machine. > > We avoid it by not issuing umad_unregister and umad_close_port This part of the patch is not needed with the fix to user_mad put in by Roland based on the issue (and patch) from Michael on user_mad deadlock. I've been running your test from over 30 minutes now without a hiccup. It used to fail pretty quickly. -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 4:21 PM > > To: yael at mellanox.co.il > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > Subject: Re: [PATCH] Opensm - exiting issues > > > > Hi Yael, > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > Hi Hal, > > > > > > There was a problem when running opensm with -o option, that caused > > > the opensm to always exit with segfault, due to object destruction > > > ordering. Also - there is the known issue of exiting opensm. We've > > > done some clearing to the exiting code. The following patch fixes > most > > > of it. > > > > I applied this part of the patch with some cosmetic changes in > > osm_vendor_ibumad.c. > > > > > In the current code we saw that sometimes opensm gets "stuck" on > exit, > > > and causes the machine to get stuck too - resulting in need for > > > rebooting. In the following patch fixes most of it. > > > We did run (in the patch) into rare cases where opensm exits with an > > > error, but at least it exits without stucking the machine... > > > > Is there a reliable way to recreate machine "stuck" ? What exactly do > > you mean by this ? > > > > All umad_unregister does is some validation, a table lookup, and issue > > the ioctl to unregister the MAD agent. Not explictly unregistering the > > agent(s) does not cause any harm as when the fd is closed, this will > > occur as part of the cleanup. > > > > -- Hal > From halr at voltaire.com Tue Nov 8 03:52:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 8 Nov 2005 13:52:54 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA3F@taurus.voltaire.com> Hi Yael, On Tue, 2005-11-08 at 05:12, Yael Kalka wrote: > Hi Hal, > > It seems that there is still another race somewhere. > The situation is much better. I had to run the testing for > ~45 minutes in order to see the problem. Is your filesystem full ? What is the file size of the log when you hit this ? Is this a max file size issue ? -- Hal > I ran on a loopback machine the following: > a) from port #2 > % while test $? = 0; do opensm -o -e; done > b) from port #1 > % while test 1 = 1; do osmtest -f f; done > > The process is hang. When getting the process with ps -efww I get: > root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] > root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g > 0x2c902000017a2 > > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp > > lsmod reports the following: > Module Size Used by > subfs 12416 1 > nvram 13576 0 > usbserial 34024 0 > autofs4 23556 2 > speedstep_lib 8324 0 > freq_table 8832 0 > thermal 18184 0 > processor 28648 1 thermal > ipv6 273920 20 > fan 8836 0 > button 11024 0 > battery 14084 0 > ac 9220 0 > edd 14560 0 > evdev 12928 0 > joydev 13888 0 > st 43676 0 > sr_mod 21284 0 > ib_ipoib 44804 0 > ib_sa 16652 1 ib_ipoib > ib_uverbs 37416 0 > ib_umad 19376 2 > af_packet 26760 4 > sg 42912 0 > ib_mthca 119452 0 > ib_mad 41620 3 ib_sa,ib_umad,ib_mthca > ib_core 48000 6 > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > e1000 91316 0 > e100 43392 0 > mii 9088 1 e100 > i2c_i801 12556 0 > i2c_core 26624 1 i2c_i801 > uhci_hcd 37008 0 > usbcore 121688 3 usbserial,uhci_hcd > parport_pc 44356 0 > lp 15396 0 > parport 40392 2 parport_pc,lp > video1394 22860 0 > ohci1394 37508 1 video1394 > raw1394 34540 0 > ieee1394 108472 3 video1394,ohci1394,raw1394 > capability 7224 0 > nls_iso8859_1 8064 1 > nls_cp437 9728 1 > vfat 17792 1 > fat 43804 1 vfat > dm_mod 64768 0 > ext3 145032 2 > jbd 73764 1 ext3 > ide_cd 44036 0 > cdrom 42784 2 sr_mod,ide_cd > ide_disk 22400 0 > aic7xxx 200632 4 > piix 14468 0 [permanent] > ide_core 131904 3 ide_cd,ide_disk,piix > sd_mod 23168 5 > scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod > > Thanks, > Yael > > > > -----Original Message----- > From: Yael Kalka > Sent: Tuesday, November 08, 2005 8:38 AM > To: 'Hal Rosenstock'; Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] Opensm - exiting issues > > > Hi Hal, > > Just another comment - when running: > % while test $? = 0; do opensm -V -o; done > Try to run from a different port: > % osmtest -f f > This causes fludding of mads to the opensm, and that usually is > the cause for the exiting problem. > > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, November 07, 2005 10:07 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] Opensm - exiting issues > > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > Hi Hal, > > > > I will answer for Yael as she already left the office. > > > > The way to reproduce the "stuck" case is to run in bash: > > % while test $? = 0; do opensm -V -o; done > > > > The symptom we see is that OpenSM sort of exists but the process stay > > active (not even defunct). No way to kill it. It seems like one of the > > threads gets caught in the middle of ioctl or something. To be able to > > run OpenSM after this we need to reboot the machine. > > > > We avoid it by not issuing umad_unregister and umad_close_port > > This part of the patch is not needed with the fix to user_mad put in by > Roland based on the issue (and patch) from Michael on user_mad deadlock. > > I've been running your test from over 30 minutes now without a hiccup. > It used to fail pretty quickly. > > -- Hal > > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Monday, November 07, 2005 4:21 PM > > > To: yael at mellanox.co.il > > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > Hi Yael, > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > Hi Hal, > > > > > > > > There was a problem when running opensm with -o option, that > caused > > > > the opensm to always exit with segfault, due to object destruction > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > done some clearing to the exiting code. The following patch fixes > > most > > > > of it. > > > > > > I applied this part of the patch with some cosmetic changes in > > > osm_vendor_ibumad.c. > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > exit, > > > > and causes the machine to get stuck too - resulting in need for > > > > rebooting. In the following patch fixes most of it. > > > > We did run (in the patch) into rare cases where opensm exits with > an > > > > error, but at least it exits without stucking the machine... > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly > do > > > you mean by this ? > > > > > > All umad_unregister does is some validation, a table lookup, and > issue > > > the ioctl to unregister the MAD agent. Not explictly unregistering > the > > > agent(s) does not cause any harm as when the fd is closed, this will > > > occur as part of the cleanup. > > > > > > -- Hal > > From yael at mellanox.co.il Tue Nov 8 04:02:17 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 8 Nov 2005 14:02:17 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23C2@mtlexch01.mtl.com> Hi Hal, The filesystem is not full, since I am using opensm with -e and with no verbosity. swlab53:~ # df -k /var/log/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda3 8262068 4705692 3136680 61% / Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, November 08, 2005 1:53 PM To: yael at mellanox.co.il Cc: openib-general at openib.org; eitan at mellanox.co.il Subject: RE: [PATCH] Opensm - exiting issues Hi Yael, On Tue, 2005-11-08 at 05:12, Yael Kalka wrote: > Hi Hal, > > It seems that there is still another race somewhere. > The situation is much better. I had to run the testing for > ~45 minutes in order to see the problem. Is your filesystem full ? What is the file size of the log when you hit this ? Is this a max file size issue ? -- Hal > I ran on a loopback machine the following: > a) from port #2 > % while test $? = 0; do opensm -o -e; done > b) from port #1 > % while test 1 = 1; do osmtest -f f; done > > The process is hang. When getting the process with ps -efww I get: > root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] > root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g > 0x2c902000017a2 > > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp > > lsmod reports the following: > Module Size Used by > subfs 12416 1 > nvram 13576 0 > usbserial 34024 0 > autofs4 23556 2 > speedstep_lib 8324 0 > freq_table 8832 0 > thermal 18184 0 > processor 28648 1 thermal > ipv6 273920 20 > fan 8836 0 > button 11024 0 > battery 14084 0 > ac 9220 0 > edd 14560 0 > evdev 12928 0 > joydev 13888 0 > st 43676 0 > sr_mod 21284 0 > ib_ipoib 44804 0 > ib_sa 16652 1 ib_ipoib > ib_uverbs 37416 0 > ib_umad 19376 2 > af_packet 26760 4 > sg 42912 0 > ib_mthca 119452 0 > ib_mad 41620 3 ib_sa,ib_umad,ib_mthca > ib_core 48000 6 > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > e1000 91316 0 > e100 43392 0 > mii 9088 1 e100 > i2c_i801 12556 0 > i2c_core 26624 1 i2c_i801 > uhci_hcd 37008 0 > usbcore 121688 3 usbserial,uhci_hcd > parport_pc 44356 0 > lp 15396 0 > parport 40392 2 parport_pc,lp > video1394 22860 0 > ohci1394 37508 1 video1394 > raw1394 34540 0 > ieee1394 108472 3 video1394,ohci1394,raw1394 > capability 7224 0 > nls_iso8859_1 8064 1 > nls_cp437 9728 1 > vfat 17792 1 > fat 43804 1 vfat > dm_mod 64768 0 > ext3 145032 2 > jbd 73764 1 ext3 > ide_cd 44036 0 > cdrom 42784 2 sr_mod,ide_cd > ide_disk 22400 0 > aic7xxx 200632 4 > piix 14468 0 [permanent] > ide_core 131904 3 ide_cd,ide_disk,piix > sd_mod 23168 5 > scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod > > Thanks, > Yael > > > > -----Original Message----- > From: Yael Kalka > Sent: Tuesday, November 08, 2005 8:38 AM > To: 'Hal Rosenstock'; Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] Opensm - exiting issues > > > Hi Hal, > > Just another comment - when running: > % while test $? = 0; do opensm -V -o; done > Try to run from a different port: > % osmtest -f f > This causes fludding of mads to the opensm, and that usually is > the cause for the exiting problem. > > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, November 07, 2005 10:07 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [PATCH] Opensm - exiting issues > > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > Hi Hal, > > > > I will answer for Yael as she already left the office. > > > > The way to reproduce the "stuck" case is to run in bash: > > % while test $? = 0; do opensm -V -o; done > > > > The symptom we see is that OpenSM sort of exists but the process stay > > active (not even defunct). No way to kill it. It seems like one of the > > threads gets caught in the middle of ioctl or something. To be able to > > run OpenSM after this we need to reboot the machine. > > > > We avoid it by not issuing umad_unregister and umad_close_port > > This part of the patch is not needed with the fix to user_mad put in by > Roland based on the issue (and patch) from Michael on user_mad deadlock. > > I've been running your test from over 30 minutes now without a hiccup. > It used to fail pretty quickly. > > -- Hal > > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Monday, November 07, 2005 4:21 PM > > > To: yael at mellanox.co.il > > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > Hi Yael, > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > Hi Hal, > > > > > > > > There was a problem when running opensm with -o option, that > caused > > > > the opensm to always exit with segfault, due to object destruction > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > done some clearing to the exiting code. The following patch fixes > > most > > > > of it. > > > > > > I applied this part of the patch with some cosmetic changes in > > > osm_vendor_ibumad.c. > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > exit, > > > > and causes the machine to get stuck too - resulting in need for > > > > rebooting. In the following patch fixes most of it. > > > > We did run (in the patch) into rare cases where opensm exits with > an > > > > error, but at least it exits without stucking the machine... > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly > do > > > you mean by this ? > > > > > > All umad_unregister does is some validation, a table lookup, and > issue > > > the ioctl to unregister the MAD agent. Not explictly unregistering > the > > > agent(s) does not cause any harm as when the fd is closed, this will > > > occur as part of the cleanup. > > > > > > -- Hal > > From dotanb at mellanox.co.il Tue Nov 8 04:24:03 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 8 Nov 2005 14:24:03 +0200 Subject: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E372A9CA@mtlexch01.mtl.com> Hi Roland, look bellow. > - Several of the tests are buggy. See the patch below at least. you are absolutely right: the tests should be written without any bug, we will check the issues you wrote in the patch and check that we don't have any more bugs in the tests. > - It would be much more useful if the COMPARE() macro printed the > expected and actual value on failure. this macro will be changed to print the values in case of failure. > - Similarly, other macros should probably also print more context. > For example, in something like: > > CHECK_PTR("ibv_create_qp", qp[i], goto cleanup); > > I would probably want to know the value of i on failure. will see how we can add debug prints in those cases. > - I don't believe some of the tests are really valid. For example, > the max number of QPs doesn't have to be precisely correct -- no > valid app is going to depend on being able to create exactly that > number of QPs and no more. This is true, but in a black box testing, we wanted to check that there isn't any array overrun or any other bug when one tries to create more resources than the HCA/driver supports. > - In any case, I'm not convinced that this sort of negative testing > is the most valuable thing to focus on right now. I think it would > be better to have regression tests of basic functionality (sends, > receives, RDMA, CQ polling, etc) and stress tests before testing > whether a buggy app will get the right error value when passing > invalid parameters. in this test (gen2_basic) we check many tests cases (good + bad flows) to check the driver (to check that the return value indicates that there is/isn't any error and to check that there isn't any seg fault or kernel crash). We started to write several tests (basic functionality + stress tests) and soon we will check in the tests to the svn (and the changes that you suggested). thank you, feedback is always welcome ... Dotan From halr at voltaire.com Tue Nov 8 04:49:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 8 Nov 2005 14:49:28 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA40@taurus.voltaire.com> On Tue, 2005-11-08 at 07:02, Yael Kalka wrote: > Hi Hal, > > The filesystem is not full, since I am using opensm with -e and with no verbosity. > > swlab53:~ # df -k /var/log/ > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda3 8262068 4705692 3136680 61% / How large is the osm.log file (ls -lasg) when this occurs ? -- Hal > > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 08, 2005 1:53 PM > To: yael at mellanox.co.il > Cc: openib-general at openib.org; eitan at mellanox.co.il > Subject: RE: [PATCH] Opensm - exiting issues > > > Hi Yael, > > On Tue, 2005-11-08 at 05:12, Yael Kalka wrote: > > Hi Hal, > > > > It seems that there is still another race somewhere. > > The situation is much better. I had to run the testing for > > ~45 minutes in order to see the problem. > > Is your filesystem full ? What is the file size of the log when you hit > this ? Is this a max file size issue ? > > -- Hal > > > I ran on a loopback machine the following: > > a) from port #2 > > % while test $? = 0; do opensm -o -e; done > > b) from port #1 > > % while test 1 = 1; do osmtest -f f; done > > > > The process is hang. When getting the process with ps -efww I get: > > root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] > > root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g > > 0x2c902000017a2 > > > > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp > > > > lsmod reports the following: > > Module Size Used by > > subfs 12416 1 > > nvram 13576 0 > > usbserial 34024 0 > > autofs4 23556 2 > > speedstep_lib 8324 0 > > freq_table 8832 0 > > thermal 18184 0 > > processor 28648 1 thermal > > ipv6 273920 20 > > fan 8836 0 > > button 11024 0 > > battery 14084 0 > > ac 9220 0 > > edd 14560 0 > > evdev 12928 0 > > joydev 13888 0 > > st 43676 0 > > sr_mod 21284 0 > > ib_ipoib 44804 0 > > ib_sa 16652 1 ib_ipoib > > ib_uverbs 37416 0 > > ib_umad 19376 2 > > af_packet 26760 4 > > sg 42912 0 > > ib_mthca 119452 0 > > ib_mad 41620 3 ib_sa,ib_umad,ib_mthca > > ib_core 48000 6 > > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > > e1000 91316 0 > > e100 43392 0 > > mii 9088 1 e100 > > i2c_i801 12556 0 > > i2c_core 26624 1 i2c_i801 > > uhci_hcd 37008 0 > > usbcore 121688 3 usbserial,uhci_hcd > > parport_pc 44356 0 > > lp 15396 0 > > parport 40392 2 parport_pc,lp > > video1394 22860 0 > > ohci1394 37508 1 video1394 > > raw1394 34540 0 > > ieee1394 108472 3 video1394,ohci1394,raw1394 > > capability 7224 0 > > nls_iso8859_1 8064 1 > > nls_cp437 9728 1 > > vfat 17792 1 > > fat 43804 1 vfat > > dm_mod 64768 0 > > ext3 145032 2 > > jbd 73764 1 ext3 > > ide_cd 44036 0 > > cdrom 42784 2 sr_mod,ide_cd > > ide_disk 22400 0 > > aic7xxx 200632 4 > > piix 14468 0 [permanent] > > ide_core 131904 3 ide_cd,ide_disk,piix > > sd_mod 23168 5 > > scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod > > > > Thanks, > > Yael > > > > > > > > -----Original Message----- > > From: Yael Kalka > > Sent: Tuesday, November 08, 2005 8:38 AM > > To: 'Hal Rosenstock'; Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > Hi Hal, > > > > Just another comment - when running: > > % while test $? = 0; do opensm -V -o; done > > Try to run from a different port: > > % osmtest -f f > > This causes fludding of mads to the opensm, and that usually is > > the cause for the exiting problem. > > > > Yael > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 10:07 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > I will answer for Yael as she already left the office. > > > > > > The way to reproduce the "stuck" case is to run in bash: > > > % while test $? = 0; do opensm -V -o; done > > > > > > The symptom we see is that OpenSM sort of exists but the process stay > > > active (not even defunct). No way to kill it. It seems like one of the > > > threads gets caught in the middle of ioctl or something. To be able to > > > run OpenSM after this we need to reboot the machine. > > > > > > We avoid it by not issuing umad_unregister and umad_close_port > > > > This part of the patch is not needed with the fix to user_mad put in by > > Roland based on the issue (and patch) from Michael on user_mad deadlock. > > > > I've been running your test from over 30 minutes now without a hiccup. > > It used to fail pretty quickly. > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Monday, November 07, 2005 4:21 PM > > > > To: yael at mellanox.co.il > > > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > > > Hi Yael, > > > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > > Hi Hal, > > > > > > > > > > There was a problem when running opensm with -o option, that > > caused > > > > > the opensm to always exit with segfault, due to object destruction > > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > > done some clearing to the exiting code. The following patch fixes > > > most > > > > > of it. > > > > > > > > I applied this part of the patch with some cosmetic changes in > > > > osm_vendor_ibumad.c. > > > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > > exit, > > > > > and causes the machine to get stuck too - resulting in need for > > > > > rebooting. In the following patch fixes most of it. > > > > > We did run (in the patch) into rare cases where opensm exits with > > an > > > > > error, but at least it exits without stucking the machine... > > > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly > > do > > > > you mean by this ? > > > > > > > > All umad_unregister does is some validation, a table lookup, and > > issue > > > > the ioctl to unregister the MAD agent. Not explictly unregistering > > the > > > > agent(s) does not cause any harm as when the fd is closed, this will > > > > occur as part of the cleanup. > > > > > > > > -- Hal > > > > From yael at mellanox.co.il Tue Nov 8 04:56:19 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 8 Nov 2005 14:56:19 +0200 Subject: [openib-general] RE: [PATCH] Opensm - exiting issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23C5@mtlexch01.mtl.com> Nothing: swlab53:~ # ls -lasg /var/log/osm.log 4 -rw-r--r-- 1 root 724 Nov 8 11:40 /var/log/osm.log -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, November 08, 2005 2:49 PM To: yael at mellanox.co.il Cc: openib-general at openib.org; eitan at mellanox.co.il Subject: RE: [PATCH] Opensm - exiting issues On Tue, 2005-11-08 at 07:02, Yael Kalka wrote: > Hi Hal, > > The filesystem is not full, since I am using opensm with -e and with no verbosity. > > swlab53:~ # df -k /var/log/ > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sda3 8262068 4705692 3136680 61% / How large is the osm.log file (ls -lasg) when this occurs ? -- Hal > > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 08, 2005 1:53 PM > To: yael at mellanox.co.il > Cc: openib-general at openib.org; eitan at mellanox.co.il > Subject: RE: [PATCH] Opensm - exiting issues > > > Hi Yael, > > On Tue, 2005-11-08 at 05:12, Yael Kalka wrote: > > Hi Hal, > > > > It seems that there is still another race somewhere. > > The situation is much better. I had to run the testing for > > ~45 minutes in order to see the problem. > > Is your filesystem full ? What is the file size of the log when you hit > this ? Is this a max file size issue ? > > -- Hal > > > I ran on a loopback machine the following: > > a) from port #2 > > % while test $? = 0; do opensm -o -e; done > > b) from port #1 > > % while test 1 = 1; do osmtest -f f; done > > > > The process is hang. When getting the process with ps -efww I get: > > root 27939 27938 0 11:40 pts/0 00:00:00 [opensm] > > root 27938 8001 0 11:40 pts/0 00:00:00 usr/bin/opensm -o -e -g > > 0x2c902000017a2 > > > > Machine description: SuSE Linux 9.3 (i586) 2.6.11.4-20a-smp > > > > lsmod reports the following: > > Module Size Used by > > subfs 12416 1 > > nvram 13576 0 > > usbserial 34024 0 > > autofs4 23556 2 > > speedstep_lib 8324 0 > > freq_table 8832 0 > > thermal 18184 0 > > processor 28648 1 thermal > > ipv6 273920 20 > > fan 8836 0 > > button 11024 0 > > battery 14084 0 > > ac 9220 0 > > edd 14560 0 > > evdev 12928 0 > > joydev 13888 0 > > st 43676 0 > > sr_mod 21284 0 > > ib_ipoib 44804 0 > > ib_sa 16652 1 ib_ipoib > > ib_uverbs 37416 0 > > ib_umad 19376 2 > > af_packet 26760 4 > > sg 42912 0 > > ib_mthca 119452 0 > > ib_mad 41620 3 ib_sa,ib_umad,ib_mthca > > ib_core 48000 6 > > ib_ipoib,ib_sa,ib_uverbs,ib_umad,ib_mthca,ib_mad > > e1000 91316 0 > > e100 43392 0 > > mii 9088 1 e100 > > i2c_i801 12556 0 > > i2c_core 26624 1 i2c_i801 > > uhci_hcd 37008 0 > > usbcore 121688 3 usbserial,uhci_hcd > > parport_pc 44356 0 > > lp 15396 0 > > parport 40392 2 parport_pc,lp > > video1394 22860 0 > > ohci1394 37508 1 video1394 > > raw1394 34540 0 > > ieee1394 108472 3 video1394,ohci1394,raw1394 > > capability 7224 0 > > nls_iso8859_1 8064 1 > > nls_cp437 9728 1 > > vfat 17792 1 > > fat 43804 1 vfat > > dm_mod 64768 0 > > ext3 145032 2 > > jbd 73764 1 ext3 > > ide_cd 44036 0 > > cdrom 42784 2 sr_mod,ide_cd > > ide_disk 22400 0 > > aic7xxx 200632 4 > > piix 14468 0 [permanent] > > ide_core 131904 3 ide_cd,ide_disk,piix > > sd_mod 23168 5 > > scsi_mod 136008 5 st,sr_mod,sg,aic7xxx,sd_mod > > > > Thanks, > > Yael > > > > > > > > -----Original Message----- > > From: Yael Kalka > > Sent: Tuesday, November 08, 2005 8:38 AM > > To: 'Hal Rosenstock'; Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > Hi Hal, > > > > Just another comment - when running: > > % while test $? = 0; do opensm -V -o; done > > Try to run from a different port: > > % osmtest -f f > > This causes fludding of mads to the opensm, and that usually is > > the cause for the exiting problem. > > > > Yael > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, November 07, 2005 10:07 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [PATCH] Opensm - exiting issues > > > > > > On Mon, 2005-11-07 at 09:42, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > I will answer for Yael as she already left the office. > > > > > > The way to reproduce the "stuck" case is to run in bash: > > > % while test $? = 0; do opensm -V -o; done > > > > > > The symptom we see is that OpenSM sort of exists but the process stay > > > active (not even defunct). No way to kill it. It seems like one of the > > > threads gets caught in the middle of ioctl or something. To be able to > > > run OpenSM after this we need to reboot the machine. > > > > > > We avoid it by not issuing umad_unregister and umad_close_port > > > > This part of the patch is not needed with the fix to user_mad put in by > > Roland based on the issue (and patch) from Michael on user_mad deadlock. > > > > I've been running your test from over 30 minutes now without a hiccup. > > It used to fail pretty quickly. > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Monday, November 07, 2005 4:21 PM > > > > To: yael at mellanox.co.il > > > > Cc: openib-general at openib.org; eitan at mellanox.co.il > > > > Subject: Re: [PATCH] Opensm - exiting issues > > > > > > > > Hi Yael, > > > > > > > > On Mon, 2005-11-07 at 08:25, Yael Kalka wrote: > > > > > Hi Hal, > > > > > > > > > > There was a problem when running opensm with -o option, that > > caused > > > > > the opensm to always exit with segfault, due to object destruction > > > > > ordering. Also - there is the known issue of exiting opensm. We've > > > > > done some clearing to the exiting code. The following patch fixes > > > most > > > > > of it. > > > > > > > > I applied this part of the patch with some cosmetic changes in > > > > osm_vendor_ibumad.c. > > > > > > > > > In the current code we saw that sometimes opensm gets "stuck" on > > > exit, > > > > > and causes the machine to get stuck too - resulting in need for > > > > > rebooting. In the following patch fixes most of it. > > > > > We did run (in the patch) into rare cases where opensm exits with > > an > > > > > error, but at least it exits without stucking the machine... > > > > > > > > Is there a reliable way to recreate machine "stuck" ? What exactly > > do > > > > you mean by this ? > > > > > > > > All umad_unregister does is some validation, a table lookup, and > > issue > > > > the ioctl to unregister the MAD agent. Not explictly unregistering > > the > > > > agent(s) does not cause any harm as when the fd is closed, this will > > > > occur as part of the cleanup. > > > > > > > > -- Hal > > > > From mst at mellanox.co.il Tue Nov 8 09:46:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Nov 2005 19:46:44 +0200 Subject: [openib-general] [PATCH 1 of 2] mthca: qp size calculations Message-ID: <20051108174644.GD30664@mellanox.co.il> Hello, Roland! So here's the kernel part of the patch that fixes the qp creation by moving the common part of the math to kernel, where we know the device limits. This is built along the lines that we agreed upon: kernel does all the checks and returns the actual capabilities to userspace. With this patch in place, it should be quite easy now to add inline data support for kernel, but thats a separate issue. Please comment. --- 1. Check that descriptor size does now exceed the value supported by hardware. 2. Set max_send_sge/max_recv_wr (and max_inline_data for userspace) in the create qp attributes structure to max values supported by qp, as opposed to (safe, but smaller) values requested by user. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack MorgensteinMichael Index: linux-2.6.14/drivers/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/include/rdma/ib_user_verbs.h 2005-11-02 14:24:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/include/rdma/ib_user_verbs.h 2005-11-08 19:29:37.000000000 +0200 @@ -43,7 +43,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 3 +#define IB_USER_VERBS_ABI_VERSION 4 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -333,6 +333,11 @@ struct ib_uverbs_create_qp { struct ib_uverbs_create_qp_resp { __u32 qp_handle; __u32 qpn; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; }; /* Index: linux-2.6.14/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/core/uverbs_cmd.c 2005-11-02 14:24:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/core/uverbs_cmd.c 2005-11-08 19:29:37.000000000 +0200 @@ -908,7 +908,12 @@ retry: if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uobject.id; + resp.max_recv_sge = attr.cap.max_recv_sge; + resp.max_send_sge = attr.cap.max_send_sge; + resp.max_recv_wr = attr.cap.max_recv_wr; + resp.max_send_wr = attr.cap.max_send_wr; + resp.max_inline_data = attr.cap.max_inline_data; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_provider.c 2005-11-06 10:30:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c 2005-11-08 19:29:37.000000000 +0200 @@ -604,11 +604,11 @@ static struct ib_qp *mthca_create_qp(str return ERR_PTR(err); } - init_attr->cap.max_inline_data = 0; init_attr->cap.max_send_wr = qp->sq.max; init_attr->cap.max_recv_wr = qp->rq.max; init_attr->cap.max_send_sge = qp->sq.max_gs; init_attr->cap.max_recv_sge = qp->rq.max_gs; + init_attr->cap.max_inline_data = qp->max_inline_data; return &qp->ibqp; } Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_provider.h 2005-09-12 10:50:00.000000000 +0300 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.h 2005-11-08 19:29:37.000000000 +0200 @@ -251,6 +251,7 @@ struct mthca_qp { struct mthca_wq sq; enum ib_sig_type sq_policy; int send_wqe_offset; + int max_inline_data; u64 *wrid; union mthca_buf queue; Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-11-02 14:24:16.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_cmd.c 2005-11-08 19:29:37.000000000 +0200 @@ -1060,6 +1060,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET); + dev_lim->max_desc_sz = min_t(int, size, dev_lim->max_desc_sz); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); dev_lim->mpt_entry_sz = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_dev.h 2005-11-06 10:30:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_dev.h 2005-11-08 19:29:37.000000000 +0200 @@ -131,6 +131,7 @@ struct mthca_limits { int max_sg; int num_qps; int max_wqes; + int max_desc_sz; int max_qp_init_rdma; int reserved_qps; int num_srqs; Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_main.c 2005-11-06 10:30:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_main.c 2005-11-08 19:29:37.000000000 +0200 @@ -168,6 +168,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.max_desc_sz = dev_lim->max_desc_sz; /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2005-11-06 10:30:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c 2005-11-08 19:36:37.000000000 +0200 @@ -883,6 +883,46 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } +static void mthca_adjust_qp_caps(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int max_data_size; + /* + * Calculate the maximum size of WQE s/g segments, excluding + * the next segment and other non-data segments. + */ + + max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - + sizeof (struct mthca_next_seg); + + switch (qp->transport) { + case MLX: + max_data_size -= 2 * sizeof (struct mthca_data_seg); + break; + case UD: + if (mthca_is_memfree(dev)) + max_data_size -= sizeof (struct mthca_arbel_ud_seg); + else + max_data_size -= sizeof (struct mthca_tavor_ud_seg); + break; + + default: + max_data_size -= sizeof (struct mthca_bind_seg); + break; + } + + if (!pd->ibpd.uobject) + qp->max_inline_data = 0; + else + qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; + + qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); + qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg); +} + /* * Allocate and register buffer for WQEs. qp->rq.max, sq.max, * rq.max_gs and sq.max_gs must all be assigned. @@ -900,6 +940,9 @@ static int mthca_alloc_wqe_buf(struct mt size = sizeof (struct mthca_next_seg) + qp->rq.max_gs * sizeof (struct mthca_data_seg); + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; qp->rq.wqe_shift++) ; /* nothing */ @@ -921,6 +964,9 @@ static int mthca_alloc_wqe_buf(struct mt size += sizeof (struct mthca_bind_seg); } + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) ; /* nothing */ @@ -1064,6 +1110,7 @@ static int mthca_alloc_qp_common(struct return ret; } + mthca_adjust_qp_caps(dev, pd, qp); /* * If this is a userspace QP, we're done now. The doorbells * will be allocated and buffers will be initialized in -- MST From mst at mellanox.co.il Tue Nov 8 09:51:11 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Nov 2005 19:51:11 +0200 Subject: [openib-general] [PATCH 2 of 2] libmthca: qp capability calculations Message-ID: <20051108175111.GE30664@mellanox.co.il> Move all qp capability calculations to kernel, where we can compare the capabilities to hardware supported limits. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack MorgensteinMichael Index: userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- userspace.orig/libibverbs/include/infiniband/kern-abi.h 2005-11-02 15:30:06.000000000 +0200 +++ userspace/libibverbs/include/infiniband/kern-abi.h 2005-11-08 19:49:38.000000000 +0200 @@ -48,7 +48,7 @@ * The minimum and maximum kernel ABI that we can handle. */ #define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 3 +#define IB_USER_VERBS_MAX_ABI_VERSION 4 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -382,6 +382,11 @@ struct ibv_create_qp { struct ibv_create_qp_resp { __u32 qp_handle; __u32 qpn; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; }; struct ibv_qp_dest { Index: userspace/libibverbs/src/cmd.c =================================================================== --- userspace.orig/libibverbs/src/cmd.c 2005-11-02 15:30:06.000000000 +0200 +++ userspace/libibverbs/src/cmd.c 2005-11-08 19:48:58.000000000 +0200 @@ -501,6 +501,13 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, qp->handle = resp.qp_handle; qp->qp_num = resp.qpn; + if (abi_ver > 3) { + attr->cap.max_recv_sge = resp.max_recv_sge; + attr->cap.max_send_sge = resp.max_send_sge; + attr->cap.max_recv_wr = resp.max_recv_wr; + attr->cap.max_send_wr = resp.max_send_wr; + attr->cap.max_inline_data = resp.max_inline_data; + } return 0; } Index: userspace/libmthca/src/qp.c =================================================================== --- userspace.orig/libmthca/src/qp.c 2005-11-08 19:48:38.000000000 +0200 +++ userspace/libmthca/src/qp.c 2005-11-08 19:48:58.000000000 +0200 @@ -216,7 +216,7 @@ int mthca_tavor_post_send(struct ibv_qp if (wr->send_flags & IBV_SEND_INLINE) { struct mthca_inline_seg *seg = wqe; - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; + int max_size = qp->max_inline_data; int s = 0; wqe += sizeof *seg; @@ -515,7 +515,7 @@ int mthca_arbel_post_send(struct ibv_qp if (wr->send_flags & IBV_SEND_INLINE) { struct mthca_inline_seg *seg = wqe; - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; + int max_size = qp->max_inline_data; int s = 0; wqe += sizeof *seg; @@ -683,12 +683,14 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd enum ibv_qp_type type, struct mthca_qp *qp) { int size; + int max_sq_sge; qp->rq.max_gs = cap->max_recv_sge; - qp->sq.max_gs = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), + qp->sq.max_gs = cap->max_send_sge; + max_sq_sge = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), sizeof (struct mthca_data_seg)) / sizeof (struct mthca_data_seg); - if (qp->sq.max_gs < cap->max_send_sge) - qp->sq.max_gs = cap->max_send_sge; + if (max_sq_sge < cap->max_send_sge) + max_sq_sge = cap->max_send_sge; qp->wrid = malloc((qp->rq.max + qp->sq.max) * sizeof (uint64_t)); if (!qp->wrid) @@ -702,7 +704,7 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd ; /* nothing */ size = sizeof (struct mthca_next_seg) + - qp->sq.max_gs * sizeof (struct mthca_data_seg); + max_sq_sge * sizeof (struct mthca_data_seg); switch (type) { case IBV_QPT_UD: if (mthca_is_memfree(pd->context)) @@ -767,36 +769,6 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd return 0; } -void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, - enum ibv_qp_type type, struct ibv_qp_cap *cap) -{ - /* - * Maximum inline data size is the full WQE size less the size - * of the next segment, inline segment and other non-data segments. - */ - cap->max_inline_data = (1 << qp->sq.wqe_shift) - - sizeof (struct mthca_next_seg) - - sizeof (struct mthca_inline_seg); - - switch (type) { - case IBV_QPT_UD: - if (mthca_is_memfree(pd->context)) - cap->max_inline_data -= sizeof (struct mthca_arbel_ud_seg); - else - cap->max_inline_data -= sizeof (struct mthca_tavor_ud_seg); - break; - - default: - cap->max_inline_data -= sizeof (struct mthca_raddr_seg); - break; - } - - cap->max_send_wr = qp->sq.max; - cap->max_recv_wr = qp->rq.max; - cap->max_send_sge = qp->sq.max_gs; - cap->max_recv_sge = qp->rq.max_gs; -} - struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn) { int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift; Index: userspace/libmthca/src/verbs.c =================================================================== --- userspace.orig/libmthca/src/verbs.c 2005-11-02 15:30:06.000000000 +0200 +++ userspace/libmthca/src/verbs.c 2005-11-08 19:50:23.000000000 +0200 @@ -471,8 +471,11 @@ struct ibv_qp *mthca_create_qp(struct ib if (ret) goto err_destroy; - mthca_return_cap(pd, qp, attr->qp_type, &attr->cap); - + qp->sq.max = attr->cap.max_send_wr; + qp->rq.max = attr->cap.max_recv_wr; + qp->sq.max_gs = attr->cap.max_send_sge; + qp->rq.max_gs = attr->cap.max_recv_sge; + qp->max_inline_data = attr->cap.max_inline_data; return &qp->ibv_qp; err_destroy: Index: userspace/libmthca/src/mthca.h =================================================================== --- userspace.orig/libmthca/src/mthca.h 2005-11-02 15:30:06.000000000 +0200 +++ userspace/libmthca/src/mthca.h 2005-11-08 19:48:58.000000000 +0200 @@ -177,6 +177,7 @@ struct mthca_qp { void *buf; uint64_t *wrid; int send_wqe_offset; + int max_inline_data; int buf_size; struct mthca_wq sq; struct mthca_wq rq; @@ -317,8 +318,6 @@ extern int mthca_arbel_post_recv(struct struct ibv_recv_wr **bad_wr); extern int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, enum ibv_qp_type type, struct mthca_qp *qp); -extern void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, - enum ibv_qp_type type, struct ibv_qp_cap *cap); extern struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn); extern int mthca_store_qp(struct mthca_context *ctx, uint32_t qpn, struct mthca_qp *qp); extern void mthca_clear_qp(struct mthca_context *ctx, uint32_t qpn); -- MST From iod00d at hp.com Tue Nov 8 10:42:26 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 8 Nov 2005 10:42:26 -0800 Subject: [openib-general] ib_mthca segfault with SVN r3984 Message-ID: <20051108184226.GA21242@esmail.cup.hp.com> Hi Roland, I'm also trying to catch up on 2.6.14 + openib.org r3984. "modprobe ib_mthca" is segfaulting. Console output appended from ia64-linux on HP rx2600. I saw the firmware update message and will take care of that. I don't expect it's the root cause of the segfault though. thanks, grant Userspace output: iota:~# reload_ib + IPoIB=30 + ifconfig ib0 down ib0: ERROR while getting interface flags: No such device + ifconfig ib0 down ib0: ERROR while getting interface flags: No such device + rmmod ib_ipoib ib_uverbs ib_sdp ib_cm ib_sa ib_mthca ib_mad ib_core ERROR: Module ib_ipoib does not exist in /proc/modules ERROR: Module ib_uverbs does not exist in /proc/modules ERROR: Module ib_sdp does not exist in /proc/modules ERROR: Module ib_cm does not exist in /proc/modules ERROR: Module ib_sa does not exist in /proc/modules ERROR: Module ib_mthca does not exist in /proc/modules ERROR: Module ib_mad does not exist in /proc/modules ERROR: Module ib_core does not exist in /proc/modules + modprobe ib_mthca msi_x=1 /usr/local/bin/reload_ib: line 7: 1590 Segmentation fault modprobe ib_mthca msi_x=1 + modprobe ib_ipoib + modprobe ib_sdp + modprobe ib_uverbs + ifconfig ib0 10.0.0.30 netmask 255.255.255.0 broadcast 10.0.0.255 SIOCSIFADDR: No such device ib0: ERROR while getting interface flags: No such device SIOCSIFNETMASK: No such device SIOCSIFBRDADDR: No such device ib0: ERROR while getting interface flags: No such device + ifconfig ib1 10.0.1.30 netmask 255.255.255.0 broadcast 10.0.1.255 SIOCSIFADDR: No such device ib1: ERROR while getting interface flags: No such device SIOCSIFNETMASK: No such device SIOCSIFBRDADDR: No such device ib1: ERROR while getting interface flags: No such device iota:~# Console output: Linux version 2.6.14 (grundler at gsyprf3) (gcc version 4.0.2 (Debian 4.0.2-2)) #3 SMP Mon Nov 7 20:56:52 PST 2005 EFI v1.10 by HP: SALsystab=0x3fb38000 ACPI 2.0=0x3fb2e000 SMBIOS=0x3fb3a000 HCDP=0x3fb2c000 ACPI: RSDP (v002 HP ) @ 0x000000003fb2e000 ACPI: XSDT (v001 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb2e02c ACPI: FADT (v003 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb36800 ACPI: SPCR (v001 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb36938 ACPI: DBGP (v001 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb36988 ACPI: MADT (v001 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb36a48 ACPI: SPMI (v004 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb369c0 ACPI: CPEP (v001 HP rx2600 0x00000000 HP 0x00000000) @ 0x000000003fb36a10 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb33870 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb33a50 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb33da0 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb347c0 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb351e0 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb35c00 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb36620 ACPI: SSDT (v001 HP rx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb36710 ACPI: DSDT (v001 HP rx2600 0x00000007 INTL 0x02012044) @ 0x0000000000000000 warning: skipping physical page 0 SAL 3.1: HP version 2.21 SAL Platform features: None SAL: AP wakeup using external interrupt vector 0xff No logical to physical processor mapping available ACPI: Local APIC address c0000000fee00000 GSI 36 (level, low) -> CPU 0 (0x0000) vector 48 2 CPUs available, 2 CPUs total MCA related initialization done On node 0 totalpages: 130372 DMA zone: 130372 pages, LIFO batch:7 Normal zone: 0 pages, LIFO batch:1 HighMem zone: 0 pages, LIFO batch:1 Virtual mem_map starts at 0xa0007fffc7900000 Built 1 zonelists Kernel command line: BOOT_IMAGE=scsi0:/EFI/debian/boot/vmlinuz-2.6.14 root=/dev/sda3 console=ttyS0,115200 ro PID hash table entries: 4096 (order: 12, 131072 bytes) CPU 0: base freq=200.000MHz, ITC ratio=15/2, ITC freq=1500.000MHz+/-750ppm Console: colour dummy device 80x25 Dentry cache hash table entries: 262144 (order: 7, 2097152 bytes) Inode-cache hash table entries: 131072 (order: 6, 1048576 bytes) Memory: 2062672k/2085952k available (7764k code, 22672k reserved, 3329k data, 272k init) McKinley Errata 9 workaround not needed; disabling it Calibrating delay loop... 2244.60 BogoMIPS (lpj=4489216) Mount-cache hash table entries: 1024 Boot processor id 0x0/0x0 softlockup thread 0 started up. CPU 1: synchronized ITC with CPU 0 (last diff 5 cycles, maxerr 605 cycles) CPU 1: base freq=200.000MHz, ITC ratio=15/2, ITC freq=1500.000MHz+/-750ppm Calibrating delay loop... 2244.60 BogoMIPS (lpj=4489216) Brought up 2 CPUs Total of 2 processors activated (4489.21 BogoMIPS). softlockup thread 1 started up. NET: Registered protocol family 16 ACPI: bus type pci registered ACPI: Subsystem revision 20050902 ACPI: Interpreter enabled ACPI: Using IOSAPIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (0000:00) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI0._PRT] ACPI: PCI Root Bridge [PCI1] (0000:20) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI1._PRT] ACPI: PCI Root Bridge [PCI2] (0000:40) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI2._PRT] ACPI: PCI Root Bridge [PCI3] (0000:60) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI3._PRT] ACPI: PCI Root Bridge [PCI4] (0000:80) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI4._PRT] ACPI: PCI Root Bridge [PCI6] (0000:c0) ACPI: PCI Interrupt Routing Table [\_SB_.SBA0.PCI6._PRT] Generic PHY: Registered new driver SCSI subsystem initialized usbcore: registered new driver usbfs usbcore: registered new driver hub IOC: zx1 2.2 HPA 0xfed01000 IOVA space 1024Mb at 0x40000000 perfmon: version 2.0 IRQ 238 perfmon: Itanium 2 PMU detected, 16 PMCs, 18 PMDs, 4 counters (47 bits) PAL Information Facility v0.5 perfmon: added sampling format default_format perfmon_default_smpl: default_format v2.0 registered Total HugeTLB memory allocated, 0 Installing knfsd (copyright (C) 1996 okir at monad.swb.de). Initializing Cryptographic API pci_hotplug: PCI Hot Plug PCI Core version: 0.5 acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 ACPI: Power Button (FF) [PWRF] ACPI: Sleep Button (FF) [SLPF] ACPI: CPU0 (power states: C1[C1]) ACPI: CPU1 (power states: C1[C1]) ACPI: Thermal Zone [THM0] (27 C) EFI Time Services Driver v0.4 Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled GSI 34 (edge, high) -> CPU 0 (0x0000) vector 49 ttyS0 at MMIO 0xff5e0000 (irq = 49) is a 16550A GSI 35 (edge, high) -> CPU 1 (0x0100) vector 50 ttyS1 at MMIO 0xff5e2000 (irq = 50) is a 16550A io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize loop: loaded (max 8 devices) pktcdvd: v0.2.0a 2004-07-14 Jens Axboe (axboe at suse.de) and petero2 at telia.com Ethernet Channel Bonding Driver: v2.6.4 (September 26, 2005) bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details. Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx CMD649: IDE controller at PCI slot 0000:00:02.0 GSI 21 (level, low) -> CPU 0 (0x0000) vector 51 ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 21 (level, low) -> IRQ 51 CMD649: chipset revision 2 CMD649: 100% native mode on irq 51 ide0: BM-DMA at 0x0d40-0x0d47, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0x0d48-0x0d4f, BIOS settings: hdc:pio, hdd:pio Probing IDE interface ide0... hda: DV-28E-A, ATAPI CD/DVD-ROM drive ide0 at 0xd58-0xd5f,0xd66 on irq 51 Probing IDE interface ide1... hda: ATAPI 24X DVD-ROM drive, 512kB Cache, UDMA(33) Uniform CD-ROM driver Revision: 3.20 Fusion MPT base driver 3.03.03 Copyright (c) 1999-2005 LSI Logic Corporation Fusion MPT SPI Host driver 3.03.03 GSI 27 (level, low) -> CPU 1 (0x0100) vector 52 ACPI: PCI Interrupt 0000:20:01.0[A] -> GSI 27 (level, low) -> IRQ 52 mptbase: Initiating ioc0 bringup ioc0: 53C1030: Capabilities={Initiator} scsi0 : ioc0: LSI53C1030, FwRev=01030600h, Ports=1, MaxQ=255, IRQ=52 Vendor: HP 73.4G Model: ST373453LC Rev: HPC3 Type: Direct-Access ANSI SCSI revision: 03 SCSI device sda: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sda: drive cache: write through SCSI device sda: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sda: drive cache: write through sda: sda1 sda2 sda3 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 Vendor: HP 73.4G Model: ST373453LC Rev: HPC3 Type: Direct-Access ANSI SCSI revision: 03 SCSI device sdb: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sdb: drive cache: write through SCSI device sdb: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sdb: drive cache: write through sdb: sdb1 sdb2 Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0 GSI 28 (level, low) -> CPU 0 (0x0000) vector 53 ACPI: PCI Interrupt 0000:20:01.1[B] -> GSI 28 (level, low) -> IRQ 53 mptbase: Initiating ioc1 bringup ioc1: 53C1030: Capabilities={Initiator} scsi1 : ioc1: LSI53C1030, FwRev=01030600h, Ports=1, MaxQ=255, IRQ=53 Vendor: HP 73.4G Model: ST373453LC Rev: HPC3 Type: Direct-Access ANSI SCSI revision: 03 SCSI device sdc: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sdc: drive cache: write through SCSI device sdc: 143374738 512-byte hdwr sectors (73408 MB) SCSI device sdc: drive cache: write through sdc: unknown partition table Attached scsi disk sdc at scsi1, channel 0, id 2, lun 0 GSI 18 (level, low) -> CPU 1 (0x0100) vector 54 ACPI: PCI Interrupt 0000:00:01.2[C] -> GSI 18 (level, low) -> IRQ 54 ehci_hcd 0000:00:01.2: EHCI Host Controller ehci_hcd 0000:00:01.2: new USB bus registered, assigned bus number 1 ehci_hcd 0000:00:01.2: irq 54, io mem 0x80021000 ehci_hcd 0000:00:01.2: USB 2.0 initialized, EHCI 0.95, driver 10 Dec 2004 hub 1-0:1.0: USB hub found hub 1-0:1.0: 5 ports detected ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) GSI 16 (level, low) -> CPU 0 (0x0000) vector 55 ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 55 ohci_hcd 0000:00:01.0: OHCI Host Controller ohci_hcd 0000:00:01.0: new USB bus registered, assigned bus number 2 ohci_hcd 0000:00:01.0: irq 55, io mem 0x80023000 hub 2-0:1.0: USB hub found hub 2-0:1.0: 3 ports detected GSI 17 (level, low) -> CPU 1 (0x0100) vector 56 ACPI: PCI Interrupt 0000:00:01.1[B] -> GSI 17 (level, low) -> IRQ 56 ohci_hcd 0000:00:01.1: OHCI Host Controller ohci_hcd 0000:00:01.1: new USB bus registered, assigned bus number 3 ohci_hcd 0000:00:01.1: irq 56, io mem 0x80022000 hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected USB Universal Host Controller Interface driver v2.3 usbcore: registered new driver hiddev usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.6:USB HID core driver mice: PS/2 mouse device common for all mice EFI Variables Facility v0.08 2004-May-17 NET: Registered protocol family 2 IP route cache hash table entries: 65536 (order: 5, 524288 bytes) TCP established hash table entries: 262144 (order: 8, 4194304 bytes) TCP bind hash table entries: 65536 (order: 6, 1048576 bytes) TCP: Hash tables configured (established 262144 bind 65536) TCP reno registered TCP bic registered NET: Registered protocol family 1 NET: Registered protocol family 17 SCTP: Hash tables configured (established 65536 bind 65536) kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 272kB freed Adding 1022944k swap on /dev/sda2. Priority:-1 extents:1 across:1022944k EXT3 FS on sda3, internal journal e100: Intel(R) PRO/100 Network Driver, 3.4.14-k2-NAPI e100: Copyright(c) 1999-2005 Intel Corporation GSI 20 (level, low) -> CPU 0 (0x0000) vector 57 ACPI: PCI Interrupt 0000:00:03.0[A] -> GSI 20 (level, low) -> IRQ 57 e100: eth0: e100_probe: addr 0x80020000, irq 57, MAC addr 00:30:6E:1E:CE:16 tg3.c:v3.42 (Oct 3, 2005) GSI 29 (level, low) -> CPU 1 (0x0100) vector 58 ACPI: PCI Interrupt 0000:20:02.0[A] -> GSI 29 (level, low) -> IRQ 58 eth1: Tigon3 [partno(BCM95700A6) rev 0105 PHY(5701)] (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:30:6e:1e:de:01 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[0] eth1: dma_rwctrl[76ff2d0f] GSI 71 (level, low) -> CPU 0 (0x0000) vector 59 ACPI: PCI Interrupt 0000:c0:01.0[A] -> GSI 71 (level, low) -> IRQ 59 eth2: Tigon3 [partno(A6847-60001) rev 0105 PHY(serdes)] (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:04:76:df:ba:99 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[0] TSOcap[0] eth2: dma_rwctrl[76ff2d0f] Intel(R) PRO/1000 Network Driver - version 6.0.60-k2-NAPI Copyright (c) 1999-2005 Intel Corporation. Linux Tulip driver version 1.1.13-NAPI (May 11, 2002) e100: eth0: e100_watchdog: link up, 100Mbps, full-duplex tg3: eth1: Link is up at 1000 Mbps, full duplex. tg3: eth1: Flow control is off for TX and off for RX. tar(1403): unaligned access to 0x60000fffffac34c4, ip=0x200000000028ba01 tar(1403): unaligned access to 0x60000fffffac34c4, ip=0xa0000001006426f0 tar(1403): unaligned access to 0x60000fffffac34c4, ip=0x200000000028bad1 tar(1403): unaligned access to 0x60000fffffac34b4, ip=0x200000000028ba01 sshd(1569): unaligned access to 0x60000fffffcceb44, ip=0x20000000007ffa01 sshd(1569): unaligned access to 0x60000fffffcceb44, ip=0xa0000001006426f0 sshd(1569): unaligned access to 0x60000fffffcceb44, ip=0x20000000007ffad1 sshd(1569): unaligned access to 0x60000fffffccc184, ip=0x20000000007ffa01 su(1575): unaligned access to 0x60000fffffa537a4, ip=0x20000000002e7a01 su(1575): unaligned access to 0x60000fffffa537a4, ip=0xa0000001006426f0 su(1575): unaligned access to 0x60000fffffa537a4, ip=0x20000000002e7ad1 su(1575): unaligned access to 0x60000fffffa52a04, ip=0x20000000002e7a01 cron(1580): unaligned access to 0x60000fffffcb65a4, ip=0x20000000002bfa01 cron(1580): unaligned access to 0x60000fffffcb65a4, ip=0xa0000001006426f0 cron(1580): unaligned access to 0x60000fffffcb65a4, ip=0x20000000002bfad1 cron(1581): unaligned access to 0x60000fffffcb7374, ip=0x20000000002bfa01 ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:81:00.0 GSI 60 (level, low) -> CPU 1 (0x0100) vector 60 ACPI: PCI Interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 60 ib_mthca 0000:81:00.0: HCA FW version 3.3.2 is old (3.3.3 is current). ib_mthca 0000:81:00.0: If you have problems, try updating your HCA FW. Unable to handle kernel paging request at virtual address 2e414348000003a1 modprobe[1590]: Oops 11012296146944 [1] Modules linked in: ib_mthca ib_mad ib_core tulip e1000 tg3 e100 Pid: 1590, CPU 1, comm: modprobe psr : 0000121008526030 ifs : 8000000000000286 ip : [] Not tainted ip is at mthca_uar_alloc+0x81/0xc0 [ib_mthca] unat: 0000000000000000 pfs : 0000000000000614 rsc : 0000000000000003 rnat: 000000000000000a bsps: a000000100ad8bd0 pr : 55a1404a4659a965 ldrs: 0000000000000000 ccv : 0000000000000007 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002000cc4a0 b6 : a0000001000d53a0 b7 : a00000010000b1c0 f6 : 1003e0000000000000090 f7 : 1003e20c49ba5e353f7cf f8 : 1003e000000000000046a f9 : 1003e000000000e251ad8 f10 : 1003e0000000035f56b37 f11 : 1003e431bde82d7b634db r1 : a0000002000e9100 r2 : 000000000000038a r3 : 0000000000000004 r8 : 0000000000000000 r9 : 0000000000000003 r10 : 0000000000000001 r11 : e00000003f542b40 r12 : e000000005757dd0 r13 : e000000005750000 r14 : 2e414348000003a1 r15 : 0000000000000003 r16 : 0000000000000000 r17 : 0000000000000008 r18 : 0000000000000007 r19 : 0000000000000007 r20 : 000000000000000f r21 : 0000000000000007 r22 : 0000000000000000 r23 : e00000407ee3059c r24 : 0000000000000000 r25 : 0000000000000004 r26 : 0000000000000002 r27 : 0000000000000000 r28 : 0000000000000000 r29 : 0000000000000003 r30 : 0000000000000000 r31 : e00000407ee305a8 Call Trace: [] show_stack+0x80/0xa0 sp=e000000005757970 bsp=e0000000057511f0 [] show_regs+0x900/0x940 sp=e000000005757b40 bsp=e000000005751198 [] die+0x150/0x200 sp=e000000005757b50 bsp=e000000005751150 [] ia64_do_page_fault+0xd30/0xdc0 sp=e000000005757b50 bsp=e000000005751100 [] ia64_leave_kernel+0x0/0x280 sp=e000000005757c00 bsp=e000000005751100 [] mthca_uar_alloc+0x80/0xc0 [ib_mthca] sp=e000000005757dd0 bsp=e0000000057510d0 [] mthca_init_one+0x1180/0x2080 [ib_mthca] sp=e000000005757dd0 bsp=e000000005751070 [] pci_device_probe+0xf0/0x160 sp=e000000005757df0 bsp=e000000005751030 [] driver_probe_device+0xd0/0x1a0 sp=e000000005757df0 bsp=e000000005750ff0 [] __driver_attach+0x70/0xc0 sp=e000000005757df0 bsp=e000000005750fc0 [] bus_for_each_dev+0xb0/0x120 sp=e000000005757df0 bsp=e000000005750f88 [] driver_attach+0x40/0x60 sp=e000000005757e10 bsp=e000000005750f68 [] bus_add_driver+0xc0/0x2e0 sp=e000000005757e10 bsp=e000000005750f30 [] driver_register+0xa0/0xc0 sp=e000000005757e10 bsp=e000000005750f10 [] pci_register_driver+0x100/0x160 sp=e000000005757e20 bsp=e000000005750ef0 [] mthca_init+0x30/0x70 [ib_mthca] sp=e000000005757e30 bsp=e000000005750ed8 [] sys_init_module+0x260/0x600 sp=e000000005757e30 bsp=e000000005750e60 [] ia64_ret_from_syscall+0x0/0x20 sp=e000000005757e30 bsp=e000000005750e60 [] __kernel_syscall_via_break+0x0/0x20 sp=e000000005758000 bsp=e000000005750e60 sdp_sk_proto.name = SDP sdp_sk_proto.obj_size = 1744 sdp_init in_interrupt = 0 sdp_init prot->rsk_prot = 0000000000000000 NET: Registered protocol family 27 From rolandd at cisco.com Tue Nov 8 11:03:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 11:03:57 -0800 Subject: [openib-general] ib_mthca segfault with SVN r3984 In-Reply-To: <20051108184226.GA21242@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 8 Nov 2005 10:42:26 -0800") References: <20051108184226.GA21242@esmail.cup.hp.com> Message-ID: <52k6fjotb6.fsf@cisco.com> Very odd -- mthca_uar_alloc() is an innocuous function and I'm not sure why it would be oopsing now. Is it possible that some old object files hung around and screwed things up? Does this still happen with an absolutely clean build? - R. From mst at mellanox.co.il Tue Nov 8 11:19:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 8 Nov 2005 21:19:25 +0200 Subject: [openib-general] Re: SDP close / linger problem In-Reply-To: <00bd01c5e3c9$81ab5c70$0211708d@gpv.az05.bull.com> References: <00bd01c5e3c9$81ab5c70$0211708d@gpv.az05.bull.com> Message-ID: <20051108191925.GB30918@mellanox.co.il> Hmm. Is the app small enough to post it here and use to reproduce the problem? If not, the traces with debug enabled might prove useful. Quoting Jerome Pioux : > Subject: SDP close / linger problem > > Hi Michael, > > Not sure that my post earlier made it to the forum? ... so, I thought that I will send it directly to you since you are the SDP person. > > I have a problem at closure time with client-server apps when using SDP. > > The problem is that it ALWAYS takes the "LINGER" time value for one or the other side to complete the close() and regardless of what the value was set at (2s, 10s 30s, 60s...) > > >From my understanding, if the LINGER option is set, the close is queued up on your send queue behind all other data potentially already queued up at the time. But, if there are no data queued up, the close should be immediate. > > On "my" app, this is always the receiver side that experiences the problem (the app is symetrical this is why there is a LINGER on the receiver side). > > I added a close in ttcp (ttcp does not use explicit close) with a linger time and the sender is now the one that always experiences this problem. > > I believe that, for both apps, all data have been sent (and received) correctly before the close - nothing (at least from the app view) is in the "pipe" (I had the app to report that before the closes). > > Finally, both app work fine using IPoIB - I meant that for the same tests, the closes are immediate, regardless of the LINGER values. > > Any idea please? > I can provide traces if needed - please tell me what is needed and how to get them. > > Thank you, > Jerome > > ps: ia64 / RHEL4 / 2.6.12 / sn rev 3882 > > -- MST From Don.Dhondt at Bull.com Tue Nov 8 11:40:05 2005 From: Don.Dhondt at Bull.com (Don.Dhondt at Bull.com) Date: Tue, 8 Nov 2005 12:40:05 -0700 Subject: [openib-general] SDP close / linger problem Message-ID: I am posting this for Jerome Pioux. He is still having problems posting to this list. Don Dhondt I have a problem at closure time with client-server apps when using SDP. The problem is that it ALWAYS takes the "LINGER" time value for one or the other side to complete the close() and regardless of what the value was set at (2s, 10s 30s, 60s...) >From my understanding, if the LINGER option is set, the close is queued up on your send queue behind all other data potentially already queued up at the time. But, if there are no data queued up, the close should be immediate. On "my" app, this is always the receiver side that experiences the problem (the app is symetrical this is why there is a LINGER on the receiver side). I added a close in ttcp (ttcp does not use explicit close) with a linger time and the sender is now the one that always experiences this problem. I believe that, for both apps, all data have been sent (and received) correctly before the close - nothing (at least from the app view) is in the "pipe" (I had the app to report that before the closes). Finally, both app work fine using IPoIB - I meant that for the same tests, the closes are immediate, regardless of the LINGER values. Any idea please? I can provide traces if needed - please tell me what is needed and how to get them. Thank you, Jerome ps: ia64 / RHEL4 / 2.6.12 / sn rev 3882 -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Tue Nov 8 11:52:07 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 08 Nov 2005 11:52:07 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsy s.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> Message-ID: <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> At 03:02 PM 11/4/2005, Rimmer, Todd wrote: > > Bob wrote, > > Perhaps if tunneling udp packets over RC connections rather than > > UD connections provides better performance, as was seen in the RDS > > experiment, then why not just convert > > IPoIB to use a connected model (rather than datagrams) > > and then all existing IP upper level > > protocols would could benefit, TCP, UDP, SCTP, .... > >This would miss the second major improvement of RDS, namely removing the >need for the application to perform timeouts and retries on datagram >packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a >loss-less reliable interface. If UDP/IP/IPoIB provided a loss-less >reliable interface it would likely break or affect other UDP applications >which are expecting a flow controlled interface. The entire discussion might be distilled into the following: - Datagram applications trade reliability for flexibility and resource savings. - Datagram applications that require reliability have to re-invent the wheel and given it is non-trivial, they often get it variable quality and can suffer performance loss if done poorly or the network is very lossy. Given networks are a lot less lossy today than years past, sans congestion drops, one might argue about whether there is still a significant problem or not. - The reliable datagram model isn't new - been there, done that on earlier interconnects - but it isn't free. IB could have done something like RDS but the people who pushed the original requirements (some who are advocating RDS now) did not want to take on the associated software enablement thus it was subsumed into hardware and made slightly more restrictive as a result - perhaps more than some people may like. The only real delta between RDS one sense and the current IB RD is the number of outstanding messages in flight on a given EEC. If RD were re-defined to allow software to recover some types of failures much like UC, then one could simply use RD. - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them. - RNIC / HCA provide hardware acceleration and reliable delivery to the remote RNIC / HCA (not to the application since that is in a separate fault domain). Doing software multiplexing over such an interconnect as envisioned for IB RD is relatively straight in many respects but not a trivial exercise as some might contend. Yes, people can point to a small number of lines of code but that is just for the initial offering and is not an indication of what it might have to become long-term to add all of the bells-n-whistles that people have envisioned. - RDS is not an API but a ULP. It really uses a set of physical connections and which are then used to set up logical application associations (often referred to as connections but really are not in terms of the interconnect). These associations can be quickly established as they are just control messages over the existing physical connections. Again, builds on concepts already shipping in earlier interconnects / solutions from a number of years back. Hence, for large scale applications which are association intensive, RDS is able to improve the performance of establishing these associations. While RDS improves the performance in this regard, its impacts on actual performance stem more from avoiding some operations thus nearly all of the performance numbers quoted are really an apple-to-orange comparison. Nothing wrong with this but people need to keep in mind that things are not being compared with one another on the same level thus the results can look more dramatic. - One thing to keep in mind is that RDS is about not doing work to gain performance and to potentially improve code by eliminating software that was too complex / difficult to get clean when it was invoked to recover from fabric-related issues. This is somewhat the same logic as used by NFS when migrating to TCP from UDP. Could not get clean software so change the underlying comms to push the problem to a place where it is largely solved. Now, whether you believe RDS is great or not, it is an attempt to solve a problem plaguing one class of applications who'd rather not spend their resources on the problem. That is a fair thing to consider if someone else has already done it better using another technology. One could also consider having IB change the RD semantics to see if that would solve the problem since it would not require a new ULP to make it work when you think about it though there is no analog with iWARP. The discussion so far has been interesting and I think there is fair push back to avoid re-inventing the wheel especially on the idea of trying to do this directly on Ethernet (that seems like just re-inventing all of that buggy code people stated they could not get right at the app layer in the first place and largely goes against the logic used to create IB and as well as iWARP's use of TCP in the first place). Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpandit at silverstorm.com Tue Nov 8 12:33:35 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Tue, 8 Nov 2005 12:33:35 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> Message-ID: <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> > Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Tue Nov 8 12:49:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 8 Nov 2005 22:49:24 +0200 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AA4D@taurus.voltaire.com> On Tue, 2005-11-08 at 15:33, Ranjit Pandit wrote: > Using APM is not useful because it doesn't provide failover across HCA's. Can't APM be made to work across HCAs ? -- Hal From jerome.pioux at bull.com Tue Nov 8 12:55:27 2005 From: jerome.pioux at bull.com (Jerome Pioux) Date: Tue, 8 Nov 2005 13:55:27 -0700 Subject: [openib-general] Re: SDP close / linger problem References: <00bd01c5e3c9$81ab5c70$0211708d@gpv.az05.bull.com> <20051108191925.GB30918@mellanox.co.il> Message-ID: <007601c5e4a6$c19cf400$0211708d@gpv.az05.bull.com> I have attached ttcp.c I put the culprit code under #ifdef LINGER Thank you Jerome ----- Original Message ----- From: "Michael S. Tsirkin" To: "Jerome Pioux" Cc: Sent: Tuesday, November 08, 2005 12:19 PM Subject: Re: SDP close / linger problem > Hmm. > Is the app small enough to post it here and use to reproduce the > problem? > If not, the traces with debug enabled might prove useful. > > Quoting Jerome Pioux : >> Subject: SDP close / linger problem >> >> Hi Michael, >> >> Not sure that my post earlier made it to the forum? ... so, I thought >> that I will send it directly to you since you are the SDP person. >> >> I have a problem at closure time with client-server apps when using SDP. >> >> The problem is that it ALWAYS takes the "LINGER" time value for one or >> the other side to complete the close() and regardless of what the value >> was set at (2s, 10s 30s, 60s...) >> >> >From my understanding, if the LINGER option is set, the close is queued >> >up on your send queue behind all other data potentially already queued >> >up at the time. But, if there are no data queued up, the close should be >> >immediate. >> >> On "my" app, this is always the receiver side that experiences the >> problem (the app is symetrical this is why there is a LINGER on the >> receiver side). >> >> I added a close in ttcp (ttcp does not use explicit close) with a linger >> time and the sender is now the one that always experiences this problem. >> >> I believe that, for both apps, all data have been sent (and received) >> correctly before the close - nothing (at least from the app view) is in >> the "pipe" (I had the app to report that before the closes). >> >> Finally, both app work fine using IPoIB - I meant that for the same >> tests, the closes are immediate, regardless of the LINGER values. >> >> Any idea please? >> I can provide traces if needed - please tell me what is needed and how to >> get them. >> >> Thank you, >> Jerome >> >> ps: ia64 / RHEL4 / 2.6.12 / sn rev 3882 >> >> > > -- > MST -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ttcp.c URL: From iod00d at hp.com Tue Nov 8 13:02:38 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 8 Nov 2005 13:02:38 -0800 Subject: [openib-general] ib_mthca segfault with SVN r3984 In-Reply-To: <52k6fjotb6.fsf@cisco.com> References: <20051108184226.GA21242@esmail.cup.hp.com> <52k6fjotb6.fsf@cisco.com> Message-ID: <20051108210238.GA21734@esmail.cup.hp.com> On Tue, Nov 08, 2005 at 11:03:57AM -0800, Roland Dreier wrote: > Very odd -- mthca_uar_alloc() is an innocuous function and I'm not > sure why it would be oopsing now. I found it odd too. > Is it possible that some old object files hung around and screwed > things up? Does this still happen with an absolutely clean build? This is an absolutely clean build. I also rm -rf /lib/modules/2.6.14 and re-installed. I'll rebuild from scratch to make sure it's repeatable. thanks, grant From caitlinb at broadcom.com Tue Nov 8 13:04:11 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 8 Nov 2005 13:04:11 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041568@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause Sent: Tuesday, November 08, 2005 11:52 AM To: Rimmer, Todd Cc: openib-general at openib.org Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB The entire discussion might be distilled into the following: - Datagram applications trade reliability for flexibility and resource savings. Reliable Datagram applications have endpoints that accept messages from multiple known sources, rather than from a single known source (TCP, RC) or multiple unknown sources (UDP, RD). This does save resources, but perhaps just as importantly it may reflect how the application truly thinks of its communication endpoints. Oracle is not unique in this communication requirement. This is essentially the interface MPI presents to its users as well. - Datagram applications that require reliability have to re-invent the wheel and given it is non-trivial, they often get it variable quality and can suffer performance loss if done poorly or the network is very lossy. Given networks are a lot less lossy today than years past, sans congestion drops, one might argue about whether there is still a significant problem or not. [cait] Standardized congestion control that is not dependent on application specific control is highly desirable. In the IP world new ULPs based upon UDP are heavily discouraged for exactly this reason. - The reliable datagram model isn't new - been there, done that on earlier interconnects - but it isn't free. IB could have done something like RDS but the people who pushed the original requirements (some who are advocating RDS now) did not want to take on the associated software enablement thus it was subsumed into hardware and made slightly more restrictive as a result - perhaps more than some people may like. The only real delta between RDS one sense and the current IB RD is the number of outstanding messages in flight on a given EEC. If RD were re-defined to allow software to recover some types of failures much like UC, then one could simply use RD. [cait] The RDS API should definitely be compatiable with IB RD service, especially any later one that solves the crippling limitation on in-flight messages. Similarly the API should be compatible with IP based solutions, which since it is derived from SOCK_DGRAM isn't much of a challenge. - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them. [cait] This goes to the question of where the Reliable Datagram Service is implemented. When done as middleware over existing reliable connection services then the middleware does have a few issues on handling flushed buffers after an RNIC failure. These issues make implementation of a zero-copy strategy more of an issue. But if the endpoint is truly a datagram endpoint then these issues are the same as for failover of connection-oriented endpoints between two RNICs/HCAs. - RNIC / HCA provide hardware acceleration and reliable delivery to the remote RNIC / HCA (not to the application since that is in a separate fault domain). Doing software multiplexing over such an interconnect as envisioned for IB RD is relatively straight in many respects but not a trivial exercise as some might contend. Yes, people can point to a small number of lines of code but that is just for the initial offering and is not an indication of what it might have to become long-term to add all of the bells-n-whistles that people have envisioned. [cait] IB RD is not transport neutral, and has the problem of severe in-flight limitations that would make it unacceptable to most applications that would benefit from RDS even if they were There is no way that iWARP vendors would ever implement a service designed to match IB RD. An RDS service could be implemented over TCP, MPA, MS-MPA or SCTP. - RDS is not an API but a ULP. It really uses a set of physical connections and which are then used to set up logical application associations (often referred to as connections but really are not in terms of the interconnect). These associations can be quickly established as they are just control messages over the existing physical connections. Again, builds on concepts already shipping in earlier interconnects / solutions from a number of years back. Hence, for large scale applications which are association intensive, RDS is able to improve the performance of establishing these associations. While RDS improves the performance in this regard, its impacts on actual performance stem more from avoiding some operations thus nearly all of the performance numbers quoted are really an apple-to-orange comparison. Nothing wrong with this but people need to keep in mind that things are not being compared with one another on the same level thus the results can look more dramatic. [cait] All correct. The real issue with RDS is whether it makes sense to present this a pseudo-transport service, or if its just a suggested strategy that each application should implement on its own. From a wire perspective there isn't much different. From a development perspective it makes sense as long as the pseudo-transport definition is indeed defined as though it were a transport. I believe that is the case here. - One thing to keep in mind is that RDS is about not doing work to gain performance and to potentially improve code by eliminating software that was too complex / difficult to get clean when it was invoked to recover from fabric-related issues. This is somewhat the same logic as used by NFS when migrating to TCP from UDP. Could not get clean software so change the underlying comms to push the problem to a place where it is largely solved. Now, whether you believe RDS is great or not, it is an attempt to solve a problem plaguing one class of applications who'd rather not spend their resources on the problem. That is a fair thing to consider if someone else has already done it better using another technology. One could also consider having IB change the RD semantics to see if that would solve the problem since it would not require a new ULP to make it work when you think about it though there is no analog with iWARP. The discussion so far has been interesting and I think there is fair push back to avoid re-inventing the wheel especially on the idea of trying to do this directly on Ethernet (that seems like just re-inventing all of that buggy code people stated they could not get right at the app layer in the first place and largely goes against the logic used to create IB and as well as iWARP's use of TCP in the first place). [cait] If there were a definition of a usable RD service from IB then porting it to iWARP could be considered as well. The key characteristics are a) message orientation, b) reliable delivery c) multiple known source (not unlimited unknown) and d) multiple in-flight messages with the ULP being responsible for flow control. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jerome.pioux at bull.com Tue Nov 8 13:04:31 2005 From: jerome.pioux at bull.com (Jerome Pioux) Date: Tue, 8 Nov 2005 14:04:31 -0700 Subject: [openib-general] Re: SDP close / linger problem References: <00bd01c5e3c9$81ab5c70$0211708d@gpv.az05.bull.com><20051108191925.GB30918@mellanox.co.il> <007601c5e4a6$c19cf400$0211708d@gpv.az05.bull.com> Message-ID: <00a201c5e4a8$049bb880$0211708d@gpv.az05.bull.com> I forgot: this is the way I run it: On the receiver side: LD_PRELOAD="/lib/libsdp.so" ttcp -r -p 5001 -l 1048576 -b 1048576 -f M -n 5000 On the sender side: LD_PRELOAD="/lib/libsdp.so" ttcp -t -p 5001 -l 1048576 -b 1048576 -f M -n 5000 192.168.0.100 The linger is configured at 10s. If you monitor the processes, you should see 10s delay for the sender process after all xfers were completed (or if you prefer, the receiver process should be gone 10s before the sender) Jerome ----- Original Message ----- From: "Jerome Pioux" To: "Michael S. Tsirkin" Cc: Sent: Tuesday, November 08, 2005 1:55 PM Subject: [openib-general] Re: SDP close / linger problem >I have attached ttcp.c > I put the culprit code under #ifdef LINGER > > Thank you > Jerome > > > ----- Original Message ----- > From: "Michael S. Tsirkin" > To: "Jerome Pioux" > Cc: > Sent: Tuesday, November 08, 2005 12:19 PM > Subject: Re: SDP close / linger problem > > >> Hmm. >> Is the app small enough to post it here and use to reproduce the >> problem? >> If not, the traces with debug enabled might prove useful. >> >> Quoting Jerome Pioux : >>> Subject: SDP close / linger problem >>> >>> Hi Michael, >>> >>> Not sure that my post earlier made it to the forum? ... so, I thought >>> that I will send it directly to you since you are the SDP person. >>> >>> I have a problem at closure time with client-server apps when using SDP. >>> >>> The problem is that it ALWAYS takes the "LINGER" time value for one or >>> the other side to complete the close() and regardless of what the value >>> was set at (2s, 10s 30s, 60s...) >>> >>> >From my understanding, if the LINGER option is set, the close is queued >>> >up on your send queue behind all other data potentially already queued >>> >up at the time. But, if there are no data queued up, the close should >>> >be >>> >immediate. >>> >>> On "my" app, this is always the receiver side that experiences the >>> problem (the app is symetrical this is why there is a LINGER on the >>> receiver side). >>> >>> I added a close in ttcp (ttcp does not use explicit close) with a linger >>> time and the sender is now the one that always experiences this problem. >>> >>> I believe that, for both apps, all data have been sent (and received) >>> correctly before the close - nothing (at least from the app view) is in >>> the "pipe" (I had the app to report that before the closes). >>> >>> Finally, both app work fine using IPoIB - I meant that for the same >>> tests, the closes are immediate, regardless of the LINGER values. >>> >>> Any idea please? >>> I can provide traces if needed - please tell me what is needed and how >>> to >>> get them. >>> >>> Thank you, >>> Jerome >>> >>> ps: ia64 / RHEL4 / 2.6.12 / sn rev 3882 >>> >>> >> >> -- >> MST > -------------------------------------------------------------------------------- > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Nov 8 13:01:30 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Nov 2005 16:01:30 -0500 Subject: [openib-general] [PATCH] OpenSM: Support SMKey command line option Message-ID: <1131483690.4451.233.camel@hal.voltaire.com> OpenSM: Support SMKey command line option IBA requires (in C14-37.1.1) an out of band method to set SM_Key. A command line option is added for this. (There already is an option for SM_Priority. Signed-off-by: Hal Rosenstock Index: main.c =================================================================== --- main.c (revision 3991) +++ main.c (working copy) @@ -118,6 +118,9 @@ show_usage(void) " This option specifies the SM's PRIORITY.\n" " This will effect the handover cases, where master\n" " is chosen by priority and GUID.\n" ); + printf( "-smkey \n" + " This option specifies the SM's SM_Key (64 bits).\n" + " This will effect SM authentication.\n" ); printf( "-r\n" "--reassign_lids\n" " This option causes OpenSM to reassign LIDs to all\n" @@ -443,6 +446,7 @@ main( { osm_subn_opt_t opt; ib_net64_t guid = 0; + ib_net64_t sm_key = 0; ib_api_status_t status; uint32_t log_flags = OSM_LOG_DEFAULT_LEVEL; uint32_t temp, dbg_lvl; @@ -484,6 +488,7 @@ main( { "once", 0, NULL, 'o'}, { "reassign_lids", 0, NULL, 'r'}, { "priority", 1, NULL, 'p'}, + { "smkey", 1, NULL, 'k'}, { "updn", 0, NULL, 'u'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, @@ -681,6 +686,12 @@ main( printf(" Priority = %d\n", temp); break; + case 'k': + sm_key = cl_hton64( strtoull( optarg, NULL, 16 )); + printf(" SM Key <0x%"PRIx64">\n", cl_hton64( sm_key )); + opt.sm_key = sm_key; + break; + case 'u': opt.updn_activate = TRUE; printf(" Activate UPDN algorithm\n"); From halr at voltaire.com Tue Nov 8 13:08:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Nov 2005 16:08:47 -0500 Subject: [openib-general] OpenSM and Wrong SM_Key Message-ID: <1131484127.4451.260.camel@hal.voltaire.com> Hi, Currently, when OpenSM receives SMInfo with a different SM_Key, it exits as follows: void __osm_sminfo_rcv_process_get_response( IN const osm_sminfo_rcv_t* const p_rcv, IN const osm_madw_t* const p_madw ) { ... /* Check that the sm_key of the found SM is the same as ours, or is zero. If not - OpenSM cannot continue with configuration!. */ if ( p_smi->sm_key != 0 && p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sminfo_rcv_process_get_response: ERR 2F18: " "Got SM with sm_key that doesn't match our " "local key. Exiting\n" ); osm_log( p_rcv->p_log, OSM_LOG_SYS, "Found remote SM with non-matching sm_key. Exiting\n" ); osm_exit_flag = TRUE; goto Exit; } C14-61.2.1 states that: A master SM which finds a higher priority master SM with the wrong SM_Key should not relinquish the subnet. Exiting OpenSM relinquishes the subnet. So it appears to me that perhaps this behavior of exiting OpenSM should be at least contingent on the SM state and relative priority of the SMInfo received. Make sense ? If so, I will work on a patch for this. -- Hal From rolandd at cisco.com Tue Nov 8 13:44:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 13:44:57 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <20051108174644.GD30664@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 Nov 2005 19:46:44 +0200") References: <20051108174644.GD30664@mellanox.co.il> Message-ID: <52fyq6q0fa.fsf@cisco.com> Thanks, this looks pretty good. I updated the computation of the WQE sizes to be more accurate (I hope). We know that a bind request will not have any s/g entries, and an atomic request will have only one s/g entry, so we don't have to add together all of our worst cases. Since this is bumping the uverbs ABI anyway, I also took the opportunity to get rid of the max_sge member in the modify SRQ command. It's not a valid parameter, so there's no reason to pass it from userspace. Comments? --- infiniband/include/rdma/ib_user_verbs.h (revision 3989) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -43,7 +43,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 3 +#define IB_USER_VERBS_ABI_VERSION 4 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -333,6 +333,11 @@ struct ib_uverbs_create_qp { struct ib_uverbs_create_qp_resp { __u32 qp_handle; __u32 qpn; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; }; /* @@ -552,9 +557,7 @@ struct ib_uverbs_modify_srq { __u32 srq_handle; __u32 attr_mask; __u32 max_wr; - __u32 max_sge; __u32 srq_limit; - __u32 reserved; __u64 driver_data[0]; }; --- infiniband/core/uverbs_cmd.c (revision 3989) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -908,7 +908,12 @@ retry: if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uobject.id; + resp.max_recv_sge = attr.cap.max_recv_sge; + resp.max_send_sge = attr.cap.max_send_sge; + resp.max_recv_wr = attr.cap.max_recv_wr; + resp.max_send_wr = attr.cap.max_send_wr; + resp.max_inline_data = attr.cap.max_inline_data; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { @@ -1701,7 +1706,6 @@ ssize_t ib_uverbs_modify_srq(struct ib_u } attr.max_wr = cmd.max_wr; - attr.max_sge = cmd.max_sge; attr.srq_limit = cmd.srq_limit; ret = ib_modify_srq(srq, &attr, cmd.attr_mask); --- infiniband/hw/mthca/mthca_provider.c (revision 3989) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -604,11 +604,11 @@ static struct ib_qp *mthca_create_qp(str return ERR_PTR(err); } - init_attr->cap.max_inline_data = 0; init_attr->cap.max_send_wr = qp->sq.max; init_attr->cap.max_recv_wr = qp->rq.max; init_attr->cap.max_send_sge = qp->sq.max_gs; init_attr->cap.max_recv_sge = qp->rq.max_gs; + init_attr->cap.max_inline_data = qp->max_inline_data; return &qp->ibqp; } --- infiniband/hw/mthca/mthca_provider.h (revision 3989) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -251,6 +251,7 @@ struct mthca_qp { struct mthca_wq sq; enum ib_sig_type sq_policy; int send_wqe_offset; + int max_inline_data; u64 *wrid; union mthca_buf queue; --- infiniband/hw/mthca/mthca_cmd.c (revision 3989) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1060,6 +1060,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET); + dev_lim->max_desc_sz = min_t(int, size, dev_lim->max_desc_sz); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); dev_lim->mpt_entry_sz = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); --- infiniband/hw/mthca/mthca_dev.h (revision 3989) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -131,6 +131,7 @@ struct mthca_limits { int max_sg; int num_qps; int max_wqes; + int max_desc_sz; int max_qp_init_rdma; int reserved_qps; int num_srqs; --- infiniband/hw/mthca/mthca_main.c (revision 3989) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -168,6 +168,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.max_desc_sz = dev_lim->max_desc_sz; /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an --- infiniband/hw/mthca/mthca_qp.c (revision 3989) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -883,6 +883,48 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } +static void mthca_adjust_qp_caps(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int max_data_size; + + /* + * Calculate the maximum size of WQE s/g segments, excluding + * the next segment and other non-data segments. + */ + max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - + sizeof (struct mthca_next_seg); + + switch (qp->transport) { + case MLX: + max_data_size -= 2 * sizeof (struct mthca_data_seg); + break; + + case UD: + if (mthca_is_memfree(dev)) + max_data_size -= sizeof (struct mthca_arbel_ud_seg); + else + max_data_size -= sizeof (struct mthca_tavor_ud_seg); + break; + + default: + max_data_size -= sizeof (struct mthca_raddr_seg); + break; + } + + /* We don't support inline data for kernel QPs (yet). */ + if (!pd->ibpd.uobject) + qp->max_inline_data = 0; + else + qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; + + qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); + qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg); +} + /* * Allocate and register buffer for WQEs. qp->rq.max, sq.max, * rq.max_gs and sq.max_gs must all be assigned. @@ -900,27 +942,53 @@ static int mthca_alloc_wqe_buf(struct mt size = sizeof (struct mthca_next_seg) + qp->rq.max_gs * sizeof (struct mthca_data_seg); + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; qp->rq.wqe_shift++) ; /* nothing */ - size = sizeof (struct mthca_next_seg) + - qp->sq.max_gs * sizeof (struct mthca_data_seg); + size = qp->sq.max_gs * sizeof (struct mthca_data_seg); switch (qp->transport) { case MLX: size += 2 * sizeof (struct mthca_data_seg); break; + case UD: - if (mthca_is_memfree(dev)) - size += sizeof (struct mthca_arbel_ud_seg); - else - size += sizeof (struct mthca_tavor_ud_seg); + size += mthca_is_memfree(dev) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg); + break; + + case UC: + size += sizeof (struct mthca_raddr_seg); + break; + + case RC: + size += sizeof (struct mthca_raddr_seg); + /* + * An atomic op will require an atomic segment, a + * remote address segment and one scatter entry. + */ + size = max_t(int, size, + sizeof (struct mthca_atomic_seg) + + sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_data_seg)); break; + default: - /* bind seg is as big as atomic + raddr segs */ - size += sizeof (struct mthca_bind_seg); + break; } + /* Make sure that we have enough space for a bind request */ + size = max_t(int, size, sizeof (struct mthca_bind_seg)); + + size += sizeof (struct mthca_next_seg); + + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) ; /* nothing */ @@ -1064,6 +1132,8 @@ static int mthca_alloc_qp_common(struct return ret; } + mthca_adjust_qp_caps(dev, pd, qp); + /* * If this is a userspace QP, we're done now. The doorbells * will be allocated and buffers will be initialized in From rolandd at cisco.com Tue Nov 8 13:46:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 08 Nov 2005 13:46:13 -0800 Subject: [openib-general] Re: [PATCH 2 of 2] libmthca: qp capability calculations In-Reply-To: <20051108175111.GE30664@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 8 Nov 2005 19:51:11 +0200") References: <20051108175111.GE30664@mellanox.co.il> Message-ID: <52br0uq0d6.fsf@cisco.com> Similar minor changes (add ChangeLog entries, better WQE size computation)... comments? --- libibverbs/include/infiniband/kern-abi.h (revision 3989) +++ libibverbs/include/infiniband/kern-abi.h (working copy) @@ -48,7 +48,7 @@ * The minimum and maximum kernel ABI that we can handle. */ #define IB_USER_VERBS_MIN_ABI_VERSION 1 -#define IB_USER_VERBS_MAX_ABI_VERSION 3 +#define IB_USER_VERBS_MAX_ABI_VERSION 4 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -382,6 +382,11 @@ struct ibv_create_qp { struct ibv_create_qp_resp { __u32 qp_handle; __u32 qpn; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; }; struct ibv_qp_dest { @@ -615,9 +620,7 @@ struct ibv_modify_srq { __u32 srq_handle; __u32 attr_mask; __u32 max_wr; - __u32 max_sge; __u32 srq_limit; - __u32 reserved; __u64 driver_data[0]; }; @@ -726,4 +729,22 @@ struct ibv_create_cq_v2 { __u64 driver_data[0]; }; +struct ibv_modify_srq_v3 { + __u32 command; + __u16 in_words; + __u16 out_words; + __u32 srq_handle; + __u32 attr_mask; + __u32 max_wr; + __u32 max_sge; + __u32 srq_limit; + __u32 reserved; + __u64 driver_data[0]; +}; + +struct ibv_create_qp_resp_v3 { + __u32 qp_handle; + __u32 qpn; +}; + #endif /* KERN_ABI_H */ --- libibverbs/ChangeLog (revision 3989) +++ libibverbs/ChangeLog (working copy) @@ -1,3 +1,11 @@ +2005-11-08 Roland Dreier + + * src/cmd.c (ibv_cmd_create_qp): Add handling for new create QP + interface, which has the kernel return QP capabilities. + + * src/cmd.c (ibv_cmd_modify_srq): Split off handling of modify SRQ + for ABI versions 3 and older, which passed max_sge as part of command. + 2005-10-30 Roland Dreier * examples/srq_pingpong.c (pp_init_ctx): Create CQ with rx_depth + --- libibverbs/src/cmd.c (revision 3989) +++ libibverbs/src/cmd.c (working copy) @@ -420,19 +420,49 @@ int ibv_cmd_create_srq(struct ibv_pd *pd return 0; } +static int ibv_cmd_modify_srq_v3(struct ibv_srq *srq, + struct ibv_srq_attr *srq_attr, + enum ibv_srq_attr_mask srq_attr_mask, + struct ibv_modify_srq *new_cmd, + size_t new_cmd_size) +{ + struct ibv_modify_srq_v3 *cmd; + size_t cmd_size; + + cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; + cmd = alloca(cmd_size); + memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); + + IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); + + cmd->srq_handle = srq->handle; + cmd->attr_mask = srq_attr_mask; + cmd->max_wr = srq_attr->max_wr; + cmd->srq_limit = srq_attr->srq_limit; + cmd->max_sge = 0; + cmd->reserved = 0; + + if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) + return errno; + + return 0; +} + int ibv_cmd_modify_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, enum ibv_srq_attr_mask srq_attr_mask, struct ibv_modify_srq *cmd, size_t cmd_size) { + if (abi_ver == 3) + return ibv_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, + cmd, cmd_size); + IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); cmd->srq_handle = srq->handle; cmd->attr_mask = srq_attr_mask; cmd->max_wr = srq_attr->max_wr; - cmd->max_sge = srq_attr->max_sge; cmd->srq_limit = srq_attr->srq_limit; - cmd->reserved = 0; if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; @@ -479,9 +509,15 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, struct ibv_qp *qp, struct ibv_qp_init_attr *attr, struct ibv_create_qp *cmd, size_t cmd_size) { - struct ibv_create_qp_resp resp; - - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &resp, sizeof resp); + union { + struct ibv_create_qp_resp resp; + struct ibv_create_qp_resp_v3 resp_v3; + } r; + + if (abi_ver > 3) + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp, sizeof r.resp); + else + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp_v3, sizeof r.resp_v3); cmd->user_handle = (uintptr_t) qp; cmd->pd_handle = pd->handle; cmd->send_cq_handle = attr->send_cq->handle; @@ -499,8 +535,18 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) return errno; - qp->handle = resp.qp_handle; - qp->qp_num = resp.qpn; + if (abi_ver > 3) { + qp->handle = r.resp.qp_handle; + qp->qp_num = r.resp.qpn; + attr->cap.max_recv_sge = r.resp.max_recv_sge; + attr->cap.max_send_sge = r.resp.max_send_sge; + attr->cap.max_recv_wr = r.resp.max_recv_wr; + attr->cap.max_send_wr = r.resp.max_send_wr; + attr->cap.max_inline_data = r.resp.max_inline_data; + } else { + qp->handle = r.resp_v3.qp_handle; + qp->qp_num = r.resp_v3.qpn; + } return 0; } --- libmthca/src/qp.c (revision 3989) +++ libmthca/src/qp.c (working copy) @@ -216,7 +216,6 @@ int mthca_tavor_post_send(struct ibv_qp if (wr->send_flags & IBV_SEND_INLINE) { struct mthca_inline_seg *seg = wqe; - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; int s = 0; wqe += sizeof *seg; @@ -225,7 +224,7 @@ int mthca_tavor_post_send(struct ibv_qp s += sge->length; - if (s > max_size) { + if (s > qp->max_inline_data) { ret = -1; *bad_wr = wr; goto out; @@ -515,7 +514,6 @@ int mthca_arbel_post_send(struct ibv_qp if (wr->send_flags & IBV_SEND_INLINE) { struct mthca_inline_seg *seg = wqe; - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; int s = 0; wqe += sizeof *seg; @@ -524,7 +522,7 @@ int mthca_arbel_post_send(struct ibv_qp s += sge->length; - if (s > max_size) { + if (s > qp->max_inline_data) { ret = -1; *bad_wr = wr; goto out; @@ -683,12 +681,14 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd enum ibv_qp_type type, struct mthca_qp *qp) { int size; + int max_sq_sge; qp->rq.max_gs = cap->max_recv_sge; - qp->sq.max_gs = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), + qp->sq.max_gs = cap->max_send_sge; + max_sq_sge = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), sizeof (struct mthca_data_seg)) / sizeof (struct mthca_data_seg); - if (qp->sq.max_gs < cap->max_send_sge) - qp->sq.max_gs = cap->max_send_sge; + if (max_sq_sge < cap->max_send_sge) + max_sq_sge = cap->max_send_sge; qp->wrid = malloc((qp->rq.max + qp->sq.max) * sizeof (uint64_t)); if (!qp->wrid) @@ -701,20 +701,42 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd qp->rq.wqe_shift++) ; /* nothing */ - size = sizeof (struct mthca_next_seg) + - qp->sq.max_gs * sizeof (struct mthca_data_seg); + size = max_sq_sge * sizeof (struct mthca_data_seg); switch (type) { case IBV_QPT_UD: - if (mthca_is_memfree(pd->context)) - size += sizeof (struct mthca_arbel_ud_seg); - else - size += sizeof (struct mthca_tavor_ud_seg); + size += mthca_is_memfree(pd->context) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg); + break; + + case IBV_QPT_UC: + size += sizeof (struct mthca_raddr_seg); + break; + + case IBV_QPT_RC: + size += sizeof (struct mthca_raddr_seg); + /* + * An atomic op will require an atomic segment, a + * remote address segment and one scatter entry. + */ + if (size < (sizeof (struct mthca_atomic_seg) + + sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_data_seg))) + size = (sizeof (struct mthca_atomic_seg) + + sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_data_seg)); break; + default: - /* bind seg is as big as atomic + raddr segs */ - size += sizeof (struct mthca_bind_seg); + break; } + /* Make sure that we have enough space for a bind request */ + if (size < sizeof (struct mthca_bind_seg)) + size = sizeof (struct mthca_bind_seg); + + size += sizeof (struct mthca_next_seg); + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) ; /* nothing */ @@ -767,36 +789,6 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd return 0; } -void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, - enum ibv_qp_type type, struct ibv_qp_cap *cap) -{ - /* - * Maximum inline data size is the full WQE size less the size - * of the next segment, inline segment and other non-data segments. - */ - cap->max_inline_data = (1 << qp->sq.wqe_shift) - - sizeof (struct mthca_next_seg) - - sizeof (struct mthca_inline_seg); - - switch (type) { - case IBV_QPT_UD: - if (mthca_is_memfree(pd->context)) - cap->max_inline_data -= sizeof (struct mthca_arbel_ud_seg); - else - cap->max_inline_data -= sizeof (struct mthca_tavor_ud_seg); - break; - - default: - cap->max_inline_data -= sizeof (struct mthca_raddr_seg); - break; - } - - cap->max_send_wr = qp->sq.max; - cap->max_recv_wr = qp->rq.max; - cap->max_send_sge = qp->sq.max_gs; - cap->max_recv_sge = qp->rq.max_gs; -} - struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn) { int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift; --- libmthca/src/verbs.c (revision 3989) +++ libmthca/src/verbs.c (working copy) @@ -476,7 +476,11 @@ struct ibv_qp *mthca_create_qp(struct ib if (ret) goto err_destroy; - mthca_return_cap(pd, qp, attr->qp_type, &attr->cap); + qp->sq.max = attr->cap.max_send_wr; + qp->rq.max = attr->cap.max_recv_wr; + qp->sq.max_gs = attr->cap.max_send_sge; + qp->rq.max_gs = attr->cap.max_recv_sge; + qp->max_inline_data = attr->cap.max_inline_data; return &qp->ibv_qp; --- libmthca/src/mthca.h (revision 3989) +++ libmthca/src/mthca.h (working copy) @@ -177,6 +177,7 @@ struct mthca_qp { void *buf; uint64_t *wrid; int send_wqe_offset; + int max_inline_data; int buf_size; struct mthca_wq sq; struct mthca_wq rq; @@ -319,8 +320,6 @@ extern int mthca_arbel_post_recv(struct struct ibv_recv_wr **bad_wr); extern int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, enum ibv_qp_type type, struct mthca_qp *qp); -extern void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, - enum ibv_qp_type type, struct ibv_qp_cap *cap); extern struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn); extern int mthca_store_qp(struct mthca_context *ctx, uint32_t qpn, struct mthca_qp *qp); extern void mthca_clear_qp(struct mthca_context *ctx, uint32_t qpn); --- libmthca/ChangeLog (revision 3989) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2005-11-08 Roland Dreier + + * src/qp.c, src/verbs.c, src/mthca.h: Delegate setting of QP + capabilities (max_sge, max_inline_data, etc) to kernel. + 2005-11-04 Roland Dreier * src/verbs.c (mthca_destroy_qp): Clean CQEs when we destroy a QP. From kingman at austin.rr.com Tue Nov 8 15:28:17 2005 From: kingman at austin.rr.com (John Kingman) Date: Tue, 8 Nov 2005 17:28:17 -0600 (CST) Subject: [openib-general] [PATCH] kernel 2.6.14 problems Message-ID: It appears that class_device_create has changed for some recent levels of 2.6.14 (e.g., 2.6.14-mm1). The symptoms include build warning messages about create_device_class parameters on compiles of uat.c, user_mad.c, and uverbs_main.c. The following patches work for me. Signed-off-by: John Kingman Index: uat.c =================================================================== --- uat.c (revision 3991) +++ uat.c (working copy) @@ -834,7 +834,7 @@ static int __init ib_uat_init(void) goto err_class; } - class_device_create(ib_uat_class, IB_UAT_DEV, NULL, "uat"); + class_device_create(ib_uat_class, NULL, IB_UAT_DEV, NULL, "uat"); idr_init(&ctx_id_table); init_MUTEX(&ctx_id_mutex); Index: user_mad.c =================================================================== --- user_mad.c (revision 3991) +++ user_mad.c (working copy) @@ -790,7 +790,7 @@ static int ib_umad_init_port(struct ib_d if (cdev_add(port->dev, base_dev + port->dev_num, 1)) goto err_cdev; - port->class_dev = class_device_create(umad_class, port->dev->dev, + port->class_dev = class_device_create(umad_class, NULL, port->dev->dev, device->dma_device, "umad%d", port->dev_num); if (IS_ERR(port->class_dev)) @@ -810,7 +810,7 @@ static int ib_umad_init_port(struct ib_d if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) goto err_sm_cdev; - port->sm_class_dev = class_device_create(umad_class, port->sm_dev->dev, + port->sm_class_dev = class_device_create(umad_class, NULL, port->sm_dev->dev, device->dma_device, "issm%d", port->dev_num); if (IS_ERR(port->sm_class_dev)) Index: uverbs_main.c =================================================================== --- uverbs_main.c (revision 3991) +++ uverbs_main.c (working copy) @@ -752,7 +752,7 @@ static void ib_uverbs_add_one(struct ib_ if (cdev_add(uverbs_dev->dev, IB_UVERBS_BASE_DEV + uverbs_dev->devnum, 1)) goto err_cdev; - uverbs_dev->class_dev = class_device_create(uverbs_class, uverbs_dev->dev->dev, + uverbs_dev->class_dev = class_device_create(uverbs_class, NULL, uverbs_dev->dev->dev, device->dma_device, "uverbs%d", uverbs_dev->devnum); if (IS_ERR(uverbs_dev->class_dev)) From iod00d at hp.com Tue Nov 8 16:16:04 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 8 Nov 2005 16:16:04 -0800 Subject: [openib-general] ib_mthca segfault with SVN r3984 In-Reply-To: <52k6fjotb6.fsf@cisco.com> References: <20051108184226.GA21242@esmail.cup.hp.com> <52k6fjotb6.fsf@cisco.com> Message-ID: <20051109001604.GB21734@esmail.cup.hp.com> On Tue, Nov 08, 2005 at 11:03:57AM -0800, Roland Dreier wrote: > Very odd -- mthca_uar_alloc() is an innocuous function and I'm not > sure why it would be oopsing now. > > Is it possible that some old object files hung around and screwed > things up? Does this still happen with an absolutely clean build? Ok - it's not reproducible. This was the third reboot/build sequence and it's working - ie no segfault. I must not have cleaned up properly on the previous rounds. thanks, grant From ralphc at pathscale.com Tue Nov 8 16:56:23 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 08 Nov 2005 16:56:23 -0800 Subject: [openib-general] Question about QP access flags (struct ib_qp_attr.qp_access_flags) Message-ID: <1131497784.19444.94.camel@brick.internal.keyresearch.com> Any comments from OpenIB developers about the following? Ralph> When ib_modify_qp() is called with the IB_QP_ACCESS_FLAGS Ralph> set in the mask, what values should be used in struct Ralph> ib_qp_attr.qp_access_flags? The IB spec. seems to indicate Ralph> that RDMA and atomic operations are all enabled or disabled Ralph> as a group but all I see in ib_verbs.h is the enum Ralph> ib_access_flags which is used for memory region access. Ralph> These are more fine grained than the IB spec. implies for Ralph> QPs. So I can see qp_access_flags being either a boolean Ralph> or perhaps a new enum defined for the values for Ralph> qp_access_flags. Roland> I think the IB spec is at best ambiguous as to whether RDMA Roland> and atomics are enabled as a group or not. Roland> The values are IB_ACCESS_REMOTE_ATOMIC, IB_ACCESS_REMOTE_WRITE, Roland> and IB_ACCESS_REMOTE_READ or-ed together I think. Roland> Roland> You should probably ask on openib-general to find out what Roland> the consensus is. Sean Hefty is the guy who originally Roland> defined this interface. -- Ralph Campbell From mshefty at ichips.intel.com Tue Nov 8 17:11:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 08 Nov 2005 17:11:30 -0800 Subject: [openib-general] Question about QP access flags (struct ib_qp_attr.qp_access_flags) In-Reply-To: <1131497784.19444.94.camel@brick.internal.keyresearch.com> References: <1131497784.19444.94.camel@brick.internal.keyresearch.com> Message-ID: <43714CC2.1020203@ichips.intel.com> Ralph Campbell wrote: > Ralph> When ib_modify_qp() is called with the IB_QP_ACCESS_FLAGS > Ralph> set in the mask, what values should be used in struct > Ralph> ib_qp_attr.qp_access_flags? The IB spec. seems to indicate > Ralph> that RDMA and atomic operations are all enabled or disabled > Ralph> as a group but all I see in ib_verbs.h is the enum > Ralph> ib_access_flags which is used for memory region access. > Ralph> These are more fine grained than the IB spec. implies for > Ralph> QPs. So I can see qp_access_flags being either a boolean > Ralph> or perhaps a new enum defined for the values for > Ralph> qp_access_flags. > > Roland> I think the IB spec is at best ambiguous as to whether RDMA > Roland> and atomics are enabled as a group or not. > Roland> The values are IB_ACCESS_REMOTE_ATOMIC, IB_ACCESS_REMOTE_WRITE, > Roland> and IB_ACCESS_REMOTE_READ or-ed together I think. Roland's response is correct. Atomics and RDMA reads are enabled separately. (See page 573 of release 1.2 of the spec. I interpreted the separate bullets to mean that they are set separately.) I think it makes sense to keep this distinction, since atomics are also an optional feature of an HCA. If you look in cm.c for init_qp_attr, you can see how the IB CM sets the mask and QP attributes. - Sean From Liang.Peng at Sun.COM Tue Nov 8 23:01:14 2005 From: Liang.Peng at Sun.COM (Liang Peng) Date: Wed, 09 Nov 2005 15:01:14 +0800 Subject: [openib-general] Problem in patching 2.6.9 kernel Message-ID: <43719EBA.9090406@Sun.COM> Hi there, Got some problem in patch 2.6.9 kernel and someone please help me out. Thanks. The details are as follows: 1) I wanted to try OpenIB on our AMD 64 Opteron machines, with Cisco/Topspin IB switch and Silverstorm HCA and cables with PCI-X slots. 2) I installed RHEL4U2 with kernel 2.6.9-22ELsmp and am using gcc to compile. 3) I downloaded the OpenIB packages from http://www.openib.org/downloads.html, which is openib-userspace-svn3279-2.x86_64.rpm . 3.5) I am following OpenIB's installation guide at https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet 4) I got the patches from https://openib.org/svn/gen2/branches/backport-to-2.6.9/ and got the *svn3279* patches because the OpenIB rpms that I downloaded is svn3279. 5) I applied the patches Linux kernel without any problems (at least I can't see any problems here). 6) I did "make menuconfig" and selected the corresponding infiniband support modules, saved configuration and exited. 7) I tried to add udev rules, but found that the file /etc/udev/udev.rules doesn't exist (however the directory /etc/udev exists with rules.d/ and some other subdirectories). So I just created a new file /etc/udev/rules.d/90-ib.rules as the installation guide suggests. 8) The problems occurs when I tried to compile and load the modules: [root at thebe linux-2.6.9]# make modules ... CHK include/linux/version.h make[1]: `arch/x86_64/kernel/asm-offsets.s' is up to date. Building modules, stage 2. MODPOST *** Warning: "get_sb_pseudo" [drivers/infiniband/core/ib_uverbs.ko] undefined! [root at thebe linux-2.6.9]# make modules_install ...... if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.6.9; fi WARNING: /lib/modules/2.6.9/kernel/drivers/infiniband/core/ib_uverbs.ko needs unknown symbol get_sb_pseudo So the modules cannot be installed and I can't load them with "modprobe" and it says the module cannot be found. Can someone point out what I did wrongly? Any hints will be appreciated very much. Regards, Liang -- Large Scale Computing and Grid Computing Asia Pacific Science & Technology Center Sun Microsystems Inc. Homepage: http://apstc.sun.com.sg Email: Liang.Peng at Sun.com and Nanyang Technological University, Singapore Email: MLPeng at ntu.edu.sg From mst at mellanox.co.il Tue Nov 8 23:43:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 09:43:16 +0200 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <52fyq6q0fa.fsf@cisco.com> References: <52fyq6q0fa.fsf@cisco.com> Message-ID: <20051109074316.GG31134@mellanox.co.il> Quoting r. Roland Dreier : > Since this is bumping the uverbs ABI anyway, I also took the > opportunity to get rid of the max_sge member in the modify SRQ > command. It's not a valid parameter, so there's no reason to pass it > from userspace. By the way, wouldnt it be nice to be able to add extra parameters to verbs without bumping the uverbs ABI? What I had in mind is: lets pad all resp and cmd structures in userspace (but not in kernel) with some 0-initialized extra space at the tail. The time to do it would be now before we bump the ABI. Comments? -- MST From eitan at mellanox.co.il Tue Nov 8 23:46:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 9 Nov 2005 09:46:06 +0200 Subject: [openib-general] OpenSM and Wrong SM_Key Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E36188FB@mtlexch01.mtl.com> Hi Hal, I would like to bring this to MgtWG before we change anything. IMO the situation when this happens is really not "legal" since if the SM's are not coordinated at least in their SM_Key it will cause the two masters on the subnet. >From our experience it is always better to cause a fatal flow and exit the SM rather then report the event in some log - normally it will not be seen ... I know this is a controversial issue. BTW: Another feature I would like to bring up is the SM behavior when it recognizes duplicated GUID on the subnet. Currently it will just issue an error in the log file. I would propose to make it abort after sending a log event describing the DR paths to these two devices. What do you say? EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 08, 2005 11:09 PM > To: openib-general at openib.org > Subject: [openib-general] OpenSM and Wrong SM_Key > > Hi, > > Currently, when OpenSM receives SMInfo with a different SM_Key, it exits > as follows: > > > void > __osm_sminfo_rcv_process_get_response( > IN const osm_sminfo_rcv_t* const p_rcv, > IN const osm_madw_t* const p_madw ) > { > ... > > > > /* > Check that the sm_key of the found SM is the same as ours, > or is zero. If not - OpenSM cannot continue with configuration!. */ > if ( p_smi->sm_key != 0 && > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > "Got SM with sm_key that doesn't match our " > "local key. Exiting\n" ); > osm_log( p_rcv->p_log, OSM_LOG_SYS, > "Found remote SM with non-matching sm_key. Exiting\n" ); > osm_exit_flag = TRUE; > goto Exit; > } > > C14-61.2.1 states that: > A master SM which finds a higher priority master SM with the wrong > SM_Key should not relinquish the subnet. > > Exiting OpenSM relinquishes the subnet. > > So it appears to me that perhaps this behavior of exiting OpenSM should > be at least contingent on the SM state and relative priority of the > SMInfo received. Make sense ? If so, I will work on a patch for this. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Tue Nov 8 23:56:16 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 9 Nov 2005 09:56:16 +0200 Subject: [openib-general] Problem in patching 2.6.9 kernel In-Reply-To: <43719EBA.9090406@Sun.COM> References: <43719EBA.9090406@Sun.COM> Message-ID: <20051109075616.GA18789@mellanox.co.il> at URL https://openib.org/svn/gen2/branches/backport/2.6.9/ look at file core_3935_to_2_6_9.patch, and take the patch hunk for: latest/drivers/infiniband/core/uverbs_main.c This hunk addresses the "get_sb_pseudo" issue. The hunk should be OK at your SVN revision, too. Note that you may get a "symbol redefined" warning as a result of the patch -- ignore it. (the warning arises from kernel include files, which do not define the function as "static"). The basic problem was that in kernel 2.6.9, this function was not exported to the kernel's symbol table. Jack On Wed, Nov 09, 2005 at 09:01:14AM +0200, Liang Peng wrote: > Hi there, > > Got some problem in patch 2.6.9 kernel and someone please help me out. > Thanks. > > The details are as follows: > > 1) I wanted to try OpenIB on our AMD 64 Opteron machines, with > Cisco/Topspin IB switch and Silverstorm HCA and cables with PCI-X slots. > > 2) I installed RHEL4U2 with kernel 2.6.9-22ELsmp and am using gcc to > compile. > > 3) I downloaded the OpenIB packages from > http://www.openib.org/downloads.html, which is openib-userspace-svn3279-2.x86_64.rpm . > 3.5) I am following OpenIB's installation guide at > https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet > > 4) I got the patches from > https://openib.org/svn/gen2/branches/backport-to-2.6.9/ > > and got the *svn3279* patches because the OpenIB rpms that I downloaded > is svn3279. > > 5) I applied the patches Linux kernel without any problems (at least I > can't see any problems here). > > 6) I did "make menuconfig" and selected the corresponding infiniband support modules, > saved configuration and exited. > > 7) I tried to add udev rules, but found that the file /etc/udev/udev.rules doesn't > exist (however the directory /etc/udev exists with rules.d/ and some other subdirectories). > So I just created a new file /etc/udev/rules.d/90-ib.rules as the installation guide suggests. > > 8) The problems occurs when I tried to compile and load the modules: > > [root at thebe linux-2.6.9]# make modules > ... > CHK include/linux/version.h > > make[1]: `arch/x86_64/kernel/asm-offsets.s' is up to date. > > Building modules, stage 2. > > MODPOST > > *** Warning: "get_sb_pseudo" [drivers/infiniband/core/ib_uverbs.ko] > undefined! > > > [root at thebe linux-2.6.9]# make modules_install > > ...... > > if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.6.9; fi > > WARNING: /lib/modules/2.6.9/kernel/drivers/infiniband/core/ib_uverbs.ko > needs unknown symbol get_sb_pseudo > > > So the modules cannot be installed and I can't load them with "modprobe" > and it says the module cannot be found. > > Can someone point out what I did wrongly? Any hints will be appreciated > very much. > > > Regards, > Liang > > -- > Large Scale Computing and Grid Computing > Asia Pacific Science & Technology Center > Sun Microsystems Inc. > Homepage: http://apstc.sun.com.sg > Email: Liang.Peng at Sun.com > and > Nanyang Technological University, Singapore > Email: MLPeng at ntu.edu.sg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Liang.Peng at Sun.COM Wed Nov 9 02:31:16 2005 From: Liang.Peng at Sun.COM (Liang Peng) Date: Wed, 09 Nov 2005 18:31:16 +0800 Subject: [openib-general] Problem in patching 2.6.9 kernel In-Reply-To: <20051109075616.GA18789@mellanox.co.il> References: <43719EBA.9090406@Sun.COM> <20051109075616.GA18789@mellanox.co.il> Message-ID: <4371CFF4.5000608@Sun.COM> Thanks Jack. After applying the patch, seems that the problem doesn't occur anymore. However, after I reboot I still can't find the modules: [root at thebe]# modprobe ib_mthca FATAL: Module ib_mthca not found. FATAL: Error running install command for ib_mthca Could it be that my new kernel images is not taking effect? What needs to be done with Grub after recompiling the kernel? Sorry for the newbie-like questions and thanks a lot. Regards, Liang Peng Jack Morgenstein wrote On 11/09/05 15:56,: >at URL >https://openib.org/svn/gen2/branches/backport/2.6.9/ > >look at file core_3935_to_2_6_9.patch, and take the patch hunk for: > latest/drivers/infiniband/core/uverbs_main.c > >This hunk addresses the "get_sb_pseudo" issue. The hunk should be OK at your SVN revision, too. >Note that you may get a "symbol redefined" warning as a result of the patch -- ignore it. >(the warning arises from kernel include files, which do not define the function as "static"). > >The basic problem was that in kernel 2.6.9, this function was not exported to the kernel's symbol table. > >Jack > >On Wed, Nov 09, 2005 at 09:01:14AM +0200, Liang Peng wrote: > > >>Hi there, >> >>Got some problem in patch 2.6.9 kernel and someone please help me out. >>Thanks. >> >>The details are as follows: >> >>1) I wanted to try OpenIB on our AMD 64 Opteron machines, with >>Cisco/Topspin IB switch and Silverstorm HCA and cables with PCI-X slots. >> >>2) I installed RHEL4U2 with kernel 2.6.9-22ELsmp and am using gcc to >>compile. >> >>3) I downloaded the OpenIB packages from >>http://www.openib.org/downloads.html, which is openib-userspace-svn3279-2.x86_64.rpm . >>3.5) I am following OpenIB's installation guide at >>https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet >> >>4) I got the patches from >>https://openib.org/svn/gen2/branches/backport-to-2.6.9/ >> >>and got the *svn3279* patches because the OpenIB rpms that I downloaded >>is svn3279. >> >>5) I applied the patches Linux kernel without any problems (at least I >>can't see any problems here). >> >>6) I did "make menuconfig" and selected the corresponding infiniband support modules, >>saved configuration and exited. >> >>7) I tried to add udev rules, but found that the file /etc/udev/udev.rules doesn't >>exist (however the directory /etc/udev exists with rules.d/ and some other subdirectories). >>So I just created a new file /etc/udev/rules.d/90-ib.rules as the installation guide suggests. >> >>8) The problems occurs when I tried to compile and load the modules: >> >>[root at thebe linux-2.6.9]# make modules >>... >> CHK include/linux/version.h >> >>make[1]: `arch/x86_64/kernel/asm-offsets.s' is up to date. >> >> Building modules, stage 2. >> >> MODPOST >> >>*** Warning: "get_sb_pseudo" [drivers/infiniband/core/ib_uverbs.ko] >>undefined! >> >> >>[root at thebe linux-2.6.9]# make modules_install >> >>...... >> >>if [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.6.9; fi >> >>WARNING: /lib/modules/2.6.9/kernel/drivers/infiniband/core/ib_uverbs.ko >>needs unknown symbol get_sb_pseudo >> >> >>So the modules cannot be installed and I can't load them with "modprobe" >>and it says the module cannot be found. >> >>Can someone point out what I did wrongly? Any hints will be appreciated >>very much. >> >> >>Regards, >>Liang >> >>-- >>Large Scale Computing and Grid Computing >>Asia Pacific Science & Technology Center >>Sun Microsystems Inc. >>Homepage: http://apstc.sun.com.sg >>Email: Liang.Peng at Sun.com >>and >>Nanyang Technological University, Singapore >>Email: MLPeng at ntu.edu.sg >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- Large Scale Computing and Grid Computing Asia Pacific Science & Technology Center Sun Microsystems Inc. Homepage: http://apstc.sun.com.sg Email: Liang.Peng at Sun.com and Nanyang Technological University, Singapore Email: MLPeng at ntu.edu.sg From mst at mellanox.co.il Wed Nov 9 04:14:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 14:14:22 +0200 Subject: [openib-general] [PATCH] mthca: fix atomic operations In-Reply-To: <25AE7F432672D511B8DC00B0D0DF11DA05E9C458@MTIEX01> References: <25AE7F432672D511B8DC00B0D0DF11DA05E9C458@MTIEX01> Message-ID: <20051109121422.GJ31134@mellanox.co.il> Quoting r. Eitan Rabin : > Subject: atomic operations > > Hi Michael, > There is a bug in the atomic flow of gen2 > mthca_qp.c line 1488 should be also divided by 16. > Once that is done atomics work. Indeed. --- Fix posting atomic work requests in mthca. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (revision 3992) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1484,8 +1484,8 @@ int mthca_tavor_post_send(struct ib_qp * } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) / 16; break; case IB_WR_RDMA_WRITE: @@ -1804,8 +1804,8 @@ int mthca_arbel_post_send(struct ib_qp * } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) / 16; break; case IB_WR_RDMA_READ: From mst at mellanox.co.il Wed Nov 9 04:17:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 14:17:39 +0200 Subject: [openib-general] [PATCH] libmthca: fix posting atomic work requests Message-ID: <20051109121739.GK31134@mellanox.co.il> Same thing as previous patch for userspace. --- Fix posting atomic work requests in libmthca. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/qp.c =================================================================== --- src/userspace/libmthca/src/qp.c (revision 3994) +++ src/userspace/libmthca/src/qp.c (working copy) @@ -147,8 +147,8 @@ int mthca_tavor_post_send(struct ibv_qp } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) / 16; break; case IBV_WR_RDMA_WRITE: @@ -448,8 +448,8 @@ int mthca_arbel_post_send(struct ibv_qp } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) /16; break; case IBV_WR_RDMA_WRITE: -- MST From mst at mellanox.co.il Wed Nov 9 04:56:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 14:56:03 +0200 Subject: [openib-general] [PATCH] libmthca: fix double free condition Message-ID: <20051109125603.GM31134@mellanox.co.il> It seems that on error mthca_alloc_av is freeing memory it didnt allocate, which can theoretically lead to double free condition (havent seen this in practice). Does the following patch make sense? --- Fix double free condition in libmthca. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/ah.c =================================================================== --- src/userspace/libmthca/src/ah.c (revision 3994) +++ src/userspace/libmthca/src/ah.c (working copy) @@ -111,7 +111,6 @@ int mthca_alloc_av(struct mthca_pd *pd, page = __add_page(pd, ps, pp); if (!page) { - free(ah); pthread_mutex_unlock(&pd->ah_mutex); return -1; } -- MST From info at hdyet.com Wed Nov 9 03:37:36 2005 From: info at hdyet.com (info at hdyet.com) Date: 9 Nov 2005 20:37:36 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051109113736.20819.qmail@mail.hdyet.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B References: <52vez4uzwu.fsf@cisco.com> Message-ID: <20051109135635.GQ31134@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> It seems, therefore, that we can have a deadlock inside > Michael> user_mad, where ib_umad_close calls > Michael> ib_unregister_mad_agent which blocks until send_handler > Michael> runs which is blocked by the port mutex. > > It certainly looks that way, and it also looks like > ib_umad_unreg_agent() has had the same potential deadlock for a > while. In any case, I don't see any reason to hold the port mutex > while unregistering agents in ib_umad_close() (the file is already > gone, so it can't race against userspace registering or unregistering > MAD agents via ioctl). So something like this should be good enough. > > Does anyone see anything wrong with this? > > - R. What about ib_umad_reg_agent error handling code in ib_umad_reg_agent? That seems to still call ib_umad_unreg_agent from under the down_write. And what about ib_umad_kill_port, which also does this? What prevents the deadlock in these two cases? -- MST From mst at mellanox.co.il Wed Nov 9 06:15:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 16:15:58 +0200 Subject: [openib-general] [PATCH] libibverbs: protect device list initialization Message-ID: <20051109141558.GR31134@mellanox.co.il> Hello, Roland! The following patch solves a problem I'm seeing when multiple threads try to call ibv_get_devices at the same time. Which brings me to another issue: our code examples call non-reentrant dlist_for_each variants of dlist scanning routines, which will create strange problems for multi-threaded users who might copy this. I propose returning const struct dlist instead of struct dlist from ibv_get_devices, which I think will trigger a warning on such code, and converting all users to the reentrant dlist_for_each_nomark. Comments? --- Make ibv_get_devices reentrant. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libibverbs/src/device.c =================================================================== --- src/userspace/libibverbs/src/device.c (revision 3914) +++ src/userspace/libibverbs/src/device.c (working copy) @@ -48,13 +48,17 @@ #include "ibverbs.h" +static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; static struct dlist *device_list; struct dlist *ibv_get_devices(void) { + struct dlist *l; + pthread_mutex_lock(&device_list_lock); if (!device_list) - device_list = ibverbs_init(); - return device_list; + l = device_list = ibverbs_init(); + pthread_mutex_unlock(&device_list_lock); + return l; } const char *ibv_get_device_name(struct ibv_device *device) -- MST From mst at mellanox.co.il Wed Nov 9 07:36:30 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 17:36:30 +0200 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <20051109141558.GR31134@mellanox.co.il> References: <20051109141558.GR31134@mellanox.co.il> Message-ID: <20051109153630.GC8633@mellanox.co.il> Quoting Michael S. Tsirkin : > The following patch solves a problem I'm seeing when multiple > threads try to call ibv_get_devices at the same time. > > Which brings me to another issue: our code examples call non-reentrant > dlist_for_each variants of dlist scanning routines, which will > create strange problems for multi-threaded users who might copy this. > > I propose returning const struct dlist instead of struct dlist > from ibv_get_devices, which I think will trigger a warning > on such code, and converting all users to the reentrant > dlist_for_each_nomark. > Comments? Oops, the patch I posted previously is broken. Here's a correct version. Sorry. --- Make ibv_get_devices reentrant. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libibverbs/src/device.c =================================================================== --- src/userspace/libibverbs/src/device.c (revision 3914) +++ src/userspace/libibverbs/src/device.c (working copy) @@ -48,13 +48,19 @@ #include "ibverbs.h" +static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; static struct dlist *device_list; struct dlist *ibv_get_devices(void) { + struct dlist *l; + pthread_mutex_lock(&device_list_lock); if (!device_list) device_list = ibverbs_init(); - return device_list; + + l = device_list; + pthread_mutex_unlock(&device_list_lock); + return l; } const char *ibv_get_device_name(struct ibv_device *device) -- MST From krause at cup.hp.com Tue Nov 8 13:03:45 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 08 Nov 2005 13:03:45 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <1131482271.4451.159.camel@hal.voltaire.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <1131482271.4451.159.camel@hal.voltaire.com> Message-ID: <6.2.0.14.2.20051108130301.025cabf8@esmail.cup.hp.com> At 12:37 PM 11/8/2005, Hal Rosenstock wrote: >On Tue, 2005-11-08 at 15:33, Ranjit Pandit wrote: > > Using APM is not useful because it doesn't provide failover across HCA's. > >Can't APM be made to work across HCAs ? No. It requires state that is only within the HCA and there are other aspects that prevent this, e.g. no single unified QP space across all HCA, etc. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Tue Nov 8 13:08:13 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 08 Nov 2005 13:08:13 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.co m> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> Message-ID: <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > > Mike wrote: > > - RDS does not solve a set of failure models. For example, if a RNIC > / HCA > > were to fail, then one cannot simply replay the operations on another > RNIC / > > HCA without extracting state, etc. and providing some end-to-end sync of > > what was really sent / received by the application. Yes, one can recover > > from cable or switch port failure by using APM style recovery but that is > > only one class of faults. The harder faults either result in the end node > > being cast out of the cluster or see silent data corruption unless > > additional steps are taken to transparently recover - again app writers > > don't want to solve the hard problems; they want that done for them. > >The current reference implementation of RDS solves the HCA failure case as >well. >Since applications don't need to keep connection states, it's easier >to handle cases like HCA and intermediate path failures. >As far as application is concerned, every sendmsg 'could' result in a >new connection setup in the driver. >If the current path fails, RDS reestablishes a connection, if >available, on a different port or a different HCA , and replays the >failed messages. >Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Nov 9 07:51:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 17:51:31 +0200 Subject: [openib-general] [PATCH] libmthca: fix posting long work request lists Message-ID: <20051109155131.GD8633@mellanox.co.il> Hello, Roland! Tavor requires ringing a recv doorbell at least each 256 WQEs. WQE count in doorbell must be set to 0 in this case. Here's a patch. --- Fix posting work request lists of length > 255 for Tavor. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/qp.c =================================================================== --- src/userspace/libmthca/src/qp.c (revision 3994) +++ src/userspace/libmthca/src/qp.c (working copy) @@ -45,6 +45,8 @@ #include "doorbell.h" #include "wqe.h" +#define MTHCA_TAVOR_WQES_PER_RECV_DOORBELL 256 + static const uint8_t mthca_opcode[] = { [IBV_WR_SEND] = MTHCA_OPCODE_SEND, [IBV_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -313,6 +315,18 @@ int mthca_tavor_post_recv(struct ibv_qp ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (nreq == MTHCA_TAVOR_WQES_PER_RECV_DOORBELL) { + uint32_t doorbell[2]; + + doorbell[0] = htonl((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = htonl(ibqp->qp_num << 8); + + mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL); + qp->rq.head += nreq; + nreq = 0; + size0 = 0; + } + if (wq_overflow(&qp->rq, nreq, to_mcq(qp->ibv_qp.send_cq))) { ret = -1; *bad_wr = wr; -- MST From jackm at mellanox.co.il Wed Nov 9 07:59:44 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 9 Nov 2005 17:59:44 +0200 Subject: [openib-general] Re: [PATCH 2 of 2] libmthca: qp capability calculations In-Reply-To: <52br0uq0d6.fsf@cisco.com> References: <52br0uq0d6.fsf@cisco.com> Message-ID: <20051109155944.GA19949@mellanox.co.il> Your patch is fine (code reviewed, and also installed and checked). Jack On Tue, Nov 08, 2005 at 11:46:13PM +0200, Roland Dreier wrote: > Similar minor changes (add ChangeLog entries, better WQE size > computation)... comments? > > --- libibverbs/include/infiniband/kern-abi.h (revision 3989) > +++ libibverbs/include/infiniband/kern-abi.h (working copy) > @@ -48,7 +48,7 @@ > * The minimum and maximum kernel ABI that we can handle. > */ > #define IB_USER_VERBS_MIN_ABI_VERSION 1 > -#define IB_USER_VERBS_MAX_ABI_VERSION 3 > +#define IB_USER_VERBS_MAX_ABI_VERSION 4 > > enum { > IB_USER_VERBS_CMD_GET_CONTEXT, > @@ -382,6 +382,11 @@ struct ibv_create_qp { > struct ibv_create_qp_resp { > __u32 qp_handle; > __u32 qpn; > + __u32 max_send_wr; > + __u32 max_recv_wr; > + __u32 max_send_sge; > + __u32 max_recv_sge; > + __u32 max_inline_data; > }; > > struct ibv_qp_dest { > @@ -615,9 +620,7 @@ struct ibv_modify_srq { > __u32 srq_handle; > __u32 attr_mask; > __u32 max_wr; > - __u32 max_sge; > __u32 srq_limit; > - __u32 reserved; > __u64 driver_data[0]; > }; > > @@ -726,4 +729,22 @@ struct ibv_create_cq_v2 { > __u64 driver_data[0]; > }; > > +struct ibv_modify_srq_v3 { > + __u32 command; > + __u16 in_words; > + __u16 out_words; > + __u32 srq_handle; > + __u32 attr_mask; > + __u32 max_wr; > + __u32 max_sge; > + __u32 srq_limit; > + __u32 reserved; > + __u64 driver_data[0]; > +}; > + > +struct ibv_create_qp_resp_v3 { > + __u32 qp_handle; > + __u32 qpn; > +}; > + > #endif /* KERN_ABI_H */ > --- libibverbs/ChangeLog (revision 3989) > +++ libibverbs/ChangeLog (working copy) > @@ -1,3 +1,11 @@ > +2005-11-08 Roland Dreier > + > + * src/cmd.c (ibv_cmd_create_qp): Add handling for new create QP > + interface, which has the kernel return QP capabilities. > + > + * src/cmd.c (ibv_cmd_modify_srq): Split off handling of modify SRQ > + for ABI versions 3 and older, which passed max_sge as part of command. > + > 2005-10-30 Roland Dreier > > * examples/srq_pingpong.c (pp_init_ctx): Create CQ with rx_depth + > --- libibverbs/src/cmd.c (revision 3989) > +++ libibverbs/src/cmd.c (working copy) > @@ -420,19 +420,49 @@ int ibv_cmd_create_srq(struct ibv_pd *pd > return 0; > } > > +static int ibv_cmd_modify_srq_v3(struct ibv_srq *srq, > + struct ibv_srq_attr *srq_attr, > + enum ibv_srq_attr_mask srq_attr_mask, > + struct ibv_modify_srq *new_cmd, > + size_t new_cmd_size) > +{ > + struct ibv_modify_srq_v3 *cmd; > + size_t cmd_size; > + > + cmd_size = sizeof *cmd + new_cmd_size - sizeof *new_cmd; > + cmd = alloca(cmd_size); > + memcpy(cmd->driver_data, new_cmd->driver_data, new_cmd_size - sizeof *new_cmd); > + > + IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); > + > + cmd->srq_handle = srq->handle; > + cmd->attr_mask = srq_attr_mask; > + cmd->max_wr = srq_attr->max_wr; > + cmd->srq_limit = srq_attr->srq_limit; > + cmd->max_sge = 0; > + cmd->reserved = 0; > + > + if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) > + return errno; > + > + return 0; > +} > + > int ibv_cmd_modify_srq(struct ibv_srq *srq, > struct ibv_srq_attr *srq_attr, > enum ibv_srq_attr_mask srq_attr_mask, > struct ibv_modify_srq *cmd, size_t cmd_size) > { > + if (abi_ver == 3) > + return ibv_cmd_modify_srq_v3(srq, srq_attr, srq_attr_mask, > + cmd, cmd_size); > + > IBV_INIT_CMD(cmd, cmd_size, MODIFY_SRQ); > > cmd->srq_handle = srq->handle; > cmd->attr_mask = srq_attr_mask; > cmd->max_wr = srq_attr->max_wr; > - cmd->max_sge = srq_attr->max_sge; > cmd->srq_limit = srq_attr->srq_limit; > - cmd->reserved = 0; > > if (write(srq->context->cmd_fd, cmd, cmd_size) != cmd_size) > return errno; > @@ -479,9 +509,15 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, > struct ibv_qp *qp, struct ibv_qp_init_attr *attr, > struct ibv_create_qp *cmd, size_t cmd_size) > { > - struct ibv_create_qp_resp resp; > - > - IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &resp, sizeof resp); > + union { > + struct ibv_create_qp_resp resp; > + struct ibv_create_qp_resp_v3 resp_v3; > + } r; > + > + if (abi_ver > 3) > + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp, sizeof r.resp); > + else > + IBV_INIT_CMD_RESP(cmd, cmd_size, CREATE_QP, &r.resp_v3, sizeof r.resp_v3); > cmd->user_handle = (uintptr_t) qp; > cmd->pd_handle = pd->handle; > cmd->send_cq_handle = attr->send_cq->handle; > @@ -499,8 +535,18 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, > if (write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size) > return errno; > > - qp->handle = resp.qp_handle; > - qp->qp_num = resp.qpn; > + if (abi_ver > 3) { > + qp->handle = r.resp.qp_handle; > + qp->qp_num = r.resp.qpn; > + attr->cap.max_recv_sge = r.resp.max_recv_sge; > + attr->cap.max_send_sge = r.resp.max_send_sge; > + attr->cap.max_recv_wr = r.resp.max_recv_wr; > + attr->cap.max_send_wr = r.resp.max_send_wr; > + attr->cap.max_inline_data = r.resp.max_inline_data; > + } else { > + qp->handle = r.resp_v3.qp_handle; > + qp->qp_num = r.resp_v3.qpn; > + } > > return 0; > } > --- libmthca/src/qp.c (revision 3989) > +++ libmthca/src/qp.c (working copy) > @@ -216,7 +216,6 @@ int mthca_tavor_post_send(struct ibv_qp > > if (wr->send_flags & IBV_SEND_INLINE) { > struct mthca_inline_seg *seg = wqe; > - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; > int s = 0; > > wqe += sizeof *seg; > @@ -225,7 +224,7 @@ int mthca_tavor_post_send(struct ibv_qp > > s += sge->length; > > - if (s > max_size) { > + if (s > qp->max_inline_data) { > ret = -1; > *bad_wr = wr; > goto out; > @@ -515,7 +514,6 @@ int mthca_arbel_post_send(struct ibv_qp > > if (wr->send_flags & IBV_SEND_INLINE) { > struct mthca_inline_seg *seg = wqe; > - int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; > int s = 0; > > wqe += sizeof *seg; > @@ -524,7 +522,7 @@ int mthca_arbel_post_send(struct ibv_qp > > s += sge->length; > > - if (s > max_size) { > + if (s > qp->max_inline_data) { > ret = -1; > *bad_wr = wr; > goto out; > @@ -683,12 +681,14 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd > enum ibv_qp_type type, struct mthca_qp *qp) > { > int size; > + int max_sq_sge; > > qp->rq.max_gs = cap->max_recv_sge; > - qp->sq.max_gs = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), > + qp->sq.max_gs = cap->max_send_sge; > + max_sq_sge = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), > sizeof (struct mthca_data_seg)) / sizeof (struct mthca_data_seg); > - if (qp->sq.max_gs < cap->max_send_sge) > - qp->sq.max_gs = cap->max_send_sge; > + if (max_sq_sge < cap->max_send_sge) > + max_sq_sge = cap->max_send_sge; > > qp->wrid = malloc((qp->rq.max + qp->sq.max) * sizeof (uint64_t)); > if (!qp->wrid) > @@ -701,20 +701,42 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd > qp->rq.wqe_shift++) > ; /* nothing */ > > - size = sizeof (struct mthca_next_seg) + > - qp->sq.max_gs * sizeof (struct mthca_data_seg); > + size = max_sq_sge * sizeof (struct mthca_data_seg); > switch (type) { > case IBV_QPT_UD: > - if (mthca_is_memfree(pd->context)) > - size += sizeof (struct mthca_arbel_ud_seg); > - else > - size += sizeof (struct mthca_tavor_ud_seg); > + size += mthca_is_memfree(pd->context) ? > + sizeof (struct mthca_arbel_ud_seg) : > + sizeof (struct mthca_tavor_ud_seg); > + break; > + > + case IBV_QPT_UC: > + size += sizeof (struct mthca_raddr_seg); > + break; > + > + case IBV_QPT_RC: > + size += sizeof (struct mthca_raddr_seg); > + /* > + * An atomic op will require an atomic segment, a > + * remote address segment and one scatter entry. > + */ > + if (size < (sizeof (struct mthca_atomic_seg) + > + sizeof (struct mthca_raddr_seg) + > + sizeof (struct mthca_data_seg))) > + size = (sizeof (struct mthca_atomic_seg) + > + sizeof (struct mthca_raddr_seg) + > + sizeof (struct mthca_data_seg)); > break; > + > default: > - /* bind seg is as big as atomic + raddr segs */ > - size += sizeof (struct mthca_bind_seg); > + break; > } > > + /* Make sure that we have enough space for a bind request */ > + if (size < sizeof (struct mthca_bind_seg)) > + size = sizeof (struct mthca_bind_seg); > + > + size += sizeof (struct mthca_next_seg); > + > for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; > qp->sq.wqe_shift++) > ; /* nothing */ > @@ -767,36 +789,6 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd > return 0; > } > > -void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, > - enum ibv_qp_type type, struct ibv_qp_cap *cap) > -{ > - /* > - * Maximum inline data size is the full WQE size less the size > - * of the next segment, inline segment and other non-data segments. > - */ > - cap->max_inline_data = (1 << qp->sq.wqe_shift) - > - sizeof (struct mthca_next_seg) - > - sizeof (struct mthca_inline_seg); > - > - switch (type) { > - case IBV_QPT_UD: > - if (mthca_is_memfree(pd->context)) > - cap->max_inline_data -= sizeof (struct mthca_arbel_ud_seg); > - else > - cap->max_inline_data -= sizeof (struct mthca_tavor_ud_seg); > - break; > - > - default: > - cap->max_inline_data -= sizeof (struct mthca_raddr_seg); > - break; > - } > - > - cap->max_send_wr = qp->sq.max; > - cap->max_recv_wr = qp->rq.max; > - cap->max_send_sge = qp->sq.max_gs; > - cap->max_recv_sge = qp->rq.max_gs; > -} > - > struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn) > { > int tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift; > --- libmthca/src/verbs.c (revision 3989) > +++ libmthca/src/verbs.c (working copy) > @@ -476,7 +476,11 @@ struct ibv_qp *mthca_create_qp(struct ib > if (ret) > goto err_destroy; > > - mthca_return_cap(pd, qp, attr->qp_type, &attr->cap); > + qp->sq.max = attr->cap.max_send_wr; > + qp->rq.max = attr->cap.max_recv_wr; > + qp->sq.max_gs = attr->cap.max_send_sge; > + qp->rq.max_gs = attr->cap.max_recv_sge; > + qp->max_inline_data = attr->cap.max_inline_data; > > return &qp->ibv_qp; > > --- libmthca/src/mthca.h (revision 3989) > +++ libmthca/src/mthca.h (working copy) > @@ -177,6 +177,7 @@ struct mthca_qp { > void *buf; > uint64_t *wrid; > int send_wqe_offset; > + int max_inline_data; > int buf_size; > struct mthca_wq sq; > struct mthca_wq rq; > @@ -319,8 +320,6 @@ extern int mthca_arbel_post_recv(struct > struct ibv_recv_wr **bad_wr); > extern int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, > enum ibv_qp_type type, struct mthca_qp *qp); > -extern void mthca_return_cap(struct ibv_pd *pd, struct mthca_qp *qp, > - enum ibv_qp_type type, struct ibv_qp_cap *cap); > extern struct mthca_qp *mthca_find_qp(struct mthca_context *ctx, uint32_t qpn); > extern int mthca_store_qp(struct mthca_context *ctx, uint32_t qpn, struct mthca_qp *qp); > extern void mthca_clear_qp(struct mthca_context *ctx, uint32_t qpn); > --- libmthca/ChangeLog (revision 3989) > +++ libmthca/ChangeLog (working copy) > @@ -1,3 +1,8 @@ > +2005-11-08 Roland Dreier > + > + * src/qp.c, src/verbs.c, src/mthca.h: Delegate setting of QP > + capabilities (max_sge, max_inline_data, etc) to kernel. > + > 2005-11-04 Roland Dreier > > * src/verbs.c (mthca_destroy_qp): Clean CQEs when we destroy a QP. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Wed Nov 9 08:00:29 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 9 Nov 2005 18:00:29 +0200 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <52fyq6q0fa.fsf@cisco.com> References: <52fyq6q0fa.fsf@cisco.com> Message-ID: <20051109160029.GB19949@mellanox.co.il> Your patch is fine (code reviewed, and also installed and checked). Jack On Tue, Nov 08, 2005 at 11:44:57PM +0200, Roland Dreier wrote: > Thanks, this looks pretty good. I updated the computation of the WQE > sizes to be more accurate (I hope). We know that a bind request will > not have any s/g entries, and an atomic request will have only one s/g > entry, so we don't have to add together all of our worst cases. > > Since this is bumping the uverbs ABI anyway, I also took the > opportunity to get rid of the max_sge member in the modify SRQ > command. It's not a valid parameter, so there's no reason to pass it > from userspace. > > Comments? > > --- infiniband/include/rdma/ib_user_verbs.h (revision 3989) > +++ infiniband/include/rdma/ib_user_verbs.h (working copy) > @@ -43,7 +43,7 @@ > * Increment this value if any changes that break userspace ABI > * compatibility are made. > */ > -#define IB_USER_VERBS_ABI_VERSION 3 > +#define IB_USER_VERBS_ABI_VERSION 4 > > enum { > IB_USER_VERBS_CMD_GET_CONTEXT, > @@ -333,6 +333,11 @@ struct ib_uverbs_create_qp { > struct ib_uverbs_create_qp_resp { > __u32 qp_handle; > __u32 qpn; > + __u32 max_send_wr; > + __u32 max_recv_wr; > + __u32 max_send_sge; > + __u32 max_recv_sge; > + __u32 max_inline_data; > }; > > /* > @@ -552,9 +557,7 @@ struct ib_uverbs_modify_srq { > __u32 srq_handle; > __u32 attr_mask; > __u32 max_wr; > - __u32 max_sge; > __u32 srq_limit; > - __u32 reserved; > __u64 driver_data[0]; > }; > > --- infiniband/core/uverbs_cmd.c (revision 3989) > +++ infiniband/core/uverbs_cmd.c (working copy) > @@ -908,7 +908,12 @@ retry: > if (ret) > goto err_destroy; > > - resp.qp_handle = uobj->uobject.id; > + resp.qp_handle = uobj->uobject.id; > + resp.max_recv_sge = attr.cap.max_recv_sge; > + resp.max_send_sge = attr.cap.max_send_sge; > + resp.max_recv_wr = attr.cap.max_recv_wr; > + resp.max_send_wr = attr.cap.max_send_wr; > + resp.max_inline_data = attr.cap.max_inline_data; > > if (copy_to_user((void __user *) (unsigned long) cmd.response, > &resp, sizeof resp)) { > @@ -1701,7 +1706,6 @@ ssize_t ib_uverbs_modify_srq(struct ib_u > } > > attr.max_wr = cmd.max_wr; > - attr.max_sge = cmd.max_sge; > attr.srq_limit = cmd.srq_limit; > > ret = ib_modify_srq(srq, &attr, cmd.attr_mask); > --- infiniband/hw/mthca/mthca_provider.c (revision 3989) > +++ infiniband/hw/mthca/mthca_provider.c (working copy) > @@ -604,11 +604,11 @@ static struct ib_qp *mthca_create_qp(str > return ERR_PTR(err); > } > > - init_attr->cap.max_inline_data = 0; > init_attr->cap.max_send_wr = qp->sq.max; > init_attr->cap.max_recv_wr = qp->rq.max; > init_attr->cap.max_send_sge = qp->sq.max_gs; > init_attr->cap.max_recv_sge = qp->rq.max_gs; > + init_attr->cap.max_inline_data = qp->max_inline_data; > > return &qp->ibqp; > } > --- infiniband/hw/mthca/mthca_provider.h (revision 3989) > +++ infiniband/hw/mthca/mthca_provider.h (working copy) > @@ -251,6 +251,7 @@ struct mthca_qp { > struct mthca_wq sq; > enum ib_sig_type sq_policy; > int send_wqe_offset; > + int max_inline_data; > > u64 *wrid; > union mthca_buf queue; > --- infiniband/hw/mthca/mthca_cmd.c (revision 3989) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -1060,6 +1060,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev > dev_lim->hca.arbel.resize_srq = field & 1; > MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); > dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); > + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET); > + dev_lim->max_desc_sz = min_t(int, size, dev_lim->max_desc_sz); > MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); > dev_lim->mpt_entry_sz = size; > MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); > --- infiniband/hw/mthca/mthca_dev.h (revision 3989) > +++ infiniband/hw/mthca/mthca_dev.h (working copy) > @@ -131,6 +131,7 @@ struct mthca_limits { > int max_sg; > int num_qps; > int max_wqes; > + int max_desc_sz; > int max_qp_init_rdma; > int reserved_qps; > int num_srqs; > --- infiniband/hw/mthca/mthca_main.c (revision 3989) > +++ infiniband/hw/mthca/mthca_main.c (working copy) > @@ -168,6 +168,7 @@ static int __devinit mthca_dev_lim(struc > mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; > mdev->limits.reserved_srqs = dev_lim->reserved_srqs; > mdev->limits.reserved_eecs = dev_lim->reserved_eecs; > + mdev->limits.max_desc_sz = dev_lim->max_desc_sz; > /* > * Subtract 1 from the limit because we need to allocate a > * spare CQE so the HCA HW can tell the difference between an > --- infiniband/hw/mthca/mthca_qp.c (revision 3989) > +++ infiniband/hw/mthca/mthca_qp.c (working copy) > @@ -883,6 +883,48 @@ int mthca_modify_qp(struct ib_qp *ibqp, > return err; > } > > +static void mthca_adjust_qp_caps(struct mthca_dev *dev, > + struct mthca_pd *pd, > + struct mthca_qp *qp) > +{ > + int max_data_size; > + > + /* > + * Calculate the maximum size of WQE s/g segments, excluding > + * the next segment and other non-data segments. > + */ > + max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - > + sizeof (struct mthca_next_seg); > + > + switch (qp->transport) { > + case MLX: > + max_data_size -= 2 * sizeof (struct mthca_data_seg); > + break; > + > + case UD: > + if (mthca_is_memfree(dev)) > + max_data_size -= sizeof (struct mthca_arbel_ud_seg); > + else > + max_data_size -= sizeof (struct mthca_tavor_ud_seg); > + break; > + > + default: > + max_data_size -= sizeof (struct mthca_raddr_seg); > + break; > + } > + > + /* We don't support inline data for kernel QPs (yet). */ > + if (!pd->ibpd.uobject) > + qp->max_inline_data = 0; > + else > + qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; > + > + qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); > + qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - > + sizeof (struct mthca_next_seg)) / > + sizeof (struct mthca_data_seg); > +} > + > /* > * Allocate and register buffer for WQEs. qp->rq.max, sq.max, > * rq.max_gs and sq.max_gs must all be assigned. > @@ -900,27 +942,53 @@ static int mthca_alloc_wqe_buf(struct mt > size = sizeof (struct mthca_next_seg) + > qp->rq.max_gs * sizeof (struct mthca_data_seg); > > + if (size > dev->limits.max_desc_sz) > + return -EINVAL; > + > for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; > qp->rq.wqe_shift++) > ; /* nothing */ > > - size = sizeof (struct mthca_next_seg) + > - qp->sq.max_gs * sizeof (struct mthca_data_seg); > + size = qp->sq.max_gs * sizeof (struct mthca_data_seg); > switch (qp->transport) { > case MLX: > size += 2 * sizeof (struct mthca_data_seg); > break; > + > case UD: > - if (mthca_is_memfree(dev)) > - size += sizeof (struct mthca_arbel_ud_seg); > - else > - size += sizeof (struct mthca_tavor_ud_seg); > + size += mthca_is_memfree(dev) ? > + sizeof (struct mthca_arbel_ud_seg) : > + sizeof (struct mthca_tavor_ud_seg); > + break; > + > + case UC: > + size += sizeof (struct mthca_raddr_seg); > + break; > + > + case RC: > + size += sizeof (struct mthca_raddr_seg); > + /* > + * An atomic op will require an atomic segment, a > + * remote address segment and one scatter entry. > + */ > + size = max_t(int, size, > + sizeof (struct mthca_atomic_seg) + > + sizeof (struct mthca_raddr_seg) + > + sizeof (struct mthca_data_seg)); > break; > + > default: > - /* bind seg is as big as atomic + raddr segs */ > - size += sizeof (struct mthca_bind_seg); > + break; > } > > + /* Make sure that we have enough space for a bind request */ > + size = max_t(int, size, sizeof (struct mthca_bind_seg)); > + > + size += sizeof (struct mthca_next_seg); > + > + if (size > dev->limits.max_desc_sz) > + return -EINVAL; > + > for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; > qp->sq.wqe_shift++) > ; /* nothing */ > @@ -1064,6 +1132,8 @@ static int mthca_alloc_qp_common(struct > return ret; > } > > + mthca_adjust_qp_caps(dev, pd, qp); > + > /* > * If this is a userspace QP, we're done now. The doorbells > * will be allocated and buffers will be initialized in > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Wed Nov 9 08:04:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 18:04:00 +0200 Subject: [openib-general] [PATCH] mthca: fix posting long work request lists Message-ID: <20051109160400.GE8633@mellanox.co.il> Same thing for kernel level mthca. --- mthca: fix posting work request lists of length > 255 for Tavor. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (revision 3992) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -176,6 +176,8 @@ enum { MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 }; +#define MTHCA_TAVOR_WQES_PER_RECV_DOORBELL 256 + static const u8 mthca_opcode[] = { [IB_WR_SEND] = MTHCA_OPCODE_SEND, [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, @@ -1652,6 +1654,23 @@ int mthca_tavor_post_receive(struct ib_q ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_WQES_PER_RECV_DOORBELL)) { + __be32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32(qp->qpn << 8); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + qp->rq.head += nreq; + nreq = 0; + size0 = 0; + } + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { mthca_err(dev, "RQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, -- MST From halr at voltaire.com Wed Nov 9 08:26:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Nov 2005 11:26:04 -0500 Subject: [openib-general] OpenSM and Wrong SM_Key In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36188FB@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36188FB@mtlexch01.mtl.com> Message-ID: <1131553563.4451.2208.camel@hal.voltaire.com> Hi Eitan, On Wed, 2005-11-09 at 02:46, Eitan Zahavi wrote: > Hi Hal, > > I would like to bring this to MgtWG before we change anything. > IMO the situation when this happens is really not "legal" since if the > SM's are not coordinated at least in their SM_Key it will cause the two > masters on the subnet. Correct. That's what the current compliance says. > >From our experience it is always better to cause a fatal flow and exit > the SM rather then report the event in some log - normally it will not > be seen ... To upper layer management too (not just in a log). It's more than just reporting in a log; it's the exiting which relinquishes the subnet. > I know this is a controversial issue. Feel free to bring this up at the MgtWG. > BTW: Another feature I would like to bring up is the SM behavior when it > recognizes duplicated GUID on the subnet. Currently it will just issue > an error in the log file. > I would propose to make it abort after sending a log event describing > the DR paths to these two devices. > > What do you say? If aborting means exiting (terminating) the SM in this case, I think that is not a good thing and should be avoided. In the case of a duplicated GUID (which should not occur), a choice needs to be made as to which one to honor. The other should be ignored. The two ways I can envision this is: (1) duplication of GUID in multiple nodes (bad manufacturing process), and (2) SM bug of some sort. -- Hal > EZ > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, November 08, 2005 11:09 PM > > To: openib-general at openib.org > > Subject: [openib-general] OpenSM and Wrong SM_Key > > > > Hi, > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, it > exits > > as follows: > > > > > > void > > __osm_sminfo_rcv_process_get_response( > > IN const osm_sminfo_rcv_t* const p_rcv, > > IN const osm_madw_t* const p_madw ) > > { > > ... > > > > > > > > /* > > Check that the sm_key of the found SM is the same as ours, > > or is zero. If not - OpenSM cannot continue with configuration!. > */ > > if ( p_smi->sm_key != 0 && > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > "Got SM with sm_key that doesn't match our " > > "local key. Exiting\n" ); > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > "Found remote SM with non-matching sm_key. Exiting\n" ); > > osm_exit_flag = TRUE; > > goto Exit; > > } > > > > C14-61.2.1 states that: > > A master SM which finds a higher priority master SM with the wrong > > SM_Key should not relinquish the subnet. > > > > Exiting OpenSM relinquishes the subnet. > > > > So it appears to me that perhaps this behavior of exiting OpenSM > should > > be at least contingent on the SM state and relative priority of the > > SMInfo received. Make sense ? If so, I will work on a patch for this. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Wed Nov 9 08:46:37 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 Nov 2005 08:46:37 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F10415A1@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause Sent: Tuesday, November 08, 2005 1:08 PM To: Ranjit Pandit Cc: openib-general at openib.org Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. [cait] Applications should not infer anything from send completion other than that their source buffer is no longer requried for the transmit to complete. That is the only assumption that can be supported in a transport neutral way. I'll also point out that even under InfiniBand the fact that a send or write has completed does NOT guarantee that the remote peer has *noticed* the data. The Remote peer could fail *after* the date has been delivered to it and before it has had a chance to act upon it. A well-designed robust application should never rely on anything other than a peer ack to indicate that the peer has truly taken ownership of transmitted information. The essence of RDS, or any similar solution, is the delivery of message with datagram semantics reliably over point-to-point reliable connections. So whatever reliability and fault-tolerance benefits the reliable connections are inherited by the RDS layer. After that it is mostly a matter of how you avoid head-of-line blocking problems when there is no receive buffer. You don't want to send an RNR (or drop the DDP Segment under iWARP) because *one* endpoint does not have available buffers. Other than that any reliable datagram service should be just as reliable as the underlying rc service. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Wed Nov 9 08:55:45 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 9 Nov 2005 08:55:45 -0800 Subject: [openib-general] Problem in patching 2.6.9 kernel In-Reply-To: <43719EBA.9090406@Sun.COM> Message-ID: Liang Peng wrote, >*** Warning: "get_sb_pseudo" [drivers/infiniband/core/ib_uverbs.ko] >undefined! Looks like the backport patch for infiniband-backport-svn3279-to-2.6.9-kernel-fixups-01.diff is missing the patch to export that symbol. Not sure why, but I will fix it. Try this patch, diff -Naurp linux-2.6.9/fs/libfs.c linux-2.6.9-kernel-fixups/fs/libfs.c --- linux-2.6.9/fs/libfs.c 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9-kernel-fixups/fs/libfs.c 2005-05-03 10:51:48.000000000 -0700 @@ -526,10 +526,12 @@ EXPORT_SYMBOL(dcache_dir_lseek); EXPORT_SYMBOL(dcache_dir_open); EXPORT_SYMBOL(dcache_readdir); EXPORT_SYMBOL(generic_read_dir); +EXPORT_SYMBOL(get_sb_pseudo); EXPORT_SYMBOL(simple_commit_write); EXPORT_SYMBOL(simple_dir_inode_operations); EXPORT_SYMBOL(simple_dir_operations); EXPORT_SYMBOL(simple_empty); +EXPORT_SYMBOL(d_alloc_name); EXPORT_SYMBOL(simple_fill_super); EXPORT_SYMBOL(simple_getattr); EXPORT_SYMBOL(simple_link); From rolandd at cisco.com Wed Nov 9 09:08:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:08:43 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <20051109074316.GG31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 09:43:16 +0200") References: <52fyq6q0fa.fsf@cisco.com> <20051109074316.GG31134@mellanox.co.il> Message-ID: <521x1poijo.fsf@cisco.com> Michael> What I had in mind is: lets pad all resp and cmd Michael> structures in userspace (but not in kernel) with some Michael> 0-initialized extra space at the tail. The time to do it Michael> would be now before we bump the ABI. Comments? I'm not convinced we can do a good enough job of this to make it worth it. How do we know how much padding to leave? If we're just adding parameters, it's easy to put the compat code in userspace anyway. It's other changes that are hard to deal with. - R. From rolandd at cisco.com Wed Nov 9 09:13:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:13:15 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109135635.GQ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 15:56:35 +0200") References: <52vez4uzwu.fsf@cisco.com> <20051109135635.GQ31134@mellanox.co.il> Message-ID: <52wtjhn3ro.fsf@cisco.com> Michael> What about ib_umad_reg_agent error handling code in Michael> ib_umad_reg_agent? That seems to still call Michael> ib_umad_unreg_agent from under the down_write. I'm pretty sure that's OK, because it's not possible for any send requests to have been posted to that agent if we hit that error path. Michael> And what about ib_umad_kill_port, which also does this? That looks like a problem. I think we need the following (the locking should still be OK, there's no need to do all the cleanup atomically). --- infiniband/core/user_mad.c (revision 3989) +++ infiniband/core/user_mad.c (working copy) @@ -849,6 +849,7 @@ err_cdev: static void ib_umad_kill_port(struct ib_umad_port *port) { struct ib_umad_file *file; + struct ib_mad_agent *agent; int id; class_set_devdata(port->class_dev, NULL); @@ -865,19 +866,21 @@ static void ib_umad_kill_port(struct ib_ spin_unlock(&port_lock); down_write(&port->mutex); - port->ib_dev = NULL; + up_write(&port->mutex); list_for_each_entry(file, &port->file_list, port_list) for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { - if (!file->agent[id]) - continue; - ib_dereg_mr(file->mr[id]); - ib_unregister_mad_agent(file->agent[id]); + down_write(&port->mutex); + agent = file->agent[id]; file->agent[id] = NULL; - } + up_write(&port->mutex); - up_write(&port->mutex); + if (agent) { + ib_unregister_mad_agent(agent); + ib_dereg_mr(file->mr[id]); + } + } clear_bit(port->dev_num, dev_map); } From rolandd at cisco.com Wed Nov 9 09:14:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:14:58 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <20051109153630.GC8633@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 17:36:30 +0200") References: <20051109141558.GR31134@mellanox.co.il> <20051109153630.GC8633@mellanox.co.il> Message-ID: <52slu5n3ot.fsf@cisco.com> > The following patch solves a problem I'm seeing when multiple > threads try to call ibv_get_devices at the same time. Looks good, thanks. > Which brings me to another issue: our code examples call non-reentrant > dlist_for_each variants of dlist scanning routines, which will > create strange problems for multi-threaded users who might copy this. I was thinking recently that it would be better to just kill the dlist use in the libibverbs API entirely. It was a mistake to be lazy and use the code for sysfs, because I don't think dlist is designed very well. Returning something like a simple singly-linked list of devices would be better. What do you think? - R. From rolandd at cisco.com Wed Nov 9 09:21:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:21:02 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long work request lists In-Reply-To: <20051109155131.GD8633@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 17:51:31 +0200") References: <20051109155131.GD8633@mellanox.co.il> Message-ID: <52oe4tn3ep.fsf@cisco.com> Thanks, is something similar needed for SRQ as well? How about for posting sends? The Tavor PRM seems to indicate that a doorbell must always be rung every 256 WQEs, but I don't see any handling of this for send queues in the Mellanox VAPI source. - R. From mst at mellanox.co.il Wed Nov 9 09:30:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 19:30:14 +0200 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <52slu5n3ot.fsf@cisco.com> References: <52slu5n3ot.fsf@cisco.com> Message-ID: <20051109173013.GA16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libibverbs: protect device list initialization > > > The following patch solves a problem I'm seeing when multiple > > threads try to call ibv_get_devices at the same time. > > Looks good, thanks. BTW I think this will or something like this will be needed even if we change the library API. > > Which brings me to another issue: our code examples call non-reentrant > > dlist_for_each variants of dlist scanning routines, which will > > create strange problems for multi-threaded users who might copy this. > > I was thinking recently that it would be better to just kill the dlist > use in the libibverbs API entirely. It was a mistake to be lazy and > use the code for sysfs, because I don't think dlist is designed very > well. Returning something like a simple singly-linked list of devices > would be better. > > What do you think? > > - R. > Basically I agree. The problem I see with the API is with the re-entrancy of ibv_get_devices: for hotplug to work, it seems clear that we'll need to rescan the device list on each call to ibv_get_devices, so we will need something like ibv_put_devices to let the library know the user is not walking the list anymore. Given this assumption, it would seems better to let ibv_get_devices to just malloc and return an array of devices, and ibv_put_devices return it, than force everyone to use the sysfs dlist. -- MST From mst at mellanox.co.il Wed Nov 9 09:40:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 19:40:57 +0200 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <521x1poijo.fsf@cisco.com> References: <521x1poijo.fsf@cisco.com> Message-ID: <20051109174057.GB16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 1 of 2] mthca: qp size calculations > > Michael> What I had in mind is: lets pad all resp and cmd > Michael> structures in userspace (but not in kernel) with some > Michael> 0-initialized extra space at the tail. The time to do it > Michael> would be now before we bump the ABI. Comments? > > I'm not convinced we can do a good enough job of this to make it worth > it. Neither do I, it was just an idea to ponder. Still, people do like upgrading kernels without touching userspace. > How do we know how much padding to leave? I guess if we run out of place we can always force the user to upgrade as we do now, or add a pointer into a sctructure. > If we're just adding parameters, it's easy to put the compat code in > userspace anyway. It's other changes that are hard to deal with. > > - R. > Well, its your call. -- MST From jlentini at netapp.com Wed Nov 9 09:42:59 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 9 Nov 2005 12:42:59 -0500 (EST) Subject: [openib-general] Re: [OpenSM] SA database query tool In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618897@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618897@mtlexch01.mtl.com> Message-ID: On Wed, 2 Nov 2005, Eitan Zahavi wrote: > Hi Again, > > Ibis is currently under: > https://openib.org/svn/gen2/utils/src/linux-user/ibis > A doc regarding how to write SA client queries is available in the file: > doc/ibis_wrap.html > > If you will need more info or examples I will be happy to provide them. I'll take you up on that offer. Using IBIS, how would you query for all the SA service records for a particular service id? james From mst at mellanox.co.il Wed Nov 9 09:50:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 19:50:51 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52wtjhn3ro.fsf@cisco.com> References: <52wtjhn3ro.fsf@cisco.com> Message-ID: <20051109175051.GC16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> What about ib_umad_reg_agent error handling code in > Michael> ib_umad_reg_agent? That seems to still call > Michael> ib_umad_unreg_agent from under the down_write. > > I'm pretty sure that's OK, because it's not possible for any send > requests to have been posted to that agent if we hit that error path. > > Michael> And what about ib_umad_kill_port, which also does this? > > That looks like a problem. I think we need the following (the locking > should still be OK, there's no need to do all the cleanup atomically). > > --- infiniband/core/user_mad.c (revision 3989) > +++ infiniband/core/user_mad.c (working copy) > @@ -849,6 +849,7 @@ err_cdev: > static void ib_umad_kill_port(struct ib_umad_port *port) > { > struct ib_umad_file *file; > + struct ib_mad_agent *agent; > int id; > > class_set_devdata(port->class_dev, NULL); > @@ -865,19 +866,21 @@ static void ib_umad_kill_port(struct ib_ > spin_unlock(&port_lock); > > down_write(&port->mutex); > - > port->ib_dev = NULL; > + up_write(&port->mutex); > > list_for_each_entry(file, &port->file_list, port_list) > for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { > - if (!file->agent[id]) > - continue; > - ib_dereg_mr(file->mr[id]); > - ib_unregister_mad_agent(file->agent[id]); > + down_write(&port->mutex); > + agent = file->agent[id]; > file->agent[id] = NULL; > - } > + up_write(&port->mutex); > > - up_write(&port->mutex); > + if (agent) { > + ib_unregister_mad_agent(agent); > + ib_dereg_mr(file->mr[id]); > + } > + } > > clear_bit(port->dev_num, dev_map); > } > I'm not convinced. What would prevent ib_umad_close from touching the list if we release the mutex? -- MST From mst at mellanox.co.il Wed Nov 9 09:53:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 19:53:27 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52wtjhn3ro.fsf@cisco.com> References: <52wtjhn3ro.fsf@cisco.com> Message-ID: <20051109175326.GD16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> What about ib_umad_reg_agent error handling code in > Michael> ib_umad_reg_agent? That seems to still call > Michael> ib_umad_unreg_agent from under the down_write. > > I'm pretty sure that's OK, because it's not possible for any send > requests to have been posted to that agent if we hit that error path. What about recv_handler? That calls queue_packet too ... -- MST From rolandd at cisco.com Wed Nov 9 09:50:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:50:19 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109175051.GC16091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 19:50:51 +0200") References: <52wtjhn3ro.fsf@cisco.com> <20051109175051.GC16091@mellanox.co.il> Message-ID: <52k6fhn21w.fsf@cisco.com> Michael> I'm not convinced. What would prevent ib_umad_close from Michael> touching the list if we release the mutex? Good point, ib_umad_close() could race against device removal. But we had that problem before: ib_umad_close() doesn't hold the mutex while going through the agent list. - R. From rolandd at cisco.com Wed Nov 9 09:58:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 09:58:36 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109175326.GD16091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 19:53:27 +0200") References: <52wtjhn3ro.fsf@cisco.com> <20051109175326.GD16091@mellanox.co.il> Message-ID: <52fyq5n1o3.fsf@cisco.com> Michael> What about recv_handler? That calls queue_packet too ... But not from the same thread. So it will just sleep until the mutex is released. - R. From mst at mellanox.co.il Wed Nov 9 10:07:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 20:07:31 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52k6fhn21w.fsf@cisco.com> References: <52k6fhn21w.fsf@cisco.com> Message-ID: <20051109180731.GE16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> I'm not convinced. What would prevent ib_umad_close from > Michael> touching the list if we release the mutex? > > Good point, ib_umad_close() could race against device removal. But we > had that problem before: ib_umad_close() doesn't hold the mutex while > going through the agent list. > > - R. > To fix ib_umad_kill_port properly, lets clean out port->file_list under the mutex. Something using list_del_init, like struct list_head *list; for (;;) { down_write(&port->mutex); if (!list_empty(&port->file_list)) break; list=port->file_list->next; list_del_init(list); up_write(&port->mutex); file = list_entry(file, list, port_list); /* ... Deregister .... */ down_write(&port->mutex); } Makes sense? -- MST From mst at mellanox.co.il Wed Nov 9 10:08:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 20:08:28 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52fyq5n1o3.fsf@cisco.com> References: <52fyq5n1o3.fsf@cisco.com> Message-ID: <20051109180827.GF16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> What about recv_handler? That calls queue_packet too ... > > But not from the same thread. So it will just sleep until the mutex > is released. Thats right, I see it now. Might be worth a comment though. -- MST From mst at mellanox.co.il Wed Nov 9 10:13:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 20:13:03 +0200 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long work request lists In-Reply-To: <52oe4tn3ep.fsf@cisco.com> References: <52oe4tn3ep.fsf@cisco.com> Message-ID: <20051109181303.GG16091@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libmthca: fix posting long work request lists > > Thanks, is something similar needed for SRQ as well? Yes, I missed that. Again, Tavor only. Can you fix it or prefer to wait for me to do it? That could be an independent patch though, hope it doesnt delay applying other fixes. > How about for > posting sends? The Tavor PRM seems to indicate that a doorbell must > always be rung every 256 WQEs, but I don't see any handling of this > for send queues in the Mellanox VAPI source. No, thats just unclear wording in PRM. The limit stems from the limited number of bits in RQ doorbell. I added a note for us to clarify this in the next revision. -- MST From rolandd at cisco.com Wed Nov 9 10:15:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 10:15:17 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long work request lists In-Reply-To: <20051109181303.GG16091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 20:13:03 +0200") References: <52oe4tn3ep.fsf@cisco.com> <20051109181303.GG16091@mellanox.co.il> Message-ID: <528xvxn0wa.fsf@cisco.com> Michael> Yes, I missed that. Again, Tavor only. Can you fix it or Michael> prefer to wait for me to do it? That could be an Michael> independent patch though, hope it doesnt delay applying Michael> other fixes. I can fix it up. - R. From lindahl at pathscale.com Wed Nov 9 10:15:29 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 9 Nov 2005 10:15:29 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10415A1@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10415A1@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051109181529.GA3936@greglaptop.hsd1.ca.comcast.net> Caitlin, Can you please use the standard quoting style? I can't tell which comments are yours. Thanks. -- greg From rolandd at cisco.com Wed Nov 9 10:16:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 10:16:16 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109180731.GE16091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 20:07:31 +0200") References: <52k6fhn21w.fsf@cisco.com> <20051109180731.GE16091@mellanox.co.il> Message-ID: <527jbhn0un.fsf@cisco.com> Michael> Makes sense? No, I think you have to make a copy of the full list of agents and clean out the file's list while holding the mutex. Otherwise ib_umad_close() could run while ib_umad_kill_port() is dealing with the same file. And also we need ib_umad_kill_port() to wait for any in-progress ib_umad_close() calls, since we don't want to call ib_unregister_mad_agent() after we've returned from the device removal call. The locking is a little tricky, I'll work something out. - R. From rolandd at cisco.com Wed Nov 9 10:25:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 10:25:19 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix double free condition In-Reply-To: <20051109125603.GM31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 14:56:03 +0200") References: <20051109125603.GM31134@mellanox.co.il> Message-ID: <523bm5n0fk.fsf@cisco.com> Thanks, applied. From Richard.Frank at oracle.com Wed Nov 9 10:28:06 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Wed, 9 Nov 2005 13:28:06 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com><6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com><96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> Message-ID: <00e101c5e55b$54e537e0$6401a8c0@YOURA11C73D0FD> Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. ----- Original Message ----- From: Michael Krause To: Ranjit Pandit Cc: openib-general at openib.org Sent: Tuesday, November 08, 2005 4:08 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP. Mike ------------------------------------------------------------------------------ _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Nov 9 11:02:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 21:02:46 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <527jbhn0un.fsf@cisco.com> References: <527jbhn0un.fsf@cisco.com> Message-ID: <20051109190246.GB25508@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Michael> Makes sense? > > No, I think you have to make a copy of the full list of agents and > clean out the file's list while holding the mutex. Otherwise > ib_umad_close() could run while ib_umad_kill_port() is dealing with > the same file. And that should be fine since its safe to call list_del after list_del_init. > And also we need ib_umad_kill_port() to wait for any > in-progress ib_umad_close() calls, since we don't want to call > ib_unregister_mad_agent() after we've returned from the device removal > call. This should work fine too since the last down_write that detects that list list is empty will flush these guys out. > The locking is a little tricky, I'll work something out. OK, if you dont manage to, I'll look into it tomorrow. -- MST From rolandd at cisco.com Wed Nov 9 11:19:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:19:53 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109190246.GB25508@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 21:02:46 +0200") References: <527jbhn0un.fsf@cisco.com> <20051109190246.GB25508@mellanox.co.il> Message-ID: <52vez1ljc6.fsf@cisco.com> Roland> And also we need ib_umad_kill_port() to wait for any Roland> in-progress ib_umad_close() calls, since we don't want to Roland> call ib_unregister_mad_agent() after we've returned from Roland> the device removal call. Michael> This should work fine too since the last down_write that Michael> detects that list list is empty will flush these guys Michael> out. The problem I run into trying to implement this is that both ib_umad_close() and ib_umad_kill_port() need to do something like: down_write(&port->mutex); agent = file->agent[id]; file->agent[id] = NULL; up_write(&port->mutex); if (agent) ib_unregister_mad_agent(agent); but ib_umad_close() could pause arbitrarily long right before the ib_unregister_mad_agent() call and then end up calling the function after the device is already gone. - R. From rolandd at cisco.com Wed Nov 9 11:26:32 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:26:32 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] mthca: qp size calculations In-Reply-To: <52fyq6q0fa.fsf@cisco.com> (Roland Dreier's message of "Tue, 08 Nov 2005 13:44:57 -0800") References: <20051108174644.GD30664@mellanox.co.il> <52fyq6q0fa.fsf@cisco.com> Message-ID: <52r79plj13.fsf@cisco.com> OK, I committed both halves of the change. From rolandd at cisco.com Wed Nov 9 11:30:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:30:21 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix atomic operations In-Reply-To: <20051109121422.GJ31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 14:14:22 +0200") References: <25AE7F432672D511B8DC00B0D0DF11DA05E9C458@MTIEX01> <20051109121422.GJ31134@mellanox.co.il> Message-ID: <52mzkdliuq.fsf@cisco.com> Thanks, applied. - R. From rolandd at cisco.com Wed Nov 9 11:32:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:32:18 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting atomic work requests In-Reply-To: <20051109121739.GK31134@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 14:17:39 +0200") References: <20051109121739.GK31134@mellanox.co.il> Message-ID: <52irv1lirh.fsf@cisco.com> Thanks, applied. - R. From rolandd at cisco.com Wed Nov 9 11:33:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:33:54 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <20051109173013.GA16091@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 19:30:14 +0200") References: <52slu5n3ot.fsf@cisco.com> <20051109173013.GA16091@mellanox.co.il> Message-ID: <52ek5pliot.fsf@cisco.com> Michael> BTW I think this will or something like this will be Michael> needed even if we change the library API. Yes, we definitely need to be thread-safe here. Michael> The problem I see with the API is with the re-entrancy of Michael> ibv_get_devices: for hotplug to work, it seems clear that Michael> we'll need to rescan the device list on each call to Michael> ibv_get_devices, so we will need something like Michael> ibv_put_devices to let the library know the user is not Michael> walking the list anymore. Michael> Given this assumption, it would seems better to let Michael> ibv_get_devices to just malloc and return an array of Michael> devices, and ibv_put_devices return it, than force Michael> everyone to use the sysfs dlist. Yeah, the reason I've been avoiding this is that having ibv_put_devices seems ugly to me. And having the caller allocate a buffer (and having to deal with resizing it etc) also seems ugly. So I'm somewhat paralyzed ;) - R. From lindahl at pathscale.com Wed Nov 9 11:42:53 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 9 Nov 2005 11:42:53 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> Message-ID: <20051109194253.GF1377@greglaptop.internal.keyresearch.com> On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote: > If an application takes any action assuming that send complete means > it is delivered, then it is subject to silent data corruption. Right. That's the same as pretty much all other *transport* layers. I don't think anyone's asserting RDS is any different: you can't assume the other side's application received and acted on your message until the other side's application tells you that it did. -- greg From rolandd at cisco.com Wed Nov 9 11:45:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:45:02 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <20051109160400.GE8633@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 18:04:00 +0200") References: <20051109160400.GE8633@mellanox.co.il> Message-ID: <52acgdli69.fsf@cisco.com> Does this look OK? I added SRQ support and also moved the doorbell[2] declaration to the top of the functions (since I'm not convinved all versions of gcc are smart enough to see that two copies of doorbell[] can share stack slots). - R. Index: infiniband/hw/mthca/mthca_srq.c =================================================================== --- infiniband/hw/mthca/mthca_srq.c (revision 3989) +++ infiniband/hw/mthca/mthca_srq.c (working copy) @@ -414,6 +414,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); + __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -429,6 +430,23 @@ int mthca_tavor_post_srq_recv(struct ib_ first_ind = srq->first_free; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + + /* + * Make sure that descriptors are written + * before doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + first_ind = srq->first_free; + } + ind = srq->first_free; if (ind < 0) { @@ -491,8 +509,6 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); Index: infiniband/hw/mthca/mthca_wqe.h =================================================================== --- infiniband/hw/mthca/mthca_wqe.h (revision 3989) +++ infiniband/hw/mthca/mthca_wqe.h (working copy) @@ -49,7 +49,8 @@ enum { }; enum { - MTHCA_INVAL_LKEY = 0x100 + MTHCA_INVAL_LKEY = 0x100, + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 }; struct mthca_next_seg { Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 4003) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1705,6 +1705,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1722,6 +1723,21 @@ int mthca_tavor_post_receive(struct ib_q ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + qp->rq.head += nreq; + nreq = 0; + size0 = 0; + } + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { mthca_err(dev, "RQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -1779,8 +1795,6 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); From rolandd at cisco.com Wed Nov 9 11:51:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 11:51:46 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long work request lists In-Reply-To: <20051109155131.GD8633@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 17:51:31 +0200") References: <20051109155131.GD8633@mellanox.co.il> Message-ID: <5264r1lhv1.fsf@cisco.com> Same for libmthca.... Index: infiniband/hw/mthca/mthca_srq.c =================================================================== --- infiniband/hw/mthca/mthca_srq.c (revision 3989) +++ infiniband/hw/mthca/mthca_srq.c (working copy) @@ -414,6 +414,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); + __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -429,6 +430,23 @@ int mthca_tavor_post_srq_recv(struct ib_ first_ind = srq->first_free; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); + + /* + * Make sure that descriptors are written + * before doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + first_ind = srq->first_free; + } + ind = srq->first_free; if (ind < 0) { @@ -491,8 +509,6 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); Index: infiniband/hw/mthca/mthca_wqe.h =================================================================== --- infiniband/hw/mthca/mthca_wqe.h (revision 3989) +++ infiniband/hw/mthca/mthca_wqe.h (working copy) @@ -49,7 +49,8 @@ enum { }; enum { - MTHCA_INVAL_LKEY = 0x100 + MTHCA_INVAL_LKEY = 0x100, + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 }; struct mthca_next_seg { Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 4003) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1705,6 +1705,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1722,6 +1723,21 @@ int mthca_tavor_post_receive(struct ib_q ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + qp->rq.head += nreq; + nreq = 0; + size0 = 0; + } + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { mthca_err(dev, "RQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -1779,8 +1795,6 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); From mshefty at ichips.intel.com Wed Nov 9 11:57:33 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Nov 2005 11:57:33 -0800 Subject: [openib-general] usermode hang in mthca_cq_clean Message-ID: <437254AD.3070908@ichips.intel.com> I'm seeing an issue trying to recover from an error in userspace. Basically, I allocate a PD, a CQ, and a QP, then destroy the QP because of an unrelated error. The destroy call takes several seconds to complete, and appears to be hung in mthca_cq_clean: line 551. Stepping through the while loop there, I'm not falling into the if or else if cases. The call does eventually complete. I will look into this more this afternoon, but thought I'd mention it in case there's an obvious issue that someone can see. - Sean From rolandd at cisco.com Wed Nov 9 12:12:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 12:12:03 -0800 Subject: [openib-general] Re: usermode hang in mthca_cq_clean In-Reply-To: <437254AD.3070908@ichips.intel.com> (Sean Hefty's message of "Wed, 09 Nov 2005 11:57:33 -0800") References: <437254AD.3070908@ichips.intel.com> Message-ID: <521x1plgx8.fsf@cisco.com> Sean> I'm seeing an issue trying to recover from an error in Sean> userspace. Basically, I allocate a PD, a CQ, and a QP, then Sean> destroy the QP because of an unrelated error. The destroy Sean> call takes several seconds to complete, and appears to be Sean> hung in mthca_cq_clean: line 551. Stepping through the Sean> while loop there, I'm not falling into the if or else if Sean> cases. The call does eventually complete. I think I see the problem. Does this patch fix it for you? (basically you're doing a benchmark seeing how fast your CPU can go through the loop 4 billion times ;) - R. --- libmthca/src/cq.c (revision 3989) +++ libmthca/src/cq.c (working copy) @@ -524,7 +524,7 @@ void mthca_arbel_cq_event(struct ibv_cq void mthca_cq_clean(struct mthca_cq *cq, uint32_t qpn, struct mthca_srq *srq) { struct mthca_cqe *cqe; - int prod_index; + uint32_t prod_index; int nfreed = 0; pthread_spin_lock(&cq->lock); @@ -546,7 +546,7 @@ void mthca_cq_clean(struct mthca_cq *cq, * Now sweep backwards through the CQ, removing CQ entries * that match our QP by copying older entries on top of them. */ - while (--prod_index > cq->cons_index) { + while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); if (cqe->my_qpn == htonl(qpn)) { if (srq) From eitan at mellanox.co.il Wed Nov 9 12:26:23 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 9 Nov 2005 22:26:23 +0200 Subject: [openib-general] Re: [OpenSM] SA database query tool Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361890E@mtlexch01.mtl.com> To query for service record using a particular service ID: (when the IB cable connected to port 1) Make sure the port it up using ibstat EXAMPLE 1 # simple example to show all the fields of the required service ID: Unix > ibis -port_num 1 % sacServiceQuery configure -service_id 0x00000000643c909d % set sids [sacServiceQuery getTable $IB_SR_COMPMASK_SID] % foreach sid $sids {sacServiceRec s -this $sid; puts [s dump]; rename s ""} -service_id 0x00000000643c909d -service_gid 0x0000000000000000:0x0003c8010ab1d192 -service_pkey 0 -resv 0 -service_lease 4294967295 -service_key -service_name osmt.srvc.1681692777.1996 ID=643c909d 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x00000000 0x00000000 0x00000000 0x00000000 0x0000000000000000 0x0000000000000000 % exit EXAMPLE 2 # making a script of it (assuming you provide service Id as parameter (attach to the first active port) cat > getServiceIdRecords < [\$c \$sid]" } } exit EOF chmod 755 ./getServiceIdRecords ./getServiceIdRecords 0x643c499e ------------------------------------------------------------------------ ------------- Service Record: sacServiceRec_service_key_get -> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 sacServiceRec_service_data8_get -> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 sacServiceRec_service_data32_get -> 0x00000000 0x00000000 0x00000000 0x00000000 sacServiceRec_service_gid_get -> 0x0000000000000000:0x0002c902000017a1 sacServiceRec_service_pkey_get -> 0 sacServiceRec_service_data16_get -> 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 sacServiceRec_service_lease_get -> 4294967295 sacServiceRec_resv_get -> 0 sacServiceRec_service_id_get -> 0x00000000643c499e sacServiceRec_service_name_get -> osmt.srvc.1681692775.20169 sacServiceRec_service_data64_get -> 0x0000000000000000 0x0000000000000000 MORE DATA # More options: Unix > ibis -port_num 1 % # to see all possible fields: % sacServiceQuery configure { -service_id -service_gid -service_pkey -resv -service_lease -service_key -service_name -service_data8 -service_data16 -service_data32 -service_data64 } % # to set the service id: % sacServiceQuery configure -service_id 0x1234567812345678 %# The list of component masks: IB_SR_COMPMASK_RES1 IB_SR_COMPMASK_SDATA64_0 IB_SR_COMPMASK_SDATA64_1 IB_SR_COMPMASK_SPKEY IB_SR_COMPMASK_SGID IB_SR_COMPMASK_SLEASE IB_SR_COMPMASK_SKEY IB_SR_COMPMASK_SNAME IB_SR_COMPMASK_SDATA16_0 IB_SR_COMPMASK_SDATA16_1 IB_SR_COMPMASK_SDATA16_2 IB_SR_COMPMASK_SDATA16_3 IB_SR_COMPMASK_SDATA16_4 IB_SR_COMPMASK_SDATA16_5 IB_SR_COMPMASK_SDATA16_6 IB_SR_COMPMASK_SDATA16_7 IB_SR_COMPMASK_SID IB_SR_COMPMASK_SDATA8_10 IB_SR_COMPMASK_SDATA8_11 IB_SR_COMPMASK_SDATA8_12 IB_SR_COMPMASK_SDATA8_13 IB_SR_COMPMASK_SDATA8_14 IB_SR_COMPMASK_SDATA8_15 IB_SR_COMPMASK_SDATA32_0 IB_SR_COMPMASK_SDATA32_1 IB_SR_COMPMASK_SDATA32_2 IB_SR_COMPMASK_SDATA8_0 IB_SR_COMPMASK_SDATA32_3 IB_SR_COMPMASK_SDATA8_1 IB_SR_COMPMASK_SDATA8_2 IB_SR_COMPMASK_SDATA8_3 IB_SR_COMPMASK_SDATA8_4 IB_SR_COMPMASK_SDATA8_5 IB_SR_COMPMASK_SDATA8_6 IB_SR_COMPMASK_SDATA8_7 IB_SR_COMPMASK_SDATA8_8 IB_SR_COMPMASK_SDATA8_9 Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Wednesday, November 09, 2005 7:43 PM > To: Eitan Zahavi > Cc: openib-general > Subject: RE: [openib-general] Re: [OpenSM] SA database query tool > > > > On Wed, 2 Nov 2005, Eitan Zahavi wrote: > > > Hi Again, > > > > Ibis is currently under: > > https://openib.org/svn/gen2/utils/src/linux-user/ibis > > A doc regarding how to write SA client queries is available in the file: > > doc/ibis_wrap.html > > > > If you will need more info or examples I will be happy to provide them. > > I'll take you up on that offer. > > Using IBIS, how would you query for all the SA service records for a > particular service id? > > james From krause at cup.hp.com Wed Nov 9 12:18:28 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 09 Nov 2005 12:18:28 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <20051109194253.GF1377@greglaptop.internal.keyresearch.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> Message-ID: <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> At 11:42 AM 11/9/2005, Greg Lindahl wrote: >On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote: > > > If an application takes any action assuming that send complete means > > it is delivered, then it is subject to silent data corruption. > >Right. That's the same as pretty much all other *transport* layers. I >don't think anyone's asserting RDS is any different: you can't assume >the other side's application received and acted on your message until >the other side's application tells you that it did. So, things like HCA failure are not transparent and one cannot simply replay the operations since you don't know what was really seen by the other side unless the application performs the resync itself. Hence, while RDS can attempt to retransmit, the application must deal with duplicates, etc. or note the error, resync, and retransmit to avoid duplicates. BTW, host-based transport implementations can transparently recover from device failure on behalf of applications since their state is in the host and not in the failed device - this is true for networking, storage, etc. HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted thus must rely upon upper level software to perform the recovery, resync, retransmission, etc. Unless RDS has implemented its own state checkpoint between endnodes, this class of failures must be solved by the application since it cannot be solved in the hardware. Hence, RDS may push some of its reliability requirements to the interconnect but it does not eliminate all reliability requirements from the application or RDS itself. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Nov 9 12:21:06 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 09 Nov 2005 12:21:06 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <00e101c5e55b$54e537e0$6401a8c0@YOURA11C73D0FD> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <00e101c5e55b$54e537e0$6401a8c0@YOURA11C73D0FD> Message-ID: <6.2.0.14.2.20051109121858.026a6c10@esmail.cup.hp.com> At 10:28 AM 11/9/2005, Rick Frank wrote: >Yes, the application is responsible for detecting lost msgs at the >application level - the transport can not do this. > >RDS does not guarantee that a message has been delivered to the >application - just that once the transport has accepted a msg it will >deliver the msg to the remote node in order without duplication - dealing >with retransmissions, etc due to sporadic / intermittent msg loss over the >interconnect. If after accepting the send - the current path fails - then >RDS will transparently fail over to another path - and if required will >resend / send any already queued msgs to the remote node - again insuring >that no msg is duplicated and they are in order. This is no different >than APM - with the exception that RDS can do this across HCAs. > >The application - Oracle in this case - will deal with detecting a >catastrophic path failure - either due to a send that does not arrive and >or a timedout response or send failure returned from the transport. If >there is no network path to a remote node - it is required that we remove >the remote node from the operating cluster to avoid what is commonly >termed as a "split brain" condition - otherwise known as a "partition in time". > >BTW - in our case - the application failure domain logic is the same >whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, >if we can not talk to a remote node - after some defined period of time - >we will remove the remote node from the cluster. In this case the database >will recover all the interesting state that may have been maintained on >the removed node - allowing the remaining nodes to continue. If later on, >communication to the remote node is restored - it will be allowed to >rejoin the cluster and take on application load. One could be able to talk to the remote node across other HCA but that does not mean one has an understanding of the state at the remote node unless the failure is noted and a resync of state occurs or the remote is able to deal with duplicates, etc. This has nothing to do with API or the transport involved but, as Caitlin noted, the difference between knowing a send buffer is free vs. knowing that the application received the data requested. Therefore, one has only reduced the reliability / robustness problem space to some extent but has not solved it by the use of RDS. Mike > > >----- Original Message ----- >From: Michael Krause >To: Ranjit Pandit >Cc: openib-general at openib.org >Sent: Tuesday, November 08, 2005 4:08 PM >Subject: Re: [openib-general] [ANNOUNCE] Contribute >RDS(ReliableDatagramSockets) to OpenIB > >At 12:33 PM 11/8/2005, Ranjit Pandit wrote: >> > Mike wrote: >> > - RDS does not solve a set of failure models. For example, if a RNIC >> / HCA >> > were to fail, then one cannot simply replay the operations on another >> RNIC / >> > HCA without extracting state, etc. and providing some end-to-end sync of >> > what was really sent / received by the application. Yes, one can recover >> > from cable or switch port failure by using APM style recovery but that is >> > only one class of faults. The harder faults either result in the end node >> > being cast out of the cluster or see silent data corruption unless >> > additional steps are taken to transparently recover - again app writers >> > don't want to solve the hard problems; they want that done for them. >> >>The current reference implementation of RDS solves the HCA failure case >>as well. >>Since applications don't need to keep connection states, it's easier >>to handle cases like HCA and intermediate path failures. >>As far as application is concerned, every sendmsg 'could' result in a >>new connection setup in the driver. >>If the current path fails, RDS reestablishes a connection, if >>available, on a different port or a different HCA , and replays the >>failed messages. >>Using APM is not useful because it doesn't provide failover across HCA's. > >I think others may disagree about whether RDS solves the problem. You >have no way of knowing whether something was received or not into the >other node's coherency domain without some intermediary or application's >involvement to see the data arrived. As such, you might see many hardware >level acks occur and not know there is a real failure. If an application >takes any action assuming that send complete means it is delivered, then >it is subject to silent data corruption. Hence, RDS can replay to its >heart content but until there is an application or middleware level of >acknowledgement, you have not solve the fault domain issues. Some may be >happy with this as they just cast out the endnode from the cluster / >database but others see the loss of a server as a big deal so may not be >happy to see this occur. It really comes down to whether you believe >loosing a server is worth while just for a local failure event which is >not fatal to the rest of the server. > >APM's value is the ability to recover from link failure. It has the same >value for any other ULP in that it recovers transparently to the ULP. > >Mike > > >---------- >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Wed Nov 9 12:45:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 9 Nov 2005 12:45:17 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F10415BC@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause Sent: Wednesday, November 09, 2005 12:21 PM To: Rick Frank; Ranjit Pandit Cc: openib-general at openib.org Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB One could be able to talk to the remote node across other HCA but that does not mean one has an understanding of the state at the remote node unless the failure is noted and a resync of state occurs or the remote is able to deal with duplicates, etc. This has nothing to do with API or the transport involved but, as Caitlin noted, the difference between knowing a send buffer is free vs. knowing that the application received the data requested. Therefore, one has only reduced the reliability / robustness problem space to some extent but has not solved it by the use of RDS. Correct. When there are point-to-point credits (even if only enforced/understood at the ULP) then the application can correctly infer that message N was successfully processed because the matching credit was restored. A transport neutral application can only communicate restoration of credits via ULP messaging. When credits are shared across sessions then the ULP has a much more complex task to properly communicate credits. The proposal I presented at RAIT for multistreamed MPA had a non-highlighted option for a "wildcard" endpoint. Without the option multistream MPA is essentially the SCTP adaptation for RDMA running over plain MPA/TCP. It achieves the same reduction in reliable transport layer connections that RDS does, but does not reduce the number of RDMA endpoints. The wildcard option reduces the number of RDMA endpoints as well, but greatly complicates the RDMA state machines. RDS over IB faces similar problems, but solved them slightly differently. Over iWARP I believe these complexities favor keeping the point-to-point logical connection between QP and only reducing the number of L4 connections (from many TCP connections to a single TCP connection or SCTP association). The advantage of that approach is that the API from application to RDMA endpoint (QP) can be left totally unchanged. But I do not see any such option over IB, unless RD is improved or a new SCTP-like connection mode is defined. In my opinion the multi-streaming is the most important feature here, but over IB I do not think there is a natural adaptation that provides multi-streaming without also adding the any-to-any endpoint semantics. Multistream MPA and SCTP can both support the any-to-any endpoint semantics by moving the source to payload information rather than transport information (by invoking "wildcard status" in MS-MPA or by duplicating the field for SCTP). So the RDS API strikes me as the best option for a transport neutral application. MS-MPA and SCTP reductions in transport overhead would be available without special API support. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pathscale.com Wed Nov 9 13:24:37 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 9 Nov 2005 13:24:37 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> Message-ID: <20051109212437.GH1377@greglaptop.internal.keyresearch.com> On Wed, Nov 09, 2005 at 12:18:28PM -0800, Michael Krause wrote: > So, things like HCA failure are not transparent and one cannot simply > replay the operations since you don't know what was really seen by the > other side unless the application performs the resync itself. I think you are over-stating the case. On the remote end, the kernel piece of RDS knows what it presented to the remote application, ditto on the local end. If only an HCA fails, and not the sending and receiving kernels or applications, that knowledge is not lost. Perhaps you were assuming that RDS would be implemented only in firmware on the HCA, and there is no kernel piece that knows what's going on. I hadn't seen that stated by anyone, and of course there are several existing and contemplated OpenIB devices that are considerably different from the usual offload engine. You could also choose to implement RDS using an offload engine and still keep enough state in the kernel to recover. -- greg From mst at mellanox.co.il Wed Nov 9 13:29:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 23:29:54 +0200 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <52ek5pliot.fsf@cisco.com> References: <52ek5pliot.fsf@cisco.com> Message-ID: <20051109212954.GC25508@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libibverbs: protect device list initialization > > Michael> BTW I think this will or something like this will be > Michael> needed even if we change the library API. > > Yes, we definitely need to be thread-safe here. Do you plan to check in the thread safety patch, meanwhile? > Michael> The problem I see with the API is with the re-entrancy of > Michael> ibv_get_devices: for hotplug to work, it seems clear that > Michael> we'll need to rescan the device list on each call to > Michael> ibv_get_devices, so we will need something like > Michael> ibv_put_devices to let the library know the user is not > Michael> walking the list anymore. > > Michael> Given this assumption, it would seems better to let > Michael> ibv_get_devices to just malloc and return an array of > Michael> devices, and ibv_put_devices return it, than force > Michael> everyone to use the sysfs dlist. > > Yeah, the reason I've been avoiding this is that having > ibv_put_devices seems ugly to me. And having the caller allocate a > buffer (and having to deal with resizing it etc) also seems ugly. > So I'm somewhat paralyzed ;) Maybe its a naming thing? We can call the list "iterator", does this make it less ugly? ibv_device_iter_init(iter); ibv_for_each_device(iter) { } ibv_device_iter_cleanup(iter); -- MST From mst at mellanox.co.il Wed Nov 9 13:31:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 23:31:16 +0200 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long work request lists In-Reply-To: <5264r1lhv1.fsf@cisco.com> References: <5264r1lhv1.fsf@cisco.com> Message-ID: <20051109213116.GD25508@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libmthca: fix posting long work request lists > > Same for libmthca.... > > Index: infiniband/hw/mthca/mthca_srq.c You mean mthca -- MST From mst at mellanox.co.il Wed Nov 9 13:40:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 23:40:00 +0200 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <52acgdli69.fsf@cisco.com> References: <52acgdli69.fsf@cisco.com> Message-ID: <20051109214000.GE25508@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: fix posting long work request lists > > Does this look OK? Nope. comments below. > I added SRQ support and also moved the doorbell[2] > declaration to the top of the functions (since I'm not convinved all > versions of gcc are smart enough to see that two copies of doorbell[] > can share stack slots). True. AFAIK no existing gcc version is smart enough for this. > - R. > > Index: infiniband/hw/mthca/mthca_srq.c > =================================================================== > --- infiniband/hw/mthca/mthca_srq.c (revision 3989) > +++ infiniband/hw/mthca/mthca_srq.c (working copy) > @@ -414,6 +414,7 @@ int mthca_tavor_post_srq_recv(struct ib_ > { > struct mthca_dev *dev = to_mdev(ibsrq->device); > struct mthca_srq *srq = to_msrq(ibsrq); > + __be32 doorbell[2]; > unsigned long flags; > int err = 0; > int first_ind; > @@ -429,6 +430,23 @@ int mthca_tavor_post_srq_recv(struct ib_ > first_ind = srq->first_free; > > for (nreq = 0; wr; ++nreq, wr = wr->next) { > + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { > + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); > + doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); this one's wrong. nreq will overflow into bit 8. You want to put 0 there since we know nreq is 256: doorbell[1] = cpu_to_be32(qp->qpn << 8); Thats what my patch does for RQ, check it out. You also need nreq = 0 here, I think (better here where we've just read it than after the barrier). > + > + /* > + * Make sure that descriptors are written > + * before doorbell is rung. > + */ > + wmb(); > + > + mthca_write64(doorbell, > + dev->kar + MTHCA_RECEIVE_DOORBELL, > + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); > + > + first_ind = srq->first_free; > + } > + > ind = srq->first_free; > > if (ind < 0) { > @@ -491,8 +509,6 @@ int mthca_tavor_post_srq_recv(struct ib_ > } > > if (likely(nreq)) { > - __be32 doorbell[2]; > - > doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); > doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); > > Index: infiniband/hw/mthca/mthca_wqe.h > =================================================================== > --- infiniband/hw/mthca/mthca_wqe.h (revision 3989) > +++ infiniband/hw/mthca/mthca_wqe.h (working copy) > @@ -49,7 +49,8 @@ enum { > }; > > enum { > - MTHCA_INVAL_LKEY = 0x100 > + MTHCA_INVAL_LKEY = 0x100, > + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 > }; > > struct mthca_next_seg { > Index: infiniband/hw/mthca/mthca_qp.c > =================================================================== > --- infiniband/hw/mthca/mthca_qp.c (revision 4003) > +++ infiniband/hw/mthca/mthca_qp.c (working copy) > @@ -1705,6 +1705,7 @@ int mthca_tavor_post_receive(struct ib_q > { > struct mthca_dev *dev = to_mdev(ibqp->device); > struct mthca_qp *qp = to_mqp(ibqp); > + __be32 doorbell[2]; > unsigned long flags; > int err = 0; > int nreq; > @@ -1722,6 +1723,21 @@ int mthca_tavor_post_receive(struct ib_q > ind = qp->rq.next_ind; > > for (nreq = 0; wr; ++nreq, wr = wr->next) { > + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { > + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); > + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); This one's wrong too. nreq will overflow into bit 8. You want to put 0 there: doorbell[1] = cpu_to_be32(qp->qpn << 8); > + > + wmb(); > + > + mthca_write64(doorbell, > + dev->kar + MTHCA_RECEIVE_DOORBELL, > + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); > + > + qp->rq.head += nreq; I know I did it differently, but I now think its better to replace nreq with MTHCA_TAVOR_MAX_WQES_PER_RECV_DB in the line above since wmb should prevent gcc from knowing the value of nreq here or storing it in a register. > + nreq = 0; > + size0 = 0; > + } > + > if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { > mthca_err(dev, "RQ %06x full (%u head, %u tail," > " %d max, %d nreq)\n", qp->qpn, > @@ -1779,8 +1795,6 @@ int mthca_tavor_post_receive(struct ib_q > > out: > if (likely(nreq)) { > - __be32 doorbell[2]; > - > doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); > doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); > > -- MST From mst at mellanox.co.il Wed Nov 9 13:48:20 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 9 Nov 2005 23:48:20 +0200 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <52vez1ljc6.fsf@cisco.com> References: <52vez1ljc6.fsf@cisco.com> Message-ID: <20051109214820.GI25508@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: user_mad.c: deadlock? > > Roland> And also we need ib_umad_kill_port() to wait for any > Roland> in-progress ib_umad_close() calls, since we don't want to > Roland> call ib_unregister_mad_agent() after we've returned from > Roland> the device removal call. > > Michael> This should work fine too since the last down_write that > Michael> detects that list list is empty will flush these guys > Michael> out. > > The problem I run into trying to implement this is that both > ib_umad_close() and ib_umad_kill_port() need to do something like: > > down_write(&port->mutex); > agent = file->agent[id]; > file->agent[id] = NULL; > up_write(&port->mutex); > > if (agent) > ib_unregister_mad_agent(agent); > > but ib_umad_close() could pause arbitrarily long right before the > ib_unregister_mad_agent() call and then end up calling the function > after the device is already gone. > > - R. > I think I see a solution: replace up_write with downgrade_write. This way ib_umad_close has a read lock most of the time, and write lock only while it is changing the list. -- MST From krause at cup.hp.com Wed Nov 9 13:57:06 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 09 Nov 2005 13:57:06 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <20051109212437.GH1377@greglaptop.internal.keyresearch.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> <20051109212437.GH1377@greglaptop.internal.keyresearch.com> Message-ID: <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> At 01:24 PM 11/9/2005, Greg Lindahl wrote: >On Wed, Nov 09, 2005 at 12:18:28PM -0800, Michael Krause wrote: > > > So, things like HCA failure are not transparent and one cannot simply > > replay the operations since you don't know what was really seen by the > > other side unless the application performs the resync itself. > >I think you are over-stating the case. On the remote end, the kernel >piece of RDS knows what it presented to the remote application, ditto >on the local end. If only an HCA fails, and not the sending and >receiving kernels or applications, that knowledge is not lost. > >Perhaps you were assuming that RDS would be implemented only in >firmware on the HCA, and there is no kernel piece that knows what's >going on. I hadn't seen that stated by anyone, and of course there are >several existing and contemplated OpenIB devices that are considerably >different from the usual offload engine. You could also choose to >implement RDS using an offload engine and still keep enough state in >the kernel to recover. I hadn't assumed anything. I'm simply trying to understand the assertions concerning availability and recovery. What you indicate above is that RDS will implement a resync of the two sides of the association to determine what has been successfully sent. It will then retransmit what has not transparent to the application. This then implies that the reliability of the underlying interconnect isn't as critical per se as the end-to-end RDS protocol will assure that data is delivered to the RDS components in the face of hardware failures. Correct? Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Nov 9 14:02:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 14:02:06 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <20051109214000.GE25508@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 23:40:00 +0200") References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> Message-ID: <52oe4tjx9d.fsf@cisco.com> Michael> this one's wrong. nreq will overflow into bit 8. You Michael> want to put 0 there since we know nreq is 256: Michael> doorbell[1] = cpu_to_be32(qp->qpn << 8); Thanks, fixed both places. Michael> I know I did it differently, but I now think its better Michael> to replace nreq with MTHCA_TAVOR_MAX_WQES_PER_RECV_DB in Michael> the line above since wmb should prevent gcc from knowing Michael> the value of nreq here or storing it in a register. Good idea. How's this: --- infiniband/hw/mthca/mthca_srq.c (revision 3989) +++ infiniband/hw/mthca/mthca_srq.c (working copy) @@ -414,6 +414,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); + __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -429,6 +430,25 @@ int mthca_tavor_post_srq_recv(struct ib_ first_ind = srq->first_free; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32(srq->srqn << 8); + + /* + * Make sure that descriptors are written + * before doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + first_ind = srq->first_free; + } + ind = srq->first_free; if (ind < 0) { @@ -491,8 +511,6 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); --- infiniband/hw/mthca/mthca_wqe.h (revision 3989) +++ infiniband/hw/mthca/mthca_wqe.h (working copy) @@ -49,7 +49,8 @@ enum { }; enum { - MTHCA_INVAL_LKEY = 0x100 + MTHCA_INVAL_LKEY = 0x100, + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 }; struct mthca_next_seg { --- infiniband/hw/mthca/mthca_qp.c (revision 4003) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1705,6 +1705,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1722,6 +1723,22 @@ int mthca_tavor_post_receive(struct ib_q ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32(qp->qpn << 8); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; + size0 = 0; + } + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { mthca_err(dev, "RQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -1779,8 +1796,6 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); From mshefty at ichips.intel.com Wed Nov 9 14:03:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Nov 2005 14:03:28 -0800 Subject: [openib-general] Re: usermode hang in mthca_cq_clean In-Reply-To: <521x1plgx8.fsf@cisco.com> References: <437254AD.3070908@ichips.intel.com> <521x1plgx8.fsf@cisco.com> Message-ID: <43727230.7020409@ichips.intel.com> Roland Dreier wrote: > I think I see the problem. Does this patch fix it for you? > (basically you're doing a benchmark seeing how fast your CPU can go > through the loop 4 billion times ;) The changes that you checked in fixed this. Thanks. (I need a faster CPU...) - Sean From lindahl at pathscale.com Wed Nov 9 14:09:10 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 9 Nov 2005 14:09:10 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> <20051109212437.GH1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> Message-ID: <20051109220910.GA2720@greglaptop.internal.keyresearch.com> On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > What you indicate above is that RDS > will implement a resync of the two sides of the association to determine > what has been successfully sent. More accurate to say that it "could" implement that. I'm just kibbutzing on someone else's proposal. > This then implies that the reliability of the underlying > interconnect isn't as critical per se as the end-to-end RDS protocol > will assure that data is delivered to the RDS components in the face > of hardware failures. Correct? Yes. That's the intent that I see in the proposal. The implementation required to actually support this may not be what the proposers had in mind. This sort of message service, by the way, has a long history in distributed computing. -- greg From mst at mellanox.co.il Wed Nov 9 14:22:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Nov 2005 00:22:01 +0200 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <52oe4tjx9d.fsf@cisco.com> References: <52oe4tjx9d.fsf@cisco.com> Message-ID: <20051109222201.GA3139@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: fix posting long work request lists > > Michael> this one's wrong. nreq will overflow into bit 8. You > Michael> want to put 0 there since we know nreq is 256: > Michael> doorbell[1] = cpu_to_be32(qp->qpn << 8); > > Thanks, fixed both places. > > Michael> I know I did it differently, but I now think its better > Michael> to replace nreq with MTHCA_TAVOR_MAX_WQES_PER_RECV_DB in > Michael> the line above since wmb should prevent gcc from knowing > Michael> the value of nreq here or storing it in a register. > > Good idea. > > How's this: Looks good. You didnt post the userspace part, I guess it would be the same. I'm going to sleep now. -- MST From iod00d at hp.com Wed Nov 9 14:35:37 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Nov 2005 14:35:37 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <52oe4tjx9d.fsf@cisco.com> References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> <52oe4tjx9d.fsf@cisco.com> Message-ID: <20051109223537.GE25815@esmail.cup.hp.com> On Wed, Nov 09, 2005 at 02:02:06PM -0800, Roland Dreier wrote: > enum { > - MTHCA_INVAL_LKEY = 0x100 > + MTHCA_INVAL_LKEY = 0x100, > + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 > }; Why would we use two different contants with the same value in one enum? ie why not declare them both at "0x100" *or* "256"? grant From iod00d at hp.com Wed Nov 9 14:40:55 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Nov 2005 14:40:55 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <20051109214000.GE25508@mellanox.co.il> References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> Message-ID: <20051109224055.GF25815@esmail.cup.hp.com> On Wed, Nov 09, 2005 at 11:40:00PM +0200, Michael S. Tsirkin wrote: > > I added SRQ support and also moved the doorbell[2] > > declaration to the top of the functions (since I'm not convinved all > > versions of gcc are smart enough to see that two copies of doorbell[] > > can share stack slots). > > True. AFAIK no existing gcc version is smart enough for this. I thought gcc tracks register usage and not stack usage per se. And I don't understand why we should worry about "stack slots" given the "scope" of usage in each case is very short. My preference is to always declare variables in the smallest scope possible...it makes maintaining the code easier and lets gcc make more efficient use of registers (ie don't allocate stack at all if it can be done in registers). thanks, grant From mshefty at ichips.intel.com Wed Nov 9 14:48:54 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 09 Nov 2005 14:48:54 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <20051109224055.GF25815@esmail.cup.hp.com> References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> <20051109224055.GF25815@esmail.cup.hp.com> Message-ID: <43727CD6.1050103@ichips.intel.com> Grant Grundler wrote: > My preference is to always declare variables in the smallest > scope possible...it makes maintaining the code easier and lets > gcc make more efficient use of registers (ie don't allocate stack > at all if it can be done in registers). My personal preference is to keep functions small, with all variable declarations at the top of the function, where they're easier to locate. I find declarations scattered throughout code makes the code more difficult to maintain. - Sean From rolandd at cisco.com Wed Nov 9 14:52:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 14:52:14 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <43727CD6.1050103@ichips.intel.com> (Sean Hefty's message of "Wed, 09 Nov 2005 14:48:54 -0800") References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> <20051109224055.GF25815@esmail.cup.hp.com> <43727CD6.1050103@ichips.intel.com> Message-ID: <52br0tjuxt.fsf@cisco.com> The issue in this case is that if you do void foo(void) { { int x[2]; /* stuff */ } { int x[2]; /* other stuff */ } } then gcc allocates stack space for two copies of x even though they can never both be in scope. I think gcc 4.1 might be better in this area but that's a little futuristic. So it's better to declare x once at the top of the function. - R. From rolandd at cisco.com Wed Nov 9 14:53:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 14:53:31 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <20051109223537.GE25815@esmail.cup.hp.com> (Grant Grundler's message of "Wed, 9 Nov 2005 14:35:37 -0800") References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> <52oe4tjx9d.fsf@cisco.com> <20051109223537.GE25815@esmail.cup.hp.com> Message-ID: <527jbhjuvo.fsf@cisco.com> > + MTHCA_INVAL_LKEY = 0x100, > + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 Grant> Why would we use two different contants with the same value Grant> in one enum? ie why not declare them both at "0x100" *or* Grant> "256"? I thought about that when I was writing it. Both are kind of random constants that could just as easily be in two enums, but I don't see much benefit in that. The reason I left one in hex and on in decimal is that the first constant is really just a bit pattern, so I think it's clearer in hex. The second constant really represents the number 256, so I think it's clearer in decimal. - R. From rpandit at silverstorm.com Wed Nov 9 14:59:58 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Wed, 9 Nov 2005 14:59:58 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> <20051109212437.GH1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> Message-ID: <96f8e60e0511091459s238e4f38h1f54b84dac2aa8f9@mail.gmail.com> On 11/9/05, Michael Krause wrote: > I hadn't assumed anything. I'm simply trying to understand the assertions > concerning availability and recovery. What you indicate above is that RDS > will implement a resync of the two sides of the association to determine > what has been successfully sent. It will then retransmit what has not > transparent to the application. This then implies that the reliability of > the underlying interconnect isn't as critical per se as the end-to-end RDS > protocol will assure that data is delivered to the RDS components in the > face of hardware failures. Correct? > > Mike Correct. Ranjit > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From rpandit at silverstorm.com Wed Nov 9 15:07:07 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Wed, 9 Nov 2005 15:07:07 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <20051109220910.GA2720@greglaptop.internal.keyresearch.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> <20051109212437.GH1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> <20051109220910.GA2720@greglaptop.internal.keyresearch.com> Message-ID: <96f8e60e0511091507v59ac8625h9f4b13957705a8e7@mail.gmail.com> On 11/9/05, Greg Lindahl wrote: > On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > > > What you indicate above is that RDS > > will implement a resync of the two sides of the association to determine > > what has been successfully sent. > > More accurate to say that it "could" implement that. I'm just > kibbutzing on someone else's proposal. > > > This then implies that the reliability of the underlying > > interconnect isn't as critical per se as the end-to-end RDS protocol > > will assure that data is delivered to the RDS components in the face > > of hardware failures. Correct? > > Yes. That's the intent that I see in the proposal. The implementation > required to actually support this may not be what the proposers had in > mind. The reference implementation of RDS already supports this. It supports failover across HCAs just like APM does across ports within an HCA. > > This sort of message service, by the way, has a long history in > distributed computing. > > -- greg > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Wed Nov 9 15:07:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 15:07:04 -0800 Subject: [openib-general] Re: user_mad.c: deadlock? In-Reply-To: <20051109214820.GI25508@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 23:48:20 +0200") References: <52vez1ljc6.fsf@cisco.com> <20051109214820.GI25508@mellanox.co.il> Message-ID: <5264r1ju93.fsf@cisco.com> Michael> I think I see a solution: replace up_write with Michael> downgrade_write. This way ib_umad_close has a read lock Michael> most of the time, and write lock only while it is Michael> changing the list. Yes, excellent idea. I wasn't familiar with that API, but that's almost exactly what we need. It's still a little ugly but I think this works: --- infiniband/core/user_mad.c (revision 4008) +++ infiniband/core/user_mad.c (working copy) @@ -110,12 +110,13 @@ struct ib_umad_device { }; struct ib_umad_file { - struct ib_umad_port *port; - struct list_head recv_list; - struct list_head port_list; - spinlock_t recv_lock; - wait_queue_head_t recv_wait; - struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_umad_port *port; + struct list_head recv_list; + struct list_head port_list; + spinlock_t recv_lock; + wait_queue_head_t recv_wait; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + int agents_dead; }; struct ib_umad_packet { @@ -144,6 +145,12 @@ static void ib_umad_release_dev(struct k kfree(dev); } +/* caller must hold port->mutex at least for reading */ +static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id) +{ + return file->agents_dead ? NULL : file->agent[id]; +} + static int queue_packet(struct ib_umad_file *file, struct ib_mad_agent *agent, struct ib_umad_packet *packet) @@ -151,10 +158,11 @@ static int queue_packet(struct ib_umad_f int ret = 1; down_read(&file->port->mutex); + for (packet->mad.hdr.id = 0; packet->mad.hdr.id < IB_UMAD_MAX_AGENTS; packet->mad.hdr.id++) - if (agent == file->agent[packet->mad.hdr.id]) { + if (agent == __get_agent(file, packet->mad.hdr.id)) { spin_lock_irq(&file->recv_lock); list_add_tail(&packet->list, &file->recv_list); spin_unlock_irq(&file->recv_lock); @@ -326,7 +334,7 @@ static ssize_t ib_umad_write(struct file down_read(&file->port->mutex); - agent = file->agent[packet->mad.hdr.id]; + agent = __get_agent(file, packet->mad.hdr.id); if (!agent) { ret = -EINVAL; goto err_up; @@ -480,7 +488,7 @@ static int ib_umad_reg_agent(struct ib_u } for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) - if (!file->agent[agent_id]) + if (!__get_agent(file, agent_id)) goto found; ret = -ENOMEM; @@ -530,7 +538,7 @@ static int ib_umad_unreg_agent(struct ib down_write(&file->port->mutex); - if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) { ret = -EINVAL; goto out; } @@ -608,21 +616,29 @@ static int ib_umad_close(struct inode *i struct ib_umad_file *file = filp->private_data; struct ib_umad_device *dev = file->port->umad_dev; struct ib_umad_packet *packet, *tmp; + int already_dead; int i; - for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) - if (file->agent[i]) - ib_unregister_mad_agent(file->agent[i]); + down_write(&file->port->mutex); + + already_dead = file->agents_dead; + file->agents_dead = 1; list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); - down_write(&file->port->mutex); list_del(&file->port_list); - up_write(&file->port->mutex); - kfree(file); + downgrade_write(&file->port->mutex); + + if (!already_dead) + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) + ib_unregister_mad_agent(file->agent[i]); + + up_read(&file->port->mutex); + kfree(file); kref_put(&dev->ref, ib_umad_release_dev); return 0; @@ -829,7 +845,6 @@ err_cdev: static void ib_umad_kill_port(struct ib_umad_port *port) { struct ib_umad_file *file; - struct ib_mad_agent *agent; int id; class_set_devdata(port->class_dev, NULL); @@ -849,16 +864,26 @@ static void ib_umad_kill_port(struct ib_ port->ib_dev = NULL; up_write(&port->mutex); - list_for_each_entry(file, &port->file_list, port_list) - for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { - down_write(&port->mutex); - agent = file->agent[id]; - file->agent[id] = NULL; - up_write(&port->mutex); + down_write(&port->mutex); - if (agent) - ib_unregister_mad_agent(agent); - } + while (!list_empty(&port->file_list)) { + file = list_entry(port->file_list.next, struct ib_umad_file, + port_list); + + file->agents_dead = 1; + list_del_init(&file->port_list); + + downgrade_write(&port->mutex); + + for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) + if (file->agent[id]) + ib_unregister_mad_agent(file->agent[id]); + + up_read(&port->mutex); + down_write(&port->mutex); + } + + up_write(&port->mutex); clear_bit(port->dev_num, dev_map); } From iod00d at hp.com Wed Nov 9 15:23:19 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Nov 2005 15:23:19 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10415BC@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10415BC@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051109232319.GG25815@esmail.cup.hp.com> On Wed, Nov 09, 2005 at 12:45:17PM -0800, Caitlin Bestler wrote: > ... Caitlin, I'm having problems reading the quoting "style" too. Please, can you take a look at "quotefix"? http://home.in.tum.de/~jain/software/outlook-quotefix/ thanks, grant > ________________________________ > > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause > Sent: Wednesday, November 09, 2005 12:21 PM > To: Rick Frank; Ranjit Pandit > Cc: openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Contribute > RDS(ReliableDatagramSockets) to OpenIB > > > > One could be able to talk to the remote node across > other HCA but that does not mean one has an understanding of the state > at the remote node unless the failure is noted and a resync of state > occurs or the remote is able to deal with duplicates, etc. This has > nothing to do with API or the transport involved but, as Caitlin noted, > the difference between knowing a send buffer is free vs. knowing that > the application received the data requested. Therefore, one has only > reduced the reliability / robustness problem space to some extent but > has not solved it by the use of RDS. > > > > Correct. When there are point-to-point credits (even if only > enforced/understood > at the ULP) then the application can correctly infer that message N was > successfully processed because the matching credit was restored. A > transport > neutral application can only communicate restoration of credits via ULP > messaging. When credits are shared across sessions then the ULP > has a much more complex task to properly communicate credits. > > The proposal I presented at RAIT for multistreamed MPA had a > non-highlighted > option for a "wildcard" endpoint. Without the option multistream MPA is > essentially > the SCTP adaptation for RDMA running over plain MPA/TCP. It achieves the > same reduction in reliable transport layer connections that RDS does, > but > does not reduce the number of RDMA endpoints. The wildcard option > reduces the number of RDMA endpoints as well, but greatly complicates > the RDMA state machines. RDS over IB faces similar problems, but solved > them slightly differently. > > Over iWARP I believe these complexities favor keeping the point-to-point > logical connection between QP and only reducing the number of L4 > connections (from many TCP connections to a single TCP connection > or SCTP association). The advantage of that approach is that the API > from application to RDMA endpoint (QP) can be left totally unchanged. > But I do not see any such option over IB, unless RD is improved or a > new SCTP-like connection mode is defined. > > In my opinion the multi-streaming is the most important feature here, > but over IB I do not think there is a natural adaptation that provides > multi-streaming without also adding the any-to-any endpoint semantics. > Multistream MPA and SCTP can both support the any-to-any endpoint > semantics by moving the source to payload information rather than > transport information (by invoking "wildcard status" in MS-MPA or > by duplicating the field for SCTP). So the RDS API strikes me as > the best option for a transport neutral application. MS-MPA and SCTP > reductions in transport overhead would be available without special > API support. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Wed Nov 9 15:34:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 15:34:11 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <20051109153630.GC8633@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 17:36:30 +0200") References: <20051109141558.GR31134@mellanox.co.il> <20051109153630.GC8633@mellanox.co.il> Message-ID: <52r79piefg.fsf@cisco.com> thanks, I applied this as a good first step. - R. From rolandd at cisco.com Wed Nov 9 15:35:38 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 15:35:38 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: protect device list initialization In-Reply-To: <20051109212954.GC25508@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 23:29:54 +0200") References: <52ek5pliot.fsf@cisco.com> <20051109212954.GC25508@mellanox.co.il> Message-ID: <52mzkdied1.fsf@cisco.com> Michael> Maybe its a naming thing? We can call the list Michael> "iterator", does this make it less ugly? Hmm, that might be a good idea. On the other hand maybe it's too complicated for such a simple thing. I'll sleep on it tonight and try to code something tomorrow. - R. From rolandd at cisco.com Wed Nov 9 15:37:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 09 Nov 2005 15:37:47 -0800 Subject: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E372A9CA@mtlexch01.mtl.com> (Dotan Barak's message of "Tue, 8 Nov 2005 14:24:03 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E372A9CA@mtlexch01.mtl.com> Message-ID: <52irv1ie9g.fsf@cisco.com> Dotan> We started to write several tests (basic functionality + Dotan> stress tests) and soon we will check in the tests to the Dotan> svn (and the changes that you suggested). Great. Some regression tests that might be useful would be tests that exercise things where we recently tried to fix bugs: - posting long (more than 256 entries) lists of receives to both QP and SRQ. - atomic operations - R. From iod00d at hp.com Wed Nov 9 15:45:56 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 9 Nov 2005 15:45:56 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long work request lists In-Reply-To: <527jbhjuvo.fsf@cisco.com> References: <52acgdli69.fsf@cisco.com> <20051109214000.GE25508@mellanox.co.il> <52oe4tjx9d.fsf@cisco.com> <20051109223537.GE25815@esmail.cup.hp.com> <527jbhjuvo.fsf@cisco.com> Message-ID: <20051109234556.GI25815@esmail.cup.hp.com> On Wed, Nov 09, 2005 at 02:53:31PM -0800, Roland Dreier wrote: > > + MTHCA_INVAL_LKEY = 0x100, > > + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 > > Grant> Why would we use two different contants with the same value > Grant> in one enum? ie why not declare them both at "0x100" *or* > Grant> "256"? > > I thought about that when I was writing it. Both are kind of random > constants that could just as easily be in two enums, but I don't see > much benefit in that. Putting them in one enum implies they are related. That's what confused me. Can they just be #defines ? > The reason I left one in hex and on in decimal is that the first > constant is really just a bit pattern, so I think it's clearer in > hex. The second constant really represents the number 256, so I think > it's clearer in decimal. yeah - makes sense. grant From dotanb at mellanox.co.il Wed Nov 9 22:15:44 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Thu, 10 Nov 2005 08:15:44 +0200 Subject: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E372ACA3@mtlexch01.mtl.com> > > Dotan> We started to write several tests (basic functionality + > Dotan> stress tests) and soon we will check in the tests to the > Dotan> svn (and the changes that you suggested). > > Great. Some regression tests that might be useful would be tests that > exercise things where we recently tried to fix bugs: > > - posting long (more than 256 entries) lists of receives to > both QP and SRQ. > - atomic operations > the test that we are developing right now found those bugs (in user level) ... until the end of the next week we will commit it to open_ib svn. Dotan From mst at mellanox.co.il Thu Nov 10 08:24:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 10 Nov 2005 18:24:00 +0200 Subject: [openib-general] [PATCH] libmthca: fix posting long wqe lists for srq Message-ID: <20051110162400.GM16589@mellanox.co.il> Fix posting long WQE lists for SRQ. Signed-off-by: Michael S. Tsirkin Index: src/userspace/libmthca/src/srq.c =================================================================== --- src/userspace/libmthca/src/srq.c (revision 4016) +++ src/userspace/libmthca/src/srq.c (working copy) @@ -99,6 +99,7 @@ int mthca_tavor_post_srq_recv(struct ibv for (nreq = 0; wr; ++nreq, wr = wr->next) { if (nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB) { + nreq = 0; doorbell[0] = htonl(first_ind << srq->wqe_shift); doorbell[1] = htonl((srq->srqn << 8) | nreq); -- MST From rolandd at cisco.com Thu Nov 10 09:19:23 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 09:19:23 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long wqe lists for srq In-Reply-To: <20051110162400.GM16589@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 10 Nov 2005 18:24:00 +0200") References: <20051110162400.GM16589@mellanox.co.il> Message-ID: <52vez01kv8.fsf@cisco.com> Thanks -- I had basically the same thing in my local working directory but forgot to commit it. - R. From krause at cup.hp.com Thu Nov 10 10:13:47 2005 From: krause at cup.hp.com (Michael Krause) Date: Thu, 10 Nov 2005 10:13:47 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB In-Reply-To: <20051109220910.GA2720@greglaptop.internal.keyresearch.com> References: <5D78D28F88822E4D8702BB9EEF1A436773E971@mercury.infiniconsys.com> <6.2.0.14.2.20051108112907.02345e80@esmail.cup.hp.com> <96f8e60e0511081233y2e248a3fxfe5b46e05cfcdea6@mail.gmail.com> <6.2.0.14.2.20051108130355.0263def8@esmail.cup.hp.com> <20051109194253.GF1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109121220.026a6318@esmail.cup.hp.com> <20051109212437.GH1377@greglaptop.internal.keyresearch.com> <6.2.0.14.2.20051109135424.02926c30@esmail.cup.hp.com> <20051109220910.GA2720@greglaptop.internal.keyresearch.com> Message-ID: <6.2.0.14.2.20051110101019.02725638@esmail.cup.hp.com> At 02:09 PM 11/9/2005, Greg Lindahl wrote: >On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > > > What you indicate above is that RDS > > will implement a resync of the two sides of the association to determine > > what has been successfully sent. > >More accurate to say that it "could" implement that. I'm just >kibbutzing on someone else's proposal. > > > This then implies that the reliability of the underlying > > interconnect isn't as critical per se as the end-to-end RDS protocol > > will assure that data is delivered to the RDS components in the face > > of hardware failures. Correct? > >Yes. That's the intent that I see in the proposal. The implementation >required to actually support this may not be what the proposers had in >mind. If it is to be reasonably robust, then RDS should be required to support the resync between the two sides of the communication. This aligns with the stated objective of implementing reliability in one location in software and one location in hardware. Without such resync being required in the ULP, then one ends up with a ULP that falls shorts of its stated objectives and pushes complexity back up to the application which is where the advocates have stated it is too complex or expensive to get it correct. >This sort of message service, by the way, has a long history in >distributed computing. Yep. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 1/7] [IB] Have cq_resize() method take an int, not int* Message-ID: <1131647515831-039f6ac6e65cc7ed@cisco.com> Change the struct ib_device.resize_cq() method to take a plain integer that holds the new CQ size, rather than a pointer to an integer that it uses to return the new size. This makes the interface match the exported ib_resize_cq() signature, and allows the low-level driver to update the CQ size with proper locking if necessary. No in-tree drivers are exporting this method yet. Signed-off-by: Roland Dreier --- drivers/infiniband/core/verbs.c | 12 ++---------- include/rdma/ib_verbs.h | 2 +- 2 files changed, 3 insertions(+), 11 deletions(-) applies-to: 08d94f59d6f80937db5d87f0bb60eafcedd811d1 40de2e548c225e3ef859e3c60de9785e37e1b5b1 diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 72d3ef7..4f51d79 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -324,16 +324,8 @@ EXPORT_SYMBOL(ib_destroy_cq); int ib_resize_cq(struct ib_cq *cq, int cqe) { - int ret; - - if (!cq->device->resize_cq) - return -ENOSYS; - - ret = cq->device->resize_cq(cq, &cqe); - if (!ret) - cq->cqe = cqe; - - return ret; + return cq->device->resize_cq ? + cq->device->resize_cq(cq, cqe) : -ENOSYS; } EXPORT_SYMBOL(ib_resize_cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f72d46d..a7f4c35 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -881,7 +881,7 @@ struct ib_device { struct ib_ucontext *context, struct ib_udata *udata); int (*destroy_cq)(struct ib_cq *cq); - int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*resize_cq)(struct ib_cq *cq, int cqe); int (*poll_cq)(struct ib_cq *cq, int num_entries, struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 3/7] [IB] uverbs: have kernel return QP capabilities In-Reply-To: <1131647515831-cd68db8e19d165ad@cisco.com> Message-ID: <1131647515831-b1175b20ec8fd319@cisco.com> Move the computation of QP capabilities (max scatter/gather entries, max inline data, etc) into the kernel, and have the uverbs module return the values as part of the create QP response. This keeps precise knowledge of device limits in the low-level kernel driver. This requires an ABI bump, so while we're making changes, get rid of the max_sge parameter for the modify SRQ command -- it's not used and shouldn't be there. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs_cmd.c | 12 ++-- drivers/infiniband/hw/mthca/mthca_cmd.c | 2 + drivers/infiniband/hw/mthca/mthca_dev.h | 1 drivers/infiniband/hw/mthca/mthca_main.c | 1 drivers/infiniband/hw/mthca/mthca_provider.c | 2 - drivers/infiniband/hw/mthca/mthca_provider.h | 1 drivers/infiniband/hw/mthca/mthca_qp.c | 86 ++++++++++++++++++++++++-- include/rdma/ib_user_verbs.h | 9 ++- 8 files changed, 98 insertions(+), 16 deletions(-) applies-to: 2741f22c820fb664f6958becc4f3d415eea0e61b 77369ed31daac51f4827c50d30f233c45480235a diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 63a7415..ed45da8 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -708,7 +708,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver resp->wc[i].opcode = wc[i].opcode; resp->wc[i].vendor_err = wc[i].vendor_err; resp->wc[i].byte_len = wc[i].byte_len; - resp->wc[i].imm_data = wc[i].imm_data; + resp->wc[i].imm_data = (__u32 __force) wc[i].imm_data; resp->wc[i].qp_num = wc[i].qp_num; resp->wc[i].src_qp = wc[i].src_qp; resp->wc[i].wc_flags = wc[i].wc_flags; @@ -908,7 +908,12 @@ retry: if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uobject.id; + resp.max_recv_sge = attr.cap.max_recv_sge; + resp.max_send_sge = attr.cap.max_send_sge; + resp.max_recv_wr = attr.cap.max_recv_wr; + resp.max_send_wr = attr.cap.max_send_wr; + resp.max_inline_data = attr.cap.max_inline_data; if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) { @@ -1135,7 +1140,7 @@ ssize_t ib_uverbs_post_send(struct ib_uv next->num_sge = user_wr->num_sge; next->opcode = user_wr->opcode; next->send_flags = user_wr->send_flags; - next->imm_data = user_wr->imm_data; + next->imm_data = (__be32 __force) user_wr->imm_data; if (qp->qp_type == IB_QPT_UD) { next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, @@ -1701,7 +1706,6 @@ ssize_t ib_uverbs_modify_srq(struct ib_u } attr.max_wr = cmd.max_wr; - attr.max_sge = cmd.max_sge; attr.srq_limit = cmd.srq_limit; ret = ib_modify_srq(srq, &attr, cmd.attr_mask); diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 49f211d..9ed3458 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1060,6 +1060,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); dev_lim->max_sg = min_t(int, field, dev_lim->max_sg); + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET); + dev_lim->max_desc_sz = min_t(int, size, dev_lim->max_desc_sz); MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); dev_lim->mpt_entry_sz = size; MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 808037f..497ff79 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -131,6 +131,7 @@ struct mthca_limits { int max_sg; int num_qps; int max_wqes; + int max_desc_sz; int max_qp_init_rdma; int reserved_qps; int num_srqs; diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 16594d1..147f248 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -168,6 +168,7 @@ static int __devinit mthca_dev_lim(struc mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.max_desc_sz = dev_lim->max_desc_sz; /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index e78259b..4cc7e28 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -616,11 +616,11 @@ static struct ib_qp *mthca_create_qp(str return ERR_PTR(err); } - init_attr->cap.max_inline_data = 0; init_attr->cap.max_send_wr = qp->sq.max; init_attr->cap.max_recv_wr = qp->rq.max; init_attr->cap.max_send_sge = qp->sq.max_gs; init_attr->cap.max_recv_sge = qp->rq.max_gs; + init_attr->cap.max_inline_data = qp->max_inline_data; return &qp->ibqp; } diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index bcd4b01..1e73947 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -251,6 +251,7 @@ struct mthca_qp { struct mthca_wq sq; enum ib_sig_type sq_policy; int send_wqe_offset; + int max_inline_data; u64 *wrid; union mthca_buf queue; diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 8852ea4..7f39af4 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -885,6 +885,48 @@ int mthca_modify_qp(struct ib_qp *ibqp, return err; } +static void mthca_adjust_qp_caps(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int max_data_size; + + /* + * Calculate the maximum size of WQE s/g segments, excluding + * the next segment and other non-data segments. + */ + max_data_size = min(dev->limits.max_desc_sz, 1 << qp->sq.wqe_shift) - + sizeof (struct mthca_next_seg); + + switch (qp->transport) { + case MLX: + max_data_size -= 2 * sizeof (struct mthca_data_seg); + break; + + case UD: + if (mthca_is_memfree(dev)) + max_data_size -= sizeof (struct mthca_arbel_ud_seg); + else + max_data_size -= sizeof (struct mthca_tavor_ud_seg); + break; + + default: + max_data_size -= sizeof (struct mthca_raddr_seg); + break; + } + + /* We don't support inline data for kernel QPs (yet). */ + if (!pd->ibpd.uobject) + qp->max_inline_data = 0; + else + qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; + + qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); + qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg); +} + /* * Allocate and register buffer for WQEs. qp->rq.max, sq.max, * rq.max_gs and sq.max_gs must all be assigned. @@ -902,27 +944,53 @@ static int mthca_alloc_wqe_buf(struct mt size = sizeof (struct mthca_next_seg) + qp->rq.max_gs * sizeof (struct mthca_data_seg); + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; qp->rq.wqe_shift++) ; /* nothing */ - size = sizeof (struct mthca_next_seg) + - qp->sq.max_gs * sizeof (struct mthca_data_seg); + size = qp->sq.max_gs * sizeof (struct mthca_data_seg); switch (qp->transport) { case MLX: size += 2 * sizeof (struct mthca_data_seg); break; + case UD: - if (mthca_is_memfree(dev)) - size += sizeof (struct mthca_arbel_ud_seg); - else - size += sizeof (struct mthca_tavor_ud_seg); + size += mthca_is_memfree(dev) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg); + break; + + case UC: + size += sizeof (struct mthca_raddr_seg); + break; + + case RC: + size += sizeof (struct mthca_raddr_seg); + /* + * An atomic op will require an atomic segment, a + * remote address segment and one scatter entry. + */ + size = max_t(int, size, + sizeof (struct mthca_atomic_seg) + + sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_data_seg)); break; + default: - /* bind seg is as big as atomic + raddr segs */ - size += sizeof (struct mthca_bind_seg); + break; } + /* Make sure that we have enough space for a bind request */ + size = max_t(int, size, sizeof (struct mthca_bind_seg)); + + size += sizeof (struct mthca_next_seg); + + if (size > dev->limits.max_desc_sz) + return -EINVAL; + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; qp->sq.wqe_shift++) ; /* nothing */ @@ -1066,6 +1134,8 @@ static int mthca_alloc_qp_common(struct return ret; } + mthca_adjust_qp_caps(dev, pd, qp); + /* * If this is a userspace QP, we're done now. The doorbells * will be allocated and buffers will be initialized in diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index 072f3a2..5ff1490 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -43,7 +43,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_VERBS_ABI_VERSION 3 +#define IB_USER_VERBS_ABI_VERSION 4 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -333,6 +333,11 @@ struct ib_uverbs_create_qp { struct ib_uverbs_create_qp_resp { __u32 qp_handle; __u32 qpn; + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; }; /* @@ -552,9 +557,7 @@ struct ib_uverbs_modify_srq { __u32 srq_handle; __u32 attr_mask; __u32 max_wr; - __u32 max_sge; __u32 srq_limit; - __u32 reserved; __u64 driver_data[0]; }; --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 2/7] [IB] umad: get rid of unused mr array In-Reply-To: <1131647515831-039f6ac6e65cc7ed@cisco.com> Message-ID: <1131647515831-cd68db8e19d165ad@cisco.com> Now that ib_umad uses the new MAD sending interface, it no longer needs its own L_Key. So just delete the array of MRs that it keeps. Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 29 ++++------------------------- 1 files changed, 4 insertions(+), 25 deletions(-) applies-to: e7b9ffe6fca9246f29a0a3cdf6417770f5821cef ec914c52d6208d8752dfd85b48a9aff304911434 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index f5ed36c..d61f544 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -116,7 +116,6 @@ struct ib_umad_file { spinlock_t recv_lock; wait_queue_head_t recv_wait; struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; - struct ib_mr *mr[IB_UMAD_MAX_AGENTS]; }; struct ib_umad_packet { @@ -505,29 +504,16 @@ found: goto out; } - file->mr[agent_id] = ib_get_dma_mr(agent->qp->pd, IB_ACCESS_LOCAL_WRITE); - if (IS_ERR(file->mr[agent_id])) { - ret = -ENOMEM; - goto err; - } - if (put_user(agent_id, (u32 __user *) (arg + offsetof(struct ib_user_mad_reg_req, id)))) { ret = -EFAULT; - goto err_mr; + ib_unregister_mad_agent(agent); + goto out; } file->agent[agent_id] = agent; ret = 0; - goto out; - -err_mr: - ib_dereg_mr(file->mr[agent_id]); - -err: - ib_unregister_mad_agent(agent); - out: up_write(&file->port->mutex); return ret; @@ -536,7 +522,6 @@ out: static int ib_umad_unreg_agent(struct ib_umad_file *file, unsigned long arg) { struct ib_mad_agent *agent = NULL; - struct ib_mr *mr = NULL; u32 id; int ret = 0; @@ -551,16 +536,13 @@ static int ib_umad_unreg_agent(struct ib } agent = file->agent[id]; - mr = file->mr[id]; file->agent[id] = NULL; out: up_write(&file->port->mutex); - if (agent) { + if (agent) ib_unregister_mad_agent(agent); - ib_dereg_mr(mr); - } return ret; } @@ -629,10 +611,8 @@ static int ib_umad_close(struct inode *i int i; for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) - if (file->agent[i]) { + if (file->agent[i]) ib_unregister_mad_agent(file->agent[i]); - ib_dereg_mr(file->mr[i]); - } list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); @@ -872,7 +852,6 @@ static void ib_umad_kill_port(struct ib_ for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { if (!file->agent[id]) continue; - ib_dereg_mr(file->mr[id]); ib_unregister_mad_agent(file->agent[id]); file->agent[id] = NULL; } --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 6/7] [IB] mthca: fix posting long lists of receive work requests In-Reply-To: <1131647515831-fb6db61e0af75c4b@cisco.com> Message-ID: <1131647515831-7161f73f404fbe76@cisco.com> In Tavor mode, when posting a long list of receive work requests, a doorbell must be rung every 256 requests. Add code to do this when required. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 19 +++++++++++++++++-- drivers/infiniband/hw/mthca/mthca_srq.c | 22 ++++++++++++++++++++-- drivers/infiniband/hw/mthca/mthca_wqe.h | 3 ++- 3 files changed, 39 insertions(+), 5 deletions(-) applies-to: 984d2fc62c548af3d01450135f33b5b97aecf00b ae57e24a4006fd46b73d842ee99db9580ef74a02 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 190c1dc..760c418 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1707,6 +1707,7 @@ int mthca_tavor_post_receive(struct ib_q { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; unsigned long flags; int err = 0; int nreq; @@ -1724,6 +1725,22 @@ int mthca_tavor_post_receive(struct ib_q ind = qp->rq.next_ind; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32(qp->qpn << 8); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; + size0 = 0; + } + if (mthca_wq_overflow(&qp->rq, nreq, qp->ibqp.recv_cq)) { mthca_err(dev, "RQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -1781,8 +1798,6 @@ int mthca_tavor_post_receive(struct ib_q out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((qp->rq.next_ind << qp->rq.wqe_shift) | size0); doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 292f55b..c3c0331 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -414,6 +414,7 @@ int mthca_tavor_post_srq_recv(struct ib_ { struct mthca_dev *dev = to_mdev(ibsrq->device); struct mthca_srq *srq = to_msrq(ibsrq); + __be32 doorbell[2]; unsigned long flags; int err = 0; int first_ind; @@ -429,6 +430,25 @@ int mthca_tavor_post_srq_recv(struct ib_ first_ind = srq->first_free; for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_TAVOR_MAX_WQES_PER_RECV_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); + doorbell[1] = cpu_to_be32(srq->srqn << 8); + + /* + * Make sure that descriptors are written + * before doorbell is rung. + */ + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + + first_ind = srq->first_free; + } + ind = srq->first_free; if (ind < 0) { @@ -491,8 +511,6 @@ int mthca_tavor_post_srq_recv(struct ib_ } if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32(first_ind << srq->wqe_shift); doorbell[1] = cpu_to_be32((srq->srqn << 8) | nreq); diff --git a/drivers/infiniband/hw/mthca/mthca_wqe.h b/drivers/infiniband/hw/mthca/mthca_wqe.h index 1f4c0ff..73f1c0b 100644 --- a/drivers/infiniband/hw/mthca/mthca_wqe.h +++ b/drivers/infiniband/hw/mthca/mthca_wqe.h @@ -49,7 +49,8 @@ enum { }; enum { - MTHCA_INVAL_LKEY = 0x100 + MTHCA_INVAL_LKEY = 0x100, + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 }; struct mthca_next_seg { --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 4/7] [IB] mthca: fix posting of atomic operations In-Reply-To: <1131647515831-b1175b20ec8fd319@cisco.com> Message-ID: <1131647515831-8ba0b803b8214b97@cisco.com> The size of work requests for atomic operations was computed incorrectly in mthca: all sizeofs need to be divided by 16. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) applies-to: 308dce81364b1cbb563942a1a57146c1808e8911 62abb8416f1923f4cef50ce9ce841b919275e3fb diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 7f39af4..190c1dc 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1556,8 +1556,8 @@ int mthca_tavor_post_send(struct ib_qp * } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) / 16; break; case IB_WR_RDMA_WRITE: @@ -1876,8 +1876,8 @@ int mthca_arbel_post_send(struct ib_qp * } wqe += sizeof (struct mthca_atomic_seg); - size += sizeof (struct mthca_raddr_seg) / 16 + - sizeof (struct mthca_atomic_seg); + size += (sizeof (struct mthca_raddr_seg) + + sizeof (struct mthca_atomic_seg)) / 16; break; case IB_WR_RDMA_READ: --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 5/7] [IB] mthca: fix wraparound handling in mthca_cq_clean() In-Reply-To: <1131647515831-8ba0b803b8214b97@cisco.com> Message-ID: <1131647515831-fb6db61e0af75c4b@cisco.com> Handle case where prod_index has wrapped around and become less than cq->cons_index by checking that their difference as a signed int is positive rather than comparing directly. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 16 ++++++---------- 1 files changed, 6 insertions(+), 10 deletions(-) applies-to: 704990abeb22a51ed2722e92536d22135f60957f 64044bcf75063cb5a6d42712886a712449df2ce3 diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index f98e235..4a8adce 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -258,7 +258,7 @@ void mthca_cq_clean(struct mthca_dev *de { struct mthca_cq *cq; struct mthca_cqe *cqe; - int prod_index; + u32 prod_index; int nfreed = 0; spin_lock_irq(&dev->cq_table.lock); @@ -293,19 +293,15 @@ void mthca_cq_clean(struct mthca_dev *de * Now sweep backwards through the CQ, removing CQ entries * that match our QP by copying older entries on top of them. */ - while (prod_index > cq->cons_index) { - cqe = get_cqe(cq, (prod_index - 1) & cq->ibcq.cqe); + while ((int) --prod_index - (int) cq->cons_index >= 0) { + cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); if (cqe->my_qpn == cpu_to_be32(qpn)) { if (srq) mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; - } - else if (nfreed) - memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & - cq->ibcq.cqe), - cqe, - MTHCA_CQ_ENTRY_SIZE); - --prod_index; + } else if (nfreed) + memcpy(get_cqe(cq, (prod_index + nfreed) & cq->ibcq.cqe), + cqe, MTHCA_CQ_ENTRY_SIZE); } if (nfreed) { --- 0.99.9e From rolandd at cisco.com Thu Nov 10 10:31:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 18:31:55 +0000 Subject: [openib-general] [git patch review 7/7] [IB] umad: further ib_unregister_mad_agent() deadlock fixes In-Reply-To: <1131647515831-7161f73f404fbe76@cisco.com> Message-ID: <1131647515831-6e04eecf9c835e20@cisco.com> The previous umad deadlock fix left ib_umad_kill_port() still vulnerable to deadlocking. This patch fixes that by downgrading our lock to a read lock when we might end up trying to reacquire the lock for reading. Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 87 ++++++++++++++++++++++++++---------- 1 files changed, 63 insertions(+), 24 deletions(-) applies-to: 17115437026be55dcd74641be21561fecf33dcdb 94382f3562e350ed7c8f7dcd6fc968bdece31328 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index d61f544..5ea741f 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -31,7 +31,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: user_mad.c 2814 2005-07-06 19:14:09Z halr $ + * $Id: user_mad.c 4010 2005-11-09 23:11:56Z roland $ */ #include @@ -110,12 +110,13 @@ struct ib_umad_device { }; struct ib_umad_file { - struct ib_umad_port *port; - struct list_head recv_list; - struct list_head port_list; - spinlock_t recv_lock; - wait_queue_head_t recv_wait; - struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + struct ib_umad_port *port; + struct list_head recv_list; + struct list_head port_list; + spinlock_t recv_lock; + wait_queue_head_t recv_wait; + struct ib_mad_agent *agent[IB_UMAD_MAX_AGENTS]; + int agents_dead; }; struct ib_umad_packet { @@ -144,6 +145,12 @@ static void ib_umad_release_dev(struct k kfree(dev); } +/* caller must hold port->mutex at least for reading */ +static struct ib_mad_agent *__get_agent(struct ib_umad_file *file, int id) +{ + return file->agents_dead ? NULL : file->agent[id]; +} + static int queue_packet(struct ib_umad_file *file, struct ib_mad_agent *agent, struct ib_umad_packet *packet) @@ -151,10 +158,11 @@ static int queue_packet(struct ib_umad_f int ret = 1; down_read(&file->port->mutex); + for (packet->mad.hdr.id = 0; packet->mad.hdr.id < IB_UMAD_MAX_AGENTS; packet->mad.hdr.id++) - if (agent == file->agent[packet->mad.hdr.id]) { + if (agent == __get_agent(file, packet->mad.hdr.id)) { spin_lock_irq(&file->recv_lock); list_add_tail(&packet->list, &file->recv_list); spin_unlock_irq(&file->recv_lock); @@ -326,7 +334,7 @@ static ssize_t ib_umad_write(struct file down_read(&file->port->mutex); - agent = file->agent[packet->mad.hdr.id]; + agent = __get_agent(file, packet->mad.hdr.id); if (!agent) { ret = -EINVAL; goto err_up; @@ -480,7 +488,7 @@ static int ib_umad_reg_agent(struct ib_u } for (agent_id = 0; agent_id < IB_UMAD_MAX_AGENTS; ++agent_id) - if (!file->agent[agent_id]) + if (!__get_agent(file, agent_id)) goto found; ret = -ENOMEM; @@ -530,7 +538,7 @@ static int ib_umad_unreg_agent(struct ib down_write(&file->port->mutex); - if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !file->agent[id]) { + if (id < 0 || id >= IB_UMAD_MAX_AGENTS || !__get_agent(file, id)) { ret = -EINVAL; goto out; } @@ -608,21 +616,29 @@ static int ib_umad_close(struct inode *i struct ib_umad_file *file = filp->private_data; struct ib_umad_device *dev = file->port->umad_dev; struct ib_umad_packet *packet, *tmp; + int already_dead; int i; - for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) - if (file->agent[i]) - ib_unregister_mad_agent(file->agent[i]); + down_write(&file->port->mutex); + + already_dead = file->agents_dead; + file->agents_dead = 1; list_for_each_entry_safe(packet, tmp, &file->recv_list, list) kfree(packet); - down_write(&file->port->mutex); list_del(&file->port_list); - up_write(&file->port->mutex); - kfree(file); + downgrade_write(&file->port->mutex); + + if (!already_dead) + for (i = 0; i < IB_UMAD_MAX_AGENTS; ++i) + if (file->agent[i]) + ib_unregister_mad_agent(file->agent[i]); + up_read(&file->port->mutex); + + kfree(file); kref_put(&dev->ref, ib_umad_release_dev); return 0; @@ -848,13 +864,36 @@ static void ib_umad_kill_port(struct ib_ port->ib_dev = NULL; - list_for_each_entry(file, &port->file_list, port_list) - for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) { - if (!file->agent[id]) - continue; - ib_unregister_mad_agent(file->agent[id]); - file->agent[id] = NULL; - } + /* + * Now go through the list of files attached to this port and + * unregister all of their MAD agents. We need to hold + * port->mutex while doing this to avoid racing with + * ib_umad_close(), but we can't hold the mutex for writing + * while calling ib_unregister_mad_agent(), since that might + * deadlock by calling back into queue_packet(). So we + * downgrade our lock to a read lock, and then drop and + * reacquire the write lock for the next iteration. + * + * We do list_del_init() on the file's list_head so that the + * list_del in ib_umad_close() is still OK, even after the + * file is removed from the list. + */ + while (!list_empty(&port->file_list)) { + file = list_entry(port->file_list.next, struct ib_umad_file, + port_list); + + file->agents_dead = 1; + list_del_init(&file->port_list); + + downgrade_write(&port->mutex); + + for (id = 0; id < IB_UMAD_MAX_AGENTS; ++id) + if (file->agent[id]) + ib_unregister_mad_agent(file->agent[id]); + + up_read(&port->mutex); + down_write(&port->mutex); + } up_write(&port->mutex); --- 0.99.9e From eeb at bartonsoftware.com Thu Nov 10 10:32:30 2005 From: eeb at bartonsoftware.com (Eric Barton) Date: Thu, 10 Nov 2005 18:32:30 -0000 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <6.2.0.14.2.20051110101019.02725638@esmail.cup.hp.com> Message-ID: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> Hi, I'm working with Cluster File Systems on lustre network drivers, including IB drivers for the Voltaire, Infinicon and Topspin stacks. These are kernel drivers which use RC QPs with VERBS for small message queueing and RDMA for bulk transfers. We're obviously looking at OpenIB Gen2, and I wonder if people could be so kind as to answer some questions for me. 1. How stable is the CM API and is it supported by all OpenIB affiliated vendors? 2. I'd like to scale to >= 10,000 peer nodes; 1 RC QP per peer. Is this going to get me into trouble? For example, I currently create a single PD and CQ for everything, however the example I've seen (cmatose.c) appears to create these separately for each peer. Is that what I should be doing too? 3. Is contiguous memory allocation an issue in Gen2? Since this is such a scarce resource in the kernel (and particular CQ usage with one vendor's stack relied heavily on it) what red flags should I be aware of? 4. Are RDMA reads still deprecated? Which resources hit the spotlight if I chose to use them? 5. Should I pre-map all physical memory and do RDMA in page-sized fragments? This avoids any mapping overhead at the expense of having much larger numbers of queued RDMAs. Since I try to keep up to 8 (by default) 1MByte RDMAs active concurrently to any individual peer, with 4k pages I can have up to 2048 RDMA work items queued at a time per peer. And if I pre-map, can I be guaranteed that if I put the CQ into the error state, all remote access to my memory is revoked (e.g. could a CQ I create after I destroy the one I just shut down somehow alias with it such that a pathalogically delayed RDMA could write my memory)? Or is it better to use FMR pools and take the map/unmap overhead? If so, is there a way to know when the unmap actually hits the hardware and my memory is safe? 6. Does Gen2 present substantially the same APIs as the kernel in userspace? So if I wrote a userspace equivalent of my kernel driver, could I have pure userspace clients talk to kernel servers? Thanks in advance... -- Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: ----------------------| --------------------------------------------------- From caitlinb at broadcom.com Thu Nov 10 10:48:40 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 10 Nov 2005 10:48:40 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041618@NT-SJCA-0751.brcm.ad.broadcom.com> Mike Krause wrote in response to Greg Lindahl: > If it is to be reasonably robust, then RDS should be required to support > the resync between the two sides of the communication. This aligns with the > stated objective of implementing reliability in one location in software and > one location in hardware. Without such resync being required in the ULP, > then one ends up with a ULP that falls shorts of its stated objectives and > pushes complexity back up to the application which is where the advocates > have stated it is too complex or expensive to get it correct. >> This sort of message service, by the way, has a long history in distributed computing. > Yep. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. From mshefty at ichips.intel.com Thu Nov 10 10:55:24 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 10:55:24 -0800 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> References: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> Message-ID: <4373979C.9040604@ichips.intel.com> Eric Barton wrote: > 1. How stable is the CM API and is it supported by all OpenIB affiliated > vendors? The IB CM API is stable. Changes might occur as a result of changes to the CM protocol itself, but that effect is not limited to just the openib API. The RDMA CMA API is fairly stable, but could still see minor changes. This would be the better connection API to use if you want to connect using IP addresses. > 2. I'd like to scale to >= 10,000 peer nodes; 1 RC QP per peer. Is this going > to get me into trouble? > > For example, I currently create a single PD and CQ for everything, however > the example I've seen (cmatose.c) appears to create these separately for > each peer. Is that what I should be doing too? Cmatose is just a simple example program that I use for testing. If you're trying to scale out to 10,000 nodes, you'll want to limit your resources. For example, I've never been able to run cmatose with 10,000 connections without running out of resources on my system. Note that the IB CM does not implement a peer to peer connection model yet, so you would need to establish your connections using the client/server model. > 4. Are RDMA reads still deprecated? Which resources hit the spotlight if I > chose to use them? RDMA reads are fully supported. Not sure what lead you to think that they were deprecated. > 6. Does Gen2 present substantially the same APIs as the kernel in userspace? > So if I wrote a userspace equivalent of my kernel driver, could I have pure > userspace clients talk to kernel servers? Most of the APIs are similar. There shouldn't be any issues talking between userspace clients and kernel servers. - Sean From Richard.Frank at oracle.com Thu Nov 10 10:56:53 2005 From: Richard.Frank at oracle.com (Rick Frank) Date: Thu, 10 Nov 2005 13:56:53 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB References: <54AD0F12E08D1541B826BE97C98F99F1041618@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <008a01c5e628$846c0840$6401a8c0@YOURA11C73D0FD> Yes, this is the case. ----- Original Message ----- From: "Caitlin Bestler" To: Sent: Thursday, November 10, 2005 1:48 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB Mike Krause wrote in response to Greg Lindahl: > If it is to be reasonably robust, then RDS should be required to support > the resync between the two sides of the communication. This aligns with the > stated objective of implementing reliability in one location in software and > one location in hardware. Without such resync being required in the ULP, > then one ends up with a ULP that falls shorts of its stated objectives and > pushes complexity back up to the application which is where the advocates > have stated it is too complex or expensive to get it correct. >> This sort of message service, by the way, has a long history in distributed computing. > Yep. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Nov 10 11:12:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 11:12:26 -0800 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> References: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> Message-ID: <43739B9A.3050302@ichips.intel.com> Eric Barton wrote: > 5. Should I pre-map all physical memory and do RDMA in page-sized fragments? > This avoids any mapping overhead at the expense of having much larger > numbers of queued RDMAs. Since I try to keep up to 8 (by default) 1MByte > RDMAs active concurrently to any individual peer, with 4k pages I can have > up to 2048 RDMA work items queued at a time per peer. This is 20 million outstanding RDMA work requests per node. > And if I pre-map, can I be guaranteed that if I put the CQ into the error > state, all remote access to my memory is revoked (e.g. could a CQ I create > after I destroy the one I just shut down somehow alias with it such that a > pathalogically delayed RDMA could write my memory)? I think that you mean QP into the error state. If the QP is in the error state, then further access from a remote system should be impossible. - Sean From eeb at bartonsoftware.com Thu Nov 10 11:20:04 2005 From: eeb at bartonsoftware.com (Eric Barton) Date: Thu, 10 Nov 2005 19:20:04 -0000 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <43739B9A.3050302@ichips.intel.com> Message-ID: <00c401c5e62c$2392f430$0281a8c0@ebpc> Yes, of course; I meant the QP. Regarding the total number of outstanding RDMA work requests, I can keep a separate cap on that, so if relatively few peers are active, I push the maximum number of RDMAs at them, but if many peers are active the number of active RDMAs per peer reduces. However I guess this still means that CQ resources sufficient for the maximum number of RDMAs I _could_ queue have to be allocated... > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, November 10, 2005 7:12 PM > To: Eric Barton > Cc: openib-general at openib.org > Subject: Re: [openib-general] Lustre over OpenIB Gen2 > > > Eric Barton wrote: > > 5. Should I pre-map all physical memory and do RDMA in > page-sized fragments? > > This avoids any mapping overhead at the expense of > having much larger > > numbers of queued RDMAs. Since I try to keep up to 8 > (by default) 1MByte > > RDMAs active concurrently to any individual peer, with > 4k pages I can have > > up to 2048 RDMA work items queued at a time per peer. > > This is 20 million outstanding RDMA work requests per node. > > > And if I pre-map, can I be guaranteed that if I put the > CQ into the error > > state, all remote access to my memory is revoked (e.g. > could a CQ I create > > after I destroy the one I just shut down somehow alias > with it such that a > > pathalogically delayed RDMA could write my memory)? > > I think that you mean QP into the error state. If the QP is > in the error state, > then further access from a remote system should be impossible. > > - Sean > From rolandd at cisco.com Thu Nov 10 11:44:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 11:44:04 -0800 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <00c401c5e62c$2392f430$0281a8c0@ebpc> (Eric Barton's message of "Thu, 10 Nov 2005 19:20:04 -0000") References: <00c401c5e62c$2392f430$0281a8c0@ebpc> Message-ID: <52fyq41e63.fsf@cisco.com> Eric> However I guess this still means that CQ resources Eric> sufficient for the maximum number of RDMAs I _could_ queue Eric> have to be allocated... In general there will be a relatively low limit on the maximum CQ size. For example, the maximum CQ size on Mellanox HCAs is ~128K entries. - R. From arlin.r.davis at intel.com Thu Nov 10 11:48:08 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 10 Nov 2005 11:48:08 -0800 Subject: [openib-general] [PATCH] uDAPL free build issues cleaned up, print path records returned from uAT Message-ID: James, I fixed some problems with the free build openib_scm version. Also turned down some debugging and added some debug prints for uAT path records. -arlin Signed-off by: Arlin Davis Index: dapl/openib/dapl_ib_cm.c =================================================================== --- dapl/openib/dapl_ib_cm.c (revision 3990) +++ dapl/openib/dapl_ib_cm.c (working copy) @@ -136,14 +136,27 @@ static void dapli_path_comp_handler(uint dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: SRC GID subnet %016llx id %016llx\n", - (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.subnet_prefix), - (unsigned long long)cpu_to_be64(conn->dapl_rt.sgid.global.interface_id) ); + (unsigned long long)cpu_to_be64(conn->dapl_path.sgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_path.sgid.global.interface_id) ); dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: DST GID subnet %016llx id %016llx\n", - (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.subnet_prefix), - (unsigned long long)cpu_to_be64(conn->dapl_rt.dgid.global.interface_id) ); + (unsigned long long)cpu_to_be64(conn->dapl_path.dgid.global.subnet_prefix), + (unsigned long long)cpu_to_be64(conn->dapl_path.dgid.global.interface_id) ); + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " path_comp_handler: slid %x dlid %x mtu %x(%x) pktlife %x(%x)\n", + ntohs(conn->dapl_path.slid), ntohs(conn->dapl_path.dlid), + conn->dapl_path.mtu, conn->dapl_path.mtu_selector, + conn->dapl_path.packet_life_time, + conn->dapl_path.packet_life_time_selector ); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " path_comp_handler: hops %x npaths %x pkey %x tclass %x rate %x(%x)\n", + conn->dapl_path.hop_limit, conn->dapl_path.numb_path, + conn->dapl_path.pkey, conn->dapl_path.traffic_class, + conn->dapl_path.rate, conn->dapl_path.rate_selector); + if (rec_num <= 0) { dapl_dbg_log(DAPL_DBG_TYPE_CM, " path_comp_handler: ERR %d retry %d\n", Index: dapl/openib_scm/dapl_ib_cm.c =================================================================== --- dapl/openib_scm/dapl_ib_cm.c (revision 3990) +++ dapl/openib_scm/dapl_ib_cm.c (working copy) @@ -285,7 +285,7 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, if (( bind( cm_ptr->l_socket,(struct sockaddr*)&addr, sizeof(addr) ) < 0) || (listen( cm_ptr->l_socket, 128 ) < 0) ) { - dapl_dbg_log( DAPL_DBG_TYPE_ERR, + dapl_dbg_log( DAPL_DBG_TYPE_CM, " listen: ERROR %s on conn_qual 0x%x\n", strerror(errno),serviceID); @@ -313,7 +313,7 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, return dat_status; bail: - dapl_dbg_log( DAPL_DBG_TYPE_ERR, + dapl_dbg_log( DAPL_DBG_TYPE_CM, " listen: ERROR on conn_qual 0x%x\n",serviceID); if ( cm_ptr->l_socket >= 0 ) close( cm_ptr->l_socket ); Index: dapl/openib_scm/dapl_ib_cq.c =================================================================== --- dapl/openib_scm/dapl_ib_cq.c (revision 3990) +++ dapl/openib_scm/dapl_ib_cq.c (working copy) @@ -569,7 +569,6 @@ dapls_ib_wait_object_wait ( { struct dapl_evd *evd_ptr; struct ibv_cq *ibv_cq = NULL; - void *ibv_ctx = NULL; int status = 0; int timeout_ms = -1; struct pollfd cq_fd = { @@ -602,7 +601,7 @@ dapls_ib_wait_object_wait ( dapl_dbg_log (DAPL_DBG_TYPE_CM, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", - evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); + evd_ptr, ibv_cq,strerror(errno)); return(dapl_convert_errno(status,"cq_wait_object_wait")); From rolandd at cisco.com Thu Nov 10 11:50:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 11:50:07 -0800 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> (Eric Barton's message of "Thu, 10 Nov 2005 18:32:30 -0000") References: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> Message-ID: <52br0s1dw0.fsf@cisco.com> Hi Eric... writing YAN (yet another NAL) I see :) Eric> 2. I'd like to scale to >= 10,000 peer nodes; 1 RC QP per Eric> peer. Is this going to get me into trouble? Eric> For example, I currently create a single PD and CQ for Eric> everything, however the example I've seen (cmatose.c) Eric> appears to create these separately for each peer. Is that Eric> what I should be doing too? I don't think you want 10K PDs. But having a single CQ big enough to handle 10K QPs might be a problem. Eric> 3. Is contiguous memory allocation an issue in Gen2? Since Eric> this is such a scarce resource in the kernel (and particular Eric> CQ usage with one vendor's stack relied heavily on it) what Eric> red flags should I be aware of? There are still a few places where you can get in trouble (for example, with the mthca driver, extremely large QP work queues might be a problem, because the driver allocates contiguous memory for the array used to track work request IDs -- not the work queues themselves though). But CQs in particular should be fine. Eric> 4. Are RDMA reads still deprecated? Which resources hit the Eric> spotlight if I chose to use them? I don't think RDMA reads were ever really deprecated. But RDMA writes probably pipeline better. Eric> 5. Should I pre-map all physical memory and do RDMA in Eric> page-sized fragments? This avoids any mapping overhead at Eric> the expense of having much larger numbers of queued RDMAs. Eric> Since I try to keep up to 8 (by default) 1MByte RDMAs active Eric> concurrently to any individual peer, with 4k pages I can Eric> have up to 2048 RDMA work items queued at a time per peer. Eric> And if I pre-map, can I be guaranteed that if I put the Eric> CQ into the error state, all remote access to my memory is Eric> revoked (e.g. could a CQ I create after I destroy the one I Eric> just shut down somehow alias with it such that a Eric> pathalogically delayed RDMA could write my memory)? s/CQ/QP/ ... anyway, if you choose your receive queue sequence numbers randomly, then the probability of a QP number/sequence number collision allowing a stray RDMA is astronomically low (effectively 0). Eric> Or is it better to use FMR pools and take the map/unmap Eric> overhead? If so, is there a way to know when the unmap Eric> actually hits the hardware and my memory is safe? FMRs are only supported on Mellanox HCAs at the moment. But they do have some advantages, like allowing you to convert a bunch of pages into a single virtually contiguous region. You can use the ib_flush_fmr_pool() function to make sure that all unmapped FMRs are really and truly flushed, but that is a slow operation (since it incurs the penalty of flushing all in-flight operations in the HCA). Eric> 6. Does Gen2 present substantially the same APIs as the Eric> kernel in userspace? So if I wrote a userspace equivalent Eric> of my kernel driver, could I have pure userspace clients Eric> talk to kernel servers? Pretty much so, except of course userspace doesn't have access to physical memory or FMRs. - R. From rolandd at cisco.com Thu Nov 10 12:18:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 12:18:45 -0800 Subject: [openib-general] [git pull] IB updates for 2.6.15 Message-ID: <52wtjgjly2.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Jack Morgenstein: [IB] mthca: report page size capability [IB] uverbs: have kernel return QP capabilities Michael S. Tsirkin: [IB] umad: two small fixes [IB] mthca: fix posting of atomic operations [IB] mthca: fix posting long lists of receive work requests Roland Dreier: [IPoIB] add path record information in debugfs [IB] umad: avoid potential deadlock when unregistering MAD agents [IPoIB] no need to set skb->dev right before freeing skb [IB] mthca: fix typo in catastrophic error polling [IB] Have cq_resize() method take an int, not int* [IB] umad: get rid of unused mr array [IB] mthca: fix wraparound handling in mthca_cq_clean() [IB] umad: further ib_unregister_mad_agent() deadlock fixes drivers/infiniband/core/user_mad.c | 129 ++++++++++------- drivers/infiniband/core/uverbs_cmd.c | 12 +- drivers/infiniband/core/verbs.c | 12 -- drivers/infiniband/hw/mthca/mthca_catas.c | 2 drivers/infiniband/hw/mthca/mthca_cmd.c | 2 drivers/infiniband/hw/mthca/mthca_cq.c | 16 +- drivers/infiniband/hw/mthca/mthca_dev.h | 2 drivers/infiniband/hw/mthca/mthca_main.c | 2 drivers/infiniband/hw/mthca/mthca_provider.c | 3 drivers/infiniband/hw/mthca/mthca_provider.h | 1 drivers/infiniband/hw/mthca/mthca_qp.c | 113 +++++++++++++-- drivers/infiniband/hw/mthca/mthca_srq.c | 22 +++ drivers/infiniband/hw/mthca/mthca_wqe.h | 3 drivers/infiniband/ulp/ipoib/ipoib.h | 15 +- drivers/infiniband/ulp/ipoib/ipoib_fs.c | 179 ++++++++++++++++++++---- drivers/infiniband/ulp/ipoib/ipoib_main.c | 72 +++++++++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 +-- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 7 - include/rdma/ib_user_verbs.h | 9 + include/rdma/ib_verbs.h | 2 20 files changed, 466 insertions(+), 163 deletions(-) From sean.hefty at intel.com Thu Nov 10 12:37:48 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 12:37:48 -0800 Subject: [openib-general] [PATCHv1] userspace CMA Message-ID: The following patch adds support for the userspace RDMA CMA. More specifically: - Adds common user/kernel data structures and copy routines in ib_uverbs And libibverbs. - Updates ib_ucm and libibcm to use common structures and copy routines. - Adds a new kernel module, rdma_ucm, to support userspace CMA. - Adds a userspace CMA library, librdmacm. - Adds a userspace CMA test program, ucmatose. - Updates the kernel CMA to support a userspace CMA library. - Adds support for user's to transition QP states. - Adds backlog parameter to rdma_listen API. (not used yet) Please respond with any feedback. Thanks. Signed-off-by: Sean Hefty Index: userspace/libibverbs/include/infiniband/sa.h =================================================================== --- userspace/libibverbs/include/infiniband/sa.h (revision 0) +++ userspace/libibverbs/include/infiniband/sa.h (revision 0) @@ -0,0 +1,130 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: sa.h 2616 2005-06-15 15:22:39Z halr $ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +enum ib_sa_rate { + IB_SA_RATE_2_5_GBPS = 2, + IB_SA_RATE_5_GBPS = 5, + IB_SA_RATE_10_GBPS = 3, + IB_SA_RATE_20_GBPS = 6, + IB_SA_RATE_30_GBPS = 4, + IB_SA_RATE_40_GBPS = 7, + IB_SA_RATE_60_GBPS = 8, + IB_SA_RATE_80_GBPS = 9, + IB_SA_RATE_120_GBPS = 10 +}; + +static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) +{ + switch (rate) { + case IB_SA_RATE_2_5_GBPS: return 1; + case IB_SA_RATE_5_GBPS: return 2; + case IB_SA_RATE_10_GBPS: return 4; + case IB_SA_RATE_20_GBPS: return 8; + case IB_SA_RATE_30_GBPS: return 12; + case IB_SA_RATE_40_GBPS: return 16; + case IB_SA_RATE_60_GBPS: return 24; + case IB_SA_RATE_80_GBPS: return 32; + case IB_SA_RATE_120_GBPS: return 48; + default: return -1; + } +} + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ibv_gid dgid; + union ibv_gid sgid; + uint16_t dlid; + uint16_t slid; + int raw_traffic; + /* reserved */ + uint32_t flow_label; + uint8_t hop_limit; + uint8_t traffic_class; + int reversible; + uint8_t numb_path; + uint16_t pkey; + /* reserved */ + uint8_t sl; + uint8_t mtu_selector; + uint8_t mtu; + uint8_t rate_selector; + uint8_t rate; + uint8_t packet_life_time_selector; + uint8_t packet_life_time; + uint8_t preference; +}; + +struct ib_sa_mcmember_rec { + union ibv_gid mgid; + union ibv_gid port_gid; + uint32_t qkey; + uint16_t mlid; + uint8_t mtu_selector; + uint8_t mtu; + uint8_t traffic_class; + uint16_t pkey; + uint8_t rate_selector; + uint8_t rate; + uint8_t packet_life_time_selector; + uint8_t packet_life_time; + uint8_t sl; + uint32_t flow_label; + uint8_t hop_limit; + uint8_t scope; + uint8_t join_state; + int proxy_join; +}; + +struct ib_sa_service_rec { + uint64_t id; + union ibv_gid gid; + uint16_t pkey; + /* uint16_t resv; */ + uint32_t lease; + uint8_t key[16]; + uint8_t name[64]; + uint8_t data8[16]; + uint16_t data16[8]; + uint32_t data32[4]; + uint64_t data64[2]; +}; + +#endif /* IB_SA_H */ Index: userspace/libibverbs/include/infiniband/marshall.h =================================================================== --- userspace/libibverbs/include/infiniband/marshall.h (revision 0) +++ userspace/libibverbs/include/infiniband/marshall.h (revision 0) @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef INFINIBAND_MARSHALL_H +#define INFINIBAND_MARSHALL_H + +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +void ib_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, + struct ibv_kern_qp_attr *src); + +void ib_copy_path_rec_from_kern(struct ib_sa_path_rec *dst, + struct ib_kern_path_rec *src); + +void ib_copy_path_rec_to_kern(struct ib_kern_path_rec *dst, + struct ib_sa_path_rec *src); + +END_C_DECLS + +#endif /* INFINIBAND_MARSHALL_H */ Index: userspace/libibverbs/include/infiniband/kern-abi.h =================================================================== --- userspace/libibverbs/include/infiniband/kern-abi.h (revision 4017) +++ userspace/libibverbs/include/infiniband/kern-abi.h (working copy) @@ -357,6 +357,64 @@ __u32 async_events_reported; }; +struct ibv_kern_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ibv_kern_ah_attr { + struct ibv_kern_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ibv_kern_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ibv_kern_ah_attr ah_attr; + struct ibv_kern_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ibv_create_qp { __u32 command; __u16 in_words; @@ -532,26 +590,6 @@ __u32 bad_wr; }; -struct ibv_kern_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ibv_kern_ah_attr { - struct ibv_kern_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ibv_create_ah { __u32 command; __u16 in_words; Index: userspace/libibverbs/src/libibverbs.map =================================================================== --- userspace/libibverbs/src/libibverbs.map (revision 4017) +++ userspace/libibverbs/src/libibverbs.map (working copy) @@ -57,5 +57,8 @@ ibv_cmd_destroy_ah; ibv_cmd_attach_mcast; ibv_cmd_detach_mcast; + ib_copy_qp_attr_from_kern; + ib_copy_path_rec_from_kern; + ib_copy_path_rec_to_kern; local: *; }; Index: userspace/libibverbs/src/marshall.c =================================================================== --- userspace/libibverbs/src/marshall.c (revision 0) +++ userspace/libibverbs/src/marshall.c (revision 0) @@ -0,0 +1,140 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include + +static void ib_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, + struct ibv_kern_ah_attr *src) +{ + memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->is_global; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, + struct ibv_kern_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->cap.max_send_wr = src->max_send_wr; + dst->cap.max_recv_wr = src->max_recv_wr; + dst->cap.max_send_sge = src->max_send_sge; + dst->cap.max_recv_sge = src->max_recv_sge; + dst->cap.max_inline_data = src->max_inline_data; + + ib_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} + +void ib_copy_path_rec_from_kern(struct ib_sa_path_rec *dst, + struct ib_kern_path_rec *src) +{ + memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); + memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} + +void ib_copy_path_rec_to_kern(struct ib_kern_path_rec *dst, + struct ib_sa_path_rec *src) +{ + memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); + memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} Index: userspace/libibverbs/Makefile.am =================================================================== --- userspace/libibverbs/Makefile.am (revision 4017) +++ userspace/libibverbs/Makefile.am (working copy) @@ -14,7 +14,8 @@ libibverbs_version_script = endif -src_libibverbs_la_SOURCES = src/cmd.c src/device.c src/init.c src/memory.c src/verbs.c +src_libibverbs_la_SOURCES = src/cmd.c src/device.c src/init.c src/marshall.c \ + src/memory.c src/verbs.c src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic \ $(libibverbs_version_script) src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map @@ -40,7 +41,8 @@ libibverbsincludedir = $(includedir)/infiniband libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \ - include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h + include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \ + include/infiniband/sa_kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1 \ @@ -53,6 +55,8 @@ EXTRA_DIST = include/infiniband/driver.h include/infiniband/kern-abi.h \ include/infiniband/opcode.h include/infiniband/verbs.h src/ibverbs.h \ + include/infiniband/marshall.h include/sa_kern-abi.h \ + include/infiniband/sa.h \ src/libibverbs.map libibverbs.spec.in $(man_MANS) $(DEBIAN) dist-hook: libibverbs.spec Index: userspace/librdmacm/include/rdma/rdma_cma_abi.h =================================================================== --- userspace/librdmacm/include/rdma/rdma_cma_abi.h (revision 0) +++ userspace/librdmacm/include/rdma/rdma_cma_abi.h (revision 0) @@ -0,0 +1,186 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_CMA_ABI_H +#define RDMA_CMA_ABI_H + +#include + +/* + * This file must be kept in sync with the kernel's version of rdma_user_cm.h + */ + +#define RDMA_USER_CM_MIN_ABI_VERSION 1 +#define RDMA_USER_CM_MAX_ABI_VERSION 1 + +#define RDMA_MAX_PRIVATE_DATA 256 + +enum { + UCMA_CMD_CREATE_ID, + UCMA_CMD_DESTROY_ID, + UCMA_CMD_BIND_ADDR, + UCMA_CMD_RESOLVE_ADDR, + UCMA_CMD_RESOLVE_ROUTE, + UCMA_CMD_QUERY_ROUTE, + UCMA_CMD_CONNECT, + UCMA_CMD_LISTEN, + UCMA_CMD_ACCEPT, + UCMA_CMD_REJECT, + UCMA_CMD_DISCONNECT, + UCMA_CMD_INIT_QP_ATTR, + UCMA_CMD_GET_EVENT +}; + +struct ucma_abi_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct ucma_abi_create_id { + __u64 uid; + __u64 response; +}; + +struct ucma_abi_create_id_resp { + __u32 id; +}; + +struct ucma_abi_destroy_id { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct ucma_abi_destroy_id_resp { + __u32 events_reported; +}; + +struct ucma_abi_bind_addr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + +struct ucma_abi_bind_addr_resp { + __u64 node_guid; +}; + +struct ucma_abi_resolve_addr { + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 id; + __u32 timeout_ms; +}; + +struct ucma_abi_resolve_route { + __u32 id; + __u32 timeout_ms; +}; + +struct ucma_abi_query_route { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct ucma_abi_query_route_resp { + __u64 node_guid; + struct ib_kern_path_rec ib_route[2]; + struct sockaddr_in6 src_addr; + __u32 num_paths; +}; + +struct ucma_abi_conn_param { + __u32 qp_num; + __u32 qp_type; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; + __u8 private_data_len; + __u8 srq; + __u8 responder_resources; + __u8 initiator_depth; + __u8 flow_control; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 valid; +}; + +struct ucma_abi_connect { + struct ucma_abi_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct ucma_abi_listen { + __u32 id; + __u32 backlog; +}; + +struct ucma_abi_accept { + __u64 uid; + struct ucma_abi_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct ucma_abi_reject { + __u32 id; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +struct ucma_abi_disconnect { + __u32 id; +}; + +struct ucma_abi_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct ucma_abi_get_event { + __u64 response; +}; + +struct ucma_abi_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +#endif /* RDMA_CMA_ABI_H */ Index: userspace/librdmacm/include/rdma/rdma_cma.h =================================================================== --- userspace/librdmacm/include/rdma/rdma_cma.h (revision 0) +++ userspace/librdmacm/include/rdma/rdma_cma.h (revision 0) @@ -0,0 +1,219 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CMA_H) +#define RDMA_CMA_H + +#include +#include +#include +#include + +/* + * Upon receiving a device removal event, users must destroy the associated + * RDMA identifier and release all resources allocated with the device. + */ +enum rdma_cm_event_type { + RDMA_CM_EVENT_ADDR_RESOLVED, + RDMA_CM_EVENT_ADDR_ERROR, + RDMA_CM_EVENT_ROUTE_RESOLVED, + RDMA_CM_EVENT_ROUTE_ERROR, + RDMA_CM_EVENT_CONNECT_REQUEST, + RDMA_CM_EVENT_CONNECT_RESPONSE, + RDMA_CM_EVENT_CONNECT_ERROR, + RDMA_CM_EVENT_UNREACHABLE, + RDMA_CM_EVENT_REJECTED, + RDMA_CM_EVENT_ESTABLISHED, + RDMA_CM_EVENT_DISCONNECTED, + RDMA_CM_EVENT_DEVICE_REMOVAL, +}; + +struct ib_addr { + union ibv_gid sgid; + union ibv_gid dgid; + uint16_t pkey; +}; + +struct rdma_addr { + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + union { + struct ib_addr ibaddr; + } addr; +}; + +struct rdma_route { + struct rdma_addr addr; + struct ib_sa_path_rec *path_rec; + int num_paths; +}; + +struct rdma_cm_id { + struct ibv_context *verbs; + void *context; + struct ibv_qp *qp; + struct rdma_route route; +}; + +struct rdma_cm_event { + struct rdma_cm_id *id; + struct rdma_cm_id *listen_id; + enum rdma_cm_event_type event; + int status; + void *private_data; + uint8_t private_data_len; +}; + +int rdma_create_id(struct rdma_cm_id **id, void *context); + +int rdma_destroy_id(struct rdma_cm_id *id); + +/** + * rdma_bind_addr - Bind an RDMA identifier to a source address and + * associated RDMA device, if needed. + * + * @id: RDMA identifier. + * @addr: Local address information. Wildcard values are permitted. + * + * This associates a source address with the RDMA identifier before calling + * rdma_listen. If a specific local address is given, the RDMA identifier will + * be bound to a local RDMA device. + */ +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr); + +/** + * rdma_resolve_addr - Resolve destination and optional source addresses + * from IP addresses to an RDMA address. If successful, the specified + * rdma_cm_id will be bound to a local device. + * + * @id: RDMA identifier. + * @src_addr: Source address information. This parameter may be NULL. + * @dst_addr: Destination address information. + * @timeout_ms: Time to wait for resolution to complete. + */ +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms); + +/** + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier + * into route information needed to establish a connection. + * + * This is called on the client side of a connection. + * Users must have first called rdma_resolve_addr to resolve a dst_addr + * into an RDMA address before calling this routine. + */ +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms); + +/** + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA + * identifier. + * + * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA + * through their states. + */ +int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, + struct ibv_qp_init_attr *qp_init_attr); + +/** + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA + * identifier. + * + * Users must destroy any QP associated with an RDMA identifier before + * destroying the RDMA ID. + */ +void rdma_destroy_qp(struct rdma_cm_id *id); + +struct rdma_conn_param { + const void *private_data; + uint8_t private_data_len; + uint8_t responder_resources; + uint8_t initiator_depth; + uint8_t flow_control; + uint8_t retry_count; /* ignored when accepting */ + uint8_t rnr_retry_count; +}; + +/** + * rdma_connect - Initiate an active connection request. + * + * Users must have resolved a route for the rdma_cm_id to connect with + * by having called rdma_resolve_route before calling this routine. + */ +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_listen - This function is called by the passive side to + * listen for incoming connection requests. + * + * Users must have bound the rdma_cm_id to a local address by calling + * rdma_bind_addr before calling this routine. + */ +int rdma_listen(struct rdma_cm_id *id, int backlog); + +/** + * rdma_accept - Called to accept a connection request. + * @id: Connection identifier associated with the request. + * @conn_param: Information needed to establish the connection. + */ +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_reject - Called on the passive side to reject a connection request. + */ +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + uint8_t private_data_len); + +/** + * rdma_disconnect - This function disconnects the associated QP. + */ +int rdma_disconnect(struct rdma_cm_id *id); + +/** + * rdma_get_cm_event - Retrieves the next pending communications event, + * if no event is pending waits for an event. + * @event: Allocated information about the next communication event. + * Event should be freed using rdma_ack_cm_event() + * + * A RDMA_CM_EVENT_CONNECT_REQUEST communication events result + * in the allocation of a new @rdma_cm_id. + * Clients are responsible for destroying the new @rdma_cm_id. + */ +int rdma_get_cm_event(struct rdma_cm_event **event); + +/** + * rdma_ack_cm_event - Free a communications event. + * @event: Event to be released. + * + * All events which are allocated by rdma_get_cm_event() must be released, + * there should be a one-to-one correspondence between successful gets + * and acks. + */ +int rdma_ack_cm_event(struct rdma_cm_event *event); + +#endif /* RDMA_CMA_H */ Index: userspace/librdmacm/AUTHORS =================================================================== --- userspace/librdmacm/AUTHORS (revision 0) +++ userspace/librdmacm/AUTHORS (revision 0) @@ -0,0 +1 @@ +Sean Hefty Index: userspace/librdmacm/configure.in =================================================================== --- userspace/librdmacm/configure.in (revision 0) +++ userspace/librdmacm/configure.in (revision 0) @@ -0,0 +1,50 @@ +dnl Process this file with autoconf to produce a configure script. + +AC_PREREQ(2.57) +AC_INIT(librdmacm, 0.9.0, openib-general at openib.org) +AC_CONFIG_SRCDIR([src/cma.c]) +AC_CONFIG_AUX_DIR(config) +AM_CONFIG_HEADER(config.h) +AM_INIT_AUTOMAKE(librdmacm, 0.9.0) +AC_DISABLE_STATIC +AM_PROG_LIBTOOL + +AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], +[ if test x$enableval = xno ; then + disable_libcheck=yes + fi +]) + +dnl Checks for programs +AC_PROG_CC + +dnl Checks for typedefs, structures, and compiler characteristics. +AC_C_CONST +AC_CHECK_SIZEOF(long) + +dnl Checks for libraries +if test "$disable_libcheck" != "yes" +then +AC_CHECK_LIB(ibverbs, ibv_get_devices, [], + AC_MSG_ERROR([ibv_get_devices() not found. librdmacm requires libibverbs.])) +fi + +dnl Checks for header files. +if test "$disable_libcheck" != "yes" +then +AC_CHECK_HEADER(infiniband/verbs.h, [], + AC_MSG_ERROR([ not found. Is libibverbs installed?])) +fi +AC_HEADER_STDC + +AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, + if test -n "`$LD --help < /dev/null 2>/dev/null | grep version-script`"; then + ac_cv_version_script=yes + else + ac_cv_version_script=no + fi) + +AM_CONDITIONAL(HAVE_LD_VERSION_SCRIPT, test "$ac_cv_version_script" = "yes") + +AC_CONFIG_FILES([Makefile librdmacm.spec]) +AC_OUTPUT Index: userspace/librdmacm/INSTALL =================================================================== Index: userspace/librdmacm/src/cma.c =================================================================== --- userspace/librdmacm/src/cma.c (revision 0) +++ userspace/librdmacm/src/cma.c (revision 0) @@ -0,0 +1,887 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: cm.c 3453 2005-09-15 21:43:21Z sean.hefty $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#define PFX "librdmacm: " + +#if __BYTE_ORDER == __LITTLE_ENDIAN +static inline uint64_t htonll(uint64_t x) { return bswap_64(x); } +static inline uint64_t ntohll(uint64_t x) { return bswap_64(x); } +#else +static inline uint64_t htonll(uint64_t x) { return x; } +static inline uint64_t ntohll(uint64_t x) { return x; } +#endif + +#define CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, type, size) \ +do { \ + struct ucma_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return -ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = sizeof(*resp); \ + memset(cmd, 0, sizeof(*cmd)); \ + resp = alloca(sizeof(*resp)); \ + if (!resp) \ + return -ENOMEM; \ + cmd->response = (uintptr_t)resp;\ +} while (0) + +#define CMA_CREATE_MSG_CMD(msg, cmd, type, size) \ +do { \ + struct ucma_abi_cmd_hdr *hdr; \ + \ + size = sizeof(*hdr) + sizeof(*cmd); \ + msg = alloca(size); \ + if (!msg) \ + return -ENOMEM; \ + hdr = msg; \ + cmd = msg + sizeof(*hdr); \ + hdr->cmd = type; \ + hdr->in = sizeof(*cmd); \ + hdr->out = 0; \ + memset(cmd, 0, sizeof(*cmd)); \ +} while (0) + +struct cma_device { + struct ibv_context *verbs; + uint64_t guid; + int port_cnt; +}; + +struct cma_id_private { + struct rdma_cm_id id; + struct cma_device *cma_dev; + int events_completed; + pthread_cond_t cond; + pthread_mutex_t mut; + uint32_t handle; +}; + +static struct dlist *dev_list; +static struct dlist *cma_dev_list; +int cma_fd; + +#define container_of(ptr, type, field) \ + ((type *) ((void *)ptr - offsetof(type, field))) + +static void __attribute__((constructor)) rdma_cma_init(void) +{ + struct ibv_device *dev; + struct cma_device *cma_dev; + struct ibv_device_attr attr; + int ret; + + cma_fd = open("/dev/infiniband/rdma_cm", O_RDWR); + if (cma_fd < 0) + abort(); + + cma_dev_list = dlist_new(sizeof *cma_dev); + dev_list = ibv_get_devices(); + if (!cma_dev_list || !dev_list) + abort(); + + dlist_for_each_data(dev_list, dev, struct ibv_device) { + cma_dev = malloc(sizeof *cma_dev); + if (!cma_dev) + abort(); + + cma_dev->guid = ibv_get_device_guid(dev); + cma_dev->verbs = ibv_open_device(dev); + if (!cma_dev->verbs) + abort(); + + ret = ibv_query_device(cma_dev->verbs, &attr); + if (ret) + abort(); + + cma_dev->port_cnt = attr.phys_port_cnt; + dlist_push(cma_dev_list, cma_dev); + } +} + +static void __attribute__((destructor)) rdma_cma_fini(void) +{ + struct cma_device *cma_dev; + + if (!cma_dev_list) + return; + + dlist_for_each_data(cma_dev_list, cma_dev, struct cma_device) + ibv_close_device(cma_dev->verbs); + + dlist_destroy(cma_dev_list); + close(cma_fd); +} + +static int ucma_get_device(struct cma_id_private *id_priv, uint64_t guid) +{ + struct cma_device *cma_dev; + + dlist_for_each_data(cma_dev_list, cma_dev, struct cma_device) + if (cma_dev->guid == guid) { + id_priv->cma_dev = cma_dev; + id_priv->id.verbs = cma_dev->verbs; + return 0; + } + + return -ENODEV; +} + +static void ucma_free_id(struct cma_id_private *id_priv) +{ + pthread_cond_destroy(&id_priv->cond); + pthread_mutex_destroy(&id_priv->mut); + if (id_priv->id.route.path_rec) + free(id_priv->id.route.path_rec); + free(id_priv); +} + +static struct cma_id_private *ucma_alloc_id(void *context) +{ + struct cma_id_private *id_priv; + + id_priv = malloc(sizeof *id_priv); + if (!id_priv) + return NULL; + + memset(id_priv, 0, sizeof *id_priv); + id_priv->id.context = context; + pthread_mutex_init(&id_priv->mut, NULL); + if (pthread_cond_init(&id_priv->cond, NULL)) + goto err; + + return id_priv; + +err: ucma_free_id(id_priv); + return NULL; +} + +int rdma_create_id(struct rdma_cm_id **id, void *context) +{ + struct ucma_abi_create_id_resp *resp; + struct ucma_abi_create_id *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + id_priv = ucma_alloc_id(context); + if (!id_priv) + return -ENOMEM; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size); + cmd->uid = (uintptr_t) id_priv; + + ret = write(cma_fd, msg, size); + if (ret != size) + goto err; + + id_priv->handle = resp->id; + *id = &id_priv->id; + return 0; + +err: ucma_free_id(id_priv); + return ret; +} + +static int ucma_destroy_kern_id(uint32_t handle) +{ + struct ucma_abi_destroy_id_resp *resp; + struct ucma_abi_destroy_id *cmd; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_DESTROY_ID, size); + cmd->id = handle; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return resp->events_reported; +} + +int rdma_destroy_id(struct rdma_cm_id *id) +{ + struct cma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct cma_id_private, id); + ret = ucma_destroy_kern_id(id_priv->handle); + if (ret < 0) + return ret; + + pthread_mutex_lock(&id_priv->mut); + while (id_priv->events_completed < ret) + pthread_cond_wait(&id_priv->cond, &id_priv->mut); + pthread_mutex_unlock(&id_priv->mut); + + ucma_free_id(id_priv); + return 0; +} + +static int ucma_addrlen(struct sockaddr *addr) +{ + if (!addr) + return 0; + + switch (addr->sa_family) { + case PF_INET: + return sizeof(struct sockaddr_in); + case PF_INET6: + return sizeof(struct sockaddr_in6); + default: + return 0; + } +} + +int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) +{ + struct ucma_abi_bind_addr_resp *resp; + struct ucma_abi_bind_addr *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, addrlen; + + addrlen = ucma_addrlen(addr); + if (!addrlen) + return -EINVAL; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_BIND_ADDR, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + memcpy(&cmd->addr, addr, addrlen); + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + if (resp->node_guid) { + ret = ucma_get_device(id_priv, resp->node_guid); + if (ret) + return ret; + } + + memcpy(&id->route.addr.src_addr, addr, addrlen); + return 0; +} + +int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms) +{ + struct ucma_abi_resolve_addr *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, daddrlen; + + daddrlen = ucma_addrlen(dst_addr); + if (!daddrlen) + return -EINVAL; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_ADDR, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + if (src_addr) + memcpy(&cmd->src_addr, src_addr, ucma_addrlen(src_addr)); + memcpy(&cmd->dst_addr, dst_addr, daddrlen); + cmd->timeout_ms = timeout_ms; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + memcpy(&id->route.addr.dst_addr, dst_addr, daddrlen); + return 0; +} + +int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) +{ + struct ucma_abi_resolve_route *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_ROUTE, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->timeout_ms = timeout_ms; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct ucma_abi_init_qp_attr *cmd; + struct ibv_kern_qp_attr *resp; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_INIT_QP_ATTR, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->qp_state = qp_attr->qp_state; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + ib_copy_qp_attr_from_kern(qp_attr, resp); + *qp_attr_mask = resp->qp_attr_mask; + return 0; +} + +static int ucma_modify_qp_rtr(struct rdma_cm_id *id) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + if (!id->qp) + return -EINVAL; + + /* Need to update QP attributes from default values. */ + qp_attr.qp_state = IBV_QPS_INIT; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + ret = ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); + if (ret) + return ret; + + qp_attr.qp_state = IBV_QPS_RTR; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int ucma_modify_qp_rts(struct rdma_cm_id *id) +{ + struct ibv_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IBV_QPS_RTS; + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ibv_modify_qp(id->qp, &qp_attr, qp_attr_mask); +} + +static int ucma_modify_qp_err(struct rdma_cm_id *id) +{ + struct ibv_qp_attr qp_attr; + + if (!id->qp) + return 0; + + qp_attr.qp_state = IBV_QPS_ERR; + return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); +} + +static int ucma_find_gid(struct cma_device *cma_dev, union ibv_gid *gid, + uint8_t *port_num) +{ + int port, ret, i; + union ibv_gid chk_gid; + + for (port = 1; port <= cma_dev->port_cnt; port++) + for (i = 0, ret = 0; !ret; i++) { + ret = ibv_query_gid(cma_dev->verbs, port, i, &chk_gid); + if (!ret && !memcmp(gid, &chk_gid, sizeof *gid)) { + *port_num = port; + return 0; + } + } + + return -EINVAL; +} + +static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num, + uint16_t pkey, uint16_t *pkey_index) +{ + int ret, i; + uint16_t chk_pkey; + + for (i = 0, ret = 0; !ret; i++) { + ret = ibv_query_pkey(cma_dev->verbs, port_num, i, &chk_pkey); + if (!ret && pkey == chk_pkey) { + *pkey_index = (uint16_t) i; + return 0; + } + } + + return -EINVAL; +} + +static int ucma_init_ib_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) +{ + struct ibv_qp_attr qp_attr; + struct ib_addr *ibaddr; + int ret; + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; + + ibaddr = &id_priv->id.route.addr.addr.ibaddr; + ret = ucma_find_gid(id_priv->cma_dev, &ibaddr->sgid, &qp_attr.port_num); + if (ret) + return ret; + + ret = ucma_find_pkey(id_priv->cma_dev, qp_attr.port_num, ibaddr->pkey, + &qp_attr.pkey_index); + if (ret) + return ret; + + return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_ACCESS_FLAGS | + IBV_QP_PKEY_INDEX | IBV_QP_PORT); +} + +int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, + struct ibv_qp_init_attr *qp_init_attr) +{ + struct cma_id_private *id_priv; + struct ibv_qp *qp; + int ret; + + id_priv = container_of(id, struct cma_id_private, id); + if (id->verbs != pd->context) + return -EINVAL; + + qp = ibv_create_qp(pd, qp_init_attr); + if (!qp) + return -ENOMEM; + + ret = ucma_init_ib_qp(id_priv, qp); + if (ret) + goto err; + + id->qp = qp; + return 0; +err: + ibv_destroy_qp(qp); + return ret; +} + +void rdma_destroy_qp(struct rdma_cm_id *id) +{ + ibv_destroy_qp(id->qp); +} + +static int ucma_query_route(struct rdma_cm_id *id) +{ + struct ucma_abi_query_route_resp *resp; + struct ucma_abi_query_route *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, i; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY_ROUTE, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + if (resp->num_paths) { + id->route.path_rec = malloc(sizeof *id->route.path_rec * + resp->num_paths); + if (!id->route.path_rec) + return -ENOMEM; + + id->route.num_paths = resp->num_paths; + for (i = 0; i < resp->num_paths; i++) + ib_copy_path_rec_from_kern(&id->route.path_rec[i], + &resp->ib_route[i]); + } + + memcpy(id->route.addr.addr.ibaddr.sgid.raw, resp->ib_route[0].sgid, + sizeof id->route.addr.addr.ibaddr.sgid); + memcpy(id->route.addr.addr.ibaddr.dgid.raw, resp->ib_route[0].dgid, + sizeof id->route.addr.addr.ibaddr.dgid); + id->route.addr.addr.ibaddr.pkey = resp->ib_route[0].pkey; + memcpy(&id->route.addr.src_addr, &resp->src_addr, + sizeof id->route.addr.src_addr); + + if (!id_priv->cma_dev) { + ret = ucma_get_device(id_priv, resp->node_guid); + if (ret) + return ret; + } + + return 0; +} + +static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, + struct rdma_conn_param *src, + struct ibv_qp *qp) +{ + dst->qp_num = qp->qp_num; + dst->qp_type = qp->qp_type; + dst->srq = (qp->srq != NULL); + dst->responder_resources = src->responder_resources; + dst->initiator_depth = src->initiator_depth; + dst->flow_control = src->flow_control; + dst->retry_count = src->retry_count; + dst->rnr_retry_count = src->rnr_retry_count; + dst->valid = 1; + + if (src->private_data && src->private_data_len) { + memcpy(dst->private_data, src->private_data, + src->private_data_len); + dst->private_data_len = src->private_data_len; + } else + src->private_data_len = 0; +} + +int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct ucma_abi_connect *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_CONNECT, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp); + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +int rdma_listen(struct rdma_cm_id *id, int backlog) +{ + struct ucma_abi_listen *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_LISTEN, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->backlog = backlog; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) +{ + struct ucma_abi_accept *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + ret = ucma_modify_qp_rtr(id); + if (ret) + return ret; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ACCEPT, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + cmd->uid = (uintptr_t) id_priv; + ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, id->qp); + + ret = write(cma_fd, msg, size); + if (ret != size) { + ucma_modify_qp_err(id); + return (ret > 0) ? -ENODATA : ret; + } + + return 0; +} + +int rdma_reject(struct rdma_cm_id *id, const void *private_data, + uint8_t private_data_len) +{ + struct ucma_abi_reject *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_REJECT, size); + + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + if (private_data && private_data_len) { + memcpy(cmd->private_data, private_data, private_data_len); + cmd->private_data_len = private_data_len; + } else + cmd->private_data_len = 0; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +int rdma_disconnect(struct rdma_cm_id *id) +{ + struct ucma_abi_disconnect *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + ret = ucma_modify_qp_err(id); + if (ret) + return ret; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_DISCONNECT, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + return 0; +} + +static void ucma_copy_event_from_kern(struct rdma_cm_event *dst, + struct ucma_abi_event_resp *src) +{ + dst->event = src->event; + dst->status = src->status; + dst->private_data_len = src->private_data_len; + if (src->private_data_len) { + dst->private_data = dst + 1; + memcpy(dst->private_data, src->private_data, + src->private_data_len); + } else + dst->private_data = NULL; +} + +static void ucma_complete_event(struct cma_id_private *id_priv) +{ + pthread_mutex_lock(&id_priv->mut); + id_priv->events_completed++; + pthread_cond_signal(&id_priv->cond); + pthread_mutex_unlock(&id_priv->mut); +} + +int rdma_ack_cm_event(struct rdma_cm_event *event) +{ + struct rdma_cm_id *id; + + if (!event) + return -EINVAL; + + id = (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) ? + event->listen_id : event->id; + + ucma_complete_event(container_of(id, struct cma_id_private, id)); + free(event); + return 0; +} + +static int ucma_process_conn_req(struct rdma_cm_event *event, + uint32_t handle) +{ + struct cma_id_private *listen_id_priv, *id_priv; + int ret; + + listen_id_priv = container_of(event->id, struct cma_id_private, id); + id_priv = ucma_alloc_id(event->id->context); + if (!id_priv) { + ucma_destroy_kern_id(handle); + ret = -ENOMEM; + goto err; + } + + event->listen_id = event->id; + event->id = &id_priv->id; + id_priv->handle = handle; + + ret = ucma_query_route(&id_priv->id); + if (ret) { + rdma_destroy_id(&id_priv->id); + goto err; + } + + return 0; +err: + ucma_complete_event(listen_id_priv); + return ret; +} + +static int ucma_process_conn_resp(struct cma_id_private *id_priv) +{ + struct ucma_abi_accept *cmd; + void *msg; + int ret, size; + + ret = ucma_modify_qp_rtr(&id_priv->id); + if (ret) + goto err; + + ret = ucma_modify_qp_rts(&id_priv->id); + if (ret) + goto err; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ACCEPT, size); + cmd->id = id_priv->handle; + + ret = write(cma_fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? -ENODATA : ret; + goto err; + } + + return 0; +err: + ucma_modify_qp_err(&id_priv->id); + return ret; +} + +static int ucma_process_establish(struct rdma_cm_id *id) +{ + int ret; + + ret = ucma_modify_qp_rts(id); + if (ret) + ucma_modify_qp_err(id); + + return ret; +} + +int rdma_get_cm_event(struct rdma_cm_event **event) +{ + struct ucma_abi_event_resp *resp; + struct ucma_abi_get_event *cmd; + struct cma_id_private *id_priv; + struct rdma_cm_event *evt; + void *msg; + int ret, size; + + if (!event) + return -EINVAL; + + evt = malloc(sizeof *evt + RDMA_MAX_PRIVATE_DATA); + if (!evt) + return -ENOMEM; + +retry: + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_EVENT, size); + ret = write(cma_fd, msg, size); + if (ret != size) { + ret = (ret > 0) ? -ENODATA : ret; + goto err; + } + + id_priv = (void *) (uintptr_t) resp->uid; + evt->id = &id_priv->id; + ucma_copy_event_from_kern(evt, resp); + + switch (evt->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + evt->status = ucma_query_route(&id_priv->id); + if (evt->status) + evt->event = RDMA_CM_EVENT_ADDR_ERROR; + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + evt->status = ucma_query_route(&id_priv->id); + if (evt->status) + evt->event = RDMA_CM_EVENT_ROUTE_ERROR; + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + ret = ucma_process_conn_req(evt, resp->id); + if (ret) + goto retry; + break; + case RDMA_CM_EVENT_CONNECT_RESPONSE: + evt->status = ucma_process_conn_resp(id_priv); + if (!evt->status) + evt->event = RDMA_CM_EVENT_ESTABLISHED; + else + evt->event = RDMA_CM_EVENT_CONNECT_ERROR; + break; + case RDMA_CM_EVENT_ESTABLISHED: + evt->status = ucma_process_establish(&id_priv->id); + if (evt->status) + evt->event = RDMA_CM_EVENT_CONNECT_ERROR; + break; + case RDMA_CM_EVENT_REJECTED: + ucma_modify_qp_err(evt->id); + break; + default: + break; + } + + *event = evt; + return 0; +err: + free(evt); + return ret; +} Index: userspace/librdmacm/src/librdmacm.map =================================================================== --- userspace/librdmacm/src/librdmacm.map (revision 0) +++ userspace/librdmacm/src/librdmacm.map (revision 0) @@ -0,0 +1,18 @@ +RDMACM_1.0 { + global: + rdma_create_id; + rdma_destroy_id; + rdma_bind_addr; + rdma_resolve_addr; + rdma_resolve_route; + rdma_create_qp; + rdma_destroy_qp; + rdma_connect; + rdma_listen; + rdma_accept; + rdma_reject; + rdma_disconnect; + rdma_get_cm_event; + rdma_ack_cm_event; + local: *; +}; Index: userspace/librdmacm/ChangeLog =================================================================== Index: userspace/librdmacm/COPYING =================================================================== --- userspace/librdmacm/COPYING (revision 0) +++ userspace/librdmacm/COPYING (revision 0) @@ -0,0 +1,378 @@ +This software is available to you under a choice of one of two +licenses. You may choose to be licensed under the terms of the the +OpenIB.org BSD license or the GNU General Public License (GPL) Version +2, both included below. + +Copyright (c) 2005 Intel Corporation. All rights reserved. + +================================================================== + + OpenIB.org BSD license + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS +FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE +COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, +INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, +BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; +LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN +ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +================================================================== + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. Index: userspace/librdmacm/librdmacm.spec.in =================================================================== --- userspace/librdmacm/librdmacm.spec.in (revision 0) +++ userspace/librdmacm/librdmacm.spec.in (revision 0) @@ -0,0 +1,40 @@ +# $Id: $ + +%define prefix /usr +%define ver @VERSION@ +%define RELEASE 1 +%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} + +Summary: Userspace RDMA Connection Manager. +Name: librdmacm +Version: %ver +Release: %rel +Copyright: Dual GPL/BSD +Group: System Environment/Libraries +BuildRoot: %{_tmppath}/%{name}-%{version}-root +Source: http://openib.org/downloads/%{name}-%{version}.tar.gz +Url: http://openib.org/ + +%description +Along with the OpenIB kernel drivers, librdmacm provides a userspace +RDMA Connection Managment API. + +%prep +%setup -q + +%build +%configure +make + +%install +make DESTDIR=${RPM_BUILD_ROOT} install +# remove unpackaged files from the buildroot +rm -f $RPM_BUILD_ROOT%{_libdir}/*.la + +%clean +rm -rf $RPM_BUILD_ROOT + +%files +%defattr(-,root,root) +%{_libdir}/librdmacm*.so.* +%doc AUTHORS COPYING ChangeLog NEWS README Index: userspace/librdmacm/Makefile.am =================================================================== --- userspace/librdmacm/Makefile.am (revision 0) +++ userspace/librdmacm/Makefile.am (revision 0) @@ -0,0 +1,36 @@ +# $Id: Makefile.am 3373 2005-09-12 16:34:20Z roland $ +INCLUDES = -I$(srcdir)/include + +AM_CFLAGS = -g -Wall -D_GNU_SOURCE + +rdmacmlibdir = $(libdir) + +rdmacmlib_LTLIBRARIES = src/librdmacm.la + +src_rdmacm_la_CFLAGS = -g -Wall -D_GNU_SOURCE + +if HAVE_LD_VERSION_SCRIPT + rdmacm_version_script = -Wl,--version-script=$(srcdir)/src/librdmacm.map +else + rdmacm_version_script = +endif + +src_librdmacm_la_SOURCES = src/cma.c +src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) + +bin_PROGRAMS = examples/ucmatose +examples_ucmatose_SOURCES = examples/cmatose.c +examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la + +librdmacmincludedir = $(includedir)/rdma + +librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \ + include/rdma/rdma_cma.h + +EXTRA_DIST = include/rdma/rdma_cma_abi.h \ + include/rdma/rdma_cma.h \ + src/librdmacm.map \ + librdmacm.spec.in + +dist-hook: librdmacm.spec + cp librdmacm.spec $(distdir) Index: userspace/librdmacm/autogen.sh =================================================================== --- userspace/librdmacm/autogen.sh (revision 0) +++ userspace/librdmacm/autogen.sh (revision 0) @@ -0,0 +1,8 @@ +#! /bin/sh + +set -x +aclocal -I config +libtoolize --force --copy +autoheader +automake --foreign --add-missing --copy +autoconf Property changes on: userspace/librdmacm/autogen.sh ___________________________________________________________________ Name: svn:executable + * Index: userspace/librdmacm/NEWS =================================================================== Index: userspace/librdmacm/README =================================================================== --- userspace/librdmacm/README (revision 0) +++ userspace/librdmacm/README (revision 0) @@ -0,0 +1,29 @@ +This README is for userspace RDMA cm library. + +Building + +To make this directory, run: +./autogen.sh && ./configure && make && make install + +Typically the autogen and configure steps only need be done the first +time unless configure.in or Makefile.am changes. + +Libraries are installed by default at /usr/local/lib. + +Device files + +The userspace CMA uses a single device file regardless of the number +of adapters or ports present. + +To create the appropriate character device file automatically with +udev, a rule like + + KERNEL="ucma", NAME="infiniband/%k", MODE="0666" + +can be used. This will create the device node named + + /dev/infiniband/ucma + +or you can create it manually + + mknod /dev/infiniband/ucma c 231 255 Index: userspace/librdmacm/examples/cmatose.c =================================================================== --- userspace/librdmacm/examples/cmatose.c (revision 0) +++ userspace/librdmacm/examples/cmatose.c (revision 0) @@ -0,0 +1,583 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#if __BYTE_ORDER == __BIG_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return x; } +static inline uint32_t cpu_to_be32(uint32_t x) { return x; } +#else +static inline uint64_t cpu_to_be64(uint64_t x) { return bswap_64(x); } +static inline uint32_t cpu_to_be32(uint32_t x) { return bswap_32(x); } +#endif + +/* + * To execute: + * Server: rdma_cmatose + * Client: rdma_cmatose "dst_ip=ip" + */ + +struct cmatest_node { + int id; + struct rdma_cm_id *cma_id; + int connected; + struct ibv_pd *pd; + struct ibv_cq *cq; + struct ibv_mr *mr; + void *mem; +}; + +struct cmatest { + struct cmatest_node *nodes; + int conn_index; + int connects_left; + int disconnects_left; + + struct sockaddr_in addr_in; + struct sockaddr *addr; +}; + +static struct cmatest test; +static int connections = 1; +static int message_size = 100; +static int message_count = 10; +static int is_server; + +static int create_message(struct cmatest_node *node) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + node->mem = malloc(message_size); + if (!node->mem) { + printf("failed message allocation\n"); + return -1; + } + node->mr = ibv_reg_mr(node->pd, node->mem, message_size, + IBV_ACCESS_LOCAL_WRITE); + if (!node->mr) { + printf("failed to reg MR\n"); + goto err; + } + return 0; +err: + free(node->mem); + return -1; +} + +static int init_node(struct cmatest_node *node) +{ + struct ibv_qp_init_attr init_qp_attr; + int cqe, ret; + + node->pd = ibv_alloc_pd(node->cma_id->verbs); + if (!node->pd) { + ret = -ENOMEM; + printf("cmatose: unable to allocate PD\n"); + goto out; + } + + cqe = message_count ? message_count * 2 : 2; + node->cq = ibv_create_cq(node->cma_id->verbs, cqe, node, 0, 0); + if (!node->cq) { + ret = -ENOMEM; + printf("cmatose: unable to create CQ\n"); + goto out; + } + + memset(&init_qp_attr, 0, sizeof init_qp_attr); + init_qp_attr.cap.max_send_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_recv_wr = message_count ? message_count : 1; + init_qp_attr.cap.max_send_sge = 1; + init_qp_attr.cap.max_recv_sge = 1; + init_qp_attr.qp_context = node; + init_qp_attr.sq_sig_all = 1; + init_qp_attr.qp_type = IBV_QPT_RC; + init_qp_attr.send_cq = node->cq; + init_qp_attr.recv_cq = node->cq; + ret = rdma_create_qp(node->cma_id, node->pd, &init_qp_attr); + if (ret) { + printf("cmatose: unable to create QP: %d\n", ret); + goto out; + } + + ret = create_message(node); + if (ret) { + printf("cmatose: failed to create messages: %d\n", ret); + goto out; + } +out: + return ret; +} + +static int post_recvs(struct cmatest_node *node) +{ + struct ibv_recv_wr recv_wr, *recv_failure; + struct ibv_sge sge; + int i, ret = 0; + + if (!message_count) + return 0; + + recv_wr.next = NULL; + recv_wr.sg_list = &sge; + recv_wr.num_sge = 1; + recv_wr.wr_id = (uintptr_t) node; + + sge.length = message_size; + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++ ) { + ret = ibv_post_recv(node->cma_id->qp, &recv_wr, &recv_failure); + if (ret) { + printf("failed to post receives: %d\n", ret); + break; + } + } + return ret; +} + +static int post_sends(struct cmatest_node *node) +{ + struct ibv_send_wr send_wr, *bad_send_wr; + struct ibv_sge sge; + int i, ret = 0; + + if (!node->connected || !message_count) + return 0; + + send_wr.next = NULL; + send_wr.sg_list = &sge; + send_wr.num_sge = 1; + send_wr.opcode = IBV_WR_SEND; + send_wr.send_flags = 0; + send_wr.wr_id = (unsigned long)node; + + sge.length = message_size; + sge.lkey = node->mr->lkey; + sge.addr = (uintptr_t) node->mem; + + for (i = 0; i < message_count && !ret; i++) + ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr); + + return ret; +} + +static void connect_error(void) +{ + test.disconnects_left--; + test.connects_left--; +} + +static void addr_handler(struct cmatest_node *node) +{ + int ret; + + ret = rdma_resolve_route(node->cma_id, 2000); + if (ret) { + printf("cmatose: resolve route failed: %d\n", ret); + connect_error(); + } +} + +static void route_handler(struct cmatest_node *node) +{ + struct rdma_conn_param conn_param; + int ret; + + ret = init_node(node); + if (ret) + goto err; + + ret = post_recvs(node); + if (ret) + goto err; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + conn_param.retry_count = 5; + ret = rdma_connect(node->cma_id, &conn_param); + if (ret) { + printf("cmatose: failure connecting: %d\n", ret); + goto err; + } + return; +err: + connect_error(); +} + +static int connect_handler(struct rdma_cm_id *cma_id) +{ + struct cmatest_node *node; + struct rdma_conn_param conn_param; + int ret; + + if (test.conn_index == connections) { + ret = -ENOMEM; + goto err1; + } + node = &test.nodes[test.conn_index++]; + + node->cma_id = cma_id; + cma_id->context = node; + + ret = init_node(node); + if (ret) + goto err2; + + ret = post_recvs(node); + if (ret) + goto err2; + + memset(&conn_param, 0, sizeof conn_param); + conn_param.responder_resources = 1; + conn_param.initiator_depth = 1; + ret = rdma_accept(node->cma_id, &conn_param); + if (ret) { + printf("cmatose: failure accepting: %d\n", ret); + goto err2; + } + return 0; + +err2: + node->cma_id = NULL; + connect_error(); +err1: + printf("cmatose: failing connection request\n"); + rdma_reject(cma_id, NULL, 0); + return ret; +} + +static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +{ + int ret = 0; + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + addr_handler(cma_id->context); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + route_handler(cma_id->context); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + ret = connect_handler(cma_id); + break; + case RDMA_CM_EVENT_ESTABLISHED: + ((struct cmatest_node *) cma_id->context)->connected = 1; + test.connects_left--; + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + printf("cmatose: event: %d, error: %d\n", event->event, + event->status); + connect_error(); + break; + case RDMA_CM_EVENT_DISCONNECTED: + rdma_disconnect(cma_id); + test.disconnects_left--; + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + /* Cleanup will occur after test completes. */ + break; + default: + break; + } + return ret; +} + +static void destroy_node(struct cmatest_node *node) +{ + if (!node->cma_id) + return; + + if (node->cma_id->qp) + rdma_destroy_qp(node->cma_id); + + if (node->cq) + ibv_destroy_cq(node->cq); + + if (node->mem) { + ibv_dereg_mr(node->mr); + free(node->mem); + } + + if (node->pd) + ibv_dealloc_pd(node->pd); + + /* Destroy the RDMA ID after all device resources */ + rdma_destroy_id(node->cma_id); +} + +static int alloc_nodes(void) +{ + int ret, i; + + test.nodes = malloc(sizeof *test.nodes * connections); + if (!test.nodes) { + printf("cmatose: unable to allocate memory for test nodes\n"); + return -ENOMEM; + } + memset(test.nodes, 0, sizeof *test.nodes * connections); + + for (i = 0; i < connections; i++) { + test.nodes[i].id = i; + if (!is_server) { + ret = rdma_create_id(&test.nodes[i].cma_id, + &test.nodes[i]); + if (ret) + goto err; + } + } + return 0; +err: + while (--i >= 0) + rdma_destroy_id(test.nodes[i].cma_id); + free(test.nodes); + return ret; +} + +static void destroy_nodes(void) +{ + int i; + + for (i = 0; i < connections; i++) + destroy_node(&test.nodes[i]); + free(test.nodes); +} + +static int poll_cqs(void) +{ + struct ibv_wc wc[8]; + int done, i, ret; + + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + for (done = 0; done < message_count; done += ret) { + ret = ibv_poll_cq(test.nodes[i].cq, 8, wc); + if (ret < 0) { + printf("cmatose: failed polling CQ: %d\n", ret); + return ret; + } + } + } + return 0; +} + +static void connect_events(void) +{ + struct rdma_cm_event *event; + int err = 0; + + while (test.connects_left && !err) { + err = rdma_get_cm_event(&event); + if (!err) { + cma_handler(event->id, event); + rdma_ack_cm_event(event); + } + } +} + +static void disconnect_events(void) +{ + struct rdma_cm_event *event; + int err = 0; + + while (test.disconnects_left && !err) { + err = rdma_get_cm_event(&event); + if (!err) { + cma_handler(event->id, event); + rdma_ack_cm_event(event); + } + } +} + +static void run_server(void) +{ + struct rdma_cm_id *listen_id; + int i, ret; + + printf("cmatose: starting server\n"); + ret = rdma_create_id(&listen_id, &test); + if (ret) { + printf("cmatose: listen request failed\n"); + return; + } + + test.addr_in.sin_family = PF_INET; + test.addr_in.sin_port = 7471; + ret = rdma_bind_addr(listen_id, test.addr); + if (ret) { + printf("cmatose: bind address failed: %d\n", ret); + return; + } + + ret = rdma_listen(listen_id, 0); + if (ret) { + printf("cmatose: failure trying to listen: %d\n", ret); + goto out; + } + + connect_events(); + + if (message_count) { + printf("initiating data transfers\n"); + for (i = 0; i < connections; i++) + if (post_sends(&test.nodes[i])) + goto out; + + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + printf("data transfers complete\n"); + + } + printf("cmatose: disconnecting\n"); + for (i = 0; i < connections; i++) { + if (!test.nodes[i].connected) + continue; + + test.nodes[i].connected = 0; + rdma_disconnect(test.nodes[i].cma_id); + } + + disconnect_events(); + printf("disconnected\n"); +out: + rdma_destroy_id(listen_id); +} + +static int get_dst_addr(char *dst) +{ + struct addrinfo *res; + int ret; + + ret = getaddrinfo(dst, NULL, NULL, &res); + if (ret) { + printf("getaddrinfo failed - invalid hostname or IP address\n"); + return ret; + } + + if (res->ai_family != PF_INET) { + ret = -1; + goto out; + } + + test.addr_in = *(struct sockaddr_in *) res->ai_addr; + test.addr_in.sin_port = 7471; +out: + freeaddrinfo(res); + return ret; +} + +static void run_client(char *dst) +{ + int i, ret; + + printf("cmatose: starting client\n"); + ret = get_dst_addr(dst); + if (ret) + return; + + printf("cmatose: connecting\n"); + for (i = 0; i < connections; i++) { + ret = rdma_resolve_addr(test.nodes[i].cma_id, NULL, + test.addr, 2000); + if (ret) { + printf("cmatose: failure getting addr: %d\n", ret); + connect_error(); + } + } + + connect_events(); + + if (message_count) { + printf("receiving data transfers\n"); + if (poll_cqs()) + goto out; + + printf("sending replies\n"); + for (i = 0; i < connections; i++) + if (post_sends(&test.nodes[i])) + goto out; + + printf("data transfers complete\n"); + + } +out: + disconnect_events(); +} + +int main(int argc, char **argv) +{ + if (argc != 1 && argc != 2) { + printf("usage: %s [server_addr]\n", argv[0]); + exit(1); + } + is_server = (argc == 1); + + test.addr = (struct sockaddr *) &test.addr_in; + test.connects_left = connections; + test.disconnects_left = connections; + if (alloc_nodes()) + exit(1); + + if (is_server) + run_server(); + else + run_client(argv[1]); + + printf("test complete\n"); + destroy_nodes(); + return 0; +} Index: userspace/libibcm/include/infiniband/cm_abi.h =================================================================== --- userspace/libibcm/include/infiniband/cm_abi.h (revision 4017) +++ userspace/libibcm/include/infiniband/cm_abi.h (working copy) @@ -37,6 +37,9 @@ #define CM_ABI_H #include +#include +#include + /* * This file must be kept in sync with the kernel's version of ib_user_cm.h */ @@ -114,58 +117,6 @@ __u32 qp_state; }; -struct cm_abi_ah_attr { - __u8 grh_dgid[16]; - __u32 grh_flow_label; - __u16 dlid; - __u16 reserved; - __u8 grh_sgid_index; - __u8 grh_hop_limit; - __u8 grh_traffic_class; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; -}; - -struct cm_abi_init_qp_attr_resp { - __u32 qp_attr_mask; - __u32 qp_state; - __u32 cur_qp_state; - __u32 path_mtu; - __u32 path_mig_state; - __u32 qkey; - __u32 rq_psn; - __u32 sq_psn; - __u32 dest_qp_num; - __u32 qp_access_flags; - - struct cm_abi_ah_attr ah_attr; - struct cm_abi_ah_attr alt_ah_attr; - - /* ibv_qp_cap */ - __u32 max_send_wr; - __u32 max_recv_wr; - __u32 max_send_sge; - __u32 max_recv_sge; - __u32 max_inline_data; - - __u16 pkey_index; - __u16 alt_pkey_index; - __u8 en_sqd_async_notify; - __u8 sq_draining; - __u8 max_rd_atomic; - __u8 max_dest_rd_atomic; - __u8 min_rnr_timer; - __u8 port_num; - __u8 timeout; - __u8 retry_cnt; - __u8 rnr_retry; - __u8 alt_port_num; - __u8 alt_timeout; -}; - struct cm_abi_listen { __u64 service_id; __u64 service_mask; @@ -184,28 +135,6 @@ __u8 reserved[3]; }; -struct cm_abi_path_rec { - __u8 dgid[16]; - __u8 sgid[16]; - __u16 dlid; - __u16 slid; - __u32 raw_traffic; - __u32 flow_label; - __u32 reversible; - __u32 mtu; - __u16 pkey; - __u8 hop_limit; - __u8 traffic_class; - __u8 numb_path; - __u8 sl; - __u8 mtu_selector; - __u8 rate_selector; - __u8 rate; - __u8 packet_life_time_selector; - __u8 packet_life_time; - __u8 preference; -}; - struct cm_abi_req { __u32 id; __u32 qpn; @@ -308,8 +237,8 @@ }; struct cm_abi_req_event_resp { - struct cm_abi_path_rec primary_path; - struct cm_abi_path_rec alternate_path; + struct ib_kern_path_rec primary_path; + struct ib_kern_path_rec alternate_path; __u64 remote_ca_guid; __u32 remote_qkey; __u32 remote_qpn; @@ -353,7 +282,7 @@ }; struct cm_abi_lap_event_resp { - struct cm_abi_path_rec path; + struct ib_kern_path_rec path; }; struct cm_abi_apr_event_resp { Index: userspace/libibcm/include/infiniband/cm.h =================================================================== --- userspace/libibcm/include/infiniband/cm.h (revision 4017) +++ userspace/libibcm/include/infiniband/cm.h (working copy) @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. * Copyright (c) 2004 Topspin Corporation. All rights reserved. * Copyright (c) 2004 Voltaire Corporation. All rights reserved. * Index: userspace/libibcm/configure.in =================================================================== --- userspace/libibcm/configure.in (revision 4017) +++ userspace/libibcm/configure.in (working copy) @@ -27,8 +27,6 @@ then AC_CHECK_LIB(ibverbs, ibv_get_devices, [], AC_MSG_ERROR([ibv_get_devices() not found. libibcm requires libibverbs.])) -AC_CHECK_LIB(ibat, ib_at_route_by_ip, [], - AC_MSG_ERROR([ib_at_route_by_ip() not found. libibcm requires libibat.])) fi dnl Checks for header files. @@ -36,8 +34,8 @@ then AC_CHECK_HEADER(infiniband/verbs.h, [], AC_MSG_ERROR([ not found. Is libibverbs installed?])) -AC_CHECK_HEADER(infiniband/at.h, [], - AC_MSG_ERROR([ not found. Is libibat installed?])) +AC_CHECK_HEADER(infiniband/marshall.h, [], + AC_MSG_ERROR([ not found. Is libibverbs installed?])) fi AC_HEADER_STDC Index: userspace/libibcm/src/cm.c =================================================================== --- userspace/libibcm/src/cm.c (revision 4017) +++ userspace/libibcm/src/cm.c (working copy) @@ -52,6 +52,7 @@ #include #include +#include #define PFX "libibcm: " @@ -266,33 +267,6 @@ return NULL; } -static void cm_param_path_get(struct cm_abi_path_rec *abi, - struct ib_sa_path_rec *sa) -{ - memcpy(abi->dgid, sa->dgid.raw, sizeof(union ibv_gid)); - memcpy(abi->sgid, sa->sgid.raw, sizeof(union ibv_gid)); - - abi->dlid = sa->dlid; - abi->slid = sa->slid; - - abi->raw_traffic = sa->raw_traffic; - abi->flow_label = sa->flow_label; - abi->reversible = sa->reversible; - abi->mtu = sa->mtu; - abi->pkey = sa->pkey; - - abi->hop_limit = sa->hop_limit; - abi->traffic_class = sa->traffic_class; - abi->numb_path = sa->numb_path; - abi->sl = sa->sl; - abi->mtu_selector = sa->mtu_selector; - abi->rate_selector = sa->rate_selector; - abi->rate = sa->rate; - abi->packet_life_time_selector = sa->packet_life_time_selector; - abi->packet_life_time = sa->packet_life_time; - abi->preference = sa->preference; -} - static void ib_cm_free_id(struct cm_id_private *cm_id_priv) { pthread_cond_destroy(&cm_id_priv->cond); @@ -407,65 +381,11 @@ return 0; } -static void ib_cm_copy_ah_attr(struct ibv_ah_attr *dest_attr, - struct cm_abi_ah_attr *src_attr) -{ - memcpy(dest_attr->grh.dgid.raw, src_attr->grh_dgid, - sizeof dest_attr->grh.dgid); - dest_attr->grh.flow_label = src_attr->grh_flow_label; - dest_attr->grh.sgid_index = src_attr->grh_sgid_index; - dest_attr->grh.hop_limit = src_attr->grh_hop_limit; - dest_attr->grh.traffic_class = src_attr->grh_traffic_class; - - dest_attr->dlid = src_attr->dlid; - dest_attr->sl = src_attr->sl; - dest_attr->src_path_bits = src_attr->src_path_bits; - dest_attr->static_rate = src_attr->static_rate; - dest_attr->is_global = src_attr->is_global; - dest_attr->port_num = src_attr->port_num; -} - -static void ib_cm_copy_qp_attr(struct ibv_qp_attr *dest_attr, - struct cm_abi_init_qp_attr_resp *src_attr) -{ - dest_attr->cur_qp_state = src_attr->cur_qp_state; - dest_attr->path_mtu = src_attr->path_mtu; - dest_attr->path_mig_state = src_attr->path_mig_state; - dest_attr->qkey = src_attr->qkey; - dest_attr->rq_psn = src_attr->rq_psn; - dest_attr->sq_psn = src_attr->sq_psn; - dest_attr->dest_qp_num = src_attr->dest_qp_num; - dest_attr->qp_access_flags = src_attr->qp_access_flags; - - dest_attr->cap.max_send_wr = src_attr->max_send_wr; - dest_attr->cap.max_recv_wr = src_attr->max_recv_wr; - dest_attr->cap.max_send_sge = src_attr->max_send_sge; - dest_attr->cap.max_recv_sge = src_attr->max_recv_sge; - dest_attr->cap.max_inline_data = src_attr->max_inline_data; - - ib_cm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); - ib_cm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); - - dest_attr->pkey_index = src_attr->pkey_index; - dest_attr->alt_pkey_index = src_attr->alt_pkey_index; - dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; - dest_attr->sq_draining = src_attr->sq_draining; - dest_attr->max_rd_atomic = src_attr->max_rd_atomic; - dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; - dest_attr->min_rnr_timer = src_attr->min_rnr_timer; - dest_attr->port_num = src_attr->port_num; - dest_attr->timeout = src_attr->timeout; - dest_attr->retry_cnt = src_attr->retry_cnt; - dest_attr->rnr_retry = src_attr->rnr_retry; - dest_attr->alt_port_num = src_attr->alt_port_num; - dest_attr->alt_timeout = src_attr->alt_timeout; -} - int ib_cm_init_qp_attr(struct ib_cm_id *cm_id, struct ibv_qp_attr *qp_attr, int *qp_attr_mask) { - struct cm_abi_init_qp_attr_resp *resp; + struct ibv_kern_qp_attr *resp; struct cm_abi_init_qp_attr *cmd; void *msg; int result; @@ -483,7 +403,7 @@ return (result > 0) ? -ENODATA : result; *qp_attr_mask = resp->qp_attr_mask; - ib_cm_copy_qp_attr(qp_attr, resp); + ib_copy_qp_attr_from_kern(qp_attr, resp); return 0; } @@ -511,8 +431,8 @@ int ib_cm_send_req(struct ib_cm_id *cm_id, struct ib_cm_req_param *param) { - struct cm_abi_path_rec *p_path; - struct cm_abi_path_rec *a_path; + struct ib_kern_path_rec *p_path; + struct ib_kern_path_rec *a_path; struct cm_abi_req *cmd; void *msg; int result; @@ -543,7 +463,7 @@ if (!p_path) return -ENOMEM; - cm_param_path_get(p_path, param->primary_path); + ib_copy_path_rec_to_kern(p_path, param->primary_path); cmd->primary_path = (uintptr_t) p_path; } @@ -552,7 +472,7 @@ if (!a_path) return -ENOMEM; - cm_param_path_get(a_path, param->alternate_path); + ib_copy_path_rec_to_kern(a_path, param->alternate_path); cmd->alternate_path = (uintptr_t) a_path; } @@ -758,7 +678,7 @@ void *private_data, uint8_t private_data_len) { - struct cm_abi_path_rec *abi_path; + struct ib_kern_path_rec *abi_path; struct cm_abi_lap *cmd; void *msg; int result; @@ -772,7 +692,7 @@ if (!abi_path) return -ENOMEM; - cm_param_path_get(abi_path, alternate_path); + ib_copy_path_rec_to_kern(abi_path, alternate_path); cmd->path = (uintptr_t) abi_path; } @@ -791,7 +711,7 @@ int ib_cm_send_sidr_req(struct ib_cm_id *cm_id, struct ib_cm_sidr_req_param *param) { - struct cm_abi_path_rec *abi_path; + struct ib_kern_path_rec *abi_path; struct cm_abi_sidr_req *cmd; void *msg; int result; @@ -812,7 +732,7 @@ if (!abi_path) return -ENOMEM; - cm_param_path_get(abi_path, param->path); + ib_copy_path_rec_to_kern(abi_path, param->path); cmd->path = (uintptr_t) abi_path; } @@ -862,39 +782,6 @@ return 0; } -/* - * event processing - */ -static void cm_event_path_get(struct ib_sa_path_rec *upath, - struct cm_abi_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid.raw, kpath->dgid, sizeof upath->dgid); - memcpy(upath->sgid.raw, kpath->sgid, sizeof upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit = kpath->hop_limit; - upath->traffic_class = kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path = kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector = kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void cm_event_req_get(struct ib_cm_req_event_param *ureq, struct cm_abi_req_event_resp *kreq) { @@ -913,8 +800,10 @@ ureq->srq = kreq->srq; ureq->port = kreq->port; - cm_event_path_get(ureq->primary_path, &kreq->primary_path); - cm_event_path_get(ureq->alternate_path, &kreq->alternate_path); + ib_copy_path_rec_from_kern(ureq->primary_path, &kreq->primary_path); + if (ureq->alternate_path) + ib_copy_path_rec_from_kern(ureq->alternate_path, + &kreq->alternate_path); } static void cm_event_rep_get(struct ib_cm_rep_event_param *urep, @@ -1058,8 +947,8 @@ case IB_CM_LAP_RECEIVED: evt->param.lap_rcvd.alternate_path = path_b; path_b = NULL; - cm_event_path_get(evt->param.lap_rcvd.alternate_path, - &resp->u.lap_resp.path); + ib_copy_path_rec_from_kern(evt->param.lap_rcvd.alternate_path, + &resp->u.lap_resp.path); break; case IB_CM_APR_RECEIVED: evt->param.apr_rcvd.ap_status = resp->u.apr_resp.status; Index: userspace/libibcm/src/libibcm.map =================================================================== --- userspace/libibcm/src/libibcm.map (revision 4017) +++ userspace/libibcm/src/libibcm.map (working copy) @@ -1,9 +1,9 @@ -IBCM_1.0 { +IBCM_4.0 { global: - ib_cm_event_get; - ib_cm_event_put; - ib_cm_get_fd; + ib_cm_get_event; + ib_cm_ack_event; + ib_cm_get_device; ib_cm_create_id; ib_cm_destroy_id; ib_cm_attr_id; @@ -20,5 +20,6 @@ ib_cm_send_apr; ib_cm_send_sidr_req; ib_cm_send_sidr_rep; + ib_cm_init_qp_attr; local: *; }; Index: userspace/libibcm/Makefile.am =================================================================== --- userspace/libibcm/Makefile.am (revision 4017) +++ userspace/libibcm/Makefile.am (working copy) @@ -18,10 +18,6 @@ src_libibcm_la_SOURCES = src/cm.c src_libibcm_la_LDFLAGS = -avoid-version $(ucm_version_script) -bin_PROGRAMS = examples/ucmpost -examples_ucmpost_SOURCES = examples/cmpost.c -examples_ucmpost_LDADD = $(top_builddir)/src/libibcm.la - libibcmincludedir = $(includedir)/infiniband libibcminclude_HEADERS = include/infiniband/cm_abi.h \ Index: userspace/libibcm/README =================================================================== --- userspace/libibcm/README (revision 4017) +++ userspace/libibcm/README (working copy) @@ -12,18 +12,17 @@ Device files -The userspace CM uses a single device file regardless of the number -of adapters or ports present. +The userspace CM uses a device file per adapter present. To create the appropriate character device file automatically with udev, a rule like - KERNEL="ucm", NAME="infiniband/%k", MODE="0666" + KERNEL="ucm*", NAME="infiniband/%k", MODE="0666" can be used. This will create the device node named - /dev/infiniband/ucm + /dev/infiniband/ucm0 -or you can create it manually +for the first HCA in the system, or you can create it manually - mknod /dev/infiniband/ucm c 231 255 + mknod /dev/infiniband/ucm0 c 231 255 Index: linux-kernel/infiniband/include/rdma/ib_user_verbs.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_user_verbs.h (revision 4017) +++ linux-kernel/infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -311,6 +311,64 @@ __u32 async_events_reported; }; +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ib_uverbs_ah_attr ah_attr; + struct ib_uverbs_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ib_uverbs_create_qp { __u64 response; __u64 user_handle; @@ -487,26 +545,6 @@ __u32 bad_wr; }; -struct ib_uverbs_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ib_uverbs_ah_attr { - struct ib_uverbs_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ib_uverbs_create_ah { __u64 response; __u64 user_handle; Index: linux-kernel/infiniband/include/rdma/rdma_user_cm.h =================================================================== --- linux-kernel/infiniband/include/rdma/rdma_user_cm.h (revision 0) +++ linux-kernel/infiniband/include/rdma/rdma_user_cm.h (revision 0) @@ -0,0 +1,187 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef RDMA_USER_CM_H +#define RDMA_USER_CM_H + +#include +#include +#include +#include + +#define RDMA_USER_CM_ABI_VERSION 1 + +#define RDMA_MAX_PRIVATE_DATA 256 + +enum { + RDMA_USER_CM_CMD_CREATE_ID, + RDMA_USER_CM_CMD_DESTROY_ID, + RDMA_USER_CM_CMD_BIND_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ADDR, + RDMA_USER_CM_CMD_RESOLVE_ROUTE, + RDMA_USER_CM_CMD_QUERY_ROUTE, + RDMA_USER_CM_CMD_CONNECT, + RDMA_USER_CM_CMD_LISTEN, + RDMA_USER_CM_CMD_ACCEPT, + RDMA_USER_CM_CMD_REJECT, + RDMA_USER_CM_CMD_DISCONNECT, + RDMA_USER_CM_CMD_INIT_QP_ATTR, + RDMA_USER_CM_CMD_GET_EVENT +}; + +/* + * command ABI structures. + */ +struct rdma_ucm_cmd_hdr { + __u32 cmd; + __u16 in; + __u16 out; +}; + +struct rdma_ucm_create_id { + __u64 uid; + __u64 response; +}; + +struct rdma_ucm_create_id_resp { + __u32 id; +}; + +struct rdma_ucm_destroy_id { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_destroy_id_resp { + __u32 events_reported; +}; + +struct rdma_ucm_bind_addr { + __u64 response; + struct sockaddr_in6 addr; + __u32 id; +}; + +struct rdma_ucm_bind_addr_resp { + __u64 node_guid; +}; + +struct rdma_ucm_resolve_addr { + struct sockaddr_in6 src_addr; + struct sockaddr_in6 dst_addr; + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_resolve_route { + __u32 id; + __u32 timeout_ms; +}; + +struct rdma_ucm_query_route { + __u64 response; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_query_route_resp { + __u64 node_guid; + struct ib_user_path_rec ib_route[2]; + struct sockaddr_in6 src_addr; + __u32 num_paths; +}; + +struct rdma_ucm_conn_param { + __u32 qp_num; + __u32 qp_type; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; + __u8 private_data_len; + __u8 srq; + __u8 responder_resources; + __u8 initiator_depth; + __u8 flow_control; + __u8 retry_count; + __u8 rnr_retry_count; + __u8 valid; +}; + +struct rdma_ucm_connect { + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_listen { + __u32 id; + __u32 backlog; +}; + +struct rdma_ucm_accept { + __u64 uid; + struct rdma_ucm_conn_param conn_param; + __u32 id; + __u32 reserved; +}; + +struct rdma_ucm_reject { + __u32 id; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +struct rdma_ucm_disconnect { + __u32 id; +}; + +struct rdma_ucm_init_qp_attr { + __u64 response; + __u32 id; + __u32 qp_state; +}; + +struct rdma_ucm_get_event { + __u64 response; +}; + +struct rdma_ucm_event_resp { + __u64 uid; + __u32 id; + __u32 event; + __u32 status; + __u8 private_data_len; + __u8 reserved[3]; + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; +}; + +#endif /* RDMA_USER_CM_H */ Index: linux-kernel/infiniband/include/rdma/rdma_cm.h =================================================================== --- linux-kernel/infiniband/include/rdma/rdma_cm.h (revision 4017) +++ linux-kernel/infiniband/include/rdma/rdma_cm.h (working copy) @@ -44,6 +44,7 @@ RDMA_CM_EVENT_ROUTE_RESOLVED, RDMA_CM_EVENT_ROUTE_ERROR, RDMA_CM_EVENT_CONNECT_REQUEST, + RDMA_CM_EVENT_CONNECT_RESPONSE, RDMA_CM_EVENT_CONNECT_ERROR, RDMA_CM_EVENT_UNREACHABLE, RDMA_CM_EVENT_REJECTED, @@ -137,6 +138,9 @@ /** * rdma_create_qp - Allocate a QP and associate it with the specified RDMA * identifier. + * + * QPs allocated to an rdma_cm_id will automatically be transitioned by the CMA + * through their states. */ int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr); @@ -150,6 +154,28 @@ */ void rdma_destroy_qp(struct rdma_cm_id *id); +/** + * rdma_init_qp_attr - Initializes the QP attributes for use in transitioning + * to a specified QP state. + * @id: Communication identifier associated with the QP attributes to + * initialize. + * @qp_attr: On input, specifies the desired QP state. On output, the + * mandatory and desired optional attributes will be set in order to + * modify the QP to the specified state. + * @qp_attr_mask: The QP attribute mask that may be used to transition the + * QP to the specified state. + * + * Users must set the @qp_attr->qp_state to the desired QP state. This call + * will set all required attributes for the given transition, along with + * known optional attributes. Users may override the attributes returned from + * this call before calling ib_modify_qp. + * + * Users that wish to have their QP automatically transitioned through its + * states can associate a QP with the rdma_cm_id by calling rdma_create_qp(). + */ +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask); + struct rdma_conn_param { const void *private_data; u8 private_data_len; @@ -158,6 +184,10 @@ u8 flow_control; u8 retry_count; /* ignored when accepting */ u8 rnr_retry_count; + /* Fields below ignored if a QP is created on the rdma_cm_id. */ + u8 srq; + u32 qp_num; + enum ib_qp_type qp_type; }; /** @@ -175,10 +205,18 @@ * Users must have bound the rdma_cm_id to a local address by calling * rdma_bind_addr before calling this routine. */ -int rdma_listen(struct rdma_cm_id *id); +int rdma_listen(struct rdma_cm_id *id, int backlog); /** - * rdma_accept - Called on the passive side to accept a connection request + * rdma_accept - Called to accept a connection request or response. + * @id: Connection identifier associated with the request. + * @conn_param: Information needed to establish the connection. This must be + * provided if accepting a connection request. If accepting a connection + * response, this parameter must be NULL. + * + * Typically, this routine is only called by the listener to accept a connection + * request. It must also be called on the active side of a connection if the + * user is performing their own QP transitions. */ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); Index: linux-kernel/infiniband/include/rdma/ib_user_cm.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_user_cm.h (revision 4017) +++ linux-kernel/infiniband/include/rdma/ib_user_cm.h (working copy) @@ -36,7 +36,7 @@ #ifndef IB_USER_CM_H #define IB_USER_CM_H -#include +#include #define IB_USER_CM_ABI_VERSION 4 @@ -110,58 +110,6 @@ __u32 qp_state; }; -struct ib_ucm_ah_attr { - __u8 grh_dgid[16]; - __u32 grh_flow_label; - __u16 dlid; - __u16 reserved; - __u8 grh_sgid_index; - __u8 grh_hop_limit; - __u8 grh_traffic_class; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; -}; - -struct ib_ucm_init_qp_attr_resp { - __u32 qp_attr_mask; - __u32 qp_state; - __u32 cur_qp_state; - __u32 path_mtu; - __u32 path_mig_state; - __u32 qkey; - __u32 rq_psn; - __u32 sq_psn; - __u32 dest_qp_num; - __u32 qp_access_flags; - - struct ib_ucm_ah_attr ah_attr; - struct ib_ucm_ah_attr alt_ah_attr; - - /* ib_qp_cap */ - __u32 max_send_wr; - __u32 max_recv_wr; - __u32 max_send_sge; - __u32 max_recv_sge; - __u32 max_inline_data; - - __u16 pkey_index; - __u16 alt_pkey_index; - __u8 en_sqd_async_notify; - __u8 sq_draining; - __u8 max_rd_atomic; - __u8 max_dest_rd_atomic; - __u8 min_rnr_timer; - __u8 port_num; - __u8 timeout; - __u8 retry_cnt; - __u8 rnr_retry; - __u8 alt_port_num; - __u8 alt_timeout; -}; - struct ib_ucm_listen { __be64 service_id; __be64 service_mask; @@ -180,28 +128,6 @@ __u8 reserved[3]; }; -struct ib_ucm_path_rec { - __u8 dgid[16]; - __u8 sgid[16]; - __be16 dlid; - __be16 slid; - __u32 raw_traffic; - __be32 flow_label; - __u32 reversible; - __u32 mtu; - __be16 pkey; - __u8 hop_limit; - __u8 traffic_class; - __u8 numb_path; - __u8 sl; - __u8 mtu_selector; - __u8 rate_selector; - __u8 rate; - __u8 packet_life_time_selector; - __u8 packet_life_time; - __u8 preference; -}; - struct ib_ucm_req { __u32 id; __u32 qpn; @@ -304,8 +230,8 @@ }; struct ib_ucm_req_event_resp { - struct ib_ucm_path_rec primary_path; - struct ib_ucm_path_rec alternate_path; + struct ib_user_path_rec primary_path; + struct ib_user_path_rec alternate_path; __be64 remote_ca_guid; __u32 remote_qkey; __u32 remote_qpn; @@ -349,7 +275,7 @@ }; struct ib_ucm_lap_event_resp { - struct ib_ucm_path_rec path; + struct ib_user_path_rec path; }; struct ib_ucm_apr_event_resp { Index: linux-kernel/infiniband/include/rdma/ib_user_sa.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_user_sa.h (revision 0) +++ linux-kernel/infiniband/include/rdma/ib_user_sa.h (revision 0) @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef IB_USER_SA_H +#define IB_USER_SA_H + +#include + +struct ib_user_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __be16 dlid; + __be16 slid; + __u32 raw_traffic; + __be32 flow_label; + __u32 reversible; + __u32 mtu; + __be16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +#endif /* IB_USER_SA_H */ Index: linux-kernel/infiniband/include/rdma/ib_marshall.h =================================================================== --- linux-kernel/infiniband/include/rdma/ib_marshall.h (revision 0) +++ linux-kernel/infiniband/include/rdma/ib_marshall.h (revision 0) @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if !defined(IB_USER_MARSHALL_H) +#define IB_USER_MARSHALL_H + +#include +#include +#include +#include + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src); + +#endif /* IB_USER_MARSHALL_H */ Index: linux-kernel/infiniband/core/uverbs_marshall.c =================================================================== --- linux-kernel/infiniband/core/uverbs_marshall.c (revision 0) +++ linux-kernel/infiniband/core/uverbs_marshall.c (revision 0) @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +static void ib_copy_ah_attr_to_user(struct ib_uverbs_ah_attr *dst, + struct ib_ah_attr *src) +{ + memcpy(dst->grh.dgid, src->grh.dgid.raw, sizeof src->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->ah_flags & IB_AH_GRH ? 1 : 0; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_to_user(struct ib_uverbs_qp_attr *dst, + struct ib_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->max_send_wr = src->cap.max_send_wr; + dst->max_recv_wr = src->cap.max_recv_wr; + dst->max_send_sge = src->cap.max_send_sge; + dst->max_recv_sge = src->cap.max_recv_sge; + dst->max_inline_data = src->cap.max_inline_data; + + ib_copy_ah_attr_to_user(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_to_user(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} +EXPORT_SYMBOL(ib_copy_qp_attr_to_user); + +void ib_copy_path_rec_to_user(struct ib_user_path_rec *dst, + struct ib_sa_path_rec *src) +{ + memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); + memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_to_user); + +void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst, + struct ib_user_path_rec *src) +{ + memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); + memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} +EXPORT_SYMBOL(ib_copy_path_rec_from_user); Index: linux-kernel/infiniband/core/Makefile =================================================================== --- linux-kernel/infiniband/core/Makefile (revision 4017) +++ linux-kernel/infiniband/core/Makefile (working copy) @@ -3,7 +3,7 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ ib_sa.o ib_at.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o -obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o +obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o rdma_ucm.o ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o @@ -16,13 +16,16 @@ rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_sa-y := sa_query.o ib_umad-y := user_mad.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o ib_ucm-y := ucm.o Index: linux-kernel/infiniband/core/cma.c =================================================================== --- linux-kernel/infiniband/core/cma.c (revision 4017) +++ linux-kernel/infiniband/core/cma.c (working copy) @@ -101,6 +101,10 @@ struct ib_sa_query *query; int query_id; struct ib_cm_id *cm_id; + + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; }; struct cma_addr { @@ -294,6 +298,9 @@ goto err; id->qp = qp; + id_priv->qp_num = qp->qp_num; + id_priv->qp_type = qp->qp_type; + id_priv->srq = (qp->srq != NULL); return 0; err: ib_destroy_qp(qp); @@ -307,51 +314,82 @@ } EXPORT_SYMBOL(rdma_destroy_qp); -static int cma_modify_ib_qp_rtr(struct rdma_id_private *id_priv) +static int cma_modify_qp_rtr(struct rdma_cm_id *id) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; + if (!id->qp) + return 0; + /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + ret = ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); if (ret) return ret; qp_attr.qp_state = IB_QPS_RTR; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - qp_attr.rq_psn = id_priv->id.qp->qp_num; - return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); } -static int cma_modify_ib_qp_rts(struct rdma_id_private *id_priv) +static int cma_modify_qp_rts(struct rdma_cm_id *id) { struct ib_qp_attr qp_attr; int qp_attr_mask, ret; + if (!id->qp) + return 0; + qp_attr.qp_state = IB_QPS_RTS; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = rdma_init_qp_attr(id, &qp_attr, &qp_attr_mask); if (ret) return ret; - return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + return ib_modify_qp(id->qp, &qp_attr, qp_attr_mask); } static int cma_modify_qp_err(struct rdma_cm_id *id) { struct ib_qp_attr qp_attr; + if (!id->qp) + return 0; + qp_attr.qp_state = IB_QPS_ERR; return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); } +int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr, + int *qp_attr_mask) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ret = ib_cm_init_qp_attr(id_priv->cm_id, qp_attr, + qp_attr_mask); + if (qp_attr->qp_state == IB_QPS_RTR) + qp_attr->rq_psn = id_priv->qp_num; + break; + default: + ret = -ENOSYS; + break; + } + + return ret; +} +EXPORT_SYMBOL(rdma_init_qp_attr); + static int cma_verify_addr(struct cma_addr *addr, struct sockaddr_in *ip_addr) { @@ -497,14 +535,14 @@ { int ret; - ret = cma_modify_ib_qp_rtr(id_priv); + ret = cma_modify_qp_rtr(&id_priv->id); if (ret) goto reject; - ret = cma_modify_ib_qp_rts(id_priv); + ret = cma_modify_qp_rts(&id_priv->id); if (ret) goto reject; - + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); if (ret) goto reject; @@ -521,7 +559,7 @@ { int ret; - ret = cma_modify_ib_qp_rts(id_priv); + ret = cma_modify_qp_rts(&id_priv->id); if (ret) goto reject; @@ -551,9 +589,12 @@ status = -ETIMEDOUT; break; case IB_CM_REP_RECEIVED: - status = cma_rep_recv(id_priv); - event = status ? RDMA_CM_EVENT_CONNECT_ERROR : - RDMA_CM_EVENT_ESTABLISHED; + if (id_priv->id.qp) { + status = cma_rep_recv(id_priv); + event = status ? RDMA_CM_EVENT_CONNECT_ERROR : + RDMA_CM_EVENT_ESTABLISHED; + } else + event = RDMA_CM_EVENT_CONNECT_RESPONSE; private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; break; case IB_CM_RTU_RECEIVED: @@ -765,7 +806,7 @@ cma_attach_to_dev(dev_id_priv, cma_dev); list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); - ret = rdma_listen(id); + ret = rdma_listen(id, 0); if (ret) goto err; @@ -792,7 +833,7 @@ return ret; } -int rdma_listen(struct rdma_cm_id *id) +int rdma_listen(struct rdma_cm_id *id, int backlog) { struct rdma_id_private *id_priv; int ret; @@ -877,7 +918,7 @@ memset(&path_rec, 0, sizeof path_rec); path_rec.sgid = addr->sgid; path_rec.dgid = addr->dgid; - path_rec.pkey = addr->pkey; + path_rec.pkey = cpu_to_be16(addr->pkey); path_rec.numb_path = 1; id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, @@ -1062,8 +1103,8 @@ req.alternate_path = &route->path_rec[1]; req.service_id = cma_get_service_id(&route->addr.dst_addr); - req.qp_num = id_priv->id.qp->qp_num; - req.qp_type = IB_QPT_RC; + req.qp_num = id_priv->qp_num; + req.qp_type = id_priv->qp_type; req.starting_psn = req.qp_num; req.responder_resources = conn_param->responder_resources; req.initiator_depth = conn_param->initiator_depth; @@ -1073,7 +1114,7 @@ req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; req.max_cm_retries = CMA_MAX_CM_RETRIES; - req.srq = id_priv->id.qp->srq ? 1 : 0; + req.srq = id_priv->srq ? 1 : 0; ret = ib_send_cm_req(id_priv->cm_id, &req); out: @@ -1090,6 +1131,12 @@ if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) return -EINVAL; + if (!id->qp) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + switch (id->device->node_type) { case IB_NODE_CA: ret = cma_connect_ib(id_priv, conn_param); @@ -1114,12 +1161,12 @@ struct ib_cm_rep_param rep; int ret; - ret = cma_modify_ib_qp_rtr(id_priv); + ret = cma_modify_qp_rtr(&id_priv->id); if (ret) return ret; memset(&rep, 0, sizeof rep); - rep.qp_num = id_priv->id.qp->qp_num; + rep.qp_num = id_priv->qp_num; rep.starting_psn = rep.qp_num; rep.private_data = conn_param->private_data; rep.private_data_len = conn_param->private_data_len; @@ -1129,7 +1176,7 @@ rep.failover_accepted = 0; rep.flow_control = conn_param->flow_control; rep.rnr_retry_count = conn_param->rnr_retry_count; - rep.srq = id_priv->id.qp->srq ? 1 : 0; + rep.srq = id_priv->srq ? 1 : 0; return ib_send_cm_rep(id_priv->cm_id, &rep); } @@ -1143,9 +1190,18 @@ if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; + if (!id->qp && conn_param) { + id_priv->qp_num = conn_param->qp_num; + id_priv->qp_type = conn_param->qp_type; + id_priv->srq = conn_param->srq; + } + switch (id->device->node_type) { case IB_NODE_CA: - ret = cma_accept_ib(id_priv, conn_param); + if (conn_param) + ret = cma_accept_ib(id_priv, conn_param); + else + ret = cma_rep_recv(id_priv); break; default: ret = -ENOSYS; Index: linux-kernel/infiniband/core/ucm.c =================================================================== --- linux-kernel/infiniband/core/ucm.c (revision 4017) +++ linux-kernel/infiniband/core/ucm.c (working copy) @@ -47,6 +47,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -202,36 +203,6 @@ return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit = kpath->hop_limit; - upath->traffic_class = kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path = kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector = kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -250,8 +221,10 @@ ureq->srq = kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, + kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -321,8 +294,8 @@ info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, + evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -634,65 +607,11 @@ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_traffic_class = src_attr->grh.traffic_class; - - dest_attr->dlid = src_attr->dlid; - dest_attr->sl = src_attr->sl; - dest_attr->src_path_bits = src_attr->src_path_bits; - dest_attr->static_rate = src_attr->static_rate; - dest_attr->is_global = (src_attr->ah_flags & IB_AH_GRH); - dest_attr->port_num = src_attr->port_num; -} - -static void ib_ucm_copy_qp_attr(struct ib_ucm_init_qp_attr_resp *dest_attr, - struct ib_qp_attr *src_attr) -{ - dest_attr->cur_qp_state = src_attr->cur_qp_state; - dest_attr->path_mtu = src_attr->path_mtu; - dest_attr->path_mig_state = src_attr->path_mig_state; - dest_attr->qkey = src_attr->qkey; - dest_attr->rq_psn = src_attr->rq_psn; - dest_attr->sq_psn = src_attr->sq_psn; - dest_attr->dest_qp_num = src_attr->dest_qp_num; - dest_attr->qp_access_flags = src_attr->qp_access_flags; - - dest_attr->max_send_wr = src_attr->cap.max_send_wr; - dest_attr->max_recv_wr = src_attr->cap.max_recv_wr; - dest_attr->max_send_sge = src_attr->cap.max_send_sge; - dest_attr->max_recv_sge = src_attr->cap.max_recv_sge; - dest_attr->max_inline_data = src_attr->cap.max_inline_data; - - ib_ucm_copy_ah_attr(&dest_attr->ah_attr, &src_attr->ah_attr); - ib_ucm_copy_ah_attr(&dest_attr->alt_ah_attr, &src_attr->alt_ah_attr); - - dest_attr->pkey_index = src_attr->pkey_index; - dest_attr->alt_pkey_index = src_attr->alt_pkey_index; - dest_attr->en_sqd_async_notify = src_attr->en_sqd_async_notify; - dest_attr->sq_draining = src_attr->sq_draining; - dest_attr->max_rd_atomic = src_attr->max_rd_atomic; - dest_attr->max_dest_rd_atomic = src_attr->max_dest_rd_atomic; - dest_attr->min_rnr_timer = src_attr->min_rnr_timer; - dest_attr->port_num = src_attr->port_num; - dest_attr->timeout = src_attr->timeout; - dest_attr->retry_cnt = src_attr->retry_cnt; - dest_attr->rnr_retry = src_attr->rnr_retry; - dest_attr->alt_port_num = src_attr->alt_port_num; - dest_attr->alt_timeout = src_attr->alt_timeout; -} - static ssize_t ib_ucm_init_qp_attr(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) { - struct ib_ucm_init_qp_attr_resp resp; + struct ib_uverbs_qp_attr resp; struct ib_ucm_init_qp_attr cmd; struct ib_ucm_context *ctx; struct ib_qp_attr qp_attr; @@ -715,7 +634,7 @@ if (result) goto out; - ib_ucm_copy_qp_attr(&resp, &qp_attr); + ib_copy_qp_attr_to_user(&resp, &qp_attr); if (copy_to_user((void __user *)(unsigned long)cmd.response, &resp, sizeof(resp))) @@ -790,7 +709,7 @@ static int ib_ucm_path_get(struct ib_sa_path_rec **path, u64 src) { - struct ib_ucm_path_rec ucm_path; + struct ib_user_path_rec upath; struct ib_sa_path_rec *sa_path; *path = NULL; @@ -802,36 +721,14 @@ if (!sa_path) return -ENOMEM; - if (copy_from_user(&ucm_path, (void __user *)(unsigned long)src, - sizeof(ucm_path))) { + if (copy_from_user(&upath, (void __user *)(unsigned long)src, + sizeof(upath))) { kfree(sa_path); return -EFAULT; } - memcpy(sa_path->dgid.raw, ucm_path.dgid, sizeof sa_path->dgid); - memcpy(sa_path->sgid.raw, ucm_path.sgid, sizeof sa_path->sgid); - - sa_path->dlid = ucm_path.dlid; - sa_path->slid = ucm_path.slid; - sa_path->raw_traffic = ucm_path.raw_traffic; - sa_path->flow_label = ucm_path.flow_label; - sa_path->hop_limit = ucm_path.hop_limit; - sa_path->traffic_class = ucm_path.traffic_class; - sa_path->reversible = ucm_path.reversible; - sa_path->numb_path = ucm_path.numb_path; - sa_path->pkey = ucm_path.pkey; - sa_path->sl = ucm_path.sl; - sa_path->mtu_selector = ucm_path.mtu_selector; - sa_path->mtu = ucm_path.mtu; - sa_path->rate_selector = ucm_path.rate_selector; - sa_path->rate = ucm_path.rate; - sa_path->packet_life_time = ucm_path.packet_life_time; - sa_path->preference = ucm_path.preference; - - sa_path->packet_life_time_selector = - ucm_path.packet_life_time_selector; - + ib_copy_path_rec_from_user(sa_path, &upath); *path = sa_path; return 0; } Index: linux-kernel/infiniband/core/ucma.c =================================================================== --- linux-kernel/infiniband/core/ucma.c (revision 0) +++ linux-kernel/infiniband/core/ucma.c (revision 0) @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ucma_file { + struct semaphore mutex; + struct file *filp; + struct list_head ctxs; + struct list_head events; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_t ref; + int events_reported; + + struct ucma_file *file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_head events; /* list of pending events. */ + struct list_head file_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_head file_list; /* member in file event list */ + struct list_head ctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DECLARE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + up(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static inline int ucma_new_cm_id(int event) +{ + return event == RDMA_CM_EVENT_CONNECT_REQUEST; +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + down(&ctx->file->mutex); + list_del(&ctx->file_list); + while (!list_empty(&ctx->events)) { + + uevent = list_entry(ctx->events.next, struct ucma_event, + ctx_list); + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + + /* clear incoming connections. */ + if (ucma_new_cm_id(uevent->resp.event)) + rdma_destroy_id(uevent->cm_id); + + kfree(uevent); + } + up(&ctx->file->mutex); +} + +static struct ucma_context* ucma_alloc_ctx(struct ucma_file *file) +{ + struct ucma_context *ctx; + int ret; + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + memset(ctx, 0, sizeof *ctx); + atomic_set(&ctx->ref, 1); + init_waitqueue_head(&ctx->wait); + ctx->file = file; + INIT_LIST_HEAD(&ctx->events); + + do { + ret = idr_pre_get(&ctx_idr, GFP_KERNEL); + if (!ret) + goto error; + + down(&ctx_mutex); + ret = idr_get_new(&ctx_idr, ctx, &ctx->id); + up(&ctx_mutex); + } while (ret == -EAGAIN); + + if (ret) + goto error; + + list_add_tail(&ctx->file_list, &file->ctxs); + return ctx; + +error: + kfree(ctx); + return NULL; +} + +static int ucma_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + struct ucma_event *uevent; + struct ucma_context *ctx = cm_id->context; + + uevent = kmalloc(sizeof(*uevent), GFP_KERNEL); + if (!uevent) + return ucma_new_cm_id(event->event); /* Destroy new IDs. */ + + memset(uevent, 0, sizeof(*uevent)); + uevent->ctx = ctx; + uevent->cm_id = cm_id; + uevent->resp.uid = ctx->uid; + uevent->resp.id = ctx->id; + uevent->resp.event = event->event; + uevent->resp.status = event->status; + if ((uevent->resp.private_data_len = event->private_data_len)) + memcpy(uevent->resp.private_data, event->private_data, + event->private_data_len); + + down(&ctx->file->mutex); + list_add_tail(&uevent->file_list, &ctx->file->events); + list_add_tail(&uevent->ctx_list, &ctx->events); + wake_up_interruptible(&ctx->file->poll_wait); + up(&ctx->file->mutex); + return 0; +} + +static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct ucma_context *ctx; + struct rdma_ucm_get_event cmd; + struct ucma_event *uevent; + int ret = 0; + DEFINE_WAIT(wait); + + if (out_len < sizeof(struct rdma_ucm_event_resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&file->mutex); + while (list_empty(&file->events)) { + if (file->filp->f_flags & O_NONBLOCK) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); + up(&file->mutex); + schedule(); + down(&file->mutex); + finish_wait(&file->poll_wait, &wait); + } + + if (ret) + goto done; + + uevent = list_entry(file->events.next, struct ucma_event, file_list); + + if (ucma_new_cm_id(uevent->resp.event)) { + ctx = ucma_alloc_ctx(file); + if (!ctx) { + ret = -ENOMEM; + goto done; + } + + ctx->cm_id = uevent->cm_id; + ctx->cm_id->context = ctx; + uevent->resp.id = ctx->id; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &uevent->resp, sizeof(uevent->resp))) { + ret = -EFAULT; + goto done; + } + + list_del(&uevent->file_list); + list_del(&uevent->ctx_list); + uevent->ctx->events_reported++; + kfree(uevent); +done: + up(&file->mutex); + return ret; +} + +static ssize_t ucma_create_id(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_create_id cmd; + struct rdma_ucm_create_id_resp resp; + struct ucma_context *ctx; + int ret; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&file->mutex); + ctx = ucma_alloc_ctx(file); + up(&file->mutex); + if (!ctx) + return -ENOMEM; + + ctx->uid = cmd.uid; + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx); + if (IS_ERR(ctx->cm_id)) { + ret = PTR_ERR(ctx->cm_id); + goto err1; + } + + resp.id = ctx->id; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) { + ret = -EFAULT; + goto err2; + } + return 0; + +err2: + rdma_destroy_id(ctx->cm_id); +err1: + down(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + kfree(ctx); + return ret; +} + +static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_destroy_id cmd; + struct rdma_ucm_destroy_id_resp resp; + struct ucma_context *ctx; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, cmd.id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + atomic_dec(&ctx->ref); + wait_event(ctx->wait, !atomic_read(&ctx->ref)); + + /* No new events will be generated after destroying the id. */ + rdma_destroy_id(ctx->cm_id); + /* Cleanup events not yet reported to the user. */ + ucma_cleanup_events(ctx); + + resp.events_reported = ctx->events_reported; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + + kfree(ctx); + return ret; +} + +static ssize_t ucma_bind_addr(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_bind_addr cmd; + struct rdma_ucm_bind_addr_resp resp; + struct ucma_context *ctx; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); + if (ret) + goto out; + + if (ctx->cm_id->device) + resp.node_guid = ctx->cm_id->device->node_guid; + else + resp.node_guid = 0; + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_addr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_addr cmd; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr, + (struct sockaddr *) &cmd.dst_addr, + cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_resolve_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_resolve_route cmd; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_resolve_route(ctx->cm_id, cmd.timeout_ms); + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp, + struct rdma_route *route) +{ + struct ib_addr *ibaddr; + + resp->num_paths = route->num_paths; + switch (route->num_paths) { + case 0: + ibaddr = &route->addr.addr.ibaddr; + memcpy(&resp->ib_route[0].dgid, ibaddr->dgid.raw, + sizeof ibaddr->dgid); + memcpy(&resp->ib_route[0].sgid, ibaddr->sgid.raw, + sizeof ibaddr->sgid); + resp->ib_route[0].pkey = cpu_to_be16(ibaddr->pkey); + break; + case 2: + ib_copy_path_rec_to_user(&resp->ib_route[1], + &route->path_rec[1]); + /* fall through */ + case 1: + ib_copy_path_rec_to_user(&resp->ib_route[0], + &route->path_rec[0]); + break; + default: + break; + } +} + +static ssize_t ucma_query_route(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_query_route cmd; + struct rdma_ucm_query_route_resp resp; + struct ucma_context *ctx; + struct sockaddr *addr; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (!ctx->cm_id->device) { + ret = -ENODEV; + goto out; + } + + addr = &ctx->cm_id->route.addr.src_addr; + memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ? + sizeof(struct sockaddr_in) : + sizeof(struct sockaddr_in6)); + resp.node_guid = ctx->cm_id->device->node_guid; + switch (ctx->cm_id->device->node_type) { + case IB_NODE_CA: + ucma_copy_ib_route(&resp, &ctx->cm_id->route); + default: + break; + } + + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static void ucma_copy_conn_param(struct rdma_conn_param *dst_conn, + struct rdma_ucm_conn_param *src_conn) +{ + dst_conn->private_data = src_conn->private_data; + dst_conn->private_data_len = src_conn->private_data_len; + dst_conn->responder_resources =src_conn->responder_resources; + dst_conn->initiator_depth = src_conn->initiator_depth; + dst_conn->flow_control = src_conn->flow_control; + dst_conn->retry_count = src_conn->retry_count; + dst_conn->rnr_retry_count = src_conn->rnr_retry_count; + dst_conn->srq = src_conn->srq; + dst_conn->qp_num = src_conn->qp_num; + dst_conn->qp_type = src_conn->qp_type; +} + +static ssize_t ucma_connect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_connect cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + if (!cmd.conn_param.valid) + return -EINVAL; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_connect(ctx->cm_id, &conn_param); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_listen(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_listen cmd; + struct ucma_context *ctx; + int ret; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_listen(ctx->cm_id, cmd.backlog); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_accept(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_accept cmd; + struct rdma_conn_param conn_param; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + if (cmd.conn_param.valid) { + ctx->uid = cmd.uid; + ucma_copy_conn_param(&conn_param, &cmd.conn_param); + ret = rdma_accept(ctx->cm_id, &conn_param); + } else + ret = rdma_accept(ctx->cm_id, NULL); + + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_reject(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_reject cmd; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_reject(ctx->cm_id, cmd.private_data, cmd.private_data_len); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_disconnect(struct ucma_file *file, const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_disconnect cmd; + struct ucma_context *ctx; + int ret = 0; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + ret = rdma_disconnect(ctx->cm_id); + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t ucma_init_qp_attr(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_init_qp_attr cmd; + struct ib_uverbs_qp_attr resp; + struct ucma_context *ctx; + struct ib_qp_attr qp_attr; + int ret = 0; + + if (out_len < sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(&cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + ctx = ucma_get_ctx(file, cmd.id); + if (IS_ERR(ctx)) + return PTR_ERR(ctx); + + resp.qp_attr_mask = 0; + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.qp_state = cmd.qp_state; + ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask); + if (ret) + goto out; + + ib_copy_qp_attr_to_user(&resp, &qp_attr); + if (copy_to_user((void __user *)(unsigned long)cmd.response, + &resp, sizeof(resp))) + ret = -EFAULT; + +out: + ucma_put_ctx(ctx); + return ret; +} + +static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, + const char __user *inbuf, + int in_len, int out_len) = { + [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id, + [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id, + [RDMA_USER_CM_CMD_BIND_ADDR] = ucma_bind_addr, + [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr, + [RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route, + [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route, + [RDMA_USER_CM_CMD_CONNECT] = ucma_connect, + [RDMA_USER_CM_CMD_LISTEN] = ucma_listen, + [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept, + [RDMA_USER_CM_CMD_REJECT] = ucma_reject, + [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, + [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event +}; + +static ssize_t ucma_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct ucma_file *file = filp->private_data; + struct rdma_ucm_cmd_hdr hdr; + ssize_t ret; + + if (len < sizeof(hdr)) + return -EINVAL; + + if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + + if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table)) + return -EINVAL; + + if (hdr.in + sizeof(hdr) > len) + return -EINVAL; + + ret = ucma_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out); + if (!ret) + ret = len; + + return ret; +} + +static unsigned int ucma_poll(struct file *filp, struct poll_table_struct *wait) +{ + struct ucma_file *file = filp->private_data; + unsigned int mask = 0; + + poll_wait(filp, &file->poll_wait, wait); + + if (!list_empty(&file->events)) + mask = POLLIN | POLLRDNORM; + + return mask; +} + +static int ucma_open(struct inode *inode, struct file *filp) +{ + struct ucma_file *file; + + file = kmalloc(sizeof *file, GFP_KERNEL); + if (!file) + return -ENOMEM; + + INIT_LIST_HEAD(&file->events); + INIT_LIST_HEAD(&file->ctxs); + init_waitqueue_head(&file->poll_wait); + init_MUTEX(&file->mutex); + + filp->private_data = file; + file->filp = filp; + return 0; +} + +static int ucma_close(struct inode *inode, struct file *filp) +{ + struct ucma_file *file = filp->private_data; + struct ucma_context *ctx; + + down(&file->mutex); + while (!list_empty(&file->ctxs)) { + ctx = list_entry(file->ctxs.next, struct ucma_context, + file_list); + up(&file->mutex); + + down(&ctx_mutex); + idr_remove(&ctx_idr, ctx->id); + up(&ctx_mutex); + + rdma_destroy_id(ctx->cm_id); + ucma_cleanup_events(ctx); + kfree(ctx); + + down(&file->mutex); + } + up(&file->mutex); + kfree(file); + return 0; +} + +static struct file_operations ucma_fops = { + .owner = THIS_MODULE, + .open = ucma_open, + .release = ucma_close, + .write = ucma_write, + .poll = ucma_poll, +}; + +static struct miscdevice ucma_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = "rdma_cm", + .fops = &ucma_fops, +}; + +static int __init ucma_init(void) +{ + return misc_register(&ucma_misc); +} + +static void __exit ucma_cleanup(void) +{ + misc_deregister(&ucma_misc); + idr_destroy(&ctx_idr); +} + +module_init(ucma_init); +module_exit(ucma_cleanup); From iod00d at hp.com Thu Nov 10 12:44:14 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 10 Nov 2005 12:44:14 -0800 Subject: [openib-general] netperf over SDP bug In-Reply-To: <20050928011700.GA22427@esmail.cup.hp.com> References: <20050928011700.GA22427@esmail.cup.hp.com> Message-ID: <20051110204414.GF29939@esmail.cup.hp.com> On Tue, Sep 27, 2005 at 06:17:00PM -0700, Grant Grundler wrote: > Hi Michael, > I'm trying to collect a full set of netperf TCP_STREAM over SDP for > SVN r3547 on 2.6.13 kernel. But some netperf runs get no throughput. Michael, I was able to reproduce this problem with SVN r3984. I've posted the graphs for r3547 and r3984 on: http://iou.parisc-linux.org/openib-perf-2005/r3547/ http://iou.parisc-linux.org/openib-perf-2005/r3984/ See sdpstream.png in each location. I'll pursue collecting information you asked for a few weeks ago as time permits. The above data was collected with "netserver" bound to the same CPU as the one taking IB MSI-X interrupts. This is bad for IPoIB (CPU bound) and good for SDP (CPU cache). I'll rerun the r3984 data and bind the netperf process as well. BTW, in case I haven't mentioned this before, I setup a parisc-linux box so netperf maintainer Rick Jones could manage his releases using something better than tarballs. netperf 2.x and netperf 4.x (under developement) source is available from: svn co http://www.netperf.org/svn/netperf2/ svn co http://www.netperf.org/svn/netperf4/ thanks, grant > Usually when sending 1k to 4k messages. The same netperf parameters > sing IPoIB seem to be working fine - just alot slower of course. > Summary of all netperf over SDP runs is appended. > > Sample commandline that got < 1Mb/s throughput is: > LD_PRELOAD=/usr/local/lib/libsdp.so /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_STREAM -T 1 -- -m 1024 -s 16384 -S 16384 > > I tried with some smaller -m parameters: > 512 -> ~270-280 Mb/s > 640 -> ~200-2100 Mb/s > 768 -> ~30-50 Mb/s > 896 -> ~2-6 Mb/s > > CPU is essentially idle in the above 512-896 byte cases. ... From krause at cup.hp.com Thu Nov 10 12:44:50 2005 From: krause at cup.hp.com (Michael Krause) Date: Thu, 10 Nov 2005 12:44:50 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1041618@NT-SJCA-0751.brcm.a d.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1041618@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20051110124256.02532460@esmail.cup.hp.com> At 10:48 AM 11/10/2005, Caitlin Bestler wrote: > > > >Mike Krause wrote in response to Greg Lindahl: > > > > If it is to be reasonably robust, then RDS should be required to >support > > the resync between the two sides of the communication. This aligns >with the > > stated objective of implementing reliability in one location in >software and > > one location in hardware. Without such resync being required in the >ULP, > > then one ends up with a ULP that falls shorts of its stated objectives >and > > pushes complexity back up to the application which is where the >advocates > > have stated it is too complex or expensive to get it correct. > > >I haven't reread all of RDS fine print to double-check this, but my >impression is that RDS semantics exactly match the subset of MPI >point-to-point communications where the receiving rank is required >to have pre-posted buffers before the send is allowed. My concern is the requirement that RDS resync the structures in the face of failure and know whether to re-transmit or will deal with duplicates. Having pre-posted buffers will help enable the resync to be accomplished but should not be equated to pre-post equals one can deal with duplicates or will verify to prevent duplicates from occurring. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Thu Nov 10 13:07:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 10 Nov 2005 13:07:17 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F104162D@NT-SJCA-0751.brcm.ad.broadcom.com> My concern is the requirement that RDS resync the structures in the face of failure and know whether to re-transmit or will deal with duplicates. Having pre-posted buffers will help enable the resync to be accomplished but should not be equated to pre-post equals one can deal with duplicates or will verify to prevent duplicates from occurring. Mike The semantics should be that barring an error the flow between any two endpoints is reliable and ordered. The difference versus a normal point-to-point definition of reliable is that a) lack of a receive buffer is an error, b) the endpoint communicates with many known remote peers (as opposed to one known remote peer, or many unknown). Having an API with those semantics, particularly as an upgrade in semanitcs from SOCK_DGRAM while preserving SOCK_DGRAM syntax, is something that I believe is of distinct value to many cluster based applications. Further the API can be implemeneted in an offload device (IB or IP) more efficiently than if it is simply implemented on top of SOCK_STREAM sockets by the application. Documenting and clarifying the semantics to make it's general applicability clearer should definitely be done, however. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Thu Nov 10 13:36:33 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 10 Nov 2005 16:36:33 -0500 Subject: [openib-general] socket based connection model for IB proposal - round 3 Message-ID: It will be discussed at IBTA SWG meeting next week Tu. Please, post your comments before that. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v3.pdf Type: application/octet-stream Size: 23181 bytes Desc: IP Address Support by InfiniBand CM_v3.pdf URL: From jlentini at netapp.com Thu Nov 10 13:39:01 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 10 Nov 2005 16:39:01 -0500 (EST) Subject: [openib-general] Re: [PATCH] uDAPL free build issues cleaned up, print path records returned from uAT In-Reply-To: References: Message-ID: On Thu, 10 Nov 2005, Arlin Davis wrote: > James, > > I fixed some problems with the free build openib_scm version. Also > turned down some debugging and added some debug prints for uAT path > records. > > -arlin Thanks Arlin. Committed in revision 4018. From Arkady.Kanevsky at netapp.com Thu Nov 10 13:39:47 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 10 Nov 2005 16:39:47 -0500 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 Message-ID: Fixed the bit value for formating indicator. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v3.pdf Type: application/octet-stream Size: 23167 bytes Desc: IP Address Support by InfiniBand CM_v3.pdf URL: From ftillier at silverstorm.com Thu Nov 10 13:52:44 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 10 Nov 2005 13:52:44 -0800 Subject: [openib-general] socket based connection model for IB proposal -round 3 In-Reply-To: Message-ID: <002001c5e641$15c03d30$9e5aa8c0@infiniconsys.com> > From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > Sent: Thursday, November 10, 2005 1:37 PM > > It will be discussed at IBTA SWG meeting next week Tu. > Please, post your comments before that. Looks fine to me overall. The only thing I would change is make the version field 4 bits rather than just 2, and shift the IP version down 2 bits, eliminating the reserved bits. That way, the first byte is split evenly between protocol version and IP version. Do we even need to indicate the IP version, or can IPv4 addresses be expressed as IPv6 addresses just by zeroing the first 12 bytes? I don't understand the relevance of the 0-based VA or Send with Invalidate discussion points. They seem orthogonal to the socket-based CM proposal, and IMO should be moved to a separate proposal. I have no opinion one way or another on the presence of the protocol field. It could just as well be left as "flags" for the consumer to do with what they please. - Fab From rolandd at cisco.com Thu Nov 10 13:52:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 13:52:34 -0800 Subject: [openib-general] Re: [PATCHv1] userspace CMA In-Reply-To: (Sean Hefty's message of "Thu, 10 Nov 2005 12:37:48 -0800") References: Message-ID: <52k6fgjhlp.fsf@cisco.com> The libibverbs bits look mostly OK but: . I don't see a file sa_kern-abi.h anywhere -- I think you forgot to add it. Also, please name it sa-kern-abi.h (ie all '-'s) -- mixed underscores and dashes are just too hard to type and look weird. Or did you just put everything in kern-abi.h? That's fine too, just remove the sa_kern-abi.h references. . Please add a ChangeLog entry covering the libibverbs changes. - R. From mshefty at ichips.intel.com Thu Nov 10 14:09:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 14:09:59 -0800 Subject: [openib-general] Re: [PATCHv1] userspace CMA In-Reply-To: <52k6fgjhlp.fsf@cisco.com> References: <52k6fgjhlp.fsf@cisco.com> Message-ID: <4373C537.80405@ichips.intel.com> Roland Dreier wrote: > . I don't see a file sa_kern-abi.h anywhere -- I think you forgot to > add it. Also, please name it sa-kern-abi.h (ie all '-'s) -- mixed > underscores and dashes are just too hard to type and look weird. I did forget to add this file. I've added it and renamed it to sa-kern-abi.h. The file only contains a definition for struct ib_kern_path_rec at the moment. > . Please add a ChangeLog entry covering the libibverbs changes. Will do. Thanks, Sean From Arkady.Kanevsky at netapp.com Thu Nov 10 14:25:02 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 10 Nov 2005 17:25:02 -0500 Subject: [dat-discussions] RE: [openib-general] socket based connection model for IB proposal -round 3 Message-ID: If other agree I am happy to make version 4 bits field. We will use IPv4 encapsulation into IPv6 as defined by IETF. 0-based VA and remote invalidate are not relevant to IP addressing. But we are proposing a change to IB CM so we need to address all the differences between IB and iWARP. This is why these are addressed in the discussion. If we have protocol field than CM will populate this based on the 5-tuple of socket_addr. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Fab Tillier [mailto:ftillier at silverstorm.com] > Sent: Thursday, November 10, 2005 4:53 PM > To: Kanevsky, Arkady; openib-general at openib.org; > swg at infinibandta.org; dat-discussions at yahoogroups.com > Subject: [dat-discussions] RE: [openib-general] socket based > connection model for IB proposal -round 3 > > > From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] > > Sent: Thursday, November 10, 2005 1:37 PM > > > > It will be discussed at IBTA SWG meeting next week Tu. > > Please, post your comments before that. > > Looks fine to me overall. The only thing I would change is > make the version field 4 bits rather than just 2, and shift > the IP version down 2 bits, eliminating the reserved bits. > That way, the first byte is split evenly between protocol > version and IP version. > > Do we even need to indicate the IP version, or can IPv4 > addresses be expressed as IPv6 addresses just by zeroing the > first 12 bytes? > > I don't understand the relevance of the 0-based VA or Send > with Invalidate discussion points. They seem orthogonal to > the socket-based CM proposal, and IMO should be moved to a > separate proposal. > > I have no opinion one way or another on the presence of the > protocol field. It could just as well be left as "flags" for > the consumer to do with what they please. > > - Fab > > > > ------------------------ Yahoo! Groups Sponsor > --------------------~--> Get Bzzzy! (real tools to help you > find a job). Welcome to the Sweet Life. > http://us.click.yahoo.com/A77XvD/vlQLAA/TtwFAA/W6uqlB/TM > -------------------------------------------------------------- > ------~-> > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > From rolandd at cisco.com Thu Nov 10 14:35:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 14:35:21 -0800 Subject: [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <20051109212954.GC25508@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 9 Nov 2005 23:29:54 +0200") References: <52ek5pliot.fsf@cisco.com> <20051109212954.GC25508@mellanox.co.il> Message-ID: <527jbgjfme.fsf_-_@cisco.com> Michael> Maybe its a naming thing? We can call the list Michael> "iterator", does this make it less ugly? I thought about this, but it feels like overkill for something pretty simple. So how about just doing /* put list of devices in list and return length of list */ extern int ibv_get_device_list(struct ibv_device * const **list); /* free a list of devices from ibv_get_device_list */ extern void ibv_free_device_list(struct ibv_device * const *list); which could be used as: struct ibv_device * const *list; int list_len; list_len = ibv_get_device_list(&list); /* ... */ ibv_free_device_list(list); Or are the consts too confusing? Should we be a little less safe but make it nice and simple and just do extern int ibv_get_device_list(struct ibv_device ***list); and so on? - R. From sean.hefty at intel.com Thu Nov 10 14:36:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 14:36:43 -0800 Subject: [openib-general] RE: [PATCHv1] userspace CMA In-Reply-To: <52k6fgjhlp.fsf@cisco.com> Message-ID: > . I don't see a file sa_kern-abi.h anywhere -- I think you forgot to > add it. Also, please name it sa-kern-abi.h (ie all '-'s) -- mixed > underscores and dashes are just too hard to type and look weird. > > . Please add a ChangeLog entry covering the libibverbs changes. Here's an updated patch for just libibverbs. Signed-off-by: Sean Hefty Index: include/infiniband/sa-kern-abi.h =================================================================== --- include/infiniband/sa-kern-abi.h (revision 0) +++ include/infiniband/sa-kern-abi.h (revision 0) @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef SA_KERN_ABI_H +#define SA_KERN_ABI_H + +#include + +struct ib_kern_path_rec { + __u8 dgid[16]; + __u8 sgid[16]; + __u16 dlid; + __u16 slid; + __u32 raw_traffic; + __u32 flow_label; + __u32 reversible; + __u32 mtu; + __u16 pkey; + __u8 hop_limit; + __u8 traffic_class; + __u8 numb_path; + __u8 sl; + __u8 mtu_selector; + __u8 rate_selector; + __u8 rate; + __u8 packet_life_time_selector; + __u8 packet_life_time; + __u8 preference; +}; + +#endif /* SA_KERN_ABI_H */ Index: include/infiniband/sa.h =================================================================== --- include/infiniband/sa.h (revision 0) +++ include/infiniband/sa.h (revision 0) @@ -0,0 +1,130 @@ +/* + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: sa.h 2616 2005-06-15 15:22:39Z halr $ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +enum ib_sa_rate { + IB_SA_RATE_2_5_GBPS = 2, + IB_SA_RATE_5_GBPS = 5, + IB_SA_RATE_10_GBPS = 3, + IB_SA_RATE_20_GBPS = 6, + IB_SA_RATE_30_GBPS = 4, + IB_SA_RATE_40_GBPS = 7, + IB_SA_RATE_60_GBPS = 8, + IB_SA_RATE_80_GBPS = 9, + IB_SA_RATE_120_GBPS = 10 +}; + +static inline int ib_sa_rate_enum_to_int(enum ib_sa_rate rate) +{ + switch (rate) { + case IB_SA_RATE_2_5_GBPS: return 1; + case IB_SA_RATE_5_GBPS: return 2; + case IB_SA_RATE_10_GBPS: return 4; + case IB_SA_RATE_20_GBPS: return 8; + case IB_SA_RATE_30_GBPS: return 12; + case IB_SA_RATE_40_GBPS: return 16; + case IB_SA_RATE_60_GBPS: return 24; + case IB_SA_RATE_80_GBPS: return 32; + case IB_SA_RATE_120_GBPS: return 48; + default: return -1; + } +} + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ibv_gid dgid; + union ibv_gid sgid; + uint16_t dlid; + uint16_t slid; + int raw_traffic; + /* reserved */ + uint32_t flow_label; + uint8_t hop_limit; + uint8_t traffic_class; + int reversible; + uint8_t numb_path; + uint16_t pkey; + /* reserved */ + uint8_t sl; + uint8_t mtu_selector; + uint8_t mtu; + uint8_t rate_selector; + uint8_t rate; + uint8_t packet_life_time_selector; + uint8_t packet_life_time; + uint8_t preference; +}; + +struct ib_sa_mcmember_rec { + union ibv_gid mgid; + union ibv_gid port_gid; + uint32_t qkey; + uint16_t mlid; + uint8_t mtu_selector; + uint8_t mtu; + uint8_t traffic_class; + uint16_t pkey; + uint8_t rate_selector; + uint8_t rate; + uint8_t packet_life_time_selector; + uint8_t packet_life_time; + uint8_t sl; + uint32_t flow_label; + uint8_t hop_limit; + uint8_t scope; + uint8_t join_state; + int proxy_join; +}; + +struct ib_sa_service_rec { + uint64_t id; + union ibv_gid gid; + uint16_t pkey; + /* uint16_t resv; */ + uint32_t lease; + uint8_t key[16]; + uint8_t name[64]; + uint8_t data8[16]; + uint16_t data16[8]; + uint32_t data32[4]; + uint64_t data64[2]; +}; + +#endif /* IB_SA_H */ Index: include/infiniband/marshall.h =================================================================== --- include/infiniband/marshall.h (revision 0) +++ include/infiniband/marshall.h (revision 0) @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef INFINIBAND_MARSHALL_H +#define INFINIBAND_MARSHALL_H + +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +void ib_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, + struct ibv_kern_qp_attr *src); + +void ib_copy_path_rec_from_kern(struct ib_sa_path_rec *dst, + struct ib_kern_path_rec *src); + +void ib_copy_path_rec_to_kern(struct ib_kern_path_rec *dst, + struct ib_sa_path_rec *src); + +END_C_DECLS + +#endif /* INFINIBAND_MARSHALL_H */ Index: include/infiniband/kern-abi.h =================================================================== --- include/infiniband/kern-abi.h (revision 4017) +++ include/infiniband/kern-abi.h (working copy) @@ -357,6 +357,64 @@ __u32 async_events_reported; }; +struct ibv_kern_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ibv_kern_ah_attr { + struct ibv_kern_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ibv_kern_qp_attr { + __u32 qp_attr_mask; + __u32 qp_state; + __u32 cur_qp_state; + __u32 path_mtu; + __u32 path_mig_state; + __u32 qkey; + __u32 rq_psn; + __u32 sq_psn; + __u32 dest_qp_num; + __u32 qp_access_flags; + + struct ibv_kern_ah_attr ah_attr; + struct ibv_kern_ah_attr alt_ah_attr; + + /* ib_qp_cap */ + __u32 max_send_wr; + __u32 max_recv_wr; + __u32 max_send_sge; + __u32 max_recv_sge; + __u32 max_inline_data; + + __u16 pkey_index; + __u16 alt_pkey_index; + __u8 en_sqd_async_notify; + __u8 sq_draining; + __u8 max_rd_atomic; + __u8 max_dest_rd_atomic; + __u8 min_rnr_timer; + __u8 port_num; + __u8 timeout; + __u8 retry_cnt; + __u8 rnr_retry; + __u8 alt_port_num; + __u8 alt_timeout; + __u8 reserved[5]; +}; + struct ibv_create_qp { __u32 command; __u16 in_words; @@ -532,26 +590,6 @@ __u32 bad_wr; }; -struct ibv_kern_global_route { - __u8 dgid[16]; - __u32 flow_label; - __u8 sgid_index; - __u8 hop_limit; - __u8 traffic_class; - __u8 reserved; -}; - -struct ibv_kern_ah_attr { - struct ibv_kern_global_route grh; - __u16 dlid; - __u8 sl; - __u8 src_path_bits; - __u8 static_rate; - __u8 is_global; - __u8 port_num; - __u8 reserved; -}; - struct ibv_create_ah { __u32 command; __u16 in_words; Index: src/libibverbs.map =================================================================== --- src/libibverbs.map (revision 4017) +++ src/libibverbs.map (working copy) @@ -57,5 +57,8 @@ ibv_cmd_destroy_ah; ibv_cmd_attach_mcast; ibv_cmd_detach_mcast; + ib_copy_qp_attr_from_kern; + ib_copy_path_rec_from_kern; + ib_copy_path_rec_to_kern; local: *; }; Index: src/marshall.c =================================================================== --- src/marshall.c (revision 0) +++ src/marshall.c (revision 0) @@ -0,0 +1,140 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include + +static void ib_copy_ah_attr_from_kern(struct ibv_ah_attr *dst, + struct ibv_kern_ah_attr *src) +{ + memcpy(dst->grh.dgid.raw, src->grh.dgid, sizeof dst->grh.dgid); + dst->grh.flow_label = src->grh.flow_label; + dst->grh.sgid_index = src->grh.sgid_index; + dst->grh.hop_limit = src->grh.hop_limit; + dst->grh.traffic_class = src->grh.traffic_class; + + dst->dlid = src->dlid; + dst->sl = src->sl; + dst->src_path_bits = src->src_path_bits; + dst->static_rate = src->static_rate; + dst->is_global = src->is_global; + dst->port_num = src->port_num; +} + +void ib_copy_qp_attr_from_kern(struct ibv_qp_attr *dst, + struct ibv_kern_qp_attr *src) +{ + dst->cur_qp_state = src->cur_qp_state; + dst->path_mtu = src->path_mtu; + dst->path_mig_state = src->path_mig_state; + dst->qkey = src->qkey; + dst->rq_psn = src->rq_psn; + dst->sq_psn = src->sq_psn; + dst->dest_qp_num = src->dest_qp_num; + dst->qp_access_flags = src->qp_access_flags; + + dst->cap.max_send_wr = src->max_send_wr; + dst->cap.max_recv_wr = src->max_recv_wr; + dst->cap.max_send_sge = src->max_send_sge; + dst->cap.max_recv_sge = src->max_recv_sge; + dst->cap.max_inline_data = src->max_inline_data; + + ib_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); + ib_copy_ah_attr_from_kern(&dst->alt_ah_attr, &src->alt_ah_attr); + + dst->pkey_index = src->pkey_index; + dst->alt_pkey_index = src->alt_pkey_index; + dst->en_sqd_async_notify = src->en_sqd_async_notify; + dst->sq_draining = src->sq_draining; + dst->max_rd_atomic = src->max_rd_atomic; + dst->max_dest_rd_atomic = src->max_dest_rd_atomic; + dst->min_rnr_timer = src->min_rnr_timer; + dst->port_num = src->port_num; + dst->timeout = src->timeout; + dst->retry_cnt = src->retry_cnt; + dst->rnr_retry = src->rnr_retry; + dst->alt_port_num = src->alt_port_num; + dst->alt_timeout = src->alt_timeout; +} + +void ib_copy_path_rec_from_kern(struct ib_sa_path_rec *dst, + struct ib_kern_path_rec *src) +{ + memcpy(dst->dgid.raw, src->dgid, sizeof dst->dgid); + memcpy(dst->sgid.raw, src->sgid, sizeof dst->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} + +void ib_copy_path_rec_to_kern(struct ib_kern_path_rec *dst, + struct ib_sa_path_rec *src) +{ + memcpy(dst->dgid, src->dgid.raw, sizeof src->dgid); + memcpy(dst->sgid, src->sgid.raw, sizeof src->sgid); + + dst->dlid = src->dlid; + dst->slid = src->slid; + dst->raw_traffic = src->raw_traffic; + dst->flow_label = src->flow_label; + dst->hop_limit = src->hop_limit; + dst->traffic_class = src->traffic_class; + dst->reversible = src->reversible; + dst->numb_path = src->numb_path; + dst->pkey = src->pkey; + dst->sl = src->sl; + dst->mtu_selector = src->mtu_selector; + dst->mtu = src->mtu; + dst->rate_selector = src->rate_selector; + dst->rate = src->rate; + dst->packet_life_time = src->packet_life_time; + dst->preference = src->preference; + dst->packet_life_time_selector = src->packet_life_time_selector; +} Index: ChangeLog =================================================================== --- ChangeLog (revision 4017) +++ ChangeLog (working copy) @@ -1,3 +1,17 @@ +2005-11-10 Sean Hefty + + * include/infiniband/sa-kern-abi.h: New include file to contain + definitions of SA structures passed between userspace and kernel. + + * include/infiniband/sa.h: New include file for definitions of + SA structures used by multiple libraries. + + * include/infiniband/marshall.h src/marshall.c: New files to define + routines used to exchange data with kernel modules. + + * include/infiniband/kern-abi.h: Added data structures used to exchange + QP attribute with kernel modules. + 2005-11-09 Michael S. Tsirkin * src/device.c (ibv_get_devices): Make function reentrant by using Index: Makefile.am =================================================================== --- Makefile.am (revision 4017) +++ Makefile.am (working copy) @@ -14,7 +14,8 @@ libibverbs_version_script = endif -src_libibverbs_la_SOURCES = src/cmd.c src/device.c src/init.c src/memory.c src/verbs.c +src_libibverbs_la_SOURCES = src/cmd.c src/device.c src/init.c src/marshall.c \ + src/memory.c src/verbs.c src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic \ $(libibverbs_version_script) src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map @@ -40,7 +41,8 @@ libibverbsincludedir = $(includedir)/infiniband libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \ - include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h + include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \ + include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1 \ @@ -53,6 +55,8 @@ EXTRA_DIST = include/infiniband/driver.h include/infiniband/kern-abi.h \ include/infiniband/opcode.h include/infiniband/verbs.h src/ibverbs.h \ + include/infiniband/marshall.h include/sa-kern-abi.h \ + include/infiniband/sa.h \ src/libibverbs.map libibverbs.spec.in $(man_MANS) $(DEBIAN) dist-hook: libibverbs.spec From mshefty at ichips.intel.com Thu Nov 10 14:45:47 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 14:45:47 -0800 Subject: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <527jbgjfme.fsf_-_@cisco.com> References: <527jbgjfme.fsf_-_@cisco.com> Message-ID: <4373CD9B.60502@ichips.intel.com> Roland Dreier wrote: > Or are the consts too confusing? Should we be a little less safe but > make it nice and simple and just do The const confuses me somewhat. > extern int ibv_get_device_list(struct ibv_device ***list); Is ***list really what we want here? Can we just get away with **list? Would something like: struct ibv_device * ibv_get_device(index); work as well? - Sean From johann at pathscale.com Thu Nov 10 14:54:51 2005 From: johann at pathscale.com (Johann George) Date: Thu, 10 Nov 2005 14:54:51 -0800 Subject: [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <527jbgjfme.fsf_-_@cisco.com> References: <52ek5pliot.fsf@cisco.com> <20051109212954.GC25508@mellanox.co.il> <527jbgjfme.fsf_-_@cisco.com> Message-ID: <20051110225451.GA30441@cuprite.internal.keyresearch.com> > So how about just doing > > /* put list of devices in list and return length of list */ > extern int ibv_get_device_list(struct ibv_device * const **list); > > /* free a list of devices from ibv_get_device_list */ > extern void ibv_free_device_list(struct ibv_device * const *list); I like it much better than what we have now. Clean, simple and easy to understand. > Or are the consts too confusing? Should we be a little less safe but > make it nice and simple and just do I often find consts a bit of a nuisance for what they give me; but am fine either way. Both are simple enough. Johann From mshefty at ichips.intel.com Thu Nov 10 15:00:47 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 15:00:47 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 In-Reply-To: References: Message-ID: <4373D11F.8080308@ichips.intel.com> If you want to maximize consumer usable private data, then you can move the version, IP version, protocol, source and destination ports into the service ID. Separately, if there's any defined mapping to a service ID or set of service IDs, then the service ID indicates the format of the private data. No additional information is needed in the CM REQ, such as using a reserve bit. To be clear, the CM REQ _carries_ the IP address. There should be no requirement that the CM performs the mapping, and I see no reason why it should even care. - Sean From johann at pathscale.com Thu Nov 10 15:02:15 2005 From: johann at pathscale.com (Johann George) Date: Thu, 10 Nov 2005 15:02:15 -0800 Subject: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <4373CD9B.60502@ichips.intel.com> References: <527jbgjfme.fsf_-_@cisco.com> <4373CD9B.60502@ichips.intel.com> Message-ID: <20051110230215.GB30441@cuprite.internal.keyresearch.com> > Is ***list really what we want here? Can we just get away with **list? > > Would something like: > > struct ibv_device * ibv_get_device(index); I would prefer one call to get the entire structure. Another option might be: struct ibv_device ** ibv_get_device() where it returns a list which is null terminated so you do not need to return the length. Johann From rolandd at cisco.com Thu Nov 10 15:11:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 15:11:48 -0800 Subject: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <4373CD9B.60502@ichips.intel.com> (Sean Hefty's message of "Thu, 10 Nov 2005 14:45:47 -0800") References: <527jbgjfme.fsf_-_@cisco.com> <4373CD9B.60502@ichips.intel.com> Message-ID: <523bm4jdxn.fsf@cisco.com> > The const confuses me somewhat. Yeah, and thinking about it more, the memory really belongs to the consumer of the function. So I don't think the const is even correct. > > extern int ibv_get_device_list(struct ibv_device ***list); > Is ***list really what we want here? Can we just get away with **list? Yes -- a single device is represented by a struct ibv_device *. So an array of devices is represented by a struct ibv_device **. And a pointer to such an array is struct ibv_device ***. But the following is OK too I think: extern int ibv_get_device_list(struct ibv_device **list[]); extern void ibv_free_device_list(struct ibv_device *list[]); is that clearer? (a pointer to an array of pointers to struct ibv_device). > Would something like: > > struct ibv_device * ibv_get_device(index); > > work as well? That could work as well. But it doesn't handle hotplug quite as well. By returning a snapshot of all the known devices at a given moment, we at least have a chance at doing something sensible with devices appearing or disappearing. - R. From rolandd at cisco.com Thu Nov 10 15:16:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 10 Nov 2005 15:16:27 -0800 Subject: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <20051110230215.GB30441@cuprite.internal.keyresearch.com> (Johann George's message of "Thu, 10 Nov 2005 15:02:15 -0800") References: <527jbgjfme.fsf_-_@cisco.com> <4373CD9B.60502@ichips.intel.com> <20051110230215.GB30441@cuprite.internal.keyresearch.com> Message-ID: <52y83whz5g.fsf@cisco.com> > I would prefer one call to get the entire structure. Another option might > be: > > struct ibv_device ** ibv_get_device() > > where it returns a list which is null terminated so you do not need to > return the length. Yes, I thought of that too. It seemed faintly preferable to tell the caller how big the array was rather than forcing the caller to count for itself. But I can't really think of a use case where it makes a difference, so perhaps your simpler version is better. - R. From mshefty at ichips.intel.com Thu Nov 10 15:23:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 15:23:07 -0800 Subject: (SPAM?) Re: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <523bm4jdxn.fsf@cisco.com> References: <523bm4jdxn.fsf@cisco.com> Message-ID: <4373D65B.3060808@ichips.intel.com> Roland Dreier wrote: > > Is ***list really what we want here? Can we just get away with **list? > > Yes -- a single device is represented by a struct ibv_device *. > So an array of devices is represented by a struct ibv_device **. > And a pointer to such an array is struct ibv_device ***. I understand. This is just API that I've seen that used '***'. Why not just return a copy of the array? > > Would something like: > > > > struct ibv_device * ibv_get_device(index); > > > > work as well? > > That could work as well. But it doesn't handle hotplug quite as well. > By returning a snapshot of all the known devices at a given moment, we > at least have a chance at doing something sensible with devices > appearing or disappearing. This doesn't seem any worse to me. The user can reference device_array[i] or call ibv_get_device(i). I need to spend more time understanding how userspace hotplug will work. - Sean From Arkady.Kanevsky at netapp.com Thu Nov 10 15:31:18 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 10 Nov 2005 18:31:18 -0500 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 Message-ID: Sean, comments inline. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, November 10, 2005 6:01 PM > To: Kanevsky, Arkady > Cc: dat-discussions at yahoogroups.com; > openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] socket > based connection model for IB proposal - round 3 > > If you want to maximize consumer usable private data, then > you can move the version, IP version, protocol, source and > destination ports into the service ID. Not at the expense of redefining what Service ID is. How do you propose to move all these fields into Service ID without violating IBTA spec Annex A3.2.? Remember Service ID is what responder advertize and requestor sends communucation requests to. It may be possible to server to advertize multiple service IDs to cover version and IP version variations but it will not be symmetrical to iWARP. Port is port (service ID) and address is address. Port does not encode IP version. > > Separately, if there's any defined mapping to a service ID or > set of service IDs, then the service ID indicates the format > of the private data. No additional information is needed in > the CM REQ, such as using a reserve bit. That is a good point. But this restricts the usage of IP addressing only to these ports. The question is what is easier to check 1 bit or Service ID. Of course, service ID will have to be checked anyhow to direct the request. While this overloads the semantic meaning of Service ID it is a viable method. > > To be clear, the CM REQ _carries_ the IP address. There > should be no requirement that the CM performs the mapping, > and I see no reason why it should even care. > Can you elaborate on this? Is this addresses who populates the formated portion of the provate data? > - Sean > From caitlinb at broadcom.com Thu Nov 10 15:40:27 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 10 Nov 2005 15:40:27 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041642@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > If you want to maximize consumer usable private data, then > you can move the version, IP version, protocol, source and > destination ports into the service ID. > > Separately, if there's any defined mapping to a service ID or > set of service IDs, then the service ID indicates the format > of the private data. No additional information is needed in > the CM REQ, such as using a reserve bit. > Current CM software could generate the Serive ID. Therefore the fact that the Private Data is in the "new format" cannot be part of the Service ID. Otherwise I agree with your analysis that data can be moved to the Serivce ID. Which is more valuable, 4 more bytes of private data or a very larger number of Service IDS, is another topic. > To be clear, the CM REQ _carries_ the IP address. There > should be no requirement that the CM performs the mapping, > and I see no reason why it should even care. > The CM needs to have at least the capability of validating the local IP address supplied. From johann at pathscale.com Thu Nov 10 15:40:56 2005 From: johann at pathscale.com (Johann George) Date: Thu, 10 Nov 2005 15:40:56 -0800 Subject: [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <52y83whz5g.fsf@cisco.com> References: <527jbgjfme.fsf_-_@cisco.com> <4373CD9B.60502@ichips.intel.com> <20051110230215.GB30441@cuprite.internal.keyresearch.com> <52y83whz5g.fsf@cisco.com> Message-ID: <20051110234056.GC30441@cuprite.internal.keyresearch.com> > It seemed faintly preferable to tell the caller how big the array was > rather than forcing the caller to count for itself. If you really wanted that, I would be more inclined towards: struct ibv_device ** ibv_get_device(*length_ptr) and if you do not want length, you could pass a null length_ptr. But since I also cannot think of a strong case for it, I prefer the cleaner interface of leaving it out. Johann From sean.hefty at intel.com Thu Nov 10 15:52:45 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 15:52:45 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: Message-ID: >> If you want to maximize consumer usable private data, then >> you can move the version, IP version, protocol, source and >> destination ports into the service ID. > >Not at the expense of redefining what Service ID is. >How do you propose to move all these fields into Service ID without >violating IBTA spec Annex A3.2.? Remember Service ID is what responder >advertize and requestor sends communucation requests to. It may be >possible >to server to advertize multiple service IDs to cover version and IP >version >variations but it will not be symmetrical to iWARP. Port is port >(service ID) >and address is address. Port does not encode IP version. The service ID could be formatted as: Set ID: 24 Version: 4 IP version: 4 Src port: 16 Dst port: 16 I don't see how this violates the spec. Beyond the set ID, the rest is defined as "any". It's not necessary, but it does save 4 bytes of private data for the user. >> Separately, if there's any defined mapping to a service ID or >> set of service IDs, then the service ID indicates the format >> of the private data. No additional information is needed in >> the CM REQ, such as using a reserve bit. > >That is a good point. >But this restricts the usage of IP addressing only to these ports. It doesn't restrict the usage at all. It defines a portion of the private data for a specific range of service IDs, the same way it is done for SDP. There's no restriction that other service IDs not use the same format. Even with the proposal to use a reserved bit in the CM, a particular service could format its private data this way, not set the bit, and still be spec compliant. >The question is what is easier to check 1 bit or Service ID. >Of course, service ID will have to be checked anyhow to direct the >request. Exactly. If the service ID is checked anyway, why set the bit? >While this overloads the semantic meaning of Service ID it is a viable >method. How is this not viable? There's a _working_ implementation today for both userspace and kernel mode clients to connect using IP addressing that didn't require any modifications to the IB CM. >> To be clear, the CM REQ _carries_ the IP address. There >> should be no requirement that the CM performs the mapping, >> and I see no reason why it should even care. >> >Can you elaborate on this? Is this addresses who populates the formated >portion of >the provate data? I'm referring to who formats the private data and performs the mapping to the service IDs (slide 13) - Sean From mshefty at ichips.intel.com Thu Nov 10 16:00:31 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 10 Nov 2005 16:00:31 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1041642@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1041642@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4373DF1F.6000908@ichips.intel.com> Caitlin Bestler wrote: > Current CM software could generate the Serive ID. Therefore > the fact that the Private Data is in the "new format" cannot > be part of the Service ID. Otherwise I agree with your analysis > that data can be moved to the Serivce ID. Which is more valuable, > 4 more bytes of private data or a very larger number of Service > IDS, is another topic. The CM would still need to know what range of service IDs can be generated. I don't believe that the range can overlap with an existing range that is already defined without needing to redefine service records and other items. The extra bit in essence becomes a 65th bit for the service ID in such cases. The additional 4 bytes of private data come at an expense of consuming something like .0000006% of the service ID space. >>To be clear, the CM REQ _carries_ the IP address. There >>should be no requirement that the CM performs the mapping, >>and I see no reason why it should even care. > > The CM needs to have at least the capability of validating > the local IP address supplied. Validation can be done outside of the CM in a separate module. - Sean From caitlinb at broadcom.com Thu Nov 10 16:07:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 10 Nov 2005 16:07:59 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connection model for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041648@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: > Caitlin Bestler wrote: >> Current CM software could generate the Serive ID. Therefore the fact >> that the Private Data is in the "new format" cannot be part of the >> Service ID. Otherwise I agree with your analysis that data can be >> moved to the Serivce ID. Which is more valuable, >> 4 more bytes of private data or a very larger number of Service IDS, >> is another topic. > > The CM would still need to know what range of service IDs can > be generated. I don't believe that the range can overlap > with an existing range that is already defined without > needing to redefine service records and other items. The > extra bit in essence becomes a 65th bit for the service ID in such > cases. > How would you prevent someone using old CM software from forging their IP address in user mode and requesting the Service ID from an old CM implementation that did not know to check newly standardized portion of what it thinks of as entirely "private" data? By comparison, an RDMA application on an iWARP system cannot receive a "connection established" event until the IP Address has been validated by kernels at both end and by the ability to round-trip with said IP address. > The additional 4 bytes of private data come at an expense of > consuming something like .0000006% of the service ID space. > >>> To be clear, the CM REQ _carries_ the IP address. There should be >>> no requirement that the CM performs the mapping, and I see no >>> reason why it should even care. >> >> The CM needs to have at least the capability of validating the local >> IP address supplied. > > Validation can be done outside of the CM in a separate module. > That's fine. Just as long as an application that wants to cheat has to consider the possibility that the kernel might validate. Similarly ingress validation *might* be done in an IP network. From mst at mellanox.co.il Thu Nov 10 21:42:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 11 Nov 2005 07:42:33 +0200 Subject: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <523bm4jdxn.fsf@cisco.com> References: <523bm4jdxn.fsf@cisco.com> Message-ID: <20051111054233.GB27969@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: (SPAM?) [openib-general] [RFC] new ibv_get_devices() API -- avoid dlists > > > The const confuses me somewhat. > > Yeah, and thinking about it more, the memory really belongs to the > consumer of the function. So I don't think the const is even correct. > > > > extern int ibv_get_device_list(struct ibv_device ***list); > > > Is ***list really what we want here? Can we just get away with **list? > > Yes -- a single device is represented by a struct ibv_device *. > So an array of devices is represented by a struct ibv_device **. > And a pointer to such an array is struct ibv_device ***. > > But the following is OK too I think: > > extern int ibv_get_device_list(struct ibv_device **list[]); > extern void ibv_free_device_list(struct ibv_device *list[]); > > is that clearer? (a pointer to an array of pointers to struct ibv_device). Yes, this looks good. > > Would something like: > > > > struct ibv_device * ibv_get_device(index); > > > > work as well? > > That could work as well. But it doesn't handle hotplug quite as well. > By returning a snapshot of all the known devices at a given moment, we > at least have a chance at doing something sensible with devices > appearing or disappearing. > > - R. I agree. With ibv_free_device_list we just need to document that the application is supposed to close devices it doesnt listen for hotplug on. -- MST From true_pure at hotmail.com Thu Nov 10 18:43:31 2005 From: true_pure at hotmail.com (=?ISO-2022-JP?B?GyRCOWFGYBsoQiA=?=) Date: Fri, 11 Nov 2005 11:43:31 +0900 Subject: [openib-general] =?iso-2022-jp?b?UmU6GyRCQGhKJyQkGyhC?= Message-ID: <20051111.0243310296@true_pure-hotmail.com> 【受信メール1件】届きました。 『名前』:kirarin 『年齢』:27歳 『職業』:自営業 『年収』:1000万円 『写真』:あり 『一言』:正直に言うとエッチ希望なんです。10万円先払いしますのでここに連絡 くれませんか?連絡くるまで待ってます。090-8012-**** ☆こちらから無料返信☆ http://lov025.com/?senyoh ※現在、kirarinさんからの指名メールは貴方様への一通のみとなっております。 ※番号の続きは本人掲示板にてご確認下さい。 ☆yahooアドレスなどフリーメールアドレスからでも登録できます☆ 拒否の方 me621794 at members.interq.or.jp From hozer at hozed.org Thu Nov 10 22:31:49 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 11 Nov 2005 00:31:49 -0600 Subject: [openib-general] OpenSM and Wrong SM_Key In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36188FB@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36188FB@mtlexch01.mtl.com> Message-ID: <20051111063149.GT3275@kalmia.hozed.org> On Wed, Nov 09, 2005 at 09:46:06AM +0200, Eitan Zahavi wrote: > Hi Hal, > > I would like to bring this to MgtWG before we change anything. > IMO the situation when this happens is really not "legal" since if the > SM's are not coordinated at least in their SM_Key it will cause the two > masters on the subnet. > > >From our experience it is always better to cause a fatal flow and exit > the SM rather then report the event in some log - normally it will not > be seen ... > > I know this is a controversial issue. Okay, so you're telling me you *WANT* behavior where a rogue node can trivially cause the running subnet manager to exit and take over management of the network? Opensm needs to have a well documented config file, instead of 3 pages of command line options, and different levels of logging. What to do in the above situation is a site-local policy config decision, not something that should be hard-coded in the SM source code. The logs might actually get looked at if there wasn't junk in the log every time something timed out. The linux kernel has 'WARN, NOTICE, and CRITICAL' level log messages. From Arkady.Kanevsky at netapp.com Fri Nov 11 05:21:13 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 11 Nov 2005 08:21:13 -0500 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: So what you are proposing is that Listener will specify IETF port (2 bytes). CM will generate an IB SID to listen on. That SID will have wildcarding for 24 bits. The requestor will specify: version, IP version, SRC port and DST port. Based on that CM will generate the SID to send request to. It will also encode IP addresses into Private data based on IP version. This makes IP addresses, SIDs and private data format interdependent and not orthogonal which it is now. It also changes the meaning of SID which currently has a meaning of TCP port. It also does not allow to use the private data formating for other SIDs. It looks like a big hack. Is it worth it for extra 4 bytes of private data for Consumers? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, November 10, 2005 6:53 PM > To: Kanevsky, Arkady; Sean Hefty > Cc: swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] RE: [dat-discussions] socket > based connectionmodel for IB proposal - round 3 > > >> If you want to maximize consumer usable private data, then you can > >> move the version, IP version, protocol, source and > destination ports > >> into the service ID. > > > >Not at the expense of redefining what Service ID is. > >How do you propose to move all these fields into Service ID without > >violating IBTA spec Annex A3.2.? Remember Service ID is what > responder > >advertize and requestor sends communucation requests to. It may be > >possible to server to advertize multiple service IDs to > cover version > >and IP version variations but it will not be symmetrical to > iWARP. Port > >is port (service ID) and address is address. Port does not encode IP > >version. > > The service ID could be formatted as: > > Set ID: 24 > Version: 4 > IP version: 4 > Src port: 16 > Dst port: 16 > > I don't see how this violates the spec. Beyond the set ID, > the rest is defined as "any". It's not necessary, but it > does save 4 bytes of private data for the user. > > >> Separately, if there's any defined mapping to a service ID > or set of > >> service IDs, then the service ID indicates the format of > the private > >> data. No additional information is needed in the CM REQ, such as > >> using a reserve bit. > > > >That is a good point. > >But this restricts the usage of IP addressing only to these ports. > > It doesn't restrict the usage at all. It defines a portion > of the private data for a specific range of service IDs, the > same way it is done for SDP. There's no restriction that > other service IDs not use the same format. > > Even with the proposal to use a reserved bit in the CM, a > particular service could format its private data this way, > not set the bit, and still be spec compliant. > > >The question is what is easier to check 1 bit or Service ID. > >Of course, service ID will have to be checked anyhow to direct the > >request. > > Exactly. If the service ID is checked anyway, why set the bit? > > >While this overloads the semantic meaning of Service ID it > is a viable > >method. > > How is this not viable? There's a _working_ implementation > today for both userspace and kernel mode clients to connect > using IP addressing that didn't require any modifications to > the IB CM. > > >> To be clear, the CM REQ _carries_ the IP address. There > should be no > >> requirement that the CM performs the mapping, and I see no > reason why > >> it should even care. > >> > >Can you elaborate on this? Is this addresses who populates > the formated > >portion of the provate data? > > I'm referring to who formats the private data and performs > the mapping to the service IDs (slide 13) > > - Sean > From mshefty at ichips.intel.com Fri Nov 11 09:42:32 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Nov 2005 09:42:32 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: References: Message-ID: <4374D808.405@ichips.intel.com> Kanevsky, Arkady wrote: > So what you are proposing is that Listener will specify IETF port (2 > bytes). > CM will generate an IB SID to listen on. That SID will have wildcarding > for 24 bits. > The requestor will specify: version, IP version, SRC port and DST port. > Based on that CM will generate the SID to send request to. No, the listener or requester generate the SID, not the IB CM - the same way SDP works today. > It will also encode IP addresses into Private data based on IP version. > > This makes IP addresses, SIDs and private data format interdependent and > not > orthogonal which it is now. > It also changes the meaning of SID which currently has a meaning of TCP > port. I'm not proposing this. I'm merely stating that is is a valid option to consider. The private data format and SIDs are not orthogonal anyway. The port number's embedded in the SID, and the SID indicates the format of the private data. They are interdependent by definition. If it's okay to put the destination port number in the SID, why not the protocol type, or IP version? > It also does not allow to use the private data formating for other SIDs. Private data is private. It should not be owned, set, interpreted, modified, or touched by the CM. It's up to the service to define and use. What's this proposal defines is basically a 65th bit for the service ID. If the new 65 bit SID is: 1 - private data has this format 0 - private data format is unknown Why do we need this 65th bit? > It looks like a big hack. Is it worth it for extra 4 bytes of private > data > for Consumers? It's a trade off between SID space and private data. Consumers need to decide how important those extra 4 bytes are. - Sean From caitlinb at broadcom.com Fri Nov 11 09:56:47 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 11 Nov 2005 09:56:47 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote in response to Arkady Kanevsky: > > What's this proposal defines is basically a 65th bit for the > service ID. If the new 65 bit SID is: > > 1 - private data has this format > 0 - private data format is unknown > > Why do we need this 65th bit? > Because current software can set any of the 64 bits. There is no assurance that any bit within the current 64 being set means that privileged software on the remote side is vouching for the standardized portion of the private data. An NFS daemon today knows that the remote IP address of an established connection is consistent with the local routing configured at the privileged layer in at least two locations (the local machine and the remote peer). Because an IP address is not used for routing we are already losing some of that. Having *neither* end be validated by the network stack is a serious change in the semantics of a connection request. Daemons frqeuently use the remote IP address to validate at least some clients as being "local" and hence trusted. On an IP network this is amazingly effective if combined with ingress filtering. But even if there is no ingress filtering, an established connection with a local IP address is inherently local. It is at most single packets, or those outside of a connection, that are suspect. Unless there is information in the REQ that cannot be set by a current CM in response to a non-privileged request then the Daemon loses this ability of inferrig "neighbor status" based on the remote IP address. That is simply no longer "IP compatible" connection establishment. From mshefty at ichips.intel.com Fri Nov 11 10:42:36 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Nov 2005 10:42:36 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4374E61C.2030002@ichips.intel.com> Caitlin Bestler wrote: > Because current software can set any of the 64 bits. > There is no assurance that any bit within the current 64 > being set means that privileged software on the remote > side is vouching for the standardized portion of the > private data. The current software did not require changes to the CM protocol, and does provide assurance that the private data was formatted by a kernel entity. This is not, nor has ever been an issue with the current implementation, despite your endless attempts to make it seem that way. - Sean From rolandd at cisco.com Fri Nov 11 10:49:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 11 Nov 2005 10:49:09 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <4374E61C.2030002@ichips.intel.com> (Sean Hefty's message of "Fri, 11 Nov 2005 10:42:36 -0800") References: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> <4374E61C.2030002@ichips.intel.com> Message-ID: <527jbfgguy.fsf@cisco.com> Sean> The current software did not require changes to the CM Sean> protocol, and does provide assurance that the private data Sean> was formatted by a kernel entity. How do you prevent a userspace process from using the current ucm module to connect to one of the CMA services on a remote system? If you allow that, then the process can use the old simple direct IB CM interface to put whatever it wants into the REQ private data. I don't see any handling of the service ID in the ucm or cm code beyond taking what the consumer passes in and formatting the CM messages with it. - R. From mshefty at ichips.intel.com Fri Nov 11 10:58:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 11 Nov 2005 10:58:27 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <527jbfgguy.fsf@cisco.com> References: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> <4374E61C.2030002@ichips.intel.com> <527jbfgguy.fsf@cisco.com> Message-ID: <4374E9D3.3050002@ichips.intel.com> Roland Dreier wrote: > Sean> The current software did not require changes to the CM > Sean> protocol, and does provide assurance that the private data > Sean> was formatted by a kernel entity. > > How do you prevent a userspace process from using the current ucm > module to connect to one of the CMA services on a remote system? If > you allow that, then the process can use the old simple direct IB CM > interface to put whatever it wants into the REQ private data. The kernel uCM module needs to verify that the SID is not used, but I'm waiting until a standard SID is defined before adding it. SDP has a similar issue, except its check could be added now, since the SID is well known. - Sean From caitlinb at broadcom.com Fri Nov 11 10:58:26 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 11 Nov 2005 10:58:26 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041676@NT-SJCA-0751.brcm.ad.broadcom.com> Roland Dreier wrote: > Sean> The current software did not require changes to the CM > Sean> protocol, and does provide assurance that the private data > Sean> was formatted by a kernel entity. > > How do you prevent a userspace process from using the current > ucm module to connect to one of the CMA services on a remote > system? If you allow that, then the process can use the old > simple direct IB CM interface to put whatever it wants into > the REQ private data. > > I don't see any handling of the service ID in the ucm or cm > code beyond taking what the consumer passes in and formatting > the CM messages with it. > > - R. That's exactly my point. As far as I can see, the only way to prevent that is to have a bit that a current CM implementation does not set, meaning it is in the header not the private data. From recio at us.ibm.com Fri Nov 11 11:00:46 2005 From: recio at us.ibm.com (Renato Recio) Date: Fri, 11 Nov 2005 13:00:46 -0600 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: The CM cannot get a message from a non-priviliged requestor, because a non-privilited requestor cannot insert the priviliged Q_Key into the packet. Renato J Recio Chief Architect, eServer I/O IBM Distinguished Engineer Member IBM Academy of Technology Tel 512-838-3685, T/L 678-3685 "Caitlin Bestler" , "Kanevsky, Arkady" m.com> cc: swg at infinibandta.org, openib-general at openib.org, 11/11/2005 11:56 dat-discussions at yahoogroups.com AM Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Sean Hefty wrote in response to Arkady Kanevsky: > > What's this proposal defines is basically a 65th bit for the > service ID. If the new 65 bit SID is: > > 1 - private data has this format > 0 - private data format is unknown > > Why do we need this 65th bit? > Because current software can set any of the 64 bits. There is no assurance that any bit within the current 64 being set means that privileged software on the remote side is vouching for the standardized portion of the private data. An NFS daemon today knows that the remote IP address of an established connection is consistent with the local routing configured at the privileged layer in at least two locations (the local machine and the remote peer). Because an IP address is not used for routing we are already losing some of that. Having *neither* end be validated by the network stack is a serious change in the semantics of a connection request. Daemons frqeuently use the remote IP address to validate at least some clients as being "local" and hence trusted. On an IP network this is amazingly effective if combined with ingress filtering. But even if there is no ingress filtering, an established connection with a local IP address is inherently local. It is at most single packets, or those outside of a connection, that are suspect. Unless there is information in the REQ that cannot be set by a current CM in response to a non-privileged request then the Daemon loses this ability of inferrig "neighbor status" based on the remote IP address. That is simply no longer "IP compatible" connection establishment. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic25893.gif Type: image/gif Size: 1255 bytes Desc: not available URL: From caitlinb at broadcom.com Fri Nov 11 11:12:15 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 11 Nov 2005 11:12:15 -0800 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041679@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: Renato Recio [mailto:recio at us.ibm.com] Sent: Friday, November 11, 2005 11:01 AM To: Caitlin Bestler Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; Sean Hefty; openib-general at openib.org; swg at infinibandta.org Subject: Re: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 The CM cannot get a message from a non-priviliged requestor, because a non-privilited requestor cannot insert the priviliged Q_Key into the packet. But a non-privileged remote consumer could make a request of an existing CM. That existing CM would consider the entire "private data" field to be, well, private. It would obviously not validate any of it. So getting the Q_Key does not guarantee that the private data is validated. There has to be a field outside of the private data that can only be set by privileged codes that means "I am aware of the expectation that I have validated the standardized portion of the private data in this optional format." And yes, the Q-Key is how we know that assertion is coming from privileged remote software. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Nov 11 11:19:24 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 11 Nov 2005 11:19:24 -0800 Subject: [openib-general] [PATCH] [CMA] add port_num to rdma_cm_id Message-ID: This patch adds a port_num field to struct rdma_cm_id to make it easier for clients to identify which port they are currently using. It also removes a few unneeded variable initializations, and fixes a bug where the IB address for an rdma_bind_addr() was not set. Since the userspace CMA was just checked in, I didn't bother updating the ABI. Signed-off-by: Sean Hefty Index: userspace/librdmacm/include/rdma/rdma_cma_abi.h =================================================================== --- userspace/librdmacm/include/rdma/rdma_cma_abi.h (revision 4019) +++ userspace/librdmacm/include/rdma/rdma_cma_abi.h (working copy) @@ -91,10 +91,6 @@ __u32 id; }; -struct ucma_abi_bind_addr_resp { - __u64 node_guid; -}; - struct ucma_abi_resolve_addr { struct sockaddr_in6 src_addr; struct sockaddr_in6 dst_addr; @@ -118,6 +114,8 @@ struct ib_kern_path_rec ib_route[2]; struct sockaddr_in6 src_addr; __u32 num_paths; + __u8 port_num; + __u8 reserved[7]; }; struct ucma_abi_conn_param { Index: userspace/librdmacm/include/rdma/rdma_cma.h =================================================================== --- userspace/librdmacm/include/rdma/rdma_cma.h (revision 4019) +++ userspace/librdmacm/include/rdma/rdma_cma.h (working copy) @@ -79,6 +79,7 @@ void *context; struct ibv_qp *qp; struct rdma_route route; + uint8_t port_num; }; struct rdma_cm_event { Index: userspace/librdmacm/src/cma.c =================================================================== --- userspace/librdmacm/src/cma.c (revision 4019) +++ userspace/librdmacm/src/cma.c (working copy) @@ -291,9 +291,54 @@ } } +static int ucma_query_route(struct rdma_cm_id *id) +{ + struct ucma_abi_query_route_resp *resp; + struct ucma_abi_query_route *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size, i; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY_ROUTE, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd->id = id_priv->handle; + + ret = write(cma_fd, msg, size); + if (ret != size) + return (ret > 0) ? -ENODATA : ret; + + if (resp->num_paths) { + id->route.path_rec = malloc(sizeof *id->route.path_rec * + resp->num_paths); + if (!id->route.path_rec) + return -ENOMEM; + + id->route.num_paths = resp->num_paths; + for (i = 0; i < resp->num_paths; i++) + ib_copy_path_rec_from_kern(&id->route.path_rec[i], + &resp->ib_route[i]); + } + + memcpy(id->route.addr.addr.ibaddr.sgid.raw, resp->ib_route[0].sgid, + sizeof id->route.addr.addr.ibaddr.sgid); + memcpy(id->route.addr.addr.ibaddr.dgid.raw, resp->ib_route[0].dgid, + sizeof id->route.addr.addr.ibaddr.dgid); + id->route.addr.addr.ibaddr.pkey = resp->ib_route[0].pkey; + memcpy(&id->route.addr.src_addr, &resp->src_addr, + sizeof id->route.addr.src_addr); + + if (!id_priv->cma_dev && resp->node_guid) { + ret = ucma_get_device(id_priv, resp->node_guid); + if (ret) + return ret; + id_priv->id.port_num = resp->port_num; + } + + return 0; +} + int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { - struct ucma_abi_bind_addr_resp *resp; struct ucma_abi_bind_addr *cmd; struct cma_id_private *id_priv; void *msg; @@ -303,7 +348,7 @@ if (!addrlen) return -EINVAL; - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_BIND_ADDR, size); + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_BIND_ADDR, size); id_priv = container_of(id, struct cma_id_private, id); cmd->id = id_priv->handle; memcpy(&cmd->addr, addr, addrlen); @@ -312,11 +357,9 @@ if (ret != size) return (ret > 0) ? -ENODATA : ret; - if (resp->node_guid) { - ret = ucma_get_device(id_priv, resp->node_guid); - if (ret) - return ret; - } + ret = ucma_query_route(id); + if (ret) + return ret; memcpy(&id->route.addr.src_addr, addr, addrlen); return 0; @@ -442,24 +485,6 @@ return ibv_modify_qp(id->qp, &qp_attr, IBV_QP_STATE); } -static int ucma_find_gid(struct cma_device *cma_dev, union ibv_gid *gid, - uint8_t *port_num) -{ - int port, ret, i; - union ibv_gid chk_gid; - - for (port = 1; port <= cma_dev->port_cnt; port++) - for (i = 0, ret = 0; !ret; i++) { - ret = ibv_query_gid(cma_dev->verbs, port, i, &chk_gid); - if (!ret && !memcmp(gid, &chk_gid, sizeof *gid)) { - *port_num = port; - return 0; - } - } - - return -EINVAL; -} - static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num, uint16_t pkey, uint16_t *pkey_index) { @@ -483,19 +508,15 @@ struct ib_addr *ibaddr; int ret; - qp_attr.qp_state = IBV_QPS_INIT; - qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; - ibaddr = &id_priv->id.route.addr.addr.ibaddr; - ret = ucma_find_gid(id_priv->cma_dev, &ibaddr->sgid, &qp_attr.port_num); - if (ret) - return ret; - - ret = ucma_find_pkey(id_priv->cma_dev, qp_attr.port_num, ibaddr->pkey, - &qp_attr.pkey_index); + ret = ucma_find_pkey(id_priv->cma_dev, id_priv->id.port_num, + ibaddr->pkey, &qp_attr.pkey_index); if (ret) return ret; + qp_attr.port_num = id_priv->id.port_num; + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_ACCESS_FLAGS | IBV_QP_PKEY_INDEX | IBV_QP_PORT); } @@ -531,51 +552,6 @@ ibv_destroy_qp(id->qp); } -static int ucma_query_route(struct rdma_cm_id *id) -{ - struct ucma_abi_query_route_resp *resp; - struct ucma_abi_query_route *cmd; - struct cma_id_private *id_priv; - void *msg; - int ret, size, i; - - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY_ROUTE, size); - id_priv = container_of(id, struct cma_id_private, id); - cmd->id = id_priv->handle; - - ret = write(cma_fd, msg, size); - if (ret != size) - return (ret > 0) ? -ENODATA : ret; - - if (resp->num_paths) { - id->route.path_rec = malloc(sizeof *id->route.path_rec * - resp->num_paths); - if (!id->route.path_rec) - return -ENOMEM; - - id->route.num_paths = resp->num_paths; - for (i = 0; i < resp->num_paths; i++) - ib_copy_path_rec_from_kern(&id->route.path_rec[i], - &resp->ib_route[i]); - } - - memcpy(id->route.addr.addr.ibaddr.sgid.raw, resp->ib_route[0].sgid, - sizeof id->route.addr.addr.ibaddr.sgid); - memcpy(id->route.addr.addr.ibaddr.dgid.raw, resp->ib_route[0].dgid, - sizeof id->route.addr.addr.ibaddr.dgid); - id->route.addr.addr.ibaddr.pkey = resp->ib_route[0].pkey; - memcpy(&id->route.addr.src_addr, &resp->src_addr, - sizeof id->route.addr.src_addr); - - if (!id_priv->cma_dev) { - ret = ucma_get_device(id_priv, resp->node_guid); - if (ret) - return ret; - } - - return 0; -} - static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, struct ibv_qp *qp) Index: linux-kernel/infiniband/include/rdma/rdma_user_cm.h =================================================================== --- linux-kernel/infiniband/include/rdma/rdma_user_cm.h (revision 4019) +++ linux-kernel/infiniband/include/rdma/rdma_user_cm.h (working copy) @@ -92,10 +92,6 @@ __u32 id; }; -struct rdma_ucm_bind_addr_resp { - __u64 node_guid; -}; - struct rdma_ucm_resolve_addr { struct sockaddr_in6 src_addr; struct sockaddr_in6 dst_addr; @@ -119,6 +115,8 @@ struct ib_user_path_rec ib_route[2]; struct sockaddr_in6 src_addr; __u32 num_paths; + __u8 port_num; + __u8 reserved[7]; }; struct rdma_ucm_conn_param { Index: linux-kernel/infiniband/include/rdma/rdma_cm.h =================================================================== --- linux-kernel/infiniband/include/rdma/rdma_cm.h (revision 4019) +++ linux-kernel/infiniband/include/rdma/rdma_cm.h (working copy) @@ -92,6 +92,7 @@ struct ib_qp *qp; rdma_cm_event_handler event_handler; struct rdma_route route; + u8 port_num; }; struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, Index: linux-kernel/infiniband/core/cma.c =================================================================== --- linux-kernel/infiniband/core/cma.c (revision 4019) +++ linux-kernel/infiniband/core/cma.c (working copy) @@ -196,11 +196,11 @@ { struct cma_device *cma_dev; int ret = -ENODEV; - u8 port; down(&mutex); list_for_each_entry(cma_dev, &dev_list, list) { - ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); + ret = ib_find_cached_gid(cma_dev->device, gid, + &id_priv->id.port_num, NULL); if (!ret) { cma_attach_to_dev(id_priv, cma_dev); break; @@ -249,23 +249,17 @@ static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) { struct ib_qp_attr qp_attr; - struct ib_addr *ibaddr; + struct ib_addr *ibaddr = &id_priv->id.route.addr.addr.ibaddr; int ret; - qp_attr.qp_state = IB_QPS_INIT; - qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; - - ibaddr = &id_priv->id.route.addr.addr.ibaddr; - ret = ib_find_cached_gid(id_priv->id.device, &ibaddr->sgid, - &qp_attr.port_num, NULL); - if (ret) - return ret; - - ret = ib_find_cached_pkey(id_priv->id.device, qp_attr.port_num, + ret = ib_find_cached_pkey(id_priv->id.device, id_priv->id.port_num, ibaddr->pkey, &qp_attr.pkey_index); if (ret) return ret; + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr.port_num = id_priv->id.port_num; return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT); } @@ -908,12 +902,6 @@ { struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; struct ib_sa_path_rec path_rec; - int ret; - u8 port; - - ret = ib_find_cached_gid(id_priv->id.device, &addr->sgid, &port, NULL); - if (ret) - return -ENODEV; memset(&path_rec, 0, sizeof path_rec); path_rec.sgid = addr->sgid; @@ -922,7 +910,7 @@ path_rec.numb_path = 1; id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, - port, &path_rec, + id_priv->id.port_num, &path_rec, IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, Index: linux-kernel/infiniband/core/ucma.c =================================================================== --- linux-kernel/infiniband/core/ucma.c (revision 4019) +++ linux-kernel/infiniband/core/ucma.c (working copy) @@ -352,12 +352,8 @@ int in_len, int out_len) { struct rdma_ucm_bind_addr cmd; - struct rdma_ucm_bind_addr_resp resp; struct ucma_context *ctx; - int ret = 0; - - if (out_len < sizeof(resp)) - return -ENOSPC; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -367,18 +363,6 @@ return PTR_ERR(ctx); ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr); - if (ret) - goto out; - - if (ctx->cm_id->device) - resp.node_guid = ctx->cm_id->device->node_guid; - else - resp.node_guid = 0; - if (copy_to_user((void __user *)(unsigned long)cmd.response, - &resp, sizeof(resp))) - ret = -EFAULT; - -out: ucma_put_ctx(ctx); return ret; } @@ -389,7 +373,7 @@ { struct rdma_ucm_resolve_addr cmd; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -411,7 +395,7 @@ { struct rdma_ucm_resolve_route cmd; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -520,7 +504,7 @@ struct rdma_ucm_connect cmd; struct rdma_conn_param conn_param; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -563,7 +547,7 @@ struct rdma_ucm_accept cmd; struct rdma_conn_param conn_param; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -588,7 +572,7 @@ { struct rdma_ucm_reject cmd; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -607,7 +591,7 @@ { struct rdma_ucm_disconnect cmd; struct ucma_context *ctx; - int ret = 0; + int ret; if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; @@ -629,7 +613,7 @@ struct ib_uverbs_qp_attr resp; struct ucma_context *ctx; struct ib_qp_attr qp_attr; - int ret = 0; + int ret; if (out_len < sizeof(resp)) return -ENOSPC; From krause at cup.hp.com Fri Nov 11 12:00:02 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 11 Nov 2005 12:00:02 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Message-ID: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> At 10:28 AM 11/9/2005, Rick Frank wrote: >Yes, the application is responsible for detecting lost msgs at the >application level - the transport can not do this. > >RDS does not guarantee that a message has been delivered to the >application - just that once the transport has accepted a msg it will >deliver the msg to the remote node in order without duplication - dealing >with retransmissions, etc due to sporadic / intermittent msg loss over the >interconnect. If after accepting the send - the current path fails - then >RDS will transparently fail over to another path - and if required will >resend / send any already queued msgs to the remote node - again insuring >that no msg is duplicated and they are in order. This is no different >than APM - with the exception that RDS can do this across HCAs. > >The application - Oracle in this case - will deal with detecting a >catastrophic path failure - either due to a send that does not arrive and >or a timedout response or send failure returned from the transport. If >there is no network path to a remote node - it is required that we remove >the remote node from the operating cluster to avoid what is commonly >termed as a "split brain" condition - otherwise known as a "partition in time". > >BTW - in our case - the application failure domain logic is the same >whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, >if we can not talk to a remote node - after some defined period of time - >we will remove the remote node from the cluster. In this case the database >will recover all the interesting state that may have been maintained on >the removed node - allowing the remaining nodes to continue. If later on, >communication to the remote node is restored - it will be allowed to >rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From ftillier at silverstorm.com Fri Nov 11 12:18:18 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 11 Nov 2005 12:18:18 -0800 Subject: [openib-general] RE: [dat-discussions] socket basedconnectionmodel for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F104166D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000001c5e6fd$0ed48410$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, November 11, 2005 9:57 AM > > Sean Hefty wrote in response to Arkady Kanevsky: > > > What's this proposal defines is basically a 65th bit for the > > service ID. If the new 65 bit SID is: > > > > 1 - private data has this format > > 0 - private data format is unknown > > > > Why do we need this 65th bit? > > Because current software can set any of the 64 bits. > There is no assurance that any bit within the current 64 > being set means that privileged software on the remote > side is vouching for the standardized portion of the > private data. Do we need the remote side to vouch for that portion of the private data? The recipient of a CM REQ can validate fully that the GIDs in the path record match the IP addresses. That was the whole point of this proposal - eliminate the need to do some reverse lookup of GID to IP based on the source GID in the path record. With the source IP provided in the private data, the recipient of the CM REQ can do a forward lookup of that IP address and validate that the GID returned matches the one in the CM REQ path. Thus, all address translation can use forward lookups and we eliminate the flaws of the reverse lookup schemes that are currently in use. It doesn't matter one bit if the CM REQ private data was formatted by a privileged entity or not - garbage in the private data can be detected by the receiving entity, even one that sits above the IB CM. - Fab From recio at us.ibm.com Fri Nov 11 12:24:35 2005 From: recio at us.ibm.com (Renato Recio) Date: Fri, 11 Nov 2005 14:24:35 -0600 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: Any active side QP can target a passive side CM QP (QP1 or redirected QPN). However, due to the use of priviliged Q_Keys, only an active side priviliged QP can target the passive side CM QP. It seems to me that our proposal of having the Service ID be generated by priviliged mode code, having a Service ID associated with RDMA Services (e.g. iSER, NFSeR, ...), and having priviliged mode code generate the first N bytes of the private data field (i.e. the bytes in question); allows the passive side: - Transport to validate an incoming CM message was generated by a priviliged consumer; and - CM to know the Service ID and first N-bytes of the private data field were generated by a priviliged consumer. Thanks, Renato J Recio Chief Architect, eServer I/O IBM Distinguished Engineer Member IBM Academy of Technology Tel 512-838-3685, T/L 678-3685 "Caitlin Bestler" cc: "Kanevsky, Arkady" , dat-discussions at yahoogroups.com, "Sean Hefty" , 11/11/2005 01:12 openib-general at openib.org, swg at infinibandta.org PM Subject: RE: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 From: Renato Recio [mailto:recio at us.ibm.com] Sent: Friday, November 11, 2005 11:01 AM To: Caitlin Bestler Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; Sean Hefty; openib-general at openib.org; swg at infinibandta.org Subject: Re: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 The CM cannot get a message from a non-priviliged requestor, because a non-privilited requestor cannot insert the priviliged Q_Key into the packet. But a non-privileged remote consumer could make a request of an existing CM. That existing CM would consider the entire "private data" field to be, well, private. It would obviously not validate any of it. So getting the Q_Key does not guarantee that the private data is validated. There has to be a field outside of the private data that can only be set by privileged codes that means "I am aware of the expectation that I have validated the standardized portion of the private data in this optional format." And yes, the Q-Key is how we know that assertion is coming from privileged remote software. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic22362.gif Type: image/gif Size: 1255 bytes Desc: not available URL: From Nitin.Hande at Sun.COM Fri Nov 11 13:01:17 2005 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Fri, 11 Nov 2005 13:01:17 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> Message-ID: <4375069D.7040106@sun.com> Michael Krause wrote: > At 10:28 AM 11/9/2005, Rick Frank wrote: > >> Yes, the application is responsible for detecting lost msgs at the >> application level - the transport can not do this. >> >> RDS does not guarantee that a message has been delivered to the >> application - just that once the transport has accepted a msg it will >> deliver the msg to the remote node in order without duplication - >> dealing with retransmissions, etc due to sporadic / intermittent msg >> loss over the interconnect. If after accepting the send - the current >> path fails - then RDS will transparently fail over to another path - >> and if required will resend / send any already queued msgs to the >> remote node - again insuring that no msg is duplicated and they are in >> order. This is no different than APM - with the exception that RDS >> can do this across HCAs. >> >> The application - Oracle in this case - will deal with detecting a >> catastrophic path failure - either due to a send that does not arrive >> and or a timedout response or send failure returned from the >> transport. If there is no network path to a remote node - it is >> required that we remove the remote node from the operating cluster to >> avoid what is commonly termed as a "split brain" condition - otherwise >> known as a "partition in time". >> >> BTW - in our case - the application failure domain logic is the same >> whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. >> Basically, if we can not talk to a remote node - after some defined >> period of time - we will remove the remote node from the cluster. In >> this case the database will recover all the interesting state that may >> have been maintained on the removed node - allowing the remaining >> nodes to continue. If later on, communication to the remote node is >> restored - it will be allowed to rejoin the cluster and take on >> application load. > > > > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the > 2nd and 3rd paragraphs are state: > > * RDP does not guarantee that a datagram is delivered to the remote > application. > * It is up to the RDP client to deal with datagrams lost due to > transport failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing > data out of the HCA fault domain, nor does it sound like it ensures that > CQE loss is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it > provides some level of resync of the outstanding sends not completed > from an application's perspective as well as any state updated via RDMA > operations which may occur without an explicit send operation to flush > to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm still trying to ascertain whether RDS completely > recovers from HCA failure (assuming there is another HCA / path > available) between the two endnodes Reading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. Nitin . > > Mike > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rpandit at silverstorm.com Fri Nov 11 13:02:14 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Fri, 11 Nov 2005 13:02:14 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> Message-ID: <96f8e60e0511111302i8f6db3anfc404b0998c8885@mail.gmail.com> On 11/11/05, Michael Krause wrote: > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > and 3rd paragraphs are state: > > * RDP does not guarantee that a datagram is delivered to the remote > application. > * It is up to the RDP client to deal with datagrams lost due to transport > failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing data > out of the HCA fault domain, nor does it sound like it ensures that CQE loss > is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it provides > some level of resync of the outstanding sends not completed from an > application's perspective as well as any state updated via RDMA operations > which may occur without an explicit send operation to flush to a known > state. I'm still trying to ascertain whether RDS completely recovers from > HCA failure (assuming there is another HCA / path available) between the two > endnodes. RDS will replay the sends that are completed in error by the HCA, which typically would happen if the current path fails or the remote node/HCA dies. In case of a catastrophic error on the local HCA, subsequent sends will fail (for a certain time (session_time_wait ) ) as if there was no alternate path available at that time. On getting an error the application should discard any sends unacknowledged by it's peer and take corrective action. After the time_wait is over, subsequent sends will initiate a brand new connection which could use the alternate HCA ( if the path is available). > > Mike > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From caitlinb at broadcom.com Fri Nov 11 13:12:27 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 11 Nov 2005 13:12:27 -0800 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041687@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: Renato Recio [mailto:recio at us.ibm.com] Sent: Friday, November 11, 2005 12:25 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; Sean Hefty; openib-general at openib.org; swg at infinibandta.org Subject: RE: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Any active side QP can target a passive side CM QP (QP1 or redirected QPN). However, due to the use of priviliged Q_Keys, only an active side priviliged QP can target the passive side CM QP. It seems to me that our proposal of having the Service ID be generated by priviliged mode code, having a Service ID associated with RDMA Services (e.g. iSER, NFSeR, ...), and having priviliged mode code generate the first N bytes of the private data field (i.e. the bytes in question); allows the passive side: - Transport to validate an incoming CM message was generated by a priviliged consumer; and - CM to know the Service ID and first N-bytes of the private data field were generated by a priviliged consumer. How does this prevent a non-privileged client running on a remote host with current CM software from generating a connection request to the targeted Service ID with the entire private data coming from the non-privileged consumer. A current CM does not know that the Service ID requires it to generate/validate any portion of the private data. A current CM does not know how to use a later version number or to set a bit that is currently defined as reserved. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ftillier at silverstorm.com Fri Nov 11 14:53:10 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 11 Nov 2005 14:53:10 -0800 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socketbased connectionmodel for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1041687@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000101c5e712$b4350b90$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, November 11, 2005 1:12 PM > > How does this prevent a non-privileged client running on a remote host with > current > CM software from generating a connection request to the targeted Service ID > with the entire private data coming from the non-privileged consumer. There is no need to prevent a non-privileged client from generating connection requests. Where does this requirement come from? Who cares where the private data comes from as long as the recipient, whether privileged or not, has a way of validating that it matches the path record information? Specifically, adding the logic in the low level IB CM to validate the private data will tie the IB CM to address translation for IPoIB, which I think is better done at a higher level (like the CMA). If a higher level entity is going to be responsible for validating the private data, the low level IB CM doesn't do squat with the reserved bit. The low level CM API must now expose the bit to allow clients to specify it so that REQs can be routed to them, so that two requests with the same SID can be distinguished form one another by this reserved bit. Thus if the bit has to be exposed through the low-level IB CM it is no more than a 65th bit for a service ID. > A current CM does not know that the Service ID requires it to > generate/validate > any portion of the private data. The CM doesn't need to validate any private data. The CM only needs to pass the incoming REQ to a client that listened on that particular SID. The client that listened on the particular SID is expected to know the private data format and to validate it as it sees fit. > A current CM does not know how to use a later version number or to set a > bit that is currently defined as reserved. I don't think we need the reserved bit at all. I agree with Sean it just adds a 65th bit to the SID that is unnecessary. We don't need a privileged-only implementation, either. As long as we have forward lookups of IP to GID available through address translation, any recipient of a CM REQ with the IP-address in the private data can validate that the IP addresses are appropriate for the IB path specified in the CM REQ. - Fab From caitlinb at broadcom.com Fri Nov 11 15:01:31 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 11 Nov 2005 15:01:31 -0800 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socketbased connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041699@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >> Sent: Friday, November 11, 2005 1:12 PM >> >> How does this prevent a non-privileged client running on a remote >> host with current CM software from generating a connection request >> to the targeted Service ID with the entire private data coming from >> the non-privileged consumer. > > There is no need to prevent a non-privileged client from > generating connection requests. Where does this requirement > come from? Who cares where the private data comes from as > long as the recipient, whether privileged or not, has a way > of validating that it matches the path record information? > > Specifically, adding the logic in the low level IB CM to > validate the private data will tie the IB CM to address > translation for IPoIB, which I think is better done at a > higher level (like the CMA). > > If a higher level entity is going to be responsible for > validating the private data, the low level IB CM doesn't do > squat with the reserved bit. The low level CM API must now > expose the bit to allow clients to specify it so that REQs > can be routed to them, so that two requests with the same SID > can be distinguished form one another by this reserved bit. > Thus if the bit has to be exposed through the low-level IB CM > it is no more than a 65th bit for a service ID. > By the time the connection request is passed to the application the remote IP address needs to be validated. I don't care whether the remote CM validated it (and is known to be privileged software) or if the local CM validates it with a reverse lookup. What I do not want is to kick this problem up to the application. If it is kicked up to the application it is no longer TCP-compatible connection setup, because that responsibility does not exist over TCP. From ftillier at silverstorm.com Fri Nov 11 15:53:28 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 11 Nov 2005 15:53:28 -0800 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socketbased connectionmodel for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1041699@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <000201c5e71b$1e914fa0$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > > Fab Tillier wrote: > >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > >> Sent: Friday, November 11, 2005 1:12 PM > >> > >> How does this prevent a non-privileged client running on a remote > >> host with current CM software from generating a connection request > >> to the targeted Service ID with the entire private data coming from > >> the non-privileged consumer. > > > > There is no need to prevent a non-privileged client from > > generating connection requests. Where does this requirement > > come from? Who cares where the private data comes from as > > long as the recipient, whether privileged or not, has a way > > of validating that it matches the path record information? > > > > Specifically, adding the logic in the low level IB CM to > > validate the private data will tie the IB CM to address > > translation for IPoIB, which I think is better done at a > > higher level (like the CMA). > > > > If a higher level entity is going to be responsible for > > validating the private data, the low level IB CM doesn't do > > squat with the reserved bit. The low level CM API must now > > expose the bit to allow clients to specify it so that REQs > > can be routed to them, so that two requests with the same SID > > can be distinguished form one another by this reserved bit. > > Thus if the bit has to be exposed through the low-level IB CM > > it is no more than a 65th bit for a service ID. > > > By the time the connection request is passed to the application > the remote IP address needs to be validated. I agree - by the time the upper-most, IP addressing aware application gets it, whoever is sending the connection request up must have done the validation. > I don't care whether the remote CM validated it (and is known > to be privileged software) or if the local CM validates it > with a reverse lookup. A reverse lookup isn't needed - a forward lookup is. The whole point of passing the IP addresses in the private data was to solve the ambiguity of reverse lookups. > What I do not want is to kick this problem up to the application. > If it is kicked up to the application it is no longer TCP-compatible > connection setup, because that responsibility does not exist over TCP. I agree. I'm just pointing out that the validation of the private data does not have to be done by a privileged entity, so trying to put a bunch of bits in the protocol to require enforcement by privileged code is unnecessary. That means that the CMA functionality could (not should) be implemented in user-mode. - Fab From krause at cup.hp.com Fri Nov 11 15:55:53 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 11 Nov 2005 15:55:53 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <4375069D.7040106@sun.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <4375069D.7040106@sun.com> Message-ID: <6.2.0.14.2.20051111155127.02542120@esmail.cup.hp.com> At 01:01 PM 11/11/2005, Nitin Hande wrote: >Michael Krause wrote: >>At 10:28 AM 11/9/2005, Rick Frank wrote: >> >>>Yes, the application is responsible for detecting lost msgs at the >>>application level - the transport can not do this. >>> >>>RDS does not guarantee that a message has been delivered to the >>>application - just that once the transport has accepted a msg it will >>>deliver the msg to the remote node in order without duplication - >>>dealing with retransmissions, etc due to sporadic / intermittent msg >>>loss over the interconnect. If after accepting the send - the current >>>path fails - then RDS will transparently fail over to another path - and >>>if required will resend / send any already queued msgs to the remote >>>node - again insuring that no msg is duplicated and they are in >>>order. This is no different than APM - with the exception that RDS can >>>do this across HCAs. >>> >>>The application - Oracle in this case - will deal with detecting a >>>catastrophic path failure - either due to a send that does not arrive >>>and or a timedout response or send failure returned from the transport. >>>If there is no network path to a remote node - it is required that we >>>remove the remote node from the operating cluster to avoid what is >>>commonly termed as a "split brain" condition - otherwise known as a >>>"partition in time". >>> >>>BTW - in our case - the application failure domain logic is the same >>>whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, >>>if we can not talk to a remote node - after some defined period of time >>>- we will remove the remote node from the cluster. In this case the >>>database will recover all the interesting state that may have been >>>maintained on the removed node - allowing the remaining nodes to >>>continue. If later on, communication to the remote node is restored - it >>>will be allowed to rejoin the cluster and take on application load. >> >>Please clarify the following which was in the document provided by Oracle. >>On page 3 of the RDS document, under the section "RDP Interface", the 2nd >>and 3rd paragraphs are state: >> * RDP does not guarantee that a datagram is delivered to the remote >> application. >> * It is up to the RDP client to deal with datagrams lost due to >> transport failure or remote application failure. >>The HCA is still a fault domain with RDS - it does not address flushing >>data out of the HCA fault domain, nor does it sound like it ensures that >>CQE loss is recoverable. >>I do believe RDS will replay all of the sendmsg's that it believes are >>pending, but it has no way to determine if already sent sendmsgs were >>actually successfully delivered to the remote application unless it >>provides some level of resync of the outstanding sends not completed from >>an application's perspective as well as any state updated via RDMA >>operations which may occur without an explicit send operation to flush to >>a known state. >If RDS could define a mechanism that the application could use to inform >the sender to resync and replay on catastrophic failure, is that a correct >understanding of your suggestion ? I'm not suggesting anything at this point. I'm trying to reconcile the documentation with the e-mail statements made by its proponents. >I'm still trying to ascertain whether RDS completely >>recovers from HCA failure (assuming there is another HCA / path >>available) between the two endnodes >Reading at the doc and the thread, it looks like we need src/dst port for >multiplexing connections, we need seq/ack# for resyncing, we need some >kind of window availability for flow control. Are'nt we very close to tcp >header ? .. TCP does not provide end-to-end to the application as implemented by most OS. Unless one ties TCP ACK to the application's consumption of the receive data, there is no method to ascertain that the application really received the data. The application would be required to send its own application-level acknowledgement. I believe the intent is for applications to remain responsible for the end-to-end receipt of data and that RDS and the interconnect are simply responsible for the exchange at the lower levels. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Nov 11 16:03:27 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 11 Nov 2005 16:03:27 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <96f8e60e0511111302i8f6db3anfc404b0998c8885@mail.gmail.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <96f8e60e0511111302i8f6db3anfc404b0998c8885@mail.gmail.com> Message-ID: <6.2.0.14.2.20051111155559.02542268@esmail.cup.hp.com> At 01:02 PM 11/11/2005, Ranjit Pandit wrote: >On 11/11/05, Michael Krause wrote: > > Please clarify the following which was in the document provided by Oracle. > > > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > > and 3rd paragraphs are state: > > > > * RDP does not guarantee that a datagram is delivered to the remote > > application. > > * It is up to the RDP client to deal with datagrams lost due to > transport > > failure or remote application failure. > > > > The HCA is still a fault domain with RDS - it does not address flushing > data > > out of the HCA fault domain, nor does it sound like it ensures that CQE > loss > > is recoverable. > > > > I do believe RDS will replay all of the sendmsg's that it believes are > > pending, but it has no way to determine if already sent sendmsgs were > > actually successfully delivered to the remote application unless it > provides > > some level of resync of the outstanding sends not completed from an > > application's perspective as well as any state updated via RDMA operations > > which may occur without an explicit send operation to flush to a known > > state. I'm still trying to ascertain whether RDS completely recovers from > > HCA failure (assuming there is another HCA / path available) between > the two > > endnodes. > >RDS will replay the sends that are completed in error by the HCA, >which typically would happen if the current path fails or the remote >node/HCA dies. Does this mean that the receiving RDS entity is responsible for dealing with duplicates? A Send completion error does not mean that the receiving endnode did not receive the data for either IB or iWARP; it only indicates that the Send operation failed which could be just a loss of the receive ACK with the Send completing on the receiver. Such a scenario would imply that RDS would have to comprehend what buffers have actually been consumed before retransmission, i.e. a resync is performed, else one could receive duplicate data at the application layer which can cause corruption or other problems as a function of the application (tolerance will vary by application thus the ULP must present consistent semantics to enable a broader set of applications than perhaps the initial targeted application to be supported). >In case of a catastrophic error on the local HCA, subsequent sends will >fail (for a certain time (session_time_wait ) ) as if there was no >alternate path available at that time. On getting an error the application >should discard any sends unacknowledged by it's peer and take corrective >action. Unacknowledged by the peer means at the interconnect or the application level? Again, how is the receive buffer management handled? >After the time_wait is over, subsequent sends will initiate a brand new >connection which could use the alternate HCA ( if the path is available). This is understood. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From recio at us.ibm.com Fri Nov 11 16:48:35 2005 From: recio at us.ibm.com (Renato Recio) Date: Fri, 11 Nov 2005 18:48:35 -0600 Subject: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: A current passive side CM will reject the incoming CM REQ, because the Service ID will not be recognized. Renato J Recio Chief Architect, eServer I/O IBM Distinguished Engineer Member IBM Academy of Technology Tel 512-838-3685, T/L 678-3685 "Caitlin Bestler" cc: "Kanevsky, Arkady" , dat-discussions at yahoogroups.com, "Sean Hefty" , openib-general at openib.org, swg at infinibandta.org 11/11/2005 03:12 Subject: RE: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel PM for IB proposal - round 3 From: Renato Recio [mailto:recio at us.ibm.com] Sent: Friday, November 11, 2005 12:25 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; Sean Hefty; openib-general at openib.org; swg at infinibandta.org Subject: RE: [swg] RE: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Any active side QP can target a passive side CM QP (QP1 or redirected QPN). However, due to the use of priviliged Q_Keys, only an active side priviliged QP can target the passive side CM QP. It seems to me that our proposal of having the Service ID be generated by priviliged mode code, having a Service ID associated with RDMA Services (e.g. iSER, NFSeR, ...), and having priviliged mode code generate the first N bytes of the private data field (i.e. the bytes in question); allows the passive side: - Transport to validate an incoming CM message was generated by a priviliged consumer; and - CM to know the Service ID and first N-bytes of the private data field were generated by a priviliged consumer. How does this prevent a non-privileged client running on a remote host with current CM software from generating a connection request to the targeted Service ID with the entire private data coming from the non-privileged consumer. A current CM does not know that the Service ID requires it to generate/validate any portion of the private data. A current CM does not know how to use a later version number or to set a bit that is currently defined as reserved. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic18741.gif Type: image/gif Size: 1255 bytes Desc: not available URL: From eitan at mellanox.co.il Sat Nov 12 09:34:44 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 12 Nov 2005 19:34:44 +0200 Subject: [openib-general] OpenSM and Wrong SM_Key Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> Hi Troy, Good to get a straight forward message. What I hear you saying is: 1. There needs to be a parameter to control the SM behavior if it finds another SM with non matching SM Key: -> Either to ignore it or to die. We can do that. No problem! 2. The SM log file has too many errors. -> Are you aware of the messages the SM sends to syslog ? We try and make these the most important ones. If you feel some messages are missing there - just let us know and we will fix it. -> The osm.log is intended for OpenSM errors reporting. These include any error. We try our best to clean it up from un-needed events. But as I say it is NOT the log you should use for getting the SM major events. You should look at the /var/log/messages instead. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Friday, November 11, 2005 8:32 AM > To: Eitan Zahavi > Cc: Hal Rosenstock; openib-general at openib.org > Subject: Re: [openib-general] OpenSM and Wrong SM_Key > > On Wed, Nov 09, 2005 at 09:46:06AM +0200, Eitan Zahavi wrote: > > Hi Hal, > > > > I would like to bring this to MgtWG before we change anything. > > IMO the situation when this happens is really not "legal" since if the > > SM's are not coordinated at least in their SM_Key it will cause the two > > masters on the subnet. > > > > >From our experience it is always better to cause a fatal flow and exit > > the SM rather then report the event in some log - normally it will not > > be seen ... > > > > I know this is a controversial issue. > > Okay, so you're telling me you *WANT* behavior where a rogue node can > trivially cause the running subnet manager to exit and take over > management of the network? > > Opensm needs to have a well documented config file, instead of 3 pages > of command line options, and different levels of logging. What to do in > the above situation is a site-local policy config decision, not something > that should be hard-coded in the SM source code. > > The logs might actually get looked at if there wasn't junk in the log > every time something timed out. > > The linux kernel has 'WARN, NOTICE, and CRITICAL' level log messages. From mst at mellanox.co.il Sat Nov 12 13:40:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 12 Nov 2005 23:40:29 +0200 Subject: [openib-general] Re: (SPAM?) [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <52y83whz5g.fsf@cisco.com> References: <52y83whz5g.fsf@cisco.com> Message-ID: <20051112214029.GB4941@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: (SPAM?) [RFC] new ibv_get_devices() API -- avoid dlists > > > I would prefer one call to get the entire structure. Another option might > > be: > > > > struct ibv_device ** ibv_get_device() > > > > where it returns a list which is null terminated so you do not need to > > return the length. > > Yes, I thought of that too. It seemed faintly preferable to tell the > caller how big the array was rather than forcing the caller to count > for itself. But I can't really think of a use case where it makes a > difference, so perhaps your simpler version is better. > > - R. Me, I like the int ibv_get_device_list(struct ibv_device **list[]); better. NULL-terminated arrays are, to me, inconvenient. -- MST From hozer at hozed.org Sat Nov 12 15:26:52 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 12 Nov 2005 17:26:52 -0600 Subject: [openib-general] OpenSM and Wrong SM_Key In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> Message-ID: <20051112232652.GV3275@kalmia.hozed.org> On Sat, Nov 12, 2005 at 07:34:44PM +0200, Eitan Zahavi wrote: > Hi Troy, > > Good to get a straight forward message. > > What I hear you saying is: > 1. There needs to be a parameter to control the SM behavior if it finds > another SM with non matching SM Key: > -> Either to ignore it or to die. We can do that. No problem! Is it possible to have another option as well, to attempt to disable the port the SM with the non-matching key is connected to? From rolandd at cisco.com Sat Nov 12 20:46:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 12 Nov 2005 20:46:40 -0800 Subject: [openib-general] please test kernel 2.6.15-rc1 Message-ID: <52lkztduj3.fsf@cisco.com> As you might have noticed, Linus just released kernel 2.6.15-rc1. This is the end of the free-for-all merge window for kernel 2.6.15 and only fixes are supposed to be merged from now on, so it's a good time to start testing the kernel. It would be extremely useful if lots of people tested the IB drivers in the upstream kernel. In other words, please try 2.6.15-rc1 out without replacing drivers/infiniband with a subversion tree -- just build yourself a stock kernel and see how it works. I'd like to make sure that the in-tree IB support works well, and that I didn't leave out a fixes or screw up a merge. You can report any IB problems to openib-general, and report all the other problems to linux-kernel at vger.kernel.org ;) On a related note, I just committed the following change to subversion, which allows the svn tree to be compiled against both 2.6.14 and 2.6.15-rc1 (or later) trees -- now that 2.6.15-rc1 is out, LINUX_VERSION_CODE has changed from 2.6.14 so this sort of hack is now possible. Thanks, Roland --- infiniband/include/rdma/ib_verbs.h (revision 4024) +++ infiniband/include/rdma/ib_verbs.h (working copy) @@ -48,6 +48,14 @@ #include #include +/* XXX remove this compatibility hack when 2.6.15 is released */ +#include + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,15) +#define class_device_create(cls, parent, devt, device, fmt, arg...) \ + class_device_create(cls, devt, device, fmt, ## arg) +#endif /* XXX end of hack */ + union ib_gid { u8 raw[16]; struct { --- infiniband/core/uverbs_main.c (revision 4024) +++ infiniband/core/uverbs_main.c (working copy) @@ -448,7 +448,6 @@ void ib_uverbs_cq_event_handler(struct i ib_uverbs_async_handler(uobj->uverbs_file, uobj->uobject.user_handle, event->event, &uobj->async_list, &uobj->async_events_reported); - } void ib_uverbs_qp_event_handler(struct ib_event *event, void *context_ptr) @@ -752,7 +751,8 @@ static void ib_uverbs_add_one(struct ib_ if (cdev_add(uverbs_dev->dev, IB_UVERBS_BASE_DEV + uverbs_dev->devnum, 1)) goto err_cdev; - uverbs_dev->class_dev = class_device_create(uverbs_class, uverbs_dev->dev->dev, + uverbs_dev->class_dev = class_device_create(uverbs_class, NULL, + uverbs_dev->dev->dev, device->dma_device, "uverbs%d", uverbs_dev->devnum); if (IS_ERR(uverbs_dev->class_dev)) --- infiniband/core/user_mad.c (revision 4028) +++ infiniband/core/user_mad.c (working copy) @@ -786,7 +786,7 @@ static int ib_umad_init_port(struct ib_d if (cdev_add(port->dev, base_dev + port->dev_num, 1)) goto err_cdev; - port->class_dev = class_device_create(umad_class, port->dev->dev, + port->class_dev = class_device_create(umad_class, NULL, port->dev->dev, device->dma_device, "umad%d", port->dev_num); if (IS_ERR(port->class_dev)) @@ -806,7 +806,7 @@ static int ib_umad_init_port(struct ib_d if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) goto err_sm_cdev; - port->sm_class_dev = class_device_create(umad_class, port->sm_dev->dev, + port->sm_class_dev = class_device_create(umad_class, NULL, port->sm_dev->dev, device->dma_device, "issm%d", port->dev_num); if (IS_ERR(port->sm_class_dev)) --- infiniband/core/uat.c (revision 4024) +++ infiniband/core/uat.c (working copy) @@ -831,7 +831,7 @@ static int __init ib_uat_init(void) goto err_class; } - class_device_create(ib_uat_class, IB_UAT_DEV, NULL, "uat"); + class_device_create(ib_uat_class, NULL, IB_UAT_DEV, NULL, "uat"); idr_init(&ctx_id_table); init_MUTEX(&ctx_id_mutex); From rolandd at cisco.com Sat Nov 12 20:53:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 12 Nov 2005 20:53:49 -0800 Subject: [openib-general] Re: (SPAM?) [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <20051112214029.GB4941@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 12 Nov 2005 23:40:29 +0200") References: <52y83whz5g.fsf@cisco.com> <20051112214029.GB4941@mellanox.co.il> Message-ID: <52hdahdu76.fsf@cisco.com> Michael> Me, I like the Michael> int ibv_get_device_list(struct ibv_device **list[]); Michael> better. NULL-terminated arrays are, to me, inconvenient. I played around with a few things and came up with the interface below. Basically, I like struct ibv_device **ibv_get_device_list(int *num_devices); with num_devices allowed to be NULL best. It makes it possible to get the number of devices if you want, but puts num_devices in a less prominent place, which makes sense because the number of devices is secondary to the actual list of devices. And I think having a parameter that's an "array of pointer to pointer to struct ibv_device" is just too weird for me, in the end. I'll cook up patches for the various users of libibverbs that I know about (MVAPICH, Open MPI, uDAPL, ibtp) before committing this. - R. --- libibverbs/include/infiniband/verbs.h (revision 4024) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -585,9 +585,21 @@ struct ibv_context { }; /** - * ibv_get_devices - Return list of IB devices + * ibv_get_device_list - Get list of IB devices currently available + * @num_devices: optional. if non-NULL, set to the number of devices + * returned in the array. + * + * Return a NULL-terminated array of IB devices. The array can be + * released with ibv_free_device_list(). + */ +extern struct ibv_device **ibv_get_device_list(int *num_devices); + +/** + * ibv_free_device_list - Free list from ibv_get_device_list() + * + * Free an array of devices returned from ibv_get_device_list() */ -extern struct dlist *ibv_get_devices(void); +extern void ibv_free_device_list(struct ibv_device **list); /** * ibv_get_device_name - Return kernel device name --- libibverbs/ChangeLog (revision 4024) +++ libibverbs/ChangeLog (working copy) @@ -1,4 +1,15 @@ -2005-11-10 Sean Hefty +2005-11-11 Roland Dreier + + * examples/asyncwatch.c, examples/rc_pingpong.c, + examples/srq_pingpong.c, examples/uc_pingpong.c, + examples/ud_pingpong.c, examples/device_list.c, + examples/devinfo.c: Update examples to match new API. + + * include/infiniband/verbs.h, src/device.c, src/init.c, + src/ibverbs.h: Change from dlist-based ibv_get_devices() API to + simpler ibv_get_device_list() and ibv_free_device_list() API. + +2005-11-10 Sean Hefty * include/infiniband/sa-kern-abi.h: New include file to contain definitions of SA structures passed between userspace and kernel. --- libibverbs/src/libibverbs.map (revision 4024) +++ libibverbs/src/libibverbs.map (working copy) @@ -1,6 +1,7 @@ IBVERBS_1.0 { global: - ibv_get_devices; + ibv_get_device_list; + ibv_free_device_list; ibv_get_device_name; ibv_get_device_guid; ibv_open_device; --- libibverbs/src/device.c (revision 4024) +++ libibverbs/src/device.c (working copy) @@ -49,21 +49,36 @@ #include "ibverbs.h" static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; -static struct dlist *device_list; +static int num_devices; +static struct ibv_device **device_list; -struct dlist *ibv_get_devices(void) +struct ibv_device **ibv_get_device_list(int *num) { - struct dlist *l; + struct ibv_device **l; + int i; pthread_mutex_lock(&device_list_lock); - if (!device_list) - device_list = ibverbs_init(); - l = device_list; + + if (!num_devices) + num_devices = ibverbs_init(&device_list); + + l = calloc(num_devices, sizeof (struct ibv_device *)); + for (i = 0; i < num_devices; ++i) + l[i] = device_list[i]; + pthread_mutex_unlock(&device_list_lock); + if (num) + *num = l ? num_devices : 0; + return l; } +void ibv_free_device_list(struct ibv_device **list) +{ + free(list); +} + const char *ibv_get_device_name(struct ibv_device *device) { return device->ibdev->name; --- libibverbs/src/ibverbs.h (revision 4024) +++ libibverbs/src/ibverbs.h (working copy) @@ -47,7 +47,8 @@ #define PFX "libibverbs: " struct ibv_driver { - ibv_driver_init_func init_func; + ibv_driver_init_func init_func; + struct ibv_driver *next; }; struct ibv_abi_compat_v2 { @@ -57,11 +58,11 @@ struct ibv_abi_compat_v2 { extern HIDDEN int abi_ver; -extern struct dlist *ibverbs_init(void); +extern HIDDEN int ibverbs_init(struct ibv_device ***list); -extern int ibv_init_mem_map(void); -extern int ibv_lock_range(void *base, size_t size); -extern int ibv_unlock_range(void *base, size_t size); +extern HIDDEN int ibv_init_mem_map(void); +extern HIDDEN int ibv_lock_range(void *base, size_t size); +extern HIDDEN int ibv_unlock_range(void *base, size_t size); #define IBV_INIT_CMD(cmd, size, opcode) \ do { \ --- libibverbs/src/init.c (revision 4024) +++ libibverbs/src/init.c (working copy) @@ -55,7 +55,7 @@ HIDDEN int abi_ver; static char default_path[] = DRIVER_PATH; static const char *user_path; -static struct dlist *driver_list; +static struct ibv_driver *driver_list; static void load_driver(char *so_path) { @@ -82,7 +82,8 @@ static void load_driver(char *so_path) } driver->init_func = init_func; - dlist_push(driver_list, driver); + driver->next = driver_list; + driver_list = driver; } static void find_drivers(char *dir) @@ -112,8 +113,7 @@ static void find_drivers(char *dir) load_driver(so_glob.gl_pathv[i]); } -static void init_drivers(struct sysfs_class_device *verbs_dev, - struct dlist *device_list) +static struct ibv_device *init_drivers(struct sysfs_class_device *verbs_dev) { struct sysfs_class_device *ib_dev; struct sysfs_attribute *attr; @@ -125,7 +125,7 @@ static void init_drivers(struct sysfs_cl if (!attr) { fprintf(stderr, PFX "Warning: no ibdev class attr for %s\n", verbs_dev->name); - return; + return NULL; } sscanf(attr->value, "%63s", ibdev_name); @@ -134,19 +134,17 @@ static void init_drivers(struct sysfs_cl if (!ib_dev) { fprintf(stderr, PFX "Warning: no infiniband class device %s for %s\n", attr->value, verbs_dev->name); - return; + return NULL; } - dlist_for_each_data(driver_list, driver, struct ibv_driver) { + for (driver = driver_list; driver; driver = driver->next) { dev = driver->init_func(verbs_dev); if (dev) { dev->dev = verbs_dev; dev->ibdev = ib_dev; dev->driver = driver; - dlist_push(device_list, dev); - - return; + return dev; } } @@ -155,6 +153,8 @@ static void init_drivers(struct sysfs_cl if (user_path) fprintf(stderr, "%s:", user_path); fprintf(stderr, "%s\n", default_path); + + return NULL; } static int check_abi_version(void) @@ -188,28 +188,23 @@ static int check_abi_version(void) } -struct dlist *ibverbs_init(void) +HIDDEN int ibverbs_init(struct ibv_device ***list) { char *wr_path, *dir; struct sysfs_class *cls; struct dlist *verbs_dev_list; - struct dlist *device_list; struct sysfs_class_device *verbs_dev; + struct ibv_device *device; + struct ibv_device **new_list; + int num_devices = 0; + int list_size = 0; - driver_list = dlist_new(sizeof (struct ibv_driver)); - device_list = dlist_new(sizeof (struct ibv_device)); - if (!driver_list || !device_list) { - fprintf(stderr, PFX "Fatal: couldn't allocate device/driver list.\n"); - abort(); - } + *list = NULL; if (ibv_init_mem_map()) - return NULL; + return 0; - /* - * Check if a driver is statically linked, and if so load it first. - */ - load_driver(NULL); + find_drivers(default_path); /* * Only follow the path passed in through the calling user's @@ -224,25 +219,42 @@ struct dlist *ibverbs_init(void) } } - find_drivers(default_path); + /* + * Now check if a driver is statically linked. Since we push + * drivers onto our driver list, the last driver we find will + * be the first one we try. + */ + load_driver(NULL); cls = sysfs_open_class("infiniband_verbs"); if (!cls) { fprintf(stderr, PFX "Fatal: couldn't open sysfs class 'infiniband_verbs'.\n"); - return NULL; + return 0; } if (check_abi_version()) - return NULL; + return 0; verbs_dev_list = sysfs_get_class_devices(cls); if (!verbs_dev_list) { fprintf(stderr, PFX "Fatal: no infiniband class devices found.\n"); - return NULL; + return 0; } - dlist_for_each_data(verbs_dev_list, verbs_dev, struct sysfs_class_device) - init_drivers(verbs_dev, device_list); + dlist_for_each_data(verbs_dev_list, verbs_dev, struct sysfs_class_device) { + device = init_drivers(verbs_dev); + if (device) { + if (list_size <= num_devices) { + list_size = list_size ? list_size * 2 : 1; + new_list = realloc(*list, list_size * sizeof (struct ibv_device *)); + if (!new_list) + goto out; + *list = new_list; + } + *list[num_devices++] = device; + } + } - return device_list; +out: + return num_devices; } --- libibverbs/examples/asyncwatch.c (revision 4024) +++ libibverbs/examples/asyncwatch.c (working copy) @@ -50,34 +50,30 @@ static inline uint64_t be64_to_cpu(uint6 int main(int argc, char *argv[]) { - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; struct ibv_context *context; struct ibv_async_event event; - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); - ib_dev = dlist_next(dev_list); - - if (!ib_dev) { + if (!*dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - context = ibv_open_device(ib_dev); + context = ibv_open_device(*dev_list); if (!context) { fprintf(stderr, "Couldn't get context for %s\n", - ibv_get_device_name(ib_dev)); + ibv_get_device_name(*dev_list)); return 1; } printf("%s: async event FD %d\n", - ibv_get_device_name(ib_dev), context->async_fd); + ibv_get_device_name(*dev_list), context->async_fd); while (1) { if (ibv_get_async_event(context, &event)) --- libibverbs/examples/rc_pingpong.c (revision 4024) +++ libibverbs/examples/rc_pingpong.c (working copy) @@ -447,7 +447,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -536,21 +536,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- libibverbs/examples/srq_pingpong.c (revision 4024) +++ libibverbs/examples/srq_pingpong.c (working copy) @@ -509,7 +509,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest[MAX_QP]; @@ -605,21 +605,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- libibverbs/examples/uc_pingpong.c (revision 4024) +++ libibverbs/examples/uc_pingpong.c (working copy) @@ -435,7 +435,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -524,21 +524,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- libibverbs/examples/ud_pingpong.c (revision 4024) +++ libibverbs/examples/ud_pingpong.c (working copy) @@ -443,7 +443,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -532,21 +532,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- libibverbs/examples/device_list.c (revision 4024) +++ libibverbs/examples/device_list.c (working copy) @@ -51,10 +51,9 @@ static inline uint64_t be64_to_cpu(uint6 int main(int argc, char *argv[]) { - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; @@ -63,10 +62,12 @@ int main(int argc, char *argv[]) printf(" %-16s\t node GUID\n", "device"); printf(" %-16s\t----------------\n", "------"); - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + while (*dev_list) { printf(" %-16s\t%016llx\n", - ibv_get_device_name(ib_dev), - (unsigned long long) be64_to_cpu(ibv_get_device_guid(ib_dev))); + ibv_get_device_name(*dev_list), + (unsigned long long) be64_to_cpu(ibv_get_device_guid(*dev_list))); + ++dev_list; + } return 0; } --- libibverbs/examples/devinfo.c (revision 4024) +++ libibverbs/examples/devinfo.c (working copy) @@ -312,8 +312,7 @@ int main(int argc, char *argv[]) { char *ib_devname = NULL; int ret = 0; - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; int num_of_hcas; int ib_port = 0; @@ -350,22 +349,19 @@ int main(int argc, char *argv[]) break; case 'l': - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(&num_of_hcas); if (!dev_list) { fprintf(stderr, "Failed to get IB devices list"); return -1; } - num_of_hcas = 0; - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - num_of_hcas ++; - printf("%d HCA%s found:\n", num_of_hcas, num_of_hcas != 1 ? "s" : ""); - dlist_start(dev_list); - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - printf("\t%s\n", ibv_get_device_name(ib_dev)); + while (*dev_list) { + printf("\t%s\n", ibv_get_device_name(*dev_list)); + ++dev_list; + } printf("\n"); return 0; @@ -376,28 +372,31 @@ int main(int argc, char *argv[]) } } - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "Failed to get IB device list\n"); return -1; } - dlist_start(dev_list); + if (ib_devname) { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + while (*dev_list) { + if (!strcmp(ibv_get_device_name(*dev_list), ib_devname)) break; - if (!ib_dev) { + ++dev_list; + } + + if (!*dev_list) { fprintf(stderr, "IB device '%s' wasn't found\n", ib_devname); return -1; } - ret |= print_hca_cap(ib_dev, ib_port); + + ret |= print_hca_cap(*dev_list, ib_port); } else { - ib_dev = dlist_next(dev_list); - if (!ib_dev) { + if (!*dev_list) { fprintf(stderr, "No IB devices found\n"); return -1; } - ret |= print_hca_cap(ib_dev, ib_port); + ret |= print_hca_cap(*dev_list, ib_port); } if (ib_devname) From rolandd at cisco.com Sat Nov 12 21:00:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 12 Nov 2005 21:00:21 -0800 Subject: [openib-general] please test kernel 2.6.15-rc1 In-Reply-To: <52lkztduj3.fsf@cisco.com> (Roland Dreier's message of "Sat, 12 Nov 2005 20:46:40 -0800") References: <52lkztduj3.fsf@cisco.com> Message-ID: <52d5l5dtwa.fsf@cisco.com> By the way, I'll be at SC'05 in Seattle Monday through Wednesday of this week. And I'll be on vacation the whole week after that, and the week after that, we're moving from the old Topspin building to the main Cisco campus. Basically, don't expect me to fix anything until January or so (but test 2.6.15-rc1 anyway ;) And if you're at SC'05 as well and want to meet up in person, just drop me a line or look for me on the show floor... - R. From eitan at mellanox.co.il Sat Nov 12 22:38:40 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 13 Nov 2005 08:38:40 +0200 Subject: [openib-general] OpenSM and Wrong SM_Key Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618948@mtlexch01.mtl.com> Hi Troy, > > > > What I hear you saying is: > > 1. There needs to be a parameter to control the SM behavior if it finds > > another SM with non matching SM Key: > > -> Either to ignore it or to die. We can do that. No problem! > > Is it possible to have another option as well, to attempt to disable the > port the SM with the non-matching key is connected to? [EZ] I think that due to the race condition built into this option we might end up with cases where the two SMs ports are shut down (close each other's port) or when both of them ignore the shut down and restart their local port. From yael at mellanox.co.il Sun Nov 13 02:18:05 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 13 Nov 2005 12:18:05 +0200 Subject: [openib-general] [PATCH] Opensm - lid assignment issues Message-ID: <5z64qwyhpe.fsf@mtl066.yok.mtl.com> Hi Hal, During some windows tests we've discovered that there is still another problem in the lid_mgr. The problem happend when 2 HCAs had the same lid - opensm entered an infinite loop. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 4032) +++ opensm/osm_lid_mgr.c (working copy) @@ -550,6 +550,9 @@ __osm_lid_mgr_init_sweep( { /* This port will use its local lid, and consume the entire required lid range. Thus we can skip that range. */ + /* If the disc_max_lid is greater then lid - we can skip right to it, + since we've done all neccessary checks on the lids in between. */ + if (disc_max_lid > lid) lid = disc_max_lid; } } @@ -593,7 +596,14 @@ __osm_lid_mgr_init_sweep( { p_range = (osm_lid_mgr_range_t *)cl_malloc(sizeof(osm_lid_mgr_range_t)); - p_range->min_lid = 1; + /* + The p_range can be NULL in one of 2 cases: + 1. If max_defined_lid == 0. In this case, we want the entire range. + 2. If all lids discovered in the loop where mapped. In this case + no free range exists, and we want to define it after the last + mapped lid. + */ + p_range->min_lid = lid; } p_range->max_lid = p_mgr->p_subn->max_unicast_lid_ho - 1; cl_qlist_insert_tail( &p_mgr->free_ranges, &p_range->item ); From johann at pathscale.com Sun Nov 13 03:01:09 2005 From: johann at pathscale.com (Johann George) Date: Sun, 13 Nov 2005 03:01:09 -0800 Subject: [openib-general] Re: (SPAM?) [RFC] new ibv_get_devices() API -- avoid dlists In-Reply-To: <52hdahdu76.fsf@cisco.com> References: <52y83whz5g.fsf@cisco.com> <20051112214029.GB4941@mellanox.co.il> <52hdahdu76.fsf@cisco.com> Message-ID: <20051113110109.GA10238@cuprite.internal.keyresearch.com> > I played around with a few things and came up with the interface > below. Basically, I like > > struct ibv_device **ibv_get_device_list(int *num_devices); Works for me. I assume the list is null terminated so we do not have to get num_devices? Johann From dotanb at mellanox.co.il Sun Nov 13 04:15:14 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Sun, 13 Nov 2005 14:15:14 +0200 Subject: [openib-general] changing a UC QP to support RDMA Write is not working Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E372B035@mtlexch01.mtl.com> Hi. I'm using Mellanox HCA on a machine with AS4 with kernel 2.6.9-5.ELsmp. I'm working with gen2 driver, svn revision 4032. i wrote a test (2 sides) that have the following flow: common: create a UC QP modify the QP to RTS (in RESET->INIT: enable RDMA Read) side A: post recv request common: sync with the other side side B: post send request (RDMA Write with immediate) Side A: poll CQ (here is the bug: there isn't any completion in the CQ) when i checked the QP context in side A, i noticed that RDMA Write wasn't enabled for this QP. for RC QP, the test passes. (even if side B would have post send request with only RDMA Write (without immediate), there will be a failure: the packet will be dropped in the responder QP). i looked in the file mthca_qp.c and i saw the following code: if (attr_mask & IB_QP_ACCESS_FLAGS) { /* * Only enable RDMA/atomics if we have responder * resources set to a non-zero value. */ if (qp->resp_depth) { qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? MTHCA_QP_BIT_RWE : 0); qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_QP_BIT_RAE : 0); } qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); qp->atomic_rd_en = attr->qp_access_flags; } the value of the attribute resp_depth is being changed (starting) the state INIT->RTR, but the value of the attribute atomic_rd_en is being changed (starting) the change RESET->INIT. did anyone see this behaviour too? thanx Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From yipeeyipeeyipeeyipee at yahoo.com Sun Nov 13 08:25:25 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Sun, 13 Nov 2005 16:25:25 +0000 (UTC) Subject: [openib-general] mellanox sdr/ddr hca interoperability Message-ID: Hi, Can I build a fabric with a ddr switch and some hosts with sdr hcas, and other hosts with ddr hcas? Are there known interoperability issues with such a fabric? All transports (ud/ rc) work as expected? For example can a ddr host establish a connection with a sdr host and send/ receive data? Should I configure anything specifically for such a fabric? Thanks, y From halr at voltaire.com Sun Nov 13 08:33:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 13 Nov 2005 18:33:01 +0200 Subject: [openib-general] mellanox sdr/ddr hca interoperability Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AAC5@taurus.voltaire.com> Yes, this works and has been deployed. SM should take care of setting up the links properly and the PathRecords returned should set parameters for the paths for the QP connections between the HCA endpoints. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of yipee Sent: Sun 11/13/2005 11:25 AM To: openib-general at openib.org Subject: [openib-general] mellanox sdr/ddr hca interoperability Hi, Can I build a fabric with a ddr switch and some hosts with sdr hcas, and other hosts with ddr hcas? Are there known interoperability issues with such a fabric? All transports (ud/ rc) work as expected? For example can a ddr host establish a connection with a sdr host and send/ receive data? Should I configure anything specifically for such a fabric? Thanks, y _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Sun Nov 13 08:59:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 13 Nov 2005 08:59:49 -0800 Subject: [openib-general] Re: changing a UC QP to support RDMA Write is not working In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E372B035@mtlexch01.mtl.com> (Dotan Barak's message of "Sun, 13 Nov 2005 14:15:14 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E372B035@mtlexch01.mtl.com> Message-ID: <52lkzscwl6.fsf@cisco.com> I think I see the problem. As you said, mthca was incorrectly checking the responder resources to see if it should enable RDMA writes on the receive queue. However, responder resources only apply to RDMA reads and atomics and are never set for UC QPs, so this was never set. Does the patch below for the kernel fix things for you? - R. Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 4024) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -728,15 +728,16 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_ACCESS_FLAGS) { + qp_context->params2 |= + cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? + MTHCA_QP_BIT_RWE : 0); + /* - * Only enable RDMA/atomics if we have responder - * resources set to a non-zero value. + * Only enable RDMA reads and atomics if we have + * responder resources set to a non-zero value. */ if (qp->resp_depth) { qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= @@ -757,31 +758,27 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->resp_depth && !attr->max_dest_rd_atomic) { /* * Lowering our responder resources to zero. - * Turn off RDMA/atomics as responder. - * (RWE/RRE/RAE in params2 already zero) + * Turn off reads RDMA and atomics as responder. + * (RRE/RAE in params2 already zero) */ - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } if (!qp->resp_depth && attr->max_dest_rd_atomic) { /* * Increasing our responder resources from - * zero. Turn on RDMA/atomics as appropriate. + * zero. Turn on RDMA reads and atomics as + * appropriate. */ qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_QP_BIT_RAE : 0); - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } From yipeeyipeeyipeeyipee at yahoo.com Sun Nov 13 10:50:58 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Sun, 13 Nov 2005 18:50:58 +0000 (UTC) Subject: [openib-general] Re: mellanox sdr/ddr hca interoperability References: <5CE025EE7D88BA4599A2C8FEFCF226F589AAC5@taurus.voltaire.com> Message-ID: Hal Rosenstock voltaire.com> writes: > Yes, this works and has been deployed. SM should take care of setting up the links properly and the > PathRecords returned should set parameters for the paths for the QP connections between the HCA endpoints. Do RC & UD sends behave the same on a mixed fabric? What about MADs usen by the CM module? And also who matches the different speeds of two different hosts that are members of the same multicast group, as there aren't any PathRecords in this case? For example if a ddr host sends a datagram to a multicast group, does the switch keep the incoming (fast) datagrams until the port to the slow hosts is free? Thanks, x From nacc at us.ibm.com Sun Nov 13 11:48:42 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sun, 13 Nov 2005 11:48:42 -0800 Subject: [openib-general] Infiniband compilation testing Message-ID: <20051113194842.GF13904@us.ibm.com> Hi all, Latest results (now on three archs: x86, ppc64, x86_64): kernel.org: =y x86 --> OK ppc64 --> OK x86_64 --> OK =m x86 --> OK ppc64 --> OK x86_64 --> OK svn: =y x86 --> OK ppc64 --> Build Warnings drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_reg': drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long unsigned int format, u64 arg (arg 3) drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long int format, u64 arg (arg 4) drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 5) drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 6) drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_clone': drivers/infiniband/ulp/iser/iser_memory.c:218: warning: long long unsigned int format, long unsigned int arg (arg 4) x86_64 --> OK =m x86 --> OK ppc64 --> Build Warnings drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_reg': drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long unsigned int format, u64 arg (arg 3) drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long int format, u64 arg (arg 4) drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 5) drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 6) drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_clone': drivers/infiniband/ulp/iser/iser_memory.c:218: warning: long long unsigned int format, long unsigned int arg (arg 4) x86_64 --> OK Thanks, Nish From robert.j.woodruff at intel.com Sun Nov 13 13:06:20 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Sun, 13 Nov 2005 13:06:20 -0800 Subject: [openib-general] please test kernel 2.6.15-rc1 In-Reply-To: <52lkztduj3.fsf@cisco.com> Message-ID: Roland wrote, >It would be extremely useful if lots of people tested the IB drivers >in the upstream kernel. In other words, please try 2.6.15-rc1 out >without replacing drivers/infiniband with a subversion tree -- just >build yourself a stock kernel and see how it works. I'd like to make >sure that the in-tree IB support works well, and that I didn't leave >out a fixes or screw up a merge. I loaded the 2.6.15-rc1 onto a couple of EM64T boxes with the SVN4016 userspace code. I have been running Intel MPI on DAPL/user-verbs and also MPI on IPoIB. So far, looks good. Will try some IPF and IA32 machines next week. woody From troy at scl.ameslab.gov Sun Nov 13 13:45:44 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sun, 13 Nov 2005 13:45:44 -0800 Subject: [openib-general] [PATCH] Opensm - lid assignment issues In-Reply-To: <5z64qwyhpe.fsf@mtl066.yok.mtl.com> References: <5z64qwyhpe.fsf@mtl066.yok.mtl.com> Message-ID: <4377B408.6020903@scl.ameslab.gov> Yael Kalka wrote: >Hi Hal, > >During some windows tests we've discovered that there is still another >problem in the lid_mgr. The problem happend when 2 HCAs had the same >lid - opensm entered an infinite loop. >The following patch fixes this. > >Thanks, >Yael > >Signed-off-by: Yael Kalka > >Index: opensm/osm_lid_mgr.c >=================================================================== >--- opensm/osm_lid_mgr.c (revision 4032) >+++ opensm/osm_lid_mgr.c (working copy) >@@ -550,6 +550,9 @@ __osm_lid_mgr_init_sweep( > { > /* This port will use its local lid, and consume the entire required lid range. > Thus we can skip that range. */ >+ /* If the disc_max_lid is greater then lid - we can skip right to it, >+ since we've done all neccessary checks on the lids in between. */ >+ if (disc_max_lid > lid) > lid = disc_max_lid; > } > } >@@ -593,7 +596,14 @@ __osm_lid_mgr_init_sweep( > { > p_range = > (osm_lid_mgr_range_t *)cl_malloc(sizeof(osm_lid_mgr_range_t)); >- p_range->min_lid = 1; >+ /* >+ The p_range can be NULL in one of 2 cases: >+ 1. If max_defined_lid == 0. In this case, we want the entire range. >+ 2. If all lids discovered in the loop where mapped. In this case >+ no free range exists, and we want to define it after the last >+ mapped lid. >+ */ >+ p_range->min_lid = lid; > } > p_range->max_lid = p_mgr->p_subn->max_unicast_lid_ho - 1; > cl_qlist_insert_tail( &p_mgr->free_ranges, &p_range->item ); > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > The opensm on the show floor is showing the following in oprofile: with a unit mask of 0x01 (mandatory) count 100000 samples % app name symbol name 5970354 51.7020 libpthread-2.3.4.so pthread_cond_timedwait@@GLIBC_2.3.2 5037621 43.6247 libosmcomp.so.1.0.0 __cl_timer_prov_cb 66241 0.5736 libosmcomp.so.1.0.0 anonymous symbol from section .plt 55929 0.4843 oprofiled (no symbols) 49918 0.4323 opensm __osm_ucast_mgr_process_neighbors 39585 0.3428 vmlinux hpet_readl 25333 0.2194 oprofile (no symbols) 22734 0.1969 opreport (no symbols) 14724 0.1275 libcrypto.so.0.9.7a (no symbols) 14296 0.1238 libc-2.3.4.so __tzfile_compute 13901 0.1204 vmlinux __copy_to_user_ll Is this the same loop? From pradeep at us.ibm.com Sun Nov 13 15:30:41 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Sun, 13 Nov 2005 15:30:41 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: Message-ID: I am seeking suggestions with respect to the 32-bit/64-bit issues. Is the macro test_thread_flag(TIF_32BIT), to ascertain if we are in the kernel in the context of a 32-bit app or not, generic enought that it will work across all the 64-bit platforms of interest. It should work for ppc64/sparc64/x86_64/parisc. Is that good enough? Also, do I use __LP64__ to find out if this is a 64-bit kernel? My research has revealed that this macro will work for gcc versions greater than 3.4. Now, Sles9sp2 uses gcc 3.3.3 and so will not work on Sles9sp2. However, RHEL4U2 uses gcc version 3.4.4. So, even though one may be using the same kernel version, it will work on one distro and not the other. How does one address such issues? Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Nov 14 07:02:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Nov 2005 17:02:02 +0200 Subject: [openib-general] Re: [PATCH] support kernel-level sockets in sdp In-Reply-To: <20051103140011.GA31134@mellanox.co.il> References: <20051103140011.GA31134@mellanox.co.il> Message-ID: <20051114150202.GV20871@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] support kernel-level sockets in sdp > > Hi! > I plan to commit the following. Comments? > > --- > > The following patch adds support for kernel-level sockets in SDP Zcopy > (currently used with AIO). > > Signed-off-by: Michael S. Tsirkin In it went. Committed revision 4043. -- MST From rolandd at cisco.com Mon Nov 14 07:34:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 07:34:14 -0800 Subject: [openib-general] Re: (SPAM?) [RFC] new ibv_get_devices() API -- avoid dlists References: <52y83whz5g.fsf@cisco.com> <20051112214029.GB4941@mellanox.co.il> <52hdahdu76.fsf@cisco.com> <20051113110109.GA10238@cuprite.internal.keyresearch.com> Message-ID: <523blzb5vt.fsf@cisco.com> Johann> Works for me. I assume the list is null terminated so we Johann> do not have to get num_devices? Yes, that's right. - R. From mst at mellanox.co.il Mon Nov 14 07:42:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Nov 2005 17:42:02 +0200 Subject: [openib-general] ipoib oops Message-ID: <20051114154202.GW20871@mellanox.co.il> Hello, Roland! I am still seeing IPoIB oopsing about once a week around ipoib_mcast_join_complete (oops below). While looking at it, a question occured to me: what protects the following code in ipoib_mcast_stop_thread list_for_each_entry(mcast, &priv->multicast_list, list) { if (mcast->query) { ib_sa_cancel_query(mcast->query_id, mcast->query); mcast->query = NULL; ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); wait_for_completion(&mcast->done); } } from walking the list while an entry is being added/deleted from it? I wander whether this could be the reason we are oopsing. Thanks, MST Unable to handle kernel NULL pointer dereference at 0000000000000488 RIP: {:ib_ipoib:ipoib_mcast_join_finish+100} PGD 178b11067 PUD 174bb7067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_ipoib ib_sdp ib_cm ib_sa ib_umad ib_mthca ib_mad ib_core Pid: 2433, comm: ib_mad1 Not tainted 2.6.14 #4 RIP: 0010:[] {:ib_ipoib:ipoib_mcast_join_finish+100} RSP: 0018:ffff81017ada7c58 EFLAGS: 00010282 RAX: 000000007a010000 RBX: 0000000000000000 RCX: 0000000000000010 RDX: ffff810177847a80 RSI: ffff810177847a80 RDI: ffff810177847a80 RBP: ffff810177847a80 R08: 0000000000000000 R09: ffff81017ada7d38 R10: ffff81017ada7df8 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000480 R14: 0000000000000000 R15: ffff81017adca898 FS: 0000000000000000(0000) GS:ffffffff805ff800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000488 CR3: 00000001791a4000 CR4: 00000000000006e0 Process ib_mad1 (pid: 2433, threadinfo ffff81017ada6000, task ffff81017e66d780) Stack: ffff8100064266e0 0000000000000286 ffff810171566700 0000000000000286 ffff81017dbb9c00 0000000000000286 ffff81017be8fe10 ffff810179f45440 ffff81017be8fe10 ffff810178282c00 Call Trace:{:ib_ipoib:ipoib_mcast_join_complete+56} {:ib_core:ib_unpack+200} {:ib_sa:ib_sa_mcmember_rec_callback+76} {:ib_sa:recv_handler+66} {:ib_mad:ib_mad_completion_handler+957} {:ib_mad:ib_mad_completion_handler+0} {worker_thread+476} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 49 8b 7d 08 48 81 c7 cc 01 00 00 f3 a6 75 17 49 8b 45 70 8b RIP {:ib_ipoib:ipoib_mcast_join_finish+100} RSP CR2: 0000000000000488 -- MST From mst at mellanox.co.il Mon Nov 14 07:47:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Nov 2005 17:47:22 +0200 Subject: [openib-general] Re: IPoIB question/problem In-Reply-To: <20051107170244.GZ31134@mellanox.co.il> References: <20051107170244.GZ31134@mellanox.co.il> Message-ID: <20051114154722.GX20871@mellanox.co.il> How does the following strike you? I didnt notice any performance impact - could more people test this please? --- Assuming that a remote node is replaced and its address changes (e.g. gid change), it seems that the ha field will gets out of sync with the address handle stored in ipoib_neigh->ah, with the result that the ah field would point to an incorrect path, resulting in all packets being lost. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-14 18:29:40.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-14 20:26:43.000000000 +0200 @@ -394,6 +394,7 @@ static void path_rec_completion(int stat list_for_each_entry(neigh, &path->neigh_list, list) { kref_get(&path->ah->ref); neigh->ah = path->ah; + memcpy(neigh->dgid.raw, path->pathrec.dgid.raw, sizeof (union ib_gid)); while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); @@ -503,6 +504,7 @@ static void neigh_add_path(struct sk_buf if (path->pathrec.dlid) { kref_get(&path->ah->ref); neigh->ah = path->ah; + memcpy(neigh->dgid.raw, path->pathrec.dgid.raw, sizeof (union ib_gid)); ipoib_send(dev, skb, path->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); @@ -633,6 +635,17 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); if (likely(neigh->ah)) { + if (unlikely(memcmp(neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof (union ib_gid)))) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + skb->dst->neighbour->ops->destructor = NULL; + list_del(&neigh->list); + kfree(neigh); + ipoib_path_lookup(skb, dev); + goto out; + } ipoib_send(dev, skb, neigh->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); goto out; Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-14 18:29:40.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-14 18:29:42.000000000 +0200 @@ -209,6 +209,7 @@ struct ipoib_path { struct ipoib_neigh { struct ipoib_ah *ah; + union ib_gid dgid; struct sk_buff_head queue; struct neighbour *neighbour; -- MST From rolandd at cisco.com Mon Nov 14 07:49:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 07:49:14 -0800 Subject: [openib-general] Data structure size mismatch References: Message-ID: <52wtjb9qmd.fsf@cisco.com> Pradeep> I am seeking suggestions with respect to the Pradeep> 32-bit/64-bit issues. Is the macro Pradeep> test_thread_flag(TIF_32BIT), to ascertain if we are in Pradeep> the kernel in the context of a 32-bit app or not, generic Pradeep> enought that it will work across all the 64-bit platforms Pradeep> of interest. It should work for Pradeep> ppc64/sparc64/x86_64/parisc. Is that good enough? What are you trying to do? It would be best to make your ABI the same for both 32-bit and 64-bit kernels, so no compatibility code is required. Failing that, just hook into the existing compatibility support (ie compat_ioctl and friends). Pradeep> Also, do I use __LP64__ to find out if this is a 64-bit Pradeep> kernel? My research has revealed that this macro will Pradeep> work for gcc versions greater than 3.4. Now, Sles9sp2 Pradeep> uses gcc 3.3.3 and so will not work on Sles9sp2. However, Pradeep> RHEL4U2 uses gcc version 3.4.4. So, even though one may Pradeep> be using the same kernel version, it will work on one Pradeep> distro and not the other. How does one address such Pradeep> issues? BITS_PER_LONG should be sufficient I think. But again, what are you trying to do? It would be better to write your code so that it doesn't matter whether it is being built for a 32-bit kernel or a 64-bit kernel. - R. From bardov at gmail.com Mon Nov 14 09:03:20 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Mon, 14 Nov 2005 19:03:20 +0200 Subject: [openib-general] Infiniband compilation testing In-Reply-To: <20051113194842.GF13904@us.ibm.com> References: <20051113194842.GF13904@us.ibm.com> Message-ID: Hi Nishanth, I committed fixes to the ppc64 compile warnings in iser: Committed revision 4044. Dan On 11/13/05, Nishanth Aravamudan wrote: > Hi all, > > Latest results (now on three archs: x86, ppc64, x86_64): > > kernel.org: > =y > x86 --> OK > ppc64 --> OK > x86_64 --> OK > =m > x86 --> OK > ppc64 --> OK > x86_64 --> OK > > svn: > =y > x86 --> OK > ppc64 --> Build Warnings > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_reg': > drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long unsigned int format, u64 arg (arg 3) > drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long int format, u64 arg (arg 4) > drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 5) > drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 6) > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_clone': > drivers/infiniband/ulp/iser/iser_memory.c:218: warning: long long unsigned int format, long unsigned int arg (arg 4) > x86_64 --> OK > =m > x86 --> OK > ppc64 --> Build Warnings > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_reg': > drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long unsigned int format, u64 arg (arg 3) > drivers/infiniband/ulp/iser/iser_memory.c:176: warning: long long int format, u64 arg (arg 4) > drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 5) > drivers/infiniband/ulp/iser/iser_memory.c:187: warning: long long unsigned int format, u64 arg (arg 6) > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_clone': > drivers/infiniband/ulp/iser/iser_memory.c:218: warning: long long unsigned int format, long unsigned int arg (arg 4) > x86_64 --> OK > > Thanks, > Nish > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From troy at scl.ameslab.gov Mon Nov 14 10:09:05 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Mon, 14 Nov 2005 10:09:05 -0800 Subject: [openib-general] another opensm crash Message-ID: <4378D2C1.6090103@scl.ameslab.gov> (gdb) bt #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, p_madw=0x80a1de0) at osm_sw_info_rcv.c:679 #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at cl_dispatcher.c:108 #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) at cl_threadpool.c:78 #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at cl_thread.c:61 #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 From pradeep at us.ibm.com Mon Nov 14 11:00:14 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 14 Nov 2005 11:00:14 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: <52wtjb9qmd.fsf@cisco.com> Message-ID: Roland Dreier wrote on 11/14/2005 07:49:14 AM: > Pradeep> I am seeking suggestions with respect to the > Pradeep> 32-bit/64-bit issues. Is the macro > Pradeep> test_thread_flag(TIF_32BIT), to ascertain if we are in > Pradeep> the kernel in the context of a 32-bit app or not, generic > Pradeep> enought that it will work across all the 64-bit platforms > Pradeep> of interest. It should work for > Pradeep> ppc64/sparc64/x86_64/parisc. Is that good enough? > > What are you trying to do? It would be best to make your ABI the same > for both 32-bit and 64-bit kernels, so no compatibility code is > required. Failing that, just hook into the existing compatibility > support (ie compat_ioctl and friends). I am trying to use copy_from_user()/copy_to_user of data structures that contains pointers. To address the differences in pointer sizes, here are the possible combinations that I am trying to address: user kernel 32 32 -----> this is simple and copy_*_user() works as expected 64 32 -----> not supported 32 64 -----> compat_ptr functions provide the necessary support. This function takes a compat_ptr_t which is a u32 as input 64 64 ------> This case still needs to be addressed. One could do a copy_*_user() without using the compat_ptr() if it is known that this is a 64-bit app. Unless, there is a guarantee that only 32-bit apps are supported, there is still one case that needs to be addressed and that is the last case above. Hence the need to ascertain if the invoking app is 32-bit or 64-bit. > > Pradeep> Also, do I use __LP64__ to find out if this is a 64-bit > Pradeep> kernel? My research has revealed that this macro will > Pradeep> work for gcc versions greater than 3.4. Now, Sles9sp2 > Pradeep> uses gcc 3.3.3 and so will not work on Sles9sp2. However, > Pradeep> RHEL4U2 uses gcc version 3.4.4. So, even though one may > Pradeep> be using the same kernel version, it will work on one > Pradeep> distro and not the other. How does one address such > Pradeep> issues? > > BITS_PER_LONG should be sufficient I think. But again, what are you > trying to do? It would be better to write your code so that it > doesn't matter whether it is being built for a 32-bit kernel or a > 64-bit kernel. > Ok, this will do nicely. Thanks for pointing it out. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From yipeeyipeeyipeeyipee at yahoo.com Mon Nov 14 11:36:14 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 14 Nov 2005 19:36:14 +0000 (UTC) Subject: [openib-general] OpenSM size Message-ID: Hi, Is there some way to compile-out parts of OpenSM so I could fit it inside a small flash rom? Maybe I can modify the Makefile to exclude some of the object files? What functionality isn't mandatory for fabric management? Thanks, y From rolandd at cisco.com Mon Nov 14 11:47:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 11:47:13 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: (Pradeep Satyanarayana's message of "Mon, 14 Nov 2005 11:00:14 -0800") References: Message-ID: <52hdaf9flq.fsf@cisco.com> Pradeep> I am trying to use copy_from_user()/copy_to_user of data Pradeep> structures that contains pointers. If you are defining a new interface, then the simplest thing is not to do that: always put pointers in a 64-bit field. If it is an existing interface that can't be changed, then there should be an existing compat wrapper for the system call. - R. From sean.hefty at intel.com Mon Nov 14 11:48:08 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 14 Nov 2005 11:48:08 -0800 Subject: [openib-general] [PATCH] [CMA] pad address information to handle IPv6 addresses Message-ID: This patch provides padding beyond the source and destination addresses to handle IPv6 address sizes. Does anyone know of a better way to handle this? Signed-off-by: Sean Hefty Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 4022) +++ include/rdma/rdma_cm.h (working copy) @@ -31,6 +31,7 @@ #define RDMA_CM_H #include +#include #include #include @@ -55,7 +56,11 @@ struct rdma_addr { struct sockaddr src_addr; + u8 src_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; struct sockaddr dst_addr; + u8 dst_pad[sizeof(struct sockaddr_in6) - + sizeof(struct sockaddr)]; union { struct ib_addr ibaddr; } addr; From eitan at mellanox.co.il Mon Nov 14 11:53:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 14 Nov 2005 21:53:06 +0200 Subject: [openib-general] OpenSM size Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618975@mtlexch01.mtl.com> Hi Yipee, It would be nice (or I should say just dandy) if we could skip some of the functionality of OpenSM but actually for Gen2 to work right most of it is required... The only parts that are not absolutely required are some SA request handlers. Mostly ServiceRecords, NodeInfoRecord, SwitchInfoRecord,LinkRecord. But there is no compile time flag to leave those out. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: yipee [mailto:yipeeyipeeyipeeyipee at yahoo.com] > Sent: Monday, November 14, 2005 9:36 PM > To: openib-general at openib.org > Subject: [openib-general] OpenSM size > > Hi, > > Is there some way to compile-out parts of OpenSM so I could fit it inside a > small flash rom? > Maybe I can modify the Makefile to exclude some of the object files? What > functionality isn't mandatory for fabric management? > > > Thanks, > y > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Mon Nov 14 11:54:28 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 14 Nov 2005 21:54:28 +0200 Subject: [openib-general] another opensm crash Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> Hi Troy Try to move aside your /lib/tls directory and see if you still get these crashes. We have issues with TLS pthread and glibc Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > Sent: Monday, November 14, 2005 8:09 PM > To: openib-general at openib.org > Subject: [openib-general] another opensm crash > > (gdb) bt > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, p_madw=0x80a1de0) > at osm_sw_info_rcv.c:679 > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > cl_dispatcher.c:108 > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > at cl_threadpool.c:78 > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at cl_thread.c:61 > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Nov 14 12:29:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 14 Nov 2005 22:29:32 +0200 Subject: [openib-general] OpenSM size Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AADC@taurus.voltaire.com> How small does it need to be ? What processor architecture ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of yipee Sent: Mon 11/14/2005 2:36 PM To: openib-general at openib.org Subject: [openib-general] OpenSM size Hi, Is there some way to compile-out parts of OpenSM so I could fit it inside a small flash rom? Maybe I can modify the Makefile to exclude some of the object files? What functionality isn't mandatory for fabric management? Thanks, y _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Nitin.Hande at Sun.COM Mon Nov 14 12:49:39 2005 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Mon, 14 Nov 2005 12:49:39 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051111155559.02542268@esmail.cup.hp.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <96f8e60e0511111302i8f6db3anfc404b0998c8885@mail.gmail.com> <6.2.0.14.2.20051111155559.02542268@esmail.cup.hp.com> Message-ID: <4378F863.2070707@sun.com> Michael Krause wrote: > At 01:02 PM 11/11/2005, Ranjit Pandit wrote: > >> On 11/11/05, Michael Krause wrote: >> > Please clarify the following which was in the document provided by >> Oracle. >> > >> > On page 3 of the RDS document, under the section "RDP Interface", >> the 2nd >> > and 3rd paragraphs are state: >> > >> > * RDP does not guarantee that a datagram is delivered to the remote >> > application. >> > * It is up to the RDP client to deal with datagrams lost due to >> transport >> > failure or remote application failure. >> > >> > The HCA is still a fault domain with RDS - it does not address >> flushing data >> > out of the HCA fault domain, nor does it sound like it ensures that >> CQE loss >> > is recoverable. >> > >> > I do believe RDS will replay all of the sendmsg's that it believes are >> > pending, but it has no way to determine if already sent sendmsgs were >> > actually successfully delivered to the remote application unless it >> provides >> > some level of resync of the outstanding sends not completed from an >> > application's perspective as well as any state updated via RDMA >> operations >> > which may occur without an explicit send operation to flush to a known >> > state. I'm still trying to ascertain whether RDS completely >> recovers from >> > HCA failure (assuming there is another HCA / path available) between >> the two >> > endnodes. >> >> RDS will replay the sends that are completed in error by the HCA, >> which typically would happen if the current path fails or the remote >> node/HCA dies. > > > Does this mean that the receiving RDS entity is responsible for dealing > with duplicates? I believe so... A Send completion error does not mean that the > receiving endnode did not receive the data for either IB or iWARP; it > only indicates that the Send operation failed which could be just a loss > of the receive ACK with the Send completing on the receiver. Such a > scenario would imply that RDS would have to comprehend what buffers have > actually been consumed before retransmission, i.e. a resync is > performed, else one could receive duplicate data at the application > layer which can cause corruption or other problems as a function of the > application (tolerance will vary by application thus the ULP must > present consistent semantics to enable a broader set of applications > than perhaps the initial targeted application to be supported). In absence of any protocol level ack (and regardless of protocol level ack), it is the application which has to implement its own reliability. RDS becomes a passive channel passing packet back and forth including duplicate packets. The responsibility then shifts to the application to figure out what is missing, duplicate's etc. Thanks Nitin > >> In case of a catastrophic error on the local HCA, subsequent sends >> will fail (for a certain time (session_time_wait ) ) as if there was >> no alternate path available at that time. On getting an error the >> application should discard any sends unacknowledged by it's peer and >> take corrective action. > > > Unacknowledged by the peer means at the interconnect or the application > level? Again, how is the receive buffer management handled? > >> After the time_wait is over, subsequent sends will initiate a brand >> new connection which could use the alternate HCA ( if the path is >> available). > > > This is understood. > > Mike > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Nitin.Hande at Sun.COM Mon Nov 14 12:49:44 2005 From: Nitin.Hande at Sun.COM (Nitin Hande) Date: Mon, 14 Nov 2005 12:49:44 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <6.2.0.14.2.20051111155127.02542120@esmail.cup.hp.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <4375069D.7040106@sun.com> <6.2.0.14.2.20051111155127.02542120@esmail.cup.hp.com> Message-ID: <4378F868.5080205@sun.com> Michael Krause wrote: > At 01:01 PM 11/11/2005, Nitin Hande wrote: > >> Michael Krause wrote: >> >>> At 10:28 AM 11/9/2005, Rick Frank wrote: >>> >>>> Yes, the application is responsible for detecting lost msgs at the >>>> application level - the transport can not do this. >>>> >>>> RDS does not guarantee that a message has been delivered to the >>>> application - just that once the transport has accepted a msg it >>>> will deliver the msg to the remote node in order without duplication >>>> - dealing with retransmissions, etc due to sporadic / intermittent >>>> msg loss over the interconnect. If after accepting the send - the >>>> current path fails - then RDS will transparently fail over to >>>> another path - and if required will resend / send any already queued >>>> msgs to the remote node - again insuring that no msg is duplicated >>>> and they are in order. This is no different than APM - with the >>>> exception that RDS can do this across HCAs. >>>> >>>> The application - Oracle in this case - will deal with detecting a >>>> catastrophic path failure - either due to a send that does not >>>> arrive and or a timedout response or send failure returned from the >>>> transport. If there is no network path to a remote node - it is >>>> required that we remove the remote node from the operating cluster >>>> to avoid what is commonly termed as a "split brain" condition - >>>> otherwise known as a "partition in time". >>>> >>>> BTW - in our case - the application failure domain logic is the same >>>> whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. >>>> Basically, if we can not talk to a remote node - after some defined >>>> period of time - we will remove the remote node from the cluster. In >>>> this case the database will recover all the interesting state that >>>> may have been maintained on the removed node - allowing the >>>> remaining nodes to continue. If later on, communication to the >>>> remote node is restored - it will be allowed to rejoin the cluster >>>> and take on application load. >>> >>> >>> Please clarify the following which was in the document provided by >>> Oracle. >>> On page 3 of the RDS document, under the section "RDP Interface", the >>> 2nd and 3rd paragraphs are state: >>> * RDP does not guarantee that a datagram is delivered to the >>> remote application. >>> * It is up to the RDP client to deal with datagrams lost due to >>> transport failure or remote application failure. >>> The HCA is still a fault domain with RDS - it does not address >>> flushing data out of the HCA fault domain, nor does it sound like it >>> ensures that CQE loss is recoverable. >>> I do believe RDS will replay all of the sendmsg's that it believes >>> are pending, but it has no way to determine if already sent sendmsgs >>> were actually successfully delivered to the remote application unless >>> it provides some level of resync of the outstanding sends not >>> completed from an application's perspective as well as any state >>> updated via RDMA operations which may occur without an explicit send >>> operation to flush to a known state. >> >> If RDS could define a mechanism that the application could use to >> inform the sender to resync and replay on catastrophic failure, is >> that a correct understanding of your suggestion ? > > > I'm not suggesting anything at this point. I'm trying to reconcile the > documentation with the e-mail statements made by its proponents. > >> I'm still trying to ascertain whether RDS completely >> >>> recovers from HCA failure (assuming there is another HCA / path >>> available) between the two endnodes >> >> Reading at the doc and the thread, it looks like we need src/dst port >> for multiplexing connections, we need seq/ack# for resyncing, we need >> some kind of window availability for flow control. Are'nt we very >> close to tcp header ? .. > > > TCP does not provide end-to-end to the application as implemented by > most OS. Unless one ties TCP ACK to the application's consumption of the > receive data, there is no method to ascertain that the application > really received the data. The application would be required to send > its own application-level acknowledgement. I believe the intent is for > applications to remain responsible for the end-to-end receipt of data > and that RDS and the interconnect are simply responsible for the > exchange at the lower levels. Yes, a TCP ack only implies that it has received the data, and means nothing to the application. It is the application which has send a application level ack to its peer. Nitin > > Mike > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Nov 14 13:11:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Nov 2005 23:11:17 +0200 Subject: [openib-general] [PATCH] mthca: fix qp max_send/recv_sge calculation Message-ID: <20051114211117.GB3603@mellanox.co.il> Roland, I think I see a problem in mthca, where qp capability values we return arent safe. How does the following look (compile tested only)? --- Calculation of QP capabilities still isnt exactly right in mthca: max_send_sge/max_recv_sge fields returned in create_qp can exceed the handware supported limits. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -919,10 +919,12 @@ static void mthca_adjust_qp_caps(struct else qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; - qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); - qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - - sizeof (struct mthca_next_seg)) / - sizeof (struct mthca_data_seg); + qp->sq.max_gs = min_t(int, dev->limits.max_sg, + max_data_size / sizeof (struct mthca_data_seg)); + qp->rq.max_gs = min_t(int, dev->limits.max_sg, + (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg)); } /* -- MST From xma at us.ibm.com Mon Nov 14 13:17:40 2005 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 14 Nov 2005 13:17:40 -0800 Subject: [openib-general] current SVN is boken Message-ID: Just downloaded the SVN 4044, it's boken, see below compile error: In file included from drivers/infiniband/include/rdma/ib_sa.h:42, from drivers/infiniband/core/at.c:53: drivers/infiniband/include/rdma/ib_mad.h:601: error: parse error before "gfp_t" Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Nov 14 13:26:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 14 Nov 2005 23:26:00 +0200 Subject: [openib-general] Re: current SVN is boken In-Reply-To: References: Message-ID: <20051114212600.GE3603@mellanox.co.il> Quoting Shirley Ma : > Subject: current SVN is boken > > > Just downloaded the SVN 4044, it's boken, see below compile error: > > In file included from drivers/infiniband/include/rdma/ib_sa.h:42, > from drivers/infiniband/core/at.c:53: > drivers/infiniband/include/rdma/ib_mad.h:601: error: parse error before > "gfp_t" > > Thanks > Shirley Ma Weird, it compiles fine for me. Which kernel are you working on? -- MST From xma at us.ibm.com Mon Nov 14 13:51:22 2005 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 14 Nov 2005 13:51:22 -0800 Subject: [openib-general] Re: current SVN is boken In-Reply-To: <20051114212600.GE3603@mellanox.co.il> Message-ID: These errors are on 2.6.13 kernel. I am moving to 2.6.14/15-rc1 kernel now. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 "Michael S. Tsirkin" 11/14/2005 01:26 PM Please respond to "Michael S. Tsirkin" To Shirley Ma/Beaverton/IBM at IBMUS cc openib-general at openib.org Subject Re: current SVN is boken Quoting Shirley Ma : > Subject: current SVN is boken > > > Just downloaded the SVN 4044, it's boken, see below compile error: > > In file included from drivers/infiniband/include/rdma/ib_sa.h:42, > from drivers/infiniband/core/at.c:53: > drivers/infiniband/include/rdma/ib_mad.h:601: error: parse error before > "gfp_t" > > Thanks > Shirley Ma Weird, it compiles fine for me. Which kernel are you working on? -- MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From atorrez at lanl.gov Mon Nov 14 14:35:59 2005 From: atorrez at lanl.gov (Alfred Torrez) Date: Mon, 14 Nov 2005 15:35:59 -0700 Subject: [openib-general] 2.6.14 Compile Error Message-ID: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> I am seeing the following errors using svn ver 4044 and 2.6.14 kernel. Have I missed a patch? drivers/infiniband/core/uat.c: In function `ib_uat_init': drivers/infiniband/core/uat.c:834: warning: passing arg 2 of `class_device_create' makes integer from pointer without a castdrivers/infiniband/core/uat.c:834: warning: passing arg 3 of `class_device_create' makes pointer from integer without a castdrivers/infiniband/core/uat.c:834: warning: too many arguments for format CC [M] drivers/infiniband/core/ucm.o In file included from drivers/infiniband/core/ucm.c:50: drivers/infiniband/include/rdma/ib_marshall.h:42: warning: `struct ib_uverbs_qp_attr' declared inside parameter listdrivers/infiniband/include/rdma/ib_marshall.h:42: warning: its scope is only this definition or declaration, which is probably not what you want drivers/infiniband/core/ucm.c: In function `ib_ucm_event_req_get': drivers/infiniband/core/ucm.c:222: error: structure has no member named `port' drivers/infiniband/core/ucm.c:224: warning: passing arg 1 of `ib_copy_path_rec_to_user' from incompatible pointer typedrivers/infiniband/core/ucm.c:227: warning: passing arg 1 of `ib_copy_path_rec_to_user' from incompatible pointer typedrivers/infiniband/core/ucm.c: In function `ib_ucm_event_process': drivers/infiniband/core/ucm.c:298: warning: passing arg 1 of `ib_copy_path_rec_to_user' from incompatible pointer typedrivers/infiniband/core/ucm.c:311: error: structure has no member named `port' drivers/infiniband/core/ucm.c: In function `ib_ucm_create_id': drivers/infiniband/core/ucm.c:509: warning: passing arg 1 of `ib_create_cm_id' from incompatible pointer type drivers/infiniband/core/ucm.c:509: error: too many arguments to function `ib_create_cm_id' drivers/infiniband/core/ucm.c: In function `ib_ucm_init_qp_attr': drivers/infiniband/core/ucm.c:614: error: storage size of `resp' isn't known drivers/infiniband/core/ucm.c:614: warning: unused variable `resp' make[3]: *** [drivers/infiniband/core/ucm.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 Alfred From nacc at us.ibm.com Mon Nov 14 14:40:51 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 14 Nov 2005 14:40:51 -0800 Subject: [openib-general] 2.6.14 Compile Error In-Reply-To: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> References: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> Message-ID: <20051114224051.GA30972@us.ibm.com> On 14.11.2005 [15:35:59 -0700], Alfred Torrez wrote: > I am seeing the following errors using svn ver 4044 and 2.6.14 kernel. > Have I missed a patch? Which gcc and which arch? Can you try against 2.6.15-rc1 (my latest compilations for x86, x86_64 and ppc64 all built fine with latest svn versus latest git (2.6.15-rc1 based)). Thanks, Nish From nacc at us.ibm.com Mon Nov 14 10:25:34 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 14 Nov 2005 10:25:34 -0800 Subject: [openib-general] Infiniband compilation testing In-Reply-To: References: <20051113194842.GF13904@us.ibm.com> Message-ID: <20051114182534.GB4965@us.ibm.com> On 14.11.2005 [19:03:20 +0200], Dan Bar Dov wrote: > Hi Nishanth, > > I committed fixes to the ppc64 compile warnings in iser: > Committed revision 4044. Just wanted to confirm that 4044 compiles w/o infiniband warnings now. Thanks again, Nish From nacc at us.ibm.com Mon Nov 14 09:54:31 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 14 Nov 2005 09:54:31 -0800 Subject: [openib-general] Infiniband compilation testing In-Reply-To: References: <20051113194842.GF13904@us.ibm.com> Message-ID: <20051114175431.GA4965@us.ibm.com> On 14.11.2005 [19:03:20 +0200], Dan Bar Dov wrote: > Hi Nishanth, > > I committed fixes to the ppc64 compile warnings in iser: > Committed revision 4044. Great, thanks! -Nish From rolandd at cisco.com Mon Nov 14 14:47:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 14:47:37 -0800 Subject: [openib-general] 2.6.14 Compile Error In-Reply-To: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> (Alfred Torrez's message of "Mon, 14 Nov 2005 15:35:59 -0700") References: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> Message-ID: <523blyalti.fsf@cisco.com> Alfred> I am seeing the following errors using svn ver 4044 and Alfred> 2.6.14 kernel. Have I missed a patch? You have not replaced include/rdma in the kernel with the svn copy. The easiest thing to do is to rm -rf include/rdma and just link linux-kernel/infiniband from a svn tree to drivers/infiniband in your kernel tree. - R. From mshefty at ichips.intel.com Mon Nov 14 15:22:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 14 Nov 2005 15:22:51 -0800 Subject: [openib-general] 2.6.14 Compile Error In-Reply-To: <523blyalti.fsf@cisco.com> References: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> <523blyalti.fsf@cisco.com> Message-ID: <43791C4B.7060601@ichips.intel.com> Roland Dreier wrote: > Alfred> I am seeing the following errors using svn ver 4044 and > Alfred> 2.6.14 kernel. Have I missed a patch? > > You have not replaced include/rdma in the kernel with the svn copy. > The easiest thing to do is to rm -rf include/rdma and just link > linux-kernel/infiniband from a svn tree to drivers/infiniband in your > kernel tree. We may want to add a "common issues" area to the wiki pages that address this. - Sean From atorrez at lanl.gov Mon Nov 14 15:53:01 2005 From: atorrez at lanl.gov (Alfred Torrez) Date: Mon, 14 Nov 2005 16:53:01 -0700 Subject: [openib-general] 2.6.14 Compile Error In-Reply-To: <523blyalti.fsf@cisco.com> References: <6.0.0.22.2.20051114153015.01e2a728@cic-mail.lanl.gov> <523blyalti.fsf@cisco.com> Message-ID: <6.0.0.22.2.20051114165013.01e423a8@cic-mail.lanl.gov> At 03:47 PM 11/14/2005, Roland Dreier wrote: > Alfred> I am seeing the following errors using svn ver 4044 and > Alfred> 2.6.14 kernel. Have I missed a patch? > >You have not replaced include/rdma in the kernel with the svn copy. >The easiest thing to do is to rm -rf include/rdma and just link >linux-kernel/infiniband from a svn tree to drivers/infiniband in your >kernel tree. > > - R. Thanks Roland. I did look at the Installation Cheat Sheet but missed this particular item. Alfred From rolandd at cisco.com Mon Nov 14 17:18:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 17:18:49 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) Message-ID: <52y83q9092.fsf@cisco.com> Given that people are trying to use SRP on real networks (eg scinet), I hacked up a slightly easier to use method for discovering SRP storage. This program does SA queries to find ports with the device management capability set, and then prints out all the required information to have the SRP initiator connect to any targets it finds. I did discover a few glitches (see below for details). For example, on my fabric with one Engenio target, I get: # ./ibsrpdm IO Unit Info: port LID: 0003 port GID: fe800000000000002c902004000e6 change ID: 0002 max controllers: 0x10 controller[ 1]: present controller[ 2]: not installed controller[ 3]: not installed controller[ 4]: not installed controller[ 5]: not installed controller[ 6]: not installed controller[ 7]: not installed controller[ 8]: not installed controller[ 9]: not installed controller[ 10]: not installed controller[ 11]: not installed controller[ 12]: not installed controller[ 13]: not installed controller[ 14]: not installed controller[ 15]: not installed controller[ 16]: not installed controller[ 1] GUID: 0002c902004000e4 vendor ID: 0002c9 device ID: 005a44 ID: LSI Storage Systems SRP Driver service entries: 1 service[ 0]: e400400002c90200 / SRP.T10:0002c902004000e4 The opensm issues I saw were: - GUIDInfoRecord SA queries are not implemented (I think), so by default my code does a (non-compliant) SM class query to get ports' GUIDs. - opensm sometimes sends a PortInfoRecord with a lot of zeroed-out entries (eg I got 30+ entries in a reply that only covered 3 real ports). To save on looking at a lot of switch ports, my code does a query for PortInfoRecords with a component mask that selects by local port number and gets all the port 1s followed by all the port 2s in the fabric. - R. -------------- next part -------------- A non-text attachment was scrubbed... Name: srptools-0.0.1.tar.gz Type: application/x-compressed-tar Size: 94443 bytes Desc: not available URL: From jackm at mellanox.co.il Mon Nov 14 22:48:09 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Tue, 15 Nov 2005 08:48:09 +0200 Subject: [openib-general] current SVN is boken In-Reply-To: References: Message-ID: <20051115064809.GA1784@mellanox.co.il> I'll answer your question, anyway (your next post indicated that you're moving to 2.6.14). You need to apply patch: https://openib.org/svn/gen2/branches/backport/2.6.13/verbs_malloc_3926_to_2_6_13.patch to fix the above problem. Jack On Mon, Nov 14, 2005 at 11:17:40PM +0200, Shirley Ma wrote: > > Just downloaded the SVN 4044, it's boken, see below compile error: > > In file included from drivers/infiniband/include/rdma/ib_sa.h:42, > from drivers/infiniband/core/at.c:53: > drivers/infiniband/include/rdma/ib_mad.h:601: error: parse error before > "gfp_t" > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638 Content-Description: ATT85926.txt > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yipeeyipeeyipeeyipee at yahoo.com Mon Nov 14 23:03:52 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 15 Nov 2005 07:03:52 +0000 (UTC) Subject: [openib-general] Re: OpenSM size References: <5CE025EE7D88BA4599A2C8FEFCF226F589AADC@taurus.voltaire.com> Message-ID: Hal Rosenstock voltaire.com> writes: > How small does it need to be ? What processor architecture ? The processor I'm using is a x86_64, but I guess I can compile OpenSM for x86 (32bits). The binary size I'm seeing is about 2MB. After bzip'ing it shrinks to 670KB. If I could reduce it to 200KB or smaller it would be much better, but even taking off 20% would be good. Thanks From dotanb at mellanox.co.il Mon Nov 14 23:29:52 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 15 Nov 2005 09:29:52 +0200 Subject: [openib-general] RE: changing a UC QP to support RDMA Write is not working Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3878D7E@mtlexch01.mtl.com> > > I think I see the problem. As you said, mthca was incorrectly > checking the responder resources to see if it should enable RDMA > writes on the receive queue. However, responder resources only apply > to RDMA reads and atomics and are never set for UC QPs, so this was > never set. > > Does the patch below for the kernel fix things for you? > > - R. I'm sorry for the delay, but thanx - the patch fix the problem. Dotan From rolandd at cisco.com Mon Nov 14 23:49:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 14 Nov 2005 23:49:16 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <52y83q9092.fsf@cisco.com> (Roland Dreier's message of "Mon, 14 Nov 2005 17:18:49 -0800") References: <52y83q9092.fsf@cisco.com> Message-ID: <52u0ee8i6b.fsf@cisco.com> OK here's a new version of the srp DM tool. Changes: - actually includes all files needed to compile - less verbose by default (new '-v' flag to get original output) - new '-c' flag to print the exact thing to echo to the 'add-target' file to connect to each target. for example: # ./ibsrpdm IO Unit Info: port LID: 0003 port GID: fe800000000000000002c902004000e6 change ID: 0002 max controllers: 0x10 controller[ 1] GUID: 0002c902004000e4 vendor ID: 0002c9 device ID: 005a44 ID: LSI Storage Systems SRP Driver service entries: 1 service[ 0]: e400400002c90200 / SRP.T10:0002c902004000e4 # ./ibsrpdm -c id_ext=0002c902004000e4,ioc_guid=0002c902004000e4,dgid=fe800000000000000002c902004000e6,pkey=ffff,service_id=e400400002c90200 with the last output, I can just do: echo id_ext=0002c902004000e4,ioc_guid=0002c902004000e4,dgid=fe800000000000000002c902004000e6,pkey=ffff,service_id=e400400002c90200 >> /sys/class/infiniband_srp/srp-mthca0-1/add_target to get to that target. - R. -------------- next part -------------- A non-text attachment was scrubbed... Name: srptools-0.0.2.tar.gz Type: application/x-compressed-tar Size: 97614 bytes Desc: not available URL: From eitan at mellanox.co.il Mon Nov 14 23:58:41 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 09:58:41 +0200 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <52y83q9092.fsf@cisco.com> References: <52y83q9092.fsf@cisco.com> Message-ID: <43799531.1010700@mellanox.co.il> Roland Dreier wrote: > > The opensm issues I saw were: > - GUIDInfoRecord SA queries are not implemented (I think), so by > default my code does a (non-compliant) SM class query to get ports' > GUIDs. Yes this is correct we never got requested for that query. If you are only interested in obtaining the guid of the port you can simply use NodeInfoRecord and you get the guid in the NodeInfo. But you probably know that. Is there anything more you expect to get from the GUIDInfo? Are you using/having multiple GUIDs for port? > - opensm sometimes sends a PortInfoRecord with a lot of zeroed-out > entries (eg I got 30+ entries in a reply that only covered 3 real > ports). This must be an RMPP bug of some sort. How easy is it to reproduce? Please give us some hints. > To save on looking at a lot of switch ports, my code does > a query for PortInfoRecords with a component mask that selects by > local port number and gets all the port 1s followed by all the port > 2s in the fabric. > > - R. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Nov 15 00:11:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Nov 2005 10:11:10 +0200 Subject: [openib-general] Re: OpenSM size In-Reply-To: References: Message-ID: <20051115081110.GF20871@mellanox.co.il> Hi! 1. Did you strip it? # ls -l /usr/local/bin/opensm -rwxr-xr-x 1 root root 2124734 Nov 15 10:07 /usr/local/bin/opensm # strip /usr/local/bin/opensm # ls -l /usr/local/bin/opensm -rwxr-xr-x 1 root root 333024 Nov 15 10:22 /usr/local/bin/opensm 2. Compile with -Os: edit Makefile in /usr/src/openib/trunk/src/userspace/management/osm/opensm, replace -O2 with -Os. # ls -l /usr/local/bin/opensm -rwxr-xr-x 1 root root 2119341 Nov 15 10:23 /usr/local/bin/opensm # ls -l /usr/local/bin/opensm -rwxr-xr-x 1 root root 316608 Nov 15 10:24 /usr/local/bin/opensm 3. Do you bzip with -9? # bzip2 -9 /usr/local/bin/opensm # ls -l /usr/local/bin/opensm.bz2 -rwxr-xr-x 1 root root 115566 Nov 15 10:24 /usr/local/bin/opensm.bz2 Thats on x86_64. 4. Go 32 bit (needs 32 bit libraries, so its a bit trickier). you need to add -m32 to cflags and ldflags to all management libraries. But this is sure to cut the size down even more. Hope this helps, MST Quoting yipee : > Subject: Re: OpenSM size > > Hal Rosenstock voltaire.com> writes: > > > How small does it need to be ? What processor architecture ? > > The processor I'm using is a x86_64, but I guess I can compile OpenSM > for x86 > (32bits). > The binary size I'm seeing is about 2MB. After bzip'ing it shrinks to > 670KB. If > I could reduce it to 200KB or smaller it would be much better, but even > taking > off 20% would be good. > > Thanks > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- MST From rolandd at cisco.com Tue Nov 15 00:27:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 00:27:21 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <43799531.1010700@mellanox.co.il> (Eitan Zahavi's message of "Tue, 15 Nov 2005 09:58:41 +0200") References: <52y83q9092.fsf@cisco.com> <43799531.1010700@mellanox.co.il> Message-ID: <52psp28geu.fsf@cisco.com> Eitan> Yes this is correct we never got requested for that query. Eitan> If you are only interested in obtaining the guid of the Eitan> port you can simply use NodeInfoRecord and you get the guid Eitan> in the NodeInfo. But you probably know that. Is there Eitan> anything more you expect to get from the GUIDInfo? Are you Eitan> using/having multiple GUIDs for port? Good point -- I'll just switch to getting the NodeInfoRecord. Eitan> This must be an RMPP bug of some sort. How easy is it to Eitan> reproduce? Please give us some hints. Quite easy in my setup -- it seems to happen every time on my fabric when I do a get table for PortInfoRecords with local port num 2. - R. From rolandd at cisco.com Tue Nov 15 00:30:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 00:30:42 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <52psp28geu.fsf@cisco.com> (Roland Dreier's message of "Tue, 15 Nov 2005 00:27:21 -0800") References: <52y83q9092.fsf@cisco.com> <43799531.1010700@mellanox.co.il> <52psp28geu.fsf@cisco.com> Message-ID: <52lkzq8g99.fsf@cisco.com> Roland> Quite easy in my setup -- it seems to happen every time on Roland> my fabric when I do a get table for PortInfoRecords with Roland> local port num 2. And running ibsrpdm on the scinet fabric at SC'05 I see hundreds of PortInfoRecords with a base LID of 0... - R. From rolandd at cisco.com Tue Nov 15 00:38:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 00:38:06 -0800 Subject: [openib-general] Re: ipoib oops In-Reply-To: <20051114154202.GW20871@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 14 Nov 2005 17:42:02 +0200") References: <20051114154202.GW20871@mellanox.co.il> Message-ID: <52hdae8fwx.fsf@cisco.com> Sorry I haven't been able to look at this immediately, since I've been busy with SC05-related stuff. I hope to sit down and think about this in detail tomorrow... - R. From rolandd at cisco.com Tue Nov 15 00:38:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 08:38:30 +0000 Subject: [openib-general] [git patch review 1/3] [IB] srp: increase max_luns Message-ID: <1132043910918-ccc552298b599865@cisco.com> Increase SRP max_luns to 512 to match the kernel's default, since SRP storage targets can have lots of LUNs and the SRP initiator itself doesn't have any particular limit. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/srp/ib_srp.c | 2 ++ drivers/infiniband/ulp/srp/ib_srp.h | 1 + 2 files changed, 3 insertions(+), 0 deletions(-) applies-to: 84a581820bff0fa9830f18138da02d929e4edcb9 5f068992a1bccda5574b4f6d33458ef806686d7f diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 321a3a1..a364530 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1417,6 +1417,8 @@ static ssize_t srp_create_target(struct if (!target_host) return -ENOMEM; + target_host->max_lun = SRP_MAX_LUN; + target = host_to_target(target_host); memset(target, 0, sizeof *target); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 4fec28a..b564f18 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -54,6 +54,7 @@ enum { SRP_PORT_REDIRECT = 1, SRP_DLID_REDIRECT = 2, + SRP_MAX_LUN = 512, SRP_MAX_IU_LEN = 256, SRP_RQ_SHIFT = 6, --- 0.99.9g From rolandd at cisco.com Tue Nov 15 00:38:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 08:38:30 +0000 Subject: [openib-general] [git patch review 2/3] [IB] srp: don't post receive if no send buf available In-Reply-To: <1132043910918-ccc552298b599865@cisco.com> Message-ID: <1132043910918-5d38e36f350b7b00@cisco.com> Have __srp_get_tx_iu() fail if the target port's request limit will not allow the initiator to post a send. This avoids continuing on and posting a receive, and then failing to post a corresponding send. If that happens, then the initiator will end up with an extra receive posted, and if this happens to much, the receive queue will overflow. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/srp/ib_srp.c | 15 +++++++++------ 1 files changed, 9 insertions(+), 6 deletions(-) applies-to: a5f8266c59f39f0a1f3dc3d71a00da7276ac1a80 47f2bce9021b4974ed33b072ebb8348c8145c946 diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index a364530..ee9fe22 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -802,13 +802,21 @@ static int srp_post_recv(struct srp_targ /* * Must be called with target->scsi_host->host_lock held to protect - * req_lim and tx_head. + * req_lim and tx_head. Lock cannot be dropped between call here and + * call to __srp_post_send(). */ static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target) { if (target->tx_head - target->tx_tail >= SRP_SQ_SIZE) return NULL; + if (unlikely(target->req_lim < 1)) { + if (printk_ratelimit()) + printk(KERN_DEBUG PFX "Target has req_lim %d\n", + target->req_lim); + return NULL; + } + return target->tx_ring[target->tx_head & SRP_SQ_SIZE]; } @@ -823,11 +831,6 @@ static int __srp_post_send(struct srp_ta struct ib_send_wr wr, *bad_wr; int ret = 0; - if (target->req_lim < 1) { - printk(KERN_ERR PFX "Target has req_lim %d\n", target->req_lim); - return -EAGAIN; - } - list.addr = iu->dma; list.length = len; list.lkey = target->srp_host->mr->lkey; --- 0.99.9g From rolandd at cisco.com Tue Nov 15 00:38:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 08:38:30 +0000 Subject: [openib-general] [git patch review 3/3] [IB] mthca: don't disable RDMA writes if no responder resources In-Reply-To: <1132043910918-5d38e36f350b7b00@cisco.com> Message-ID: <1132043910918-bea399454f2ff33e@cisco.com> Responder resources are only required to handle RDMA reads and atomic operations, not RDMA writes. So the driver should allow RDMA writes even if responder resources are set to 0. This is especially important for the UC transport -- with the old code, it was impossible to enable RDMA writes for UC QPs. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 27 ++++++++++++--------------- 1 files changed, 12 insertions(+), 15 deletions(-) applies-to: 2d17f2cdc77646d07cb2a598e3d2bcbdf94675ad cbc5b2bb9e226c2b2b981836d2289912e2ef3c1c diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 760c418..5899f0c 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -730,15 +730,16 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_ACCESS_FLAGS) { + qp_context->params2 |= + cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? + MTHCA_QP_BIT_RWE : 0); + /* - * Only enable RDMA/atomics if we have responder - * resources set to a non-zero value. + * Only enable RDMA reads and atomics if we have + * responder resources set to a non-zero value. */ if (qp->resp_depth) { qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= @@ -759,31 +760,27 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->resp_depth && !attr->max_dest_rd_atomic) { /* * Lowering our responder resources to zero. - * Turn off RDMA/atomics as responder. - * (RWE/RRE/RAE in params2 already zero) + * Turn off reads RDMA and atomics as responder. + * (RRE/RAE in params2 already zero) */ - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } if (!qp->resp_depth && attr->max_dest_rd_atomic) { /* * Increasing our responder resources from - * zero. Turn on RDMA/atomics as appropriate. + * zero. Turn on RDMA reads and atomics as + * appropriate. */ qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_QP_BIT_RAE : 0); - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } --- 0.99.9g From eitan at mellanox.co.il Tue Nov 15 01:30:04 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 11:30:04 +0200 Subject: [openib-general] SRP device management client (and a few opensm glitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618981@mtlexch01.mtl.com> Thanks. We will try and reproduce it here. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 15, 2005 10:31 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few opensm > glitches) > > Roland> Quite easy in my setup -- it seems to happen every time on > Roland> my fabric when I do a get table for PortInfoRecords with > Roland> local port num 2. > > And running ibsrpdm on the scinet fabric at SC'05 I see hundreds of > PortInfoRecords with a base LID of 0... > > - R. From yael at mellanox.co.il Tue Nov 15 02:06:27 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 15 Nov 2005 12:06:27 +0200 Subject: [openib-general] SRP device management client (and a few opensm glitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23EA@mtlexch01.mtl.com> Hello Roland, When turning on only the comp_mask for the local_port_num you will get all relevant PortInfo records from the switches. These records do have many fields zeroed out (e.g subnet_prefix), but they are still valid records. Is this what you are seeing? Thanks, Yael -----Original Message----- From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tuesday, November 15, 2005 11:30 AM To: Roland Dreier; Eitan Zahavi Cc: openib-general at openib.org Subject: RE: [openib-general] SRP device management client (and a few opensm glitches) Thanks. We will try and reproduce it here. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 15, 2005 10:31 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few opensm > glitches) > > Roland> Quite easy in my setup -- it seems to happen every time on > Roland> my fabric when I do a get table for PortInfoRecords with > Roland> local port num 2. > > And running ibsrpdm on the scinet fabric at SC'05 I see hundreds of > PortInfoRecords with a base LID of 0... > > - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Nov 15 06:19:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Nov 2005 16:19:47 +0200 Subject: [openib-general] Re: ipoib oops In-Reply-To: <52hdae8fwx.fsf@cisco.com> References: <52hdae8fwx.fsf@cisco.com> Message-ID: <20051115141947.GL20871@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib oops > > Sorry I haven't been able to look at this immediately, since I've been > busy with SC05-related stuff. > > I hope to sit down and think about this in detail tomorrow... > > - R. > OK. Meanwhile, there's another possible issue that I think I see: if an entry is inserted in mcast_list after ipoib_mcast_restart_task calls ipoib_mcast_stop_thread but before spin_lock_irqsave(&priv->lock, flags);, then ipoib_mcast_restart_task can remove this mcast entry from the list, without waiting for the query to complete. -- MST From halr at voltaire.com Tue Nov 15 06:23:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Nov 2005 16:23:45 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AAE8@taurus.voltaire.com> Hi, It's not necessarily an RMPP bug. A lot of the port 2s on SCinet are not plugged in. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Roland Dreier Sent: Tue 11/15/2005 3:27 AM To: Eitan Zahavi Cc: openib-general at openib.org Subject: Re: [openib-general] SRP device management client (and a few opensmglitches) Eitan> Yes this is correct we never got requested for that query. Eitan> If you are only interested in obtaining the guid of the Eitan> port you can simply use NodeInfoRecord and you get the guid Eitan> in the NodeInfo. But you probably know that. Is there Eitan> anything more you expect to get from the GUIDInfo? Are you Eitan> using/having multiple GUIDs for port? Good point -- I'll just switch to getting the NodeInfoRecord. Eitan> This must be an RMPP bug of some sort. How easy is it to Eitan> reproduce? Please give us some hints. Quite easy in my setup -- it seems to happen every time on my fabric when I do a get table for PortInfoRecords with local port num 2. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Tue Nov 15 06:41:58 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 16:41:58 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618985@mtlexch01.mtl.com> I think Yael figured it out: Looking at Roland's code it seems like it will not filter out the PortRecords coming from switch physical ports. So actually he gets many records that all have base lid = 0 and gid = 0 from these ports... I assume this is the case. There is no trivial way to know from the PortInfo in the PortRecord to which type of node the port belongs. Only a combination of NodeRecord PortRecord (by base lid) can tell you if the PortRecord is an HCA port or switch port. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 15, 2005 4:24 PM > To: Roland Dreier; Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SRP device management client (and a few > opensmglitches) > > Hi, > > It's not necessarily an RMPP bug. A lot of the port 2s on SCinet are not plugged in. > > -- Hal > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Roland Dreier > Sent: Tue 11/15/2005 3:27 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few > opensmglitches) > > > > Eitan> Yes this is correct we never got requested for that query. > Eitan> If you are only interested in obtaining the guid of the > Eitan> port you can simply use NodeInfoRecord and you get the guid > Eitan> in the NodeInfo. But you probably know that. Is there > Eitan> anything more you expect to get from the GUIDInfo? Are you > Eitan> using/having multiple GUIDs for port? > > Good point -- I'll just switch to getting the NodeInfoRecord. > > Eitan> This must be an RMPP bug of some sort. How easy is it to > Eitan> reproduce? Please give us some hints. > > Quite easy in my setup -- it seems to happen every time on my fabric > when I do a get table for PortInfoRecords with local port num 2. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Tue Nov 15 06:43:18 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 15 Nov 2005 06:43:18 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <4378F863.2070707@sun.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <96f8e60e0511111302i8f6db3anfc404b0998c8885@mail.gmail.com> <6.2.0.14.2.20051111155559.02542268@esmail.cup.hp.com> <4378F863.2070707@sun.com> Message-ID: <6.2.0.14.2.20051115064030.026fe118@esmail.cup.hp.com> At 12:49 PM 11/14/2005, Nitin Hande wrote: >Michael Krause wrote: >>At 01:02 PM 11/11/2005, Ranjit Pandit wrote: >> >>>On 11/11/05, Michael Krause wrote: >>> > Please clarify the following which was in the document provided by >>> Oracle. >>> > >>> > On page 3 of the RDS document, under the section "RDP Interface", the 2nd >>> > and 3rd paragraphs are state: >>> > >>> > * RDP does not guarantee that a datagram is delivered to the remote >>> > application. >>> > * It is up to the RDP client to deal with datagrams lost due to >>> transport >>> > failure or remote application failure. >>> > >>> > The HCA is still a fault domain with RDS - it does not address >>> flushing data >>> > out of the HCA fault domain, nor does it sound like it ensures that >>> CQE loss >>> > is recoverable. >>> > >>> > I do believe RDS will replay all of the sendmsg's that it believes are >>> > pending, but it has no way to determine if already sent sendmsgs were >>> > actually successfully delivered to the remote application unless it >>> provides >>> > some level of resync of the outstanding sends not completed from an >>> > application's perspective as well as any state updated via RDMA >>> operations >>> > which may occur without an explicit send operation to flush to a known >>> > state. I'm still trying to ascertain whether RDS completely recovers >>> from >>> > HCA failure (assuming there is another HCA / path available) between >>> the two >>> > endnodes. >>> >>>RDS will replay the sends that are completed in error by the HCA, >>>which typically would happen if the current path fails or the remote >>>node/HCA dies. >> >>Does this mean that the receiving RDS entity is responsible for dealing >>with duplicates? >I believe so... > >A Send completion error does not mean that the >>receiving endnode did not receive the data for either IB or iWARP; it >>only indicates that the Send operation failed which could be just a loss >>of the receive ACK with the Send completing on the receiver. Such a >>scenario would imply that RDS would have to comprehend what buffers have >>actually been consumed before retransmission, i.e. a resync is performed, >>else one could receive duplicate data at the application layer which can >>cause corruption or other problems as a function of the application >>(tolerance will vary by application thus the ULP must present consistent >>semantics to enable a broader set of applications than perhaps the >>initial targeted application to be supported). >In absence of any protocol level ack (and regardless of protocol level >ack), it is the application which has to implement its own reliability. >RDS becomes a passive channel passing packet back and forth including >duplicate packets. The responsibility then shifts to the application to >figure out what is missing, duplicate's etc. This would seem at odds with earlier assertions that as long as there were another path to the endnode, RDS would transparently recover on behalf of the application. I thought Oracle stated for their application that send failure would be interpreted as endnode failure and cast out the peer - perhaps I misread their usage model. Other applications who might want to use RDS could be designed to deal with the associated faults but if one has to deal with recovery / resync at the application layer, then that is quite a bit of work to perform in every application and is again at odds with the purpose of RDS which is to move reliability to the interconnect to the extent possible and to RDS so that the UDP application does not need to take on this complex code and attempt to get it right. Mike >Thanks >Nitin > > >> >>>In case of a catastrophic error on the local HCA, subsequent sends will >>>fail (for a certain time (session_time_wait ) ) as if there was no >>>alternate path available at that time. On getting an error the >>>application should discard any sends unacknowledged by it's peer and >>>take corrective action. >> >>Unacknowledged by the peer means at the interconnect or the application >>level? Again, how is the receive buffer management handled? >> >>>After the time_wait is over, subsequent sends will initiate a brand new >>>connection which could use the alternate HCA ( if the path is available). >> >>This is understood. >>Mike >> >>------------------------------------------------------------------------ >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Tue Nov 15 06:46:36 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 15 Nov 2005 06:46:36 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB In-Reply-To: <4378F868.5080205@sun.com> References: <6.2.0.14.2.20051111115707.0254a7e0@esmail.cup.hp.com> <4375069D.7040106@sun.com> <6.2.0.14.2.20051111155127.02542120@esmail.cup.hp.com> <4378F868.5080205@sun.com> Message-ID: <6.2.0.14.2.20051115064339.02912820@esmail.cup.hp.com> At 12:49 PM 11/14/2005, Nitin Hande wrote: >Michael Krause wrote: >>At 01:01 PM 11/11/2005, Nitin Hande wrote: >> >>>Michael Krause wrote: >>> >>>>At 10:28 AM 11/9/2005, Rick Frank wrote: >>>> >>>>>Yes, the application is responsible for detecting lost msgs at the >>>>>application level - the transport can not do this. >>>>> >>>>>RDS does not guarantee that a message has been delivered to the >>>>>application - just that once the transport has accepted a msg it will >>>>>deliver the msg to the remote node in order without duplication - >>>>>dealing with retransmissions, etc due to sporadic / intermittent msg >>>>>loss over the interconnect. If after accepting the send - the current >>>>>path fails - then RDS will transparently fail over to another path - >>>>>and if required will resend / send any already queued msgs to the >>>>>remote node - again insuring that no msg is duplicated and they are in >>>>>order. This is no different than APM - with the exception that RDS >>>>>can do this across HCAs. >>>>> >>>>>The application - Oracle in this case - will deal with detecting a >>>>>catastrophic path failure - either due to a send that does not arrive >>>>>and or a timedout response or send failure returned from the >>>>>transport. If there is no network path to a remote node - it is >>>>>required that we remove the remote node from the operating cluster to >>>>>avoid what is commonly termed as a "split brain" condition - otherwise >>>>>known as a "partition in time". >>>>> >>>>>BTW - in our case - the application failure domain logic is the same >>>>>whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. >>>>>Basically, if we can not talk to a remote node - after some defined >>>>>period of time - we will remove the remote node from the cluster. In >>>>>this case the database will recover all the interesting state that may >>>>>have been maintained on the removed node - allowing the remaining >>>>>nodes to continue. If later on, communication to the remote node is >>>>>restored - it will be allowed to rejoin the cluster and take on >>>>>application load. >>>> >>>> >>>>Please clarify the following which was in the document provided by Oracle. >>>>On page 3 of the RDS document, under the section "RDP Interface", the >>>>2nd and 3rd paragraphs are state: >>>> * RDP does not guarantee that a datagram is delivered to the remote >>>> application. >>>> * It is up to the RDP client to deal with datagrams lost due to >>>> transport failure or remote application failure. >>>>The HCA is still a fault domain with RDS - it does not address flushing >>>>data out of the HCA fault domain, nor does it sound like it ensures >>>>that CQE loss is recoverable. >>>>I do believe RDS will replay all of the sendmsg's that it believes are >>>>pending, but it has no way to determine if already sent sendmsgs were >>>>actually successfully delivered to the remote application unless it >>>>provides some level of resync of the outstanding sends not completed >>>>from an application's perspective as well as any state updated via RDMA >>>>operations which may occur without an explicit send operation to flush >>>>to a known state. >>> >>>If RDS could define a mechanism that the application could use to inform >>>the sender to resync and replay on catastrophic failure, is that a >>>correct understanding of your suggestion ? >> >>I'm not suggesting anything at this point. I'm trying to reconcile the >>documentation with the e-mail statements made by its proponents. >> >>>I'm still trying to ascertain whether RDS completely >>> >>>>recovers from HCA failure (assuming there is another HCA / path >>>>available) between the two endnodes >>> >>>Reading at the doc and the thread, it looks like we need src/dst port >>>for multiplexing connections, we need seq/ack# for resyncing, we need >>>some kind of window availability for flow control. Are'nt we very close >>>to tcp header ? .. >> >>TCP does not provide end-to-end to the application as implemented by most >>OS. Unless one ties TCP ACK to the application's consumption of the >>receive data, there is no method to ascertain that the application really >>received the data. The application would be required to send its own >>application-level acknowledgement. I believe the intent is for >>applications to remain responsible for the end-to-end receipt of data and >>that RDS and the interconnect are simply responsible for the exchange at >>the lower levels. >Yes, a TCP ack only implies that it has received the data, and means >nothing to the application. It is the application which has send a >application level ack to its peer. TCP ACK was intended to be an end-to-end ACK but implementations took it to a lower level ACK only. A TCP stack linked into an application as demonstrated by multiple IHV and research does provide an end-to-end ACK and considerable performance improvements over the traditional network stack implementations. Some claim it is more than good enough to eliminate the need for protocol off-load / RDMA which is true for many applications (certainly for most Sockets, etc.) but not true when one takes advantage of the RDMA comms paradigm which has benefit for a number of applications. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Nov 15 06:48:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Nov 2005 16:48:02 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AAEC@taurus.voltaire.com> Roland, Just to close the loop on this: Are those the only component fields which were zeroed out ? Thanks. -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tue 11/15/2005 9:41 AM To: Hal Rosenstock; Roland Dreier Cc: openib-general at openib.org Subject: RE: [openib-general] SRP device management client (and a few opensmglitches) I think Yael figured it out: Looking at Roland's code it seems like it will not filter out the PortRecords coming from switch physical ports. So actually he gets many records that all have base lid = 0 and gid = 0 from these ports... I assume this is the case. There is no trivial way to know from the PortInfo in the PortRecord to which type of node the port belongs. Only a combination of NodeRecord PortRecord (by base lid) can tell you if the PortRecord is an HCA port or switch port. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 15, 2005 4:24 PM > To: Roland Dreier; Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SRP device management client (and a few > opensmglitches) > > Hi, > > It's not necessarily an RMPP bug. A lot of the port 2s on SCinet are not plugged in. > > -- Hal > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Roland Dreier > Sent: Tue 11/15/2005 3:27 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few > opensmglitches) > > > > Eitan> Yes this is correct we never got requested for that query. > Eitan> If you are only interested in obtaining the guid of the > Eitan> port you can simply use NodeInfoRecord and you get the guid > Eitan> in the NodeInfo. But you probably know that. Is there > Eitan> anything more you expect to get from the GUIDInfo? Are you > Eitan> using/having multiple GUIDs for port? > > Good point -- I'll just switch to getting the NodeInfoRecord. > > Eitan> This must be an RMPP bug of some sort. How easy is it to > Eitan> reproduce? Please give us some hints. > > Quite easy in my setup -- it seems to happen every time on my fabric > when I do a get table for PortInfoRecords with local port num 2. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Tue Nov 15 06:51:46 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 15 Nov 2005 09:51:46 -0500 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: The goal that this proposal is to provide underpinning for common RDMA transport CM. Thus, the API ULP (both user space and kernel space) use socket addressing. For ULP addressing this means 5 tuple: protocol, src IP addr, src port, dst IP addr, and dst port. Port is 16 bit entity. The proposal just provide a mechanism for exchanging this 5-tuple between two sides. Which entity is responsible to "use" the proposed protocol is an interesting one. I was assuming that this will be CM. After all the proposed protocol is CM extension protocol. But it can be another entity module between CM and ULP. Its job will be taking 5 tuple and populating private data and converting dst port to SID. Since OpenIB addr.c already deals with IP to IB address translations it is a logical candidate for it. On remote side it extracts info from private data and populates socket info for Consumer and passes Consumer a pointer to Consumer private data. Another interesting place to deal with is listening point. Since it is common RDMA API, 16 bit port should be use for it also. This means that the same module should locally convert port to IB SID before passing it to CM. CM just ensures that incoming connection request which matches listening SID. While it is possible to do wildcarding on the whole SID, I had not seen it is used selectively on individual bits of a SID or a port. While SDP does the conversion to IB SID from Ethernet port, this proposal shift the responsibility for port and IP address conversion from ULP down. Now lets look at each field proposed to be moved from protocol private data to SID. Protocol version. This mean that in the future if protocol version will be bumped up we will have to change the SID on which Consumer listens on and requests sent to. Not sure how to do that without changing ULP. Does not look like a good idea. IP version. This can be incorporated into SID. But if HCA has multiple IP addresses assigned to it the listening point need to specify its IP address(es). The current verbs and/or API will have to be changed to support it. But if socket is passed to listen on it does have all the needed info. Looks fine. Ethernet Protocol. The same as the one above. Src port. Very questionable. For that listening SID must have wild card for portion of SID where SRC port is incorporated. Since ULP is not aware or ever see it, it is possible. But this pushes the definition of SID beyond it current IBTA spec statement of "similar to TCP port number". The query of listen point should also hide the wildcarded SID in this case. DAPL APIs (uDAPL and kDAPL) does not expose local IP address for listen point. An additional API can be added to support passing local socket to listen on instead of Connection Qualifier. Since it is addition no backwards compatibility issues. The current ULPs/Apps will still use "the default API address" and the protocol assigned SID as connection qualifier. The new API ensures that locally SID conversion takes place. The use of protocol defined range of SIDs ensures that remote side knows to parse private data according to proposed protocol format. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, November 11, 2005 12:43 PM > To: Kanevsky, Arkady > Cc: Sean Hefty; swg at infinibandta.org; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: Re: [openib-general] RE: [dat-discussions] socket > based connectionmodel for IB proposal - round 3 > > Kanevsky, Arkady wrote: > > So what you are proposing is that Listener will specify > IETF port (2 > > bytes). > > CM will generate an IB SID to listen on. That SID will have > > wildcarding for 24 bits. > > The requestor will specify: version, IP version, SRC port > and DST port. > > Based on that CM will generate the SID to send request to. > > No, the listener or requester generate the SID, not the IB CM > - the same way SDP works today. > > > It will also encode IP addresses into Private data based on > IP version. > > > > This makes IP addresses, SIDs and private data format > interdependent > > and not orthogonal which it is now. > > It also changes the meaning of SID which currently has a meaning of > > TCP port. > > I'm not proposing this. I'm merely stating that is is a > valid option to consider. The private data format and SIDs > are not orthogonal anyway. The port number's embedded in the > SID, and the SID indicates the format of the private data. > They are interdependent by definition. > > If it's okay to put the destination port number in the SID, > why not the protocol type, or IP version? > > > It also does not allow to use the private data formating > for other SIDs. > > Private data is private. It should not be owned, set, > interpreted, modified, or touched by the CM. It's up to the > service to define and use. > > What's this proposal defines is basically a 65th bit for the > service ID. If the new 65 bit SID is: > > 1 - private data has this format > 0 - private data format is unknown > > Why do we need this 65th bit? > > > It looks like a big hack. Is it worth it for extra 4 bytes > of private > > data for Consumers? > > It's a trade off between SID space and private data. > Consumers need to decide how important those extra 4 bytes are. > > - Sean > > From bboas at llnl.gov Tue Nov 15 06:52:30 2005 From: bboas at llnl.gov (Bill Boas) Date: Tue, 15 Nov 2005 06:52:30 -0800 Subject: [openib-general] Fwd: Invitation to OpenIB BOF at SC05 Wednesday 11-12 Room 205 in Convention Center Message-ID: <6.2.3.4.2.20051115065215.02f82630@mail-lc.llnl.gov> >Date: Mon, 14 Nov 2005 15:16:51 -0800 >To: openib-promoters at openib.org, openib-general at openib.org, >sc05-ib at lists.scl.ameslab.gov, Eric Lantz, openib-windows at openib.org >From: Bill Boas >Subject: Invitation to OpenIB BOF at SC05 Wednesday 11-12 Room 205 >in Convention Center >Cc: seager at llnl.gov, gorda1 at llnl.gov, tdhooge at llnl.gov, >robp at ncsa.uiuc.edu, Jacques-Charles Lafoucriere >, Peter Hass , Andy >Bechtolsheim, Linden Mercer, Rick Cecil, Jason Gunthorpe >, David Southwell, >langer1 at llnl.gov, Steve Lyness, Steve Poole, Scot Schultz, Chet >Mehta, Kyril Faenov > >Please attend the OpenIB Birds of Feather at Sc|05 on Wednesday >November 16 at 11.00 AM in Room 205 in the Convention Center > >Agenda and Goals of BOF Bill Bio > >SCinet05-IB team experience and Exhibitor Booth Feedback on >Infiniband Network at SC05 - speakers from team who did it > >Integrated OpenIB and iWARP stack demo and Direct Access Network >naming idea - Tom Tucker > >Long distance Links at SC05 results and feedback - Linden Mercer, Rick Cecil, > >Microsoft Feedback - Eric lantz > >New members > >Next Events Content, location and schedule > - Proposed Interoperability Event - Thad Omura > - Sonoma Workshop - TBD > - HLRS Workshop - Peter Haas > >Other items and Wrap Up - Bill Boas > > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From mst at mellanox.co.il Tue Nov 15 07:09:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 15 Nov 2005 17:09:24 +0200 Subject: [openib-general] Re: [PATCH] mthca: fix qp max_send/recv_sge calculation In-Reply-To: <20051114211117.GB3603@mellanox.co.il> References: <20051114211117.GB3603@mellanox.co.il> Message-ID: <20051115150924.GO20871@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH] mthca: fix qp max_send/recv_sge calculation > > Roland, I think I see a problem in mthca, where qp capability values > we return arent safe. > How does the following look (compile tested only)? This is tested now, please review/apply. -- MST From rolandd at cisco.com Tue Nov 15 07:20:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 07:20:36 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23EA@mtlexch01.mtl.com> (Yael Kalka's message of "Tue, 15 Nov 2005 12:06:27 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23EA@mtlexch01.mtl.com> Message-ID: <52d5l27xa3.fsf@cisco.com> Yael> Hello Roland, When turning on only the comp_mask for the Yael> local_port_num you will get all relevant PortInfo records Yael> from the switches. These records do have many fields zeroed Yael> out (e.g subnet_prefix), but they are still valid records. Yael> Is this what you are seeing? I guess the bug is related, because in my test fabric which has a single 24-port switch chip, I see 24 entries with a zero base_lid and a local_port_num of 2. - R. From rolandd at cisco.com Tue Nov 15 07:31:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 07:31:26 -0800 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AAEC@taurus.voltaire.com> (Hal Rosenstock's message of "Tue, 15 Nov 2005 16:48:02 +0200") References: <5CE025EE7D88BA4599A2C8FEFCF226F589AAEC@taurus.voltaire.com> Message-ID: <528xvq7ws1.fsf@cisco.com> Hal> Are those the only component fields which were zeroed out ? No, for example the capability mask is all 0 as well. It seems to be something a little bit more complicated. In my test fabric, which has a single 24-port switch with hosts connected to both port 1 and port 2 (of the switch), I see no extra entries for my port 1 query. For my port 2 query, I get exactly 24 entries with a base_lid of zero and a local port number of 2. The PortNum in the SA header goes from 1 through 24. - R. From eitan at mellanox.co.il Tue Nov 15 07:41:27 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 17:41:27 +0200 Subject: [openib-general] SRP device management client (and a few opensm glitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618989@mtlexch01.mtl.com> Hi Roland, If you only got single 24port switch you should only see 1 record with base lid = 0 and port num = 2. But maybe we have a bug not comparing port num. On our test today we have seen only one record for port 2 from each switch (we had two switches so got 2 recodrs). Could you dump out the content of the PortRecords that you get as response? Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 15, 2005 5:21 PM > To: Yael Kalka > Cc: Eitan Zahavi; openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few opensm > glitches) > > Yael> Hello Roland, When turning on only the comp_mask for the > Yael> local_port_num you will get all relevant PortInfo records > Yael> from the switches. These records do have many fields zeroed > Yael> out (e.g subnet_prefix), but they are still valid records. > Yael> Is this what you are seeing? > > I guess the bug is related, because in my test fabric which has a > single 24-port switch chip, I see 24 entries with a zero base_lid and > a local_port_num of 2. > > - R. From eitan at mellanox.co.il Tue Nov 15 07:46:52 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 17:46:52 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361898A@mtlexch01.mtl.com> You should not get more then one SA header. I assumed you are doing GetTable of PortInfoRecord. If this is correct you should only get one SA header in the resulting RMPP (reassembled MAD). EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 15, 2005 5:31 PM > To: Hal Rosenstock > Cc: Eitan Zahavi; Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few > opensmglitches) > > Hal> Are those the only component fields which were zeroed out ? > > No, for example the capability mask is all 0 as well. > > It seems to be something a little bit more complicated. In my test > fabric, which has a single 24-port switch with hosts connected to > both port 1 and port 2 (of the switch), I see no extra entries for my > port 1 query. For my port 2 query, I get exactly 24 entries with a > base_lid of zero and a local port number of 2. The PortNum in the SA > header goes from 1 through 24. > > - R. From rolandd at cisco.com Tue Nov 15 07:48:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 07:48:34 -0800 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E361898A@mtlexch01.mtl.com> (Eitan Zahavi's message of "Tue, 15 Nov 2005 17:46:52 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E361898A@mtlexch01.mtl.com> Message-ID: <524q6d9ajx.fsf@cisco.com> Eitan> You should not get more then one SA header. I assumed you Eitan> are doing GetTable of PortInfoRecord. If this is correct Eitan> you should only get one SA header in the resulting RMPP Eitan> (reassembled MAD). Yes, I only have one SA header. I just meant the SA wrapper of the PortInfo attribute. (There are two port numbers and two LIDs in the PortInfoRecord) - R. From rolandd at cisco.com Tue Nov 15 07:52:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 07:52:18 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618989@mtlexch01.mtl.com> (Eitan Zahavi's message of "Tue, 15 Nov 2005 17:41:27 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618989@mtlexch01.mtl.com> Message-ID: <52zmo57vt9.fsf@cisco.com> Eitan> Could you dump out the content of the PortRecords that you Eitan> get as response? They look like valid records for switch ports, except the local port number field doesn't match the port number field in the SA record identifier wrapper. - R. From rolandd at cisco.com Tue Nov 15 08:04:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 08:04:29 -0800 Subject: [openib-general] SRP device management client (and a few opensm glitches) In-Reply-To: <52zmo57vt9.fsf@cisco.com> (Roland Dreier's message of "Tue, 15 Nov 2005 07:52:18 -0800") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618989@mtlexch01.mtl.com> <52zmo57vt9.fsf@cisco.com> Message-ID: <52veyt7v8y.fsf@cisco.com> I just noticed that the host port that the SM is running on is connected to switch port 2. What seems to be happening is that all of the switch's ports (except port 0) are seen as having local port number 2 in the actual PortInfo attribute information, even though the PortNum field in the SA record wrapper has the real port number of the switch port. I haven't been able to follow the opensm code to see where the table used in osm_port_get_phys_ptr() is filled in yet, so I don't know why this is happening. - R. From mshefty at ichips.intel.com Tue Nov 15 10:22:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 15 Nov 2005 10:22:17 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: References: Message-ID: <437A2759.5050701@ichips.intel.com> Kanevsky, Arkady wrote: > Which entity is responsible to "use" the proposed protocol is an > interesting one. > I was assuming that this will be CM. After all the proposed protocol is > CM extension > protocol. > But it can be another entity module between CM and ULP. The use of a reserved bit in the CM message indicates that the CM itself needs to set this data. This requires communication between the CM and IPoIB - in essence making IPoIB part of the CM. Removal of the reserved bit is what permits another entity between the CM and the ULP to perform this task. > While it is possible to do wildcarding on the whole SID, I had not seen > it is used selectively on individual bits of a SID or a port. This is what is done/needed to support SDP. A simple mask is applied to the SID before comparing against a local listen. > While SDP does the conversion to IB SID from Ethernet port, this > proposal > shift the responsibility for port and IP address conversion from ULP > down. Ideally, SDP should use this same mechanism, but requires changes to SDP. > Protocol version. This mean that in the future if protocol version will > be > bumped up we will have to change the SID on which Consumer listens on > and > requests sent to. Not sure how to do that without changing ULP. Does not > look > like a good idea. As a note, I'm not saying that I prefer a more complex SID, just that there is trade-off to be made that could provide for more ULP private data. If the version changes, then the addressing has changed. The ULP may need to change anyway in order to know how to interpret the address. If they don't care about the protocol version or don't need to know how to interpret the address, they can still wildcard the version. > IP version. This can be incorporated into SID. But if HCA has multiple > IP addresses > assigned to it the listening point need to specify its IP address(es). > The current verbs and/or API will have to be changed to support it. > But if socket is passed to listen on it does have all the needed info. > Looks fine. I didn't follow this. > DAPL APIs (uDAPL and kDAPL) does not expose local IP address for listen > point. > An additional API can be added to support passing local socket to listen > on > instead of Connection Qualifier. Since it is addition no backwards > compatibility issues. > The current ULPs/Apps will still use "the default API address" and the > protocol assigned SID as connection qualifier. The issue with DAPL is that it assumes that addressing has been resolved to a specific device before communication between the client and server have even occurred. I.e. the server must be clairvoyant and know which device a connection request will be received on. Likewise, a client must assume which local device is needed to connect to a given remote address. - Sean From halr at voltaire.com Tue Nov 15 10:23:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 15 Nov 2005 20:23:06 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AAEE@taurus.voltaire.com> Are you referring to SCinet ? It is definitely running off HCA port 1. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Roland Dreier Sent: Tue 11/15/2005 11:04 AM To: Eitan Zahavi Cc: openib-general at openib.org Subject: Re: [openib-general] SRP device management client (and a few opensmglitches) I just noticed that the host port that the SM is running on is connected to switch port 2. What seems to be happening is that all of the switch's ports (except port 0) are seen as having local port number 2 in the actual PortInfo attribute information, even though the PortNum field in the SA record wrapper has the real port number of the switch port. I haven't been able to follow the opensm code to see where the table used in osm_port_get_phys_ptr() is filled in yet, so I don't know why this is happening. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Tue Nov 15 10:33:32 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 15 Nov 2005 10:33:32 -0800 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: <54AD0F12E08D1541B826BE97C98F99F1041781@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Kanevsky, Arkady wrote: >> Which entity is responsible to "use" the proposed protocol is an >> interesting one. I was assuming that this will be CM. After all the >> proposed protocol is CM extension protocol. But it can be another >> entity module between CM and ULP. > > The use of a reserved bit in the CM message indicates that > the CM itself needs to set this data. This requires > communication between the CM and IPoIB - in essence making > IPoIB part of the CM. Removal of the reserved bit is what > permits another entity between the CM and the ULP to perform > this task. > The use of a reserved bit allows the receiver to know that the IP address has some validity without having to do a translation of the IP address itself. As Fab pointed out, if the receiving software takes this step then there is no need for any additional CM bit. Because the receiver can rely on the translation being authentic. It's an extra round trip, but it does leave the local CM interface totally intact. > >> DAPL APIs (uDAPL and kDAPL) does not expose local IP address for >> listen point. An additional API can be added to support passing >> local socket to listen on instead of Connection Qualifier. Since it >> is addition no backwards compatibility issues. >> The current ULPs/Apps will still use "the default API address" and >> the protocol assigned SID as connection qualifier. > > The issue with DAPL is that it assumes that addressing has > been resolved to a specific device before communication > between the client and server have even occurred. I.e. the > server must be clairvoyant and know which device a connection > request will be received on. Likewise, a client must assume > which local device is needed to connect to a given remote address. > > - Sean That is actually inherent in any RDMA model. You have to pre-post receive buffers before you connect. Therefore you need to know which device to register memory to. You don't need to know the external physical port, but you do need to know the device that is responsible for memory registration. And this is not require clairvoyance. It requires integration with the host local routing tables. Something that would be easy if the GID were treated as an IPv6 address. But even with some form of translation it is easy, as long as the IP addresses are integrated with the local routing tables. Given a remote IP address (or a local one that you want to use) you know what egress port will be used (and which ones could be used), and you know that RDMA device(s) associated with those egress points. The last step is simple, but has been overlooked all too often. From eitan at mellanox.co.il Tue Nov 15 13:14:58 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 15 Nov 2005 23:14:58 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> Hi Roland, Now I get it ! In the port info record there is a field named LocalPortNumber. This field is NOT the port number the data is about. It is the port number the packet of the query came from. (see table 145 p823 l-38). When OpenSM obtains the PortInfo associated with that particular port it did so through port 2 of the switch. So the LocalPortNum is set to 2. If you are interested with the real port number, it is located in the RID second field, of the PortInfoRecord returned by the SA GetTable query. Hope this helps. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 15, 2005 8:23 PM > To: Roland Dreier; Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [openib-general] SRP device management client (and a few > opensmglitches) > > Are you referring to SCinet ? It is definitely running off HCA port 1. > > -- Hal > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Roland Dreier > Sent: Tue 11/15/2005 11:04 AM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] SRP device management client (and a few > opensmglitches) > > > > I just noticed that the host port that the SM is running on is > connected to switch port 2. > > What seems to be happening is that all of the switch's ports (except > port 0) are seen as having local port number 2 in the actual PortInfo > attribute information, even though the PortNum field in the SA record > wrapper has the real port number of the switch port. > > I haven't been able to follow the opensm code to see where the table > used in osm_port_get_phys_ptr() is filled in yet, so I don't know why > this is happening. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Tue Nov 15 14:16:13 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 15 Nov 2005 14:16:13 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Message-ID: <54AD0F12E08D1541B826BE97C98F99F10417A9@NT-SJCA-0751.brcm.ad.broadcom.com> In absence of any protocol level ack (and regardless of protocol level ack), it is the application which has to implement its own reliability. RDS becomes a passive channel passing packet back and forth including duplicate packets. The responsibility then shifts to the application to figure out what is missing, duplicate's etc. This would seem at odds with earlier assertions that as long as there were another path to the endnode, RDS would transparently recover on behalf of the application. I thought Oracle stated for their application that send failure would be interpreted as endnode failure and cast out the peer - perhaps I misread their usage model. Other applications who might want to use RDS could be designed to deal with the associated faults but if one has to deal with recovery / resync at the application layer, then that is quite a bit of work to perform in every application and is again at odds with the purpose of RDS which is to move reliability to the interconnect to the extent possible and to RDS so that the UDP application does not need to take on this complex code and attempt to get it right. [cait] I would agree that there isn't much point in defining a "reliable" datagram service unless it is more reliable than unreliable. To me that means that the the transport should deal with all networking problems other than a *total* failure to re-establish contact with the remote end. That makes it basically equivalent of a point-to-point Reliable Connection. The biggest difference, and justification, for having something like RDS is to eliminate point-to-point flow control and allow it to be replaced with ULP based flow control that is not point-to-point. The resources associated with tracking credits is where a lot of the overhead inherent in multiple point-to-point connections come from (that, and the synchronization of that data over the network). -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Nov 15 15:18:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 15 Nov 2005 15:18:40 -0800 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> (Eitan Zahavi's message of "Tue, 15 Nov 2005 23:14:58 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> Message-ID: <52hdad7b5b.fsf@cisco.com> Eitan> Hi Roland, Now I get it ! In the port info record there is Eitan> a field named LocalPortNumber. This field is NOT the port Eitan> number the data is about. It is the port number the packet Eitan> of the query came from. (see table 145 p823 l-38). When Eitan> OpenSM obtains the PortInfo associated with that particular Eitan> port it did so through port 2 of the switch. So the Eitan> LocalPortNum is set to 2. Eitan> If you are interested with the real port number, it is Eitan> located in the RID second field, of the PortInfoRecord Eitan> returned by the SA GetTable query. OK, I guess that makes sense (although how to interpret the LocalPortNumber of the attribute when it is contained in the SA record seems slightly unclear). It seems that if I want to get a list of (say) all the HCA port 2s in the network, I have to do a get table query of the SA PortInfoRecord with component mask set so that I get ports with LocalPortNumber 2, and then filter out switch ports (since the port number field in the RID is not defined for HCA ports). Do you think this is really what was intended? - R. From dledford at redhat.com Tue Nov 15 17:04:18 2005 From: dledford at redhat.com (Doug Ledford) Date: Tue, 15 Nov 2005 20:04:18 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available Message-ID: <437A8592.1000408@redhat.com> I have initial RPM support for both of these releases available for use/testing. For Fedora Core 4, I didn't compile a new kernel since the current FC4 kernel is 2.6.14 based and includes the upstream Infiniband support. For RHEL4 I obviously compiled a new kernel, but it used the code pulled from OpenIB's svn trunk as opposed to upstream. As a result, there is actually more functionality in the RHEL4 kernel than in the FC4 kernel. The FC4 kernel includes the Mellanox host adapter driver, what parts of the core stack that have been submitted upstream, and ipoib. The RHEL4 kernel includes similar core features, the Mellanox host driver, ipoib, sdp, and srp support. I did not include kDAPL nor iSER because of the apparent rejection of kDAPL by upstream and the current dependency of iSER on kDAPL. All of the user land tools were also built from the same svn trunk pull as the kernel support. So far, I've put libmthca, libibverbs, and a package I termed opensm but really includes the entire management directory out of the user space portion of the tree. I anticipate adding libsdp, udapl, and the user space components to go along with srp (persistent bindings at boot up support) over the next couple weeks. I've had requests for the mvapich-gen2 support, but I'm not sure if that will make it. All of this was done using the 3965 version of the svn trunk. For the most part, I don't plan to rebase again prior to release, so from here on out it's likely to be bug fixes only that go in. In addition to the actual IB rpms, there have been several updated base RPMs, such as module-init-tools to pick up the right device aliases and such as part of the modprobe.conf.dist and udev to get the device naming rules correct. There will likely be a few more base package updates before things are finished (for instance, system-config-network still doesn't quite do the right thing with ipoib interfaces, nor does the ifup-eth script work even with statically configured IP addresses due to the default usage of arping to check to see if the IP address is already in use segfaulting). For the kernel, libmthca, and libibverbs, support is limited to x86, x86_64, and ia64. For the opensm package we support all arches. Prior to things being released, we will obviously get the kernel and others working on all arches or note that support for those arches will be coming later (all ppc based arches: s390, s390x, ppc, ppc64, ppc64iseries, are the excluded ones at the moment). Anyway, they're available on my web page at: http://people.redhat.com/dledford/Infiniband If you try these out and have any problems, please email me directly (and feel free to Cc: the list) for more immediate responses. -- Doug Ledford http://people.redhat.com/dledford From pradeep at us.ibm.com Tue Nov 15 17:06:50 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 15 Nov 2005 17:06:50 -0800 Subject: [openib-general] compile error -libibverbs (64-bit) Message-ID: I am trying to compile some of the userspace utilities as 64-bit apps on a ppc64 machine with a sles9sp2 distribution. I am getting the following compile errors (appended below). Yes, I am using some older bits, but I do not think that is the issue here. I have exported LDFLAGS=-L /lib64 -m64 and also LD_LIBRARY_PATH=/lib64 I have modified Makefile.am to add -m64 to both AM_CFLAGS and src_libibverbs_la_CFLAGS too. ldconfig -p | grep sysfs gives me the following: libsysfs.so.1 (libc6,64bit) => /lib64/libsysfs.so.1 libsysfs.so.1 (libc6) => /lib/libsysfs.so.1 libsysfs.so (libc6) => /usr/lib/libsysfs.so which appears to be right. Yet, libtool seems to be picking up the wrong libsysfs.so (hard coded path from somewhere). What am I missing? Pradeep pradeep at us.ibm.com ------------------------------------------------------------------------------------------------------------------------- make all-am make[1]: Entering directory `/home/pradeep/trunk-3675/src/userspace/libibverbs' /bin/sh ./libtool --mode=link gcc -g -Wall -D_GNU_SOURCE -m64 -g -O2 -L /lib64 -m64 -o src/libibverbs.la -rpath /usr/local/lib -version-info 1 -export-dynamic -Wl,--version-script=./src/libibverbs.map src_libibverbs_la-cmd.lo src_libibver bs_la-device.lo src_libibverbs_la-init.lo src_libibverbs_la-memory.lo src_libibv erbs_la-verbs.lo -lsysfs -lpthread -ldl gcc -shared .libs/src_libibverbs_la-cmd.o .libs/src_libibverbs_la-device.o .lib s/src_libibverbs_la-init.o .libs/src_libibverbs_la-memory.o .libs/src_libibverbs _la-verbs.o -L/home/pradeep/trunk-3675/src/userspace/libibverbs /usr/lib/libsys <------error here fs.so -lpthread -ldl -m64 -m64 -Wl,--version-script=./src/libibverbs.map -Wl,-s oname -Wl,libibverbs.so.1 -o src/.libs/libibverbs.so.1.0.0 /usr/lib/libsysfs.so: could not read symbols: Invalid operation collect2: ld returned 1 exit status make[1]: *** [src/libibverbs.la] Error 1 make[1]: Leaving directory `/home/pradeep/trunk-3675/src/userspace/libibverbs' make: *** [all] Error 2 ~ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hozer at hozed.org Tue Nov 15 18:58:40 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 15 Nov 2005 20:58:40 -0600 Subject: [openib-general] another opensm crash In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> Message-ID: <20051116025840.GW3275@kalmia.hozed.org> On Mon, Nov 14, 2005 at 09:54:28PM +0200, Eitan Zahavi wrote: > Hi Troy > > Try to move aside your /lib/tls directory and see if you still get these > crashes. > We have issues with TLS pthread and glibc We still have issues with -maxsmps=8. And no, running with maxsmps=1 is not an option on this network. Is there some testcase that can trigger this? How are we going to get the thread locking stuff fixed? From pradeep at us.ibm.com Tue Nov 15 21:46:11 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 15 Nov 2005 21:46:11 -0800 Subject: [openib-general] Data structure size mismatch In-Reply-To: <52hdaf9flq.fsf@cisco.com> Message-ID: Roland Dreier wrote on 11/14/2005 11:47:13 AM: > Pradeep> I am trying to use copy_from_user()/copy_to_user of data > Pradeep> structures that contains pointers. > > If you are defining a new interface, then the simplest thing is not to > do that: always put pointers in a 64-bit field. > > If it is an existing interface that can't be changed, then there > should be an existing compat wrapper for the system call. > This is for address translation. Address translations use "write()" to a character device. write() does not have a compat wrapper and so there are two possible approaches: 1) convert the write() to ioctl() and use the compat wrapper approach. 2) use the test_thread_flag(TIF_32BIT) or an alternative to ascertain if it is a 32-bit app on a 64 bit kernel and do the compat_ptr(), else use the regular copy_*_user(). I am leaning towards the first option, since that is the correct way to approach this and will work across all platforms. Any other suggestions? Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Nov 16 02:45:04 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 16 Nov 2005 12:45:04 +0200 Subject: [openib-general] SRP device management client (and a few opensmglitches) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618990@mtlexch01.mtl.com> > > It seems that if I want to get a list of (say) all the HCA port 2s in > the network, I have to do a get table query of the SA PortInfoRecord > with component mask set so that I get ports with LocalPortNumber 2, > and then filter out switch ports (since the port number field in the > RID is not defined for HCA ports). Do you think this is really what > was intended? [EZ] Hi Roland, What you have to do to get the HCA ports PortInfo is: 1. Get all NodeInfoRecord for node_type == HCA: comp_mask = 0x5 2. For each such node send a PortInfoRecord query by the port number: comp_mask = 0x33 PortInfoRecord *p_pir; p_pir->lid = lid_no; p_pir->port_num = port_num; Do GetTable(p_pir) From vjhyze at tpnet.pl Wed Nov 16 04:03:10 2005 From: vjhyze at tpnet.pl (Elmo Huerta) Date: Wed, 16 Nov 2005 13:03:10 +0100 Subject: [openib-general] Your request. Message-ID: <37c501c5eaae$6f3c9ea0$53491353@dysp> We are happy to present you with six deals from four different brokers.
Please remember that there is no commitment required on your part, and your credit is not an issue.
Please validate your information with our secure and private database to ensure our records are up to date and accurate.

http://knows-3.com/save1.asp
Have a good day.

Sincerely,

Elmo Huerta
Customer Service Rep
eRFD Inc.




monomial some resorcinol it , instalment the or hug be
be brady it's may watanabe the try subjectivity or may
geiger ortry dissipate it.
krueger see ultimate a ! algorithm but may baltic see
or grove some and unidimensional may , khmer and see
penna inmay cashmere and.
-------------- next part -------------- An HTML attachment was scrubbed... URL: From shubbell at dbresearch.net Wed Nov 16 06:14:10 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 16 Nov 2005 08:14:10 -0600 Subject: [openib-general] IPoIB Message-ID: <437B3EB2.7000000@dbresearch.net> Hello, I ran across something that continues to puzzle me. We upgraded to the latest infiniband source code tree as of yesterday and I tried to run my program that has been working for months using the new infiniband modules. Here is what I am seeing: 1) I can ping and ibping the head node from at least two of my client nodes. 2) I can ping the head node from the head node. 3) I can see that messages are being sent from my sender application (x86_64) via multicast by using ifconfig and viewing the number of Tx packets sent. 4) I do NOT receive the messages in my second receiver application (x86_64) Now, normally I would say that the code that I am using does not work, so I took that code and recompiled it under x32 and ran. Everything works great. So, here I am, very puzzled. Does anyone have some advice? Thanks, Sean From halr at voltaire.com Wed Nov 16 06:18:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 16 Nov 2005 16:18:40 +0200 Subject: [openib-general] IPoIB Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB0A@taurus.voltaire.com> Hi Sean, Sounds like you have partial multicast connectivity. What SM are you using ? What switch(es) ? Scan the network for ports which come up at 1x. That could cause this. Try recycling your switches and waiting a minute or 2 and seeing if this is better. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Sean Hubbell Sent: Wed 11/16/2005 9:14 AM To: openib-general at openib.org Subject: [openib-general] IPoIB Hello, I ran across something that continues to puzzle me. We upgraded to the latest infiniband source code tree as of yesterday and I tried to run my program that has been working for months using the new infiniband modules. Here is what I am seeing: 1) I can ping and ibping the head node from at least two of my client nodes. 2) I can ping the head node from the head node. 3) I can see that messages are being sent from my sender application (x86_64) via multicast by using ifconfig and viewing the number of Tx packets sent. 4) I do NOT receive the messages in my second receiver application (x86_64) Now, normally I would say that the code that I am using does not work, so I took that code and recompiled it under x32 and ran. Everything works great. So, here I am, very puzzled. Does anyone have some advice? Thanks, Sean _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yipeeyipeeyipeeyipee at yahoo.com Wed Nov 16 06:19:50 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Wed, 16 Nov 2005 14:19:50 +0000 (UTC) Subject: [openib-general] canceling a post_recv wr Message-ID: Hi, How can I cancel or destroy a WR (work request) that was previously posted to a qp with ibv_post_recv()? Can I do this without destroying the qp? Does the Mellanox hardware support this? Thanks, y From mst at mellanox.co.il Wed Nov 16 06:33:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Nov 2005 16:33:15 +0200 Subject: [openib-general] Re: canceling a post_recv wr In-Reply-To: References: Message-ID: <20051116143315.GT20871@mellanox.co.il> Quoting yipee : > Subject: canceling a post_recv wr > > Hi, > > How can I cancel or destroy a WR (work request) that was previously > posted to a > qp with ibv_post_recv()? Can I do this without destroying the qp? > Does the Mellanox hardware support this? > > > Thanks, > y AFAIK you have to close the qp to do this. -- MST From shubbell at dbresearch.net Wed Nov 16 07:15:20 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 16 Nov 2005 09:15:20 -0600 Subject: [openib-general] IPoIB In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB0A@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AB0A@taurus.voltaire.com> Message-ID: <437B4D08.5060101@dbresearch.net> Hal Rosenstock wrote: >Hi Sean, > >Sounds like you have partial multicast connectivity. What SM are you using ? What switch(es) ? > >Scan the network for ports which come up at 1x. That could cause this. > >Try recycling your switches and waiting a minute or 2 and seeing if this is better. > >-- Hal > >________________________________ > >From: openib-general-bounces at openib.org on behalf of Sean Hubbell >Sent: Wed 11/16/2005 9:14 AM >To: openib-general at openib.org >Subject: [openib-general] IPoIB > > > >Hello, > > I ran across something that continues to puzzle me. We upgraded to the >latest infiniband source code tree as of yesterday and I tried to run my >program that has been working for months using the new infiniband >modules. Here is what I am seeing: > >1) I can ping and ibping the head node from at least two of my client nodes. >2) I can ping the head node from the head node. >3) I can see that messages are being sent from my sender application >(x86_64) via multicast by using ifconfig and viewing the number of Tx >packets sent. >4) I do NOT receive the messages in my second receiver application (x86_64) > >Now, normally I would say that the code that I am using does not work, >so I took that code and recompiled it under x32 and ran. Everything >works great. So, here I am, very puzzled. Does anyone have some advice? > >Thanks, > >Sean > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > Thanks Hal, I'll give that a try. Sean From moschny at ipd.uni-karlsruhe.de Wed Nov 16 08:58:38 2005 From: moschny at ipd.uni-karlsruhe.de (Thomas Moschny) Date: Wed, 16 Nov 2005 17:58:38 +0100 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437A8592.1000408@redhat.com> References: <437A8592.1000408@redhat.com> Message-ID: <200511161758.46379.moschny@ipd.uni-karlsruhe.de> On Wednesday 16 November 2005 02:04, Doug Ledford wrote: > I have initial RPM support for both of these releases available for > use/testing. Thanks for providing the rpms! > If you try these out and have any problems, please email me directly > (and feel free to Cc: the list) for more immediate responses. Unfortunately, we got an kernel-oops on ia64 (rhel4) ... The boot log is attached. - Thomas -------------- next part -------------- A non-text attachment was scrubbed... Name: boot.log.bz2 Type: application/x-bzip2 Size: 4520 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From Arkady.Kanevsky at netapp.com Wed Nov 16 08:59:27 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 16 Nov 2005 11:59:27 -0500 Subject: [openib-general] socket based connectionmodel for IB proposal - round 4 Message-ID: This version incorporate the feedback on 3 reflectors and yesterday's SWG meeting. Major changes from previous version are: no REQ bit to identify private data formaing - SID range used instead port mapping uses IBTA space and IETF protocol # is encoded in SID protocol version is 4 bits. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v4.ppt Type: application/vnd.ms-powerpoint Size: 49152 bytes Desc: IP Address Support by InfiniBand CM_v4.ppt URL: From caitlinb at broadcom.com Wed Nov 16 09:37:14 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 16 Nov 2005 09:37:14 -0800 Subject: [openib-general] socket based connectionmodel for IB proposal - round 4 Message-ID: <54AD0F12E08D1541B826BE97C98F99F10417FE@NT-SJCA-0751.brcm.ad.broadcom.com> The SID range indicates the format, but it does not vouch for the data. It should be noted that the middleware handling this new format SHOULD/MUST validate the remote IP address using a privileged method (such as ARP) before passing the remote IP address to the ULP. ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Kanevsky, Arkady Sent: Wednesday, November 16, 2005 8:59 AM To: swg at infinibandta.org; openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [openib-general] socket based connectionmodel for IB proposal - round 4 This version incorporate the feedback on 3 reflectors and yesterday's SWG meeting. Major changes from previous version are: no REQ bit to identify private data formaing - SID range used instead port mapping uses IBTA space and IETF protocol # is encoded in SID protocol version is 4 bits. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Nov 16 10:06:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 16 Nov 2005 20:06:29 +0200 Subject: [openib-general] [PATCH 1 of 2] Re: ipoib oops In-Reply-To: <20051107154919.GU31134@mellanox.co.il> References: <20051107154919.GU31134@mellanox.co.il> Message-ID: <20051116180629.GA25683@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: ipoib oops > > Hi! > I saw this in /var/log/messages recently. > Unfortunately I cant say exactly what I did to trigger this problem. The following is for review only: this triggers another problem in ipoib which I'm working on now. Comments? --- Make sure all users of priv->mcast_list and priv_broadcast are protected by priv->lock. I had to add another list_head to the mcast structure to avoid using mcast_list outside the lock in stop_thread. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-15 15:53:09.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-15 17:31:35.000000000 +0200 @@ -62,6 +62,7 @@ struct ipoib_mcast { struct rb_node rb_node; struct list_head list; + struct list_head cancel_list; struct completion done; int query_id; @@ -587,10 +588,12 @@ int ipoib_mcast_start_thread(struct net_ return 0; } -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) +int ipoib_mcast_stop_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_mcast *mcast; + struct ipoib_mcast *mcast, *tmcast; + unsigned long flags; + LIST_HEAD(cancel_list); ipoib_dbg_mcast(priv, "stopping multicast thread\n"); @@ -599,17 +602,20 @@ int ipoib_mcast_stop_thread(struct net_d cancel_delayed_work(&priv->mcast_task); up(&mcast_mutex); - if (flush) - flush_workqueue(ipoib_workqueue); + flush_workqueue(ipoib_workqueue); - if (priv->broadcast && priv->broadcast->query) { - ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; - ipoib_dbg_mcast(priv, "waiting for bcast\n"); - wait_for_completion(&priv->broadcast->done); - } + spin_lock_irqsave(&priv->lock, flags); + + if (priv->broadcast && priv->broadcast->query) + list_add_tail(&priv->broadcast->cancel_list, &cancel_list); + + list_for_each_entry(mcast, &priv->multicast_list, list) + if (mcast->query) + list_add_tail(&mcast->cancel_list, &cancel_list); - list_for_each_entry(mcast, &priv->multicast_list, list) { + spin_unlock_irqrestore(&priv->lock, flags); + + list_for_each_entry_safe(mcast, tmcast, &cancel_list, cancel_list) { if (mcast->query) { ib_sa_cancel_query(mcast->query_id, mcast->query); mcast->query = NULL; @@ -617,6 +623,7 @@ int ipoib_mcast_stop_thread(struct net_d IPOIB_GID_ARG(mcast->mcmember.mgid)); wait_for_completion(&mcast->done); } + list_del_init(&mcast->cancel_list); } return 0; @@ -741,12 +748,14 @@ void ipoib_mcast_dev_flush(struct net_de { struct ipoib_dev_priv *priv = netdev_priv(dev); LIST_HEAD(remove_list); + LIST_HEAD(cancel_list); struct ipoib_mcast *mcast, *tmcast, *nmcast; unsigned long flags; ipoib_dbg_mcast(priv, "flushing multicast list\n"); spin_lock_irqsave(&priv->lock, flags); + list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { nmcast = ipoib_mcast_alloc(dev, 0); if (nmcast) { @@ -780,14 +789,24 @@ void ipoib_mcast_dev_flush(struct net_de &priv->multicast_tree); list_add_tail(&priv->broadcast->list, &remove_list); + priv->broadcast = nmcast; + } else { + ipoib_warn(priv, "could not reallocate multicast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(priv->broadcast->mcmember.mgid)); } - - priv->broadcast = nmcast; } spin_unlock_irqrestore(&priv->lock, flags); list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } ipoib_mcast_leave(dev, mcast); ipoib_mcast_free(mcast); } @@ -821,8 +840,12 @@ void ipoib_mcast_restart_task(void *dev_ unsigned long flags; ipoib_dbg_mcast(priv, "restarting multicast task\n"); + ipoib_dbg_mcast(priv, "stopping multicast thread\n"); - ipoib_mcast_stop_thread(dev, 0); + down(&mcast_mutex); + clear_bit(IPOIB_MCAST_RUN, &priv->flags); + cancel_delayed_work(&priv->mcast_task); + up(&mcast_mutex); spin_lock_irqsave(&priv->lock, flags); @@ -900,6 +923,13 @@ void ipoib_mcast_restart_task(void *dev_ /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { + if (mcast->query) { + ib_sa_cancel_query(mcast->query_id, mcast->query); + mcast->query = NULL; + ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(mcast->mcmember.mgid)); + wait_for_completion(&mcast->done); + } ipoib_mcast_leave(mcast->dev, mcast); ipoib_mcast_free(mcast); } Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-11-15 15:53:09.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-11-15 17:20:47.000000000 +0200 @@ -452,7 +452,7 @@ int ipoib_ib_dev_down(struct net_device flush_workqueue(ipoib_workqueue); } - ipoib_mcast_stop_thread(dev, 1); + ipoib_mcast_stop_thread(dev); /* * Flush the multicast groups first so we stop any multicast joins. The @@ -619,7 +619,7 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_dbg(priv, "cleaning up ib_dev\n"); - ipoib_mcast_stop_thread(dev, 1); + ipoib_mcast_stop_thread(dev); /* Delete the broadcast address and the local address */ ipoib_mcast_dev_down(dev); Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-14 17:22:18.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-15 17:02:04.000000000 +0200 @@ -264,7 +264,7 @@ void ipoib_mcast_send(struct net_device void ipoib_mcast_restart_task(void *dev_ptr); int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); +int ipoib_mcast_stop_thread(struct net_device *dev); void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); -- MST From Arkady.Kanevsky at netapp.com Wed Nov 16 10:29:15 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 16 Nov 2005 13:29:15 -0500 Subject: [openib-general] socket based connectionmodel for IB proposal - round 4 Message-ID: Correct. But this is already a requirement for OS. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 ________________________________ From: Caitlin Bestler [mailto:caitlinb at broadcom.com] Sent: Wednesday, November 16, 2005 12:37 PM To: Kanevsky, Arkady; swg at infinibandta.org; openib-general at openib.org; dat-discussions at yahoogroups.com Subject: RE: [openib-general] socket based connectionmodel for IB proposal - round 4 The SID range indicates the format, but it does not vouch for the data. It should be noted that the middleware handling this new format SHOULD/MUST validate the remote IP address using a privileged method (such as ARP) before passing the remote IP address to the ULP. ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Kanevsky, Arkady Sent: Wednesday, November 16, 2005 8:59 AM To: swg at infinibandta.org; openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [openib-general] socket based connectionmodel for IB proposal - round 4 This version incorporate the feedback on 3 reflectors and yesterday's SWG meeting. Major changes from previous version are: no REQ bit to identify private data formaing - SID range used instead port mapping uses IBTA space and IETF protocol # is encoded in SID protocol version is 4 bits. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Wed Nov 16 10:36:26 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 16 Nov 2005 13:36:26 -0500 Subject: [openib-general] socket based connectionmodel for IB proposal -round 4 Message-ID: pdf version of the proposal. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 ________________________________ From: Kanevsky, Arkady Sent: Wednesday, November 16, 2005 11:59 AM To: swg at infinibandta.org; openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [openib-general] socket based connectionmodel for IB proposal -round 4 This version incorporate the feedback on 3 reflectors and yesterday's SWG meeting. Major changes from previous version are: no REQ bit to identify private data formaing - SID range used instead port mapping uses IBTA space and IETF protocol # is encoded in SID protocol version is 4 bits. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v4.pdf Type: application/octet-stream Size: 24564 bytes Desc: IP Address Support by InfiniBand CM_v4.pdf URL: From dledford at redhat.com Wed Nov 16 11:28:29 2005 From: dledford at redhat.com (Doug Ledford) Date: Wed, 16 Nov 2005 14:28:29 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <200511161758.46379.moschny@ipd.uni-karlsruhe.de> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> Message-ID: <437B885D.7090402@redhat.com> Thomas Moschny wrote: > On Wednesday 16 November 2005 02:04, Doug Ledford wrote: > >>I have initial RPM support for both of these releases available for >>use/testing. > > > Thanks for providing the rpms! > > >>If you try these out and have any problems, please email me directly >>(and feel free to Cc: the list) for more immediate responses. > > > Unfortunately, we got an kernel-oops on ia64 (rhel4) ... > The boot log is attached. > > - Thomas OK, so it oopsed in ib_umad_init. I'll look into it. It works fine on i686 and x86_64 (in my testing anyway). Right now, the modprobe.conf.dist file uses a big hammer to put all the core modules into the kernel whenever we load anything. You might be able to get further in testing by either excluding the umad module until after mthca is loaded or something similar to that. If you try any of that, let me know the results (I currently don't have ia64 hardware to test on, working on rectifying that). -- Doug Ledford http://people.redhat.com/dledford From sean.hefty at intel.com Wed Nov 16 11:54:54 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 16 Nov 2005 11:54:54 -0800 Subject: [openib-general] rdma_listen() backlog parameter Message-ID: I've been trying to implement support for a backlog parameter in the CMA's rdma_listen() call. My original goal was to push the backlog parameter down into the kernel CMA with the following definition: Backlog - The number of unprocessed connection requests that a listener can have. A connection request is considered processed once it has either been accepted or rejected. To keep things simple, is there any reason for the kernel CMA to have a backlog parameter? Would iWarp require this? The kernel CMA is callback driven, which makes it trivial for clients to manage their own backlog. I'm considering maintaining this within the kernel uCMA only, unless it would need to be pushed down for iWarp. - Sean From mshefty at ichips.intel.com Wed Nov 16 12:49:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Nov 2005 12:49:00 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437A8592.1000408@redhat.com> References: <437A8592.1000408@redhat.com> Message-ID: <437B9B3C.1040202@ichips.intel.com> Doug Ledford wrote: > All of the user land tools were also built from the same svn trunk pull > as the kernel support. So far, I've put libmthca, libibverbs, and a > package I termed opensm but really includes the entire management > directory out of the user space portion of the tree. I anticipate > adding libsdp, udapl, and the user space components to go along with srp > (persistent bindings at boot up support) over the next couple weeks. > I've had requests for the mvapich-gen2 support, but I'm not sure if that > will make it. Do you know which connection layer uDAPL will use? - Sean From dledford at redhat.com Wed Nov 16 13:37:01 2005 From: dledford at redhat.com (Doug Ledford) Date: Wed, 16 Nov 2005 16:37:01 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437B9B3C.1040202@ichips.intel.com> References: <437A8592.1000408@redhat.com> <437B9B3C.1040202@ichips.intel.com> Message-ID: <437BA67D.80400@redhat.com> Sean Hefty wrote: > Doug Ledford wrote: > >> All of the user land tools were also built from the same svn trunk >> pull as the kernel support. So far, I've put libmthca, libibverbs, >> and a package I termed opensm but really includes the entire >> management directory out of the user space portion of the tree. I >> anticipate adding libsdp, udapl, and the user space components to go >> along with srp (persistent bindings at boot up support) over the next >> couple weeks. I've had requests for the mvapich-gen2 support, but I'm >> not sure if that will make it. > > > Do you know which connection layer uDAPL will use? > > - Sean The suggestion from Bob Woodruff was to use the verbs-cm support since kDAPL isn't included in the kernel. -- Doug Ledford http://people.redhat.com/dledford From robert.j.woodruff at intel.com Wed Nov 16 13:37:46 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 16 Nov 2005 13:37:46 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4 available In-Reply-To: <437BA67D.80400@redhat.com> Message-ID: Doug Ledford wrote, >The suggestion from Bob Woodruff was to use the verbs-cm support since >kDAPL isn't included in the kernel. Yes, for now I would suggest to use the socket CM version of uDAPL. Arlin is starting to debug the CMA version, but for now the socket CM version of uDAPL seems stable. We ran it on a 128 node cluster in Dupont for SC'05 with no problems. For example, to build and install the socket CM version of uDAPL on a Red Hat EL 4.0 kernel, from the userspace directory, cd dapl/dapl/udapl make VERBS=openib_scm cp -f ./Target/libdapl.a /usr/local/lib/libdapl-openib.a cp -f ./Target/libdapl.so /usr/local/lib/libdapl-openib.so cd ../../dat/udat make OS_VENDOR=REDHAT_EL4 cp -f ./Target/x86_64/libdat.a /usr/lib64 cp -f ./Target/x86_64/libdat.so /usr/lib64 Not sure why we needed the REDHAT_EL4 option on building libdat.so, but Arlin Davis can provide details if needed. woody From dledford at redhat.com Wed Nov 16 14:12:16 2005 From: dledford at redhat.com (Doug Ledford) Date: Wed, 16 Nov 2005 17:12:16 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4 available In-Reply-To: References: Message-ID: <437BAEC0.9090909@redhat.com> Bob Woodruff wrote: > Doug Ledford wrote, > >>The suggestion from Bob Woodruff was to use the verbs-cm support since >>kDAPL isn't included in the kernel. > > > Yes, for now I would suggest to use the socket CM version of uDAPL. > Arlin is starting to debug the CMA version, but for now the socket > CM version of uDAPL seems stable. We ran it on a 128 node cluster in > Dupont for SC'05 with no problems. > For example, to build and install the socket CM > version of uDAPL on a Red Hat EL 4.0 kernel, > from the userspace directory, > > cd dapl/dapl/udapl > make VERBS=openib_scm > cp -f ./Target/libdapl.a /usr/local/lib/libdapl-openib.a > cp -f ./Target/libdapl.so /usr/local/lib/libdapl-openib.so > cd ../../dat/udat > make OS_VENDOR=REDHAT_EL4 > cp -f ./Target/x86_64/libdat.a /usr/lib64 > cp -f ./Target/x86_64/libdat.so /usr/lib64 > > Not sure why we needed the REDHAT_EL4 option on building libdat.so, > but Arlin Davis can provide details if needed. I'll keep that in mind, but to be honest, I've had my best luck so far allowing rpmbuild to do the right thing (using things like the %configure macro instead of trying to hand configure the installation, etc). In addition, my kernel's include support that the REDHAT_EL4 flag may be assuming is missing. My goal is to get everything compiling as though this were an upstream kernel and up to date user space components, so I'm hoping the REDHAT_EL4 flag won't be necessary by the time I'm done. But, user space deadlines are after kernel space deadlines, so I'm off working on kernel things right now. -- Doug Ledford http://people.redhat.com/dledford From robert.j.woodruff at intel.com Wed Nov 16 13:56:22 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 16 Nov 2005 13:56:22 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4 available In-Reply-To: <437BAEC0.9090909@redhat.com> Message-ID: Doug wrote, >I'll keep that in mind, but to be honest, I've had my best luck so far >allowing rpmbuild to do the right thing (using things like the >%configure macro instead of trying to hand configure the installation, >etc). In addition, my kernel's include support that the REDHAT_EL4 flag >may be assuming is missing. My goal is to get everything compiling as >though this were an upstream kernel and up to date user space >components, so I'm hoping the REDHAT_EL4 flag won't be necessary by the >time I'm done. But, user space deadlines are after kernel space ?deadlines, so I'm off working on kernel things right now. Understood. The only reason I have been hand building/copying things for uDAPL is uDAPL does not support autogen, configure, and make with alternate destination target specified as the rest of the userspace libraries provide. Perhaps Arlin and James can fix the build environment for uDAPL to make it more consistent with the rest of the userspace components. I think this may make building RPMS easier. woody From krause at cup.hp.com Wed Nov 16 14:15:29 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 16 Nov 2005 14:15:29 -0800 Subject: [swg] Re: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <437A2759.5050701@ichips.intel.com> References: <437A2759.5050701@ichips.intel.com> Message-ID: <6.2.0.14.2.20051116140954.023480c8@esmail.cup.hp.com> At 10:22 AM 11/15/2005, Sean Hefty wrote: >Kanevsky, Arkady wrote: >>Which entity is responsible to "use" the proposed protocol is an >>interesting one. >>I was assuming that this will be CM. After all the proposed protocol is >>CM extension >>protocol. >>But it can be another entity module between CM and ULP. > >The use of a reserved bit in the CM message indicates that the CM itself >needs to set this data. This requires communication between the CM and >IPoIB - in essence making IPoIB part of the CM. Removal of the reserved >bit is what permits another entity between the CM and the ULP to perform >this task. > >>While it is possible to do wildcarding on the whole SID, I had not seen >>it is used selectively on individual bits of a SID or a port. > >This is what is done/needed to support SDP. A simple mask is applied to >the SID before comparing against a local listen. > >>While SDP does the conversion to IB SID from Ethernet port, this >>proposal >>shift the responsibility for port and IP address conversion from ULP >>down. > >Ideally, SDP should use this same mechanism, but requires changes to SDP. Given SDP is an existing standard and implemented on a variety of platforms, it would be best to not modify SDP in order to insure interoperability. Hence, the new annex should only apply to ULP that do not already communicate the IP address, etc. as part of the CM exchange on IB and as part of the SDP port mapper on iWARP (the port mapper enables completely transparent SDP usage underneath an application with no source code changes required). >>Protocol version. This mean that in the future if protocol version will be >>bumped up we will have to change the SID on which Consumer listens on and >>requests sent to. Not sure how to do that without changing ULP. Does not look >>like a good idea. > >As a note, I'm not saying that I prefer a more complex SID, just that >there is trade-off to be made that could provide for more ULP private data. > >If the version changes, then the addressing has changed. The ULP may need >to change anyway in order to know how to interpret the address. If they >don't care about the protocol version or don't need to know how to >interpret the address, they can still wildcard the version. > >>IP version. This can be incorporated into SID. But if HCA has multiple IP >>addresses >>assigned to it the listening point need to specify its IP address(es). >>The current verbs and/or API will have to be changed to support it. >>But if socket is passed to listen on it does have all the needed info. >>Looks fine. > >I didn't follow this. The SDP port mapper protocol for iWARP already solves the multiple IP address issue. The same concept / principles could be leveraged for IB. I don't think the verbs need to be changed to support it as it is just payload within CM messages for IB. >>DAPL APIs (uDAPL and kDAPL) does not expose local IP address for listen >>point. >>An additional API can be added to support passing local socket to listen >>on >>instead of Connection Qualifier. Since it is addition no backwards >>compatibility issues. >>The current ULPs/Apps will still use "the default API address" and the >>protocol assigned SID as connection qualifier. > >The issue with DAPL is that it assumes that addressing has been resolved >to a specific device before communication between the client and server >have even occurred. I.e. the server must be clairvoyant and know which >device a connection request will be received on. Likewise, a client must >assume which local device is needed to connect to a given remote address. The nice thing about the SDP port mapper is that it allows the server and client to dynamically determine the resolution prior to connection establishment as well as gauge how many resources, preferred CA / port, etc. to extend to the client. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Nov 16 14:29:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Nov 2005 14:29:26 -0800 Subject: [swg] Re: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <6.2.0.14.2.20051116140954.023480c8@esmail.cup.hp.com> References: <437A2759.5050701@ichips.intel.com> <6.2.0.14.2.20051116140954.023480c8@esmail.cup.hp.com> Message-ID: <437BB2C6.1030504@ichips.intel.com> Michael Krause wrote: > Given SDP is an existing standard and implemented on a variety of > platforms, it would be best to not modify SDP in order to insure > interoperability. Hence, the new annex should only apply to ULP that do > not already communicate the IP address, etc. as part of the CM exchange > on IB and as part of the SDP port mapper on iWARP (the port mapper > enables completely transparent SDP usage underneath an application with > no source code changes required). From an implementation viewpoint, my plan is to make the CMA aware of both the proposed private data format, as well as SDP's format. This should eliminate most of the duplicated functionality found in the CMA and SDP implementation, without requiring changes to SDP. - Sean From recio at us.ibm.com Wed Nov 16 15:52:00 2005 From: recio at us.ibm.com (Renato Recio) Date: Wed, 16 Nov 2005 17:52:00 -0600 Subject: [swg] Re: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 Message-ID: The SDP port mapper protocol is required, because the well known SDP port is "not" the targeted port. In the case of the RDMA Service, the targetted port may very well be a well known port (e.g. one of the two iSCSI ports), but the only point of the RDMA Service is to let the listening end know "I want the iSCSI well known port, that is RDMA aware". It seems to me that for ULPs that are RDMA aware, there is no need for a port mapper. However, over IB there is a need to know that 1) its the RDMA aware service being requested; and 2) to leverage the IP stack, the IP address info. More comments bounded by below. Thanks, Renato J Recio Chief Architect, eServer I/O IBM Distinguished Engineer Member IBM Academy of Technology Tel 512-838-3685, T/L 678-3685 Michael Krause m> cc: swg at infinibandta.org, openib-general at openib.org Subject: Re: [swg] Re: [openib-general] RE: [dat-discussions] socket based 11/16/2005 04:15 connectionmodel for IB proposal - round 3 PM At 10:22 AM 11/15/2005, Sean Hefty wrote: Kanevsky, Arkady wrote: Which entity is responsible to "use" the proposed protocol is an interesting one. I was assuming that this will be CM. After all the proposed protocol is CM extension protocol. But it can be another entity module between CM and ULP. The use of a reserved bit in the CM message indicates that the CM itself needs to set this data. This requires communication between the CM and IPoIB - in essence making IPoIB part of the CM. Removal of the reserved bit is what permits another entity between the CM and the ULP to perform this task. While it is possible to do wildcarding on the whole SID, I had not seen it is used selectively on individual bits of a SID or a port. This is what is done/needed to support SDP. A simple mask is applied to the SID before comparing against a local listen. While SDP does the conversion to IB SID from Ethernet port, this proposal shift the responsibility for port and IP address conversion from ULP down. Ideally, SDP should use this same mechanism, but requires changes to SDP. Given SDP is an existing standard and implemented on a variety of platforms, it would be best to not modify SDP in order to insure interoperability. Hence, the new annex should only apply to ULP that do not already communicate the IP address, etc. as part of the CM exchange on IB and as part of the SDP port mapper on iWARP (the port mapper enables completely transparent SDP usage underneath an application with no source code changes required). No. Mike, the intent is not to use this protocol for a ULP that is not RDMA aware and runs over IPoIB. Similarly, the intent is not to use this protocol fo a ULP that is not RDMA aware, but depends on SDP to help a little by leveraging RDMA as much as possible without exposing RDMA primitives (RDMA Write/Read) to the ULP. Instead, the intent is to use this protocol for a ULP that is RDMA aware. Think of it as... IPoIB uses IP stack for port management (reaching well known ports, port mapping for dynamic ports, ...). SDP uses the IB SDP Service for managing ULP ports that are not RDMA aware, but we want to help a little. We need a similar IB RDMA Service for managing ULP ports that RDMA aware. I brought this up in the RDMAC many moons ago. It seems to me that for iWARP the solution is a little more straight forward, because the stack stays IP through out. That is, if the active side (in IB terms :{) wants to connect to a well known port on the passive side, the RDMA negotiation can occur in band (e.g. as it does for iSER). However, given we don't have the IP stack through out, we need an RDMA service that allows an active side RDMA aware ULPs to ask the passive side ULP "are you RDMA aware?". That Service is what is being proposed. Protocol version. This mean that in the future if protocol version will be bumped up we will have to change the SID on which Consumer listens on and requests sent to. Not sure how to do that without changing ULP. Does not look like a good idea. As a note, I'm not saying that I prefer a more complex SID, just that there is trade-off to be made that could provide for more ULP private data. If the version changes, then the addressing has changed. The ULP may need to change anyway in order to know how to interpret the address. If they don't care about the protocol version or don't need to know how to interpret the address, they can still wildcard the version. IP version. This can be incorporated into SID. But if HCA has multiple IP addresses assigned to it the listening point need to specify its IP address(es). The current verbs and/or API will have to be changed to support it. But if socket is passed to listen on it does have all the needed info. Looks fine. I didn't follow this. The SDP port mapper protocol for iWARP already solves the multiple IP address issue. The same concept / principles could be leveraged for IB. I don't think the verbs need to be changed to support it as it is just payload within CM messages for IB. But they can't. Because for IB, we will still need a Service ID (see CM spec chapter) and the SDP Service ID is not applicable (e.g. we will not be running SDP between iSCSI and iSER). DAPL APIs (uDAPL and kDAPL) does not expose local IP address for listen point. An additional API can be added to support passing local socket to listen on instead of Connection Qualifier. Since it is addition no backwards compatibility issues. The current ULPs/Apps will still use "the default API address" and the protocol assigned SID as connection qualifier. The issue with DAPL is that it assumes that addressing has been resolved to a specific device before communication between the client and server have even occurred. I.e. the server must be clairvoyant and know which device a connection request will be received on. Likewise, a client must assume which local device is needed to connect to a given remote address. The nice thing about the SDP port mapper is that it allows the server and client to dynamically determine the resolution prior to connection establishment as well as gauge how many resources, preferred CA / port, etc. to extend to the client. Read my comments above. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pic08516.gif Type: image/gif Size: 1255 bytes Desc: not available URL: From swise at opengridcomputing.com Wed Nov 16 16:31:31 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 16 Nov 2005 18:31:31 -0600 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: References: Message-ID: <1132187491.9338.4.camel@wifi-230-130.sc05.org> The rnic might need to reserve resources based on the listen backlog, so I think the kernel iwarp cma will need this. This is rnic-dependent, but at least the Ammasso rnic needs to know what the backlog should be. For OpenIB, we _could_ lock this down to some fixed value on the iwarp side. By the way, this listen backlog for an iwarp rnic is actually the backlog of outstanding TCP connection requests as opposed to iwarp connections. Steve. On Wed, 2005-11-16 at 11:54 -0800, Sean Hefty wrote: > I've been trying to implement support for a backlog parameter in the CMA's > rdma_listen() call. My original goal was to push the backlog parameter down > into the kernel CMA with the following definition: > > Backlog - The number of unprocessed connection requests that a listener can > have. A connection request is considered processed once it has either been > accepted or rejected. > > To keep things simple, is there any reason for the kernel CMA to have a backlog > parameter? Would iWarp require this? The kernel CMA is callback driven, which > makes it trivial for clients to manage their own backlog. I'm considering > maintaining this within the kernel uCMA only, unless it would need to be pushed > down for iWarp. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Wed Nov 16 16:43:03 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 16 Nov 2005 18:43:03 -0600 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: References: Message-ID: <1132188184.7598.5.camel@wifi-228-213.sc05.org> Sean: I was just speaking with one of the Mellanox guys at the booth and he said that there is a parameter that performs a similar function to the TCP backlog that is set in the MAD message used to advertise the service point. He couldn't remember the name, but said I should post it to this group as a question. So the question is "what is the method used to manage the number of outstanding, uncompleted connection requests coming in to a single service point? Is there some parameter you can specify in the MAD messages used to set up the connection?" WRT the CMA: I think this parameter is just an attribute of the service/listening endpoint and not a queuing depth of outstanding, unaccepted/rejected connections in the CMA. On Wed, 2005-11-16 at 11:54 -0800, Sean Hefty wrote: > I've been trying to implement support for a backlog parameter in the CMA's > rdma_listen() call. My original goal was to push the backlog parameter down > into the kernel CMA with the following definition: > > Backlog - The number of unprocessed connection requests that a listener can > have. A connection request is considered processed once it has either been > accepted or rejected. > > To keep things simple, is there any reason for the kernel CMA to have a backlog > parameter? Would iWarp require this? The kernel CMA is callback driven, which > makes it trivial for clients to manage their own backlog. I'm considering > maintaining this within the kernel uCMA only, unless it would need to be pushed > down for iWarp. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Wed Nov 16 16:49:22 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Nov 2005 16:49:22 -0800 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: <1132187491.9338.4.camel@wifi-230-130.sc05.org> References: <1132187491.9338.4.camel@wifi-230-130.sc05.org> Message-ID: <437BD392.4040507@ichips.intel.com> Steve Wise wrote: > The rnic might need to reserve resources based on the listen backlog, so > I think the kernel iwarp cma will need this. This is rnic-dependent, > but at least the Ammasso rnic needs to know what the backlog should be. > For OpenIB, we _could_ lock this down to some fixed value on the iwarp > side. I just committed a patch that removed this from the kernel CMA, but it's easy enough to put back. I was having issues trying to push the backlog down into the kernel CMA, versus maintaining it in the uCMA (kernel module to support the userspace CMA library). The issues surrounded trying to define something usable for IB that didn't result in potential system hangs. I considered pushing the backlog down into the IB CM, but the IB CM doesn't really need a backlog, plus it didn't fix my system hang issues... Right now, there's a backlog parameter for userspace that is used to restrict the number of outstanding connect request events waiting to be retrieved by the user. (Think of it as sizing a mythical connection event queue maintained in the kernel.) This works regardless of the underlying transport, but doesn't pass the backlog information down to lower-level drivers. - Sean From mshefty at ichips.intel.com Wed Nov 16 16:55:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 16 Nov 2005 16:55:44 -0800 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: <1132188184.7598.5.camel@wifi-228-213.sc05.org> References: <1132188184.7598.5.camel@wifi-228-213.sc05.org> Message-ID: <437BD510.8090103@ichips.intel.com> Tom Tucker wrote: > I was just speaking with one of the Mellanox guys at the booth and he > said that there is a parameter that performs a similar function to the > TCP backlog that is set in the MAD message used to advertise the service > point. He couldn't remember the name, but said I should post it to this > group as a question. I'm not familiar with this, but if it's only for advertising, then the implementation would still need to enforce it. > WRT the CMA: I think this parameter is just an attribute of the > service/listening endpoint and not a queuing depth of outstanding, > unaccepted/rejected connections in the CMA. We could view it this way, where the backlog is simply a listening attribute. I was wanting to enforce some sort of limit on the number of events queued in the kernel for a single client, however, which is what it is being used for now. - Sean From liangs at cse.ohio-state.edu Wed Nov 16 21:33:22 2005 From: liangs at cse.ohio-state.edu (Shuang Liang) Date: Thu, 17 Nov 2005 00:33:22 -0500 Subject: [openib-general] High memory Message-ID: <437C1622.5070505@cse.ohio-state.edu> Hi, I am new here with some problem of gen2 programming, hope somebody can help me. I was trying to send a message from a kernel buffer to a remote userland program on IA-32 machines. Basically, what happened was I used get_dma_mr to get memory registered. And I noticed if a buffer is allocated from high memory (address >f8000000), then the data can not be delivered correctly to the receiver side(both send recv completes successfully, but with wrong data). I thought the problem could have been that I used virt_to_phys for address translation. But I can't find any appropriate ones for high memory address translation. I wondering if somebody could give me some suggestions on this. Thanks a lot! -- Shuang Liang, Graduate Administration Assistant, Department of Computer Science & Engineering The Ohio State University, 374 Dreese Labs, 2015 Neil Ave Columbus, Ohio 43210 614-292-1900, 614-292-2911 (fax) From iod00d at hp.com Wed Nov 16 22:00:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 16 Nov 2005 22:00:50 -0800 Subject: [openib-general] High memory In-Reply-To: <437C1622.5070505@cse.ohio-state.edu> References: <437C1622.5070505@cse.ohio-state.edu> Message-ID: <20051117060050.GF24146@esmail.cup.hp.com> On Thu, Nov 17, 2005 at 12:33:22AM -0500, Shuang Liang wrote: > Hi, > I am new here with some problem of gen2 programming, hope somebody > can help me. I was trying to send a message from a kernel buffer to a > remote userland program on IA-32 machines. > Basically, what happened was I used get_dma_mr to get memory > registered. And I noticed if a buffer is allocated from high memory > (address >f8000000), then the data can not be delivered correctly to the > receiver side(both send recv completes successfully, but with wrong > data). I thought the problem could have been that I used virt_to_phys > for address translation. Using "virt_to_phys" is always wrong when trying to get a DMA mapping. > But I can't find any appropriate ones for high > memory address translation. I wondering if somebody could give me some > suggestions on this. I don't know enough context. Are you trying to write an openib kernel driver? If so you want to read "Documentation/DMA-API.txt" for hints on the available DMA mapping interfaces and then look at how SDP or IPoIB use the described DMA interfaces. hth, grant From yael at mellanox.co.il Wed Nov 16 23:00:33 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 17 Nov 2005 09:00:33 +0200 Subject: [openib-general] RE: [PATCH] Opensm - lid assignment issues Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F1@mtlexch01.mtl.com> Hello Hal, I see that you haven't applied this patch yet. Just making sure it is not lost between the mails... Thanks, Yael -----Original Message----- From: Yael Kalka [mailto:yael at mellanox.co.il] Sent: Sunday, November 13, 2005 12:18 PM To: halr at voltaire.com Cc: openib-general at openib.org; eitan at mellanox.co.il; yael at mellanox.co.il Subject: [PATCH] Opensm - lid assignment issues Hi Hal, During some windows tests we've discovered that there is still another problem in the lid_mgr. The problem happend when 2 HCAs had the same lid - opensm entered an infinite loop. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 4032) +++ opensm/osm_lid_mgr.c (working copy) @@ -550,6 +550,9 @@ __osm_lid_mgr_init_sweep( { /* This port will use its local lid, and consume the entire required lid range. Thus we can skip that range. */ + /* If the disc_max_lid is greater then lid - we can skip right to it, + since we've done all neccessary checks on the lids in between. */ + if (disc_max_lid > lid) lid = disc_max_lid; } } @@ -593,7 +596,14 @@ __osm_lid_mgr_init_sweep( { p_range = (osm_lid_mgr_range_t *)cl_malloc(sizeof(osm_lid_mgr_range_t)); - p_range->min_lid = 1; + /* + The p_range can be NULL in one of 2 cases: + 1. If max_defined_lid == 0. In this case, we want the entire range. + 2. If all lids discovered in the loop where mapped. In this case + no free range exists, and we want to define it after the last + mapped lid. + */ + p_range->min_lid = lid; } p_range->max_lid = p_mgr->p_subn->max_unicast_lid_ho - 1; cl_qlist_insert_tail( &p_mgr->free_ranges, &p_range->item ); From yael at mellanox.co.il Thu Nov 17 04:30:37 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 17 Nov 2005 14:30:37 +0200 Subject: [openib-general] [PATCH] Opensm - add info in vendor error printing Message-ID: <5z4q6bxxqq.fsf@mtl066.yok.mtl.com> Hi Hal, We are encountering problems in umad_send with large sized mads. I will send a different mail regarding this issue to the group. This patch adds the mad size to the error message when umad_send failed. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4069) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -1010,6 +1010,7 @@ osm_vendor_send( ib_sa_mad_t* const p_sa = (ib_sa_mad_t *)p_mad; int ret = -1; int is_rmpp = 0; + uint32_t sent_mad_size; #ifndef VENDOR_RMPP_SUPPORT uint32_t paylen = 0; #endif @@ -1067,20 +1068,23 @@ osm_vendor_send( if (resp_expected) put_madw(p_vend, p_madw, &p_mad->trans_id); - if ((ret = umad_send(p_bind->port_id, p_bind->agent_id, p_vw->umad, #ifdef VENDOR_RMPP_SUPPORT - p_madw->mad_size, + sent_mad_size = p_madw->mad_size; #else - is_rmpp ? p_madw->mad_size - IB_SA_MAD_HDR_SIZE : + sent_mad_size = is_rmpp ? p_madw->mad_size - IB_SA_MAD_HDR_SIZE : p_madw->mad_size, #endif + + if ((ret = umad_send(p_bind->port_id, p_bind->agent_id, p_vw->umad, + sent_mad_size, resp_expected ? p_vend->timeout : 0, p_vend->max_retries)) < 0) { if (resp_expected) get_madw(p_vend, &p_mad->trans_id); /* remove from aging table */ osm_log(p_vend->p_log, OSM_LOG_ERROR, "osm_vendor_send: ERR 5430: " - "Send p_madw = %p failed %d (%m)\n", p_madw, ret); + "Send p_madw = %p of size %d failed %d (%m)\n", + p_madw, sent_mad_size, ret); p_madw->status = IB_ERROR; cl_spinlock_acquire( &p_vend->cb_lock ); (*p_bind->send_err_callback)(p_bind->client_context, p_madw); /* cb frees madw */ From yael at mellanox.co.il Thu Nov 17 04:49:52 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 17 Nov 2005 14:49:52 +0200 Subject: [openib-general] RE: [PATCH] Opensm - add info in vendor error printing Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F3@mtlexch01.mtl.com> Hello all, During some opensm tests we've encountered a situation where a call to umad_send() fails on very large mads (on mads of size ~90,000). Is there some limitation on the size of the mads that can be sent? Reproduction of the issue is simple: Run opensm Then run osmtest -f m -M 3 What the osmtest does in this case is try to create as many multicast groups as possible (for all the possible MC_lids), and then does an SA query on all the multicast groups that exist. OpenSM will then try to answer with a huge group of multicast records, and when opensm does umad_send() - it fails. Any ideas why this happens? Thanks, Yael From info at cjduey.com Thu Nov 17 03:01:06 2005 From: info at cjduey.com (info at cjduey.com) Date: 17 Nov 2005 20:01:06 +0900 Subject: [openib-general] $B?7Ce(B1$B7o(B Message-ID: <20051117110106.7782.qmail@mail.cjduey.com> $BK\F|5.J}$O=w at -$+$i!zL5NA(B2$B%7%g%C%H%k!<%`!z$NF~<<$rM6$o$l$?(B $B;v$N$*CN$i$;$G$9!#(B $B6u$-%k!<%`$N$4MQ0U$O40N;$7$F$*$j$^$9!#:#(B $B$9$0L5NA$GF~<<$NJ}$O"-(B http://www.y-falconry.net?SEX $B2<5->pJs$rL5NAF~<fIW$@$1$I!"5$7Z$J%;%U%lE*46(B $B3P$GIU$-9g$C$F$/$l$?$i$$$$$J!A;~4V$H6b3[$J$I$OAjCL$7$F(B $B$+$i!&!&!&!Y(B $B$4EPO?D>8e!"=w at -$r$40FFbCW$7$^$9!#(B($BL5NA(B) $BpJs$K4X$7$F$O!"H`J}$@$1$r$4>7BT$9$k=w at -$ND>@\5U%5(B $B%]$*M6$$$G$9!#(B $B=w at -$H2q$&A0$K?6$j9~$_$G!"$*6b$rD:$/;v$b2DG=$G$9!*$*6b$r$b(B $B$i$C$F$+$i!"=w at -$H2q$($k!*$9$4$/0B?4$G$-$k%7%9%F%`$G$9!#Ev(B $B%0%k!<%W$G$O!">R2p$K4X$7$F0l at Z$NHqMQ$O$+$+$j$^$;$s$N$G!"$4(B $B0B?42<$5$$!#(B http://www.y-falconry.net?SEX $B5qH](B iranai at y-falconry.net From dledford at redhat.com Thu Nov 17 06:14:59 2005 From: dledford at redhat.com (Doug Ledford) Date: Thu, 17 Nov 2005 09:14:59 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <200511161758.46379.moschny@ipd.uni-karlsruhe.de> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> Message-ID: <437C9063.70607@redhat.com> Thomas Moschny wrote: > On Wednesday 16 November 2005 02:04, Doug Ledford wrote: > >>I have initial RPM support for both of these releases available for >>use/testing. > > > Thanks for providing the rpms! > > >>If you try these out and have any problems, please email me directly >>(and feel free to Cc: the list) for more immediate responses. > > > Unfortunately, we got an kernel-oops on ia64 (rhel4) ... > The boot log is attached. > > - Thomas I think I know what this is. On any arch where sizeof(int) != sizeof(void *) the kernel version on my site would oops. The kernel you tested was one where I was in the process of adding kzalloc() to slab.c so that driver backports would be a bit less painful in the future. However, I missed adding the function prototype to the include files. This resulted in a lot of "assignment makes pointer from integer without a cast" errors anywhere kzalloc was used. Well, on 64bit arches, you can't make a full pointer from an integer. So, I'm thinking that the oops is the result of a partial conversion from a 32bit int to a 64bit pointer, even though kzalloc actually returned a 64bit pointer anyway. The attached patch should be able to be dropped into the existing srpm in place of the patch with the same name and a rebuild should then solve the problem, although in the process of creating this patch I had to move it from the 2700 section of the patch list down to the 10002 position because it touches things added after the infiniband code. -- Doug Ledford http://people.redhat.com/dledford -------------- next part -------------- A non-text attachment was scrubbed... Name: linux-2.6.9-slab-update.patch Type: text/x-patch Size: 3605 bytes Desc: not available URL: From mst at mellanox.co.il Thu Nov 17 07:52:09 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 17 Nov 2005 17:52:09 +0200 Subject: [openib-general] [PATCH] ipoib: protect child list access Message-ID: <20051117155209.GN20871@mellanox.co.il> race condition: ipoib_ib_dev_flush is accessing child list without locks. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_ib.c (revision 4042) +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -608,9 +608,13 @@ void ipoib_ib_dev_flush(void *_dev) if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) ipoib_ib_dev_up(dev); + down(&priv->vlan_mutex); + /* Flush any child interfaces too */ list_for_each_entry(cpriv, &priv->child_intfs, list) ipoib_ib_dev_flush(&cpriv->dev); + + up(&priv->vlan_mutex); } void ipoib_ib_dev_cleanup(struct net_device *dev) -- MST From caitlinb at broadcom.com Thu Nov 17 08:36:25 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 17 Nov 2005 08:36:25 -0800 Subject: [openib-general] High memory Message-ID: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Thu, Nov 17, 2005 at 12:33:22AM -0500, Shuang Liang wrote: >> Hi, >> I am new here with some problem of gen2 programming, hope somebody >> can help me. I was trying to send a message from a kernel buffer to a >> remote userland program on IA-32 machines. >> Basically, what happened was I used get_dma_mr to get memory >> registered. And I noticed if a buffer is allocated from high memory >> (address >f8000000), then the data can not be delivered correctly to >> the receiver side(both send recv completes successfully, but with >> wrong data). I thought the problem could have been that I used >> virt_to_phys for address translation. > > Using "virt_to_phys" is always wrong when trying to get a DMA mapping. > >> But I can't find any appropriate ones for high memory address >> translation. I wondering if somebody could give me some suggestions >> on this. > > I don't know enough context. > Are you trying to write an openib kernel driver? > If so you want to read "Documentation/DMA-API.txt" for hints > on the available DMA mapping interfaces and then look at how > SDP or IPoIB use the described DMA interfaces. > That applies even if you are trying to write a kernel daemon and wish to use high memory. You either make it part of your virtual memory map at least temporarily, so you can register it, or you register the memory "physically". But "physical" registration for RDMA is never really physical addresses, it is always bus addresses, which means reading the stuff meant for driver developers. From tom at opengridcomputing.com Thu Nov 17 09:20:34 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 17 Nov 2005 11:20:34 -0600 Subject: [openib-general] RE: [dat-discussions] socket based connectionmodel for IB proposal - round 3 In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1041781@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1041781@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1132248035.8059.8.camel@wifi-228-113.sc05.org> On Tue, 2005-11-15 at 10:33 -0800, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Kanevsky, Arkady wrote: > >> Which entity is responsible to "use" the proposed protocol is an > >> interesting one. I was assuming that this will be CM. After all the > >> proposed protocol is CM extension protocol. But it can be another > >> entity module between CM and ULP. > > > > The use of a reserved bit in the CM message indicates that > > the CM itself needs to set this data. This requires > > communication between the CM and IPoIB - in essence making > > IPoIB part of the CM. Removal of the reserved bit is what > > permits another entity between the CM and the ULP to perform > > this task. > > > > The use of a reserved bit allows the receiver to know that the > IP address has some validity without having to do a translation > of the IP address itself. > > As Fab pointed out, if the receiving software takes this step > then there is no need for any additional CM bit. Because the > receiver can rely on the translation being authentic. It's > an extra round trip, but it does leave the local CM interface > totally intact. > > > > >> DAPL APIs (uDAPL and kDAPL) does not expose local IP address for > >> listen point. An additional API can be added to support passing > >> local socket to listen on instead of Connection Qualifier. Since it > >> is addition no backwards compatibility issues. > >> The current ULPs/Apps will still use "the default API address" and > >> the protocol assigned SID as connection qualifier. > > > > The issue with DAPL is that it assumes that addressing has > > been resolved to a specific device before communication > > between the client and server have even occurred. I.e. the > > server must be clairvoyant and know which device a connection > > request will be received on. Likewise, a client must assume > > which local device is needed to connect to a given remote address. > > > > - Sean > > That is actually inherent in any RDMA model. You have to pre-post > receive buffers before you connect. Therefore you need to know > which device to register memory to. You don't need to know the > external physical port, but you do need to know the device that > is responsible for memory registration. Er, I think the current CMA implementation is a valid "RDMA model" and specifically doesn't require that you determine the local device before connecting. BTW, assuming the semantics specified in RNIC Verbs, you don't have to post recv buffers before connect because the active side must send the first RDMA message. Therefore, if the passive side posts recv buffers prior to completing the LLP transition to RDMA mode and the active side posts recv buffers prior to sending the first RDMA message, there is no timing hole where a valid RDMA message can be received prior to having a posted recv buffer to handle it. > > And this is not require clairvoyance. It requires integration > with the host local routing tables. Something that would be > easy if the GID were treated as an IPv6 address. But even with > some form of translation it is easy, as long as the IP addresses > are integrated with the local routing tables. The current CMA does this. > > Given a remote IP address (or a local one that you want to use) > you know what egress port will be used (and which ones could be > used), and you know that RDMA device(s) associated with those > egress points. The last step is simple, but has been overlooked > all too often. The CMA does this too. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Nov 17 09:39:36 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 17 Nov 2005 09:39:36 -0800 Subject: [openib-general] socket based connection model for IB proposal -round 4 In-Reply-To: Message-ID: If the proposal will include UDP, should the definition extend beyond connections to include UD QPs as well (i.e. SIDR REQ)? - Sean From halr at voltaire.com Thu Nov 17 09:35:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 12:35:18 -0500 Subject: [openib-general] RE: [PATCH] Opensm - lid assignment issues In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F1@mtlexch01.mtl.com> Message-ID: <1132248917.26731.5.camel@hal.voltaire.com> Hi Yael, On Thu, 2005-11-17 at 02:00, Yael Kalka wrote: > Hello Hal, > I see that you haven't applied this patch yet. > Just making sure it is not lost between the mails... It is not lost. I was at SC05 and ran it there. I will commit it soon now that I am back and just now digging out from 8 days of being away. -- Hal > Thanks, > Yael > > -----Original Message----- > From: Yael Kalka [mailto:yael at mellanox.co.il] > Sent: Sunday, November 13, 2005 12:18 PM > To: halr at voltaire.com > Cc: openib-general at openib.org; eitan at mellanox.co.il; yael at mellanox.co.il > Subject: [PATCH] Opensm - lid assignment issues > > > Hi Hal, > > During some windows tests we've discovered that there is still another > problem in the lid_mgr. The problem happend when 2 HCAs had the same > lid - opensm entered an infinite loop. > The following patch fixes this. > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 4032) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -550,6 +550,9 @@ __osm_lid_mgr_init_sweep( > { > /* This port will use its local lid, and consume the > entire required lid range. > Thus we can skip that range. */ > + /* If the disc_max_lid is greater then lid - we can skip > right to it, > + since we've done all neccessary checks on the lids in > between. */ > + if (disc_max_lid > lid) > lid = disc_max_lid; > } > } > @@ -593,7 +596,14 @@ __osm_lid_mgr_init_sweep( > { > p_range = > (osm_lid_mgr_range_t *)cl_malloc(sizeof(osm_lid_mgr_range_t)); > - p_range->min_lid = 1; > + /* > + The p_range can be NULL in one of 2 cases: > + 1. If max_defined_lid == 0. In this case, we want the entire > range. > + 2. If all lids discovered in the loop where mapped. In this case > + no free range exists, and we want to define it after the last > > + mapped lid. > + */ > + p_range->min_lid = lid; > } > p_range->max_lid = p_mgr->p_subn->max_unicast_lid_ho - 1; > cl_qlist_insert_tail( &p_mgr->free_ranges, &p_range->item ); > From kingman at austin.rr.com Thu Nov 17 10:45:39 2005 From: kingman at austin.rr.com (John Kingman) Date: Thu, 17 Nov 2005 12:45:39 -0600 (CST) Subject: [openib-general] [PATCH] ibsrpdm: fix service record range problem Message-ID: The start and end values are reversed in the attribute modifier for the Service Entries attribute. Tested with our target. Signed-off-by: John Kingman --- srp-dm.c 2005-11-15 01:35:40.000000000 -0600 +++ srp-dm.c 2005-11-17 12:09:28.000000000 -0600 @@ -335,7 +335,7 @@ struct srp_dm_mad *in_dm_mad; init_srp_dm_mad(&out_mad, agent[1], dlid, SRP_DM_ATTR_SERVICE_ENTRIES, - (ioc << 16) | (start << 8) | end); + (ioc << 16) | (end << 8) | start); again: if (write(fd, &out_mad, sizeof out_mad) != sizeof out_mad) { From mshefty at ichips.intel.com Thu Nov 17 11:18:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Nov 2005 11:18:30 -0800 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: <1132187491.9338.4.camel@wifi-230-130.sc05.org> References: <1132187491.9338.4.camel@wifi-230-130.sc05.org> Message-ID: <437CD786.2010305@ichips.intel.com> Steve Wise wrote: > By the way, this listen backlog for an iwarp rnic is actually the > backlog of outstanding TCP connection requests as opposed to iwarp > connections. Giving this more thought, I think there are at least two separate issues. One is defining what backlog means from the perspective of the rdma_listen() API. The userspace and kernel APIs don't doesn't necessarily need to have the same definition. The second issue that I see is mapping backlog to underlying CMs. A single rdma_listen call can result in separate listen calls to underlying devices. (Obviously, a simple mapping is to just pass the value down.) - Sean From caitlinb at broadcom.com Thu Nov 17 11:26:02 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 17 Nov 2005 11:26:02 -0800 Subject: [openib-general] rdma_listen() backlog parameter Message-ID: <54AD0F12E08D1541B826BE97C98F99F104189D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Steve Wise wrote: >> By the way, this listen backlog for an iwarp rnic is actually the >> backlog of outstanding TCP connection requests as opposed to iwarp >> connections. > > Giving this more thought, I think there are at least two > separate issues. One is defining what backlog means from the > perspective of the rdma_listen() API. > The userspace and kernel APIs don't doesn't necessarily need to have > the same definition. > > The second issue that I see is mapping backlog to underlying > CMs. A single rdma_listen call can result in separate listen calls > to underlying devices. (Obviously, a simple mapping is to just pass > the value down.) > Agreed, although I would think of it in terms of how many connection requests the ULP is willing to accept and then secondly as to what resources the CM/driver/device needs. From bboas at llnl.gov Mon Nov 14 15:16:51 2005 From: bboas at llnl.gov (Bill Boas) Date: Mon, 14 Nov 2005 15:16:51 -0800 Subject: [openib-general] Invitation to OpenIB BOF at SC05 Wednesday 11-12 Room 205 in Convention Center Message-ID: <6.2.3.4.2.20051114123415.02f263c0@mail-lc.llnl.gov> Please attend the OpenIB Birds of Feather at Sc|05 on Wednesday November 16 at 11.00 AM in Room 205 in the Convention Center Agenda and Goals of BOF Bill Bio SCinet05-IB team experience and Exhibitor Booth Feedback on Infiniband Network at SC05 - speakers from team who did it Integrated OpenIB and iWARP stack demo and Direct Access Network naming idea - Tom Tucker Long distance Links at SC05 results and feedback - Linden Mercer, Rick Cecil, Microsoft Feedback - Eric lantz New members Next Events Content, location and schedule - Proposed Interoperability Event - Thad Omura - Sonoma Workshop - TBD - HLRS Workshop - Peter Haas Other items and Wrap Up - Bill Boas Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From bboas at llnl.gov Mon Nov 14 15:28:11 2005 From: bboas at llnl.gov (Bill Boas) Date: Mon, 14 Nov 2005 15:28:11 -0800 Subject: [openib-general] Re: Invitation to OpenIB BOF at SC05 Wednesday 11-12 Room 205 in Convention Center In-Reply-To: <6.2.3.4.2.20051114123415.02f263c0@mail-lc.llnl.gov> References: <6.2.3.4.2.20051114123415.02f263c0@mail-lc.llnl.gov> Message-ID: <6.2.3.4.2.20051114151803.02f22220@mail-lc.llnl.gov> At 03:16 PM 11/14/2005, Bill Boas wrote: >Please attend the OpenIB Birds of Feather at Sc|05 on Wednesday >November 16 at 11.00 AM in Room 205 in the Convention Center > >Agenda and Goals of BOF Bill Boas > >SCinet05-IB team experience and Exhibitor Booth Feedback on >Infiniband Network at SC05 - speakers from team who did it > >Integrated OpenIB and iWARP stack demo and Direct Access Network >naming idea - Tom Tucker > >Long distance Links at SC05 results and feedback - Linden Mercer, Rick Cecil, > >Microsoft Feedback - Eric lantz > >New members > >Next Events Content, location and schedule > - Proposed Interoperability Event - Thad Omura > - Sonoma Workshop - TBD > - HLRS Workshop - Peter Haas > >Other items and Wrap Up - Bill Boas > > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From halr at voltaire.com Thu Nov 17 12:02:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 15:02:16 -0500 Subject: [openib-general] RE: IPoIB In-Reply-To: <437CD294.2050802@dbresearch.net> References: <437CD294.2050802@dbresearch.net> Message-ID: <1132257736.26731.441.camel@hal.voltaire.com> Hi Sean, On Thu, 2005-11-17 at 13:57, Sean Hubbell wrote: > Hal, > > I tried cycling the power to the switches with no luck. > I am running opensm (source code as of 11/15/2005). > We are running with 12 SBS 24 Port Switches. How can I see what ports came up at 1x? Do you have the management tools installed ? Find the LIDs of the switches with ibnetdiscover and then use smpquery portinfo to see what LinkWidthActive says. You would lose multicast connectivity if some ports were 1x and others were 4x so this is a good thing to check. -- Hal From liangs at cse.ohio-state.edu Thu Nov 17 12:13:10 2005 From: liangs at cse.ohio-state.edu (Shuang Liang) Date: Thu, 17 Nov 2005 15:13:10 -0500 Subject: [openib-general] High memory In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <437CE456.3010807@cse.ohio-state.edu> Thanks for your reply! I am trying to write a gen2 based kernel module here. Basically, what I am trying to do is to send a one byte kernel buffer to a remote user process. I used get_dma_mr to get the mr before hand at the sender side, so that I don't have to bother to register when sending the kernel buffer. Since I am working on IA-32 machine, so I assume bus addr and phys addr are the same(I tried both virt_to_phys and dma_map_single() for addresss translation actually). What I noticed was that if the buffer is above in high memory, the data was not received correctly; while if in low memory, things work. Does this mean HCA is not able to address high memory or maybe I am missing sth here? Thanks Shuang, Caitlin Bestler wrote: >openib-general-bounces at openib.org wrote: > > >>On Thu, Nov 17, 2005 at 12:33:22AM -0500, Shuang Liang wrote: >> >> >>>Hi, >>> I am new here with some problem of gen2 programming, hope somebody >>>can help me. I was trying to send a message from a kernel buffer to a >>>remote userland program on IA-32 machines. >>> Basically, what happened was I used get_dma_mr to get memory >>>registered. And I noticed if a buffer is allocated from high memory >>>(address >f8000000), then the data can not be delivered correctly to >>>the receiver side(both send recv completes successfully, but with >>>wrong data). I thought the problem could have been that I used >>>virt_to_phys for address translation. >>> >>> >>Using "virt_to_phys" is always wrong when trying to get a DMA mapping. >> >> >> >>>But I can't find any appropriate ones for high memory address >>>translation. I wondering if somebody could give me some suggestions >>>on this. >>> >>> >>I don't know enough context. >>Are you trying to write an openib kernel driver? >>If so you want to read "Documentation/DMA-API.txt" for hints >>on the available DMA mapping interfaces and then look at how >>SDP or IPoIB use the described DMA interfaces. >> >> >> > >That applies even if you are trying to write a kernel daemon >and wish to use high memory. You either make it part of your >virtual memory map at least temporarily, so you can register >it, or you register the memory "physically". But "physical" >registration for RDMA is never really physical addresses, it >is always bus addresses, which means reading the stuff meant >for driver developers. > > > > > -- Shuang Liang, Graduate Administration Assistant, Department of Computer Science & Engineering The Ohio State University, 374 Dreese Labs, 2015 Neil Ave Columbus, Ohio 43210 614-292-1900, 614-292-2911 (fax) From halr at voltaire.com Thu Nov 17 12:23:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 17 Nov 2005 22:23:16 +0200 Subject: [openib-general] RE: IPoIB Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB18@taurus.voltaire.com> Hi Sean, On Thu, 2005-11-17 at 13:57, Sean Hubbell wrote: > Hal, > > I tried cycling the power to the switches with no luck. > I am running opensm (source code as of 11/15/2005). > We are running with 12 SBS 24 Port Switches. How can I see what ports came up at 1x? Do you have the management tools installed ? Find the LIDs of the switches with ibnetdiscover and then use smpquery portinfo to see what LinkWidthActive says. You would lose multicast connectivity if some ports were 1x and others were 4x so this is a good thing to check. -- Hal From halr at voltaire.com Thu Nov 17 12:37:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 15:37:08 -0500 Subject: [openib-general] Lustre over OpenIB Gen2 In-Reply-To: <52br0s1dw0.fsf@cisco.com> References: <00b801c5e625$1c4ffdf0$0281a8c0@ebpc> <52br0s1dw0.fsf@cisco.com> Message-ID: <1132259828.26731.588.camel@hal.voltaire.com> On Thu, 2005-11-10 at 14:50, Roland Dreier wrote: > Eric> Or is it better to use FMR pools and take the map/unmap > Eric> overhead? If so, is there a way to know when the unmap > Eric> actually hits the hardware and my memory is safe? > > FMRs are only supported on Mellanox HCAs at the moment. But they do > have some advantages, like allowing you to convert a bunch of pages > into a single virtually contiguous region. You can use the > ib_flush_fmr_pool() function to make sure that all unmapped FMRs are > really and truly flushed, but that is a slow operation (since it > incurs the penalty of flushing all in-flight operations in the HCA). Assuming one had a machine with a mix of vendor HCAs (Mellanox, PathScale), how would one determine what hardware/driver was being used (assuming one wanted to take advantage of a proprietary feature which was supported only on one or the other) ? If it that EINVAL or the like is returned from the ones that don't support that feature ? -- Hal From halr at voltaire.com Thu Nov 17 13:44:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 16:44:27 -0500 Subject: [openib-general] OpenSM and Wrong SM_Key In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> Message-ID: <1132263866.26731.994.camel@hal.voltaire.com> On Sat, 2005-11-12 at 12:34, Eitan Zahavi wrote: > Hi Troy, > > Good to get a straight forward message. > > What I hear you saying is: > 1. There needs to be a parameter to control the SM behavior if it finds > another SM with non matching SM Key: > -> Either to ignore it or to die. We can do that. No problem! > > 2. The SM log file has too many errors. > -> Are you aware of the messages the SM sends to syslog ? We try and > make these the most important ones. If you feel some messages are > missing there - just let us know and we will fix it. I think that exiting is noncompliant and this should not be made into a command line option. -- Hal > -> The osm.log is intended for OpenSM errors reporting. These include > any error. We try our best to clean it up from un-needed events. But as > I say it is NOT the log you should use for getting the SM major events. > You should look at the /var/log/messages instead. > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:hozer at hozed.org] > > Sent: Friday, November 11, 2005 8:32 AM > > To: Eitan Zahavi > > Cc: Hal Rosenstock; openib-general at openib.org > > Subject: Re: [openib-general] OpenSM and Wrong SM_Key > > > > On Wed, Nov 09, 2005 at 09:46:06AM +0200, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > I would like to bring this to MgtWG before we change anything. > > > IMO the situation when this happens is really not "legal" since if > the > > > SM's are not coordinated at least in their SM_Key it will cause the > two > > > masters on the subnet. > > > > > > >From our experience it is always better to cause a fatal flow and > exit > > > the SM rather then report the event in some log - normally it will > not > > > be seen ... > > > > > > I know this is a controversial issue. > > > > Okay, so you're telling me you *WANT* behavior where a rogue node can > > trivially cause the running subnet manager to exit and take over > > management of the network? > > > > Opensm needs to have a well documented config file, instead of 3 pages > > of command line options, and different levels of logging. What to do > in > > the above situation is a site-local policy config decision, not > something > > that should be hard-coded in the SM source code. > > > > The logs might actually get looked at if there wasn't junk in the log > > every time something timed out. > > > > The linux kernel has 'WARN, NOTICE, and CRITICAL' level log messages. From iod00d at hp.com Thu Nov 17 13:53:13 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 17 Nov 2005 13:53:13 -0800 Subject: [openib-general] High memory In-Reply-To: <437CE456.3010807@cse.ohio-state.edu> References: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> <437CE456.3010807@cse.ohio-state.edu> Message-ID: <20051117215313.GC27551@esmail.cup.hp.com> On Thu, Nov 17, 2005 at 03:13:10PM -0500, Shuang Liang wrote: > Thanks for your reply! > I am trying to write a gen2 based kernel module here. Basically, what > I am trying to do is to send a one byte kernel buffer to a remote user > process. I used get_dma_mr to get the mr before hand at the sender side, > so that I don't have to bother to register when sending the kernel > buffer. Since I am working on IA-32 machine, so I assume bus addr and > phys addr are the same(I tried both virt_to_phys and dma_map_single() > for addresss translation actually). dma_map_single() is the correct interface to get a bus address. > What I noticed was that if the buffer is above in high memory, the > data was not received correctly; while if in low memory, things work. > Does this mean HCA is not able to address high memory or maybe I am > missing sth here? Are you using a dma_addr_t (a 64-bit type) to store the return value? oh...you have CONFIG_HIGHMEM64G enabled? (See include/asm-i386/types.h and look for dma_addr_t) grant From halr at voltaire.com Thu Nov 17 13:49:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 16:49:31 -0500 Subject: [openib-general] OpenSM and Wrong SM_Key In-Reply-To: <20051112232652.GV3275@kalmia.hozed.org> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361893A@mtlexch01.mtl.com> <20051112232652.GV3275@kalmia.hozed.org> Message-ID: <1132264171.26731.1040.camel@hal.voltaire.com> On Sat, 2005-11-12 at 18:26, Troy Benjegerdes wrote: > On Sat, Nov 12, 2005 at 07:34:44PM +0200, Eitan Zahavi wrote: > > Hi Troy, > > > > Good to get a straight forward message. > > > > What I hear you saying is: > > 1. There needs to be a parameter to control the SM behavior if it finds > > another SM with non matching SM Key: > > -> Either to ignore it or to die. We can do that. No problem! > > Is it possible to have another option as well, to attempt to disable the > port the SM with the non-matching key is connected to? Not sure you need an option for this. This is beyond the spec but even if you disabled the switch port across the link from the non matching SM, you might still have other nodes claimed by that SM which you couldn't access. You would still need to do something to clear that up. Also, if it is an embedded SM (on a switch) then there are numerous ports to disable. -- Hal From halr at voltaire.com Thu Nov 17 14:32:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 00:32:21 +0200 Subject: [openib-general] OpenSM and Wrong SM_Key Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB19@taurus.voltaire.com> On Sat, 2005-11-12 at 18:26, Troy Benjegerdes wrote: > On Sat, Nov 12, 2005 at 07:34:44PM +0200, Eitan Zahavi wrote: > > Hi Troy, > > > > Good to get a straight forward message. > > > > What I hear you saying is: > > 1. There needs to be a parameter to control the SM behavior if it finds > > another SM with non matching SM Key: > > -> Either to ignore it or to die. We can do that. No problem! > > Is it possible to have another option as well, to attempt to disable the > port the SM with the non-matching key is connected to? Not sure you need an option for this. This is beyond the spec but even if you disabled the switch port across the link from the non matching SM, you might still have other nodes claimed by that SM which you couldn't access. You would still need to do something to clear that up. Also, if it is an embedded SM (on a switch) then there are numerous ports to disable. -- Hal From liangs at cse.ohio-state.edu Thu Nov 17 14:34:06 2005 From: liangs at cse.ohio-state.edu (Shuang Liang) Date: Thu, 17 Nov 2005 17:34:06 -0500 Subject: [openib-general] High memory In-Reply-To: <20051117215313.GC27551@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> <437CE456.3010807@cse.ohio-state.edu> <20051117215313.GC27551@esmail.cup.hp.com> Message-ID: <437D055E.1090503@cse.ohio-state.edu> >Are you using a dma_addr_t (a 64-bit type) to store the return value? > > Yes, ib_sge.addr is a u64 same as dma_addr_t. >oh...you have CONFIG_HIGHMEM64G enabled? >(See include/asm-i386/types.h and look for dma_addr_t) > > > Just found out this section in .config # CONFIG_NOHIGHMEM is not set # CONFIG_HIGHMEM4G is not set CONFIG_HIGHMEM64G=y CONFIG_HIGHMEM=y CONFIG_X86_PAE=y Maybe I should turn CONFIG_HIGHMEM4G to yes? My system is with 1G memory. Thanks! -- Shuang Liang, Graduate Administration Assistant, Department of Computer Science & Engineering The Ohio State University, 374 Dreese Labs, 2015 Neil Ave Columbus, Ohio 43210 614-292-1900, 614-292-2911 (fax) From halr at voltaire.com Thu Nov 17 14:36:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 17:36:50 -0500 Subject: [openib-general] another opensm crash In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> Message-ID: <1132267009.26731.1427.camel@hal.voltaire.com> On Mon, 2005-11-14 at 14:54, Eitan Zahavi wrote: > Hi Troy > > Try to move aside your /lib/tls directory and see if you still get these > crashes. > We have issues with TLS pthread and glibc There are still strange crashes like this which appear to be memory scribbling issues. -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > Sent: Monday, November 14, 2005 8:09 PM > > To: openib-general at openib.org > > Subject: [openib-general] another opensm crash > > > > (gdb) bt > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > p_madw=0x80a1de0) > > at osm_sw_info_rcv.c:679 > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > cl_dispatcher.c:108 > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > at cl_threadpool.c:78 > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > cl_thread.c:61 > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Thu Nov 17 15:33:11 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 17 Nov 2005 15:33:11 -0800 Subject: [openib-general] High memory In-Reply-To: <437D055E.1090503@cse.ohio-state.edu> References: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> <437CE456.3010807@cse.ohio-state.edu> <20051117215313.GC27551@esmail.cup.hp.com> <437D055E.1090503@cse.ohio-state.edu> Message-ID: <20051117233311.GF27551@esmail.cup.hp.com> On Thu, Nov 17, 2005 at 05:34:06PM -0500, Shuang Liang wrote: > Yes, ib_sge.addr is a u64 same as dma_addr_t. I agree u64 should be ok for this particular case. But it should be dma_addr_t, not u64. You are using a mellanox (ib_mthca) device? > Just found out this section in .config > # CONFIG_NOHIGHMEM is not set > # CONFIG_HIGHMEM4G is not set > CONFIG_HIGHMEM64G=y > CONFIG_HIGHMEM=y > CONFIG_X86_PAE=y > Maybe I should turn CONFIG_HIGHMEM4G to yes? Only if you want to ignore mem physically located above 4GB. I thought only one of the models (NOHIGHMEM, 4G or 64G) could be selected. > My system is with 1G memory. Can you post /proc/iomem and /proc/meminfo output? thanks, grant From moschny at ipd.uni-karlsruhe.de Thu Nov 17 15:14:44 2005 From: moschny at ipd.uni-karlsruhe.de (Thomas Moschny) Date: Fri, 18 Nov 2005 00:14:44 +0100 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437C9063.70607@redhat.com> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> Message-ID: <200511180014.54566.moschny@ipd.uni-karlsruhe.de> On Thursday 17 November 2005 15:14, Doug Ledford wrote: > Thomas Moschny wrote: > > Unfortunately, we got an kernel-oops on ia64 (rhel4) ... > > The boot log is attached. > > I think I know what this is. [...] > The attached patch should be able to be dropped into the existing srpm > in place of the patch with the same name and a rebuild should then solve > the problem, although in the process of creating this patch I had to > move it from the 2700 section of the patch list down to the 10002 > position because it touches things added after the infiniband code. The patch seems to work here, thanks. The machines are up now, and at least IPoIB is working. There seems to be a (minor?) problem with opensm -o, it aborts: ------------------------------------------------- OpenSM Rev:openib-1.1.0 Command Line Arguments: Run Once Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.1.0 Using default guid 0xxxxxxxxxxxxxxx Entering MASTER state SUBNET UP Exiting SM *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 *** Aborted Subsequent runs of opensm hang in flush_cpu_workqueue or rwsem_down_failed_common. - Thomas -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From info at jjyhd.com Thu Nov 17 15:12:38 2005 From: info at jjyhd.com (info at jjyhd.com) Date: 18 Nov 2005 08:12:38 +0900 Subject: [openib-general] $B$46a=jCT=w$NJ*8l(B Message-ID: <20051117231238.492.qmail@mail.jjyhd.com> SEX$B$,$G$-$F;~5k$^$G$b$i$($A$c$&$h!Z6a=j=PD%%[%9%HJg=8![!*(B $B;~5k(B1$B;~4V$G(B8000$B1_!*(B $BNc!&(B8000$B1_!_#4;~4V!_(B20$BF|!a(B64$BK|(B!! $B"#;q3J!'(B19$B:P0J>e$N7r9/$JCK at -!#(B $B"#;~4V!'%a!<%k$K$F=w at -$+$i8F$S=P$7$r$&$1!"<+M3$J;~4V$GMn$A9g$&!#(B $B"#5kM?!';~5k$O=w at -B&$+$iEO$5$l$^$9!#(B $B"#1~Jg!'2<5-$N%j%s%/$K$FEPO?$7!"L>A0$N:G8e$K!V(B*$B!W$H5-F~$7!"%W%m(B $B%U%#!<%k$G%"%T!<%k!#(B $BEPO?%j%s%/(Bhttp://www.kool-king.net?090 10$BJ,0JFb$K6a=j=w at -$X<+F0G[?.$r9T$$$^$9!*!*!*(B $B5qH]$NJ}$O(B badluck at kool-king.net From liangs at cse.ohio-state.edu Thu Nov 17 16:16:50 2005 From: liangs at cse.ohio-state.edu (Shuang Liang) Date: Thu, 17 Nov 2005 19:16:50 -0500 Subject: [openib-general] High memory In-Reply-To: <20051117233311.GF27551@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F104186E@NT-SJCA-0751.brcm.ad.broadcom.com> <437CE456.3010807@cse.ohio-state.edu> <20051117215313.GC27551@esmail.cup.hp.com> <437D055E.1090503@cse.ohio-state.edu> <20051117233311.GF27551@esmail.cup.hp.com> Message-ID: <437D1D72.5030100@cse.ohio-state.edu> >I agree u64 should be ok for this particular case. >But it should be dma_addr_t, not u64. > >You are using a mellanox (ib_mthca) device? > > > Yes, part of lspci output: 02:02.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 03:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) >Can you post /proc/iomem and /proc/meminfo output? > > # cat /proc/iomem 00000000-0009fbff : System RAM 0009fc00-0009ffff : reserved 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000c8000-000c8fff : Adapter ROM 000c9000-000d0fff : Adapter ROM 000d1000-000d65ff : Adapter ROM 000f0000-000fffff : System ROM 00100000-3ffeffff : System RAM 00100000-00358542 : Kernel code 00358543-0041ee07 : Kernel data 3fff0000-3fffefff : ACPI Tables 3ffff000-3fffffff : ACPI Non-volatile Storage a0000000-bfffffff : 0000:01:02.0 c0c00000-c0dfffff : 0000:01:02.0 c0efc000-c0efffff : 0000:01:02.0 c0f00000-d1ffffff : PCI Bus #03 c8000000-cfffffff : 0000:03:00.0 c8000000-cfffffff : ib_mthca d1800000-d1ffffff : 0000:03:00.0 d1800000-d1ffffff : ib_mthca e0000000-efffffff : 0000:04:02.0 f4000000-f7ffffff : 0000:04:02.0 fb000000-fbffffff : 0000:00:02.0 fc3a0000-fc3bffff : 0000:00:04.0 fc3a0000-fc3bffff : e100 fc3c0000-fc3dffff : 0000:00:02.0 fc3e0000-fc3effff : 0000:00:04.0 fc3fd000-fc3fdfff : 0000:00:04.0 fc3fd000-fc3fdfff : e100 fc3fe000-fc3fefff : 0000:00:0f.2 fc3fe000-fc3fefff : ohci_hcd fc3ff000-fc3fffff : 0000:00:02.0 fc4f0000-fc4fffff : 0000:01:02.0 fc500000-fc7fffff : PCI Bus #03 fc700000-fc7fffff : 0000:03:00.0 fc780680-fc78069b : ib_mthca fc780700-fc78070f : ib_mthca fc7f00d8-fc7f00df : ib_mthca fd000000-fdffffff : 0000:05:02.0 feb00000-feb7ffff : 0000:05:02.0 feba0000-febbffff : 0000:05:03.0 febc0000-febdffff : 0000:05:03.1 febfe000-febfefff : 0000:05:03.0 febfe000-febfefff : aic7xxx febff000-febfffff : 0000:05:03.1 febff000-febfffff : aic7xxx fec00000-fec03fff : reserved fee00000-fee00fff : reserved fff80000-ffffffff : reserved # cat /proc/meminfo MemTotal: 1034396 kB MemFree: 922376 kB Buffers: 15056 kB Cached: 55404 kB SwapCached: 0 kB Active: 37032 kB Inactive: 43756 kB HighTotal: 131008 kB HighFree: 60780 kB LowTotal: 903388 kB LowFree: 861596 kB SwapTotal: 2048276 kB SwapFree: 2048276 kB Dirty: 140 kB Writeback: 0 kB Mapped: 15596 kB Slab: 15136 kB CommitLimit: 2565472 kB Committed_AS: 42328 kB PageTables: 956 kB VmallocTotal: 116728 kB VmallocUsed: 37164 kB VmallocChunk: 79364 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB Thanks -Shuang, -- Shuang Liang, Graduate Administration Assistant, Department of Computer Science & Engineering The Ohio State University, 374 Dreese Labs, 2015 Neil Ave Columbus, Ohio 43210 614-292-1900, 614-292-2911 (fax) From xma at us.ibm.com Thu Nov 17 16:20:57 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 17 Nov 2005 16:20:57 -0800 Subject: [openib-general] [PATCH 1 of 2] Re: ipoib oops In-Reply-To: <20051116180629.GA25683@mellanox.co.il> Message-ID: Hi, Michael, Does the patch you are working on address below problem? rmmod hcad_mod hung in ipoib. This is the task stacks I got on linux-2.6.14 + SVN 4044. rmmod D 000000000ff7d030 11248 15681 14007 14061 (NOTLB) Call Trace: [c00000006dd9af10] [c00000006dd9b0f0] 0xc00000006dd9b0f0 (unreliable) [c00000006dd9b0e0] [c00000000000e468] .__switch_to+0x104/0x180 [c00000006dd9b170] [c000000000394528] .schedule+0x774/0xf34 [c00000006dd9b2b0] [c000000000069238] .flush_cpu_workqueue+0xdc/0x2c0 [c00000006dd9b3c0] [d00000000041f77c] .ipoib_mcast_stop_thread+0x1cc/0x274 [ib_ipoib] [c00000006dd9b480] [d00000000041c880] .ipoib_ib_dev_down+0x124/0x194 [ib_ipoib] [c00000006dd9b520] [d00000000041aed4] .ipoib_stop+0x98/0x188 [ib_ipoib] [c00000006dd9b5b0] [c0000000002e79e4] .dev_close+0x110/0x118 [c00000006dd9b630] [c0000000002e7afc] .unregister_netdevice+0x110/0x320 [c00000006dd9b6c0] [c0000000002e7d30] .unregister_netdev+0x24/0x40 [c00000006dd9b750] [d00000000041b104] .ipoib_remove_one+0x64/0xc0 [ib_ipoib] [c00000006dd9b7e0] [d00000000024e8fc] .ib_unregister_device+0xb4/0x1b4 [ib_core][c00000006dd9b880] [d00000000054a1fc] .ehca_remove+0x9c/0x714 [hcad_mod] ipoib D 0000000000000000 13552 13326 19 13712 3575 (L-TLB) Call Trace: [c0000000f679f760] [c0000000f679f870] 0xc0000000f679f870 (unreliable) [c0000000f679f930] [c00000000000e468] .__switch_to+0x104/0x180 [c0000000f679f9c0] [c000000000394528] .schedule+0x774/0xf34 [c0000000f679fb00] [c000000000395310] .wait_for_completion+0xc8/0x140 [c0000000f679fbf0] [d00000000041f7c8] .ipoib_mcast_stop_thread+0x218/0x274 [ib_ipoib] [c0000000f679fcb0] [d00000000041f9c0] .ipoib_mcast_restart_task+0x90/0x410 [ib_ipoib] [c0000000f679fdb0] [c000000000068a8c] .worker_thread+0x244/0x320 [c0000000f679fed0] [c00000000006fbf8] .kthread+0x178/0x1c8 [c0000000f679ff90] [c0000000000100b8] .kernel_thread+0x4c/0x68 Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Nov 17 16:24:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 17 Nov 2005 16:24:00 -0800 Subject: [openib-general] rdma_listen() backlog parameter In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F104189D@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F104189D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <437D1F20.4010902@ichips.intel.com> > Agreed, although I would think of it in terms of how many > connection requests the ULP is willing to accept and then > secondly as to what resources the CM/driver/device needs. I re-added the backlog parameter to the kernel CMA. From userspace, the backlog is the maximum number of connection request events that will be queued for the user. From the kernel, the backlog is the number of resources needed by the underlying device or transport. - Sean From halr at voltaire.com Thu Nov 17 16:23:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 19:23:32 -0500 Subject: [openib-general] another opensm crash In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618976@mtlexch01.mtl.com> Message-ID: <1132267112.26731.1448.camel@hal.voltaire.com> On Mon, 2005-11-14 at 14:54, Eitan Zahavi wrote: > Hi Troy > > Try to move aside your /lib/tls directory and see if you still get these > crashes. > We have issues with TLS pthread and glibc There are still strange crashes like this which appear to be memory scribbling issues. Moving tls aside changes the threads into processes. Does that indicate that threading issues are suspected ? -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > Sent: Monday, November 14, 2005 8:09 PM > > To: openib-general at openib.org > > Subject: [openib-general] another opensm crash > > > > (gdb) bt > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > p_madw=0x80a1de0) > > at osm_sw_info_rcv.c:679 > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > cl_dispatcher.c:108 > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > at cl_threadpool.c:78 > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > cl_thread.c:61 > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From caitlinb at broadcom.com Thu Nov 17 16:31:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 17 Nov 2005 16:31:17 -0800 Subject: [openib-general] rdma_listen() backlog parameter Message-ID: <54AD0F12E08D1541B826BE97C98F99F1025E1F@NT-SJCA-0751.brcm.ad.broadcom.com> -----Original Message----- From: Sean Hefty [mailto:mshefty at ichips.intel.com] Sent: Thu 11/17/2005 4:24 PM To: Caitlin Bestler Cc: Steve Wise; openib-general at openib.org Subject: Re: [openib-general] rdma_listen() backlog parameter > Agreed, although I would think of it in terms of how many > connection requests the ULP is willing to accept and then > secondly as to what resources the CM/driver/device needs. I re-added the backlog parameter to the kernel CMA. From userspace, the backlog is the maximum number of connection request events that will be queued for the user. From the kernel, the backlog is the number of resources needed by the underlying device or transport. - Sean ------ start reply Even in kernel sticking with the "largest number of connection requests I'm willing to handle" provides for transport neutral semantics. The "number of resources" assumes we know what those resources actually are. For example, over IP, you have two distinct set of resources: the TCP connection backlog and the number of open connections on which MPA mode is being negotiated. Those differences can be swept under the rug if a backlog of "N" means that the CM should have sufficient resources to generate N connection requests, and promises not to inflict the N+1st request on the application. I suppose a case could be made for this being a range (from I want at least X but no more than Y), but there's a lot of tradition behind having a single number to represent the connection backlog. From halr at voltaire.com Thu Nov 17 16:29:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 19:29:53 -0500 Subject: [openib-general] OpenSM size In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618975@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618975@mtlexch01.mtl.com> Message-ID: <1132273793.26731.2326.camel@hal.voltaire.com> On Mon, 2005-11-14 at 14:53, Eitan Zahavi wrote: > The only parts that are not absolutely required are some SA request > handlers. > Mostly ServiceRecords, NodeInfoRecord, SwitchInfoRecord,LinkRecord. It really depends on your specific requirements but if IBA spec compliance is one of them then ServiceRecord, NodeRecord, and SwitchInfoRecord are required and LinkRecord is optional but if you do one optional attribute, they are all required (except MCMemberRecord, TraceRecord, and MultiPathRecord). -- Hal From dledford at redhat.com Thu Nov 17 17:09:23 2005 From: dledford at redhat.com (Doug Ledford) Date: Thu, 17 Nov 2005 20:09:23 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <200511180014.54566.moschny@ipd.uni-karlsruhe.de> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> <200511180014.54566.moschny@ipd.uni-karlsruhe.de> Message-ID: <437D29C3.7000408@redhat.com> Thomas Moschny wrote: > On Thursday 17 November 2005 15:14, Doug Ledford wrote: > >>Thomas Moschny wrote: >> >>>Unfortunately, we got an kernel-oops on ia64 (rhel4) ... >>>The boot log is attached. >> >>I think I know what this is. [...] >>The attached patch should be able to be dropped into the existing srpm >>in place of the patch with the same name and a rebuild should then solve >>the problem, although in the process of creating this patch I had to >>move it from the 2700 section of the patch list down to the 10002 >>position because it touches things added after the infiniband code. > > > The patch seems to work here, thanks. The machines are up now, and at least > IPoIB is working. > > There seems to be a (minor?) problem with opensm -o, it aborts: > > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > Command Line Arguments: > Run Once > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > > Using default guid 0xxxxxxxxxxxxxxx > Entering MASTER state > > SUBNET UP > > Exiting SM > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 > *** > Aborted There is actually an init script for opensm that can be enabled on one machine in the subnet (I suppose you could do more if you assigned priorities to the machines). It seems to run fine, but issues this same message on shutdown. So, at least on x86_64, that much is similar, opensm issues this warning on shutdown. > Subsequent runs of opensm hang in flush_cpu_workqueue or > rwsem_down_failed_common. However, I don't see this on x86_64. -- Doug Ledford http://people.redhat.com/dledford From halr at voltaire.com Thu Nov 17 17:05:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 20:05:06 -0500 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <528xvq7ws1.fsf@cisco.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AAEC@taurus.voltaire.com> <528xvq7ws1.fsf@cisco.com> Message-ID: <1132275905.26731.2649.camel@hal.voltaire.com> On Tue, 2005-11-15 at 10:31, Roland Dreier wrote: > Hal> Are those the only component fields which were zeroed out ? > > No, for example the capability mask is all 0 as well. Capability mask does not apply to a switch physical (external) port so this could well be the case. > It seems to be something a little bit more complicated. In my test > fabric, which has a single 24-port switch with hosts connected to > both port 1 and port 2 (of the switch), I see no extra entries for my > port 1 query. For my port 2 query, I get exactly 24 entries with a > base_lid of zero and a local port number of 2. The PortNum in the SA > header goes from 1 through 24. Local port number is just the port which received the request. What exactly is the query ? What is it's component mask ? -- Hal From halr at voltaire.com Thu Nov 17 17:15:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 20:15:05 -0500 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <52hdad7b5b.fsf@cisco.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> <52hdad7b5b.fsf@cisco.com> Message-ID: <1132276504.26731.2750.camel@hal.voltaire.com> On Tue, 2005-11-15 at 18:18, Roland Dreier wrote: > Eitan> Hi Roland, Now I get it ! In the port info record there is > Eitan> a field named LocalPortNumber. This field is NOT the port > Eitan> number the data is about. It is the port number the packet > Eitan> of the query came from. (see table 145 p823 l-38). When > Eitan> OpenSM obtains the PortInfo associated with that particular > Eitan> port it did so through port 2 of the switch. So the > Eitan> LocalPortNum is set to 2. > > Eitan> If you are interested with the real port number, it is > Eitan> located in the RID second field, of the PortInfoRecord > Eitan> returned by the SA GetTable query. > > OK, I guess that makes sense (although how to interpret the > LocalPortNumber of the attribute when it is contained in the SA record > seems slightly unclear). > > It seems that if I want to get a list of (say) all the HCA port 2s in > the network, I have to do a get table query of the SA PortInfoRecord > with component mask set so that I get ports with LocalPortNumber 2, > and then filter out switch ports (since the port number field in the > RID is not defined for HCA ports). Do you think this is really what > was intended? It's not LocalPortNum. I don't see any easy way to query all port 2s in the network as for HCAs and routers the PortNum component is reserved (and you need to supply port LID). Doing that would only get you all the switch port 2s. Perhaps the spec should be amended if this is useful. If the spec were amended like this, you would get all the port 2s in the network and would then need to determine whether they were on a switch on xCA. Do you think this is useful ? Shall I file a comment on this ? We need to move fast to get this into 1.2 errata. I think OpenSM is actually doing what you did ask (with LocalPortNum = 2 and that component mask bit on). -- Hal From halr at voltaire.com Thu Nov 17 17:28:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Nov 2005 20:28:16 -0500 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618990@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618990@mtlexch01.mtl.com> Message-ID: <1132277296.26731.2871.camel@hal.voltaire.com> On Wed, 2005-11-16 at 05:45, Eitan Zahavi wrote: > > > > It seems that if I want to get a list of (say) all the HCA port 2s in > > the network, I have to do a get table query of the SA PortInfoRecord > > with component mask set so that I get ports with LocalPortNumber 2, > > and then filter out switch ports (since the port number field in the > > RID is not defined for HCA ports). Do you think this is really what > > was intended? > [EZ] Hi Roland, > > What you have to do to get the HCA ports PortInfo is: > 1. Get all NodeInfoRecord for node_type == HCA: > comp_mask = 0x5 LID should not be turned on; only NodeType in NodeInfo so component mask should be 0x10 (bit 4).. > 2. For each such node send a PortInfoRecord query by the port number: > comp_mask = 0x33 The component mask should be 0x3 for EndportLID and PortNum. > PortInfoRecord *p_pir; > p_pir->lid = lid_no; > p_pir->port_num = port_num; > Do GetTable(p_pir) Why GetTable ? I think it could just be a Get as there is only 1 expected. -- Hal From halr at voltaire.com Thu Nov 17 17:59:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 03:59:15 +0200 Subject: [openib-general] another opensm crash Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB1A@taurus.voltaire.com> On Mon, 2005-11-14 at 14:54, Eitan Zahavi wrote: > Hi Troy > > Try to move aside your /lib/tls directory and see if you still get these > crashes. > We have issues with TLS pthread and glibc There are still strange crashes like this which appear to be memory scribbling issues. Moving tls aside changes the threads into processes. Does that indicate that threading issues are suspected ? -- Hal > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > Sent: Monday, November 14, 2005 8:09 PM > > To: openib-general at openib.org > > Subject: [openib-general] another opensm crash > > > > (gdb) bt > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > p_madw=0x80a1de0) > > at osm_sw_info_rcv.c:679 > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > cl_dispatcher.c:108 > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > at cl_threadpool.c:78 > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > cl_thread.c:61 > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Nov 17 18:00:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 04:00:58 +0200 Subject: [openib-general] Re: [Sc05-ib] Opensm crash.. Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB1B@taurus.voltaire.com> On Wed, 2005-11-16 at 00:10, Troy Benjegerdes wrote: > This was running with -maxsmps=32 > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 98311 (LWP 31196)] > 0xb7f71f29 in osm_log (p_log=0x0, verbosity=16 '\020', > p_str=0x80787cd "%s: [\n") at osm_log.c:137 > 137 if (p_log->level & verbosity) > (gdb) bt > #0 0xb7f71f29 in osm_log (p_log=0x0, verbosity=16 '\020', > p_str=0x80787cd "%s: [\n") at osm_log.c:137 > #1 0x0807755b in osm_vl15_poll (p_vl=0x8090ca4) at osm_vl15intf.c:410 > #2 0x0806b1fc in __osm_sm_mad_ctrl_update_wire_stats (p_ctrl=0x8090110) > at osm_sm_mad_ctrl.c:228 > #3 0x0806b6e0 in __osm_sm_mad_ctrl_rcv_callback (p_madw=0xb4c29390, > bind_context=0x8090110, p_req_madw=0x89cf1a8) at osm_sm_mad_ctrl.c:270 > #4 0xb7f3f821 in umad_receiver (p_ptr=0x80ccce8) at osm_vendor_ibumad.c:401 > #5 0xb7f6c617 in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61 > #6 0x46d86ce1 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #7 0x46d86e51 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 > #8 0x46c16d3a in clone () from /lib/i686/libc.so.6 > (gdb) print p_log > $1 = (osm_log_t * const) 0x0 > (gdb) up > #1 0x0807755b in osm_vl15_poll (p_vl=0x8090ca4) at osm_vl15intf.c:410 > 410 OSM_LOG_ENTER( p_vl->p_log, osm_vl15_poll ); > (gdb) print p_vl > $2 = (osm_vl15_t * const) 0x8090ca4 > (gdb) print p_vl->p_log > $3 = (osm_log_t *) 0x0 > (gdb) print *p_vl > $4 = {thread_state = OSM_THREAD_STATE_RUN, state = OSM_VL15_STATE_READY, > max_wire_smps = 32, signal = {condvar = {__c_lock = {__status = 0, > __spinlock = 0}, __c_waiting = 0x80a9940, > __padding = '\0' , __align = 0}, signaled = 0, > manual_reset = 0, spinlock = {mutex = {__m_reserved = 0, __m_count = 0, > __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, > __spinlock = 0}}, state = CL_INITIALIZED}, state = > CL_INITIALIZED}, > poller = {osd = {id = 65541, state = CL_INITIALIZED}, > pfn_callback = 0x8076d6c <__osm_vl15_poller>, context = 0x8090ca4, > name = '\0' }, rfifo = {end = {p_next = 0xab44b760, > p_prev = 0x8090c6c}, count = 135010200, state = 0}, ufifo = {end = { > p_next = 0x100, p_prev = 0x0}, count = 148497440, state = 135010200}, > lock = {mutex = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, > __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, state = 0}, > p_vend = 0x0, p_log = 0x0, p_stats = 0x0, p_subn = 0x0, h_disp = 0x0, > p_lock = 0x0} > (gdb) up > #2 0x0806b1fc in __osm_sm_mad_ctrl_update_wire_stats (p_ctrl=0x8090110) > at osm_sm_mad_ctrl.c:228 > 228 osm_vl15_poll( p_ctrl->p_vl15 ); > This looks like another memory scribbling issue. This time p_log was cleared. -- Hal From rolandd at cisco.com Thu Nov 17 20:46:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Nov 2005 20:46:53 -0800 Subject: [openib-general] High memory In-Reply-To: <437C1622.5070505@cse.ohio-state.edu> (Shuang Liang's message of "Thu, 17 Nov 2005 00:33:22 -0500") References: <437C1622.5070505@cse.ohio-state.edu> Message-ID: <52acg25zr6.fsf@cisco.com> Shuang> Hi, I am new here with some problem of gen2 programming, Shuang> hope somebody can help me. I was trying to send a message Shuang> from a kernel buffer to a remote userland program on IA-32 Shuang> machines. Basically, what happened was I used get_dma_mr Shuang> to get memory registered. And I noticed if a buffer is Shuang> allocated from high memory (address >f8000000), then the Shuang> data can not be delivered correctly to the receiver Shuang> side(both send recv completes successfully, but with wrong Shuang> data). I thought the problem could have been that I used Shuang> virt_to_phys for address translation. But I can't find any Shuang> appropriate ones for high memory address translation. I Shuang> wondering if somebody could give me some suggestions on Shuang> this. Yes, virt_to_phys is essentially always wrong. I'm not sure I really understand what you're doing. By high memory do you mean "highmem"? If so your comment about address > f8000000 doesn't make sense to me -- the definition of highmem is that it does not have any kernel mapping at all. The right way to use the MR from get_dma_mr() is to use "bus addresses" from the DMA mapping API. For highmem, the right way to get those addresses is with dma_map_sg() or dma_map_page(). - R. From rolandd at cisco.com Thu Nov 17 20:50:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 17 Nov 2005 20:50:39 -0800 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <1132276504.26731.2750.camel@hal.voltaire.com> (Hal Rosenstock's message of "17 Nov 2005 20:15:05 -0500") References: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> <52hdad7b5b.fsf@cisco.com> <1132276504.26731.2750.camel@hal.voltaire.com> Message-ID: <5264qq5zkw.fsf@cisco.com> Hal> It's not LocalPortNum. I don't see any easy way to query all Hal> port 2s in the network as for HCAs and routers the PortNum Hal> component is reserved (and you need to supply port Hal> LID). Doing that would only get you all the switch port Hal> 2s. Perhaps the spec should be amended if this is useful. If Hal> the spec were amended like this, you would get all the port Hal> 2s in the network and would then need to determine whether Hal> they were on a switch on xCA. Do you think this is useful ? Hal> Shall I file a comment on this ? We need to move fast to get Hal> this into 1.2 errata. I think there are two comments that I would file against the spec: - LocalPortNum for SA PortInfoRecord queries is not well defined. It doesn't make sense to me that it should be whichever port the SM happened to discover a switch through, since that could change essentially at random if a switch is multiply connected to the fabric. - It would be nice to be able to do an SA GetTable() query to get all PortInfo records with the IsDeviceManagement bit set in the capability mask (that's what I'm really after in this case). As it stands now, all that it is possible to do is to set the component mask so that all ports with an exact match for the full capability mask field are returned, and that's pretty useless. - R. From halr at voltaire.com Thu Nov 17 22:23:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 01:23:42 -0500 Subject: [openib-general] SRP device management client (and a few opensmglitches) In-Reply-To: <5264qq5zkw.fsf@cisco.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361898D@mtlexch01.mtl.com> <52hdad7b5b.fsf@cisco.com> <1132276504.26731.2750.camel@hal.voltaire.com> <5264qq5zkw.fsf@cisco.com> Message-ID: <1132295021.26731.5318.camel@hal.voltaire.com> On Thu, 2005-11-17 at 23:50, Roland Dreier wrote: > Hal> It's not LocalPortNum. I don't see any easy way to query all > Hal> port 2s in the network as for HCAs and routers the PortNum > Hal> component is reserved (and you need to supply port > Hal> LID). Doing that would only get you all the switch port > Hal> 2s. Perhaps the spec should be amended if this is useful. If > Hal> the spec were amended like this, you would get all the port > Hal> 2s in the network and would then need to determine whether > Hal> they were on a switch on xCA. Do you think this is useful ? > Hal> Shall I file a comment on this ? We need to move fast to get > Hal> this into 1.2 errata. > > I think there are two comments that I would file against the spec: > > - LocalPortNum for SA PortInfoRecord queries is not well defined. Rather than not well defined, it's not very useful as an SA search field > It > doesn't make sense to me that it should be whichever port the SM > happened to discover a switch through, since that could change > essentially at random if a switch is multiply connected to the fabric. I think that's why you would want to use PortNum in the RID rather than this. However, this is a reserved field for HCAs (and routers). Hence, my original comment to make this useful as a search field. > - It would be nice to be able to do an SA GetTable() query to get all > PortInfo records with the IsDeviceManagement bit set in the > capability mask (that's what I'm really after in this case). As it > stands now, all that it is possible to do is to set the component > mask so that all ports with an exact match for the full capability > mask field are returned, and that's pretty useless. Yes, that's a better way to get this and would be a perfect query for this. I will file both of these tomorrow. If there are others you know about that would be useful, now's a good time to discuss them with Ted and/or me. -- Hal From mst at mellanox.co.il Fri Nov 18 02:37:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 18 Nov 2005 12:37:55 +0200 Subject: [openib-general] [PATCH 1 of 2] Re: ipoib oops In-Reply-To: References: Message-ID: <20051118103755.GA12546@mellanox.co.il> Could be. Quoting Shirley Ma : > Subject: Re: [openib-general] [PATCH 1 of 2] Re: ipoib oops > > > Hi, Michael, > > Does the patch you are working on address below problem? rmmod hcad_mod > hung in ipoib. This is the task stacks I got on linux-2.6.14 + SVN 4044. > > > rmmod D 000000000ff7d030 11248 15681 14007 14061 > (NOTLB) > Call Trace: > [c00000006dd9af10] [c00000006dd9b0f0] 0xc00000006dd9b0f0 (unreliable) > [c00000006dd9b0e0] [c00000000000e468] .__switch_to+0x104/0x180 > [c00000006dd9b170] [c000000000394528] .schedule+0x774/0xf34 > [c00000006dd9b2b0] [c000000000069238] .flush_cpu_workqueue+0xdc/0x2c0 > [c00000006dd9b3c0] [d00000000041f77c] > .ipoib_mcast_stop_thread+0x1cc/0x274 > [ib_ipoib] > [c00000006dd9b480] [d00000000041c880] .ipoib_ib_dev_down+0x124/0x194 > [ib_ipoib] > [c00000006dd9b520] [d00000000041aed4] .ipoib_stop+0x98/0x188 [ib_ipoib] > [c00000006dd9b5b0] [c0000000002e79e4] .dev_close+0x110/0x118 > [c00000006dd9b630] [c0000000002e7afc] .unregister_netdevice+0x110/0x320 > [c00000006dd9b6c0] [c0000000002e7d30] .unregister_netdev+0x24/0x40 > [c00000006dd9b750] [d00000000041b104] .ipoib_remove_one+0x64/0xc0 > [ib_ipoib] > [c00000006dd9b7e0] [d00000000024e8fc] .ib_unregister_device+0xb4/0x1b4 > [ib_core][c00000006dd9b880] [d00000000054a1fc] .ehca_remove+0x9c/0x714 > [hcad_mod] > > ipoib D 0000000000000000 13552 13326 19 13712 3575 > (L-TLB) > Call Trace: > [c0000000f679f760] [c0000000f679f870] 0xc0000000f679f870 (unreliable) > [c0000000f679f930] [c00000000000e468] .__switch_to+0x104/0x180 > [c0000000f679f9c0] [c000000000394528] .schedule+0x774/0xf34 > [c0000000f679fb00] [c000000000395310] .wait_for_completion+0xc8/0x140 > [c0000000f679fbf0] [d00000000041f7c8] > .ipoib_mcast_stop_thread+0x218/0x274 > [ib_ipoib] > [c0000000f679fcb0] [d00000000041f9c0] > .ipoib_mcast_restart_task+0x90/0x410 > [ib_ipoib] > [c0000000f679fdb0] [c000000000068a8c] .worker_thread+0x244/0x320 > [c0000000f679fed0] [c00000000006fbf8] .kthread+0x178/0x1c8 > [c0000000f679ff90] [c0000000000100b8] .kernel_thread+0x4c/0x68 > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638 > > -- MST From halr at voltaire.com Fri Nov 18 03:57:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 06:57:59 -0500 Subject: [openib-general] Re: [PATCH] Opensm - lid assignment issues In-Reply-To: <5z64qwyhpe.fsf@mtl066.yok.mtl.com> References: <5z64qwyhpe.fsf@mtl066.yok.mtl.com> Message-ID: <1132314951.26731.8249.camel@hal.voltaire.com> Hi Yael, On Sun, 2005-11-13 at 05:18, Yael Kalka wrote: > Hi Hal, > > During some windows tests we've discovered that there is still another > problem in the lid_mgr. The problem happend when 2 HCAs had the same > lid - opensm entered an infinite loop. > The following patch fixes this. Thanks. Applied. -- Hal From halr at voltaire.com Fri Nov 18 04:01:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 07:01:19 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <200511180014.54566.moschny@ipd.uni-karlsruhe.de> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> <200511180014.54566.moschny@ipd.uni-karlsruhe.de> Message-ID: <1132315079.26731.8272.camel@hal.voltaire.com> On Thu, 2005-11-17 at 18:14, Thomas Moschny wrote: > On Thursday 17 November 2005 15:14, Doug Ledford wrote: > > Thomas Moschny wrote: > > > Unfortunately, we got an kernel-oops on ia64 (rhel4) ... > > > The boot log is attached. > > > > I think I know what this is. [...] > > The attached patch should be able to be dropped into the existing srpm > > in place of the patch with the same name and a rebuild should then solve > > the problem, although in the process of creating this patch I had to > > move it from the 2700 section of the patch list down to the 10002 > > position because it touches things added after the infiniband code. > > The patch seems to work here, thanks. The machines are up now, and at least > IPoIB is working. > > There seems to be a (minor?) problem with opensm -o, it aborts: > > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > Command Line Arguments: > Run Once > Log File: /var/log/osm.log > ------------------------------------------------- > OpenSM Rev:openib-1.1.0 > > Using default guid 0xxxxxxxxxxxxxxx > Entering MASTER state > > SUBNET UP > > Exiting SM > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 > *** > Aborted > > Subsequent runs of opensm hang in flush_cpu_workqueue or > rwsem_down_failed_common. Any idea what svn or how recent the OpenSM being used is ? -- Hal From halr at voltaire.com Fri Nov 18 04:16:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 07:16:39 -0500 Subject: [openib-general] Re: [PATCH] Opensm - add info in vendor error printing In-Reply-To: <5z4q6bxxqq.fsf@mtl066.yok.mtl.com> References: <5z4q6bxxqq.fsf@mtl066.yok.mtl.com> Message-ID: <1132316198.26731.8439.camel@hal.voltaire.com> Hi Yael, On Thu, 2005-11-17 at 07:30, Yael Kalka wrote: > Hi Hal, > > We are encountering problems in umad_send with large sized mads. I > will send a different mail regarding this issue to the group. > This patch adds the mad size to the error message when umad_send > failed. Thanks. Applied. -- Hal From halr at voltaire.com Fri Nov 18 04:39:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 14:39:22 +0200 Subject: [openib-general] RE: [PATCH] Opensm - add info in vendor errorprinting Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB1E@taurus.voltaire.com> On Thu, 2005-11-17 at 07:49, Yael Kalka wrote: > Hello all, > > During some opensm tests we've encountered a situation where a call > to umad_send() fails on very large mads (on mads of size ~90,000). > Is there some limitation on the size of the mads that can be sent? Those errors are the result of a kmalloc in user_mad.c (ib_umad module) when sending. > Reproduction of the issue is simple: > Run opensm > Then run osmtest -f m -M 3 > What the osmtest does in this case is try to create as many multicast > groups as possible (for all the possible MC_lids), and then > does an SA query on all the multicast groups that exist. > OpenSM will then try to answer with a huge group of multicast records, > and when opensm does umad_send() - it fails. > > Any ideas why this happens? Not yet. I now have this on my list but it is not top priority. I am still digging out from SC05. -- Hal From halr at voltaire.com Fri Nov 18 04:41:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 14:41:01 +0200 Subject: [openib-general] RE: [PATCH] Opensm - add info in vendor errorprinting Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB1F@taurus.voltaire.com> On Thu, 2005-11-17 at 07:49, Yael Kalka wrote: > Hello all, > > During some opensm tests we've encountered a situation where a call > to umad_send() fails on very large mads (on mads of size ~90,000). > Is there some limitation on the size of the mads that can be sent? Those errors are the result of a kmalloc in user_mad.c (ib_umad module) when sending. If a contiguous block of that size can't be obtained, then it would fail. > Reproduction of the issue is simple: > Run opensm > Then run osmtest -f m -M 3 > What the osmtest does in this case is try to create as many multicast > groups as possible (for all the possible MC_lids), and then > does an SA query on all the multicast groups that exist. > OpenSM will then try to answer with a huge group of multicast records, > and when opensm does umad_send() - it fails. > > Any ideas why this happens? Not yet. I now have this on my list but it is not top priority. I am still digging out from SC05. -- Hal From halr at voltaire.com Fri Nov 18 04:55:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 14:55:10 +0200 Subject: [openib-general] Re: [Sc05-ib] OpenSM segfault Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB20@taurus.voltaire.com> On Thu, 2005-11-10 at 10:08, Troy Benjegerdes wrote: > Entering MASTER state > > SUBNET UP > > > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread -1225770064 (LWP 10892)] > 0x0805ec4c in __osm_sa_mcm_by_comp_mask_cb (p_map_item=0x80bc338, > context=0xb6f031e0) at osm_sa_mcmember_record.c:1873 > 1873 scope_state = p_mcm_port->scope_state; > (gdb) > (gdb) bt > #0 0x0805ec4c in __osm_sa_mcm_by_comp_mask_cb (p_map_item=0x80bc338, > context=0xb6f031e0) at osm_sa_mcmember_record.c:1873 > #1 0xb7f3ec7e in cl_qmap_apply_func (p_map=0x808ec68, > pfn_func=0x805e838 <__osm_sa_mcm_by_comp_mask_cb>, context=0xb6f031e0) > at cl_map.c:299 > #2 0x0805ed50 in osm_mcmr_query_mgrp (p_rcv=0x808f7c4, p_madw=0x80a0d30) > at osm_sa_mcmember_record.c:2004 > #3 0x0805f824 in osm_mcmr_rcv_process (p_rcv=0x808f7c4, p_madw=0x80a0d30) > at osm_sa_mcmember_record.c:2239 > #4 0xb7f3cfd8 in __cl_disp_worker (context=0x808fda4) at > cl_dispatcher.c:108 > #5 0xb7f427bd in __cl_thread_pool_routine (context=0x808fde4) > at cl_threadpool.c:78 > #6 0xb7f42617 in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61 > #7 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > #8 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > the associated log is at : > http://scl.ameslab.gov/~troy/opensm-gdb.log I see what is occuring here. I will send out a patch for this shortly which was from SC05. -- Hal From dledford at redhat.com Fri Nov 18 05:47:49 2005 From: dledford at redhat.com (Doug Ledford) Date: Fri, 18 Nov 2005 08:47:49 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <200511180014.54566.moschny@ipd.uni-karlsruhe.de> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> <200511180014.54566.moschny@ipd.uni-karlsruhe.de> Message-ID: <437DDB85.1020403@redhat.com> Thomas Moschny wrote: > The patch seems to work here, thanks. The machines are up now, and at least > IPoIB is working. I should have new kernels on the site sometime today (version OpenIB_3965.3) that fix this. I only have enough quota space for one set of kernel rpms, so once they are up, the others are gone. > There seems to be a (minor?) problem with opensm -o, it aborts: [ Snip ] > Exiting SM > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 > *** > Aborted > > Subsequent runs of opensm hang in flush_cpu_workqueue or > rwsem_down_failed_common. BTW, can you try forcing opensm to run single threaded on it's first invocation and see if that fixes this? Also, do people generally feel that opensm is stable enough to start converting it to a proper system daemon? By that I mean things like not having it spew a bunch of informational messages to stdout when in daemon mode, putting in an actual daemon option, ability to write and handle a pid file, handling of putting itself in the background and disassociating from the controlling tty, etc. If so, I'll start coding that up and send through a patch. The current init.d startup script has some rather ugly hackery to get around the current opensm's very daemon unfriendly behavior... -- Doug Ledford http://people.redhat.com/dledford From halr at voltaire.com Fri Nov 18 05:44:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 08:44:47 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437DDB85.1020403@redhat.com> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> <200511180014.54566.moschny@ipd.uni-karlsruhe.de> <437DDB85.1020403@redhat.com> Message-ID: <1132321485.26731.9346.camel@hal.voltaire.com> Hi Doug, On Fri, 2005-11-18 at 08:47, Doug Ledford wrote: > Thomas Moschny wrote: > > > The patch seems to work here, thanks. The machines are up now, and at least > > IPoIB is working. > > I should have new kernels on the site sometime today (version > OpenIB_3965.3) that fix this. I only have enough quota space for one > set of kernel rpms, so once they are up, the others are gone. > > > There seems to be a (minor?) problem with opensm -o, it aborts: > > [ Snip ] > > > Exiting SM > > > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 > > *** > > Aborted > > > > Subsequent runs of opensm hang in flush_cpu_workqueue or > > rwsem_down_failed_common. > > BTW, can you try forcing opensm to run single threaded on it's first > invocation and see if that fixes this? > > Also, do people generally feel that opensm is stable enough to start > converting it to a proper system daemon? At this point, my opinion is that it's good for small networks and more stable larger networks (I will get you the scale in a subsequent email if that is of interest). It still has a little ways to go before I would say it is ready for prime time. That's just my assesment after coming back from SC05 but there were a lot of flaky links and a lot of new equipment as well as a lot of different equipment never before put together on that scale. More will follow on the list. However, see comment below... > By that I mean things like not > having it spew a bunch of informational messages to stdout when in > daemon mode, putting in an actual daemon option, ability to write and > handle a pid file, handling of putting itself in the background and > disassociating from the controlling tty, etc. If so, I'll start coding > that up and send through a patch. The current init.d startup script has > some rather ugly hackery to get around the current opensm's very daemon > unfriendly behavior... I think these would all be good improvements (and some were mentioned at SC05) so I would appreciate patches for these to move OpenSM forward as quickly as possible. Thanks. -- Hal From halr at voltaire.com Fri Nov 18 06:01:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 16:01:13 +0200 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4available Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB21@taurus.voltaire.com> Hi Doug, On Fri, 2005-11-18 at 08:47, Doug Ledford wrote: > Thomas Moschny wrote: > > > The patch seems to work here, thanks. The machines are up now, and at least > > IPoIB is working. > > I should have new kernels on the site sometime today (version > OpenIB_3965.3) that fix this. I only have enough quota space for one > set of kernel rpms, so once they are up, the others are gone. > > > There seems to be a (minor?) problem with opensm -o, it aborts: > > [ Snip ] > > > Exiting SM > > > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 > > *** > > Aborted > > > > Subsequent runs of opensm hang in flush_cpu_workqueue or > > rwsem_down_failed_common. > > BTW, can you try forcing opensm to run single threaded on it's first > invocation and see if that fixes this? > > Also, do people generally feel that opensm is stable enough to start > converting it to a proper system daemon? At this point, my opinion is that it's good for small networks and more stable larger networks (I will get you the scale in a subsequent email if that is of interest). It still has a little ways to go before I would say it is ready for prime time. That's just my assesment after coming back from SC05 but there were a lot of flaky links and a lot of new equipment as well as a lot of different equipment never before put together on that scale. More will follow on the list. However, see comment below... > By that I mean things like not > having it spew a bunch of informational messages to stdout when in > daemon mode, putting in an actual daemon option, ability to write and > handle a pid file, handling of putting itself in the background and > disassociating from the controlling tty, etc. If so, I'll start coding > that up and send through a patch. The current init.d startup script has > some rather ugly hackery to get around the current opensm's very daemon > unfriendly behavior... I think these would all be good improvements (and some were mentioned at SC05) so I would appreciate patches for these to move OpenSM forward as quickly as possible. Thanks. -- Hal From halr at voltaire.com Fri Nov 18 06:07:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 09:07:50 -0500 Subject: [openib-general] [PATCH] OpenSM: Fix logic error in osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb Message-ID: <1132322051.26731.9460.camel@hal.voltaire.com> In osm_sa_mcmember_record.c::__osm_sa_mcm_by_comp_mask_cb, fix negative logic error which cause seg fault here reported by Troy Benjegerdes Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_mcmember_record.c =================================================================== --- opensm/osm_sa_mcmember_record.c (revision 4077) +++ opensm/osm_sa_mcmember_record.c (working copy) @@ -1868,7 +1868,7 @@ __osm_sa_mcm_by_comp_mask_cb( if (IB_MCR_COMPMASK_PORT_GID & comp_mask) { /* try to find this port */ - if (! osm_mgrp_is_port_present(p_mgrp, portguid, &p_mcm_port)) + if (osm_mgrp_is_port_present(p_mgrp, portguid, &p_mcm_port)) { scope_state = p_mcm_port->scope_state; } From halr at voltaire.com Fri Nov 18 06:11:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 09:11:09 -0500 Subject: [openib-general] [PATCH] OpenSM: Message-ID: <1132322869.26731.9610.camel@hal.voltaire.com> Hi Yael, While investigating the SA MCMemberRecord segfault just patched, I stumbled across the following and thought it would be better handled as follows since the consumer does not check the returned contents of *pp_mcm_port. Let me know what you think about the below patch. Thanks. -- Hal In osm_multicast.c::osm_mgrp_is_port_present, also determine return value based on whether p_map_item is NULL. Signed-off-by: Hal Rosenstock Index: opensm/osm_multicast.c =================================================================== --- opensm/osm_multicast.c (revision 4077) +++ opensm/osm_multicast.c (working copy) @@ -257,14 +257,16 @@ osm_mgrp_is_port_present( { if (pp_mcm_port) *pp_mcm_port = (osm_mcm_port_t *)p_map_item; - return TRUE; + if (p_map_item) + return TRUE; + else + return FALSE; } if (pp_mcm_port) *pp_mcm_port = NULL; return FALSE; } - /********************************************************************** **********************************************************************/ static void From halr at voltaire.com Fri Nov 18 06:39:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 16:39:27 +0200 Subject: [openib-general] [PATCH] OpenSM component library: Incl_timer.c::__cl_timer_prov_cb, handle EINVAL properly Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB23@taurus.voltaire.com> In cl_timer.c::__cl_timer_prov_cb, handle EINVAL This occurs when there is too much work to do and OpenSM falls behind so the callback occurs later than the expiration time. This causes OpenSM to peg the CPU (consume 100% of the CPU). Signed-off-by: Troy Benjegardes Signed-off-by: Hal Rosenstock Index: cl_timer.c =================================================================== --- cl_timer.c (revision 4077) +++ cl_timer.c (working copy) @@ -180,8 +180,12 @@ __cl_timer_prov_cb( ret = pthread_cond_timedwait( &gp_timer_prov->cond, &gp_timer_prov->mutex, &p_timer->timeout ); - /* Sleep again on every event other than timeout */ - if( ret != ETIMEDOUT ) + /* + Sleep again on every event other than timeout and invalid + Note: EINVAL means that we got behind. This can occur when + we are very busy... + */ + if( ret != ETIMEDOUT && ret != EINVAL ) continue; /* From halr at voltaire.com Fri Nov 18 07:15:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 10:15:29 -0500 Subject: [openib-general] [PATCH] OpenSM: Add vendor decode support some new IB hardware vendors Message-ID: <1132326928.26731.10341.camel@hal.voltaire.com> OpenSM: Add vendor decode support some new IB hardware vendors (PathScale and IBM) Signed-off-by: Hal Rosenstock Index: opensm/osm_helper.c =================================================================== --- opensm/osm_helper.c (revision 4077) +++ opensm/osm_helper.c (working copy) @@ -1851,6 +1851,8 @@ osm_get_node_type_str_fixed_width( #define OSM_VENDOR_ID_FUJITSU2 0x000B5D #define OSM_VENDOR_ID_VOLTAIRE 0x0008F1 #define OSM_VENDOR_ID_YOTTAYOTTA 0x000453 +#define OSM_VENDOR_ID_PATHSCALE 0x001175 +#define OSM_VENDOR_ID_IBM 0x000255 /********************************************************************** **********************************************************************/ @@ -1866,6 +1868,8 @@ osm_get_manufacturer_str( static const char* fujitsu_str = "Fujitsu "; static const char* voltaire_str = "Voltaire "; static const char* yotta_str = "YottaYotta "; + static const char* pathscale_str = "PathScale "; + static const char* ibm_str = "IBM "; static const char* unknown_str = "Unknown "; switch( (uint32_t)(guid_ho >> (5 * 8)) ) @@ -1887,6 +1891,10 @@ osm_get_manufacturer_str( return( voltaire_str ); case OSM_VENDOR_ID_YOTTAYOTTA: return( yotta_str ); + case OSM_VENDOR_ID_PATHSCALE: + return( pathscale_str ); + case OSM_VENDOR_ID_IBM: + return( ibm_str ); default: return( unknown_str ); } From halr at voltaire.com Fri Nov 18 08:12:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 18 Nov 2005 18:12:40 +0200 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4available Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB29@taurus.voltaire.com> On Fri, 2005-11-18 at 09:01, Hal Rosenstock wrote: > At this point, my opinion is that it's good for small networks and more > stable larger networks (I will get you the scale in a subsequent email > if that is of interest). OpenIB OpenSM has been run successfully on a cluster size of 256 nodes to date. I believe Mellanox Gold (IBGD) OpenSM has been run on a cluster of 512 nodes so it is likely OpenIB OpenSM would scale to that size too. -- Hal From viswa.krish at gmail.com Fri Nov 18 10:13:51 2005 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Fri, 18 Nov 2005 10:13:51 -0800 Subject: [openib-general] mthca and non-MSI system Message-ID: <4df28be40511181013w151318aajfec7d6dd6aa9784b@mail.gmail.com> Has the mthca driver been tested on non-MSI (interrupt) system. I seem to have a problem where interrupts are not generated on non-MSI system with the following message "NOP command failed to generate interrupt (IRQ 9), aborting." BIOS or ACPI interrupt routing problem? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubbell at dbresearch.net Fri Nov 18 10:13:26 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Fri, 18 Nov 2005 12:13:26 -0600 Subject: [openib-general] Re: IPoIB In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB18@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AB18@taurus.voltaire.com> Message-ID: <437E19C6.4020007@dbresearch.net> Hal Rosenstock wrote: >Hi Sean, > >On Thu, 2005-11-17 at 13:57, Sean Hubbell wrote: > > >>Hal, >> >> I tried cycling the power to the switches with no luck. >>I am running opensm (source code as of 11/15/2005). >>We are running with 12 SBS 24 Port Switches. How can I see what ports came up at 1x? >> >> > >Do you have the management tools installed ? > >Find the LIDs of the switches with ibnetdiscover and then use smpquery >portinfo to see what LinkWidthActive says. > >You would lose multicast connectivity if some ports were 1x and others >were 4x so this is a good thing to check. > >-- Hal > > > > > > Yes, I have those built, just forgot about the smpquery. I'll let you know what I find. Thanks for the help (again). Sean -- Sean Hubbell Senior Software Engineer deciBel Research, Inc. (256) 426-8957 From jeff at abbatech.com Fri Nov 18 10:40:34 2005 From: jeff at abbatech.com (Jeff Sadowski) Date: Fri, 18 Nov 2005 11:40:34 -0700 Subject: [openib-general] RE: [Sc05-ib] Opensm crash.. Message-ID: <7C37D8C9E149DE4090CB87199139E765035DA5@abba-server-5.abbatech.com> Hey Hal maybe valgrind could be of some use? -----Original Message----- From: sc05-ib-bounces at lists.scl.ameslab.gov on behalf of Hal Rosenstock Sent: Thu 11/17/2005 7:00 PM To: troy at scl.ameslab.gov Cc: sc05-ib at scl.ameslab.gov; openib-general at openib.org Subject: Re: [Sc05-ib] Opensm crash.. On Wed, 2005-11-16 at 00:10, Troy Benjegerdes wrote: > This was running with -maxsmps=32 > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 98311 (LWP 31196)] > 0xb7f71f29 in osm_log (p_log=0x0, verbosity=16 '\020', > p_str=0x80787cd "%s: [\n") at osm_log.c:137 > 137 if (p_log->level & verbosity) > (gdb) bt > #0 0xb7f71f29 in osm_log (p_log=0x0, verbosity=16 '\020', > p_str=0x80787cd "%s: [\n") at osm_log.c:137 > #1 0x0807755b in osm_vl15_poll (p_vl=0x8090ca4) at osm_vl15intf.c:410 > #2 0x0806b1fc in __osm_sm_mad_ctrl_update_wire_stats (p_ctrl=0x8090110) > at osm_sm_mad_ctrl.c:228 > #3 0x0806b6e0 in __osm_sm_mad_ctrl_rcv_callback (p_madw=0xb4c29390, > bind_context=0x8090110, p_req_madw=0x89cf1a8) at osm_sm_mad_ctrl.c:270 > #4 0xb7f3f821 in umad_receiver (p_ptr=0x80ccce8) at osm_vendor_ibumad.c:401 > #5 0xb7f6c617 in __cl_thread_wrapper (arg=0x0) at cl_thread.c:61 > #6 0x46d86ce1 in pthread_start_thread () from /lib/i686/libpthread.so.0 > #7 0x46d86e51 in pthread_start_thread_event () from > /lib/i686/libpthread.so.0 > #8 0x46c16d3a in clone () from /lib/i686/libc.so.6 > (gdb) print p_log > $1 = (osm_log_t * const) 0x0 > (gdb) up > #1 0x0807755b in osm_vl15_poll (p_vl=0x8090ca4) at osm_vl15intf.c:410 > 410 OSM_LOG_ENTER( p_vl->p_log, osm_vl15_poll ); > (gdb) print p_vl > $2 = (osm_vl15_t * const) 0x8090ca4 > (gdb) print p_vl->p_log > $3 = (osm_log_t *) 0x0 > (gdb) print *p_vl > $4 = {thread_state = OSM_THREAD_STATE_RUN, state = OSM_VL15_STATE_READY, > max_wire_smps = 32, signal = {condvar = {__c_lock = {__status = 0, > __spinlock = 0}, __c_waiting = 0x80a9940, > __padding = '\0' , __align = 0}, signaled = 0, > manual_reset = 0, spinlock = {mutex = {__m_reserved = 0, __m_count = 0, > __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, > __spinlock = 0}}, state = CL_INITIALIZED}, state = > CL_INITIALIZED}, > poller = {osd = {id = 65541, state = CL_INITIALIZED}, > pfn_callback = 0x8076d6c <__osm_vl15_poller>, context = 0x8090ca4, > name = '\0' }, rfifo = {end = {p_next = 0xab44b760, > p_prev = 0x8090c6c}, count = 135010200, state = 0}, ufifo = {end = { > p_next = 0x100, p_prev = 0x0}, count = 148497440, state = 135010200}, > lock = {mutex = {__m_reserved = 0, __m_count = 0, __m_owner = 0x0, > __m_kind = 0, __m_lock = {__status = 0, __spinlock = 0}}, state = 0}, > p_vend = 0x0, p_log = 0x0, p_stats = 0x0, p_subn = 0x0, h_disp = 0x0, > p_lock = 0x0} > (gdb) up > #2 0x0806b1fc in __osm_sm_mad_ctrl_update_wire_stats (p_ctrl=0x8090110) > at osm_sm_mad_ctrl.c:228 > 228 osm_vl15_poll( p_ctrl->p_vl15 ); > This looks like another memory scribbling issue. This time p_log was cleared. -- Hal _______________________________________________ Sc05-ib mailing list Sc05-ib at lists.scl.ameslab.gov https://lists.scl.ameslab.gov/cgi-bin/mailman/listinfo/sc05-ib From rolandd at cisco.com Fri Nov 18 12:41:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 18 Nov 2005 12:41:44 -0800 Subject: [openib-general] mthca and non-MSI system In-Reply-To: <4df28be40511181013w151318aajfec7d6dd6aa9784b@mail.gmail.com> (Viswanath Krishnamurthy's message of "Fri, 18 Nov 2005 10:13:51 -0800") References: <4df28be40511181013w151318aajfec7d6dd6aa9784b@mail.gmail.com> Message-ID: <52ek5d4rjr.fsf@cisco.com> Viswanath> Has the mthca driver been tested on non-MSI (interrupt) Viswanath> system. I seem to have a problem where interrupts are Viswanath> not generated on non-MSI system with the following Viswanath> message Yes, many people are running without MSI. Viswanath> "NOP command failed to generate interrupt (IRQ 9), Viswanath> aborting." BIOS or ACPI interrupt routing problem? The last part of the message lists the most likely source of problems. - R. From rolandd at cisco.com Fri Nov 18 13:14:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 18 Nov 2005 13:14:21 -0800 Subject: [openib-general] [PATCH] ibsrpdm: fix service record range problem In-Reply-To: (John Kingman's message of "Thu, 17 Nov 2005 12:45:39 -0600 (CST)") References: Message-ID: <5264qp4q1e.fsf@cisco.com> John> The start and end values are reversed in the attribute John> modifier for the Service Entries attribute. Thanks, I thought I had fixed that but it snuck back in. I'll make sure it's right in my next release. - R. From dledford at redhat.com Fri Nov 18 15:05:44 2005 From: dledford at redhat.com (Doug Ledford) Date: Fri, 18 Nov 2005 18:05:44 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437A8592.1000408@redhat.com> References: <437A8592.1000408@redhat.com> Message-ID: <437E5E48.4030505@redhat.com> Doug Ledford wrote: > For the kernel, libmthca, and libibverbs, support is limited to x86, > x86_64, and ia64. OK, I've uploaded some new kernel rpms to the same site as before. This should fix the oops problem on 64 bit arches. There still isn't ppc support and I don't think we are going to implement that for this update due to testing issues (namely, I don't know what ppc hardware would even work with mthca cards and I can't test it, I'm just not a ppc knowledgable guy). If anyone out there is using ppc hardware and mthca controllers and would like support for this enabled, please drop me an email and let me know what arch variant you need support for (the ppc64iseries in particular is the real killer since it doesn't allow __raw_readl/__raw_writel that the mthca driver uses and since I don't have an iseries machine I can't test possible alternatives for doing the io). -- Doug Ledford http://people.redhat.com/dledford From halr at voltaire.com Fri Nov 18 14:02:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 17:02:53 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437E5E48.4030505@redhat.com> References: <437A8592.1000408@redhat.com> <437E5E48.4030505@redhat.com> Message-ID: <1132351372.26731.13759.camel@hal.voltaire.com> Hi Doug, On Fri, 2005-11-18 at 18:05, Doug Ledford wrote: > Doug Ledford wrote: > > > For the kernel, libmthca, and libibverbs, support is limited to x86, > > x86_64, and ia64. > > OK, I've uploaded some new kernel rpms to the same site as before. This > should fix the oops problem on 64 bit arches. There still isn't ppc > support and I don't think we are going to implement that for this update > due to testing issues (namely, I don't know what ppc hardware would even > work with mthca cards and I can't test it, I'm just not a ppc > knowledgable guy). If anyone out there is using ppc hardware and mthca > controllers and would like support for this enabled, please drop me an > email and let me know what arch variant you need support for (the > ppc64iseries in particular is the real killer since it doesn't allow > __raw_readl/__raw_writel that the mthca driver uses and since I don't > have an iseries machine I can't test possible alternatives for doing the > io). You can use IBM p630 and JS20 systems.with mthca. -- Hal From rolandd at cisco.com Fri Nov 18 14:10:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 18 Nov 2005 14:10:34 -0800 Subject: [openib-general] Re: [PATCH] ipoib: protect child list access In-Reply-To: <20051117155209.GN20871@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Nov 2005 17:52:09 +0200") References: <20051117155209.GN20871@mellanox.co.il> Message-ID: <52r79d38v9.fsf@cisco.com> Thanks, applied. From rolandd at cisco.com Fri Nov 18 14:15:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 18 Nov 2005 14:15:57 -0800 Subject: [openib-general] Re: [PATCH] ipoib: protect child list access In-Reply-To: <52r79d38v9.fsf@cisco.com> (Roland Dreier's message of "Fri, 18 Nov 2005 14:10:34 -0800") References: <20051117155209.GN20871@mellanox.co.il> <52r79d38v9.fsf@cisco.com> Message-ID: <52k6f538ma.fsf@cisco.com> err, not really applied yet... I applied the mthca sge calculation patch and then replied to the wrong email. - R. From rolandd at cisco.com Fri Nov 18 14:23:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 18 Nov 2005 14:23:04 -0800 Subject: [openib-general] [git pull] IB updates for 2.6.15 Message-ID: <52fypt38af.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Michael S. Tsirkin: IB/mthca: Safer max_send_sge/max_recv_sge calculation Roland Dreier: [IB] srp: increase max_luns [IB] srp: don't post receive if no send buf available [IB] mthca: don't disable RDMA writes if no responder resources IB/umad: make sure write()s have sufficient data drivers/infiniband/core/user_mad.c | 2 +- drivers/infiniband/hw/mthca/mthca_qp.c | 37 ++++++++++++++++---------------- drivers/infiniband/ulp/srp/ib_srp.c | 17 ++++++++++----- drivers/infiniband/ulp/srp/ib_srp.h | 1 + 4 files changed, 31 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 5ea741f..e73f81c 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -312,7 +312,7 @@ static ssize_t ib_umad_write(struct file int ret, length, hdr_len, copy_offset; int rmpp_active = 0; - if (count < sizeof (struct ib_user_mad)) + if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; length = count - sizeof (struct ib_user_mad); diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 760c418..dd4e133 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -730,15 +730,16 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_ACCESS_FLAGS) { + qp_context->params2 |= + cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? + MTHCA_QP_BIT_RWE : 0); + /* - * Only enable RDMA/atomics if we have responder - * resources set to a non-zero value. + * Only enable RDMA reads and atomics if we have + * responder resources set to a non-zero value. */ if (qp->resp_depth) { qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= @@ -759,31 +760,27 @@ int mthca_modify_qp(struct ib_qp *ibqp, if (qp->resp_depth && !attr->max_dest_rd_atomic) { /* * Lowering our responder resources to zero. - * Turn off RDMA/atomics as responder. - * (RWE/RRE/RAE in params2 already zero) + * Turn off reads RDMA and atomics as responder. + * (RRE/RAE in params2 already zero) */ - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } if (!qp->resp_depth && attr->max_dest_rd_atomic) { /* * Increasing our responder resources from - * zero. Turn on RDMA/atomics as appropriate. + * zero. Turn on RDMA reads and atomics as + * appropriate. */ qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_READ ? MTHCA_QP_BIT_RRE : 0); qp_context->params2 |= cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_QP_BIT_RAE : 0); - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); } @@ -921,10 +918,12 @@ static void mthca_adjust_qp_caps(struct else qp->max_inline_data = max_data_size - MTHCA_INLINE_HEADER_SIZE; - qp->sq.max_gs = max_data_size / sizeof (struct mthca_data_seg); - qp->rq.max_gs = (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - - sizeof (struct mthca_next_seg)) / - sizeof (struct mthca_data_seg); + qp->sq.max_gs = min_t(int, dev->limits.max_sg, + max_data_size / sizeof (struct mthca_data_seg)); + qp->rq.max_gs = min_t(int, dev->limits.max_sg, + (min(dev->limits.max_desc_sz, 1 << qp->rq.wqe_shift) - + sizeof (struct mthca_next_seg)) / + sizeof (struct mthca_data_seg)); } /* diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 321a3a1..ee9fe22 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -802,13 +802,21 @@ static int srp_post_recv(struct srp_targ /* * Must be called with target->scsi_host->host_lock held to protect - * req_lim and tx_head. + * req_lim and tx_head. Lock cannot be dropped between call here and + * call to __srp_post_send(). */ static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target) { if (target->tx_head - target->tx_tail >= SRP_SQ_SIZE) return NULL; + if (unlikely(target->req_lim < 1)) { + if (printk_ratelimit()) + printk(KERN_DEBUG PFX "Target has req_lim %d\n", + target->req_lim); + return NULL; + } + return target->tx_ring[target->tx_head & SRP_SQ_SIZE]; } @@ -823,11 +831,6 @@ static int __srp_post_send(struct srp_ta struct ib_send_wr wr, *bad_wr; int ret = 0; - if (target->req_lim < 1) { - printk(KERN_ERR PFX "Target has req_lim %d\n", target->req_lim); - return -EAGAIN; - } - list.addr = iu->dma; list.length = len; list.lkey = target->srp_host->mr->lkey; @@ -1417,6 +1420,8 @@ static ssize_t srp_create_target(struct if (!target_host) return -ENOMEM; + target_host->max_lun = SRP_MAX_LUN; + target = host_to_target(target_host); memset(target, 0, sizeof *target); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 4fec28a..b564f18 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -54,6 +54,7 @@ enum { SRP_PORT_REDIRECT = 1, SRP_DLID_REDIRECT = 2, + SRP_MAX_LUN = 512, SRP_MAX_IU_LEN = 256, SRP_RQ_SHIFT = 6, From halr at voltaire.com Fri Nov 18 14:26:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 19 Nov 2005 00:26:02 +0200 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4 available Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB3C@taurus.voltaire.com> On Fri, 2005-11-18 at 17:02, Hal Rosenstock wrote: > You can use IBM p630 and JS20 systems.with mthca. I think you may also be able to do Power 5 as well. This would have the advantage of being able to run both mthca and ehca without getting yet again another machine. -- Hal From robert.j.woodruff at intel.com Fri Nov 18 14:30:07 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 18 Nov 2005 14:30:07 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4 andRHEL-4 available In-Reply-To: <437E5E48.4030505@redhat.com> Message-ID: Doug Ledford wrote: >> For the kernel, libmthca, and libibverbs, support is limited to x86, >> x86_64, and ia64. >OK, I've uploaded some new kernel rpms to the same site as before. This >should fix the oops problem on 64 bit arches. Yep. I loaded it up on an x86_64 machine and it did indeed fix the oops. I can also give it a try on some IPF machines. Will let you know if I see any other issues. woody From halr at voltaire.com Fri Nov 18 18:36:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2005 21:36:03 -0500 Subject: [openib-general] Re: [Sc05-ib] OpenSM (lack of) error handling In-Reply-To: <4372EA72.8070609@scl.ameslab.gov> References: <4372EA72.8070609@scl.ameslab.gov> Message-ID: <1132367763.26731.15413.camel@hal.voltaire.com> On Thu, 2005-11-10 at 01:36, Troy Benjegerdes wrote: > OpenSM does NOT handle links that generate errors very well at ALL. We > have several flakey links on the SC05 show floor, and opensm is > segfaulting and generally not very happy about it. > > Is there a reasonable way to partition off links that generate lots of > errors without physically unplugging them? The port on the other end of the link can be physically disabled. A management command to do this can be added. Ifg the SM were embedded on a switch, then all ports on the links opposite the switch ports would need to be disabled. The harder part is detecting that this needs to be done (SM Key mismatch might be one but the other SM could not play by the rules and respond properly in the real rogue case). Also, the policy for doing this is hard especially if that policy is built into the SM rather than the network administrator (a person) issuing a manual command). It does not take care of the nodes which were claimed by that other SM. That can be problematic if the other SM set MKeyProtect bits to 2 or 3 and an infinite lease period. There is no way to reclaim them in that particular case other than rebooting those nodes. > Also, what is to prevent any random IB client that plugs in from using > MAD packets to reset port counters? Anyone in your partition(s) can do this. -- Hal From moschny at ipd.uni-karlsruhe.de Fri Nov 18 19:45:08 2005 From: moschny at ipd.uni-karlsruhe.de (Thomas Moschny) Date: Sat, 19 Nov 2005 04:45:08 +0100 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437E5E48.4030505@redhat.com> References: <437A8592.1000408@redhat.com> <437E5E48.4030505@redhat.com> Message-ID: <437E9FC4.8080408@ipd.uni-karlsruhe.de> Doug Ledford wrote: > OK, I've uploaded some new kernel rpms to the same site as before. This > should fix the oops problem on 64 bit arches. Tried the kernel here on rhel4/ia64, oops fixed indeed. - Thomas From moschny at ipd.uni-karlsruhe.de Fri Nov 18 20:12:44 2005 From: moschny at ipd.uni-karlsruhe.de (Thomas Moschny) Date: Sat, 19 Nov 2005 05:12:44 +0100 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437DDB85.1020403@redhat.com> References: <437A8592.1000408@redhat.com> <200511161758.46379.moschny@ipd.uni-karlsruhe.de> <437C9063.70607@redhat.com> <200511180014.54566.moschny@ipd.uni-karlsruhe.de> <437DDB85.1020403@redhat.com> Message-ID: <437EA63C.1080303@ipd.uni-karlsruhe.de> Thomas Moschny wrote: > Exiting SM > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 *** > Aborted > > Subsequent runs of opensm hang in flush_cpu_workqueue or rwsem_down_failed_common. Doug Ledford wrote: > BTW, can you try forcing opensm to run single threaded on it's first > invocation and see if that fixes this? Did you mean calling opensm with -d1? Well, currently I can't see any consistent behavior, but if called with -d1 on the *second* -o run, it doesn't seem to hang (unless there are already some unkillable instances on this machine from earlier runs). - Thomas From tom at ipperformance.com Sat Nov 19 05:56:30 2005 From: tom at ipperformance.com (Tom Tucker) Date: Sat, 19 Nov 2005 07:56:30 -0600 Subject: [openib-general] mthca and non-MSI system In-Reply-To: <4df28be40511181013w151318aajfec7d6dd6aa9784b@mail.gmail.com> References: <4df28be40511181013w151318aajfec7d6dd6aa9784b@mail.gmail.com> Message-ID: <1132408590.1445.13.camel@mail.es335.com> We have run mthca on non-MSI systems and on MSI systems with MSI disabled. In both cases, the driver seemed to work. Tom On Fri, 2005-11-18 at 10:13 -0800, Viswanath Krishnamurthy wrote: > Has the mthca driver been tested on non-MSI (interrupt) system. I seem > to have a problem where > interrupts are not generated on non-MSI system with the following > message > > "NOP command failed to generate interrupt (IRQ 9), aborting." > BIOS or ACPI interrupt routing problem? > > -Viswa > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sat Nov 19 06:39:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 19 Nov 2005 16:39:43 +0200 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4available Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB44@taurus.voltaire.com> On Fri, 2005-11-18 at 23:12, Thomas Moschny wrote: > Thomas Moschny wrote: > > Exiting SM > > > > *** glibc detected *** double free or corruption (!prev): 0x6000000000067970 *** > > Aborted On what processor architecture is opensm running ? Note that some better handling of opensm exiting went in at r3977 which is slightly past this (r3965). > > Subsequent runs of opensm hang in flush_cpu_workqueue or rwsem_down_failed_common. Sounds like something isn't cleanup up properly when the previous instance exits. After the error, is there an opensm instance still around ? If so, it wouldn't clean up some MAD registrations. > Doug Ledford wrote: > > BTW, can you try forcing opensm to run single threaded on it's first > > invocation and see if that fixes this? > > Did you mean calling opensm with -d1? That would force single thread mode. You should see something like this when opensm starts up: opensm -d1 ------------------------------------------------- OpenSM Rev:openib-1.1.0 Command Line Arguments: d level = 0x1 Debug mode: Forcing Single Thread Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.1.0 > Well, currently I can't see any > consistent behavior, but if called with -d1 on the *second* -o run, What is the state of the subnet ? > it doesn't seem to hang (unless there are already some unkillable instances > on this machine from earlier runs). Did you check with ps for opensm instances ? -- Hal From halr at voltaire.com Sat Nov 19 08:33:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 19 Nov 2005 18:33:03 +0200 Subject: [openib-general] OpenSM Debug Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB47@taurus.voltaire.com> Hi, The following code snippet is in opensm/main.c: if ( osm_is_debug() != cl_is_debug() ) { fprintf(stderr, "ERROR: OpenSM and Complib were compiled using different modes\n"); fprintf(stderr, "ERROR: OpenSM debug:%d Complib debug:%d \n", osm_is_debug(), cl_is_debug() ); exit(1); } Is there a reason debug can't be turned on independently in OpenSM and the component library ? It seems to me that you might want debug in any combination of the three places (these 2 and the vendor library). If so, those should be changed to warnings rather than errors and something should perhaps be added for the vendor library too. -- Hal From sean.hefty at intel.com Sat Nov 19 08:37:37 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 19 Nov 2005 08:37:37 -0800 Subject: [openib-general] OpenSM Debug In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB47@taurus.voltaire.com> Message-ID: >The following code snippet is in opensm/main.c: > > if ( osm_is_debug() != cl_is_debug() ) > { > fprintf(stderr, "ERROR: OpenSM and Complib were compiled using different >modes\n"); > fprintf(stderr, "ERROR: OpenSM debug:%d Complib debug:%d \n", >osm_is_debug(), cl_is_debug() ); > exit(1); > } > >Is there a reason debug can't be turned on independently in OpenSM and >the component library ? There used to be a restriction that you couldn't mix a free/release version of the component library with a debug version of a client, and vice-versa. The debug version of complib added fields to structures that were not needed in the release version, resulting in different structure sizes between free and debug versions. This is probably still the case. - Sean From dledford at redhat.com Sat Nov 19 11:24:48 2005 From: dledford at redhat.com (Doug Ledford) Date: Sat, 19 Nov 2005 14:24:48 -0500 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4available In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB44@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AB44@taurus.voltaire.com> Message-ID: <437F7C00.9000305@redhat.com> Hal Rosenstock wrote: > On Fri, 2005-11-18 at 23:12, Thomas Moschny wrote: > >>Thomas Moschny wrote: >> >>>Exiting SM >>> >>>*** glibc detected *** double free or corruption (!prev): 0x6000000000067970 *** >>>Aborted > > > On what processor architecture is opensm running ? > > Note that some better handling of opensm exiting went in at r3977 which > is slightly past this (r3965). > > >>>Subsequent runs of opensm hang in flush_cpu_workqueue or rwsem_down_failed_common. > > > Sounds like something isn't cleanup up properly when the previous > instance exits. After the error, is there an opensm instance still > around ? If so, it wouldn't clean up some MAD registrations. That's the only time I've seen opensm fail to run is when I already had one running that I wasn't aware of. -- Doug Ledford http://people.redhat.com/dledford From ftillier at silverstorm.com Sat Nov 19 10:50:51 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Sat, 19 Nov 2005 10:50:51 -0800 Subject: [openib-general] OpenSM Debug In-Reply-To: Message-ID: <002701c5ed3a$2bb3d8e0$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Saturday, November 19, 2005 8:38 AM > > >The following code snippet is in opensm/main.c: > > > > if ( osm_is_debug() != cl_is_debug() ) > > { > > fprintf(stderr, "ERROR: OpenSM and Complib were compiled using > > different modes\n"); > > fprintf(stderr, "ERROR: OpenSM debug:%d Complib debug:%d \n", > >osm_is_debug(), cl_is_debug() ); > > exit(1); > > } > > > >Is there a reason debug can't be turned on independently in OpenSM and > >the component library ? > > There used to be a restriction that you couldn't mix a free/release > version of the component library with a debug version of a client, > and vice-versa. The debug version of complib added fields to > structures that were not needed in the release version, resulting > in different structure sizes between free and debug versions. > This is probably still the case. That's correct - structure definitions change between the debug and release builds of complib. The code above is there because in Linux, the library created by complib has the same name in debug and release builds, so it is possible to have a mismatch between the type of build for opensm and complib. In Windows, I solved this by adding a debug-only suffix to the library name (complibd vs. complib) so that the risk of linkage errors is eliminated. I have suggested in the past that the Linux complib adopt a similar naming scheme and that doing runtime checks for linkage errors was indicative of a poor design. This has been the basis for me pushing back on adding the cl_is_debug function to the Windows version of complib. - Fab From mst at mellanox.co.il Sat Nov 19 11:05:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 19 Nov 2005 21:05:48 +0200 Subject: [openib-general] error handling in ipoib_open Message-ID: <20051119190548.GA22412@mellanox.co.il> Roland, ipoib_open error handling looks strange. dont we need to e.g. call ipoib_ib_dev_stop if ipoib_ib_dev_up returns an error? MST -- MST From yael at mellanox.co.il Sun Nov 20 01:20:00 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 20 Nov 2005 11:20:00 +0200 Subject: [openib-general] opensm - umad_receiver break on alloc errors Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F8@mtlexch01.mtl.com> Hi Hal, While reviewing the umad_receiver function in osm_vendor_ibumad.c we've noticed that when umad_alloc() calls fail, the receiver breaks. What happens then is that SM continues to live, though the umad_receiver thread doesn't exist anymore. I think that there is no use in keeping the SM alive in this case. As a result, I think we should do one of the following when umad_alloc() failes: 1. If umad_alloc() fails - issue an error to the syslog, and exit SM. This is a fatal case. 2. Use continue instead of break. Assuming that if umad_alloc() fails this time - doesn't mean it'll fail again. What do you think? Yael From eitan at mellanox.co.il Sun Nov 20 01:52:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 20 Nov 2005 11:52:24 +0200 Subject: [openib-general] another opensm crash Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E36189A7@mtlexch01.mtl.com> Hi Hal, > > > > Try to move aside your /lib/tls directory and see if you still get these > > crashes. > > We have issues with TLS pthread and glibc > > There are still strange crashes like this which appear to be memory > scribbling issues. [EZ] OK we need to trace those. But TLS has some bugs too. We had cases where we could see cond wait events not being picked up. > > Moving tls aside changes the threads into processes. Does that indicate > that threading issues are suspected ? [EZ] In old Pthread the threads seems like processes and in TLS they do not. This is not the issue. I suspect that in gen1 we see the cond wait issue more frequently as the vendor uses cl_timer more often (which uses cond wait ...) > > -- Hal > > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > > Sent: Monday, November 14, 2005 8:09 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] another opensm crash > > > > > > (gdb) bt > > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > > p_madw=0x80a1de0) > > > at osm_sw_info_rcv.c:679 > > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > > cl_dispatcher.c:108 > > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > > at cl_threadpool.c:78 > > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > > cl_thread.c:61 > > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Sun Nov 20 04:59:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 20 Nov 2005 14:59:09 +0200 Subject: [openib-general] OpenSM Debug Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB49@taurus.voltaire.com> Hi Fab, On Sat, 2005-11-19 at 13:50, Fab Tillier wrote: > > From: Sean Hefty [mailto:sean.hefty at intel.com] > > Sent: Saturday, November 19, 2005 8:38 AM > > > > >The following code snippet is in opensm/main.c: > > > > > > if ( osm_is_debug() != cl_is_debug() ) > > > { > > > fprintf(stderr, "ERROR: OpenSM and Complib were compiled using > > > different modes\n"); > > > fprintf(stderr, "ERROR: OpenSM debug:%d Complib debug:%d \n", > > >osm_is_debug(), cl_is_debug() ); > > > exit(1); > > > } > > > > > >Is there a reason debug can't be turned on independently in OpenSM and > > >the component library ? > > > > There used to be a restriction that you couldn't mix a free/release > > version of the component library with a debug version of a client, > > and vice-versa. The debug version of complib added fields to > > structures that were not needed in the release version, resulting > > in different structure sizes between free and debug versions. > > This is probably still the case. > > That's correct - structure definitions change between the debug and release > builds of complib. The code above is there because in Linux, the library > created by complib has the same name in debug and release builds, so it is > possible to have a mismatch between the type of build for opensm and complib. > In Windows, I solved this by adding a debug-only suffix to the library name > (complibd vs. complib) so that the risk of linkage errors is eliminated. I have > suggested in the past that the Linux complib adopt a similar naming scheme and > that doing runtime checks for linkage errors was indicative of a poor design. > > This has been the basis for me pushing back on adding the cl_is_debug function > to the Windows version of complib. Is there a convention for naming debug libraries in Linux ? Is there any reason why the 2 versions of the libraries (with different names) shouldn't be allowed concurrently to exist and just link with the desired one ? -- Hal > - Fab > From halr at voltaire.com Sun Nov 20 05:05:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Nov 2005 08:05:01 -0500 Subject: [openib-general] another opensm crash In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36189A7@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36189A7@mtlexch01.mtl.com> Message-ID: <1132491520.26731.23778.camel@hal.voltaire.com> On Sun, 2005-11-20 at 04:52, Eitan Zahavi wrote: > Hi Hal, > > > > > > > Try to move aside your /lib/tls directory and see if you still get > these > > > crashes. > > > We have issues with TLS pthread and glibc > > > > There are still strange crashes like this which appear to be memory > > scribbling issues. > [EZ] OK we need to trace those. The problem will be recreating it now :-( This type of crash appeared numerous and varied as to where the scribbling occurred and how OpenSM crashed. -- Hal > But TLS has some bugs too. > We had cases where we could see cond wait events not being picked up. > > > > Moving tls aside changes the threads into processes. Does that > indicate > > that threading issues are suspected ? > [EZ] In old Pthread the threads seems like processes and in TLS they do > not. This is not the issue. I suspect that in gen1 we see the cond wait > issue more frequently as the vendor uses cl_timer more often (which uses > cond wait ...) > > > > -- Hal > > > > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > > > Sent: Monday, November 14, 2005 8:09 PM > > > > To: openib-general at openib.org > > > > Subject: [openib-general] another opensm crash > > > > > > > > (gdb) bt > > > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > > > p_madw=0x80a1de0) > > > > at osm_sw_info_rcv.c:679 > > > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > > > cl_dispatcher.c:108 > > > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > > > at cl_threadpool.c:78 > > > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > > > cl_thread.c:61 > > > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sun Nov 20 05:31:20 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 20 Nov 2005 15:31:20 +0200 Subject: [openib-general] another opensm crash Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E36189B0@mtlexch01.mtl.com> Hi Hal, To reproduce the problems we see in large subnets we have to revive the simulator project. Yael will spend some time evolving the packet dropper test on the simulator and I hope we will be able to reproduce this kind of bugs. The limit of the current test is that it only runs the standard sweep without having any client doing path record, multicast and traps in parallel. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, November 20, 2005 3:05 PM > To: Eitan Zahavi > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: RE: [openib-general] another opensm crash > > On Sun, 2005-11-20 at 04:52, Eitan Zahavi wrote: > > Hi Hal, > > > > > > > > > > Try to move aside your /lib/tls directory and see if you still get > > these > > > > crashes. > > > > We have issues with TLS pthread and glibc > > > > > > There are still strange crashes like this which appear to be memory > > > scribbling issues. > > [EZ] OK we need to trace those. > > The problem will be recreating it now :-( This type of crash appeared > numerous and varied as to where the scribbling occurred and how OpenSM > crashed. > > -- Hal > > > But TLS has some bugs too. > > We had cases where we could see cond wait events not being picked up. > > > > > > Moving tls aside changes the threads into processes. Does that > > indicate > > > that threading issues are suspected ? > > [EZ] In old Pthread the threads seems like processes and in TLS they do > > not. This is not the issue. I suspect that in gen1 we see the cond wait > > issue more frequently as the vendor uses cl_timer more often (which uses > > cond wait ...) > > > > > > -- Hal > > > > > > > > > > > Eitan Zahavi > > > > Design Technology Director > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > > > > > Sent: Monday, November 14, 2005 8:09 PM > > > > > To: openib-general at openib.org > > > > > Subject: [openib-general] another opensm crash > > > > > > > > > > (gdb) bt > > > > > #0 0x08071ff3 in osm_si_rcv_process (p_rcv=0x8090138, > > > > p_madw=0x80a1de0) > > > > > at osm_sw_info_rcv.c:679 > > > > > #1 0xb7fb0213 in __cl_disp_worker (context=0x8090da4) at > > > > > cl_dispatcher.c:108 > > > > > #2 0xb7fb8557 in __cl_thread_pool_routine (context=0x8090de4) > > > > > at cl_threadpool.c:78 > > > > > #3 0xb7fb834d in __cl_thread_wrapper (arg=0x8091408) at > > > > cl_thread.c:61 > > > > > #4 0x46cde341 in start_thread () from /lib/tls/libpthread.so.0 > > > > > #5 0x46b6e6fe in clone () from /lib/tls/libc.so.6 > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Sun Nov 20 05:43:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Nov 2005 08:43:57 -0500 Subject: [openib-general] Re: opensm - umad_receiver break on alloc errors In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F8@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E23F8@mtlexch01.mtl.com> Message-ID: <1132494034.26731.23966.camel@hal.voltaire.com> Hi Yael, On Sun, 2005-11-20 at 04:20, Yael Kalka wrote: > Hi Hal, > > While reviewing the umad_receiver function in osm_vendor_ibumad.c we've > noticed > that when umad_alloc() calls fail, the receiver breaks. > What happens then is that SM continues to live, though the umad_receiver > thread > doesn't exist anymore. > I think that there is no use in keeping the SM alive in this case. > As a result, I think we should do one of the following when umad_alloc() > failes: > 1. If umad_alloc() fails - issue an error to the syslog, and exit SM. > This is a > fatal case. > 2. Use continue instead of break. Assuming that if umad_alloc() fails > this time - > doesn't mean it'll fail again. In general, I was afraid of a tight loop with this failing and just retrying over and over. I thought about some other strategies to dial this back (some artificial timeout before the next alloc was retried). There are 2 calls to umad_alloc in the umad_receiver. The first one is just to allocate a normal sized MAD. This is the one which has the issue above IMO. The second call is for a larger send. That one should definitely be changed from a break to a continue. Either you can issue a patch for this or I can fix it. This part is a one liner :-) Should we do something about the first alloc failure ? Thanks. -- Hal > What do you think? > Yael > From halr at voltaire.com Sun Nov 20 06:28:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 20 Nov 2005 16:28:56 +0200 Subject: [openib-general] [PATCH] OpenSM: Change debug build options for better debug Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB4D@taurus.voltaire.com> OpenSM: Change debug build options for better debug Signed-off-by: Hal Rosenstock Index: complib/Makefile.am =================================================================== --- complib/Makefile.am (revision 4088) +++ complib/Makefile.am (working copy) @@ -4,7 +4,7 @@ lib_LTLIBRARIES = libosmcomp.la if DEBUG -DBGFLAGS = -g -O0 -D_DEBUG_ +DBGFLAGS = -ggdb -D_DEBUG_ else DBGFLAGS = -g -O2 endif Index: libvendor/Makefile.am =================================================================== --- libvendor/Makefile.am (revision 4088) +++ libvendor/Makefile.am (working copy) @@ -2,7 +2,7 @@ SUBDIRS = . if DEBUG -DBGFLAGS = -g -O0 -D_DEBUG_ +DBGFLAGS = -ggdb -D_DEBUG_ else DBGFLAGS = -g -O2 endif Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 4088) +++ opensm/Makefile.am (working copy) @@ -6,7 +6,7 @@ lib_LTLIBRARIES = libopensm.la if DEBUG -DBGFLAGS = -g -O0 -D_DEBUG_ +DBGFLAGS = -ggdb -D_DEBUG_ else DBGFLAGS = -g -O2 endif Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 4088) +++ osmtest/Makefile.am (working copy) @@ -1,6 +1,6 @@ if DEBUG -DBGFLAGS = -g -O0 -D_DEBUG_ +DBGFLAGS = -ggdb -D_DEBUG_ else DBGFLAGS = -g -O2 endif From troy at scl.ameslab.gov Sun Nov 20 07:38:53 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sun, 20 Nov 2005 09:38:53 -0600 Subject: [openib-general] another opensm crash In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E36189B0@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E36189B0@mtlexch01.mtl.com> Message-ID: <4380988D.6090909@scl.ameslab.gov> Eitan Zahavi wrote: >Hi Hal, > >To reproduce the problems we see in large subnets we have to revive the >simulator project. Yael will spend some time evolving the packet dropper >test on the simulator and I hope we will be able to reproduce this kind >of bugs. > >The limit of the current test is that it only runs the standard sweep >without having any client doing path record, multicast and traps in >parallel. > >EZ > >Eitan Zahavi >Design Technology Director >Mellanox Technologies LTD >Tel:+972-4-9097208 >Fax:+972-4-9593245 >P.O. Box 586 Yokneam 20692 ISRAEL > > I would strongly suggest adding some sort of packet error injection tool as well. We had a lot of links generating symbol errors on that network, and didn't have time to track them down. From mst at mellanox.co.il Sun Nov 20 08:33:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 Nov 2005 18:33:10 +0200 Subject: [openib-general] user_mad: large rmpp length problem Message-ID: <20051120163310.GS20871@mellanox.co.il> Hello! ib_umad_write currently accepts a count parameter from user and attempts to allocate mad of size count - sizeof (struct ib_user_mad) in kernel memory. This, obviously, fails with -ENOMEM, which means that we cant send large transactions with RMPP. The proper fix appears to be to transfer the data by chunks, waking the user process and copying a fixed number of bytes each time. A simpler fix would be allocate all of the memory in one go, but by chunks, and make it possible to pass a list of buffers to ib_post_send_mad. This would, however, open us up for a DOS scenario when the user want to send a huge RMPP transaction - not sure how serious that is. Both ways would require API changed, mainly to ib_create_send_mad and ib_post_send_mad functions. Comments? What would be the best solution to this problem? -- MST From halr at voltaire.com Sun Nov 20 08:38:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 20 Nov 2005 18:38:40 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB4E@taurus.voltaire.com> On Sun, 2005-11-20 at 11:33, Michael S. Tsirkin wrote: > Hello! > ib_umad_write currently accepts a count parameter from user > and attempts to allocate mad of size count - sizeof (struct ib_user_mad) > in kernel memory. > > This, obviously, fails with -ENOMEM, which means that we cant > send large transactions with RMPP. > > The proper fix appears to be to transfer the data by chunks, > waking the user process and copying a fixed number of bytes each time. > > A simpler fix would be allocate all of the memory in one go, > but by chunks, and make it possible to pass a list of buffers > to ib_post_send_mad. This would, however, open us up for a DOS > scenario when the user want to send a huge RMPP transaction - > not sure how serious that is. > > Both ways would require API changed, mainly to ib_create_send_mad > and ib_post_send_mad functions. > > Comments? What would be the best solution to this problem? Yes, we are planning to address this shortly along the lines you outlined (second approach). -- Hal From ftillier at silverstorm.com Sun Nov 20 09:18:27 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Sun, 20 Nov 2005 09:18:27 -0800 Subject: [openib-general] OpenSM Debug In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB49@taurus.voltaire.com> Message-ID: <002801c5edf6$6cff8530$9e5aa8c0@infiniconsys.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, November 20, 2005 4:59 AM > > Hi Fab, > > On Sat, 2005-11-19 at 13:50, Fab Tillier wrote: > > > > That's correct - structure definitions change between the debug and > > release builds of complib. The code above is there because in Linux, > > the library created by complib has the same name in debug and release > > builds, so it is possible to have a mismatch between the type of > > build for opensm and complib. In Windows, I solved this by adding a > > debug-only suffix to the library name (complibd vs. complib) so that > > the risk of linkage errors is eliminated. I have suggested in the > > past that the Linux complib adopt a similar naming scheme and > > that doing runtime checks for linkage errors was indicative of a > > poor design. > > > > This has been the basis for me pushing back on adding the > > cl_is_debug function to the Windows version of complib. > > Is there a convention for naming debug libraries in Linux ? I'm no Linux expert, so I have no clue here. Perhaps the C libraries already have some method? > Is there any reason why the 2 versions of the libraries (with different > names) shouldn't be allowed concurrently to exist and just link with the > desired one ? There is none that I can think of. In fact, the Windows drivers allow both the debug and release versions of the user-mode components to co-exist, as well as mixing debug and release kernel drivers. This makes it easy to debug a single component without affecting timings in the whole stack. - Fab From mst at mellanox.co.il Sun Nov 20 10:06:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 20 Nov 2005 20:06:58 +0200 Subject: [openib-general] Re: OpenSM Debug In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F589AB49@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AB49@taurus.voltaire.com> Message-ID: <20051120180658.GB7064@mellanox.co.il> Quoting Hal Rosenstock : > Is there a convention for naming debug libraries in Linux ? > > Is there any reason why the 2 versions of the libraries (with different > names) shouldn't be allowed concurrently to exist and just link with the > desired one ? The best approach is to make all library versions binarily compatible. This way you wont have this problem. -- MST From bunk at stusta.de Sun Nov 20 15:14:11 2005 From: bunk at stusta.de (Adrian Bunk) Date: Mon, 21 Nov 2005 00:14:11 +0100 Subject: [openib-general] [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference Message-ID: <20051120231411.GF16060@stusta.de> The Coverity checker spotted this obvious NULL pointer dereference caused by a wrong order of the cleanups. Signed-off-by: Adrian Bunk --- linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c.old 2005-11-20 22:04:36.000000000 +0100 +++ linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c 2005-11-20 22:05:17.000000000 +0100 @@ -355,9 +355,9 @@ spin_unlock_irqrestore(&port_priv->reg_lock, flags); kfree(reg_req); error3: - kfree(mad_agent_priv); -error2: ib_dereg_mr(mad_agent_priv->agent.mr); +error2: + kfree(mad_agent_priv); error1: return ret; } From halr at voltaire.com Mon Nov 21 05:43:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 21 Nov 2005 15:43:05 +0200 Subject: [openib-general] [PATCH] OpenSM vendor layer: Don't exit from receiver thread whenlarge umad alloc fails Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB51@taurus.voltaire.com> OpenSM vendor layer: Don't exit from receiver thread when large umad alloc fails Signed-off-by: Hal Rosenstock Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4088) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -273,7 +273,7 @@ umad_receiver(void *p_ptr) osm_log(p_ur->p_log, OSM_LOG_ERROR, "umad_receiver: ERR 5405: " "can't alloc umad length %d\n", length); - break; + continue; } if ((mad_agent = umad_recv(p_vend->umad_port_id, From dotanb at mellanox.co.il Mon Nov 21 06:03:14 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 21 Nov 2005 16:03:14 +0200 Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3879836@mtlexch01.mtl.com> Hi. i tried to do following thing in user level: i posted a send request with 0 bytes (the gather list length is 0) with the send opcode and the inline bit was enabled and i got completion with error. should the driver handle it? (and post send of 0 bytes with inline enabled should generate good completion) or the user should know that this scenario is illegal? what do you think about it? thanks Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Mon Nov 21 07:25:30 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 Nov 2005 10:25:30 -0500 (EST) Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3879836@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3879836@mtlexch01.mtl.com> Message-ID: Mellanox hardware interprets an sge with length 0 as a 2 GB buffer. Your memory region is probably too small and therefore you receive a protection error. james On Mon, 21 Nov 2005, Dotan Barak wrote: > Hi. > > i tried to do following thing in user level: i posted a send request > with 0 bytes (the gather list length is 0) with the send opcode and > the inline bit was enabled and i got completion with error. > > should the driver handle it? (and post send of 0 bytes with inline > enabled should generate good completion) or the user should know > that this scenario is illegal? > > what do you think about it? > > thanks > Dotan Barak > Software Verification Engineer > Mellanox Technologies LTD > Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 > P.O. Box 86 Yokneam 20692 ISRAEL. > Home: +972-77-8841095 Cell: 052-4222383 > > [ May the fork be with you ] From dotanb at mellanox.co.il Mon Nov 21 07:33:19 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 21 Nov 2005 17:33:19 +0200 Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A0@mtlexch01.mtl.com> > > > Mellanox hardware interprets an sge with length 0 as a 2 GB buffer. > > Your memory region is probably too small and therefore you receive a > protection error. > but i don't use any sge (my sge list length is 0) Dotan From itamar at mellanox.co.il Mon Nov 21 07:38:03 2005 From: itamar at mellanox.co.il (Itamar Rabenstein) Date: Mon, 21 Nov 2005 17:38:03 +0200 Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A1@mtlexch01.mtl.com> No. Mellanox hardware interprets a sge with length 1 that the value of the lenght field in the first element is 0 as a 2 GB buffer Mellanox hardware interprets a sge with length 0 as 0 byte lenght send . Itamar > -----Original Message----- > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Monday, November 21, 2005 5:26 PM > To: Dotan Barak > Cc: openib-general at openib.org > Subject: Re: [openib-general] can i post a send request with 0 bytes > with the inline bit enabled? > > > > Mellanox hardware interprets an sge with length 0 as a 2 GB buffer. > > Your memory region is probably too small and therefore you receive a > protection error. > > james > > On Mon, 21 Nov 2005, Dotan Barak wrote: > > > Hi. > > > > i tried to do following thing in user level: i posted a > send request > > with 0 bytes (the gather list length is 0) with the send opcode and > > the inline bit was enabled and i got completion with error. > > > > should the driver handle it? (and post send of 0 bytes with inline > > enabled should generate good completion) or the user should know > > that this scenario is illegal? > > > > what do you think about it? > > > > thanks > > Dotan Barak > > Software Verification Engineer > > Mellanox Technologies LTD > > Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 > > P.O. Box 86 Yokneam 20692 ISRAEL. > > Home: +972-77-8841095 Cell: 052-4222383 > > > > [ May the fork be with you ] > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jlentini at netapp.com Mon Nov 21 07:36:28 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 Nov 2005 10:36:28 -0500 (EST) Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A0@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A0@mtlexch01.mtl.com> Message-ID: > > Mellanox hardware interprets an sge with length 0 as a 2 GB buffer. > > > > Your memory region is probably too small and therefore you receive a > > protection error. > > > > but i don't use any sge (my sge list length is 0) How do you indicate that the sge list has 0 length? Are you posting an ibv_send_wr? What value do you set the sg_list field to? From mst at mellanox.co.il Mon Nov 21 07:39:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Nov 2005 17:39:13 +0200 Subject: [openib-general] [PATCH] mthca: missing cleanup in reset state Message-ID: <20051121153913.GC20871@mellanox.co.il> Hello, Roland! The following patch fixes system hangs I am sometimes seeing when ipoib is brought down and back up. Should something similiar be done in libmthca? Thanks, MST --- mthca: last pointer was not updated when qp was modified to reset state. This causes data corruption if wqes were already posted on the queue. Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c @@ -869,7 +869,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); mthca_wq_init(&qp->sq); + qp->sq.last = get_send_wqe(qp, qp->sq.max - 1); + mthca_wq_init(&qp->rq); + qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1); if (mthca_is_memfree(dev)) { *qp->sq.db = 0; -- MST From jlentini at netapp.com Mon Nov 21 07:41:46 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 Nov 2005 10:41:46 -0500 (EST) Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A1@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798A1@mtlexch01.mtl.com> Message-ID: On Mon, 21 Nov 2005, Itamar Rabenstein wrote: > No. > > Mellanox hardware interprets a sge with length 1 that the value of the > lenght field in the first element is 0 as a 2 GB buffer > Mellanox hardware interprets a sge with length 0 as 0 byte lenght send . sge stands for scatter gather element. What you say is correct if you replace sge with sge list above. From musikdotzauer at autoplus-jeux.com Mon Nov 21 07:45:14 2005 From: musikdotzauer at autoplus-jeux.com (Logan Patterson) Date: Mon, 21 Nov 2005 18:45:14 +0300 Subject: [openib-general] Hey baby, found this site and wanted you to check it out firstNeed Software? Message-ID: <000001c5eeb1$70dbc400$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! Name Patches Regular Now Steel Package 10 Patches $79.95 $49.95 Free shipping Silver Package 25 Patches $129.95 $99.95 Free shipping and exercise manual included Gold Package 40 Patches $189.95 $149.95 Free shipping and exercise manual included Platinum Package 65 Patches $259.95 $199.95 Free shipping and exercise manual included -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Mon Nov 21 08:10:26 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 21 Nov 2005 18:10:26 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E366C354@mtlexch01.mtl.com> > Yes, we are planning to address this shortly along the lines you > outlined (second approach). > > -- Hal Hi Hal, This is very important to us since its limit the cluster size that we can query. Since you all go for holiday vacation - can you define the API and we will implement it? Tziporet _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at mellanox.co.il Mon Nov 21 08:16:39 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Mon, 21 Nov 2005 18:16:39 +0200 Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E38798C0@mtlexch01.mtl.com> > > How do you indicate that the sge list has 0 length? > > Are you posting an ibv_send_wr? What value do you set the sg_list > field to? > i post a SR using the ibv_send_wr. in the struct ibv_send_wr: num_sge <-- 0 i don't put any value in sg_list because no one should check this value ... Dotan From greg at kroah.com Sun Nov 20 16:24:35 2005 From: greg at kroah.com (Greg KH) Date: Sun, 20 Nov 2005 16:24:35 -0800 Subject: [openib-general] Re: [stable] [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference In-Reply-To: <20051120230050.GB16060@stusta.de> <20051120230826.GD16060@stusta.de> <20051120231411.GF16060@stusta.de> References: <20051120230050.GB16060@stusta.de> <20051120230826.GD16060@stusta.de> <20051120231411.GF16060@stusta.de> Message-ID: <20051121002435.GB9749@kroah.com> Please send these again to the stable@ address when they have been accepted into upstream. thanks, greg k-h From halr at voltaire.com Mon Nov 21 08:18:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Nov 2005 11:18:45 -0500 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E366C354@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C354@mtlexch01.mtl.com> Message-ID: <1132589923.26731.32658.camel@hal.voltaire.com> Hi Tziporet, On Mon, 2005-11-21 at 11:10, Tziporet Koren wrote: > Hi Hal, > This is very important to us since its limit the cluster size that we can query. > Since you all go for holiday vacation - can you define the API and we will implement it? I also don't think an API change is required in the approach I am planning to take. It's in user_mad in how it allocates memory and copies into it as well as the underlying mad layer for supporting more than 1 sgl entry. If you want to take the ball on this, you/Michael are welcome to submit patches to both of these along these lines. Sean can comment for himself on this. -- Hal From mst at mellanox.co.il Mon Nov 21 08:31:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Nov 2005 18:31:10 +0200 Subject: [openib-general] [PATCH series] ipoib: fix multicast related issues Message-ID: <20051121163110.GD20871@mellanox.co.il> Multicast handling was susceptible to several race conditions, where different sources would try to start/stop multicast thread and create/destroy multicast groups in parallel. The following two patches (sent separately) solve these races by 1. Forcing all entries into ipoib_multicast.c to go through the single-threaded ipoib workqueue. 2. Protecting each access to mcast->queue, priv->broadcast and priv->mcast_list by priv->lock. With these two patches applied (and with the mthca patch that I've sent previously), I've been running the up/down test for more than 24 hours without failures now. -- MST From mst at mellanox.co.il Mon Nov 21 08:34:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Nov 2005 18:34:58 +0200 Subject: [openib-general] [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue Message-ID: <20051121163458.GE20871@mellanox.co.il> There appear to be several races in IPoIB multicast code: for example when a MAD event may start the multicast thread, while ipoib_stop tries to stop it, leaving a thread running after the device is removed. This patch forces all external callers of multicast.c into ipoib_workqueue, which avoids the need for explicit synchronisation. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-11-20 11:21:39.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2005-11-20 12:04:32.000000000 +0200 @@ -431,7 +431,8 @@ int ipoib_ib_dev_up(struct net_device *d set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); - return ipoib_mcast_start_thread(dev); + queue_work(ipoib_workqueue, &priv->start_task); + return 0; } int ipoib_ib_dev_down(struct net_device *dev) @@ -452,19 +453,8 @@ int ipoib_ib_dev_down(struct net_device flush_workqueue(ipoib_workqueue); } - ipoib_mcast_stop_thread(dev, 1); - - /* - * Flush the multicast groups first so we stop any multicast joins. The - * completion thread may have already died and we may deadlock waiting - * for the completion thread to finish some multicast joins. - */ - ipoib_mcast_dev_flush(dev); - - /* Delete broadcast and local addresses since they will be recreated */ - ipoib_mcast_dev_down(dev); - - ipoib_flush_paths(dev); + queue_work(ipoib_workqueue, &priv->down_task); + flush_workqueue(ipoib_workqueue); return 0; } @@ -619,11 +609,8 @@ void ipoib_ib_dev_cleanup(struct net_dev ipoib_dbg(priv, "cleaning up ib_dev\n"); - ipoib_mcast_stop_thread(dev, 1); - - /* Delete the broadcast address and the local address */ - ipoib_mcast_dev_down(dev); - + queue_work(ipoib_workqueue, &priv->cleanup_task); + flush_workqueue(ipoib_workqueue); ipoib_transport_dev_cleanup(dev); } Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-20 11:21:39.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-20 12:19:03.000000000 +0200 @@ -573,7 +573,7 @@ void ipoib_mcast_join_task(void *dev_ptr netif_carrier_on(dev); } -int ipoib_mcast_start_thread(struct net_device *dev) +static void ipoib_mcast_start_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -583,11 +583,9 @@ int ipoib_mcast_start_thread(struct net_ if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); up(&mcast_mutex); - - return 0; } -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) +static void ipoib_mcast_stop_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_mcast *mcast; @@ -599,9 +597,6 @@ int ipoib_mcast_stop_thread(struct net_d cancel_delayed_work(&priv->mcast_task); up(&mcast_mutex); - if (flush) - flush_workqueue(ipoib_workqueue); - if (priv->broadcast && priv->broadcast->query) { ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); priv->broadcast->query = NULL; @@ -618,8 +613,6 @@ int ipoib_mcast_stop_thread(struct net_d wait_for_completion(&mcast->done); } } - - return 0; } static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) @@ -737,7 +730,7 @@ out: spin_unlock(&priv->lock); } -void ipoib_mcast_dev_flush(struct net_device *dev) +static void ipoib_mcast_dev_flush(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); LIST_HEAD(remove_list); @@ -793,7 +786,7 @@ void ipoib_mcast_dev_flush(struct net_de } } -void ipoib_mcast_dev_down(struct net_device *dev) +static void ipoib_mcast_dev_down(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned long flags; @@ -822,7 +815,7 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_dbg_mcast(priv, "restarting multicast task\n"); - ipoib_mcast_stop_thread(dev, 0); + ipoib_mcast_stop_thread(dev); spin_lock_irqsave(&priv->lock, flags); @@ -908,6 +901,41 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_mcast_start_thread(dev); } +void ipoib_mcast_down_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); +} + +void ipoib_mcast_cleanup_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); +} + +void ipoib_mcast_start_task(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + ipoib_mcast_start_thread(dev); +} + + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-20 11:21:39.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-20 11:57:00.000000000 +0200 @@ -899,6 +899,9 @@ static void ipoib_setup(struct net_devic INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->down_task, ipoib_mcast_down_task, priv->dev); + INIT_WORK(&priv->cleanup_task, ipoib_mcast_cleanup_task, priv->dev); + INIT_WORK(&priv->start_task, ipoib_mcast_start_task, priv->dev); INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); } Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-20 11:21:39.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-20 12:18:43.000000000 +0200 @@ -137,6 +137,9 @@ struct ipoib_dev_priv { struct work_struct mcast_task; struct work_struct flush_task; struct work_struct restart_task; + struct work_struct down_task; + struct work_struct cleanup_task; + struct work_struct start_task; struct work_struct ah_reap_task; struct ib_device *ca; @@ -263,11 +266,9 @@ void ipoib_mcast_send(struct net_device struct sk_buff *skb); void ipoib_mcast_restart_task(void *dev_ptr); -int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); - -void ipoib_mcast_dev_down(struct net_device *dev); -void ipoib_mcast_dev_flush(struct net_device *dev); +void ipoib_mcast_start_task(void *dev_ptr); +void ipoib_mcast_down_task(void *dev_ptr); +void ipoib_mcast_cleanup_task(void *dev_ptr); #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); -- MST From mst at mellanox.co.il Mon Nov 21 08:36:53 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Nov 2005 18:36:53 +0200 Subject: [openib-general] [PATCH 2 of 2] ipoib: ipoib_multicast.c cleanup Message-ID: <20051121163653.GF20871@mellanox.co.il> Fix several race conditions in ipoib_multicast.c: 1. Make sure mcast->query is set to NULL if, and only if, no query is outstanding. 2. Make sure mcast->done is initialized to uncompleted value before we submit a new query, so that its safe to wait on. 4. Protect all accesses to priv->broadcast, priv->multicast_list, mcast->query and mcast->done by priv->lock. I had to change mcast_mutex to ipoib_mcast_lock to make the last bit work. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-20 11:57:00.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-11-20 14:57:18.000000000 +0200 @@ -1146,6 +1146,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + spin_lock_init(&ipoib_mcast_lock); + return 0; err_wq: Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-20 12:34:04.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-11-20 14:57:18.000000000 +0200 @@ -53,7 +53,7 @@ MODULE_PARM_DESC(mcast_debug_level, "Enable multicast debug tracing if > 0"); #endif -static DECLARE_MUTEX(mcast_mutex); +spinlock_t ipoib_mcast_lock; /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ struct ipoib_mcast { @@ -126,17 +126,14 @@ static void ipoib_mcast_free(struct ipoi kfree(mcast); } -static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, - int can_sleep) +static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev) { struct ipoib_mcast *mcast; - mcast = kzalloc(sizeof *mcast, can_sleep ? GFP_KERNEL : GFP_ATOMIC); + mcast = kzalloc(sizeof *mcast, GFP_ATOMIC); if (!mcast) return NULL; - init_completion(&mcast->done); - mcast->dev = dev; mcast->created = jiffies; mcast->backoff = 1; @@ -209,17 +206,23 @@ static int ipoib_mcast_join_finish(struc { struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; int ret; mcast->mcmember = *mcmember; + spin_lock_irqsave(&priv->lock, flags); + /* Set the cached Q_Key before we attach if it's the broadcast group */ - if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + if (priv->broadcast && + !memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; } + spin_unlock_irqrestore(&priv->lock, flags); + if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { ipoib_warn(priv, "multicast group " IPOIB_GID_FMT @@ -303,6 +306,12 @@ ipoib_mcast_sendonly_join_complete(int s { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + ipoib_dbg_mcast(priv, "sendonly join completion for " IPOIB_GID_FMT + " (status %d)\n", + IPOIB_GID_ARG(mcast->mcmember.mgid), status); if (!status) ipoib_mcast_join_finish(mcast, mcmember); @@ -320,7 +329,11 @@ ipoib_mcast_sendonly_join_complete(int s clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); } + spin_lock_irqsave(&priv->lock, flags); + mcast->query = NULL; + complete(&mcast->done); + spin_unlock_irqrestore(&priv->lock, flags); } static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) @@ -350,6 +363,7 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); + init_completion(&mcast->done); ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | @@ -379,23 +393,31 @@ static void ipoib_mcast_join_complete(in struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; ipoib_dbg_mcast(priv, "join completion for " IPOIB_GID_FMT " (status %d)\n", IPOIB_GID_ARG(mcast->mcmember.mgid), status); + if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { mcast->backoff = 1; - down(&mcast_mutex); + spin_lock(&ipoib_mcast_lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); - up(&mcast_mutex); + spin_unlock(&ipoib_mcast_lock); + spin_lock_irqsave(&priv->lock, flags); + mcast->query = NULL; complete(&mcast->done); + spin_unlock_irqrestore(&priv->lock, flags); return; } if (status == -EINTR) { + spin_lock_irqsave(&priv->lock, flags); + mcast->query = NULL; complete(&mcast->done); + spin_unlock_irqrestore(&priv->lock, flags); return; } @@ -417,20 +439,21 @@ static void ipoib_mcast_join_complete(in if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - mcast->query = NULL; + spin_lock_irqsave(&priv->lock, flags); - down(&mcast_mutex); + spin_lock(&ipoib_mcast_lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { if (status == -ETIMEDOUT) queue_work(ipoib_workqueue, &priv->mcast_task); else queue_delayed_work(ipoib_workqueue, &priv->mcast_task, mcast->backoff * HZ); - } else - complete(&mcast->done); - up(&mcast_mutex); + } + spin_unlock(&ipoib_mcast_lock); - return; + mcast->query = NULL; + complete(&mcast->done); + spin_unlock_irqrestore(&priv->lock, flags); } static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, @@ -469,6 +492,7 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } + init_completion(&mcast->done); ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, @@ -481,12 +505,12 @@ static void ipoib_mcast_join(struct net_ if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; - down(&mcast_mutex); + spin_lock(&ipoib_mcast_lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, mcast->backoff * HZ); - up(&mcast_mutex); + spin_unlock(&ipoib_mcast_lock); } else mcast->query_id = ret; } @@ -515,44 +539,44 @@ void ipoib_mcast_join_task(void *dev_ptr ipoib_warn(priv, "ib_query_port failed\n"); } + spin_lock_irq(&priv->lock); + if (!priv->broadcast) { - priv->broadcast = ipoib_mcast_alloc(dev, 1); + priv->broadcast = ipoib_mcast_alloc(dev); if (!priv->broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); - down(&mcast_mutex); + spin_lock(&ipoib_mcast_lock); if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_delayed_work(ipoib_workqueue, &priv->mcast_task, HZ); - up(&mcast_mutex); - return; + spin_unlock(&ipoib_mcast_lock); + goto unlock; } memcpy(priv->broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); - spin_lock_irq(&priv->lock); __ipoib_mcast_add(dev, priv->broadcast); - spin_unlock_irq(&priv->lock); } - if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { + if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags) && + !priv->broadcast->query) { ipoib_mcast_join(dev, priv->broadcast, 0); - return; + goto unlock; } while (1) { struct ipoib_mcast *mcast = NULL; - spin_lock_irq(&priv->lock); list_for_each_entry(mcast, &priv->multicast_list, list) { if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) - && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { + && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) + && !mcast->query) { /* Found the next unjoined group */ break; } } - spin_unlock_irq(&priv->lock); if (&mcast->list == &priv->multicast_list) { /* All done */ @@ -560,7 +584,7 @@ void ipoib_mcast_join_task(void *dev_ptr } ipoib_mcast_join(dev, mcast, 1); - return; + goto unlock; } priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - @@ -571,48 +595,59 @@ void ipoib_mcast_join_task(void *dev_ptr clear_bit(IPOIB_MCAST_RUN, &priv->flags); netif_carrier_on(dev); + +unlock: + spin_unlock_irq(&priv->lock); } static void ipoib_mcast_start_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; ipoib_dbg_mcast(priv, "starting multicast thread\n"); - down(&mcast_mutex); + spin_lock_irqsave(&ipoib_mcast_lock, flags); if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags)) queue_work(ipoib_workqueue, &priv->mcast_task); - up(&mcast_mutex); + spin_unlock_irqrestore(&ipoib_mcast_lock, flags); } static void ipoib_mcast_stop_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_mcast *mcast; + unsigned long flags; ipoib_dbg_mcast(priv, "stopping multicast thread\n"); - down(&mcast_mutex); + spin_lock_irqsave(&priv->lock, flags); + + spin_lock(&ipoib_mcast_lock); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); - up(&mcast_mutex); + spin_unlock(&ipoib_mcast_lock); if (priv->broadcast && priv->broadcast->query) { ib_sa_cancel_query(priv->broadcast->query_id, priv->broadcast->query); - priv->broadcast->query = NULL; + spin_unlock_irqrestore(&priv->lock, flags); ipoib_dbg_mcast(priv, "waiting for bcast\n"); wait_for_completion(&priv->broadcast->done); + spin_lock_irqsave(&priv->lock, flags); } list_for_each_entry(mcast, &priv->multicast_list, list) { if (mcast->query) { ib_sa_cancel_query(mcast->query_id, mcast->query); - mcast->query = NULL; + spin_unlock_irqrestore(&priv->lock, flags); ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); wait_for_completion(&mcast->done); + spin_lock_irqsave(&priv->lock, flags); } } + + spin_unlock_irqrestore(&priv->lock, flags); } static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) @@ -621,6 +656,7 @@ static int ipoib_mcast_leave(struct net_ struct ib_sa_mcmember_rec rec = { .join_state = 1 }; + struct ib_sa_query *query; int ret = 0; if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) @@ -629,6 +665,8 @@ static int ipoib_mcast_leave(struct net_ ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); + BUG_ON(mcast->query); + rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); @@ -649,7 +687,7 @@ static int ipoib_mcast_leave(struct net_ IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, 0, GFP_ATOMIC, NULL, - mcast, &mcast->query); + mcast, &query); if (ret < 0) ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " "for leave (result = %d)\n", ret); @@ -675,7 +713,7 @@ void ipoib_mcast_send(struct net_device ipoib_dbg_mcast(priv, "setting up send only multicast group for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); - mcast = ipoib_mcast_alloc(dev, 0); + mcast = ipoib_mcast_alloc(dev); if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); @@ -741,7 +779,7 @@ static void ipoib_mcast_dev_flush(struct spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(mcast, tmcast, &priv->multicast_list, list) { - nmcast = ipoib_mcast_alloc(dev, 0); + nmcast = ipoib_mcast_alloc(dev); if (nmcast) { nmcast->flags = mcast->flags & (1 << IPOIB_MCAST_FLAG_SENDONLY); @@ -764,17 +802,16 @@ static void ipoib_mcast_dev_flush(struct } if (priv->broadcast) { - nmcast = ipoib_mcast_alloc(dev, 0); + nmcast = ipoib_mcast_alloc(dev); if (nmcast) { nmcast->mcmember.mgid = priv->broadcast->mcmember.mgid; rb_replace_node(&priv->broadcast->rb_node, &nmcast->rb_node, &priv->multicast_tree); - - list_add_tail(&priv->broadcast->list, &remove_list); } + list_add_tail(&priv->broadcast->list, &remove_list); priv->broadcast = nmcast; } @@ -789,19 +826,23 @@ static void ipoib_mcast_dev_flush(struct static void ipoib_mcast_dev_down(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_mcast *mcast; unsigned long flags; + spin_lock_irqsave(&priv->lock, flags); + /* Delete broadcast since it will be recreated */ if (priv->broadcast) { ipoib_dbg_mcast(priv, "deleting broadcast group\n"); - spin_lock_irqsave(&priv->lock, flags); rb_erase(&priv->broadcast->rb_node, &priv->multicast_tree); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_mcast_leave(dev, priv->broadcast); - ipoib_mcast_free(priv->broadcast); + mcast = priv->broadcast; priv->broadcast = NULL; - } + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_mcast_leave(dev, mcast); + ipoib_mcast_free(mcast); + } else + spin_unlock_irqrestore(&priv->lock, flags); } void ipoib_mcast_restart_task(void *dev_ptr) @@ -847,7 +888,7 @@ void ipoib_mcast_restart_task(void *dev_ ipoib_dbg_mcast(priv, "adding multicast entry for mgid " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mgid)); - nmcast = ipoib_mcast_alloc(dev, 0); + nmcast = ipoib_mcast_alloc(dev); if (!nmcast) { ipoib_warn(priv, "unable to allocate memory for multicast structure\n"); continue; Index: linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14-dbg.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-20 12:18:43.000000000 +0200 +++ linux-2.6.14-dbg/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-20 14:56:53.000000000 +0200 @@ -226,6 +226,7 @@ static inline struct ipoib_neigh **to_ip } extern struct workqueue_struct *ipoib_workqueue; +extern spinlock_t ipoib_mcast_lock; /* functions */ -- MST From jackm at mellanox.co.il Mon Nov 21 08:48:04 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 21 Nov 2005 18:48:04 +0200 Subject: [openib-general] No resource tracking per qp for multicast groups Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> Hi, I noticed that there is no per-qp tracking of multicast groups of which it is a member. Thus, for example, if a user-space app dies without a chance to perform its cleanup, the (nonexistent) QP will still be listed in the HCA firmware as a member of its multicast groups. This has 2 effects: 1. The effective number of qp's which can join that multicast group has been reduced (since it has zombie entries). 2. If the above QP gets re-used, it will still be a member of the multicast groups (and therefore receive packets which were not intended for the new user of the QP). I suggest tracking mcast group membership in kernel-space only. If we don't wish to change the verbs layer behavior, we can just detach a qp from all its multicast groups (if any) in ib_destroy_qp (although this is not IB compliant -- see IB Spec 11.2.4.4 (we should return an error in this case)). Otherwise, I think we'll need something messy (such as an ib_verbs layer function requesting a QP to detach from all its multicast groups. My preference is to leave the verbs layer alone as much as possible. Track the multicast group membership per qp (gid and lid) in struct ib_qp, and make calls in ib_destroy_qp() to ib_mcast_detach(). Jack -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Mon Nov 21 08:50:31 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 21 Nov 2005 08:50:31 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4and RHEL-4 available In-Reply-To: <437E9FC4.8080408@ipd.uni-karlsruhe.de> Message-ID: Thomas wrote, >Doug Ledford wrote: >> OK, I've uploaded some new kernel rpms to the same site as before. This >> should fix the oops problem on 64 bit arches. >Tried the kernel here on rhel4/ia64, oops fixed indeed. >- Thomas Ditto. I also ran this over the weekend on a small cluster in my lab, running MPI over uDAPL, MPI over SDP, and MPI over IPoIB all simultaneously and it ran all weekend with no errors. I have only tested the kernel RPM thus far. Will test the userspace RPMS also, once you have one for uDAPL, which I need for MPI. woody From mshefty at ichips.intel.com Mon Nov 21 10:45:52 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 Nov 2005 10:45:52 -0800 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <1132589923.26731.32658.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C354@mtlexch01.mtl.com> <1132589923.26731.32658.camel@hal.voltaire.com> Message-ID: <438215E0.6080305@ichips.intel.com> Hal Rosenstock wrote: > I also don't think an API change is required in the approach I am > planning to take. It's in user_mad in how it allocates memory and copies > into it as well as the underlying mad layer for supporting more than 1 > sgl entry. If you want to take the ball on this, you/Michael are welcome > to submit patches to both of these along these lines. Sean can comment > for himself on this. An API change shouldn't be necessary. The plan to support large sends was to change how the void *mad pointer in struct ib_mad_send_bug is used. Currently, it references a single data buffer. The fix is to have it reference a list of buffers for large transfers. See: http://openib.org/pipermail/openib-general/2005-October/012723.html for more details. And yes, a patch for this would be great. Otherwise, I will eventually get to this, but it may be a few weeks before it bumps up high enough on my priority list. - Sean From mst at mellanox.co.il Mon Nov 21 11:55:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 21 Nov 2005 21:55:31 +0200 Subject: [openib-general] Re: Re: user_mad: large rmpp length problem In-Reply-To: <438215E0.6080305@ichips.intel.com> References: <438215E0.6080305@ichips.intel.com> Message-ID: <20051121195531.GA16794@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: user_mad: large rmpp length problem > > Hal Rosenstock wrote: > > I also don't think an API change is required in the approach I am > > planning to take. It's in user_mad in how it allocates memory and copies > > into it as well as the underlying mad layer for supporting more than 1 > > sgl entry. If you want to take the ball on this, you/Michael are welcome > > to submit patches to both of these along these lines. Sean can comment > > for himself on this. > > An API change shouldn't be necessary. The plan to support large sends was to > change how the void *mad pointer in struct ib_mad_send_bug is used. Currently, > it references a single data buffer. The fix is to have it reference a list of > buffers for large transfers. See: > > http://openib.org/pipermail/openib-general/2005-October/012723.html > > for more details. > > And yes, a patch for this would be great. Otherwise, I will eventually get to > this, but it may be a few weeks before it bumps up high enough on my priority list. > > - Sean This approach still means that we have to allocate a potentially huge amount of memory to copy all of the rmpp mad into kernel in one go. What I had in mind was: add a get_next_segment callback to mad, which would copy the data (by copy_from_user) incrementally, making forward progress without requiring us to have all of the mad in kernel memory. -- MST From halr at voltaire.com Mon Nov 21 12:19:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Nov 2005 15:19:56 -0500 Subject: [openib-general] Better Diagnostics Message-ID: <1132604307.26731.33678.camel@hal.voltaire.com> Hi, In order to locate the physical switch port on a chassis based switch, can the switch vendors (Cisco/Topspin, SilverStorm, Voltaire, Mellanox) who supply switch boxes where the physical port on the chassis does not match the port number on the switch document this so a diagnostic can indicate the actual port number ? Also, is there a way to determine slot ? Is SystemImageGuid used as an indicator that a set of boards are in the same chassis or is chassis management supported or none of the above ? I look forward to hearing back in order to improve the OpenIB diagnostics. Thanks in advance. -- Hal From mshefty at ichips.intel.com Mon Nov 21 12:29:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 21 Nov 2005 12:29:09 -0800 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <20051121195531.GA16794@mellanox.co.il> References: <438215E0.6080305@ichips.intel.com> <20051121195531.GA16794@mellanox.co.il> Message-ID: <43822E15.5050007@ichips.intel.com> Michael S. Tsirkin wrote: > This approach still means that we have to allocate a potentially > huge amount of memory to copy all of the rmpp mad into kernel > in one go. > > What I had in mind was: add a get_next_segment callback to mad, > which would copy the data (by copy_from_user) incrementally, > making forward progress without requiring us to have all of the mad > in kernel memory. This may work. I think that there may be issues with respect to providing proper synchronization and handling retries, but without looking at an implementation, I can't be sure. Is support for large RMPP transfers needed for kernel clients? If not, then we may be able to implement this entirely in userspace, provided an option were given to disable it in the kernel. - Sean From halr at voltaire.com Mon Nov 21 12:33:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Nov 2005 15:33:32 -0500 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <43822E15.5050007@ichips.intel.com> References: <438215E0.6080305@ichips.intel.com> <20051121195531.GA16794@mellanox.co.il> <43822E15.5050007@ichips.intel.com> Message-ID: <1132605188.26731.33719.camel@hal.voltaire.com> On Mon, 2005-11-21 at 15:29, Sean Hefty wrote: > Michael S. Tsirkin wrote: > > This approach still means that we have to allocate a potentially > > huge amount of memory to copy all of the rmpp mad into kernel > > in one go. > > > > What I had in mind was: add a get_next_segment callback to mad, > > which would copy the data (by copy_from_user) incrementally, > > making forward progress without requiring us to have all of the mad > > in kernel memory. > > This may work. I think that there may be issues with respect to providing > proper synchronization and handling retries, but without looking at an > implementation, I can't be sure. Possibly timeouts too. I would prefer to take this a step at a time. I think that allocating some number of page sized chunks is likely to work and IMO we should first go down this direction. If this proves problematic, we can then adjust the strategy. > Is support for large RMPP transfers needed for kernel clients? Currently no but why preclude this. Kernel clients ultimately will use MultiPathRecords which require RMPP. > If not, then we > may be able to implement this entirely in userspace, provided an option were > given to disable it in the kernel. See above. -- Hal From btmiller at helix.nih.gov Mon Nov 21 13:05:22 2005 From: btmiller at helix.nih.gov (Tim Miller) Date: Mon, 21 Nov 2005 16:05:22 -0500 Subject: [openib-general] InfiniPath + OpenIB question Message-ID: Hello, I've been playing around with PathScale InfiniPath and OpenIB with the gen2 code checked out of subversion last Friday, and I'm having a bit of trouble with it. I can build and install everything OK, and the ipath driver appears to work, but I'm having trouble with the mvapich included. I can compile the example programs just fine, but when I run: mpirun_rsh -rsh -np 4 -hostfile nodes ./cpip I get the following error: [0] Abort: Error getting HCA context at line 94 in file viainit.c mpirun: executable version 0 does not match our version 2. I am running an older version of opensm -- thinking that may be the problem, I compiled the opensm in the gen2 tree, but it fails. Looking at the log, the problem seems to be that osm_vendor_bind fails with an unable to open port error. I'm new to openib in general. I've dug around in the Wiki and googled a bit, but didn't find much useful, so I'm not sure how to debug this further. Best Wishes, Tim From jlentini at netapp.com Mon Nov 21 13:19:27 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 Nov 2005 16:19:27 -0500 (EST) Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E38798C0@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798C0@mtlexch01.mtl.com> Message-ID: > i post a SR using the ibv_send_wr. > > in the struct ibv_send_wr: > num_sge <-- 0 > i don't put any value in sg_list because no one should check > this value ... Is the error reported by the return code from ibv_post_send() or is it in a work completion? What is the error code? Have you followed the code path from ibv_post_send() down through the mthca library to make sure that an sqe list of size 0 is handled correctly? From halr at voltaire.com Mon Nov 21 13:17:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Nov 2005 16:17:25 -0500 Subject: [openib-general] InfiniPath + OpenIB question In-Reply-To: References: Message-ID: <1132607844.26731.33860.camel@hal.voltaire.com> Hi Tim, On Mon, 2005-11-21 at 16:05, Tim Miller wrote: > Hello, > > I've been playing around with PathScale InfiniPath and OpenIB with the > gen2 code checked out of subversion last Friday, and I'm having a bit of > trouble with it. I can build and install everything OK, and the ipath > driver appears to work, but I'm having trouble with the mvapich included. > I can compile the example programs just fine, but when I run: > > mpirun_rsh -rsh -np 4 -hostfile nodes ./cpip > > I get the following error: > > [0] Abort: Error getting HCA context > at line 94 in file viainit.c > mpirun: executable version 0 does not match our version 2. > > I am running an older version of opensm -- thinking that may be the > problem, I compiled the opensm in the gen2 tree, but it fails. Looking at > the log, the problem seems to be that osm_vendor_bind fails with an unable > to open port error. Are you sure there is no instance of OpenSM that was running when you started this ? Did you also rebuild the other management libraries (libibcommon, libibmad, and libibumad) ? -- Hal > I'm new to openib in general. I've dug around in the Wiki and googled a > bit, but didn't find much useful, so I'm not sure how to debug this > further. > > Best Wishes, > Tim > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bos at pathscale.com Mon Nov 21 13:43:20 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 21 Nov 2005 13:43:20 -0800 Subject: [openib-general] InfiniPath + OpenIB question In-Reply-To: References: Message-ID: <1132609400.29428.16.camel@serpentine.pathscale.com> On Mon, 2005-11-21 at 16:05 -0500, Tim Miller wrote: > I can build and install everything OK, and the ipath > driver appears to work, but I'm having trouble with the mvapich included. > I can compile the example programs just fine, but when I run: > > mpirun_rsh -rsh -np 4 -hostfile nodes ./cpip > > I get the following error: > > [0] Abort: Error getting HCA context > at line 94 in file viainit.c > mpirun: executable version 0 does not match our version 2. The problem you are experiencing is most likely because you have the PathScale MPI libraries installed. The major version of the MPI shared libraries we ship is 2, while it's 0 for mvapich. You appear to be mixing code compiled against one set of MPI libraries with code compiled against the other. It is best to use the InfiniPath MPI libraries unless you specifically need to be running an MPI app over Infiniband for some reason. Regards, References: <438215E0.6080305@ichips.intel.com> <20051121195531.GA16794@mellanox.co.il> <43822E15.5050007@ichips.intel.com> <1132605188.26731.33719.camel@hal.voltaire.com> Message-ID: <4382578B.1050707@ichips.intel.com> Hal Rosenstock wrote: >>This may work. I think that there may be issues with respect to providing >>proper synchronization and handling retries, but without looking at an >>implementation, I can't be sure. > > Possibly timeouts too. I would prefer to take this a step at a time. I > think that allocating some number of page sized chunks is likely to work > and IMO we should first go down this direction. If this proves > problematic, we can then adjust the strategy. Even if the copy to the kernel were deferred on the send side, the receive side will still fully reassemble the MAD before giving it to a client. The receive MAD will consume multiple data buffers, rather than a single buffer. But it will still result in consuming kernel memory, and more than that required on the send side because of duplicated headers. - Sean From jlentini at netapp.com Mon Nov 21 18:42:58 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 21 Nov 2005 21:42:58 -0500 (EST) Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4 available In-Reply-To: <437BA67D.80400@redhat.com> References: <437A8592.1000408@redhat.com> <437B9B3C.1040202@ichips.intel.com> <437BA67D.80400@redhat.com> Message-ID: On Wed, 16 Nov 2005, Doug Ledford wrote: > The suggestion from Bob Woodruff was to use the verbs-cm support > since kDAPL isn't included in the kernel. There's no interaction between kDAPL and uDAPL, so that shouldn't affect your decision. uDAPL can be built to use either the OpenIB verbs-cm or socket-cm (cma) interfaces. I would recommend the socket-cm (cma) interface as that is the interface that will support both IB and iWARP providers. From sean.hefty at intel.com Mon Nov 21 19:29:06 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 21 Nov 2005 19:29:06 -0800 Subject: [openib-general] Announce: preview RPMs for FC-4 and RHEL-4available In-Reply-To: Message-ID: >uDAPL can be built to use either the OpenIB verbs-cm or socket-cm >(cma) interfaces. > >I would recommend the socket-cm (cma) interface as that is the >interface that will support both IB and iWARP providers. There are three options for uDAPL: the IB CM, a socket-based CM, or the connection manager abstraction (CMA). Long term, I believe that you'll want the CMA, but short term the socket-based CM has been tested more. The IB CM requires an address translation service, such as ib_at. Ib_at has some stability issues, but its needed functionality was incorporated into components used by the CMA. The uDAPL port to the CMA is being tested now, and should be available within a couple of days. The socket-based CM exchanges QP information over a TCP/IP connection, but is limited in its functionality. Most of the QP attributes are hard-coded, and its implementation is limited to a single port on the HCA. - Sean From mst at mellanox.co.il Mon Nov 21 21:11:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Nov 2005 07:11:32 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <4382578B.1050707@ichips.intel.com> References: <4382578B.1050707@ichips.intel.com> Message-ID: <20051122051131.GA24579@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: user_mad: large rmpp length problem > > Hal Rosenstock wrote: > >>This may work. I think that there may be issues with respect to providing > >>proper synchronization and handling retries, but without looking at an > >>implementation, I can't be sure. > > > > Possibly timeouts too. I would prefer to take this a step at a time. I > > think that allocating some number of page sized chunks is likely to work > > and IMO we should first go down this direction. If this proves > > problematic, we can then adjust the strategy. > > Even if the copy to the kernel were deferred on the send side, the receive side > will still fully reassemble the MAD before giving it to a client. The receive > MAD will consume multiple data buffers, rather than a single buffer. But it > will still result in consuming kernel memory, and more than that required on the > send side because of duplicated headers. > > - Sean > Right, that would have to be handled too, at least at some point. -- MST From dotanb at mellanox.co.il Mon Nov 21 23:58:21 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 22 Nov 2005 09:58:21 +0200 Subject: [openib-general] can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3879983@mtlexch01.mtl.com> > > > > i post a SR using the ibv_send_wr. > > > > in the struct ibv_send_wr: > > num_sge <-- 0 > > i don't put any value in sg_list because no one should check > > this value ... > > Is the error reported by the return code from ibv_post_send() or > is it in a work completion? What is the error code? > > Have you followed the code path from ibv_post_send() down through the > mthca library to make sure that an sqe list of size 0 is handled > correctly? i can post the send request without any problems, but i get the following completion with error: local QP operation err (QPN 8a0406, WQE @ 00000002, CQN 9d0084, index 0) [ 0] 008a0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 02700000 [14] 00000000 [18] 00000002 [1c] ff000000> i looked into the code of the mthca library: in regular post (without inline) with sge list of size 0: the variables wqe and size are not being changed in post with inline enabled and with sge list of size 0: the variables wqe and size were changed: wqe += sizeof(struct mthca_inline_seg) - the size is 4 size += 1 Dotan From yael at mellanox.co.il Tue Nov 22 04:19:05 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 22 Nov 2005 14:19:05 +0200 Subject: [openib-general] [PATCH] Opensm - wrong assertions in error callbacks Message-ID: <5z3bloyix2.fsf@mtl066.yok.mtl.com> Hi Hal, I saw that in the send_err_callback functions both in osm_sa_mad_ctrl and in osm_sm_mad_ctrl are wrong. I assume we haven't encountered these problems since we haven't encountered send errors on debug mode... In osm_sa_mad_ctrl - all mads should be with resp_expected == FALSE (and not TRUE), as these are all responses. In osm_sm_mad_ctrl - we can send both requests and responses (for SMInfo, for example), so no use in checking the resp_expected flag. This patch fixes these issues. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_mad_ctrl.c =================================================================== --- opensm/osm_sa_mad_ctrl.c (revision 4109) +++ opensm/osm_sa_mad_ctrl.c (working copy) @@ -423,7 +423,7 @@ __osm_sa_mad_ctrl_send_err_callback( Unless we generated a Report(Notice) */ CL_ASSERT( p_madw ); - CL_ASSERT( p_madw->resp_expected == TRUE); + CL_ASSERT( p_madw->resp_expected == FALSE); /* An error occurred. No response was received to a request MAD. Index: opensm/osm_sm_mad_ctrl.c =================================================================== --- opensm/osm_sm_mad_ctrl.c (revision 4109) +++ opensm/osm_sm_mad_ctrl.c (working copy) @@ -815,7 +815,6 @@ __osm_sm_mad_ctrl_send_err_cb( ib_get_err_str( p_madw->status ) ); CL_ASSERT( p_madw ); - CL_ASSERT( p_madw->resp_expected == TRUE ); /* If this was a SubnSet MAD, then this error might indicate a problem From halr at voltaire.com Tue Nov 22 05:13:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Nov 2005 08:13:49 -0500 Subject: [openib-general] Re: [PATCH] Opensm - wrong assertions in error callbacks In-Reply-To: <5z3bloyix2.fsf@mtl066.yok.mtl.com> References: <5z3bloyix2.fsf@mtl066.yok.mtl.com> Message-ID: <1132665227.26731.37674.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-11-22 at 07:19, Yael Kalka wrote: > Hi Hal, > > I saw that in the send_err_callback functions both in osm_sa_mad_ctrl > and in osm_sm_mad_ctrl are wrong. I assume we haven't encountered > these problems since we haven't encountered send errors on debug > mode... One question about this below: > In osm_sa_mad_ctrl - all mads should be with resp_expected == FALSE > (and not TRUE), as these are all responses. What about SA Notices ? Aren't they requests from the perspective of the SA ? > In osm_sm_mad_ctrl - we can send both requests and responses (for > SMInfo, for example), so no use in checking the resp_expected flag. > This patch fixes these issues. -- Hal From halr at voltaire.com Tue Nov 22 06:28:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Nov 2005 09:28:42 -0500 Subject: [openib-general] Re: [PATCH] Opensm - wrong assertions in error callbacks In-Reply-To: <5z3bloyix2.fsf@mtl066.yok.mtl.com> References: <5z3bloyix2.fsf@mtl066.yok.mtl.com> Message-ID: <1132669720.26731.38006.camel@hal.voltaire.com> On Tue, 2005-11-22 at 07:19, Yael Kalka wrote: > Hi Hal, > > I saw that in the send_err_callback functions both in osm_sa_mad_ctrl > and in osm_sm_mad_ctrl are wrong. I assume we haven't encountered > these problems since we haven't encountered send errors on debug > mode... > In osm_sa_mad_ctrl - all mads should be with resp_expected == FALSE > (and not TRUE), as these are all responses. > In osm_sm_mad_ctrl - we can send both requests and responses (for > SMInfo, for example), so no use in checking the resp_expected flag. > This patch fixes these issues. Thanks. I applied the osm_sm_mad_ctrl.c part pending the response to the osm_sa_mad_ctrl.c portion. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_sa_mad_ctrl.c > =================================================================== > --- opensm/osm_sa_mad_ctrl.c (revision 4109) > +++ opensm/osm_sa_mad_ctrl.c (working copy) > @@ -423,7 +423,7 @@ __osm_sa_mad_ctrl_send_err_callback( > Unless we generated a Report(Notice) > */ > CL_ASSERT( p_madw ); > - CL_ASSERT( p_madw->resp_expected == TRUE); > + CL_ASSERT( p_madw->resp_expected == FALSE); > > /* > An error occurred. No response was received to a request MAD. > Index: opensm/osm_sm_mad_ctrl.c > =================================================================== > --- opensm/osm_sm_mad_ctrl.c (revision 4109) > +++ opensm/osm_sm_mad_ctrl.c (working copy) > @@ -815,7 +815,6 @@ __osm_sm_mad_ctrl_send_err_cb( > ib_get_err_str( p_madw->status ) ); > > CL_ASSERT( p_madw ); > - CL_ASSERT( p_madw->resp_expected == TRUE ); > > /* > If this was a SubnSet MAD, then this error might indicate a problem > From halr at voltaire.com Tue Nov 22 06:35:20 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Nov 2005 09:35:20 -0500 Subject: [openib-general] [PATCH] OpenSM: Reorder assert in __osm_sm_mad_ctrl_send_err_cb Message-ID: <1132670110.26731.38032.camel@hal.voltaire.com> OpenSM: Reorder assert in __osm_sm_mad_ctrl_send_err_cb in osm_sm_mad_ctrl.c Signed-off-by: Hal Rosenstock Index: osm_sm_mad_ctrl.c =================================================================== --- osm_sm_mad_ctrl.c (revision 4115) +++ osm_sm_mad_ctrl.c (working copy) @@ -809,13 +809,13 @@ __osm_sm_mad_ctrl_send_err_cb( OSM_LOG_ENTER( p_ctrl->p_log, __osm_sm_mad_ctrl_send_err_cb ); + CL_ASSERT( p_madw ); + osm_log( p_ctrl->p_log, OSM_LOG_ERROR, "__osm_sm_mad_ctrl_send_err_cb: ERR 3113: " "MAD completed in error (%s).\n", ib_get_err_str( p_madw->status ) ); - CL_ASSERT( p_madw ); - /* If this was a SubnSet MAD, then this error might indicate a problem in configuring the subnet. In this case - need to mark that there was From halr at voltaire.com Tue Nov 22 06:52:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Nov 2005 09:52:04 -0500 Subject: [openib-general] [RFC] OpenSM: include svn version in build string Message-ID: <1132670997.26731.38092.camel@hal.voltaire.com> Hi, It has been requested (several times now :-) that the svn version be included in the OpenSM build string version info (as 1.1.0 isn't nearly descriptive enough). This needs to work both in and out of the OpenIB svn tree as builds will occur both ways. So one way to accomplish this which has additional overhead on each checkin is to maintain something like an osm_svnversion file at the osm level of the tree and display that in opensm/main.c when OpenSM starts up. Another way would be to modify include/opensm/osm_base.h OSM_VERSION with this info. Again. manual. Anyone see how to get away from the manual overhead of this ? Anyone have other ideas on how to solve this ? (Are there other examples of this out there ?) Thanks. -- Hal From mst at mellanox.co.il Tue Nov 22 08:00:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Nov 2005 18:00:47 +0200 Subject: [openib-general] Re: [RFC] OpenSM: include svn version in build string In-Reply-To: <1132670997.26731.38092.camel@hal.voltaire.com> References: <1132670997.26731.38092.camel@hal.voltaire.com> Message-ID: <20051122160047.GY20871@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [RFC] OpenSM: include svn version in build string > > Hi, > > It has been requested (several times now :-) that the svn version be > included in the OpenSM build string version info (as 1.1.0 isn't nearly > descriptive enough). This needs to work both in and out of the OpenIB > svn tree as builds will occur both ways. > > So one way to accomplish this which has additional overhead on each > checkin is to maintain something like an osm_svnversion file at the osm > level of the tree and display that in opensm/main.c when OpenSM starts > up. > > Another way would be to modify include/opensm/osm_base.h OSM_VERSION > with this info. Again. manual. > > Anyone see how to get away from the manual overhead of this ? > > Anyone have other ideas on how to solve this ? (Are there other examples > of this out there ?) > > Thanks. > > -- Hal You also want a url in there to make subversion revision a unique identifier. For in-tree build, you can have Makefile generate a header from .svn/entries something like version.h: .svn/entries grep -e revision= -e url= $^ | head 2> $@ in the makefile should be sufficient. For an out of tree build, you just stick something into version.h. after you do svn export. Its also important to have a default version identifier by keeping a backup version.h under include directory, have that picked up after the local version if that doesnt exist. -- MST From mst at mellanox.co.il Tue Nov 22 08:11:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Nov 2005 18:11:07 +0200 Subject: [openib-general] [PATCH] multicast resource tracking Message-ID: <20051122161107.GA20871@mellanox.co.il> uverbs needs to track which multicast groups is each qp attached to, in order to properly detach when cleanup is performed on device file close. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/core/uverbs_main.c =================================================================== --- linux-kernel/drivers/infiniband/core/uverbs_main.c (revision 4111) +++ linux-kernel/drivers/infiniband/core/uverbs_main.c (working copy) @@ -160,6 +160,18 @@ void ib_uverbs_release_uevent(struct ib_ spin_unlock_irq(&file->async_file->lock); } +static void ib_uverbs_detach_umcast(struct ib_qp *qp, + struct ib_uqp_object *uobj) +{ + struct ib_uverbs_mcast_entry *mcast, *tmp; + + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) { + ib_detach_mcast(qp,&mcast->gid, mcast->lid); + list_del(&mcast->list); + kfree(mcast); + } +} + static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, struct ib_ucontext *context) { @@ -180,13 +192,14 @@ static int ib_uverbs_cleanup_ucontext(st list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); - struct ib_uevent_object *uevent = - container_of(uobj, struct ib_uevent_object, uobject); + struct ib_uqp_object *uqp = + container_of(uobj, struct ib_uqp_object, uevent.uobject); idr_remove(&ib_uverbs_qp_idr, uobj->id); + ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); - ib_uverbs_release_uevent(file, uevent); - kfree(uevent); + ib_uverbs_release_uevent(file, &uqp->uevent); + kfree(uqp); } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { Index: linux-kernel/drivers/infiniband/core/uverbs.h =================================================================== --- linux-kernel/drivers/infiniband/core/uverbs.h (revision 4111) +++ linux-kernel/drivers/infiniband/core/uverbs.h (working copy) @@ -105,12 +105,23 @@ struct ib_uverbs_event { u32 *counter; }; +struct ib_uverbs_mcast_entry { + struct list_head list; + union ib_gid gid; + u16 lid; +}; + struct ib_uevent_object { struct ib_uobject uobject; struct list_head event_list; u32 events_reported; }; +struct ib_uqp_object { + struct ib_uevent_object uevent; + struct list_head mcast_list; +}; + struct ib_ucq_object { struct ib_uobject uobject; struct ib_uverbs_file *uverbs_file; Index: linux-kernel/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- linux-kernel/drivers/infiniband/core/uverbs_cmd.c (revision 4111) +++ linux-kernel/drivers/infiniband/core/uverbs_cmd.c (working copy) @@ -815,7 +815,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -866,10 +866,11 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + uobj->uevent.uobject.user_handle = cmd.user_handle; + uobj->uevent.uobject.context = file->ucontext; + uobj->uevent.events_reported = 0; + INIT_LIST_HEAD(&uobj->uevent.event_list); + INIT_LIST_HEAD(&uobj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { @@ -882,7 +883,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uobject; + qp->uobject = &uobj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -901,14 +902,14 @@ retry: goto err_destroy; } - ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uobject.id); + ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject.id); if (ret == -EAGAIN) goto retry; if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -922,7 +923,7 @@ retry: } down(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->qp_list); + list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); up(&file->mutex); up(&ib_uverbs_idr_mutex); @@ -930,7 +931,7 @@ retry: return in_len; err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uobject.id); + idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); err_destroy: ib_destroy_qp(qp); @@ -1032,7 +1033,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; struct ib_qp *qp; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1046,7 +1047,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u if (!qp || qp->uobject->context != file->ucontext) goto out; - uobj = container_of(qp->uobject, struct ib_uevent_object, uobject); + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + if (!list_empty(&uobj->mcast_list)) { + ret = -EBUSY; + goto out; + } ret = ib_destroy_qp(qp); if (ret) @@ -1055,12 +1061,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); down(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->uevent.uobject.list); up(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, &uobj->uevent); - resp.events_reported = uobj->events_reported; + resp.events_reported = uobj->uevent.events_reported; kfree(uobj); @@ -1542,6 +1548,8 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; + struct ib_uqp_object *uobj; + struct ib_uverbs_mcast_entry *mcast, *tmp; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1550,9 +1558,34 @@ ssize_t ib_uverbs_attach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_attach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(((union ib_gid *)cmd.gid)->raw, + mcast->gid.raw, sizeof mcast->gid.raw)) { + ret = 0; + goto out; + } + mcast = kmalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto out; + } + mcast->lid = cmd.mlid; + mcast->gid = *(union ib_gid *)cmd.gid; + ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); + if (!ret) { + uobj = container_of(qp->uobject, struct ib_uqp_object, + uevent.uobject); + list_add_tail(&mcast->list, &uobj->mcast_list); + } else + kfree(mcast); + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; @@ -1563,6 +1596,7 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; + struct ib_uqp_object *uobj; struct ib_qp *qp; int ret = -EINVAL; @@ -1572,9 +1606,28 @@ ssize_t ib_uverbs_detach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!ret) { + struct ib_uverbs_mcast_entry *mcast, *tmp; + + uobj = container_of(qp->uobject, struct ib_uqp_object, + uevent.uobject); + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(((union ib_gid *)cmd.gid)->raw, + mcast->gid.raw, sizeof mcast->gid.raw)) { + list_del(&mcast->list); + kfree(mcast); + break; + } + + } + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; -- MST From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 22 09:00:19 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 22 Nov 2005 17:00:19 +0000 (UTC) Subject: [openib-general] userspace physical memory Message-ID: Hi, Is there a way to do physical memory registration from user-space? Thanks, y From kiss_pretty_gold at libra.livedoor.com Tue Nov 22 06:05:40 2005 From: kiss_pretty_gold at libra.livedoor.com (kiss_pretty_gold at libra.livedoor.com) Date: Tue, 22 Nov 2005 23:05:40 +0900 Subject: [openib-general] =?iso-2022-jp?b?UmU6GyRCMFw5VCRLJEQkJCRGGyhC?= Message-ID: <20051122.1405400020@kiss_pretty_gold-libra.livedoor.com>  この度、貴方のアカウント(配信されたアドレス)は当会員制クラブの 〔ゴールド会員〕と認定されましたので、メールでお知らせさせて頂きました。      □簡単な無料登録と同時に特別会員へ移行□          ↓     ↓    ↓          http://lov025.com/?lovely ゴールド会員誕生に伴いご近所女性をお一人ご紹介(了承済み)!! 【あなた様を了承済みですので登録後は会う約束をするのみ。】 一般会員では体験できないサービスで、ゴールド会員への以降手続きは明後日 迄となりますのでお急ぎ下さい。 以下、紹介女性の詳細になりますのでご確認下さい。 《名前》あみ 《年齢》28歳 《職業》会社員 《一言》今って時間ありますか? あみさんへ直接メールするならこちら http://lov025.com/?lovely ゴールド拒否 me621794 at members.interq.or.jp From surs at cse.ohio-state.edu Tue Nov 22 09:39:21 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 22 Nov 2005 12:39:21 -0500 Subject: [openib-general] InfiniPath + OpenIB question In-Reply-To: References: Message-ID: <20051122173920.GA14228@cse.ohio-state.edu> Hi, I sent a message about an hour back, but it hasn't seem to have made it to the list. Anyways, here is my reply (again). Apologies for multiple copies. ===== > > >mpirun_rsh -rsh -np 4 -hostfile nodes ./cpip > >I get the following error: > >[0] Abort: Error getting HCA context Thanks for trying out MVAPICH on Pathscale machines with Gen2. This seems to be the major problem, not the version mismatch. The version "0" usually pops up when the remote processes have died and could not communicate their version number to the master "mpirun_rsh" process. There are several things you need to make sure that MVAPICH works smoothly. 1) Load all IB related modules (ib_ipath, ib_uverbs, ib_mad, ib_umad), as Hal has pointed out. 2) Make sure your /etc/udev/rules.d/90-ib.rules is present and the contents look more or less like in the cheat-sheet posted on OpenIB Wiki by Michael Tsirkin (https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet) 3) If you don't have the udev rules and are making devices by hand, make sure that the "user" you are running as has 666 perms on the devices. 4) If you are to run MPI programs as a user (not as root), then make sure that the "ulimit -l" shows "unlimited" or some high number. This is the amount of lockable memory (in kB). In order to change this value, you need to edit /etc/security/limits.conf and put a line like: * soft memlock unlimited Then edit /etc/init.d/sshd and put a line: ulimit -l unlimited All subsequent SSH sessions will have "unlimited" lockable memory. Please let us know if you were able to get past this error. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From ftillier at silverstorm.com Tue Nov 22 09:42:17 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 22 Nov 2005 09:42:17 -0800 Subject: [openib-general] userspace physical memory In-Reply-To: Message-ID: <002c01c5ef8c$1575f8c0$9e5aa8c0@infiniconsys.com> > From: yipee [mailto:yipeeyipeeyipeeyipee at yahoo.com] > Sent: Tuesday, November 22, 2005 9:00 AM > > Hi, > > Is there a way to do physical memory registration from user-space? No, there is not. The IB spec specifically calls out physical registration as a privileged-only operation. - Fab From hozer at hozed.org Tue Nov 22 10:20:32 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 22 Nov 2005 12:20:32 -0600 Subject: [openib-general] Re: [RFC] OpenSM: include svn version in build string In-Reply-To: <20051122160047.GY20871@mellanox.co.il> References: <1132670997.26731.38092.camel@hal.voltaire.com> <20051122160047.GY20871@mellanox.co.il> Message-ID: <20051122182032.GF3275@kalmia.hozed.org> On Tue, Nov 22, 2005 at 06:00:47PM +0200, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: [RFC] OpenSM: include svn version in build string > > > > Hi, > > > > It has been requested (several times now :-) that the svn version be > > included in the OpenSM build string version info (as 1.1.0 isn't nearly > > descriptive enough). This needs to work both in and out of the OpenIB > > svn tree as builds will occur both ways. > > > > So one way to accomplish this which has additional overhead on each > > checkin is to maintain something like an osm_svnversion file at the osm > > level of the tree and display that in opensm/main.c when OpenSM starts > > up. > > > > Another way would be to modify include/opensm/osm_base.h OSM_VERSION > > with this info. Again. manual. > > > > Anyone see how to get away from the manual overhead of this ? > > > > Anyone have other ideas on how to solve this ? (Are there other examples > > of this out there ?) > > > > Thanks. > > > > -- Hal > > You also want a url in there to make subversion revision a unique identifier. > > For in-tree build, you can have Makefile generate a header from > .svn/entries > > something like > > version.h: .svn/entries > grep -e revision= -e url= $^ | head 2> $@ > > in the makefile should be sufficient. > > For an out of tree build, you just stick something into version.h. > after you do svn export. > > Its also important to have a default version identifier by > keeping a backup version.h under include directory, have that > picked up after the local version if that doesnt exist. I'd suggest using the command 'svnversion' if it exists, since that will tell you if you have a locally modified version as well. I tried to get this working for the kernel modules, and came up with this: ibversion.sh: #!/bin/sh SVN=`svnversion $1` echo \#define IBVERSION \"$SVN\" > $1.tmp diff -q $1.tmp $1 if [ $? != '0' ] ; then mv $1.tmp $1 else rm $1.tmp fi -------------- next part -------------- Index: linux-kernel/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (revision 3968) +++ linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -51,6 +51,9 @@ MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); +#include +MODULE_VERSION(IBVERSION); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG int ipoib_debug_level; @@ -917,6 +920,8 @@ struct ipoib_dev_priv *priv; int result = -ENOMEM; + printk(KERN_WARNING "hca->node_type = %d\n", hca->node_type); + priv = ipoib_intf_alloc(format); if (!priv) goto alloc_mem_failed; Index: linux-kernel/infiniband/ulp/ipoib/Makefile =================================================================== --- linux-kernel/infiniband/ulp/ipoib/Makefile (revision 3968) +++ linux-kernel/infiniband/ulp/ipoib/Makefile (working copy) @@ -9,3 +9,6 @@ ipoib_vlan.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o +drivers/infiniband/include/ibversion.h: FORCE + sh drivers/infiniband/ibversion.sh $@ + Index: linux-kernel/infiniband/hw/mthca/Makefile =================================================================== --- linux-kernel/infiniband/hw/mthca/Makefile (revision 3968) +++ linux-kernel/infiniband/hw/mthca/Makefile (working copy) @@ -1,5 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include +drivers/infiniband/include/ibversion.h: FORCE + sh drivers/infiniband/ibversion.sh $@ + ifdef CONFIG_INFINIBAND_MTHCA_DEBUG EXTRA_CFLAGS += -DDEBUG endif @@ -11,3 +14,4 @@ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o \ mthca_catas.o + Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3968) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -44,13 +44,15 @@ #include #include #include +#include #include "mthca_provider.h" #include "mthca_doorbell.h" #define DRV_NAME "ib_mthca" #define PFX DRV_NAME ": " -#define DRV_VERSION "0.06" +//#define DRV_VERSION "0.06" +#define DRV_VERSION IBVERSION #define DRV_RELDATE "June 23, 2005" enum { From hozer at hozed.org Tue Nov 22 10:29:42 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 22 Nov 2005 12:29:42 -0600 Subject: [openib-general] OpenSM Debug In-Reply-To: <002801c5edf6$6cff8530$9e5aa8c0@infiniconsys.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F589AB49@taurus.voltaire.com> <002801c5edf6$6cff8530$9e5aa8c0@infiniconsys.com> Message-ID: <20051122182942.GG3275@kalmia.hozed.org> On Sun, Nov 20, 2005 at 09:18:27AM -0800, Fab Tillier wrote: > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Sunday, November 20, 2005 4:59 AM > > > > Hi Fab, > > > > On Sat, 2005-11-19 at 13:50, Fab Tillier wrote: > > > > > > That's correct - structure definitions change between the debug and > > > release builds of complib. The code above is there because in Linux, > > > the library created by complib has the same name in debug and release > > > builds, so it is possible to have a mismatch between the type of > > > build for opensm and complib. In Windows, I solved this by adding a > > > debug-only suffix to the library name (complibd vs. complib) so that > > > the risk of linkage errors is eliminated. I have suggested in the > > > past that the Linux complib adopt a similar naming scheme and > > > that doing runtime checks for linkage errors was indicative of a > > > poor design. > > > > > > This has been the basis for me pushing back on adding the > > > cl_is_debug function to the Windows version of complib. > > > > Is there a convention for naming debug libraries in Linux ? > > I'm no Linux expert, so I have no clue here. Perhaps the C libraries already > have some method? > > > Is there any reason why the 2 versions of the libraries (with different > > names) shouldn't be allowed concurrently to exist and just link with the > > desired one ? > > There is none that I can think of. In fact, the Windows drivers allow both the > debug and release versions of the user-mode components to co-exist, as well as > mixing debug and release kernel drivers. This makes it easy to debug a single > component without affecting timings in the whole stack. How much timing overhead does debug add anyway? Based on what I saw at supercomputing, OpenSM spent more time in the kernel and doing context /thread switches than actually doing a lot of computations. At the moment, I'd prefer that debug was enabled by default, and we had a way to dump a stack trace and restart if something asserted. I'm going to speculate that in 99% of the cases, a debug build on a PC will have no trouble. For those really large clusters, people that know what they are doing can enable optimizations. From mst at mellanox.co.il Tue Nov 22 11:36:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 22 Nov 2005 21:36:43 +0200 Subject: [openib-general] Re: [RFC] OpenSM: include svn version in build string In-Reply-To: <20051122182032.GF3275@kalmia.hozed.org> References: <20051122182032.GF3275@kalmia.hozed.org> Message-ID: <20051122193643.GC26894@mellanox.co.il> Quoting r. Troy Benjegerdes : > Subject: Re: [openib-general] Re: [RFC] OpenSM: include svn version in build string > > On Tue, Nov 22, 2005 at 06:00:47PM +0200, Michael S. Tsirkin wrote: > > Quoting r. Hal Rosenstock : > > > Subject: [RFC] OpenSM: include svn version in build string > > > > > > Hi, > > > > > > It has been requested (several times now :-) that the svn version be > > > included in the OpenSM build string version info (as 1.1.0 isn't > nearly > > > descriptive enough). This needs to work both in and out of the > OpenIB > > > svn tree as builds will occur both ways. > > > > > > So one way to accomplish this which has additional overhead on each > > > checkin is to maintain something like an osm_svnversion file at the > osm > > > level of the tree and display that in opensm/main.c when OpenSM > starts > > > up. > > > > > > Another way would be to modify include/opensm/osm_base.h OSM_VERSION > > > with this info. Again. manual. > > > > > > Anyone see how to get away from the manual overhead of this ? > > > > > > Anyone have other ideas on how to solve this ? (Are there other > examples > > > of this out there ?) > > > > > > Thanks. > > > > > > -- Hal > > > > You also want a url in there to make subversion revision a unique > identifier. > > > > For in-tree build, you can have Makefile generate a header from > > .svn/entries > > > > something like > > > > version.h: .svn/entries > > grep -e revision= -e url= $^ | head 2> $@ > > > > in the makefile should be sufficient. > > > > For an out of tree build, you just stick something into version.h. > > after you do svn export. > > > > Its also important to have a default version identifier by > > keeping a backup version.h under include directory, have that > > picked up after the local version if that doesnt exist. > > I'd suggest using the command 'svnversion' if it exists, since that will > tell you if you have a locally modified version as well. > > I tried to get this working for the kernel modules, and came up with > this: > > ibversion.sh: > > #!/bin/sh > > SVN=`svnversion $1` > > echo \#define IBVERSION \"$SVN\" > $1.tmp > > diff -q $1.tmp $1 > if [ $? != '0' ] ; then > mv $1.tmp $1 > else > rm $1.tmp > fi > > > > > Content-Description: svn-version-test.diff > Index: linux-kernel/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (revision 3968) > +++ linux-kernel/infiniband/ulp/ipoib/ipoib_main.c (working copy) > @@ -51,6 +51,9 @@ > MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); > MODULE_LICENSE("Dual BSD/GPL"); > > +#include > +MODULE_VERSION(IBVERSION); > + > #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG > int ipoib_debug_level; > > @@ -917,6 +920,8 @@ > struct ipoib_dev_priv *priv; > int result = -ENOMEM; > > + printk(KERN_WARNING "hca->node_type = %d\n", hca->node_type); > + > priv = ipoib_intf_alloc(format); > if (!priv) > goto alloc_mem_failed; > Index: linux-kernel/infiniband/ulp/ipoib/Makefile > =================================================================== > --- linux-kernel/infiniband/ulp/ipoib/Makefile (revision 3968) > +++ linux-kernel/infiniband/ulp/ipoib/Makefile (working copy) > @@ -9,3 +9,6 @@ > ipoib_vlan.o > ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o > > +drivers/infiniband/include/ibversion.h: FORCE > + sh drivers/infiniband/ibversion.sh $@ > + > Index: linux-kernel/infiniband/hw/mthca/Makefile > =================================================================== > --- linux-kernel/infiniband/hw/mthca/Makefile (revision 3968) > +++ linux-kernel/infiniband/hw/mthca/Makefile (working copy) > @@ -1,5 +1,8 @@ > EXTRA_CFLAGS += -Idrivers/infiniband/include > > +drivers/infiniband/include/ibversion.h: FORCE > + sh drivers/infiniband/ibversion.sh $@ > + > ifdef CONFIG_INFINIBAND_MTHCA_DEBUG > EXTRA_CFLAGS += -DDEBUG > endif > @@ -11,3 +14,4 @@ > mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ > mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o \ > mthca_catas.o > + > Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h > =================================================================== > --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3968) > +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) > @@ -44,13 +44,15 @@ > #include > #include > #include > +#include > > #include "mthca_provider.h" > #include "mthca_doorbell.h" > > #define DRV_NAME "ib_mthca" > #define PFX DRV_NAME ": " > -#define DRV_VERSION "0.06" > +//#define DRV_VERSION "0.06" > +#define DRV_VERSION IBVERSION > #define DRV_RELDATE "June 23, 2005" > > enum { > > I dont think its a good idea to add dependency on svnversion to the makefile. Lets just add --with-version= option to configure. Then the user can run it --with-version=`svnversion` -- MST From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 22 11:51:44 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 22 Nov 2005 19:51:44 +0000 (UTC) Subject: [openib-general] Re: userspace physical memory References: <002c01c5ef8c$1575f8c0$9e5aa8c0@infiniconsys.com> Message-ID: Fab Tillier silverstorm.com> writes: [snip] > No, there is not. The IB spec specifically calls out physical > registration as a privileged-only operation. And what about super-user? Isn't he privileged enough? If 'root' can't do it then I'd have to write a small kernel module to do it? Thanks, x From hozer at hozed.org Tue Nov 22 12:23:24 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 22 Nov 2005 14:23:24 -0600 Subject: [openib-general] Re: [RFC] OpenSM: include svn version in build string In-Reply-To: <20051122193643.GC26894@mellanox.co.il> References: <20051122182032.GF3275@kalmia.hozed.org> <20051122193643.GC26894@mellanox.co.il> Message-ID: <20051122202323.GH3275@kalmia.hozed.org> > > > > I dont think its a good idea to add dependency on svnversion to > the makefile. > > Lets just add --with-version= option to configure. > Then the user can run it --with-version=`svnversion` If you remove the automatic generation of the version info from svnversion, you defeat the whole purpose of having it in the first place. If I have to remember to add '--with-version', the one time that it matters I'll forget and then when Hal asks me what version I have I'll tell him the wrong one. OpenMPI generates a version string from svnversion in the configure script.. This would be a good way to do it if it's not in the makefile directly. From btmiller at helix.nih.gov Tue Nov 22 12:00:12 2005 From: btmiller at helix.nih.gov (Tim Miller) Date: Tue, 22 Nov 2005 15:00:12 -0500 Subject: [openib-general] InfiniPath + OpenIB question In-Reply-To: References: Message-ID: On Mon, 21 Nov 2005, Tim Miller wrote: > Hello, > > I've been playing around with PathScale InfiniPath and OpenIB with the gen2 > code checked out of subversion last Friday, and I'm having a bit of trouble > with it. (snip) Thanks to everybody who responded. I believe that I have it working now. The problem was that I had somehow missed one of the modules to load and that udev was not creating the device nodes in /dev/infiniband with world read/write permissions. Thanks again, Tim -- Tim Miller System Administrator -- Laboratory of Computational Biology National Institutes of Health -- Bldg. 50 Rm. 3309 -- 301-402-0618 From dledford at redhat.com Tue Nov 22 13:41:23 2005 From: dledford at redhat.com (Doug Ledford) Date: Tue, 22 Nov 2005 16:41:23 -0500 Subject: [openib-general] Announce: more updated files on my site Message-ID: <43839083.7030803@redhat.com> I've uploaded more files. There's a kernel update for RHEL4 (added ppc64 Infiniband support), rebuilt libibverbs, libmthca, and opensm for RHEL4/FC4/FC5 and added ppc64 support to RHEL4 and both ppc64 and ppc support to FC4/FC5 for these three RPMs, and I've added libsdp to all of the above. Here is the RPM listing currently: FC-4/libibverbs/1.0-0.3965.1.FC4: i386 ia64 SRPMS tests x86_64 FC-4/libibverbs/1.0.rc4-0.3965.2.FC4: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-4/libmthca/1.0-0.3965.1.FC4: i386 ia64 SRPMS tests x86_64 FC-4/libmthca/1.0.rc4-0.3965.2.FC4: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-4/libsdp/0.90-0.3965.2.FC4: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-4/module-init-tools/3.1-4.Infiniband: i386 ia64 SRPMS tests x86_64 FC-4/opensm/1.0-0.3965.1.FC4: i386 ia64 ppc ppc64 s390 s390x SRPMS tests x86_64 FC-4/opensm/1.0-0.3965.2.FC4: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-4/udev/058-1.0.FC4.1.Infiniband: i386 ia64 SRPMS tests x86_64 FC-5/libibverbs/1.0.rc4-0.3965.2.FC5: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-5/libmthca/1.0.rc4-0.3965.2.FC5: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-5/libsdp/0.90-0.3965.2.FC5: i386 ia64 ppc ppc64 SRPMS tests x86_64 FC-5/opensm/1.0-0.3965.2.FC5: i386 ia64 ppc ppc64 SRPMS tests x86_64 RHEL-4/kernel/2.6.9-22.14.EL.OpenIB_3965.3: i686 ia64 noarch ppc64 ppc64iseries SRPMS tests x86_64 RHEL-4/libibverbs/1.0-0.3965.1.EL4: i386 ia64 SRPMS tests x86_64 RHEL-4/libibverbs/1.0.rc4-0.3965.2.EL4: i386 ia64 ppc64 SRPMS tests x86_64 RHEL-4/libmthca/1.0-0.3965.1.EL4: i386 ia64 SRPMS tests x86_64 RHEL-4/libmthca/1.0.rc4-0.3965.2.EL4: i386 ia64 ppc64 SRPMS tests x86_64 RHEL-4/libsdp/0.90-0.3965.2.EL4: i386 ia64 ppc64 SRPMS tests x86_64 RHEL-4/module-init-tools/3.1-0.pre5.3.Infiniband: i386 ia64 ppc ppc64 s390 s390x SRPMS tests x86_64 RHEL-4/opensm/1.0-0.3965.1.EL4: i386 ia64 ppc ppc64 s390 s390x SRPMS tests x86_64 RHEL-4/opensm/1.0-0.3965.2.EL4: i386 ia64 ppc ppc64 SRPMS tests x86_64 RHEL-4/udev/039-10.11.EL4.Infiniband: i386 ia64 ppc ppc64 s390 s390x SRPMS tests x86_64 And in case anyone needs it, the url again is: http://people.redhat.com/dledford/Infiniband Also, I want to thank the people that have been downloading and testing these packages. If I miss anything you see that needs done, feel free to let me know ;-) -- Doug Ledford http://people.redhat.com/dledford From johnny_alpine at air-chilli.com Tue Nov 22 13:53:12 2005 From: johnny_alpine at air-chilli.com (Justin Martin) Date: Tue, 22 Nov 2005 16:53:12 -0500 Subject: [openib-general] Hey baby, found this site and wanted you to check it out firstNeed Software? Message-ID: <000001c5efae$0fe5e600$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.tomasto.com/pt/?46&cyasp -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Tue Nov 22 14:47:12 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 22 Nov 2005 14:47:12 -0800 Subject: [openib-general] Re: userspace physical memory In-Reply-To: References: <002c01c5ef8c$1575f8c0$9e5aa8c0@infiniconsys.com> Message-ID: <20051122224712.GA26978@esmail.cup.hp.com> On Tue, Nov 22, 2005 at 07:51:44PM +0000, yipee wrote: > Fab Tillier silverstorm.com> writes: > > [snip] > > > No, there is not. The IB spec specifically calls out physical > > registration as a privileged-only operation. > > And what about super-user? Isn't he privileged enough? This isn't just about "user privileges" - it's about correct operation in an enviroment with asyncronous operations (IO and other CPUs). The kernel needs to coordinate access to physical resources. > If 'root' can't do it then I'd have to write a small kernel > module to do it? If you explained better what you are tring to do, then maybe more experienced folks can advise if that's a good or bad idea. grant From mst at mellanox.co.il Tue Nov 22 18:54:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 23 Nov 2005 04:54:45 +0200 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <20050915073629.GQ28025@mellanox.co.il> References: <52ek7rxljj.fsf@cisco.com> <20050915073629.GQ28025@mellanox.co.il> Message-ID: <20051123025445.GA6560@mellanox.co.il> Quoting Michael S. Tsirkin : > > This patch does a few things: > > - Adds node_guid and node_desc fields to struct ib_device > > - Has mthca set these fields on startup > > - Extends modify_device method to handle setting node_desc > > - Exposes node_desc in sysfs > > - Allows userspace to set node_desc by writing into sysfs file, eg. > > echo -n `hostname` >> /sys/class/infiniband/mthca0/node_desc > > > > This should probably be combined with Sean's work to get rid of > > node_guid queries in ULPs. > > > > Comments? > > > > - R. > > > Good stuff. Roland, do you plan to check this in? -- MST From rjwalsh at pathscale.com Tue Nov 22 16:33:51 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 22 Nov 2005 16:33:51 -0800 Subject: [openib-general] Multipathing Message-ID: <1132706031.9516.61.camel@hematite.internal.keyresearch.com> Hi all, Is multipathing implemented/working in OpenIB? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Tue Nov 22 16:55:09 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 22 Nov 2005 16:55:09 -0800 Subject: [openib-general] OpenSM Debug In-Reply-To: <20051122182942.GG3275@kalmia.hozed.org> Message-ID: >How much timing overhead does debug add anyway? Based on what I saw at >supercomputing, OpenSM spent more time in the kernel and doing context >/thread switches than actually doing a lot of computations. Depending on the actual interfaces used by opensm, debug used to add considering overhead. - Sean From yael at mellanox.co.il Tue Nov 22 23:06:09 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 23 Nov 2005 09:06:09 +0200 Subject: [openib-general] RE: [PATCH] Opensm - wrong assertions in error callbacks Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E240C@mtlexch01.mtl.com> Hi Hal, You are correct. The Notices are requests. Then the assertion on the resp_expected should be removed from the osm_sa_mad_ctrl.c. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, November 22, 2005 3:14 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - wrong assertions in error callbacks Hi Yael, On Tue, 2005-11-22 at 07:19, Yael Kalka wrote: > Hi Hal, > > I saw that in the send_err_callback functions both in osm_sa_mad_ctrl > and in osm_sm_mad_ctrl are wrong. I assume we haven't encountered > these problems since we haven't encountered send errors on debug > mode... One question about this below: > In osm_sa_mad_ctrl - all mads should be with resp_expected == FALSE > (and not TRUE), as these are all responses. What about SA Notices ? Aren't they requests from the perspective of the SA ? > In osm_sm_mad_ctrl - we can send both requests and responses (for > SMInfo, for example), so no use in checking the resp_expected flag. > This patch fixes these issues. -- Hal From eitan at mellanox.co.il Tue Nov 22 23:25:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 23 Nov 2005 09:25:06 +0200 Subject: [openib-general] First Multicast Leave disconnects all other clients Message-ID: <43841952.901@mellanox.co.il> Hi All, This is an old issue that I think we might eventually need to fix. I will appreciate the group thoughts on it: IBTA defines Multicast Group membership (for routing purposes) to be requested from and maintained by the SA. The "membership" from the SA point of view is of "IB end-ports" not "clients" (but you know all that already). As the SA keeps track of "end-ports" registrations, any "end-port" requesting to leave the group will result with the port to be disconnected from the multicast group (packets will not be forwarded to it). The following sequence is very possible: 1. Two clients (A and B) on the same machine requested to join group G 2. Suddenly client B decided to withdraw its membership (maybe it just cleans up before it leaves). 3. The SA gets B's request to leave the group and disconnect the port 4. Client A got disconnected... It seems the IBTA intent was that the IB driver will be responsible for maintaining the list of clients registered to each group. But the IB core does not track what clients registered (through SA requests) to a particular multicast group. The first client to leave the group causes the rest (of the clients) to be disconnected. My proposal is to provide an API for such registrations at both user and kernel and track the requesting processes. Cleanup is also required both by process and kernel module granularity. BTW: The same API could also handle "Client Reregistration" for multicast groups, such that we could avoid the need to have that code duplicated by every client. But this refers to yet another API that is missing: Report dispatching which deserves its own mail... From info at hydgtf.com Wed Nov 23 00:42:54 2005 From: info at hydgtf.com (info at hydgtf.com) Date: 23 Nov 2005 17:42:54 +0900 Subject: [openib-general] $B$46a=jCT=w$NJ*8l(B Message-ID: <20051123084254.27287.qmail@mail.hydgtf.com> SEX$B$,$G$-$F;~5k$^$G$b$i$($A$c$&$h!Z6a=j=PD%%[%9%HJg=8![!*(B $B;~5k(B1$B;~4V$G(B8000$B1_!*(B $BNc!&(B8000$B1_!_#4;~4V!_(B20$BF|!a(B64$BK|(B!! $B"#;q3J!'(B19$B:P0J>e$N7r9/$JCK at -!#(B $B"#;~4V!'%a!<%k$K$F=w at -$+$i8F$S=P$7$r$&$1!"<+M3$J;~4V$GMn$A9g$&!#(B $B"#5kM?!';~5k$O=w at -B&$+$iEO$5$l$^$9!#(B $B"#1~Jg!'2<5-$N%j%s%/$K$FEPO?$7!"L>A0$N:G8e$K!V(B*$B!W$H5-F~$7!"%W%m(B $B%U%#!<%k$G%"%T!<%k!#(B $BEPO?%j%s%/(Bhttp://www.kool-king.net?090 10$BJ,0JFb$K6a=j=w at -$X<+F0G[?.$r9T$$$^$9!*!*!*(B $B5qH]$NJ}$O(B badluck at kool-king.net From yael at mellanox.co.il Wed Nov 23 01:37:22 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 23 Nov 2005 11:37:22 +0200 Subject: [openib-general] [PATCH] Opensm - client reregistration bit Message-ID: <5z1x17yab1.fsf@mtl066.yok.mtl.com> Hi Hal, Currently when sending mads with PortInfo Set we set client_rereg to be zero if we are in first_time_master_sweep and relevant capability mask is on). On other cases - we send in client_reregister bit the data we have saved. If this data is 1 - then we will keep on sending it. This patch assures that we send 0 in the client_reregister bit, unless we specifically want to send 1. I think there is still a bug in the client_reregisteration. If we merge subnets, the new ports will not be set with client_rereg=1 (since the master SM is not in first sweep). I plan to continue working on the client_reregsitration issue next week, and fix this issue as well. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 4119) +++ opensm/osm_lid_mgr.c (working copy) @@ -1146,6 +1146,8 @@ __osm_lid_mgr_set_physp_pi( if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE ) && ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) ib_port_info_set_client_rereg( p_pi, 1 ); + else + ib_port_info_set_client_rereg( p_pi, 0 ); /* We need to send the PortInfoSet request with the new sm_lid in the following cases: From yipeeyipeeyipeeyipee at yahoo.com Wed Nov 23 02:07:32 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Wed, 23 Nov 2005 10:07:32 +0000 (UTC) Subject: [openib-general] Re: userspace physical memory References: <002c01c5ef8c$1575f8c0$9e5aa8c0@infiniconsys.com> <20051122224712.GA26978@esmail.cup.hp.com> Message-ID: Grant Grundler hp.com> writes: [snip] > If you explained better what you are tring to do, then maybe more > experienced folks can advise if that's a good or bad idea. ok, I work on some code in a closed linux environment that requires the ability to rdma data to/from every address of the available physical memory. This linux environment is not general purpose and very specialized. We have total control of the IO & CPU operations, so it is safe for us to have this physical memory registration. thanks, y From halr at voltaire.com Wed Nov 23 02:27:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Nov 2005 05:27:06 -0500 Subject: [openib-general] RE: [PATCH] Opensm - wrong assertions in error callbacks In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E240C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E240C@mtlexch01.mtl.com> Message-ID: <1132741624.26731.42437.camel@hal.voltaire.com> On Wed, 2005-11-23 at 02:06, Yael Kalka wrote: > Hi Hal, > You are correct. The Notices are requests. Then the assertion on the > resp_expected > should be removed from the osm_sa_mad_ctrl.c. Thanks. Applied. > Thanks, > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, November 22, 2005 3:14 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - wrong assertions in error callbacks > > > Hi Yael, > > On Tue, 2005-11-22 at 07:19, Yael Kalka wrote: > > Hi Hal, > > > > I saw that in the send_err_callback functions both in osm_sa_mad_ctrl > > and in osm_sm_mad_ctrl are wrong. I assume we haven't encountered > > these problems since we haven't encountered send errors on debug > > mode... > > One question about this below: > > > In osm_sa_mad_ctrl - all mads should be with resp_expected == FALSE > > (and not TRUE), as these are all responses. > > What about SA Notices ? Aren't they requests from the perspective of the > SA ? > > > In osm_sm_mad_ctrl - we can send both requests and responses (for > > SMInfo, for example), so no use in checking the resp_expected flag. > > This patch fixes these issues. > > -- Hal From halr at voltaire.com Wed Nov 23 02:49:33 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Nov 2005 05:49:33 -0500 Subject: [openib-general] First Multicast Leave disconnects all other clients In-Reply-To: <43841952.901@mellanox.co.il> References: <43841952.901@mellanox.co.il> Message-ID: <1132742810.26731.42532.camel@hal.voltaire.com> Hi Eitan, On Wed, 2005-11-23 at 02:25, Eitan Zahavi wrote: > Hi All, > > This is an old issue that I think we might eventually need to fix. > I will appreciate the group thoughts on it: > > IBTA defines Multicast Group membership (for routing purposes) to be requested from and maintained by the SA. > The "membership" from the SA point of view is of "IB end-ports" not "clients" (but you know all that already). > > As the SA keeps track of "end-ports" registrations, any "end-port" requesting to leave the group will > result with the port to be disconnected from the multicast group (packets will not be forwarded to it). > > The following sequence is very possible: > 1. Two clients (A and B) on the same machine requested to join group G > 2. Suddenly client B decided to withdraw its membership (maybe it just cleans up before it leaves). > 3. The SA gets B's request to leave the group and disconnect the port > 4. Client A got disconnected... > > It seems the IBTA intent was that the IB driver will be responsible for maintaining the list of clients > registered to each group. Yes, the end node is responsible for tracking the registrations within the node and fabricating responses when the node does not want to leave. Is delete a different case though ? > But the IB core does not track what clients registered (through SA requests) to a particular multicast group. > The first client to leave the group causes the rest (of the clients) to be disconnected. This is an implementation issue IMO and applies to other subscriptions too (not just limited to multicast). > My proposal is to provide an API for such registrations at both user and kernel and track the requesting processes. > Cleanup is also required both by process and kernel module granularity. Is the API the SA client request itself for this ? Shouldn't the tracking be done there (within sa_query.c) ? > BTW: The same API could also handle "Client Reregistration" for multicast groups, Client reregistration is for all subscriptions (including ServiceRecords and events as well). > such that we could avoid the need to have that code duplicated by every client. I'm missing how client reregistration would help here. Can you elaborate ? > But this refers to yet another API that is missing: Report dispatching which deserves its own > mail... I'm missing the connection between reregistration and report dispatching. -- Hal From halr at voltaire.com Wed Nov 23 03:00:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Nov 2005 06:00:44 -0500 Subject: [openib-general] Re: [PATCH] Opensm - client reregistration bit In-Reply-To: <5z1x17yab1.fsf@mtl066.yok.mtl.com> References: <5z1x17yab1.fsf@mtl066.yok.mtl.com> Message-ID: <1132743639.26731.42598.camel@hal.voltaire.com> On Wed, 2005-11-23 at 04:37, Yael Kalka wrote: > Hi Hal, > > Currently when sending mads with PortInfo Set we set client_rereg to > be zero I think you mean one here. > if we are in first_time_master_sweep and relevant capability > mask is on). On other cases - we send in client_reregister bit the > data we have saved. If this data is 1 - then we will keep on sending > it. > This patch assures that we send 0 in the client_reregister bit, > unless we specifically want to send 1. Thanks. Applied. > I think there is still a bug in the client_reregisteration. If we > merge subnets, the new ports will not be set with client_rereg=1 > (since the master SM is not in first sweep). > I plan to continue working on the client_reregsitration issue next > week, and fix this issue as well. OK. Thanks. There is also the issue that client_reregistration is set too early and often the SA is not ready as the subscriptions are received before the first sweep completes. So this currently relies on the client timeout and retransmit strategy. The SM should only request this when it is really ready. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 4119) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -1146,6 +1146,8 @@ __osm_lid_mgr_set_physp_pi( > if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE ) && > ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) > ib_port_info_set_client_rereg( p_pi, 1 ); > + else > + ib_port_info_set_client_rereg( p_pi, 0 ); > > /* We need to send the PortInfoSet request with the new sm_lid > in the following cases: From yael at mellanox.co.il Wed Nov 23 03:48:59 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 23 Nov 2005 13:48:59 +0200 Subject: [openib-general] RE: [PATCH] Opensm - client reregistration bit Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2417@mtlexch01.mtl.com> Hal Rosenstock wrote: > On Wed, 2005-11-23 at 04:37, Yael Kalka wrote: > > Hi Hal, > > > > Currently when sending mads with PortInfo Set we set client_rereg to > > be zero > > I think you mean one here. You are right. > > if we are in first_time_master_sweep and relevant capability > > mask is on). On other cases - we send in client_reregister bit the > > data we have saved. If this data is 1 - then we will keep on sending > > it. > > This patch assures that we send 0 in the client_reregister bit, > > unless we specifically want to send 1. > > Thanks. Applied. > > > I think there is still a bug in the client_reregisteration. If we > > merge subnets, the new ports will not be set with client_rereg=1 > > (since the master SM is not in first sweep). > > I plan to continue working on the client_reregsitration issue next > > week, and fix this issue as well. > > OK. Thanks. > > There is also the issue that client_reregistration is set too early and > often the SA is not ready as the subscriptions are received before the > first sweep completes. So this currently relies on the client timeout > and retransmit strategy. The SM should only request this when it is > really ready. Ok. I will have to think how to handle this. Thanks for mentioning the issue. Yael > -- Hal > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: opensm/osm_lid_mgr.c > > =================================================================== > > --- opensm/osm_lid_mgr.c (revision 4119) > > +++ opensm/osm_lid_mgr.c (working copy) > > @@ -1146,6 +1146,8 @@ __osm_lid_mgr_set_physp_pi( > > if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE ) && > > ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) > > ib_port_info_set_client_rereg( p_pi, 1 ); > > + else > > + ib_port_info_set_client_rereg( p_pi, 0 ); > > > > /* We need to send the PortInfoSet request with the new sm_lid > > in the following cases: From harris_arthur at emc.com Wed Nov 23 09:58:28 2005 From: harris_arthur at emc.com (harris, arthur) Date: Wed, 23 Nov 2005 12:58:28 -0500 Subject: [openib-general] Warnings while compiling with kernel 2.6.14.2... Message-ID: <66BA8F805B5BEB469B7FA8F00B80C34B0DB892@srplunkett.eng.emc.com> Any suggestions as to how I might resolve the following problem? I am unable to insmod the modules ib_iser.ko and kdapl_ib.ko. While compiling the latest Linux kernel (Linux-2.6.14.2) with the latest openib trunk, I encountered the following WARNINGS: WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko needs unknown symbol dat_registry_list_providers WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko needs unknown symbol dat_ia_open WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko needs unknown symbol dat_ia_close WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/kdapl/ib/kdapl_ib.ko needs unknown symbol dat_registry_remove_provider WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/kdapl/ib/kdapl_ib.ko needs unknown symbol dat_registry_add_provider I did locate each of these symbols in the file /usr/src/kernels/linux-2.6.14.2/drivers/infiniband/ulp/kdapl/api.c AdditionaIly, I was also able to install the following ib modules: > lsmod | grep ib_ ib_srp 26368 0 ib_sdp 141440 0 ib_ping 9616 0 ib_umad 18864 0 ib_uat 15436 0 ib_at 24000 1 ib_uat ib_ucm 20868 0 ib_uverbs 38568 2 rdma_ucm,ib_ucm ib_cm 37604 4 ib_srp,ib_sdp,ib_ucm,rdma_cm ib_addr 11268 1 rdma_cm ib_ipoib 56464 0 ib_sa 16676 5 ib_srp,ib_sdp,ib_at,rdma_cm,ib_ipoib ib_mthca 116764 0 ib_mad 39328 5 ib_ping,ib_umad,ib_cm,ib_sa,ib_mthca ib_core 46848 12 ib_srp,ib_sdp,ib_ping,ib_umad,ib_ucm,ib_uverbs,rdma_cm,ib_cm,ib_ipoib,ib_sa, ib_mthca,ib_mad scsi_mod 139432 3 ib_srp,libata,sd_mod Any Thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Nov 23 10:53:27 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 23 Nov 2005 13:53:27 -0500 (EST) Subject: [openib-general] Warnings while compiling with kernel 2.6.14.2... In-Reply-To: <66BA8F805B5BEB469B7FA8F00B80C34B0DB892@srplunkett.eng.emc.com> References: <66BA8F805B5BEB469B7FA8F00B80C34B0DB892@srplunkett.eng.emc.com> Message-ID: On Wed, 23 Nov 2005, harris, arthur wrote: > > Any suggestions as to how I might resolve the following problem? > > I am unable to insmod the modules ib_iser.ko and kdapl_ib.ko. > > While compiling the latest Linux kernel (Linux-2.6.14.2) with the latest > openib trunk, I encountered the following WARNINGS: > > WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko > needs unknown symbol dat_registry_list_providers > WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko > needs unknown symbol dat_ia_open > WARNING: /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/iser/ib_iser.ko > needs unknown symbol dat_ia_close > WARNING: > /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/kdapl/ib/kdapl_ib.ko > needs unknown symbol dat_registry_remove_provider > WARNING: > /lib/modules/2.6.14.2/kernel/drivers/infiniband/ulp/kdapl/ib/kdapl_ib.ko > needs unknown symbol dat_registry_add_provider > > > I did locate each of these symbols in the file > /usr/src/kernels/linux-2.6.14.2/drivers/infiniband/ulp/kdapl/api.c > > AdditionaIly, I was also able to install the following ib modules: > > > lsmod | grep ib_ > ib_srp 26368 0 > ib_sdp 141440 0 > ib_ping 9616 0 > ib_umad 18864 0 > ib_uat 15436 0 > ib_at 24000 1 ib_uat > ib_ucm 20868 0 > ib_uverbs 38568 2 rdma_ucm,ib_ucm > ib_cm 37604 4 ib_srp,ib_sdp,ib_ucm,rdma_cm > ib_addr 11268 1 rdma_cm > ib_ipoib 56464 0 > ib_sa 16676 5 ib_srp,ib_sdp,ib_at,rdma_cm,ib_ipoib > ib_mthca 116764 0 > ib_mad 39328 5 ib_ping,ib_umad,ib_cm,ib_sa,ib_mthca > ib_core 46848 12 > ib_srp,ib_sdp,ib_ping,ib_umad,ib_ucm,ib_uverbs,rdma_cm,ib_cm,ib_ipoib,ib_sa, > ib_mthca,ib_mad > scsi_mod 139432 3 ib_srp,libata,sd_mod > > > Any Thoughts? First modprobe kdapl, then inserting kdapl_ib and ib_iser should work. From Arkady.Kanevsky at netapp.com Wed Nov 23 12:16:11 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 23 Nov 2005 15:16:11 -0500 Subject: [openib-general] socket based connection model for IB proposal -round 4 Message-ID: Yes. The private data format is not RC or UC specific. I will add this comment that format covers both EE and C. Is this sufficient? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, November 17, 2005 12:40 PM > To: Kanevsky, Arkady; swg at infinibandta.org; > openib-general at openib.org; dat-discussions at yahoogroups.com > Subject: RE: [openib-general] socket based connection model > for IB proposal -round 4 > > > If the proposal will include UDP, should the definition > extend beyond connections to include UD QPs as well (i.e. SIDR REQ)? > > - Sean > From sean.hefty at intel.com Wed Nov 23 12:19:40 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 23 Nov 2005 12:19:40 -0800 Subject: [openib-general] socket based connection model for IB proposal -round 4 In-Reply-To: Message-ID: >Yes. >The private data format is not RC or UC specific. >I will add this comment that format covers both EE and C. > >Is this sufficient? I was referring to UD, which is the SIDR REQ message. - Sean From ted.kim at sun.com Wed Nov 23 12:28:31 2005 From: ted.kim at sun.com (Ted H. Kim) Date: Wed, 23 Nov 2005 12:28:31 -0800 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 In-Reply-To: References: Message-ID: <4384D0EF.1060309@sun.com> Arkady, I know we originally set out to compress everything down to the minimum to preserve as much ULP specific private data as possible. But it seems to me in the current proposal we have reserved space now which could be used to re-expand the version to major 4-bits and minor-4 bits without harming anything else. Can we entertain that as an option? My rationale is to err on the side of perhaps a little too much version room than too little. This will put it in line with the precedent of SDP. -ted Kanevsky, Arkady wrote: > pdf version of the proposal. > > > Arkady Kanevsky email: arkady at netapp.com > > > Network Appliance Inc. phone: 781-768-5395 > > 275 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > ------------------------------------------------------------------------ > *From:* Kanevsky, Arkady > *Sent:* Wednesday, November 16, 2005 11:59 AM > *To:* swg at infinibandta.org; openib-general at openib.org; > dat-discussions at yahoogroups.com > *Subject:* [openib-general] socket based connectionmodel for IB > proposal -round 4 > > This version incorporate the feedback on 3 reflectors and > yesterday's SWG meeting. > > Major changes from previous version are: > no REQ bit to identify private data formaing - SID range used instead > port mapping uses IBTA space and IETF protocol # is encoded in SID > protocol version is 4 bits. > > Arkady > > > Arkady Kanevsky email: arkady at netapp.com > > > Network Appliance Inc. phone: 781-768-5395 > > 275 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From ted.kim at sun.com Wed Nov 23 12:31:17 2005 From: ted.kim at sun.com (Ted H. Kim) Date: Wed, 23 Nov 2005 12:31:17 -0800 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 In-Reply-To: <4384D0EF.1060309@sun.com> References: <4384D0EF.1060309@sun.com> Message-ID: <4384D195.9050705@sun.com> Err, badly phrased -- what I mean by "can we entertain that option" is can we either revise the proposal or put it in as a voteable. I don't actually want an "optional" format. Sorry about that. -ted Ted H. Kim wrote: > Arkady, > > I know we originally set out to compress everything down to > the minimum to preserve as much ULP specific private data as > possible. But it seems to me in the current proposal we have > reserved space now which could be used to re-expand the > version to major 4-bits and minor-4 bits without harming > anything else. > > Can we entertain that as an option? > My rationale is to err on the side of perhaps a little > too much version room than too little. This will put it > in line with the precedent of SDP. > > -ted > > > > Kanevsky, Arkady wrote: > >> pdf version of the proposal. >> >> >> Arkady Kanevsky email: arkady at netapp.com >> >> >> Network Appliance Inc. phone: 781-768-5395 >> >> 275 Totten Pond Rd. Fax: 781-895-1195 >> >> Waltham, MA 02451-2010 central phone: 781-768-5300 >> >> >> >> >> ------------------------------------------------------------------------ >> *From:* Kanevsky, Arkady >> *Sent:* Wednesday, November 16, 2005 11:59 AM >> *To:* swg at infinibandta.org; openib-general at openib.org; >> dat-discussions at yahoogroups.com >> *Subject:* [openib-general] socket based connectionmodel for IB >> proposal -round 4 >> >> This version incorporate the feedback on 3 reflectors and >> yesterday's SWG meeting. >> Major changes from previous version are: >> no REQ bit to identify private data formaing - SID range used instead >> port mapping uses IBTA space and IETF protocol # is encoded in SID >> protocol version is 4 bits. >> Arkady >> >> Arkady Kanevsky email: arkady at netapp.com >> >> >> Network Appliance Inc. phone: 781-768-5395 >> >> 275 Totten Pond Rd. Fax: 781-895-1195 >> >> Waltham, MA 02451-2010 central phone: 781-768-5300 >> >> > > -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From Arkady.Kanevsky at netapp.com Wed Nov 23 12:35:37 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 23 Nov 2005 15:35:37 -0500 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 Message-ID: This is fine with me. I will update the proposal with this for next version. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Ted H. Kim [mailto:ted.kim at sun.com] > Sent: Wednesday, November 23, 2005 3:29 PM > To: Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org; > dat-discussions at yahoogroups.com > Subject: Re: [swg] RE: [openib-general] socket based > connectionmodel for IB proposal -round 4 > > Arkady, > > I know we originally set out to compress everything down to > the minimum to preserve as much ULP specific private data as > possible. But it seems to me in the current proposal we have > reserved space now which could be used to re-expand the > version to major 4-bits and minor-4 bits without harming > anything else. > > Can we entertain that as an option? > My rationale is to err on the side of perhaps a little too > much version room than too little. This will put it in line > with the precedent of SDP. > > -ted > > > > Kanevsky, Arkady wrote: > > pdf version of the proposal. > > > > > > Arkady Kanevsky email: arkady at netapp.com > > > > > > Network Appliance Inc. phone: 781-768-5395 > > > > 275 Totten Pond Rd. Fax: 781-895-1195 > > > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > > > -------------------------------------------------------------- > ---------- > > *From:* Kanevsky, Arkady > > *Sent:* Wednesday, November 16, 2005 11:59 AM > > *To:* swg at infinibandta.org; openib-general at openib.org; > > dat-discussions at yahoogroups.com > > *Subject:* [openib-general] socket based connectionmodel for IB > > proposal -round 4 > > > > This version incorporate the feedback on 3 reflectors and > > yesterday's SWG meeting. > > > > Major changes from previous version are: > > no REQ bit to identify private data formaing - SID > range used instead > > port mapping uses IBTA space and IETF protocol # is > encoded in SID > > protocol version is 4 bits. > > > > Arkady > > > > > > Arkady Kanevsky email: arkady at netapp.com > > > > > > Network Appliance Inc. phone: 781-768-5395 > > > > 275 Totten Pond Rd. Fax: 781-895-1195 > > > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > -- > Ted H. Kim > Sun Microsystems, Inc. ted.kim at sun.com > 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 > El Segundo, CA 90245 (310) 341-1120 FAX > From mshefty at ichips.intel.com Wed Nov 23 12:40:36 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 23 Nov 2005 12:40:36 -0800 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 In-Reply-To: <4384D0EF.1060309@sun.com> References: <4384D0EF.1060309@sun.com> Message-ID: <4384D3C4.8070105@ichips.intel.com> Ted H. Kim wrote: > I know we originally set out to compress everything down to > the minimum to preserve as much ULP specific private data as > possible. But it seems to me in the current proposal we have > reserved space now which could be used to re-expand the > version to major 4-bits and minor-4 bits without harming > anything else. I don't see any benefit to having 2 4-bit version numbers over a single 8-bit number. A single 4-bit version number should suffice. If all version numbers are ever consumed, then version 15 can define an extended version field. IMO, multiple version fields simply complicate the implementation. I would rather see the reserved space used to define the size of carried user-private data. - Sean From bunk at stusta.de Wed Nov 23 14:34:56 2005 From: bunk at stusta.de (Adrian Bunk) Date: Wed, 23 Nov 2005 23:34:56 +0100 Subject: [openib-general] [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference Message-ID: <20051123223456.GD3963@stusta.de> The Coverity checker spotted this obvious NULL pointer dereference caused by a wrong order of the cleanups. Signed-off-by: Adrian Bunk --- This patch was already sent on: - 21 Nov 2005 --- linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c.old 2005-11-20 22:04:36.000000000 +0100 +++ linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c 2005-11-20 22:05:17.000000000 +0100 @@ -355,9 +355,9 @@ spin_unlock_irqrestore(&port_priv->reg_lock, flags); kfree(reg_req); error3: - kfree(mad_agent_priv); -error2: ib_dereg_mr(mad_agent_priv->agent.mr); +error2: + kfree(mad_agent_priv); error1: return ret; } From info at hagste.com Wed Nov 23 20:04:46 2005 From: info at hagste.com (info at hagste.com) Date: 24 Nov 2005 13:04:46 +0900 Subject: [openib-general] 2005 Message-ID: <20051124040446.25767.qmail@mail.hagste.com> $BEvHVAH$NL5NACjA*$K$F$4EvA*$5$l$?$3$H$r$*CN$i$;CW$7$^$9!#L5NA$GO"Mm8x3+(B $B=w at -$NL>A0$H(BID$B$r$*Aw$j$7$^$9!#(B ID$B!'(B216302$B!!J~(B ID$B!'(B216255$B!!%/%i%j%9(B http://www.lovinyou.net/?2005 $B"(%K%C%/%M!<%`$N:G8e$K(B(*)$B$r$D$1$k$H=w at -$+$iD>%a!<%k$,F~$j$^$9!#(B $B$4K\?M$+$i$N0MMj$G(BID$B$r8x3+CW$7$F$*$j$^$9!#(B $BK|$,0lEvA*$rE12s$7$?$$>l9g$O$3$A$i$X(B badluck at lovinyou.net From mst at mellanox.co.il Thu Nov 24 00:03:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 24 Nov 2005 10:03:52 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <20051120163310.GS20871@mellanox.co.il> References: <20051120163310.GS20871@mellanox.co.il> Message-ID: <20051124080352.GE3255@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: user_mad: large rmpp length problem > > Hello! > ib_umad_write currently accepts a count parameter from user > and attempts to allocate mad of size count - sizeof (struct ib_user_mad) > in kernel memory. > > This, obviously, fails with -ENOMEM, which means that we cant > send large transactions with RMPP. > > The proper fix appears to be to transfer the data by chunks, > waking the user process and copying a fixed number of bytes each time. Here's a very simple patch which, while not ideal, let us go up to 512KB. --- Allocate memory for large MAD buffers with __get_free_pages, making it possible to get buffers up to 512KB in size. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/core/user_mad.c =================================================================== --- linux-kernel.orig/drivers/infiniband/core/user_mad.c +++ linux-kernel/drivers/infiniband/core/user_mad.c @@ -204,6 +204,34 @@ out: kfree(packet); } +static struct ib_umad_packet *alloc_packet(int buf_size) +{ + struct ib_umad_packet *packet; + int length = sizeof *packet + buf_size; + + if (length >= PAGE_SIZE) + packet = (void *)__get_free_pages(GFP_KERNEL, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); + else + packet = kmalloc(length, GFP_KERNEL); + + if (!packet) + return NULL; + + memset(packet, 0, length); + return packet; +} + +static void free_packet(struct ib_umad_packet *packet) +{ + int length = packet->length + sizeof *packet; + if (length >= PAGE_SIZE) + free_pages((unsigned long) packet, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); + else + kfree(packet); +} + + + static void recv_handler(struct ib_mad_agent *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -215,7 +243,7 @@ static void recv_handler(struct ib_mad_a goto out; length = mad_recv_wc->mad_len; - packet = kzalloc(sizeof *packet + length, GFP_KERNEL); + packet = alloc_packet(length); if (!packet) goto out; @@ -240,7 +268,7 @@ static void recv_handler(struct ib_mad_a } if (queue_packet(file, agent, packet)) - kfree(packet); + free_packet(packet); out: ib_free_recv_mad(mad_recv_wc); @@ -294,7 +322,7 @@ static ssize_t ib_umad_read(struct file list_add(&packet->list, &file->recv_list); spin_unlock_irq(&file->recv_lock); } else - kfree(packet); + free_packet(packet); return ret; } Index: linux-kernel/drivers/infiniband/core/mad.c =================================================================== --- linux-kernel.orig/drivers/infiniband/core/mad.c +++ linux-kernel/drivers/infiniband/core/mad.c @@ -779,7 +779,7 @@ struct ib_mad_send_buf * ib_create_send_ { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; - int buf_size; + int length, buf_size; void *buf; mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, @@ -791,10 +791,17 @@ struct ib_mad_send_buf * ib_create_send_ (!rmpp_active && buf_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - buf = kzalloc(sizeof *mad_send_wr + buf_size, gfp_mask); + length = sizeof *mad_send_wr + buf_size; + if (length >= PAGE_SIZE) + buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); + else + buf = kmalloc(length, gfp_mask); + if (!buf) return ERR_PTR(-ENOMEM); + memset(buf, 0, length); + mad_send_wr = buf + buf_size; mad_send_wr->send_buf.mad = buf; @@ -830,10 +837,19 @@ EXPORT_SYMBOL(ib_create_send_mad); void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { struct ib_mad_agent_private *mad_agent_priv; + void *mad_send_wr; + int length; mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); - kfree(send_buf->mad); + mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, + send_buf); + + length = sizeof(struct ib_mad_send_wr_private) + (mad_send_wr - send_buf->mad); + if (length >= PAGE_SIZE) + free_pages((unsigned long)send_buf->mad, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); + else + kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); -- MST From jshuey at arboltour.com Thu Nov 24 12:04:47 2005 From: jshuey at arboltour.com (Nathaniel Lewis) Date: Thu, 24 Nov 2005 22:04:47 +0200 Subject: [openib-general] Hey bro, check out the huge sale these guys are offering Message-ID: <000001c5f141$32f4cf00$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.kinel.net/pt/?46&khroa -------------- next part -------------- An HTML attachment was scrubbed... URL: From bunk at stusta.de Sat Nov 26 15:37:36 2005 From: bunk at stusta.de (Adrian Bunk) Date: Sun, 27 Nov 2005 00:37:36 +0100 Subject: [openib-general] [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference Message-ID: <20051126233736.GE3988@stusta.de> The Coverity checker spotted this obvious NULL pointer dereference caused by a wrong order of the cleanups. Signed-off-by: Adrian Bunk --- This patch was already sent on: - 23 Nov 2005 - 21 Nov 2005 --- linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c.old 2005-11-20 22:04:36.000000000 +0100 +++ linux-2.6.15-rc1-mm2-full/drivers/infiniband/core/mad.c 2005-11-20 22:05:17.000000000 +0100 @@ -355,9 +355,9 @@ spin_unlock_irqrestore(&port_priv->reg_lock, flags); kfree(reg_req); error3: - kfree(mad_agent_priv); -error2: ib_dereg_mr(mad_agent_priv->agent.mr); +error2: + kfree(mad_agent_priv); error1: return ret; } From info at furyd.com Sat Nov 26 21:07:51 2005 From: info at furyd.com (info at furyd.com) Date: 27 Nov 2005 14:07:51 +0900 Subject: [openib-general] 2005 Message-ID: <20051127050751.3370.qmail@mail.furyd.com> $BEvHVAH$NL5NACjA*$K$F$4EvA*$5$l$?$3$H$r$*CN$i$;CW$7$^$9!#L5NA$GO"Mm8x3+(B $B=w at -$NL>A0$H(BID$B$r$*Aw$j$7$^$9!#(B ID$B!'(B216302$B!!J~(B ID$B!'(B216255$B!!%/%i%j%9(B http://www.lovinyou.net/?2005 $B"(%K%C%/%M!<%`$N:G8e$K(B(*)$B$r$D$1$k$H=w at -$+$iD>%a!<%k$,F~$j$^$9!#(B $B$4K\?M$+$i$N0MMj$G(BID$B$r8x3+CW$7$F$*$j$^$9!#(B $BK|$,0lEvA*$rE12s$7$?$$>l9g$O$3$A$i$X(B badluck at lovinyou.net From halr at voltaire.com Sun Nov 27 09:32:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Nov 2005 12:32:21 -0500 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <20051124080352.GE3255@mellanox.co.il> References: <20051120163310.GS20871@mellanox.co.il> <20051124080352.GE3255@mellanox.co.il> Message-ID: <1133112741.12652.42.camel@hal.voltaire.com> On Thu, 2005-11-24 at 03:03, Michael S. Tsirkin wrote: > Here's a very simple patch which, while not ideal, let us go up to 512KB. > > --- > > Allocate memory for large MAD buffers with __get_free_pages, > making it possible to get buffers up to 512KB in size. Looks good to me. Applied. Thanks. Should an error message be added in when an allocation fails ? -- Hal From mst at mellanox.co.il Sun Nov 27 09:55:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Nov 2005 19:55:29 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <1133112741.12652.42.camel@hal.voltaire.com> References: <1133112741.12652.42.camel@hal.voltaire.com> Message-ID: <20051127175529.GA7919@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: user_mad: large rmpp length problem > > On Thu, 2005-11-24 at 03:03, Michael S. Tsirkin wrote: > > Here's a very simple patch which, while not ideal, let us go up to > 512KB. > > > > --- > > > > Allocate memory for large MAD buffers with __get_free_pages, > > making it possible to get buffers up to 512KB in size. > > Looks good to me. Applied. Thanks. > > Should an error message be added in when an allocation fails ? > > -- Hal > Given that we pass an error up to userspace, probably not. -- MST From mst at mellanox.co.il Sun Nov 27 10:26:03 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 27 Nov 2005 20:26:03 +0200 Subject: [openib-general] [PATCH] opensm: respect DESTDIR in install hook Message-ID: <20051127182603.GD7919@mellanox.co.il> osm makefile adds an install hook which doesnt respect the DESTDIR variable, making it hard to build RPMs. Signed-off-by: Michael S. Tsirkin Index: trunk/src/userspace/management/osm/Makefile.am =================================================================== --- trunk/src/userspace/management/osm/Makefile.am 2005-08-15 19:10:22.000000000 +0300 +++ trunk/src/userspace/management/osm/Makefile.am 2005-11-16 15:32:26.000000000 +0200 @@ -9,8 +9,9 @@ # we should provide a hint for other apps about the build mode of this project install-exec-hook: + mkdir -p $(DESTDIR)/$(includedir) if DEBUG - echo "define osm_build_type \"debug\"" > $(includedir)/infiniband/opensm/osm_build_id.h + echo "define osm_build_type \"debug\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h else - echo "define osm_build_type \"free\"" > $(includedir)/infiniband/opensm/osm_build_id.h + echo "define osm_build_type \"free\"" > $(DESTDIR)/$(includedir)/infiniband/opensm/osm_build_id.h endif -- MST From happydance at kobej.zzn.com Sun Nov 27 11:10:13 2005 From: happydance at kobej.zzn.com (happydance at kobej.zzn.com) Date: Sun, 27 Nov 2005 11:10:13 -0800 (PST) Subject: [openib-general] =?utf-8?b?woHCoMKTw7rClnvCjcOFwpHDpcKLwonCgsOM?= =?utf-8?b?wpbCs8KXwr/Cl8KYwpdwwoHCoA==?= Message-ID: 20051128020911.19324mail@mail.superfreeweb8548754521254_server08_221x251x99x253.ap221.freshhipweb101.cx �@�@���������������������������������� �@�@�������������S�������������������� �@�@���������������������������������� http://ad.deai-ciao.net/?hkcka ���L���ɂ����v�ʼn^�c���Ă���T�C�g�ł��B �������p�����݂Ȃ���ɂ͈�ؗ���𒸂��� �@���܂���̂ŁA���S���Ă����p�������B �E�E�E�c���������������������������������� ------------------------------------------ �������� [������]���� [21��] ���o�^�m�F�� [B90 W54 H88][155cm 46kg] ���ʃ��m�F�� http://ad.deai-ciao.net/?hkcka �������o�^�� http://ad.deai-ciao.net/?hkcka [�����b�Z�[�W] �����A�����G�b�`�Ȏ��΂�����l���Ă܂��B �����悤�ɃG�b�`�Ȑl�̃��[���҂��Ă܂��B ���A���ʼn���Ă�������A�A�h���X���� �ăG�b�`�Ȏʃ�������I���肢�I ------------------------------------------ http://ad.deai-ciao.net/?hkcka ���������p�ɂ‚��ā��� ���������������� �o�^�@�@�@�@�O�~ ���������������� ���[�����M�@�O�~ ���������������� ���[����M�@�O�~ ���������������� �����݁@�@�@�O�~ ���������������� �{���@�@�@�@�O�~ ���������������� ���A�h���@�O�~ ���������������� ���d���@�@�O�~ ���������������� �މ�@�@�@�@�O�~ �� ���� �������p�͓o�^����މ�܂� �@�@ �@�S�T�[�r�X�������ł��B http://ad.deai-ciao.net/?hkcka �E�E�E�c���������������������������������� ���������������������������������������������� �@ ���S�����I�����̏o���A�i�^�ɁB �@ �Z���u�l�ȁE�n�k�����p�B���ߏ������”\�B �@ �������p�͓o�^����މ�܂� �@ �S�T�[�r�X�������ł��y���ݒ����܂��B http://ad.deai-ciao.net/?hkcka ���������������������������������������������� From rolandd at cisco.com Sun Nov 27 15:51:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 27 Nov 2005 15:51:41 -0800 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference References: <20051126233736.GE3988@stusta.de> Message-ID: <52irud4pki.fsf@cisco.com> Thanks, I already have this in my git tree of pending changes (I found it by actually hitting the crash it causes with CONFIG_DEBUG_SLAB=y). - R. From rolandd at cisco.com Sun Nov 27 16:17:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 27 Nov 2005 16:17:42 -0800 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <20051124080352.GE3255@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 24 Nov 2005 10:03:52 +0200") References: <20051120163310.GS20871@mellanox.co.il> <20051124080352.GE3255@mellanox.co.il> Message-ID: <52wtit39sp.fsf@cisco.com> > Allocate memory for large MAD buffers with __get_free_pages, > making it possible to get buffers up to 512KB in size. Ugh, why is this an improvement?! What are the chances of an order-9 allocation succeeding on a system that's been running for a while? > + if (length >= PAGE_SIZE) > + buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); > + else > + buf = kmalloc(length, gfp_mask); this is extra-gross too. I don't think this is worth it -- much better to fix the MAD API to handle gather/scatter lists. - R. From rolandd at cisco.com Sun Nov 27 16:21:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 27 Nov 2005 16:21:05 -0800 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <20051123025445.GA6560@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 23 Nov 2005 04:54:45 +0200") References: <52ek7rxljj.fsf@cisco.com> <20050915073629.GQ28025@mellanox.co.il> <20051123025445.GA6560@mellanox.co.il> Message-ID: <52slth39n2.fsf@cisco.com> Michael> Roland, do you plan to check this in? No, I dropped it for the moment because I couldn't come up with a good way to handle the window from when the IB port is brought up until userspace has a chance to set the node description. - R. From bunk at stusta.de Sun Nov 27 16:25:23 2005 From: bunk at stusta.de (Adrian Bunk) Date: Mon, 28 Nov 2005 01:25:23 +0100 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference In-Reply-To: <52irud4pki.fsf@cisco.com> References: <20051126233736.GE3988@stusta.de> <52irud4pki.fsf@cisco.com> Message-ID: <20051128002523.GA31395@stusta.de> On Sun, Nov 27, 2005 at 03:51:41PM -0800, Roland Dreier wrote: > Thanks, I already have this in my git tree of pending changes > (I found it by actually hitting the crash it causes with CONFIG_DEBUG_SLAB=y). Can you Cc me when forwarding it to Linus? After it's in Linus' tree, Greg will accept it for the 2.6.14 stable tree. > - R. TIA Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From halr at voltaire.com Sun Nov 27 17:54:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Nov 2005 20:54:31 -0500 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <52slth39n2.fsf@cisco.com> References: <52ek7rxljj.fsf@cisco.com> <20050915073629.GQ28025@mellanox.co.il> <20051123025445.GA6560@mellanox.co.il> <52slth39n2.fsf@cisco.com> Message-ID: <1133142666.12652.2983.camel@hal.voltaire.com> On Sun, 2005-11-27 at 19:21, Roland Dreier wrote: > Michael> Roland, do you plan to check this in? > > No, I dropped it for the moment because I couldn't come up with a good > way to handle the window from when the IB port is brought up until > userspace has a chance to set the node description. Also, ideally there should be a way for an SM to know that this has changed. -- Hal From halr at voltaire.com Sun Nov 27 17:57:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Nov 2005 20:57:45 -0500 Subject: [openib-general] Multipathing In-Reply-To: <1132706031.9516.61.camel@hematite.internal.keyresearch.com> References: <1132706031.9516.61.camel@hematite.internal.keyresearch.com> Message-ID: <1133142838.12652.3004.camel@hal.voltaire.com> Hi Robert, On Tue, 2005-11-22 at 19:33, Robert Walsh wrote: > Hi all, > > Is multipathing implemented/working in OpenIB? What in specific are you looking for here ? LMC > 0 is supported in OpenSM (with -l/--lmc option) but has not (to my knowledge) been tested with the OpenIB stack. Also, OpenSM does not yet support SA MultiPathRecord but there are plans for this. -- Hal > Regards, > Robert. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Nov 27 21:35:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 07:35:28 +0200 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <52slth39n2.fsf@cisco.com> References: <52slth39n2.fsf@cisco.com> Message-ID: <20051128053528.GA15811@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] Allow setting of NodeDescription > > Michael> Roland, do you plan to check this in? > > No, I dropped it for the moment because I couldn't come up with a good > way to handle the window from when the IB port is brought up until > userspace has a chance to set the node description. What about simply initializing this to some string like "uninitialized"? This indicates in a pretty clear way that user needs to recheck the node later. -- MST From yael at mellanox.co.il Sun Nov 27 23:11:38 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 28 Nov 2005 09:11:38 +0200 Subject: [openib-general] Multipathing Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E242C@mtlexch01.mtl.com> Hi Hal, LMC > 0 has been tested, though not heavily, on the OpenIB stack. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, November 28, 2005 3:58 AM To: Robert Walsh Cc: openib-general at openib.org Subject: Re: [openib-general] Multipathing Hi Robert, On Tue, 2005-11-22 at 19:33, Robert Walsh wrote: Hi all, Is multipathing implemented/working in OpenIB? What in specific are you looking for here ? LMC > 0 is supported in OpenSM (with -l/--lmc option) but has not (to my knowledge) been tested with the OpenIB stack. Also, OpenSM does not yet support SA MultiPathRecord but there are plans for this. -- Hal Regards, Robert. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ianjiang.ict at gmail.com Sun Nov 27 23:23:27 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 28 Nov 2005 15:23:27 +0800 Subject: [openib-general] [udapl]How to build the udapl Message-ID: <7b2fa1820511272323t545e8040l258da18642000a77@mail.gmail.com> I am trying to use the udapl. I have got the source at https://openib.org/svn/gen2/trunk/src/userspace/dapl/ but could not find an instruction of building and installation. Should I start with the https://openib.org/svn/gen2/trunk/src/userspace/dapl/dapl/udapl/Makefile? I am using the redhat AS 4 with kernel 2.6.12.5. Any suggestion is appriciated! -- Ian Jiang ianjiang.ict at gmail.com Institute of Computing Technology, Chinese Academy of Sciences. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Nov 28 03:12:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 13:12:32 +0200 Subject: [openib-general] Re: user_mad: large rmpp length problem In-Reply-To: <52wtit39sp.fsf@cisco.com> References: <52wtit39sp.fsf@cisco.com> Message-ID: <20051128111232.GV3255@mellanox.co.il> Quoting Roland Dreier : > > Allocate memory for large MAD buffers with __get_free_pages, > > making it possible to get buffers up to 512KB in size. > > Ugh, why is this an improvement?! I use this patch mainly as a diagnostic vehicle for various tools until we get it fixed in a better way. > What are the chances of an order-9 > allocation succeeding on a system that's been running for a while? Probably low. But reboot should help ;). > > + if (length >= PAGE_SIZE) > > + buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); > > + else > > + buf = kmalloc(length, gfp_mask); > > this is extra-gross too. I agree here. > I don't think this is worth it -- much better to fix the MAD API to > handle gather/scatter lists. We are looking into that. -- MST From halr at voltaire.com Mon Nov 28 03:25:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Nov 2005 06:25:10 -0500 Subject: [openib-general] Re: [PATCH] opensm: respect DESTDIR in install hook In-Reply-To: <20051127182603.GD7919@mellanox.co.il> References: <20051127182603.GD7919@mellanox.co.il> Message-ID: <1133177027.12652.5776.camel@hal.voltaire.com> On Sun, 2005-11-27 at 13:26, Michael S. Tsirkin wrote: > osm makefile adds an install hook which doesnt respect the > DESTDIR variable, making it hard to build RPMs. Thanks. Applied. From mukul at pantasys.com Mon Nov 28 04:12:13 2005 From: mukul at pantasys.com (Mukul Kumar) Date: Mon, 28 Nov 2005 17:42:13 +0530 Subject: [openib-general] Status of SRP initiator & Target on OpenIB gen2 Message-ID: <20051128120929.3E1271622B@ox.pantasys.com> Hi, What is the status of SRP initiator & target on OpenIB gen2? Thanks, Mukul. From halr at voltaire.com Mon Nov 28 04:21:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Nov 2005 07:21:48 -0500 Subject: [openib-general] Status of SRP initiator & Target on OpenIB gen2 In-Reply-To: <20051128120929.3E1271622B@ox.pantasys.com> References: <20051128120929.3E1271622B@ox.pantasys.com> Message-ID: <1133180507.12652.6037.camel@hal.voltaire.com> On Mon, 2005-11-28 at 07:12, Mukul Kumar wrote: > Hi, > > What is the status of SRP initiator & target on OpenIB gen2? The OpenIB SRP initiator has been pushed upstream for inclusion in 2.6.15. There is currently no OpenIB SRP target. -- Hal > Thanks, > Mukul. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Nov 28 05:50:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 15:50:54 +0200 Subject: [openib-general] Re: [PATCH] opensm: respect DESTDIR in install hook In-Reply-To: <1133177027.12652.5776.camel@hal.voltaire.com> References: <1133177027.12652.5776.camel@hal.voltaire.com> Message-ID: <20051128135054.GB25206@mellanox.co.il> Quoting Hal Rosenstock : > > osm makefile adds an install hook which doesnt respect the > > DESTDIR variable, making it hard to build RPMs. > > Thanks. Applied. Here's another one: --- Make osm makefiles respect DESTDIR. Signed-off-by: Michael S. Tsirkin Index: trunk/src/userspace/management/osm/complib/Makefile.am =================================================================== --- trunk/src/userspace/management/osm/complib/Makefile.am 2005-09-12 23:10:17.000000000 +0300 +++ trunk/src/userspace/management/osm/complib/Makefile.am 2005-11-17 17:16:05.000000000 +0200 @@ -97,6 +97,6 @@ # that it will mark it with an alias... # we find the new lib by traversing the links install-exec-hook: - if test -L $(libdir)/libosmcomp-$(VERSION).so; then rm $(libdir)/libosmcomp-$(VERSION).so; fi; \ - lname=`\ls -l $(libdir)/libosmcomp.so | awk '{print $$NF}'`; \ - ln -s $$lname $(libdir)/libosmcomp-$(VERSION).so + if test -L $(DESTDIR)/$(libdir)/libosmcomp-$(VERSION).so; then rm $(DESTDIR)/$(libdir)/libosmcomp-$(VERSION).so; fi; \ + lname=`\ls -l $(DESTDIR)/$(libdir)/libosmcomp.so | awk '{print $$NF}'`; \ + ln -s $$lname $(DESTDIR)/$(libdir)/libosmcomp-$(VERSION).so Index: trunk/src/userspace/management/osm/libvendor/Makefile.am =================================================================== --- trunk/src/userspace/management/osm/libvendor/Makefile.am 2005-09-12 23:10:17.000000000 +0300 +++ trunk/src/userspace/management/osm/libvendor/Makefile.am 2005-11-17 17:16:43.000000000 +0200 @@ -91,6 +91,6 @@ # that it will mark it with an alias... # we find the new lib by traversing the links install-exec-hook: - if test -L $(libdir)/libosmvendor-$(VERSION).so; then rm $(libdir)/libosmvendor-$(VERSION).so; fi; \ - lname=`\ls -l $(libdir)/libosmvendor.so | awk '{print $$NF}'`; \ - ln -s $$lname $(libdir)/libosmvendor-$(VERSION).so + if test -L $(DESTDIR)/$(libdir)/libosmvendor-$(VERSION).so; then rm $(DESTDIR)/$(libdir)/libosmvendor-$(VERSION).so; fi; \ + lname=`\ls -l $(DESTDIR)/$(libdir)/libosmvendor.so | awk '{print $$NF}'`; \ + ln -s $$lname $(DESTDIR)/$(libdir)/libosmvendor-$(VERSION).so Index: trunk/src/userspace/management/osm/opensm/Makefile.am =================================================================== --- trunk/src/userspace/management/osm/opensm/Makefile.am 2005-10-25 00:56:34.000000000 +0200 +++ trunk/src/userspace/management/osm/opensm/Makefile.am 2005-11-17 17:17:38.000000000 +0200 @@ -102,6 +102,6 @@ # that it will mark it with an alias... # we find the new lib by traversing the links install-exec-hook: - if test -L $(libdir)/libopensm-$(VERSION).so; then rm $(libdir)/libopensm-$(VERSION).so; fi; \ - lname=`\ls -l $(libdir)/libopensm.so | awk '{print $$NF}'`; \ - ln -s $$lname $(libdir)/libopensm-$(VERSION).so + if test -L $(DESTDIR)/$(libdir)/libopensm-$(VERSION).so; then rm $(DESTDIR)/$(libdir)/libopensm-$(VERSION).so; fi; \ + lname=`\ls -l $(DESTDIR)/$(libdir)/libopensm.so | awk '{print $$NF}'`; \ + ln -s $$lname $(DESTDIR)/$(libdir)/libopensm-$(VERSION).so -- MST From binus52 at kiwi.ne.jp Sun Nov 27 05:48:35 2005 From: binus52 at kiwi.ne.jp (binus52 at kiwi.ne.jp) Date: Sun, 27 Nov 2005 22:48:35 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCIVobKEIxMRskQjduOEIbKEI=?= =?iso-2022-jp?b?GyRCRGo+cEpzIVsbKEI=?= Message-ID: <20051127.1348350656@binus52-kiwi.ne.jp> ☆★☆完全永久無料!!☆★☆ 〜〜〜 《受信者限定のお知らせデス!!》 〜〜〜 《男女会員100万人突破!!》 ◆◇当コミュニティーサイトは完全無料◇◆ 全てのコンテンツにおいて料金の発生はございません。 http://yyx.jp/index.php/ailove ★掲示板閲覧→¥0☆掲示板投稿→¥0 ★メール送信→¥0☆メール受信→¥0★メール閲覧→¥0 ★写真の閲覧→¥0 ☆★全てのコンテンツが完全永久無料☆★ 【機能も充実!!地域検索により、近くの会いたい人に直ぐ会える!!】 http://yyx.jp/index.php/ailove ※当コミュニティーサイトは広告収入のみで運営しております。 ※週間、月間、女性誌など、多数掲載中! ※逢LOVEは、ワンクリック・架空請求撲滅を推進します。 完全永久無料会員へはこちらから♪ http://yyx.jp/index.php/ailove 配信停止 me622133 at members.interq.or.jp From jlentini at netapp.com Mon Nov 28 07:32:37 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 28 Nov 2005 10:32:37 -0500 (EST) Subject: [openib-general] [udapl]How to build the udapl In-Reply-To: <7b2fa1820511272323t545e8040l258da18642000a77@mail.gmail.com> References: <7b2fa1820511272323t545e8040l258da18642000a77@mail.gmail.com> Message-ID: On Mon, 28 Nov 2005, Ian Jiang wrote: > I am trying to use the udapl. > I have got the source at > https://openib.org/svn/gen2/trunk/src/userspace/dapl/ > but could not find an instruction of building and installation. > Should I start with the > https://openib.org/svn/gen2/trunk/src/userspace/dapl/dapl/udapl/Makefile? > I am using the redhat AS 4 with kernel 2.6.12.5. > > Any suggestion is appriciated! You have the correct makefile. You'll need to have the OpenIB userspace libraries installed and working first. Then you can build uDAPL for OpenIB with the following command: make VERBS=openib From halr at voltaire.com Mon Nov 28 08:10:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Nov 2005 11:10:36 -0500 Subject: [openib-general] Re: [PATCH] opensm: respect DESTDIR in install hook In-Reply-To: <20051128135054.GB25206@mellanox.co.il> References: <1133177027.12652.5776.camel@hal.voltaire.com> <20051128135054.GB25206@mellanox.co.il> Message-ID: <1133194236.12652.6723.camel@hal.voltaire.com> On Mon, 2005-11-28 at 08:50, Michael S. Tsirkin wrote: > Make osm makefiles respect DESTDIR. [For some reason, I had to apply this manually.] Thanks. Applied. From yipeeyipeeyipeeyipee at yahoo.com Mon Nov 28 08:56:55 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 28 Nov 2005 16:56:55 +0000 (UTC) Subject: [openib-general] rtr & post_recv Message-ID: Hi, When connecting two reliable-connection qp's should I post receive buffers (using ibv_post_recv()) before modifying the qp's state to rtr (with ibv_modify_qp()) ? Or should I first modify the qp and then post the buffers? I've stumbled across a problem that on some HCAs I must first do ibv_modify_qp() and then do ibv_post_recv() (a) or otherwise the connection never passes data sent with ibv_post_send() (b). a. working ibv_modify_qp(qp, rtr) ibv_post_recv(qp, wr) b. non-working (on some hcas) ibv_post_recv(qp, wr) ibv_modify_qp(qp, rtr) Is this sequence defined anywhere? Or have I found a bug? thanks, y From rolandd at cisco.com Mon Nov 28 09:10:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 09:10:14 -0800 Subject: [openib-general] rtr & post_recv In-Reply-To: (yipee's message of "Mon, 28 Nov 2005 16:56:55 +0000 (UTC)") References: Message-ID: <523blg3dhl.fsf@cisco.com> yipee> Hi, When connecting two reliable-connection qp's should I yipee> post receive buffers (using ibv_post_recv()) before yipee> modifying the qp's state to rtr (with ibv_modify_qp()) ? Or yipee> should I first modify the qp and then post the buffers? It should work fine to post receives when the QP is in the INIT state. yipee> I've stumbled across a problem that on some HCAs I must yipee> first do ibv_modify_qp() and then do ibv_post_recv() (a) or yipee> otherwise the connection never passes data sent with yipee> ibv_post_send() (b). What do you mean by "some HCAs"? Can you give details about which HCAs it works with and which it doesn't? - R. From rolandd at cisco.com Mon Nov 28 09:10:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 09:10:43 -0800 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <20051128053528.GA15811@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 Nov 2005 07:35:28 +0200") References: <52slth39n2.fsf@cisco.com> <20051128053528.GA15811@mellanox.co.il> Message-ID: <52y8381ywc.fsf@cisco.com> Michael> What about simply initializing this to some string like Michael> "uninitialized"? This indicates in a pretty clear way Michael> that user needs to recheck the node later. That's kind of sucky if userspace never gets around to setting the node description :) - R. From mst at mellanox.co.il Mon Nov 28 09:36:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 19:36:05 +0200 Subject: [openib-general] Re: [PATCH] Allow setting of NodeDescription In-Reply-To: <52y8381ywc.fsf@cisco.com> References: <52y8381ywc.fsf@cisco.com> Message-ID: <20051128173605.GH25751@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH] Allow setting of NodeDescription > > Michael> What about simply initializing this to some string like > Michael> "uninitialized"? This indicates in a pretty clear way > Michael> that user needs to recheck the node later. > > That's kind of sucky if userspace never gets around to setting the > node description :) So we know the system is stuck. Looks like a feature :) -- MST From mst at mellanox.co.il Mon Nov 28 09:40:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 19:40:12 +0200 Subject: [openib-general] [PATCH] user_mad: fix mad header parsing Message-ID: <20051128174012.GI25751@mellanox.co.il> Looks like there's a bug in user_mad.c. Does the following make sense? --- ib_umad_write in user_mad.c is looking at rmpp_hdr field in MAD before checking that the MAD actually has the RMPP header. So for a MAD without RMPP header it looks like we are actually checking a bit inside mkey, or something. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/core/user_mad.c =================================================================== --- linux-kernel/drivers/infiniband/core/user_mad.c (revision 4158) +++ linux-kernel/drivers/infiniband/core/user_mad.c (working copy) @@ -338,7 +340,7 @@ static ssize_t ib_umad_write(struct file u8 method; __be64 *tid; int ret, length, hdr_len, copy_offset; - int rmpp_active = 0; + int rmpp_active, has_rmpp_header; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -388,28 +390,31 @@ static ssize_t ib_umad_write(struct file } rmpp_mad = (struct ib_rmpp_mad *) packet->mad.data; - if (ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE) { - /* RMPP active */ - if (!agent->rmpp_version) { - ret = -EINVAL; - goto err_ah; - } - - /* Validate that the management class can support RMPP */ - if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { - hdr_len = IB_MGMT_SA_HDR; - } else if ((rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && - (rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) { - hdr_len = IB_MGMT_VENDOR_HDR; - } else { - ret = -EINVAL; - goto err_ah; - } - rmpp_active = 1; + if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { + hdr_len = IB_MGMT_SA_HDR; copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; + } else if (rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START && + rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END) { + hdr_len = IB_MGMT_VENDOR_HDR; + copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; } else { hdr_len = IB_MGMT_MAD_HDR; copy_offset = IB_MGMT_MAD_HDR; + has_rmpp_header = 0; + } + + if (has_rmpp_header) + rmpp_active = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE; + else + rmpp_active = 0; + + /* Validate that the management class can support RMPP */ + if (rmpp_active && !agent->rmpp_version) { + ret = -EINVAL; + goto err_ah; } packet->msg = ib_create_send_mad(agent, -- MST From rolandd at cisco.com Mon Nov 28 09:58:23 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 09:58:23 -0800 Subject: [openib-general] Re: [PATCH] user_mad: fix mad header parsing In-Reply-To: <20051128174012.GI25751@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 Nov 2005 19:40:12 +0200") References: <20051128174012.GI25751@mellanox.co.il> Message-ID: <52u0dw1wow.fsf@cisco.com> > Looks like there's a bug in user_mad.c. > Does the following make sense? Seems right to me. Sean? > Signed-off-by: Michael S. Tsirkin > Signed-off-by: Jack Morgenstein I think you are passing on a patch that Jack wrote. If that is true, then these lines are in the wrong order -- you always add your Signed-off-by: line at the bottom of the chain, so I think it should really be: Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin and then when I send it upstream, it becomes Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier and so on. - R. From rolandd at cisco.com Mon Nov 28 09:59:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 09:59:17 -0800 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference In-Reply-To: <20051128002523.GA31395@stusta.de> (Adrian Bunk's message of "Mon, 28 Nov 2005 01:25:23 +0100") References: <20051126233736.GE3988@stusta.de> <52irud4pki.fsf@cisco.com> <20051128002523.GA31395@stusta.de> Message-ID: <52psok1wne.fsf@cisco.com> Adrian> Can you Cc me when forwarding it to Linus? Looks like it went into Linus's tree directly from you (which is fine). Adrian> After it's in Linus' tree, Greg will accept it for the Adrian> 2.6.14 stable tree. Is this really important enough for the stable tree? - R. From mst at mellanox.co.il Mon Nov 28 10:12:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 20:12:49 +0200 Subject: [openib-general] Re: [PATCH] user_mad: fix mad header parsing In-Reply-To: <52u0dw1wow.fsf@cisco.com> References: <52u0dw1wow.fsf@cisco.com> Message-ID: <20051128181248.GJ25751@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] user_mad: fix mad header parsing > > > Looks like there's a bug in user_mad.c. > > Does the following make sense? > > Seems right to me. Sean? > > > Signed-off-by: Michael S. Tsirkin > > Signed-off-by: Jack Morgenstein > > I think you are passing on a patch that Jack wrote. Yes. > If that is true, > then these lines are in the wrong order -- you always add your > Signed-off-by: line at the bottom of the chain, so I think it should > really be: > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin > > and then when I send it upstream, it becomes > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin > Signed-off-by: Roland Dreier > > and so on. > > - R. > OK, good to know. -- MST From mshefty at ichips.intel.com Mon Nov 28 10:11:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 28 Nov 2005 10:11:51 -0800 Subject: [openib-general] Re: [PATCH] user_mad: fix mad header parsing In-Reply-To: <52u0dw1wow.fsf@cisco.com> References: <20051128174012.GI25751@mellanox.co.il> <52u0dw1wow.fsf@cisco.com> Message-ID: <438B4867.2000908@ichips.intel.com> Roland Dreier wrote: > > Looks like there's a bug in user_mad.c. > > Does the following make sense? > > Seems right to me. Sean? This looks right. - Sean From halr at voltaire.com Mon Nov 28 10:05:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Nov 2005 13:05:25 -0500 Subject: [openib-general] [PATCH] user_mad: fix mad header parsing In-Reply-To: <20051128174012.GI25751@mellanox.co.il> References: <20051128174012.GI25751@mellanox.co.il> Message-ID: <1133201124.12652.7246.camel@hal.voltaire.com> On Mon, 2005-11-28 at 12:40, Michael S. Tsirkin wrote: > Looks like there's a bug in user_mad.c. > Does the following make sense? Yes. > > --- > > ib_umad_write in user_mad.c is looking at rmpp_hdr field in MAD before > checking that the MAD actually has the RMPP header. > So for a MAD without RMPP header it looks like we are actually > checking a bit inside mkey, or something. MKey for SMPs; something else for GMPs incapable of RMPP. -- Hal From rolandd at cisco.com Mon Nov 28 11:26:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 11:26:41 -0800 Subject: [openib-general] Re: [PATCH] mthca: missing cleanup in reset state In-Reply-To: <20051121153913.GC20871@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 Nov 2005 17:39:13 +0200") References: <20051121153913.GC20871@mellanox.co.il> Message-ID: <521x101slq.fsf@cisco.com> Michael> The following patch fixes system hangs I am sometimes Michael> seeing when ipoib is brought down and back up. Thanks, applied and queued for 2.6.15. Michael> Should something similiar be done in libmthca? I think so -- like this? --- libmthca/src/qp.c (revision 4128) +++ libmthca/src/qp.c (working copy) @@ -65,6 +65,21 @@ static void *get_send_wqe(struct mthca_q return qp->buf + qp->send_wqe_offset + (n << qp->sq.wqe_shift); } +void mthca_init_qp_indices(struct mthca_qp *qp) +{ + qp->sq.next_ind = 0; + qp->sq.last_comp = qp->sq.max - 1; + qp->sq.head = 0; + qp->sq.tail = 0; + qp->sq.last = get_send_wqe(qp, qp->sq.max - 1); + + qp->rq.next_ind = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->rq.head = 0; + qp->rq.tail = 0; + qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1); +} + static inline int wq_overflow(struct mthca_wq *wq, int nreq, struct mthca_cq *cq) { unsigned cur; --- libmthca/src/verbs.c (revision 4128) +++ libmthca/src/verbs.c (working copy) @@ -394,19 +394,6 @@ int mthca_destroy_srq(struct ibv_srq *sr return 0; } -static void mthca_init_qp_indices(struct mthca_qp *qp) -{ - qp->sq.next_ind = 0; - qp->sq.last_comp = qp->sq.max - 1; - qp->sq.head = 0; - qp->sq.tail = 0; - - qp->rq.next_ind = 0; - qp->rq.last_comp = qp->rq.max - 1; - qp->rq.head = 0; - qp->rq.tail = 0; -} - struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) { struct mthca_create_qp cmd; @@ -427,11 +414,12 @@ struct ibv_qp *mthca_create_qp(struct ib qp->sq.max = align_queue_size(pd->context, attr->cap.max_send_wr, 0); qp->rq.max = align_queue_size(pd->context, attr->cap.max_recv_wr, 0); - mthca_init_qp_indices(qp); if (mthca_alloc_qp_buf(pd, &attr->cap, attr->qp_type, qp)) goto err; + mthca_init_qp_indices(qp); + if (pthread_spin_init(&qp->sq.lock, PTHREAD_PROCESS_PRIVATE) || pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; --- libmthca/src/mthca.h (revision 4128) +++ libmthca/src/mthca.h (working copy) @@ -310,6 +310,7 @@ extern struct ibv_qp *mthca_create_qp(st extern int mthca_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask attr_mask); extern int mthca_destroy_qp(struct ibv_qp *qp); +extern void mthca_init_qp_indices(struct mthca_qp *qp); extern int mthca_tavor_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); extern int mthca_tavor_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, --- libmthca/ChangeLog (revision 4128) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2005-11-28 Roland Dreier + + * src/qp.c (mthca_init_qp_indices): Set qp->sq.last and + qp->rq.last so that QP is fully reset when the indices are + reinited on transition to RESET state. + 2005-11-09 Roland Dreier * src/srq.c (mthca_tavor_post_srq_recv), src/qp.c From yipeeyipeeyipeeyipee at yahoo.com Mon Nov 28 11:48:45 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 28 Nov 2005 19:48:45 +0000 (UTC) Subject: [openib-general] Re: rtr & post_recv References: <523blg3dhl.fsf@cisco.com> Message-ID: Roland Dreier cisco.com> writes: [snip] > It should work fine to post receives when the QP is in the INIT state. That's what I thought too. I even think that it's preferable to do post_recv() before the state change to RTR, because the qp might (try to) receive packets a moment after the state change. [snip] > What do you mean by "some HCAs"? On DDR memfree HCAs everything works as expected (i.e. the order of two the calls doesn't matter). When running this code on older SDR HCAs the problem appears. All links use a DDR switch. > Can you give details about which HCAs it works with and which it doesn't? These HCAs are Mellanox HCAs (I.e. not topspin or voltaire). I don't remember the exact revision of the HCAs and I'm not at work now, tommorow I can post here the output from 'lspci' or the firmware revisions. I do remember that the DDR guid has 0x0400 (or 0x0040) in its second word, one SDR guid has 0x0200 and the other SDR guid has 0x0000 in its second word. thanks, x From mst at mellanox.co.il Mon Nov 28 13:01:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 23:01:35 +0200 Subject: [openib-general] Re: [PATCH] mthca: missing cleanup in reset state In-Reply-To: <521x101slq.fsf@cisco.com> References: <521x101slq.fsf@cisco.com> Message-ID: <20051128210135.GB27028@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: missing cleanup in reset state > > Michael> The following patch fixes system hangs I am sometimes > Michael> seeing when ipoib is brought down and back up. > > Thanks, applied and queued for 2.6.15. > > Michael> Should something similiar be done in libmthca? > > I think so -- like this? Looks reasonable. -- MST From mst at mellanox.co.il Mon Nov 28 13:03:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 28 Nov 2005 23:03:16 +0200 Subject: [openib-general] Re: [PATCH] user_mad: fix mad header parsing In-Reply-To: <438B4867.2000908@ichips.intel.com> References: <438B4867.2000908@ichips.intel.com> Message-ID: <20051128210316.GC27028@mellanox.co.il> Quoting Sean Hefty : > Subject: Re: [openib-general] Re: [PATCH] user_mad: fix mad header parsing > > Roland Dreier wrote: > > > Looks like there's a bug in user_mad.c. > > > Does the following make sense? > > > > Seems right to me. Sean? > > This looks right. I'll commit to trunk then? Roland, IMHO this is 2.6.15 material. -- MST From rolandd at cisco.com Mon Nov 28 13:07:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 13:07:47 -0800 Subject: [openib-general] Re: [PATCH] user_mad: fix mad header parsing In-Reply-To: <20051128210316.GC27028@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 28 Nov 2005 23:03:16 +0200") References: <438B4867.2000908@ichips.intel.com> <20051128210316.GC27028@mellanox.co.il> Message-ID: <52wtiszdjw.fsf@cisco.com> Michael> I'll commit to trunk then? Roland, IMHO this is 2.6.15 Michael> material. I've already committed it and queued it for 2.6.15. - R. From rolandd at cisco.com Mon Nov 28 13:19:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 13:19:43 -0800 Subject: [openib-general] Re: can i post a send request with 0 bytes with the inline bit enabled? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3879836@mtlexch01.mtl.com> (Dotan Barak's message of "Mon, 21 Nov 2005 16:03:14 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3879836@mtlexch01.mtl.com> Message-ID: <52sltgzd00.fsf@cisco.com> Dotan> should the driver handle it? (and post send of 0 bytes with Dotan> inline enabled should generate good completion) or the user Dotan> should know that this scenario is illegal? I guess we might as well fix it. I checked in the following patch. - R. Index: libmthca/src/qp.c =================================================================== --- libmthca/src/qp.c (revision 4182) +++ libmthca/src/qp.c (working copy) @@ -230,27 +230,30 @@ int mthca_tavor_post_send(struct ibv_qp } if (wr->send_flags & IBV_SEND_INLINE) { - struct mthca_inline_seg *seg = wqe; - int s = 0; + if (wr->num_sge) { + struct mthca_inline_seg *seg = wqe; + int s = 0; - wqe += sizeof *seg; - for (i = 0; i < wr->num_sge; ++i) { - struct ibv_sge *sge = &wr->sg_list[i]; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + struct ibv_sge *sge = &wr->sg_list[i]; - s += sge->length; + s += sge->length; - if (s > qp->max_inline_data) { - ret = -1; - *bad_wr = wr; - goto out; + if (s > qp->max_inline_data) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, (void *) (intptr_t) sge->addr, + sge->length); + wqe += sge->length; } - memcpy(wqe, (void*) (intptr_t) sge->addr, sge->length); - wqe += sge->length; + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + size += align(s + sizeof *seg, 16) / 16; } - - seg->byte_count = htonl(MTHCA_INLINE_SEG | s); - size += align(s + sizeof *seg, 16) / 16; } else { struct mthca_data_seg *seg; @@ -551,27 +554,30 @@ int mthca_arbel_post_send(struct ibv_qp } if (wr->send_flags & IBV_SEND_INLINE) { - struct mthca_inline_seg *seg = wqe; - int s = 0; + if (wr->num_sge) { + struct mthca_inline_seg *seg = wqe; + int s = 0; - wqe += sizeof *seg; - for (i = 0; i < wr->num_sge; ++i) { - struct ibv_sge *sge = &wr->sg_list[i]; + wqe += sizeof *seg; + for (i = 0; i < wr->num_sge; ++i) { + struct ibv_sge *sge = &wr->sg_list[i]; - s += sge->length; + s += sge->length; - if (s > qp->max_inline_data) { - ret = -1; - *bad_wr = wr; - goto out; + if (s > qp->max_inline_data) { + ret = -1; + *bad_wr = wr; + goto out; + } + + memcpy(wqe, (void *) (uintptr_t) sge->addr, + sge->length); + wqe += sge->length; } - memcpy(wqe, (void*) (uintptr_t) sge->addr, sge->length); - wqe += sge->length; + seg->byte_count = htonl(MTHCA_INLINE_SEG | s); + size += align(s + sizeof *seg, 16) / 16; } - - seg->byte_count = htonl(MTHCA_INLINE_SEG | s); - size += align(s + sizeof *seg, 16) / 16; } else { struct mthca_data_seg *seg; Index: libmthca/ChangeLog =================================================================== --- libmthca/ChangeLog (revision 4182) +++ libmthca/ChangeLog (working copy) @@ -3,6 +3,9 @@ * src/qp.c (mthca_init_qp_indices): Set qp->sq.last and qp->rq.last so that QP is fully reset when the indices are reinited on transition to RESET state. + (mthca_tavor_post_send, mthca_arbel_post_send): Don't create an + inline send segment when a work request is posted that has the + inline flag set but no gather entries included. 2005-11-09 Roland Dreier From rolandd at cisco.com Mon Nov 28 15:56:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 23:56:40 +0000 Subject: [openib-general] [git patch review 2/2] IB/umad: fix RMPP handling In-Reply-To: <1133222200079-44d9989c7d031b8b@cisco.com> Message-ID: <1133222200079-c74b250d96363b53@cisco.com> ib_umad_write in user_mad.c is looking at rmpp_hdr field in MAD before checking that the MAD actually has the RMPP header. So for a MAD without RMPP header it looks like we are actually checking a bit inside M_Key, or something. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 41 +++++++++++++++++++----------------- 1 files changed, 22 insertions(+), 19 deletions(-) applies-to: 918111360e352d128126bb338227ec4fb6e8afbc bf6d9e23a36c8a01bf6fbb945387d8ca3870ff71 diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index e73f81c..eb7f525 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -310,7 +310,7 @@ static ssize_t ib_umad_write(struct file u8 method; __be64 *tid; int ret, length, hdr_len, copy_offset; - int rmpp_active = 0; + int rmpp_active, has_rmpp_header; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -360,28 +360,31 @@ static ssize_t ib_umad_write(struct file } rmpp_mad = (struct ib_rmpp_mad *) packet->mad.data; - if (ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE) { - /* RMPP active */ - if (!agent->rmpp_version) { - ret = -EINVAL; - goto err_ah; - } - - /* Validate that the management class can support RMPP */ - if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { - hdr_len = IB_MGMT_SA_HDR; - } else if ((rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && - (rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) { - hdr_len = IB_MGMT_VENDOR_HDR; - } else { - ret = -EINVAL; - goto err_ah; - } - rmpp_active = 1; + if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { + hdr_len = IB_MGMT_SA_HDR; copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; + } else if (rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START && + rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END) { + hdr_len = IB_MGMT_VENDOR_HDR; + copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; } else { hdr_len = IB_MGMT_MAD_HDR; copy_offset = IB_MGMT_MAD_HDR; + has_rmpp_header = 0; + } + + if (has_rmpp_header) + rmpp_active = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE; + else + rmpp_active = 0; + + /* Validate that the management class can support RMPP */ + if (rmpp_active && !agent->rmpp_version) { + ret = -EINVAL; + goto err_ah; } packet->msg = ib_create_send_mad(agent, --- 0.99.9k From rolandd at cisco.com Mon Nov 28 15:56:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 23:56:40 +0000 Subject: [openib-general] [git patch review 1/2] IB/mthca: reset QP's last pointers when transitioning to reset state Message-ID: <1133222200079-44d9989c7d031b8b@cisco.com> last pointer is not updated when QP is modified to reset state. This causes data corruption if WQEs are already posted on the queue. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) applies-to: 1e8504d2a91579756c89ef2d65ebd526f973cde8 187a25863fe014486ee834164776b2a587d6934d diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index dd4e133..f9c8eb9 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -871,7 +871,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); mthca_wq_init(&qp->sq); + qp->sq.last = get_send_wqe(qp, qp->sq.max - 1); + mthca_wq_init(&qp->rq); + qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1); if (mthca_is_memfree(dev)) { *qp->sq.db = 0; --- 0.99.9k From rolandd at cisco.com Mon Nov 28 20:28:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 20:28:49 -0800 Subject: [openib-general] [PATCH] ehca: remove dead code Message-ID: <52zmnoxeke.fsf@cisco.com> ib2ehca_mask() is not used anywhere that I can find. Signed-off-by: Roland Dreier --- Index: infiniband/hw/ehca/ehca_qp.c =================================================================== --- infiniband/hw/ehca/ehca_qp.c (revision 4128) +++ infiniband/hw/ehca/ehca_qp.c (working copy) @@ -490,87 +490,6 @@ static inline int ibqptype2servicetype(e } } -/** @brief returns ehca bit mask corresponding to given ib attr mask - * as parameter of modify qp - */ -static inline u64 ib2ehca_mask(enum ib_qp_attr_mask attr_mask) -{ - u64 update_mask = 0; - if (attr_mask & IB_QP_PKEY_INDEX) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_P_KEY_IDX, 1); - } - if (attr_mask & IB_QP_PORT) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PRIM_PHYS_PORT, 1); - } - if (attr_mask & IB_QP_QKEY) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_QKEY, 1); - } - if (attr_mask & IB_QP_AV) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_GID, 1); - } - - if (attr_mask & IB_QP_PATH_MTU) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_PATH_MTU, 1); - } - if (attr_mask & IB_QP_TIMEOUT) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_TIMEOUT, 1); - } - if (attr_mask & IB_QP_RETRY_CNT) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RETRY_COUNT, 1); - } - if (attr_mask & IB_QP_RNR_RETRY) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RNR_RETRY_COUNT, 1); - } - if (attr_mask & IB_QP_RQ_PSN) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_RECEIVE_PSN, 1); - } - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { - update_mask |= - EHCA_BMASK_SET - (MQPCB_MASK_RDMA_ATOMIC_OUTST_DEST_QP, 1); - } - if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_RDMA_NR_ATOMIC_RESP_RES, 1); - } - if (attr_mask & IB_QP_ALT_PATH) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DLID, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_PATH_BITS, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SERVICE_LEVEL, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_MAX_STATIC_RATE, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_GRH_FLAG, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SOURCE_GID_IDX, 1); - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_GID, 1); - } - if (attr_mask & IB_QP_MIN_RNR_TIMER) { - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_MIN_RNR_NAK_TIMER_FIELD, 1); - } - if (attr_mask & IB_QP_SQ_PSN) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_SEND_PSN, 1); - } - if (attr_mask & IB_QP_DEST_QPN) { - update_mask |= EHCA_BMASK_SET(MQPCB_MASK_DEST_QP_NR, 1); - } - if (attr_mask & IB_QP_PATH_MIG_STATE) { - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_PATH_MIGRATION_STATE, 1); - } - if (attr_mask & IB_QP_CAP) { - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_SEND_WR, 1); - update_mask |= - EHCA_BMASK_SET(MQPCB_MASK_MAX_NR_OUTST_RECV_WR, 1); - } - return update_mask; -} - /* init_qp_queues - Initializes/constructs r/squeue and registers queue pages. * returns 0 if successful, * -EXXXX if not From rolandd at cisco.com Mon Nov 28 21:14:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 21:14:58 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue In-Reply-To: <20051121163458.GE20871@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 Nov 2005 18:34:58 +0200") References: <20051121163458.GE20871@mellanox.co.il> Message-ID: <52sltgxcfh.fsf@cisco.com> [resending, forgot to reply to all...] I started reviewing these patches, and I found in my testing I also needed the following patch for your stress test to be stable. Does this seem correct and necessary? Thanks, Roland --- It's possible that IPoIB will issue multiple SA queries for the same path struct. Therefore a path struct's completion needs to be initialized for each query rather than only once when the struct is allocated, or else we might not wait long enough for later queries to finish and free the path struct too soon. Signed-off-by: Roland Dreier --- --- infiniband/ulp/ipoib/ipoib_main.c (revision 4180) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -428,7 +428,6 @@ static struct ipoib_path *path_rec_creat skb_queue_head_init(&path->queue); INIT_LIST_HEAD(&path->neigh_list); - init_completion(&path->done); memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); path->pathrec.sgid = priv->local_gid; @@ -446,6 +445,8 @@ static int path_rec_start(struct net_dev ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(path->pathrec.dgid)); + init_completion(&path->done); + path->query_id = ib_sa_path_rec_get(priv->ca, priv->port, &path->pathrec, From rolandd at cisco.com Mon Nov 28 21:26:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 28 Nov 2005 21:26:27 -0800 Subject: [openib-general] [PATCH] iSER: add dma-mapping.h include Message-ID: <52oe44xbwc.fsf@cisco.com> iSER uses enum dma_data_direction and other DMA API things without including . Therefore, the compilation fails for architectures (eg sparc64) where that include does not get pulled in implicitly because of other includes. Signed-off-by: Roland Dreier --- infiniband/ulp/iser/iser.h (revision 4186) +++ infiniband/ulp/iser/iser.h (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -38,6 +39,7 @@ #include #include #include +#include #include "iser_api.h" #include "iser_header.h" From mst at mellanox.co.il Tue Nov 29 02:30:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 12:30:42 +0200 Subject: [openib-general] [PATCH] mthca: fix posting long send lists on arbel Message-ID: <20051129103041.GT25751@mellanox.co.il> mthca: fix posting send work request lists of length >= 255 on Arbel. Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_wqe.h =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_wqe.h (revision 4088) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_wqe.h (working copy) @@ -50,7 +50,8 @@ enum { enum { MTHCA_INVAL_LKEY = 0x100, - MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256, + MTHCA_ARBEL_MAX_WQES_PER_SEND_DB = 255 }; struct mthca_next_seg { Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (revision 4088) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1836,6 +1881,34 @@ int mthca_arbel_post_send(struct ib_qp * ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { + __be32 doorbell[2]; + nreq = 0; + doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | + f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + size0 = 0; + + qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->sq.db = cpu_to_be32(qp->sq.head & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + wmb(); + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { mthca_err(dev, "SQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, -- MST From mst at mellanox.co.il Tue Nov 29 02:30:59 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 12:30:59 +0200 Subject: [openib-general] [PATCH] libmthca: fix posting long send lists on arbel Message-ID: <20051129103059.GU25751@mellanox.co.il> libmthca: fix posting send work request lists of length >= 255 on Arbel. Signed-off-by: Michael S. Tsirkin Index: trunk/src/userspace/libmthca/src/qp.c =================================================================== --- trunk/src/userspace/libmthca/src/qp.c (revision 4126) +++ trunk/src/userspace/libmthca/src/qp.c (working copy) @@ -422,6 +422,31 @@ int mthca_arbel_post_send(struct ibv_qp ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB) { + uint32_t doorbell[2]; + nreq = 0; + doorbell[0] = htonl((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | f0 | op0); + doorbell[1] = htonl((ibqp->qp_num << 8) | size0); + size0 = 0; + + qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + mb(); + *qp->sq.db = htonl(qp->sq.head & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + mb(); + mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_SEND_DOORBELL); + } + if (wq_overflow(&qp->sq, nreq, to_mcq(qp->ibv_qp.send_cq))) { ret = -1; *bad_wr = wr; Index: trunk/src/userspace/libmthca/src/wqe.h =================================================================== --- trunk/src/userspace/libmthca/src/wqe.h (revision 4126) +++ trunk/src/userspace/libmthca/src/wqe.h (working copy) @@ -55,7 +55,8 @@ enum { enum { MTHCA_INVAL_LKEY = 0x100, - MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256, + MTHCA_ARBEL_MAX_WQES_PER_SEND_DB = 255 }; struct mthca_next_seg { -- MST From mst at mellanox.co.il Tue Nov 29 02:31:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 12:31:32 +0200 Subject: [openib-general] [PATCH] libmthca: fix posting long receive lists on tavor Message-ID: <20051129103132.GV25751@mellanox.co.il> libmthca: fix posting receive work request lists of length > 255 on Tavor. Signed-off-by: Michael S. Tsirkin Index: trunk/src/userspace/libmthca/src/qp.c =================================================================== --- trunk/src/userspace/libmthca/src/qp.c (revision 4126) +++ trunk/src/userspace/libmthca/src/qp.c (working copy) @@ -327,7 +327,7 @@ int mthca_tavor_post_recv(struct ibv_qp mthca_write64(doorbell, to_mctx(ibqp->context), MTHCA_RECV_DOORBELL); - qp->rq.head += nreq; + qp->rq.head += MTHCA_TAVOR_MAX_WQES_PER_RECV_DB; size0 = 0; } -- MST From mst at mellanox.co.il Tue Nov 29 02:32:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 12:32:52 +0200 Subject: [openib-general] outstanding patches: mst patchset Message-ID: <20051129103252.GW25751@mellanox.co.il> Hi! FYI the list of patches that I am currently using is under https://openib.org/svn/trunk/contrib/mellanox/patches What is there currently and ready for merging: mcast_cleanup.patch - add missing resource cleanup for mcast object multicast_clean.patch - two patches in one file: IPoIB multicast cleanup mthca_longsend.patch libmthca_longsend.patch libmthca_longrecv.patch - fixes for handling long wqe lists (>= 255 entries) Other things of interest node_desc_updated.patch - Roland's patch for setting node description from userspace, updated to latest bits -- MST From danb at voltaire.com Tue Nov 29 03:23:17 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 29 Nov 2005 13:23:17 +0200 Subject: [openib-general] RE: [PATCH] iSER: add dma-mapping.h include Message-ID: Applied (r4196). Dan > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 29, 2005 7:26 AM > To: Dan Bar Dov > Cc: openib-general at openib.org > Subject: [PATCH] iSER: add dma-mapping.h include > > iSER uses enum dma_data_direction and other DMA API things without > including . Therefore, the compilation fails for > architectures (eg sparc64) where that include does not get pulled in > implicitly because of other includes. > > Signed-off-by: Roland Dreier > > --- infiniband/ulp/iser/iser.h (revision 4186) > +++ infiniband/ulp/iser/iser.h (working copy) > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms > of the GNU > @@ -38,6 +39,7 @@ > #include > #include > #include > +#include > > #include "iser_api.h" > #include "iser_header.h" > From bunk at stusta.de Tue Nov 29 04:30:52 2005 From: bunk at stusta.de (Adrian Bunk) Date: Tue, 29 Nov 2005 13:30:52 +0100 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference In-Reply-To: <52psok1wne.fsf@cisco.com> References: <20051126233736.GE3988@stusta.de> <52irud4pki.fsf@cisco.com> <20051128002523.GA31395@stusta.de> <52psok1wne.fsf@cisco.com> Message-ID: <20051129123052.GF31395@stusta.de> On Mon, Nov 28, 2005 at 09:59:17AM -0800, Roland Dreier wrote: > Adrian> Can you Cc me when forwarding it to Linus? > > Looks like it went into Linus's tree directly from you (which is fine). It went through Andrew. > Adrian> After it's in Linus' tree, Greg will accept it for the > Adrian> 2.6.14 stable tree. > > Is this really important enough for the stable tree? You said it fixed a crash for you. Besides this, it's a small and easy to verify change. > - R. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From mst at mellanox.co.il Tue Nov 29 04:39:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 14:39:32 +0200 Subject: [openib-general] [PATCH] libibverbs: fix make dist Message-ID: <20051129123932.GA25751@mellanox.co.il> Fix EXTRA_DIST: sa-kern-abi.h path is wrong. Signed-off-by: Michael S. Tsirkin Index: libibverbs/Makefile.am =================================================================== --- libibverbs/Makefile.am (revision 4198) +++ libibverbs/Makefile.am (working copy) @@ -55,7 +55,7 @@ EXTRA_DIST = include/infiniband/driver.h include/infiniband/kern-abi.h \ include/infiniband/opcode.h include/infiniband/verbs.h src/ibverbs.h \ - include/infiniband/marshall.h include/sa-kern-abi.h include/infiniband/sa.h \ + include/infiniband/marshall.h include/infiniband/sa-kern-abi.h include/infiniband/sa.h \ src/libibverbs.map libibverbs.spec.in $(man_MANS) $(DEBIAN) dist-hook: libibverbs.spec -- MST From yael at mellanox.co.il Tue Nov 29 05:04:27 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 29 Nov 2005 15:04:27 +0200 Subject: [openib-general] [PATCH] Opensm - duplicated guids error message --text follows this line-- Message-ID: <5zek4zy59g.fsf@mtl066.yok.mtl.com> Currently the error message generated if there are duplicated guids on the subnet is not clear, and is devided to several error messages. The following patch fixes the error message to be clearer. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 4198) +++ opensm/osm_node_info_rcv.c (working copy) @@ -133,43 +133,72 @@ __osm_ni_rcv_set_links( { /* Uh oh... + This means that we found 2 nodes with the same guid, + or a 12x link with lane reversal that is not configured correctly. */ + char buf[BUF_SIZE]; + char line[BUF_SIZE]; + char dr_new_path[BUF_SIZE]; + char dr_old_path[BUF_SIZE]; + uint32_t i; + osm_dr_path_t *p_path = NULL, *p_old_path = NULL; + + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + sprintf(dr_new_path, "no_path_available"); + if (p_physp) + { + p_path = osm_physp_get_dr_path_ptr( p_physp ); + if ( p_path ) + { + sprintf( dr_new_path, "new path:"); + for (i = 0; i <= p_path->hop_count; i++ ) + { + sprintf( line, "[%X]", p_path->path[i] ); + strcat( dr_new_path, line ); + } + } + } + p_old_neighbor_node = osm_node_get_remote_node( p_node, port_num, &old_neighbor_port_num ); + p_old_physp = osm_node_get_physp_ptr( + p_old_neighbor_node, + old_neighbor_port_num); + sprintf(dr_old_path, "no_path_available"); + if (p_old_physp) + { + p_old_path = osm_physp_get_dr_path_ptr( p_old_physp ); + if ( p_old_path ) + { + sprintf( dr_old_path, "old_path:"); + for (i = 0; i <= p_old_path->hop_count; i++ ) + { + sprintf( line, "[%X]", p_old_path->path[i] ); + strcat( dr_old_path, line ); + } + } + } osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_set_links: ERR 0D01: " + "Found duplicated guids or 12x link " + "with lane reversal badly configured.\n" "Overriding existing link to:" "node 0x%" PRIx64 ", port number 0x%X connected to:\n" "\t\t\t\told node 0x%" PRIx64 ", " - "port number 0x%X\n" + "port number 0x%X %s\n" "\t\t\t\tnew node 0x%" PRIx64 ", " - "port number 0x%X\n", + "port number 0x%X %s\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num, cl_ntoh64( osm_node_get_node_guid( p_old_neighbor_node ) ), old_neighbor_port_num , + dr_old_path, cl_ntoh64( p_ni_context->node_guid ), - p_ni_context->port_num + p_ni_context->port_num, + dr_new_path ); - - if (osm_log_is_active(p_rcv->p_log, OSM_LOG_ERROR)) - { - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - if (p_physp) - osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( p_physp ), - OSM_LOG_ERROR); - - p_old_physp = osm_node_get_physp_ptr( - p_old_neighbor_node, - old_neighbor_port_num); - if (p_old_physp) - osm_dump_dr_path(p_rcv->p_log, - osm_physp_get_dr_path_ptr( p_old_physp ), - OSM_LOG_ERROR); - } } /* From yipeeyipeeyipeeyipee at yahoo.com Tue Nov 29 03:54:24 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Tue, 29 Nov 2005 11:54:24 +0000 (UTC) Subject: [openib-general] Re: rtr & post_recv References: <523blg3dhl.fsf@cisco.com> Message-ID: Roland Dreier cisco.com> writes: > What do you mean by "some HCAs"? Can you give details about which > HCAs it works with and which it doesn't? Ok, I have a bit more details about the HCAs. **** First node info: yip1 ] lspci | grep Mell 6:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) yip1 ] cat /sys/class/infiniband/mthca0/fw_ver 4.7.0 yip1 ] cat /sys/class/infiniband/mthca0/node_guid 0002:c902:0000:56ac **** Second node info: yip2 ] lspci | grep Mell 06:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) yip2 ] cat /sys/class/infiniband/mthca0/fw_ver 4.6.2 yip2 ] cat /sys/class/infiniband/mthca0/node_guid 0002:c902:0020:3ed8 **** Working nodes info: yip3 ] lspci | grep Mell 06:00.0 InfiniBand: Mellanox Technology MT25208 InfiniHost III Ex (rev 20) yip3 ] cat /sys/class/infiniband/mthca0/node_guid 0002:c902:0040:2764 yip3 ] cat /sys/class/infiniband/mthca0/fw_ver 5.1.0 yip4 ] lspci | grep Mell 06:00.0 InfiniBand: Mellanox Technology MT25208 InfiniHost III Ex (rev 20) yip4 ] cat /sys/class/infiniband/mthca0/node_guid 0002:c902:0040:2768 yip4 ] cat /sys/class/infiniband/mthca0/fw_ver 5.1.0 **** Is there any more info that can help you in resolving this issue? I tried reproducing this problem using the user-space cmpost.c. In the function rep_handler() I moved the post_recvs() call before the modify_to_rtr() and the problem reproduced itself. Any idea? thanks, y From info at hyytr.com Tue Nov 29 05:26:52 2005 From: info at hyytr.com (info at hyytr.com) Date: 29 Nov 2005 22:26:52 +0900 Subject: [openib-general] $B!Z=EMW![7HBSHV9f8r49$7$F2<$5$$(B Message-ID: <20051129132652.21200.qmail@mail.hyytr.com> $BB>$G$O=PMh$^$;$s$,!"(B http://www.y-falconry.net/?star $B$G$O(B $B%a%k%"%I8r49$d7HBSHV9f8r49$,2DG=$G$9!#(B $B!cEPO?$OL5NA!d(B $B:#$@$1L5NA%]%$%s%H$,#8#0#0#P$b$i$($^$9!#(B $BB>$H$N0c$$$r;n$7$F$_$FGl9g$OB`2q$7$F(B $B$b$i$C$F9=$$$^$;$s!#(B $B%W%i%$%P%7!<$r References: <5zek4zy59g.fsf@mtl066.yok.mtl.com> Message-ID: <1133276441.4411.234.camel@hal.voltaire.com> On Tue, 2005-11-29 at 08:04, Yael Kalka wrote: > Currently the error message generated if there are duplicated guids on > the subnet is not clear, and is devided to several error messages. > The following patch fixes the error message to be clearer. Thanks. Applied. From halr at voltaire.com Tue Nov 29 07:11:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Nov 2005 10:11:32 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] osm_node_info_rcv.c:osm_ni_rcv_init Move assert before variable used Message-ID: <1133277091.4411.324.camel@hal.voltaire.com> osm_node_info_rcv.c:osm_ni_rcv_init Move assert before the variable is used [I think there are numerous instances of this if OpenSM were to be inspected for this. This could cause issues with --enable-debug (which have been seen but it is unclear to me whether this is the cause or the effect). Should more of this be done ?] Some other cosmetic changes Signed-off-by: Hal Rosenstock Index: osm_node_info_rcv.c =================================================================== --- osm_node_info_rcv.c (revision 4205) +++ osm_node_info_rcv.c (working copy) @@ -986,6 +986,7 @@ osm_ni_rcv_init( IN cl_plock_t* const p_lock ) { ib_api_status_t status = IB_SUCCESS; + OSM_LOG_ENTER( p_log, osm_ni_rcv_init ); osm_ni_rcv_construct( p_rcv ); @@ -1013,12 +1014,12 @@ osm_ni_rcv_process( osm_node_t *p_node; boolean_t process_new_flag = FALSE; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); - p_guid_tbl = &p_rcv->p_subn->node_guid_tbl; p_smp = osm_madw_get_smp_ptr( p_madw ); p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); @@ -1041,6 +1042,9 @@ osm_ni_rcv_process( osm_dump_smp_dr_path(p_rcv->p_log, p_smp, OSM_LOG_ERROR); goto Exit; } + + p_guid_tbl = &p_rcv->p_subn->node_guid_tbl; + /* Determine if this node has already been discovered, and process accordingly. From danb at voltaire.com Tue Nov 29 07:48:23 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 29 Nov 2005 17:48:23 +0200 Subject: [openib-general] [ISER] kDAPL no longer used Message-ID: I committed a new version of iSER. This version replaced kDAPL with CMA and ib_verbs. kDAPL is no longer used by iSER. As a part of this rewrite, all memory access was normalized as well. One more major change is forthcoming - directly interfacing ib_iser with open_iscsi. Dan From rolandd at cisco.com Tue Nov 29 08:59:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 08:59:10 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: fix make dist In-Reply-To: <20051129123932.GA25751@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 14:39:32 +0200") References: <20051129123932.GA25751@mellanox.co.il> Message-ID: <528xv7xue9.fsf@cisco.com> Thanks, applied. From rolandd at cisco.com Tue Nov 29 09:01:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 09:01:00 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long receive lists on tavor In-Reply-To: <20051129103132.GV25751@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 12:31:32 +0200") References: <20051129103132.GV25751@mellanox.co.il> Message-ID: <524q5vxub7.fsf@cisco.com> Thanks, applied. From rolandd at cisco.com Tue Nov 29 09:02:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 09:02:26 -0800 Subject: [openib-general] Re: rtr & post_recv In-Reply-To: (yipee's message of "Tue, 29 Nov 2005 11:54:24 +0000 (UTC)") References: <523blg3dhl.fsf@cisco.com> Message-ID: <52zmnnwfod.fsf@cisco.com> Seems like you see the failure with PCIe HCAs in compatibility mode (FW 4.x.y) and not in mem-free mode (FW 5.x.y). I'll investigate here to see what happens on a PCI-X HCA but my first guess is that this is a FW bug. - R. From rolandd at cisco.com Tue Nov 29 09:07:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 09:07:58 -0800 Subject: [openib-general] Re: [2.6 patch] drivers/infiniband/core/mad.c: fix a NULL pointer dereference In-Reply-To: <20051129123052.GF31395@stusta.de> (Adrian Bunk's message of "Tue, 29 Nov 2005 13:30:52 +0100") References: <20051126233736.GE3988@stusta.de> <52irud4pki.fsf@cisco.com> <20051128002523.GA31395@stusta.de> <52psok1wne.fsf@cisco.com> <20051129123052.GF31395@stusta.de> Message-ID: <52veybwff5.fsf@cisco.com> Roland> Is this really important enough for the stable tree? Adrian> You said it fixed a crash for you. To trigger the patch, you have to hit the error path, which in practical terms requires buggy code calling into the function. And you also have to either be running with CONFIG_DEBUG_SLAB=y or be extremely unlucky. So I don't think anyone who's not developing IB driver code could ever hit the crash, and any developers are going to be running the latest tree anyway. Adrian> Besides this, it's a small and easy to verify change. Sure, I don't mind it going into the stable tree. I'm just not sure it's worth spending everyone's time on it. - R. From rolandd at cisco.com Tue Nov 29 09:40:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 09:40:17 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long send lists on arbel In-Reply-To: <20051129103059.GU25751@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 12:30:59 +0200") References: <20051129103059.GU25751@mellanox.co.il> Message-ID: <52r78zwdxa.fsf@cisco.com> Thanks, applied. Is the limit of 255 send requests per doorbell documented anywhere? I didn't look very hard but I didn't see it in the 0.86 PRM. - R. From rolandd at cisco.com Tue Nov 29 10:31:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 10:31:17 -0800 Subject: [openib-general] Re: [PATCH 2 of 2] ipoib: ipoib_multicast.c cleanup In-Reply-To: <20051121163653.GF20871@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 Nov 2005 18:36:53 +0200") References: <20051121163653.GF20871@mellanox.co.il> Message-ID: <52ek4zwbka.fsf@cisco.com> I applied this bit as obviously correct: > 2. Make sure mcast->done is initialized to uncompleted value > before we submit a new query, so that its safe to wait on. but I couldn't convince myself that > 1. Make sure mcast->query is set to NULL if, and only if, > no query is outstanding. > 4. Protect all accesses to priv->broadcast, priv->multicast_list, > mcast->query and mcast->done by priv->lock. > I had to change mcast_mutex to ipoib_mcast_lock to make the last bit work. were improvements (and there was no 3. ;) - R. From rolandd at cisco.com Tue Nov 29 10:50:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 10:50:28 -0800 Subject: [openib-general] [PATCH] iSER: fix cast warning on 32-bit archs Message-ID: <528xv7waob.fsf@cisco.com> Fix the warning iser_verbs.c:692: warning: cast to pointer from integer of different size on 32-bit architectures -- work request ID needs to be cast to long before being cast to a pointer. Signed-off-by: Roland Dreier --- infiniband/ulp/iser/iser_verbs.c (revision 4210) +++ infiniband/ulp/iser/iser_verbs.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -689,7 +690,7 @@ static void iser_cq_callback(struct ib_c unsigned long xfer_len; while (ib_poll_cq(cq, 1, &wc) == 1) { - p_dto = (struct iser_dto *)wc.wr_id; + p_dto = (struct iser_dto *) (unsigned long) wc.wr_id; if (p_dto == NULL || p_dto->type >= ISER_DTO_PASSIVE) iser_bug("NULL p_dto %p or unexpected type\n", p_dto); From rolandd at cisco.com Tue Nov 29 10:53:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 10:53:40 -0800 Subject: [openib-general] Re: [PATCH] ipoib: protect child list access In-Reply-To: <20051117155209.GN20871@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 17 Nov 2005 17:52:09 +0200") References: <20051117155209.GN20871@mellanox.co.il> Message-ID: <524q5vwaiz.fsf@cisco.com> Thanks, applied. From rolandd at cisco.com Tue Nov 29 10:56:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 10:56:28 -0800 Subject: [openib-general] Re: error handling in ipoib_open In-Reply-To: <20051119190548.GA22412@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 19 Nov 2005 21:05:48 +0200") References: <20051119190548.GA22412@mellanox.co.il> Message-ID: <52zmnnuvtv.fsf@cisco.com> Michael> Roland, ipoib_open error handling looks strange. dont we Michael> need to e.g. call ipoib_ib_dev_stop if ipoib_ib_dev_up Michael> returns an error? You're right. I just committed this: --- infiniband/ulp/ipoib/ipoib_main.c (revision 4215) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -94,8 +94,10 @@ int ipoib_open(struct net_device *dev) if (ipoib_ib_dev_open(dev)) return -EINVAL; - if (ipoib_ib_dev_up(dev)) + if (ipoib_ib_dev_up(dev)) { + ipoib_ib_dev_stop(dev); return -EINVAL; + } if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; From mst at mellanox.co.il Tue Nov 29 11:36:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 21:36:08 +0200 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long send lists on arbel In-Reply-To: <52r78zwdxa.fsf@cisco.com> References: <52r78zwdxa.fsf@cisco.com> Message-ID: <20051129193608.GA8576@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] libmthca: fix posting long send lists on arbel > > Thanks, applied. > > Is the limit of 255 send requests per doorbell documented anywhere? I > didn't look very hard but I didn't see it in the 0.86 PRM. > > - R. > There are 8 bits and it must be > 0, so ... -- MST From rolandd at cisco.com Tue Nov 29 11:33:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 11:33:57 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix posting long send lists on arbel In-Reply-To: <20051129103041.GT25751@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 12:30:42 +0200") References: <20051129103041.GT25751@mellanox.co.il> Message-ID: <52psojuu3e.fsf@cisco.com> Thanks, applied. From rolandd at cisco.com Tue Nov 29 11:37:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 11:37:43 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long send lists on arbel In-Reply-To: <20051129193608.GA8576@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 21:36:08 +0200") References: <52r78zwdxa.fsf@cisco.com> <20051129193608.GA8576@mellanox.co.il> Message-ID: <52fypfutx4.fsf@cisco.com> Michael> There are 8 bits and it must be > 0, so ... I'm probably being dense, but I don't see any 8-bit field that counts descriptors in either the send doorbell descriptor or the MMIO register. What am I missing? - R. From rolandd at cisco.com Tue Nov 29 11:38:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 11:38:52 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix posting long send lists on arbel In-Reply-To: <20051129193608.GA8576@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 21:36:08 +0200") References: <52r78zwdxa.fsf@cisco.com> <20051129193608.GA8576@mellanox.co.il> Message-ID: <52br03utv7.fsf@cisco.com> Never mind, I see the wqe_cnt field in the send doorbell register. - R. From rolandd at cisco.com Tue Nov 29 11:44:02 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 11:44:02 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue In-Reply-To: <20051121163458.GE20871@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 21 Nov 2005 18:34:58 +0200") References: <20051121163458.GE20871@mellanox.co.il> Message-ID: <527jarutml.fsf@cisco.com> Michael> There appear to be several races in IPoIB multicast code: Michael> for example when a MAD event may start the multicast Michael> thread, while ipoib_stop tries to stop it, leaving a Michael> thread running after the device is removed. I need help understanding how moving the work queue helps with this. If one context queues work to stop the thread and then another context queues work to start it, can't we still end up with the thread running when the device is removed? - R. From mst at mellanox.co.il Tue Nov 29 12:14:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 22:14:10 +0200 Subject: [openib-general] Re: [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue In-Reply-To: <527jarutml.fsf@cisco.com> References: <527jarutml.fsf@cisco.com> Message-ID: <20051129201409.GB8576@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue > > Michael> There appear to be several races in IPoIB multicast code: > Michael> for example when a MAD event may start the multicast > Michael> thread, while ipoib_stop tries to stop it, leaving a > Michael> thread running after the device is removed. > > I need help understanding how moving the work queue helps with this. > If one context queues work to stop the thread and then another context > queues work to start it, can't we still end up with the thread running > when the device is removed? > > - R. > I think we can have: context 1: check ADMIN_UP, it is set so we'll start thread context 2: clear ADMIN_UP context 2: stop thread context 1: start thread -- MST From mst at mellanox.co.il Tue Nov 29 12:15:21 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 22:15:21 +0200 Subject: [openib-general] Re: [PATCH 2 of 2] ipoib: ipoib_multicast.c cleanup In-Reply-To: <52ek4zwbka.fsf@cisco.com> References: <52ek4zwbka.fsf@cisco.com> Message-ID: <20051129201521.GC8576@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 2 of 2] ipoib: ipoib_multicast.c cleanup > > I applied this bit as obviously correct: > > > 2. Make sure mcast->done is initialized to uncompleted value > > before we submit a new query, so that its safe to wait on. > > but I couldn't convince myself that > > > 1. Make sure mcast->query is set to NULL if, and only if, > > no query is outstanding. > > 4. Protect all accesses to priv->broadcast, priv->multicast_list, > > mcast->query and mcast->done by priv->lock. > > I had to change mcast_mutex to ipoib_mcast_lock to make the last bit work. > > were improvements (and there was no 3. ;) > > - R. > For example, mcast_restart_thread performs list walk with no locking. Isnt this a problem? -- MST From mst at mellanox.co.il Tue Nov 29 12:40:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 22:40:45 +0200 Subject: [openib-general] [PATCH repost] uverbs: mcast group cleanup Message-ID: <20051129204045.GA9125@mellanox.co.il> Roland, just wanted to make sure this doesnt get missed: > uverbs needs to track which multicast groups is each qp > attached to, in order to properly detach when cleanup > is performed on device file close. > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin You can get it here: https://openib.org/svn/trunk/contrib/mellanox/patches/mcast_cleanup.patch This is relatively straight-forward, maybe its good it get it out of the way before we go back to discussing ipoib races. -- MST From sean.hefty at intel.com Tue Nov 29 12:45:24 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 29 Nov 2005 12:45:24 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: The following patch modifies the kernel CM API to support matching private data in received REQs against listen requests. This allows the CM to support multiple listeners on the same service ID if a discriminator is carried in the private data. This will be used by the CMA to distinguish listen requests on the same port, but on different IP addresses. Updates were made to SDP and kDAPL to support this change. Signed-off-by: Sean Hefty Index: ulp/sdp/sdp_conn.c =================================================================== --- ulp/sdp/sdp_conn.c (revision 4186) +++ ulp/sdp/sdp_conn.c (working copy) @@ -1817,7 +1817,7 @@ result = ib_cm_listen(hca->listen_id, cpu_to_be64(SDP_MSG_SERVICE_ID_VALUE), - cpu_to_be64(SDP_MSG_SERVICE_ID_MASK)); + cpu_to_be64(SDP_MSG_SERVICE_ID_MASK), NULL); if (result) { sdp_warn("Error <%d> listening for SDP connections", result); goto err4; Index: ulp/kdapl/ib/dapl_openib_cm.c =================================================================== --- ulp/kdapl/ib/dapl_openib_cm.c (revision 4186) +++ ulp/kdapl/ib/dapl_openib_cm.c (working copy) @@ -691,7 +691,7 @@ } status = ib_cm_listen(sp->cm_srvc_handle, cpu_to_be64(sp->conn_qual), - 0); + 0, NULL); if (status) { ib_destroy_cm_id(sp->cm_srvc_handle); sp->cm_srvc_handle = NULL; Index: include/rdma/ib_cm.h =================================================================== --- include/rdma/ib_cm.h (revision 4186) +++ include/rdma/ib_cm.h (working copy) @@ -102,7 +102,8 @@ IB_CM_APR_INFO_LENGTH = 72, IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE = 216, IB_CM_SIDR_REP_PRIVATE_DATA_SIZE = 136, - IB_CM_SIDR_REP_INFO_LENGTH = 72 + IB_CM_SIDR_REP_INFO_LENGTH = 72, + IB_CM_PRIVATE_DATA_COMPARE_SIZE = 64 }; struct ib_cm_id; @@ -238,7 +239,6 @@ u32 qpn; void *info; u8 info_len; - }; struct ib_cm_event { @@ -318,6 +318,11 @@ #define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL) #define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL) +struct ib_cm_private_data_compare { + u8 data[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 mask[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; +}; + /** * ib_cm_listen - Initiates listening on the specified service ID for * connection and service ID resolution requests. @@ -330,10 +335,12 @@ * range of service IDs. If set to 0, the service ID is matched * exactly. This parameter is ignored if %service_id is set to * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. */ -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask); +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_private_data_compare *compare_data); struct ib_cm_req_param { struct ib_sa_path_rec *primary_path; Index: core/cm.c =================================================================== --- core/cm.c (revision 4186) +++ core/cm.c (working copy) @@ -130,6 +130,7 @@ /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_private_data_compare *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,48 @@ return cm_id_priv; } +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} + +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_compare_data(src, src_data->data, dst_data->mask); + cm_mask_compare_data(dst, dst_data->data, src_data->mask); + if (!memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE)) + return 0; + + return memcmp(src_data->data, dst_data->data, + IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_compare_data(dst, private_data, dst_data->mask); + if (!memcmp(dst_data->data, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE)) + return 0; + + return memcmp(private_data, dst_data->data, + IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +405,18 @@ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +425,10 @@ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +438,20 @@ } static struct cm_id_private * cm_find_listen(struct ib_device *device, - __be64 service_id) + __be64 service_id, + void *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device)) + (cm_id_priv->id.device == device) && !data_cmp) return cm_id_priv; if (device < cm_id_priv->id.device) @@ -405,6 +460,10 @@ node = node->rb_right; else if (service_id < cm_id_priv->id.service_id) node = node->rb_left; + else if (service_id > cm_id_priv->id.service_id) + node = node->rb_right; + else if (data_cmp < 0) + node = node->rb_left; else node = node->rb_right; } @@ -728,15 +787,14 @@ wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount)); while ((work = cm_dequeue_work(cm_id_priv)) != NULL) cm_free_work(work); - if (cm_id_priv->private_data && cm_id_priv->private_data_len) - kfree(cm_id_priv->private_data); + kfree(cm_id_priv->compare_data); + kfree(cm_id_priv->private_data); kfree(cm_id_priv); } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, - __be64 service_id, - __be64 service_mask) +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, + struct ib_cm_private_data_compare *compare_data) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; unsigned long flags; @@ -750,7 +808,19 @@ return -EINVAL; cm_id_priv = container_of(cm_id, struct cm_id_private, id); - BUG_ON(cm_id->state != IB_CM_IDLE); + if (cm_id->state != IB_CM_IDLE) + return -EINVAL; + + if (compare_data) { + cm_id_priv->compare_data = kzalloc(sizeof *compare_data, + GFP_KERNEL); + if (!cm_id_priv->compare_data) + return -ENOMEM; + cm_mask_compare_data(cm_id_priv->compare_data->data, + compare_data->data, compare_data->mask); + memcpy(cm_id_priv->compare_data->mask, compare_data->mask, + IB_CM_PRIVATE_DATA_COMPARE_SIZE); + } cm_id->state = IB_CM_LISTEN; @@ -767,6 +837,8 @@ if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; + kfree(cm_id_priv->compare_data); + cm_id_priv->compare_data = NULL; ret = -EBUSY; } return ret; @@ -1239,7 +1311,8 @@ /* Find matching listen request. */ listen_cm_id_priv = cm_find_listen(cm_id_priv->id.device, - req_msg->service_id); + req_msg->service_id, + req_msg->private_data); if (!listen_cm_id_priv) { spin_unlock_irqrestore(&cm.lock, flags); cm_issue_rej(work->port, work->mad_recv_wc, @@ -2646,7 +2719,8 @@ goto out; /* Duplicate message. */ } cur_cm_id_priv = cm_find_listen(cm_id->device, - sidr_req_msg->service_id); + sidr_req_msg->service_id, + sidr_req_msg->private_data); if (!cur_cm_id_priv) { rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table); spin_unlock_irqrestore(&cm.lock, flags); Index: core/cma.c =================================================================== --- core/cma.c (revision 4186) +++ core/cma.c (working copy) @@ -755,7 +755,7 @@ return PTR_ERR(id_priv->cm_id); svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); if (ret) { ib_destroy_cm_id(id_priv->cm_id); id_priv->cm_id = NULL; Index: core/ucm.c =================================================================== --- core/ucm.c (revision 4186) +++ core/ucm.c (working copy) @@ -660,7 +660,8 @@ if (IS_ERR(ctx)) return PTR_ERR(ctx); - result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask); + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask, + NULL); ib_ucm_ctx_put(ctx); return result; } From mst at mellanox.co.il Tue Nov 29 12:54:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 22:54:33 +0200 Subject: [openib-general] Re: [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: References: Message-ID: <20051129205433.GB9125@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [PATCH] [CM] add private data comparison to match REQs with listens > > The following patch modifies the kernel CM API to support matching > private data in received REQs against listen requests. This allows the > CM to support multiple listeners on the same service ID if a > discriminator is carried in the private data. > > This will be used by the CMA to distinguish listen requests on the same > port, but on different IP addresses. Updates were made to SDP and kDAPL > to support this change. > > Signed-off-by: Sean Hefty By the way, we are not yes using all bits in the service id, cant we use some of them for demux, as well? -- MST From caitlinb at broadcom.com Tue Nov 29 13:04:12 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 29 Nov 2005 13:04:12 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C24DF@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > The following patch modifies the kernel CM API to support > matching private data in received REQs against listen > requests. This allows the CM to support multiple listeners > on the same service ID if a discriminator is carried in the > private data. > > This will be used by the CMA to distinguish listen requests > on the same port, but on different IP addresses. Updates > were made to SDP and kDAPL to support this change. > A DAT Consumer does not issue multiple listens for the same Interface Adapter differentiated on the IP address (source or destination). Only a single listen per port/service ID is allowed per Interface Adapter. The resulting Connection Request events carry the remote and local addresses to allow the DAT *consumer* to differentiate based on the address. There are several factors as to why this differentiation is best done by the DAT Consumer, and not by the Connection Manager or DAT Provider: 1) While supporting multiple addresses on the same Interface is an extremely common practice, typically it is the *same* service being supported for port X no matter what the local IP address is. 2) Virtual hosts may support *thousands* of IP addresses, and differentiate the content served for each of them. Such a large fan out is best supported by the daemon's own tables, rather than by general purpose Connection Manager tables. 3) The backlog is best expressed as a common limit (HTTPD will accept up to 50 pending connections) rather than a per IP address limit (since there are *thousands* of virtual sites, the aggregate limit simply dissapears). Keep in mind that this is site virtualization where the daemon is aware that it is supporting multiple sites using a *single* Interface Adapter (HCA/RNIC). Apache is the prime example of this strategy. Xen-style virtualization can have each GOS supplying a totally different daemon for its virtual Interface Adapter on the same TCP port/Service ID. But in that case each daemon/GOS thinks that it has a distinct device. The listents are segregated by the virtual device, not by the destination IP address (unless that is how the virtualizer splits the traffic, but that is done *before* the Connection Manager is involved). From mshefty at ichips.intel.com Tue Nov 29 13:08:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Nov 2005 13:08:39 -0800 Subject: [openib-general] Re: [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <20051129205433.GB9125@mellanox.co.il> References: <20051129205433.GB9125@mellanox.co.il> Message-ID: <438CC357.1040106@ichips.intel.com> Michael S. Tsirkin wrote: > By the way, we are not yes using all bits in the service id, > cant we use some of them for demux, as well? The service ID is not sufficient, since we may need to demux based on an IP address. SDP could also use this feature to match listen requests, rather than performing a second level of demultiplexing, which it currently does. - Sean From mst at mellanox.co.il Tue Nov 29 13:18:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 29 Nov 2005 23:18:42 +0200 Subject: [openib-general] Re: [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <438CC357.1040106@ichips.intel.com> References: <438CC357.1040106@ichips.intel.com> Message-ID: <20051129211842.GB10295@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: [PATCH] [CM] add private data comparison to?match REQs with listens > > Michael S. Tsirkin wrote: > > By the way, we are not yes using all bits in the service id, > > cant we use some of them for demux, as well? > > The service ID is not sufficient, since we may need to demux based on an IP > address. I agree. I simply suggested using spare service ID bits for an additional protocol demux field, so that we can have two protocols share the same port. > SDP could also use this feature to match listen requests, rather than > performing a second level of demultiplexing, which it currently does. Sounds good. BTW, with respect to cma - is there a chance that sdp will be able to use it? -- MST From mshefty at ichips.intel.com Tue Nov 29 13:29:05 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Nov 2005 13:29:05 -0800 Subject: [openib-general] Re: [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <20051129211842.GB10295@mellanox.co.il> References: <438CC357.1040106@ichips.intel.com> <20051129211842.GB10295@mellanox.co.il> Message-ID: <438CC821.3030906@ichips.intel.com> Michael S. Tsirkin wrote: > BTW, with respect to cma - is there a chance that sdp will be able to use it? My intent is that SDP will be able to use the CMA. So the CMA will support both the new protocol that's being proposed, along with the existing SDP protocol. - Sean From Arkady.Kanevsky at netapp.com Tue Nov 29 14:32:51 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 29 Nov 2005 17:32:51 -0500 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 Message-ID: Sean, SWG discussed today the extending private data format proposal to SIDR_REQ. The group does not see the need for it since ULP is no RDMA aware. That is ULP does not use RDMA operations. Do you have some specific ULP in mind for this functionality? For UDP a different IP address can be used for each message. There is no persistent connection. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, November 23, 2005 3:41 PM > To: Ted H. Kim > Cc: Kanevsky, Arkady; swg at infinibandta.org; openib-general at openib.org > Subject: Re: [swg] RE: [openib-general] socket based > connectionmodel for IB proposal -round 4 > > Ted H. Kim wrote: > > I know we originally set out to compress everything down to the > > minimum to preserve as much ULP specific private data as > possible. But > > it seems to me in the current proposal we have reserved space now > > which could be used to re-expand the version to major 4-bits and > > minor-4 bits without harming anything else. > > I don't see any benefit to having 2 4-bit version numbers > over a single 8-bit number. A single 4-bit version number > should suffice. If all version numbers are ever consumed, > then version 15 can define an extended version field. IMO, > multiple version fields simply complicate the implementation. > > I would rather see the reserved space used to define the size > of carried user-private data. > > - Sean > From rolandd at cisco.com Tue Nov 29 15:12:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 15:12:15 -0800 Subject: [openib-general] Re: [PATCH 1 of 2] ipoib: pass all of multicast.c through ipoib_workqueue In-Reply-To: <20051129201409.GB8576@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 22:14:10 +0200") References: <527jarutml.fsf@cisco.com> <20051129201409.GB8576@mellanox.co.il> Message-ID: <523blfujzk.fsf@cisco.com> Michael> I think we can have: Michael> context 1: check ADMIN_UP, it is set so we'll start thread Michael> context 2: clear ADMIN_UP Michael> context 2: stop thread Michael> context 1: start thread But doesn't that end up with the thread running, even if we put the "stop thread" followed by the "start thread" into a work queue? - R. From info at kduue.com Tue Nov 29 14:40:52 2005 From: info at kduue.com (info at kduue.com) Date: 30 Nov 2005 07:40:52 +0900 Subject: [openib-general] $BA0J'$$$7$F$b$$$$$G$9!#(B Message-ID: <20051129224052.20740.qmail@mail.kduue.com> $B$O$8$a$^$7$F!">K;R$H$$$$$^$9!#Mh7n#3#0:M$K$J$j$^$9!#%5%$%H$N(B $BCO0hJL7G<(HD$r8+$?$H$3$m3Z$7$/$*IU$-9g$$$G$-$=$&$J$N$G!"%a!<(B $B%k$G0MMj$r$@$7$F$_$^$7$?!#@5D>$K<+8J>R2p$r$5$;$FD:$-$^$9$H<+(B $BJ,$O4i$bBN7?$b<+K}$G$-$:!"A{$,$7$/$J$$=w$@$H;W$C$F$^$9!#$"$J(B $B$?$N;~4V$HBN$r$$$?$@$/$K$O$*Ni$r$7$?$iGK;R!W$HEP(B $BO?$7$F$^$9!#(B $B!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A!A(B $B5,Ls$K$h$j!"=w at -$+$iD>@\%a%C%;!<%8$NFO$$$?J}$OG'>Z$;$:40A4L5(B $BNA$GEPO?$J$I$,$G$-$^$9!#(B $B"(;~4VFb$K8r:]$, at .N)$7$J$+$C$?>l9g$O=w at -$ND>%"%I$rD>@\8x3+CW(B $B$7$^$9!#(B $B5qH](B iranai at y-falconry.net From rolandd at cisco.com Tue Nov 29 15:20:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 15:20:15 -0800 Subject: [openib-general] Re: [PATCH 2 of 2] ipoib: ipoib_multicast.c cleanup In-Reply-To: <20051129201521.GC8576@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 22:15:21 +0200") References: <52ek4zwbka.fsf@cisco.com> <20051129201521.GC8576@mellanox.co.il> Message-ID: <52y837t51s.fsf@cisco.com> Michael> For example, mcast_restart_thread performs list walk with Michael> no locking. Isnt this a problem? Do you mean ipoib_mcast_restart_task()? If so it seems all priv->multicast_list accesses are inside priv->lock. - R. From rolandd at cisco.com Tue Nov 29 15:23:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 15:23:46 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB In-Reply-To: <52d5li8waw.fsf@cisco.com> (Roland Dreier's message of "Wed, 02 Nov 2005 15:27:51 -0800") References: <52d5li8waw.fsf@cisco.com> Message-ID: <52u0dvt4vx.fsf@cisco.com> Any progress to report on the port of RDS from the SilverStorm proprietary stack to the standard Linux stack? I think it would really move the discussion forward if there were some code that people could build and use. - R. From mshefty at ichips.intel.com Tue Nov 29 15:30:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 29 Nov 2005 15:30:04 -0800 Subject: [swg] RE: [openib-general] socket based connectionmodel for IB proposal -round 4 In-Reply-To: References: Message-ID: <438CE47C.2070501@ichips.intel.com> Kanevsky, Arkady wrote: > Sean, > SWG discussed today the extending private data format proposal to > SIDR_REQ. > The group does not see the need for it since ULP is no RDMA aware. > That is ULP does not use RDMA operations. > Do you have some specific ULP in mind for this functionality? > For UDP a different IP address can be used for each message. There is no > persistent connection. I didn't have any particular ULP in mind. I was thinking more of a generic application that wanted to use UDP style addressing over IB, similar to what's being discussed for using TCP style addressing over IB. It seems that there needs to be a way to map a given destination address to a remote QP/qkey. Regardless if the IP address is carried in each ULP message, it would still need to be in the SIDR REQ in order to locate the correct QP. Such a ULP won't use RDMA or atomic operations, but still benefits by having QP and CQ semantics, such as direct hardware access from userspace and pre-posted receive buffers, while avoiding the overhead of IP (or IPoIB). So, I would consider the application as being "RDMA aware", with "RDMA" hardware defined as that which provides QP semantics. I view it similar to applications that sit over DAPL that do only sends and receives. In any case, this can be treated as a separate issue than what's being defined for connections. It could just be beneficial that if it is ever defined that it use the same private data format. - Sean From rolandd at cisco.com Tue Nov 29 15:32:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 15:32:27 -0800 Subject: [openib-general] Re: [PATCH repost] uverbs: mcast group cleanup In-Reply-To: <20051129204045.GA9125@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 29 Nov 2005 22:40:45 +0200") References: <20051129204045.GA9125@mellanox.co.il> Message-ID: <52oe43t4hg.fsf@cisco.com> Michael> Roland, just wanted to make sure this doesnt get missed: Yes, it's in my queue. I'm still trying to work through all the patches you sent during my vacation... - R. From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 1/8] IPoIB: reinitialize path struct's completion for every query Message-ID: <1133312245796-394e1098d722d830@cisco.com> It's possible that IPoIB will issue multiple SA queries for the same path struct. Therefore the struct's completion needs to be initialized for each query rather than only once when the struct is allocated, or else we might not wait long enough for later queries to finish and free the path struct too soon. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) applies-to: 9fd732ebc6b85090b64862c4ee3af7078ba1f822 65c7eddaba33995e013ef3c04718f6dc8fdf2335 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2fa3075..cd58b3d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -428,7 +428,6 @@ static struct ipoib_path *path_rec_creat skb_queue_head_init(&path->queue); INIT_LIST_HEAD(&path->neigh_list); - init_completion(&path->done); memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); path->pathrec.sgid = priv->local_gid; @@ -446,6 +445,8 @@ static int path_rec_start(struct net_dev ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(path->pathrec.dgid)); + init_completion(&path->done); + path->query_id = ib_sa_path_rec_get(priv->ca, priv->port, &path->pathrec, --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 2/8] IPoIB: always set path->query to NULL when query finishes In-Reply-To: <1133312245796-394e1098d722d830@cisco.com> Message-ID: <1133312245796-b000d53a5b61afe0@cisco.com> Always set path->query to NULL when the SA path record query completes, rather than only when we don't have an address handle. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) applies-to: 5f96a0797d23e7cee4f2a6c4770bacadee31a261 5872a9fc28e6cd3a4e51479a50970d19a01573b3 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cd58b3d..826d7a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -398,9 +398,9 @@ static void path_rec_completion(int stat while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } - } else - path->query = NULL; + } + path->query = NULL; complete(&path->done); spin_unlock_irqrestore(&priv->lock, flags); --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 3/8] IPoIB: reinitialize mcast structs' completions for every query In-Reply-To: <1133312245796-b000d53a5b61afe0@cisco.com> Message-ID: <1133312245796-cb4f80534d10c1b9@cisco.com> Make sure mcast->done is initialized to uncompleted value before we submit a new query, so that it's safe to wait on. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) applies-to: 993e77f7c00b7bc296e96f0cec1c98ea28a0436a de922487890936470660e89f9095aee980637989 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index c33ed87..10404e0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -135,8 +135,6 @@ static struct ipoib_mcast *ipoib_mcast_a if (!mcast) return NULL; - init_completion(&mcast->done); - mcast->dev = dev; mcast->created = jiffies; mcast->backoff = 1; @@ -350,6 +348,8 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); + init_completion(&mcast->done); + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | @@ -469,6 +469,8 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } + init_completion(&mcast->done); + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 4/8] IPoIB: don't zero members after we allocate with kzalloc In-Reply-To: <1133312245796-cb4f80534d10c1b9@cisco.com> Message-ID: <1133312245797-2006ed5b68ef5482@cisco.com> ipoib_mcast_alloc() uses kzalloc(), so there's no need to zero out members of the mcast struct after it's allocated. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 4 ---- 1 files changed, 0 insertions(+), 4 deletions(-) applies-to: bbb88a18ee78fa43c0f887c138011a055a9c8045 2e86541ec878de9ec5771600a77f451a80bebfc4 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 10404e0..ef3ee03 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -138,15 +138,11 @@ static struct ipoib_mcast *ipoib_mcast_a mcast->dev = dev; mcast->created = jiffies; mcast->backoff = 1; - mcast->logcount = 0; INIT_LIST_HEAD(&mcast->list); INIT_LIST_HEAD(&mcast->neigh_list); skb_queue_head_init(&mcast->pkt_queue); - mcast->ah = NULL; - mcast->query = NULL; - return mcast; } --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 5/8] IPoIB: protect child list in ipoib_ib_dev_flush In-Reply-To: <1133312245797-2006ed5b68ef5482@cisco.com> Message-ID: <1133312245797-01403858bd5f112a@cisco.com> race condition: ipoib_ib_dev_flush is accessing child list without locks. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) applies-to: e58f418dd8e64675b1dbaa6db92d7c1e606d1506 4f71055a45a503273c039d80db8ba9b13cb17549 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 54ef2fe..2388580 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -608,9 +608,13 @@ void ipoib_ib_dev_flush(void *_dev) if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) ipoib_ib_dev_up(dev); + down(&priv->vlan_mutex); + /* Flush any child interfaces too */ list_for_each_entry(cpriv, &priv->child_intfs, list) ipoib_ib_dev_flush(&cpriv->dev); + + up(&priv->vlan_mutex); } void ipoib_ib_dev_cleanup(struct net_device *dev) --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 6/8] IPoIB: fix error handling in ipoib_open In-Reply-To: <1133312245797-01403858bd5f112a@cisco.com> Message-ID: <1133312245799-51b50fe9f024aec5@cisco.com> If ipoib_ib_dev_up() fails after ipoib_ib_dev_open() is called, then ipoib_ib_dev_stop() needs to be called to clean up. Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) applies-to: bb4b6f10197addff1af91368f916904eb4404edf 267ee88ed34c76dc527eeb3d95f9f9558ac99973 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 826d7a7..475d98f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -94,8 +94,10 @@ int ipoib_open(struct net_device *dev) if (ipoib_ib_dev_open(dev)) return -EINVAL; - if (ipoib_ib_dev_up(dev)) + if (ipoib_ib_dev_up(dev)) { + ipoib_ib_dev_stop(dev); return -EINVAL; + } if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 8/8] IB/uverbs: track multicast group membership for userspace QPs In-Reply-To: <1133312245799-fb80cd19aa5b232b@cisco.com> Message-ID: <1133312245799-86be2f2daf6d00d2@cisco.com> uverbs needs to track which multicast groups is each qp attached to, in order to properly detach when cleanup is performed on device file close. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs.h | 11 ++++ drivers/infiniband/core/uverbs_cmd.c | 90 ++++++++++++++++++++++++++------- drivers/infiniband/core/uverbs_main.c | 21 ++++++-- 3 files changed, 99 insertions(+), 23 deletions(-) applies-to: 9896d3c1093ded78c62da6b9a52b71e282c763e0 f4e401562c11c7ca65592ebd749353cf0b19af7b diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index ecb8301..7114e3f 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -105,12 +105,23 @@ struct ib_uverbs_event { u32 *counter; }; +struct ib_uverbs_mcast_entry { + struct list_head list; + union ib_gid gid; + u16 lid; +}; + struct ib_uevent_object { struct ib_uobject uobject; struct list_head event_list; u32 events_reported; }; +struct ib_uqp_object { + struct ib_uevent_object uevent; + struct list_head mcast_list; +}; + struct ib_ucq_object { struct ib_uobject uobject; struct ib_uverbs_file *uverbs_file; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index ed45da8..a57d021 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -815,7 +815,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -866,10 +866,11 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + uobj->uevent.uobject.user_handle = cmd.user_handle; + uobj->uevent.uobject.context = file->ucontext; + uobj->uevent.events_reported = 0; + INIT_LIST_HEAD(&uobj->uevent.event_list); + INIT_LIST_HEAD(&uobj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { @@ -882,7 +883,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uobject; + qp->uobject = &uobj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -901,14 +902,14 @@ retry: goto err_destroy; } - ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uobject.id); + ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject.id); if (ret == -EAGAIN) goto retry; if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -922,7 +923,7 @@ retry: } down(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->qp_list); + list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); up(&file->mutex); up(&ib_uverbs_idr_mutex); @@ -930,7 +931,7 @@ retry: return in_len; err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uobject.id); + idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); err_destroy: ib_destroy_qp(qp); @@ -1032,7 +1033,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; struct ib_qp *qp; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1046,7 +1047,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u if (!qp || qp->uobject->context != file->ucontext) goto out; - uobj = container_of(qp->uobject, struct ib_uevent_object, uobject); + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + if (!list_empty(&uobj->mcast_list)) { + ret = -EBUSY; + goto out; + } ret = ib_destroy_qp(qp); if (ret) @@ -1055,12 +1061,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); down(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->uevent.uobject.list); up(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, &uobj->uevent); - resp.events_reported = uobj->events_reported; + resp.events_reported = uobj->uevent.events_reported; kfree(uobj); @@ -1542,6 +1548,8 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; + struct ib_uqp_object *uobj; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1550,9 +1558,36 @@ ssize_t ib_uverbs_attach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_attach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + ret = 0; + goto out; + } + mcast = kmalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto out; + } + + mcast->lid = cmd.mlid; + memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw); + + ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); + if (!ret) { + uobj = container_of(qp->uobject, struct ib_uqp_object, + uevent.uobject); + list_add_tail(&mcast->list, &uobj->mcast_list); + } else + kfree(mcast); + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; @@ -1563,7 +1598,9 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; + struct ib_uqp_object *uobj; struct ib_qp *qp; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1572,9 +1609,24 @@ ssize_t ib_uverbs_detach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (ret) + goto out; + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + list_del(&mcast->list); + kfree(mcast); + break; + } + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index de6581d..81737bd 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -160,6 +160,18 @@ void ib_uverbs_release_uevent(struct ib_ spin_unlock_irq(&file->async_file->lock); } +static void ib_uverbs_detach_umcast(struct ib_qp *qp, + struct ib_uqp_object *uobj) +{ + struct ib_uverbs_mcast_entry *mcast, *tmp; + + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) { + ib_detach_mcast(qp, &mcast->gid, mcast->lid); + list_del(&mcast->list); + kfree(mcast); + } +} + static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, struct ib_ucontext *context) { @@ -180,13 +192,14 @@ static int ib_uverbs_cleanup_ucontext(st list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); - struct ib_uevent_object *uevent = - container_of(uobj, struct ib_uevent_object, uobject); + struct ib_uqp_object *uqp = + container_of(uobj, struct ib_uqp_object, uevent.uobject); idr_remove(&ib_uverbs_qp_idr, uobj->id); + ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); - ib_uverbs_release_uevent(file, uevent); - kfree(uevent); + ib_uverbs_release_uevent(file, &uqp->uevent); + kfree(uqp); } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:57:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 00:57:25 +0000 Subject: [openib-general] [git patch review 7/8] IB/mthca: fix posting of send lists of length >= 255 on mem-free HCAs In-Reply-To: <1133312245799-51b50fe9f024aec5@cisco.com> Message-ID: <1133312245799-fb80cd19aa5b232b@cisco.com> On mem-free HCAs, when posting a long list of send requests, a doorbell must be rung every 255 requests. Add code to handle this. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 31 +++++++++++++++++++++++++++++-- drivers/infiniband/hw/mthca/mthca_wqe.h | 3 ++- 2 files changed, 31 insertions(+), 3 deletions(-) applies-to: 1f53cd0db55372192cc088788dadbed102845a17 e0ae9ecf469fdd3c1ad999efbf4fe6b782f49900 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index f9c8eb9..7450550 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1822,6 +1822,7 @@ int mthca_arbel_post_send(struct ib_qp * { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; void *wqe; void *prev_wqe; unsigned long flags; @@ -1841,6 +1842,34 @@ int mthca_arbel_post_send(struct ib_qp * ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | + f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; + size0 = 0; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->sq.db = cpu_to_be32(qp->sq.head & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + wmb(); + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { mthca_err(dev, "SQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -2017,8 +2046,6 @@ int mthca_arbel_post_send(struct ib_qp * out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); diff --git a/drivers/infiniband/hw/mthca/mthca_wqe.h b/drivers/infiniband/hw/mthca/mthca_wqe.h index 73f1c0b..e7d2c1e 100644 --- a/drivers/infiniband/hw/mthca/mthca_wqe.h +++ b/drivers/infiniband/hw/mthca/mthca_wqe.h @@ -50,7 +50,8 @@ enum { enum { MTHCA_INVAL_LKEY = 0x100, - MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256, + MTHCA_ARBEL_MAX_WQES_PER_SEND_DB = 255 }; struct mthca_next_seg { --- 0.99.9k From rolandd at cisco.com Tue Nov 29 16:58:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 29 Nov 2005 16:58:48 -0800 Subject: [openib-general] Re: [PATCH] multicast resource tracking In-Reply-To: <20051122161107.GA20871@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 22 Nov 2005 18:11:07 +0200") References: <20051122161107.GA20871@mellanox.co.il> Message-ID: <52ek4zt0hj.fsf@cisco.com> Thanks, I applied this in the version below. I made a few cleanups, so I hope I didn't break anything. For example, I didn't see any reason for either of the list_for_each_entry loops in uverbs_cmd.c to be list_for_each_entry_safe -- the attach loop doesn't delete anything, and the detach loop breaks as soon as it deletes an entry. - R. --- infiniband/core/uverbs_cmd.c (revision 4210) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -815,7 +815,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -866,10 +866,11 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + uobj->uevent.uobject.user_handle = cmd.user_handle; + uobj->uevent.uobject.context = file->ucontext; + uobj->uevent.events_reported = 0; + INIT_LIST_HEAD(&uobj->uevent.event_list); + INIT_LIST_HEAD(&uobj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { @@ -882,7 +883,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uobject; + qp->uobject = &uobj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -901,14 +902,14 @@ retry: goto err_destroy; } - ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uobject.id); + ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject.id); if (ret == -EAGAIN) goto retry; if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -922,7 +923,7 @@ retry: } down(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->qp_list); + list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); up(&file->mutex); up(&ib_uverbs_idr_mutex); @@ -930,7 +931,7 @@ retry: return in_len; err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uobject.id); + idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); err_destroy: ib_destroy_qp(qp); @@ -1032,7 +1033,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; struct ib_qp *qp; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1046,7 +1047,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u if (!qp || qp->uobject->context != file->ucontext) goto out; - uobj = container_of(qp->uobject, struct ib_uevent_object, uobject); + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + if (!list_empty(&uobj->mcast_list)) { + ret = -EBUSY; + goto out; + } ret = ib_destroy_qp(qp); if (ret) @@ -1055,12 +1061,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); down(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->uevent.uobject.list); up(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, &uobj->uevent); - resp.events_reported = uobj->events_reported; + resp.events_reported = uobj->uevent.events_reported; kfree(uobj); @@ -1542,6 +1548,8 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; + struct ib_uqp_object *uobj; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1550,9 +1558,36 @@ ssize_t ib_uverbs_attach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_attach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + ret = 0; + goto out; + } + mcast = kmalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto out; + } + + mcast->lid = cmd.mlid; + memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw); + + ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); + if (!ret) { + uobj = container_of(qp->uobject, struct ib_uqp_object, + uevent.uobject); + list_add_tail(&mcast->list, &uobj->mcast_list); + } else + kfree(mcast); + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; @@ -1563,7 +1598,9 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; + struct ib_uqp_object *uobj; struct ib_qp *qp; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1572,9 +1609,24 @@ ssize_t ib_uverbs_detach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (ret) + goto out; + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + list_del(&mcast->list); + kfree(mcast); + break; + } + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; --- infiniband/core/uverbs.h (revision 4210) +++ infiniband/core/uverbs.h (working copy) @@ -105,12 +105,23 @@ struct ib_uverbs_event { u32 *counter; }; +struct ib_uverbs_mcast_entry { + struct list_head list; + union ib_gid gid; + u16 lid; +}; + struct ib_uevent_object { struct ib_uobject uobject; struct list_head event_list; u32 events_reported; }; +struct ib_uqp_object { + struct ib_uevent_object uevent; + struct list_head mcast_list; +}; + struct ib_ucq_object { struct ib_uobject uobject; struct ib_uverbs_file *uverbs_file; --- infiniband/core/uverbs_main.c (revision 4210) +++ infiniband/core/uverbs_main.c (working copy) @@ -160,6 +160,18 @@ void ib_uverbs_release_uevent(struct ib_ spin_unlock_irq(&file->async_file->lock); } +static void ib_uverbs_detach_umcast(struct ib_qp *qp, + struct ib_uqp_object *uobj) +{ + struct ib_uverbs_mcast_entry *mcast, *tmp; + + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) { + ib_detach_mcast(qp, &mcast->gid, mcast->lid); + list_del(&mcast->list); + kfree(mcast); + } +} + static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, struct ib_ucontext *context) { @@ -180,13 +192,14 @@ static int ib_uverbs_cleanup_ucontext(st list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); - struct ib_uevent_object *uevent = - container_of(uobj, struct ib_uevent_object, uobject); + struct ib_uqp_object *uqp = + container_of(uobj, struct ib_uqp_object, uevent.uobject); idr_remove(&ib_uverbs_qp_idr, uobj->id); + ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); - ib_uverbs_release_uevent(file, uevent); - kfree(uevent); + ib_uverbs_release_uevent(file, &uqp->uevent); + kfree(uqp); } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { From rjwalsh at pathscale.com Tue Nov 29 17:45:47 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 29 Nov 2005 17:45:47 -0800 Subject: [openib-general] Changes to core drivers to support RD Message-ID: <1133315147.28746.26.camel@hematite.internal.keyresearch.com> Hi all, Here's some infrastructure changes to the core drivers and libibverbs to get the ball rolling on supporting RD. Note: no drivers yet support this, and there's no changes in any of the driver code - this is merely infrastructure changes. Please let us know what you think. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: rd_kernel_bits.diff Type: text/x-patch Size: 31673 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rd_user_bits.diff Type: text/x-patch Size: 20478 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From yaronh at voltaire.com Tue Nov 29 19:08:48 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 30 Nov 2005 05:08:48 +0200 Subject: [swg] RE: [openib-general] socket based connectionmodel for IBproposal -round 4 Message-ID: <35EA21F54A45CB47B879F21A91F4862F8FEE06@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Sean Hefty > Sent: Tuesday, November 29, 2005 6:30 PM > To: Kanevsky, Arkady > Cc: Ted H. Kim; swg at infinibandta.org; openib-general at openib.org > Subject: Re: [swg] RE: [openib-general] socket based connectionmodel for > IBproposal -round 4 > > Kanevsky, Arkady wrote: > > Sean, > > SWG discussed today the extending private data format proposal to > > SIDR_REQ. > > The group does not see the need for it since ULP is no RDMA aware. > > That is ULP does not use RDMA operations. > > Do you have some specific ULP in mind for this functionality? > > For UDP a different IP address can be used for each message. There is no > > persistent connection. > > I didn't have any particular ULP in mind. I was thinking more of a > generic > application that wanted to use UDP style addressing over IB, similar to > what's > being discussed for using TCP style addressing over IB. > > It seems that there needs to be a way to map a given destination address > to a > remote QP/qkey. Regardless if the IP address is carried in each ULP > message, it > would still need to be in the SIDR REQ in order to locate the correct QP. > Sean, How about using ARP to get from IP to DGID+Partition Followed by an SIDR to map DGID+PKey+Service to QKey & QP It is the same concept as CMA that first uses IP stack (ARP etc') to get to the remote end-point (in that case GID+PKey combination) followed by SA-PR and CM REQ, we just substitute the CM REQ with a SIDR REQ It may not solve all the cases but probably most of the practical ones Anyway the packets will need to carry some header (since it's not a connected model), you can add more stuff in that header (e.g. can use IPoIB header as is which contains already the src/dst IP) Yaron From yael at mellanox.co.il Wed Nov 30 03:55:25 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 30 Nov 2005 13:55:25 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] osm_node_info_rcv.c:osm_ni_rcv_init Move assert before variable used Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2448@mtlexch01.mtl.com> Good catch... -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, November 29, 2005 5:12 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: [PATCH] [TRIVIAL] osm_node_info_rcv.c:osm_ni_rcv_init Move assert before variable used osm_node_info_rcv.c:osm_ni_rcv_init Move assert before the variable is used [I think there are numerous instances of this if OpenSM were to be inspected for this. This could cause issues with --enable-debug (which have been seen but it is unclear to me whether this is the cause or the effect). Should more of this be done ?] Some other cosmetic changes Signed-off-by: Hal Rosenstock Index: osm_node_info_rcv.c =================================================================== --- osm_node_info_rcv.c (revision 4205) +++ osm_node_info_rcv.c (working copy) @@ -986,6 +986,7 @@ osm_ni_rcv_init( IN cl_plock_t* const p_lock ) { ib_api_status_t status = IB_SUCCESS; + OSM_LOG_ENTER( p_log, osm_ni_rcv_init ); osm_ni_rcv_construct( p_rcv ); @@ -1013,12 +1014,12 @@ osm_ni_rcv_process( osm_node_t *p_node; boolean_t process_new_flag = FALSE; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); - p_guid_tbl = &p_rcv->p_subn->node_guid_tbl; p_smp = osm_madw_get_smp_ptr( p_madw ); p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); @@ -1041,6 +1042,9 @@ osm_ni_rcv_process( osm_dump_smp_dr_path(p_rcv->p_log, p_smp, OSM_LOG_ERROR); goto Exit; } + + p_guid_tbl = &p_rcv->p_subn->node_guid_tbl; + /* Determine if this node has already been discovered, and process accordingly. From yael at mellanox.co.il Wed Nov 30 04:11:27 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 30 Nov 2005 14:11:27 +0200 Subject: [openib-general] [PATCH] Opensm - fix PathRecord get --text follows this line-- Message-ID: <5zd5kixrm8.fsf@mtl066.yok.mtl.com> During some tests I've noticed that in PathRecord queries the SA doesn't compare the packetLifeTime component, if relevant comp_mask is turned on. This patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 4231) +++ opensm/osm_sa_path_record.c (working copy) @@ -175,8 +175,10 @@ __osm_pr_rcv_get_path_parms( ib_api_status_t status = IB_SUCCESS; uint8_t mtu; uint8_t rate; + uint8_t pkt_life; uint8_t required_mtu; uint8_t required_rate; + uint8_t required_pkt_life; ib_net16_t dest_lid; OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_get_path_parms ); @@ -510,10 +512,54 @@ __osm_pr_rcv_get_path_parms( } } + /* Verify the pkt_life_time */ + /* According to spec definition Table 171 PacketLifeTime description, + for loopback paths, packetLifeTime shall be zero. */ + if ( p_src_port == p_dest_port ) + /* This is loopback */ + pkt_life = 0; + else + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + + /* we silently ignore cases where only the PktLife selector is defined */ + if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && + (comp_mask & IB_PR_COMPMASK_PKTLIFETIME)) + { + required_pkt_life = ib_path_rec_pkt_life( p_pr ); + switch( ib_path_rec_pkt_life_sel( p_pr ) ) + { + case 0: /* must be greater than */ + if( pkt_life <= required_pkt_life ) + status = IB_NOT_FOUND; + break; + + case 1: /* must be less than */ + if( pkt_life >= required_pkt_life ) + status = IB_NOT_FOUND; + break; + + case 2: /* exact match */ + if( pkt_life != required_pkt_life ) + status = IB_NOT_FOUND; + break; + + case 3: /* smallest available */ + /* can't be disqualified by this one */ + break; + + default: + /* if we're here, there's a bug in ib_path_rec_pkt_life_sel() */ + CL_ASSERT( FALSE ); + status = IB_ERROR; + break; + } + } + p_parms->mtu = mtu; p_parms->rate = rate; p_parms->pkey = IB_DEFAULT_PKEY; - p_parms->pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; + /* the pkt_life */ + p_parms->pkt_life = pkt_life; p_parms->sl = OSM_DEFAULT_SL; Exit: From ogerlitz at voltaire.com Wed Nov 30 04:22:17 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 30 Nov 2005 14:22:17 +0200 (IST) Subject: [openib-general] spinlock wrong CPU on CPU#1, ib_addr Message-ID: Sean, I've encountered into this while running iSER (now ported from kdapl to the verbs/cma) where the rest of gen2 is svn r4052 over amd x86_64 / kernel 2.6.14 (compiled with debug, see below the exact flags) I am not sure if the issue/bug is in ib_addr on ib_iser modules, since i see some ib_iser symbols within the BUG stack trace??? >From the first iser print below the stack dump i realize that the first and second stages of the address traslation went fine, since we creat the qp only once the route is resolved, and the cma connect failed (please dont worry on this one for now). Or. BUG: spinlock wrong CPU on CPU#1, ib_addr/3866 lock: ffffffff88073428, .magic: dead4ead, .owner: ib_addr/3866, .owner_cpu: 0 Call Trace:{_raw_spin_unlock+112} {:ib_iser:iser_adaptor_find_by_device+188} {:ib_iser:iser_cma_handler+83} {:rdma_cm:cma_notify_user+27} {:rdma_cm:addr_handler+167} {:ib_addr:process_req+316} {worker_thread+476} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+217} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} iser:iser_create_qp setting conn ffff8100328bfe48 qp cma_id ffff81003291b000 qp ffff810032fb4600 iser:iser_cma_handler event: 7, error: -110 iser:iser_free_qp_and_id free-ing conn ffff8100328bfe48 conn->qp ffff810032fb4600 conn->cma_id ffff81003291b000 # # Kernel hacking # # CONFIG_PRINTK_TIME is not set CONFIG_DEBUG_KERNEL=y CONFIG_MAGIC_SYSRQ=y CONFIG_LOG_BUF_SHIFT=18 CONFIG_DETECT_SOFTLOCKUP=y # CONFIG_SCHEDSTATS is not set # CONFIG_DEBUG_SLAB is not set CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_KOBJECT is not set # CONFIG_DEBUG_INFO is not set CONFIG_DEBUG_FS=y # CONFIG_FRAME_POINTER is not set CONFIG_INIT_DEBUG=y # CONFIG_IOMMU_DEBUG is not set CONFIG_KPROBES=y From yael at mellanox.co.il Wed Nov 30 04:35:27 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 30 Nov 2005 14:35:27 +0200 Subject: [openib-general] [PATCH] Opensm - fix LinkRecord get Message-ID: <5zacfmxqi8.fsf@mtl066.yok.mtl.com> Hi Hal, During some tests I've noticed that in LinkRecord queries there are some bugs: 1. Trying to ensure the two physical ports are connected comparison isn't done correctly. 2. When __osm_lr_rcv_get_physp_link is called with physical ports not null - there is no check that the value returned is actually different than null. As a result we can get several links with the same value. This patch fixes both issues. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_link_record.c =================================================================== --- opensm/osm_sa_link_record.c (revision 4231) +++ opensm/osm_sa_link_record.c (working copy) @@ -235,7 +235,7 @@ __osm_lr_rcv_get_physp_link( Ensure the two physp's are actually connected. If not, bail out. */ - if( osm_physp_get_remote( p_src_physp ) != p_src_physp ) + if( osm_physp_get_remote( p_src_physp ) != p_dest_physp ) goto Exit; } else @@ -393,12 +393,16 @@ __osm_lr_rcv_get_port_links( { p_dest_physp = osm_port_get_phys_ptr( p_dest_port, dest_port_num ); + /* both physical ports should be with data */ + if (p_src_physp && p_dest_physp) + { __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, p_dest_physp, comp_mask, p_list, p_req_physp ); } } } + } else { /* @@ -412,17 +416,22 @@ __osm_lr_rcv_get_port_links( if (port_num < p_src_port->physp_tbl_size) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + if (p_src_physp) + { __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, p_req_physp ); } } + } else { num_ports = osm_port_get_num_physp( p_src_port ); for( port_num = 1; port_num < num_ports; port_num++ ) { p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); + if (p_src_physp) + { __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, NULL, comp_mask, p_list, p_req_physp ); @@ -430,6 +439,7 @@ __osm_lr_rcv_get_port_links( } } } + } else { if( p_dest_port ) @@ -446,11 +456,14 @@ __osm_lr_rcv_get_port_links( { p_dest_physp = osm_port_get_phys_ptr( p_dest_port, port_num ); + if (p_dest_physp) + { __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, p_list, p_req_physp ); } } + } else { num_ports = osm_port_get_num_physp( p_dest_port ); @@ -458,12 +471,15 @@ __osm_lr_rcv_get_port_links( { p_dest_physp = osm_port_get_phys_ptr( p_dest_port, port_num ); + if (p_dest_physp) + { __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, p_dest_physp, comp_mask, p_list, p_req_physp ); } } } + } else { /* From halr at voltaire.com Wed Nov 30 05:48:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 08:48:24 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix PathRecord get --text follows this line-- In-Reply-To: <5zd5kixrm8.fsf@mtl066.yok.mtl.com> References: <5zd5kixrm8.fsf@mtl066.yok.mtl.com> Message-ID: <1133358504.2984.9320.camel@hal.voltaire.com> On Wed, 2005-11-30 at 07:11, Yael Kalka wrote: > During some tests I've noticed that in PathRecord queries the SA > doesn't compare the packetLifeTime component, if relevant comp_mask is > turned on. This patch fixes this. Thanks. Applied. A couple of comments below. -- Hal > Signed-off-by: Yael Kalka > > Index: opensm/osm_sa_path_record.c > =================================================================== > --- opensm/osm_sa_path_record.c (revision 4231) > +++ opensm/osm_sa_path_record.c (working copy) > @@ -175,8 +175,10 @@ __osm_pr_rcv_get_path_parms( > ib_api_status_t status = IB_SUCCESS; > uint8_t mtu; > uint8_t rate; > + uint8_t pkt_life; > uint8_t required_mtu; > uint8_t required_rate; > + uint8_t required_pkt_life; > ib_net16_t dest_lid; > > OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_get_path_parms ); > @@ -510,10 +512,54 @@ __osm_pr_rcv_get_path_parms( > } > } > > + /* Verify the pkt_life_time */ > + /* According to spec definition Table 171 PacketLifeTime description, Is this IBA 1.1 (rather than 1.2) ? > + for loopback paths, packetLifeTime shall be zero. */ > + if ( p_src_port == p_dest_port ) > + /* This is loopback */ > + pkt_life = 0; > + else > + pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + > + /* we silently ignore cases where only the PktLife selector is defined */ > + if ((comp_mask & IB_PR_COMPMASK_PKTLIFETIMESELEC) && > + (comp_mask & IB_PR_COMPMASK_PKTLIFETIME)) > + { > + required_pkt_life = ib_path_rec_pkt_life( p_pr ); > + switch( ib_path_rec_pkt_life_sel( p_pr ) ) > + { > + case 0: /* must be greater than */ > + if( pkt_life <= required_pkt_life ) > + status = IB_NOT_FOUND; > + break; > + > + case 1: /* must be less than */ > + if( pkt_life >= required_pkt_life ) > + status = IB_NOT_FOUND; > + break; For all the selector code (not just packet life time), the less than and greater than comparisons include =. Is that right ? > + case 2: /* exact match */ > + if( pkt_life != required_pkt_life ) > + status = IB_NOT_FOUND; > + break; > + > + case 3: /* smallest available */ > + /* can't be disqualified by this one */ > + break; > + > + default: > + /* if we're here, there's a bug in ib_path_rec_pkt_life_sel() */ > + CL_ASSERT( FALSE ); > + status = IB_ERROR; > + break; > + } > + } > + > p_parms->mtu = mtu; > p_parms->rate = rate; > p_parms->pkey = IB_DEFAULT_PKEY; > - p_parms->pkt_life = OSM_DEFAULT_SUBNET_TIMEOUT; > + /* the pkt_life */ > + p_parms->pkt_life = pkt_life; > p_parms->sl = OSM_DEFAULT_SL; > > Exit: > From mst at mellanox.co.il Wed Nov 30 06:29:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Nov 2005 16:29:17 +0200 Subject: [openib-general] [PATCH] mthca: fix max wr for arbel Message-ID: <20051130142917.GT25751@mellanox.co.il> Unlike tavor, max QP/SRQ size is a power of 2 for arbel mode, despite what the documentation (of the query dev lim command) says. Without this patch, on Arbel, we can start with a qp of a legal size and get above the device limit after rounding to the next power of two. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_cmd.c (revision 4153) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -937,10 +937,6 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev if (err) goto out; - MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); - dev_lim->max_srq_sz = (1 << field) - 1; - MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); - dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); dev_lim->reserved_qps = 1 << (field & 0xf); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); @@ -1056,6 +1052,10 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); if (mthca_is_memfree(dev)) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); @@ -1087,6 +1087,10 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Max ICM size %lld MB\n", (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = (1 << field) - 1; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE; -- MST From halr at voltaire.com Wed Nov 30 06:19:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 09:19:52 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix LinkRecord get In-Reply-To: <5zacfmxqi8.fsf@mtl066.yok.mtl.com> References: <5zacfmxqi8.fsf@mtl066.yok.mtl.com> Message-ID: <1133360391.2984.9635.camel@hal.voltaire.com> On Wed, 2005-11-30 at 07:35, Yael Kalka wrote: > Hi Hal, > > During some tests I've noticed that in LinkRecord queries there are > some bugs: > 1. Trying to ensure the two physical ports are connected comparison > isn't done correctly. > 2. When __osm_lr_rcv_get_physp_link is called with physical ports not > null - there is no check that the value returned is actually different > than null. As a result we can get several links with the same value. > > This patch fixes both issues. Thanks. Applied. Some minor comments below. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_sa_link_record.c > =================================================================== > --- opensm/osm_sa_link_record.c (revision 4231) > +++ opensm/osm_sa_link_record.c (working copy) > @@ -235,7 +235,7 @@ __osm_lr_rcv_get_physp_link( > Ensure the two physp's are actually connected. > If not, bail out. > */ > - if( osm_physp_get_remote( p_src_physp ) != p_src_physp ) > + if( osm_physp_get_remote( p_src_physp ) != p_dest_physp ) > goto Exit; Should there be an error message here ? > } > else > @@ -393,12 +393,16 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( p_dest_port, > dest_port_num ); > + /* both physical ports should be with data */ > + if (p_src_physp && p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > } > + } Formatting was off here (and similarly below)... I fixed it in the change that was just committed. > else > { > /* > @@ -412,17 +416,22 @@ __osm_lr_rcv_get_port_links( > if (port_num < p_src_port->physp_tbl_size) > { > p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); > + if (p_src_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > NULL, comp_mask, p_list, > p_req_physp ); > } > } > + } > else > { > num_ports = osm_port_get_num_physp( p_src_port ); > for( port_num = 1; port_num < num_ports; port_num++ ) > { > p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); > + if (p_src_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > NULL, comp_mask, p_list, > p_req_physp ); > @@ -430,6 +439,7 @@ __osm_lr_rcv_get_port_links( > } > } > } > + } > else > { > if( p_dest_port ) > @@ -446,11 +456,14 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( > p_dest_port, port_num ); > + if (p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > + } > else > { > num_ports = osm_port_get_num_physp( p_dest_port ); > @@ -458,12 +471,15 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( > p_dest_port, port_num ); > + if (p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > } > + } > else > { > /* > From halr at voltaire.com Wed Nov 30 06:28:58 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 09:28:58 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] osm_req.c: Move assert before variable used Message-ID: <1133360937.2984.9729.camel@hal.voltaire.com> osm_req.c: Move assert before variable used Also, added additional assert into osm_req_construct Signed-off-by: Hal Rosenstock Index: osm_req.c =================================================================== --- osm_req.c (revision 4237) +++ osm_req.c (working copy) @@ -74,6 +74,8 @@ void osm_req_construct( IN osm_req_t* const p_req ) { + CL_ASSERT( p_req ); + cl_memclr( p_req, sizeof(*p_req) ); } @@ -129,9 +131,10 @@ osm_req_get( ib_api_status_t status = IB_SUCCESS; ib_net64_t tid; + CL_ASSERT( p_req ); + OSM_LOG_ENTER( p_req->p_log, osm_req_get ); - CL_ASSERT( p_req ); CL_ASSERT( p_path ); CL_ASSERT( attr_id ); @@ -219,9 +222,10 @@ osm_req_set( ib_api_status_t status = IB_SUCCESS; ib_net64_t tid; + CL_ASSERT( p_req ); + OSM_LOG_ENTER( p_req->p_log, osm_req_set ); - CL_ASSERT( p_req ); CL_ASSERT( p_path ); CL_ASSERT( attr_id ); CL_ASSERT( p_payload ); @@ -295,7 +299,3 @@ osm_req_set( return( status ); } - - - - From eitan at mellanox.co.il Wed Nov 30 07:03:03 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 30 Nov 2005 17:03:03 +0200 Subject: [openib-general] First Multicast Leave disconnects all other clients Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A3C@mtlexch01.mtl.com> Sorry for the late response. The bottom line: We are missing 3 agents in the OpenIB stack: InformInfo - handling registrations and Report dispatching ServiceRecord - tracks registrations Multicast Join/Leave - tracking registrations to multicast groups and ref-counting All these agents should be able to cleanup dead client registrations and also provide re-registration in case of SM ClientReregistration event. Please see below > > > > It seems the IBTA intent was that the IB driver will be responsible for maintaining > the list of clients > > registered to each group. > > Yes, the end node is responsible for tracking the registrations within > the node and fabricating responses when the node does not want to leave. > Is delete a different case though ? [EZ] No it is not. Delete of multicast group is really the last leave. > > > But the IB core does not track what clients registered (through SA requests) to a > particular multicast group. > > The first client to leave the group causes the rest (of the clients) to be disconnected. > > This is an implementation issue IMO and applies to other subscriptions > too (not just limited to multicast). [EZ] I agree it is an implementation issue. I hope it will get implemented in OpenIB. > > > My proposal is to provide an API for such registrations at both user and kernel and > track the requesting processes. > > Cleanup is also required both by process and kernel module granularity. > > Is the API the SA client request itself for this ? Shouldn't the > tracking be done there (within sa_query.c) ? [EZ] It will be hard to sniff the MADs (especially user level) for all the registration flows. So I propose we should have ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. Both in user land and in kernel. > > > BTW: The same API could also handle "Client Reregistration" for multicast groups, > > Client reregistration is for all subscriptions (including ServiceRecords > and events as well). [EZ] Yes exactly. I believe similar problem exists for all registrations. > > > such that we could avoid the need to have that code duplicated by every client. > > I'm missing how client reregistration would help here. Can you elaborate > ? [EZ] It is related to the reference tracking: If a kernel module tracks all registrations to refcount them and perform cleanup, it could with similar effort also send the - re-registration in the event of SM change ... > > > But this refers to yet another API that is missing: Report dispatching which deserves > its own > > mail... > > I'm missing the connection between reregistration and report > dispatching. [EZ] Sorry for not being verbose. The need for Events dispatcher is based on the fact that only one client should respond to Report with ReportRepress. Reports are "unsolicited" MADs coming into the device. In umad the implementation prevents any "multiple" client registration for receiving any "unsolicited" MAD - only one class-agent needs to be there handling "unsolicited" messages. This is fine - but what it means is that when two clients wants to be notified about events they should register with that agent and the agent should be able to dispatch the message to all registered clients as well as send only one response back. From dotanb at mellanox.co.il Wed Nov 30 08:10:16 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 30 Nov 2005 18:10:16 +0200 Subject: [openib-general] the max_send/recv_sge value in the QP creation init attribute can't accept max s/g value in memfree devices for UD QPs Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8C9E0@mtlexch01.mtl.com> in memfree Mellanox HCAs (for example: 25204) there is a problem with the s/g size in the mthca driver the reported value (from the query HCA verb in user level) is 30. when i try to create an RC QP with 30 s/g entries it fails (with 29 s/g the QP can be created) when i try to create an UD QP with 28 s/g entries it fails (with 27 s/g the QP can be created) there are 2 issues: 1) the query HCA report a higher value of s/g entries than the HCA can support 2) which value should be reported by the query HCA as the maximum supported value of s/g entries in a QP? 29: which is supported only by RC/UC QPs or 28: which is supported by all transport types (but decrease the number of s/g entries that an RC QP can support) what do you think? Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Nov 30 09:21:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 09:21:17 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix max wr for arbel In-Reply-To: <20051130142917.GT25751@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 30 Nov 2005 16:29:17 +0200") References: <20051130142917.GT25751@mellanox.co.il> Message-ID: <52iruarr02.fsf@cisco.com> Michael> Unlike tavor, max QP/SRQ size is a power of 2 for arbel Michael> mode, despite what the documentation (of the query dev Michael> lim command) says. Without this patch, on Arbel, we can Michael> start with a qp of a legal size and get above the device Michael> limit after rounding to the next power of two. This looks fine but I'm a little confused by the explanation. If the firmware reports a limit of say 14, then the current code will turn that into a limit of 16383 work queue entries, but the actual hardware limit is 16384 entries. So the code in mthca_set_qp_size() will reject attempts to create a QP with more than 16383 entries. If someone tries to create a QP with more than 8192 entries, the size will get rounded up to 16384, which will still work with the hardware. Or did you just mean that sometimes a QP will be created with a size bigger than the limit reported by the driver? I agree that's not ideal but it doesn't seem to be a major problem. - R. From rolandd at cisco.com Wed Nov 30 09:27:41 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 09:27:41 -0800 Subject: [openib-general] [git pull] IB fixes for 2.6.15 Message-ID: <52acfmrqpe.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Jack Morgenstein: IB/uverbs: track multicast group membership for userspace QPs Michael S. Tsirkin: IB/mthca: reset QP's last pointers when transitioning to reset state IB/umad: fix RMPP handling IPoIB: reinitialize mcast structs' completions for every query IPoIB: protect child list in ipoib_ib_dev_flush IB/mthca: fix posting of send lists of length >= 255 on mem-free HCAs Roland Dreier: IPoIB: reinitialize path struct's completion for every query IPoIB: always set path->query to NULL when query finishes IPoIB: don't zero members after we allocate with kzalloc IPoIB: fix error handling in ipoib_open drivers/infiniband/core/user_mad.c | 41 ++++++----- drivers/infiniband/core/uverbs.h | 11 +++ drivers/infiniband/core/uverbs_cmd.c | 90 +++++++++++++++++++----- drivers/infiniband/core/uverbs_main.c | 21 +++++- drivers/infiniband/hw/mthca/mthca_qp.c | 34 +++++++++ drivers/infiniband/hw/mthca/mthca_wqe.h | 3 + drivers/infiniband/ulp/ipoib/ipoib_ib.c | 4 + drivers/infiniband/ulp/ipoib/ipoib_main.c | 11 ++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 10 +-- 9 files changed, 170 insertions(+), 55 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index e73f81c..eb7f525 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -310,7 +310,7 @@ static ssize_t ib_umad_write(struct file u8 method; __be64 *tid; int ret, length, hdr_len, copy_offset; - int rmpp_active = 0; + int rmpp_active, has_rmpp_header; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -360,28 +360,31 @@ static ssize_t ib_umad_write(struct file } rmpp_mad = (struct ib_rmpp_mad *) packet->mad.data; - if (ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE) { - /* RMPP active */ - if (!agent->rmpp_version) { - ret = -EINVAL; - goto err_ah; - } - - /* Validate that the management class can support RMPP */ - if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { - hdr_len = IB_MGMT_SA_HDR; - } else if ((rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && - (rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) { - hdr_len = IB_MGMT_VENDOR_HDR; - } else { - ret = -EINVAL; - goto err_ah; - } - rmpp_active = 1; + if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { + hdr_len = IB_MGMT_SA_HDR; copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; + } else if (rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START && + rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END) { + hdr_len = IB_MGMT_VENDOR_HDR; + copy_offset = IB_MGMT_RMPP_HDR; + has_rmpp_header = 1; } else { hdr_len = IB_MGMT_MAD_HDR; copy_offset = IB_MGMT_MAD_HDR; + has_rmpp_header = 0; + } + + if (has_rmpp_header) + rmpp_active = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & + IB_MGMT_RMPP_FLAG_ACTIVE; + else + rmpp_active = 0; + + /* Validate that the management class can support RMPP */ + if (rmpp_active && !agent->rmpp_version) { + ret = -EINVAL; + goto err_ah; } packet->msg = ib_create_send_mad(agent, diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index ecb8301..7114e3f 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -105,12 +105,23 @@ struct ib_uverbs_event { u32 *counter; }; +struct ib_uverbs_mcast_entry { + struct list_head list; + union ib_gid gid; + u16 lid; +}; + struct ib_uevent_object { struct ib_uobject uobject; struct list_head event_list; u32 events_reported; }; +struct ib_uqp_object { + struct ib_uevent_object uevent; + struct list_head mcast_list; +}; + struct ib_ucq_object { struct ib_uobject uobject; struct ib_uverbs_file *uverbs_file; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index ed45da8..a57d021 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -815,7 +815,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv struct ib_uverbs_create_qp cmd; struct ib_uverbs_create_qp_resp resp; struct ib_udata udata; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; struct ib_pd *pd; struct ib_cq *scq, *rcq; struct ib_srq *srq; @@ -866,10 +866,11 @@ ssize_t ib_uverbs_create_qp(struct ib_uv attr.cap.max_recv_sge = cmd.max_recv_sge; attr.cap.max_inline_data = cmd.max_inline_data; - uobj->uobject.user_handle = cmd.user_handle; - uobj->uobject.context = file->ucontext; - uobj->events_reported = 0; - INIT_LIST_HEAD(&uobj->event_list); + uobj->uevent.uobject.user_handle = cmd.user_handle; + uobj->uevent.uobject.context = file->ucontext; + uobj->uevent.events_reported = 0; + INIT_LIST_HEAD(&uobj->uevent.event_list); + INIT_LIST_HEAD(&uobj->mcast_list); qp = pd->device->create_qp(pd, &attr, &udata); if (IS_ERR(qp)) { @@ -882,7 +883,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uv qp->send_cq = attr.send_cq; qp->recv_cq = attr.recv_cq; qp->srq = attr.srq; - qp->uobject = &uobj->uobject; + qp->uobject = &uobj->uevent.uobject; qp->event_handler = attr.event_handler; qp->qp_context = attr.qp_context; qp->qp_type = attr.qp_type; @@ -901,14 +902,14 @@ retry: goto err_destroy; } - ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uobject.id); + ret = idr_get_new(&ib_uverbs_qp_idr, qp, &uobj->uevent.uobject.id); if (ret == -EAGAIN) goto retry; if (ret) goto err_destroy; - resp.qp_handle = uobj->uobject.id; + resp.qp_handle = uobj->uevent.uobject.id; resp.max_recv_sge = attr.cap.max_recv_sge; resp.max_send_sge = attr.cap.max_send_sge; resp.max_recv_wr = attr.cap.max_recv_wr; @@ -922,7 +923,7 @@ retry: } down(&file->mutex); - list_add_tail(&uobj->uobject.list, &file->ucontext->qp_list); + list_add_tail(&uobj->uevent.uobject.list, &file->ucontext->qp_list); up(&file->mutex); up(&ib_uverbs_idr_mutex); @@ -930,7 +931,7 @@ retry: return in_len; err_idr: - idr_remove(&ib_uverbs_qp_idr, uobj->uobject.id); + idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); err_destroy: ib_destroy_qp(qp); @@ -1032,7 +1033,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u struct ib_uverbs_destroy_qp cmd; struct ib_uverbs_destroy_qp_resp resp; struct ib_qp *qp; - struct ib_uevent_object *uobj; + struct ib_uqp_object *uobj; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1046,7 +1047,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u if (!qp || qp->uobject->context != file->ucontext) goto out; - uobj = container_of(qp->uobject, struct ib_uevent_object, uobject); + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + if (!list_empty(&uobj->mcast_list)) { + ret = -EBUSY; + goto out; + } ret = ib_destroy_qp(qp); if (ret) @@ -1055,12 +1061,12 @@ ssize_t ib_uverbs_destroy_qp(struct ib_u idr_remove(&ib_uverbs_qp_idr, cmd.qp_handle); down(&file->mutex); - list_del(&uobj->uobject.list); + list_del(&uobj->uevent.uobject.list); up(&file->mutex); - ib_uverbs_release_uevent(file, uobj); + ib_uverbs_release_uevent(file, &uobj->uevent); - resp.events_reported = uobj->events_reported; + resp.events_reported = uobj->uevent.events_reported; kfree(uobj); @@ -1542,6 +1548,8 @@ ssize_t ib_uverbs_attach_mcast(struct ib { struct ib_uverbs_attach_mcast cmd; struct ib_qp *qp; + struct ib_uqp_object *uobj; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1550,9 +1558,36 @@ ssize_t ib_uverbs_attach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_attach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + ret = 0; + goto out; + } + mcast = kmalloc(sizeof *mcast, GFP_KERNEL); + if (!mcast) { + ret = -ENOMEM; + goto out; + } + + mcast->lid = cmd.mlid; + memcpy(mcast->gid.raw, cmd.gid, sizeof mcast->gid.raw); + + ret = ib_attach_mcast(qp, &mcast->gid, cmd.mlid); + if (!ret) { + uobj = container_of(qp->uobject, struct ib_uqp_object, + uevent.uobject); + list_add_tail(&mcast->list, &uobj->mcast_list); + } else + kfree(mcast); + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; @@ -1563,7 +1598,9 @@ ssize_t ib_uverbs_detach_mcast(struct ib int out_len) { struct ib_uverbs_detach_mcast cmd; + struct ib_uqp_object *uobj; struct ib_qp *qp; + struct ib_uverbs_mcast_entry *mcast; int ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -1572,9 +1609,24 @@ ssize_t ib_uverbs_detach_mcast(struct ib down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); - if (qp && qp->uobject->context == file->ucontext) - ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + ret = ib_detach_mcast(qp, (union ib_gid *) cmd.gid, cmd.mlid); + if (ret) + goto out; + uobj = container_of(qp->uobject, struct ib_uqp_object, uevent.uobject); + + list_for_each_entry(mcast, &uobj->mcast_list, list) + if (cmd.mlid == mcast->lid && + !memcmp(cmd.gid, mcast->gid.raw, sizeof mcast->gid.raw)) { + list_del(&mcast->list); + kfree(mcast); + break; + } + +out: up(&ib_uverbs_idr_mutex); return ret ? ret : in_len; diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index de6581d..81737bd 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -160,6 +160,18 @@ void ib_uverbs_release_uevent(struct ib_ spin_unlock_irq(&file->async_file->lock); } +static void ib_uverbs_detach_umcast(struct ib_qp *qp, + struct ib_uqp_object *uobj) +{ + struct ib_uverbs_mcast_entry *mcast, *tmp; + + list_for_each_entry_safe(mcast, tmp, &uobj->mcast_list, list) { + ib_detach_mcast(qp, &mcast->gid, mcast->lid); + list_del(&mcast->list); + kfree(mcast); + } +} + static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, struct ib_ucontext *context) { @@ -180,13 +192,14 @@ static int ib_uverbs_cleanup_ucontext(st list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); - struct ib_uevent_object *uevent = - container_of(uobj, struct ib_uevent_object, uobject); + struct ib_uqp_object *uqp = + container_of(uobj, struct ib_uqp_object, uevent.uobject); idr_remove(&ib_uverbs_qp_idr, uobj->id); + ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); list_del(&uobj->list); - ib_uverbs_release_uevent(file, uevent); - kfree(uevent); + ib_uverbs_release_uevent(file, &uqp->uevent); + kfree(uqp); } list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index dd4e133..7450550 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -871,7 +871,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp->ibqp.srq ? to_msrq(qp->ibqp.srq) : NULL); mthca_wq_init(&qp->sq); + qp->sq.last = get_send_wqe(qp, qp->sq.max - 1); + mthca_wq_init(&qp->rq); + qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1); if (mthca_is_memfree(dev)) { *qp->sq.db = 0; @@ -1819,6 +1822,7 @@ int mthca_arbel_post_send(struct ib_qp * { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); + __be32 doorbell[2]; void *wqe; void *prev_wqe; unsigned long flags; @@ -1838,6 +1842,34 @@ int mthca_arbel_post_send(struct ib_qp * ind = qp->sq.head & (qp->sq.max - 1); for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (unlikely(nreq == MTHCA_ARBEL_MAX_WQES_PER_SEND_DB)) { + nreq = 0; + + doorbell[0] = cpu_to_be32((MTHCA_ARBEL_MAX_WQES_PER_SEND_DB << 24) | + ((qp->sq.head & 0xffff) << 8) | + f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + qp->sq.head += MTHCA_ARBEL_MAX_WQES_PER_SEND_DB; + size0 = 0; + + /* + * Make sure that descriptors are written before + * doorbell record. + */ + wmb(); + *qp->sq.db = cpu_to_be32(qp->sq.head & 0xffff); + + /* + * Make sure doorbell record is written before we + * write MMIO send doorbell. + */ + wmb(); + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + if (mthca_wq_overflow(&qp->sq, nreq, qp->ibqp.send_cq)) { mthca_err(dev, "SQ %06x full (%u head, %u tail," " %d max, %d nreq)\n", qp->qpn, @@ -2014,8 +2046,6 @@ int mthca_arbel_post_send(struct ib_qp * out: if (likely(nreq)) { - __be32 doorbell[2]; - doorbell[0] = cpu_to_be32((nreq << 24) | ((qp->sq.head & 0xffff) << 8) | f0 | op0); diff --git a/drivers/infiniband/hw/mthca/mthca_wqe.h b/drivers/infiniband/hw/mthca/mthca_wqe.h index 73f1c0b..e7d2c1e 100644 --- a/drivers/infiniband/hw/mthca/mthca_wqe.h +++ b/drivers/infiniband/hw/mthca/mthca_wqe.h @@ -50,7 +50,8 @@ enum { enum { MTHCA_INVAL_LKEY = 0x100, - MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256 + MTHCA_TAVOR_MAX_WQES_PER_RECV_DB = 256, + MTHCA_ARBEL_MAX_WQES_PER_SEND_DB = 255 }; struct mthca_next_seg { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 54ef2fe..2388580 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -608,9 +608,13 @@ void ipoib_ib_dev_flush(void *_dev) if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) ipoib_ib_dev_up(dev); + down(&priv->vlan_mutex); + /* Flush any child interfaces too */ list_for_each_entry(cpriv, &priv->child_intfs, list) ipoib_ib_dev_flush(&cpriv->dev); + + up(&priv->vlan_mutex); } void ipoib_ib_dev_cleanup(struct net_device *dev) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2fa3075..475d98f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -94,8 +94,10 @@ int ipoib_open(struct net_device *dev) if (ipoib_ib_dev_open(dev)) return -EINVAL; - if (ipoib_ib_dev_up(dev)) + if (ipoib_ib_dev_up(dev)) { + ipoib_ib_dev_stop(dev); return -EINVAL; + } if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -398,9 +400,9 @@ static void path_rec_completion(int stat while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } - } else - path->query = NULL; + } + path->query = NULL; complete(&path->done); spin_unlock_irqrestore(&priv->lock, flags); @@ -428,7 +430,6 @@ static struct ipoib_path *path_rec_creat skb_queue_head_init(&path->queue); INIT_LIST_HEAD(&path->neigh_list); - init_completion(&path->done); memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); path->pathrec.sgid = priv->local_gid; @@ -446,6 +447,8 @@ static int path_rec_start(struct net_dev ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(path->pathrec.dgid)); + init_completion(&path->done); + path->query_id = ib_sa_path_rec_get(priv->ca, priv->port, &path->pathrec, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index c33ed87..ef3ee03 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -135,20 +135,14 @@ static struct ipoib_mcast *ipoib_mcast_a if (!mcast) return NULL; - init_completion(&mcast->done); - mcast->dev = dev; mcast->created = jiffies; mcast->backoff = 1; - mcast->logcount = 0; INIT_LIST_HEAD(&mcast->list); INIT_LIST_HEAD(&mcast->neigh_list); skb_queue_head_init(&mcast->pkt_queue); - mcast->ah = NULL; - mcast->query = NULL; - return mcast; } @@ -350,6 +344,8 @@ static int ipoib_mcast_sendonly_join(str rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); + init_completion(&mcast->done); + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | @@ -469,6 +465,8 @@ static void ipoib_mcast_join(struct net_ rec.traffic_class = priv->broadcast->mcmember.traffic_class; } + init_completion(&mcast->done); + ret = ib_sa_mcmember_rec_set(priv->ca, priv->port, &rec, comp_mask, mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_join_complete, From mst at mellanox.co.il Wed Nov 30 09:38:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 30 Nov 2005 19:38:14 +0200 Subject: [openib-general] Re: [PATCH] mthca: fix max wr for arbel In-Reply-To: <52iruarr02.fsf@cisco.com> References: <52iruarr02.fsf@cisco.com> Message-ID: <20051130173814.GF25751@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: fix max wr for arbel > > Michael> Unlike tavor, max QP/SRQ size is a power of 2 for arbel > Michael> mode, despite what the documentation (of the query dev > Michael> lim command) says. Without this patch, on Arbel, we can > Michael> start with a qp of a legal size and get above the device > Michael> limit after rounding to the next power of two. > > This looks fine but I'm a little confused by the explanation. If the > firmware reports a limit of say 14, then the current code will turn > that into a limit of 16383 work queue entries, but the actual hardware > limit is 16384 entries. > > So the code in mthca_set_qp_size() will reject attempts to create a QP > with more than 16383 entries. If someone tries to create a QP with > more than 8192 entries, the size will get rounded up to 16384, which > will still work with the hardware. Yes but if you the try to use the attribute that you get from creating this qp to create a new qp, you get an error, since the driver checks the device limits first thing. > > Or did you just mean that sometimes a QP will be created with a size > bigger than the limit reported by the driver? I agree that's not > ideal but it doesn't seem to be a major problem. > > - R. > I dont claim its a major problem, but its a bit ugly and the fix is easy. MST -- MST From rolandd at cisco.com Wed Nov 30 09:55:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 09:55:59 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix max wr for arbel In-Reply-To: <20051130173814.GF25751@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 30 Nov 2005 19:38:14 +0200") References: <52iruarr02.fsf@cisco.com> <20051130173814.GF25751@mellanox.co.il> Message-ID: <521x0yrpe8.fsf@cisco.com> Michael> I dont claim its a major problem, but its a bit ugly and Michael> the fix is easy. Sure -- I just wanted to make sure I understood the issue. Anyway, I applied it and queued it for 2.6.15. - R. From mshefty at ichips.intel.com Wed Nov 30 09:58:33 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Nov 2005 09:58:33 -0800 Subject: [swg] RE: [openib-general] socket based connectionmodel for IBproposal -round 4 In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F8FEE06@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F8FEE06@taurus.voltaire.com> Message-ID: <438DE849.6040506@ichips.intel.com> Yaron Haviv wrote: > How about using ARP to get from IP to DGID+Partition > Followed by an SIDR to map DGID+PKey+Service to QKey & QP > > It is the same concept as CMA that first uses IP stack (ARP etc') to get > to the remote end-point (in that case GID+PKey combination) followed by > SA-PR and CM REQ, we just substitute the CM REQ with a SIDR REQ > It may not solve all the cases but probably most of the practical ones This was my thought as well. > Anyway the packets will need to carry some header (since it's not a > connected model), you can add more stuff in that header (e.g. can use > IPoIB header as is which contains already the src/dst IP) I was assuming that each packet would need to carry some sort of header. At this point, we may want to defer defining anything for UDP until there's a better understanding of what an application would want. My guess is that such an application will need new APIs for posting sends based on UDP addressing. - Sean From mshefty at ichips.intel.com Wed Nov 30 10:10:56 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 30 Nov 2005 10:10:56 -0800 Subject: [openib-general] Re: spinlock wrong CPU on CPU#1, ib_addr In-Reply-To: References: Message-ID: <438DEB30.4000309@ichips.intel.com> Or Gerlitz wrote: > Or. > BUG: spinlock wrong CPU on CPU#1, ib_addr/3866 > lock: ffffffff88073428, .magic: dead4ead, .owner: ib_addr/3866, .owner_cpu: 0 I haven't seen this bug before, so I'm not sure what to make of it. The ib_addr code doesn't even acquire spinlocks. > Call Trace:{_raw_spin_unlock+112} {:ib_iser:iser_adaptor_find_by_device+188} > {:ib_iser:iser_cma_handler+83} {:rdma_cm:cma_notify_user+27} > {:rdma_cm:addr_handler+167} {:ib_addr:process_req+316} > {worker_thread+476} {default_wake_function+0} > {__wake_up_common+67} {default_wake_function+0} > {keventd_create_kthread+0} {worker_thread+0} > {keventd_create_kthread+0} {kthread+217} > {child_rip+8} {keventd_create_kthread+0} > {kthread+0} {child_rip+0} > > iser:iser_create_qp setting conn ffff8100328bfe48 qp cma_id ffff81003291b000 qp ffff810032fb4600 > iser:iser_cma_handler event: 7, error: -110 > iser:iser_free_qp_and_id free-ing conn ffff8100328bfe48 conn->qp ffff810032fb4600 conn->cma_id ffff81003291b000 Can you describe more what iSER is doing in the callback or post the code? - Sean From danb at voltaire.com Wed Nov 30 10:14:29 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 30 Nov 2005 20:14:29 +0200 Subject: [openib-general] RE: [PATCH] iSER: fix cast warning on 32-bit archs Message-ID: Applied. Thanks, Dan > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, November 29, 2005 8:50 PM > To: Dan Bar Dov > Cc: openib-general at openib.org > Subject: [PATCH] iSER: fix cast warning on 32-bit archs > > Fix the warning > > iser_verbs.c:692: warning: cast to pointer from integer > of different size > > on 32-bit architectures -- work request ID needs to be cast to long > before being cast to a pointer. > > Signed-off-by: Roland Dreier > > --- infiniband/ulp/iser/iser_verbs.c (revision 4210) > +++ infiniband/ulp/iser/iser_verbs.c (working copy) > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. > + * Copyright (c) 2005 Cisco Systems. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms > of the GNU > @@ -689,7 +690,7 @@ static void iser_cq_callback(struct ib_c > unsigned long xfer_len; > > while (ib_poll_cq(cq, 1, &wc) == 1) { > - p_dto = (struct iser_dto *)wc.wr_id; > + p_dto = (struct iser_dto *) (unsigned long) wc.wr_id; > > if (p_dto == NULL || p_dto->type >= ISER_DTO_PASSIVE) > iser_bug("NULL p_dto %p or unexpected > type\n", p_dto); > From rolandd at cisco.com Wed Nov 30 11:36:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 30 Nov 2005 11:36:05 -0800 Subject: [openib-general] Changes to core drivers to support RD In-Reply-To: <1133315147.28746.26.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 29 Nov 2005 17:45:47 -0800") References: <1133315147.28746.26.camel@hematite.internal.keyresearch.com> Message-ID: <5264qaq66y.fsf@cisco.com> A couple of quick comments: - this breaks the API at least for the create QP command (since it adds an rdd_handle member to the marshalled struct), so the ABI version needs to be bumped, and compatibility code needs to be added to libibverbs to handle the old create QP struct. - It seems strange to use the abbreviation "ee" for EE contexts -- I think "eec" would be clearer. Also, for some reason, your attachments got corrupted with lots of carriage returns (^M). - R. From halr at voltaire.com Wed Nov 30 11:34:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 14:34:42 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM: Move more asserts before variable is used Message-ID: <1133379280.2984.12674.camel@hal.voltaire.com> OpenSM: Move more asserts before variable is used Signed-off-by: Hal Rosenstock Index: osm_pkey_rcv.c =================================================================== --- osm_pkey_rcv.c (revision 4257) +++ osm_pkey_rcv.c (working copy) @@ -71,9 +71,10 @@ void osm_pkey_rcv_destroy( IN osm_pkey_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -124,9 +125,10 @@ osm_pkey_rcv_process( uint8_t port_num; uint16_t block_num; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_smp = osm_madw_get_smp_ptr( p_madw ); Index: osm_sm_state_mgr.c =================================================================== --- osm_sm_state_mgr.c (revision 4257) +++ osm_sm_state_mgr.c (working copy) @@ -406,10 +406,10 @@ void osm_sm_state_mgr_destroy( IN osm_sm_state_mgr_t * const p_sm_mgr ) { - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); - CL_ASSERT( p_sm_mgr ); + OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_destroy ); + cl_spinlock_destroy( &p_sm_mgr->state_lock ); cl_timer_destroy( &p_sm_mgr->polling_timer ); @@ -500,10 +500,10 @@ osm_sm_state_mgr_process( { ib_api_status_t status = IB_SUCCESS; - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); - CL_ASSERT( p_sm_mgr ); + OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_process ); + /* * The state lock prevents many race conditions from screwing * up the state transition process. @@ -760,10 +760,10 @@ osm_sm_state_mgr_check_legality( { ib_api_status_t status = IB_SUCCESS; - OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality ); - CL_ASSERT( p_sm_mgr ); + OSM_LOG_ENTER( p_sm_mgr->p_log, osm_sm_state_mgr_check_legality ); + /* * The state lock prevents many race conditions from screwing * up the state transition process. Index: osm_state_mgr.c =================================================================== --- osm_state_mgr.c (revision 4257) +++ osm_state_mgr.c (working copy) @@ -86,10 +86,10 @@ void osm_state_mgr_destroy( IN osm_state_mgr_t * const p_mgr ) { - OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); - CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_destroy ); + /* destroy the locks */ cl_spinlock_destroy( &p_mgr->state_lock ); cl_spinlock_destroy( &p_mgr->idle_lock ); @@ -1881,9 +1881,10 @@ osm_state_mgr_process( ib_api_status_t status; osm_remote_sm_t *p_remote_sm; - OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_state_mgr_process ); + /* if we are exiting do nothing */ if( osm_exit_flag ) signal = OSM_SIGNAL_NONE; Index: osm_sa_vlarb_record.c =================================================================== --- osm_sa_vlarb_record.c (revision 4261) +++ osm_sa_vlarb_record.c (working copy) @@ -348,6 +348,8 @@ osm_vlarb_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_vlarb_rec_rcv_process ); /* update the requestor physical port. */ @@ -362,7 +364,6 @@ osm_vlarb_rec_rcv_process( goto Exit; } - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_sa_lft_record.c =================================================================== --- osm_sa_lft_record.c (revision 4258) +++ osm_sa_lft_record.c (working copy) @@ -329,9 +329,10 @@ osm_lftr_rcv_process( ib_api_status_t status; osm_physp_t* p_req_physp; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_lftr_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_sa_portinfo_record.c =================================================================== --- osm_sa_portinfo_record.c (revision 4260) +++ osm_sa_portinfo_record.c (working copy) @@ -551,9 +551,10 @@ osm_pir_rcv_process( osm_physp_t* p_req_physp; boolean_t trusted_req = TRUE; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_pir_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_sa_pkey_record.c =================================================================== --- osm_sa_pkey_record.c (revision 4260) +++ osm_sa_pkey_record.c (working copy) @@ -344,9 +344,10 @@ osm_pkey_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_pkey_rec_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_lin_fwd_rcv.c =================================================================== --- osm_lin_fwd_rcv.c (revision 4257) +++ osm_lin_fwd_rcv.c (working copy) @@ -75,9 +75,10 @@ void osm_lft_rcv_destroy( IN osm_lft_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -120,9 +121,10 @@ osm_lft_rcv_process( ib_net64_t node_guid; ib_api_status_t status; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_lft_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_sw_tbl = &p_rcv->p_subn->sw_guid_tbl; Index: osm_sa_slvl_record.c =================================================================== --- osm_sa_slvl_record.c (revision 4261) +++ osm_sa_slvl_record.c (working copy) @@ -324,9 +324,10 @@ osm_slvl_rec_rcv_process( ib_net64_t comp_mask; osm_physp_t* p_req_physp; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rec_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_sminfo_rcv.c =================================================================== --- osm_sminfo_rcv.c (revision 4257) +++ osm_sminfo_rcv.c (working copy) @@ -80,9 +80,10 @@ void osm_sminfo_rcv_destroy( IN osm_sminfo_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_sminfo_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } Index: osm_node_info_rcv.c =================================================================== --- osm_node_info_rcv.c (revision 4257) +++ osm_node_info_rcv.c (working copy) @@ -287,8 +287,6 @@ __osm_ni_rcv_process_new_node( p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); port_num = ib_node_info_get_local_port_num( p_ni ); - CL_ASSERT( p_node ); - /* Request PortInfo & NodeDescription attributes for the port that responded to the NodeInfo attribute. @@ -354,7 +352,6 @@ __osm_ni_rcv_get_node_desc( p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); port_num = ib_node_info_get_local_port_num( p_ni ); - /* Request PortInfo & NodeDescription attributes for the port that responded to the NodeInfo attribute. @@ -968,9 +965,10 @@ void osm_ni_rcv_destroy( IN osm_ni_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_ni_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } Index: osm_mcast_mgr.c =================================================================== --- osm_mcast_mgr.c (revision 4257) +++ osm_mcast_mgr.c (working copy) @@ -394,9 +394,10 @@ void osm_mcast_mgr_destroy( IN osm_mcast_mgr_t* const p_mgr ) { + CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_mcast_mgr_destroy ); - CL_ASSERT( p_mgr ); OSM_LOG_EXIT( p_mgr->p_log ); } @@ -448,9 +449,10 @@ __osm_mcast_mgr_set_tbl( ib_net16_t block[IB_MCAST_BLOCK_SIZE]; osm_signal_t signal = OSM_SIGNAL_DONE; + CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, __osm_mcast_mgr_set_tbl ); - CL_ASSERT( p_mgr ); CL_ASSERT( p_sw ); p_node = osm_switch_get_node_ptr( p_sw ); Index: osm_sa_sminfo_record.c =================================================================== --- osm_sa_sminfo_record.c (revision 4261) +++ osm_sa_sminfo_record.c (working copy) @@ -89,9 +89,10 @@ void osm_smir_rcv_destroy( IN osm_smir_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -141,9 +142,10 @@ osm_smir_rcv_process( ib_net64_t local_guid; osm_port_t* local_port; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_trap_rcv.c =================================================================== --- osm_trap_rcv.c (revision 4257) +++ osm_trap_rcv.c (working copy) @@ -189,11 +189,12 @@ void osm_trap_rcv_destroy( IN osm_trap_rcv_t* const p_rcv ) { - OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); - CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_trap_rcv_destroy ); + cl_event_wheel_destroy( &p_rcv->trap_aging_tracker ); + OSM_LOG_EXIT( p_rcv->p_log ); } Index: osm_ucast_mgr.c =================================================================== --- osm_ucast_mgr.c (revision 4257) +++ osm_ucast_mgr.c (working copy) @@ -90,9 +90,10 @@ void osm_ucast_mgr_destroy( IN osm_ucast_mgr_t* const p_mgr ) { + CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_destroy ); - CL_ASSERT( p_mgr ); OSM_LOG_EXIT( p_mgr->p_log ); } @@ -784,9 +785,10 @@ __osm_ucast_mgr_set_table( uint32_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; + CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_set_table ); - CL_ASSERT( p_mgr ); CL_ASSERT( p_sw ); p_node = osm_switch_get_node_ptr( p_sw ); Index: osm_sa_node_record.c =================================================================== --- osm_sa_node_record.c (revision 4260) +++ osm_sa_node_record.c (working copy) @@ -434,9 +434,10 @@ osm_nr_rcv_process( ib_api_status_t status; osm_physp_t* p_req_physp; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_nr_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_sw_info_rcv.c =================================================================== --- osm_sw_info_rcv.c (revision 4257) +++ osm_sw_info_rcv.c (working copy) @@ -363,9 +363,10 @@ __osm_si_rcv_process_new( ib_smp_t *p_smp; cl_qmap_t *p_sw_guid_tbl; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, __osm_si_rcv_process_new ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_sw_guid_tbl = &p_rcv->p_subn->sw_guid_tbl; @@ -581,9 +582,10 @@ void osm_si_rcv_destroy( IN osm_si_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -629,9 +631,10 @@ osm_si_rcv_process( ib_net64_t node_guid; osm_si_context_t *p_context; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_si_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_node_guid_tbl = &p_rcv->p_subn->node_guid_tbl; Index: osm_mcast_fwd_rcv.c =================================================================== --- osm_mcast_fwd_rcv.c (revision 4257) +++ osm_mcast_fwd_rcv.c (working copy) @@ -77,9 +77,10 @@ void osm_mft_rcv_destroy( IN osm_mft_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -123,9 +124,10 @@ osm_mft_rcv_process( ib_net64_t node_guid; ib_api_status_t status; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_mft_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_sw_tbl = &p_rcv->p_subn->sw_guid_tbl; Index: osm_slvl_map_rcv.c =================================================================== --- osm_slvl_map_rcv.c (revision 4257) +++ osm_slvl_map_rcv.c (working copy) @@ -83,9 +83,10 @@ void osm_slvl_rcv_destroy( IN osm_slvl_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -135,9 +136,10 @@ osm_slvl_rcv_process( ib_net64_t node_guid; uint8_t out_port_num, in_port_num; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_slvl_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_smp = osm_madw_get_smp_ptr( p_madw ); Index: osm_node_desc_rcv.c =================================================================== --- osm_node_desc_rcv.c (revision 4257) +++ osm_node_desc_rcv.c (working copy) @@ -109,9 +109,10 @@ void osm_nd_rcv_destroy( IN osm_nd_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -151,9 +152,10 @@ osm_nd_rcv_process( osm_node_t *p_node; ib_net64_t node_guid; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_nd_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_guid_tbl = &p_rcv->p_subn->node_guid_tbl; Index: osm_sa_mcmember_record.c =================================================================== --- osm_sa_mcmember_record.c (revision 4259) +++ osm_sa_mcmember_record.c (working copy) @@ -109,10 +109,12 @@ void osm_mcmr_rcv_destroy( IN osm_mcmr_recv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_destroy ); - CL_ASSERT( p_rcv ); cl_qlock_pool_destroy( &p_rcv->pool ); + OSM_LOG_EXIT( p_rcv->p_log ); } @@ -1965,9 +1967,10 @@ osm_mcmr_query_mgrp(IN osm_mcmr_recv_t* osm_physp_t* p_req_physp; boolean_t trusted_req = TRUE; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_query_mgrp ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); @@ -2170,9 +2173,10 @@ osm_mcmr_rcv_process( ib_member_rec_t *p_recvd_mcmember_rec; boolean_t valid; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_mcmr_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); Index: osm_drop_mgr.c =================================================================== --- osm_drop_mgr.c (revision 4257) +++ osm_drop_mgr.c (working copy) @@ -81,9 +81,10 @@ void osm_drop_mgr_destroy( IN osm_drop_mgr_t* const p_mgr ) { + CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_destroy ); - CL_ASSERT( p_mgr ); OSM_LOG_EXIT( p_mgr->p_log ); } @@ -596,10 +597,10 @@ osm_drop_mgr_process( uint8_t port_num; osm_physp_t *p_physp; - OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); - CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_drop_mgr_process ); + p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl; p_lsweep_ports = &p_mgr->p_subn->light_sweep_physp_list; Index: osm_lid_mgr.c =================================================================== --- osm_lid_mgr.c (revision 4257) +++ osm_lid_mgr.c (working copy) @@ -1312,10 +1312,10 @@ osm_lid_mgr_process_subnet( osm_physp_t *p_physp; int lid_changed; - OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); - CL_ASSERT( p_mgr ); + OSM_LOG_ENTER( p_mgr->p_log, osm_lid_mgr_process_subnet ); + CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); CL_ASSERT( p_mgr->p_subn->sm_port_guid ); Index: osm_vl_arb_rcv.c =================================================================== --- osm_vl_arb_rcv.c (revision 4257) +++ osm_vl_arb_rcv.c (working copy) @@ -83,9 +83,10 @@ void osm_vla_rcv_destroy( IN osm_vla_rcv_t* const p_rcv ) { + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_destroy ); - CL_ASSERT( p_rcv ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -135,9 +136,10 @@ osm_vla_rcv_process( ib_net64_t node_guid; uint8_t port_num, block_num; + CL_ASSERT( p_rcv ); + OSM_LOG_ENTER( p_rcv->p_log, osm_vla_rcv_process ); - CL_ASSERT( p_rcv ); CL_ASSERT( p_madw ); p_smp = osm_madw_get_smp_ptr( p_madw ); From rjwalsh at pathscale.com Wed Nov 30 11:50:51 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 30 Nov 2005 11:50:51 -0800 Subject: [openib-general] Changes to core drivers to support RD In-Reply-To: <5264qaq66y.fsf@cisco.com> References: <1133315147.28746.26.camel@hematite.internal.keyresearch.com> <5264qaq66y.fsf@cisco.com> Message-ID: <1133380251.19761.9.camel@phosphene.durables.org> > - this breaks the API at least for the create QP command (since it > adds an rdd_handle member to the marshalled struct), so the ABI > version needs to be bumped, and compatibility code needs to be > added to libibverbs to handle the old create QP struct. Yup. I just wanted to get the main idea out, first. Compatibility stuff will definitely be added before this is committed. > - It seems strange to use the abbreviation "ee" for EE contexts -- I > think "eec" would be clearer. Ralph sent me this then disappeared on vacation :-) Johann and I spent a few minutes scratching our head over that and just assumed he had a reason. I'll ask him when he gets back, but we're OK with fixing it. > Also, for some reason, your attachments got corrupted with lots of > carriage returns (^M). Odd. That was just the output from svn diff. Hmm. Next patch, I'll check beforehand. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From halr at voltaire.com Wed Nov 30 13:04:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 16:04:16 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used Message-ID: <1133384655.2984.13655.camel@hal.voltaire.com> OpenSM/complib: Move assert before variable is used Signed-off-by: Hal Rosenstock Index: cl_dispatcher.c =================================================================== --- cl_dispatcher.c (revision 4257) +++ cl_dispatcher.c (working copy) @@ -344,8 +344,8 @@ cl_disp_post( cl_dispatcher_t *p_disp; cl_disp_msg_t *p_msg; - p_disp = handle->p_disp; CL_ASSERT( p_disp ); + p_disp = handle->p_disp; CL_ASSERT( msg_id != CL_DISP_MSGID_NONE ); cl_spinlock_acquire( &p_disp->lock ); From halr at voltaire.com Wed Nov 30 13:19:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 16:19:41 -0500 Subject: [openib-general] [[PATCH] [TRIVIAL] OpenSM/libvendor: Move assert before variable is used Message-ID: <1133385581.2984.13829.camel@hal.voltaire.com> OpenSM/libvendor: Move assert before variable is used Signed-off-by: Hal Rosenstock Index: osm_vendor_ibumad.c =================================================================== --- osm_vendor_ibumad.c (revision 4265) +++ osm_vendor_ibumad.c (working copy) @@ -666,10 +666,11 @@ osm_vendor_open_port( int i = 0, umad_port_id = -1, found = 0; int ca, r; + CL_ASSERT( p_vend ); + OSM_LOG_ENTER( p_vend->p_log, osm_vendor_open_port ); CL_ASSERT( port_guid ); - CL_ASSERT( p_vend ); if (p_vend->umad_port_id >= 0) { umad_port_id = p_vend->umad_port_id; From halr at voltaire.com Wed Nov 30 13:44:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 16:44:25 -0500 Subject: [openib-general] First Multicast Leave disconnects all other clients In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A3C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A3C@mtlexch01.mtl.com> Message-ID: <1133387063.2984.14121.camel@hal.voltaire.com> Hi Eitan, On Wed, 2005-11-30 at 10:03, Eitan Zahavi wrote: > Sorry for the late response. > > The bottom line: > We are missing 3 agents in the OpenIB stack: > InformInfo - handling registrations and Report dispatching These are not currently used. > ServiceRecord - tracks registrations ServiceRecord is implemented in sa_query (and was used by AT/uAT but that is largely historical now) > Multicast Join/Leave - tracking registrations to multicast groups and > ref-counting > > All these agents should be able to cleanup dead client registrations and > also provide re-registration in case of SM ClientReregistration event. In OpenIB, any Set of PortInfo (which includes ClientReregister) currently causes a (coarse) event (LID change) which causes IPoIB client to reregister its multicasts registrations with the SA. > Please see below > > > > > > It seems the IBTA intent was that the IB driver will be responsible > for maintaining > > the list of clients > > > registered to each group. > > > > Yes, the end node is responsible for tracking the registrations within > > the node and fabricating responses when the node does not want to > leave. > > Is delete a different case though ? > [EZ] No it is not. Delete of multicast group is really the last leave. There is an explicit delete. While it shouldn't be needed to be forced, there is always some scenario where this is useful. > > > But the IB core does not track what clients registered (through SA > requests) to a > > particular multicast group. > > > The first client to leave the group causes the rest (of the clients) > to be disconnected. > > > > This is an implementation issue IMO and applies to other subscriptions > > too (not just limited to multicast). > [EZ] I agree it is an implementation issue. I hope it will get > implemented in OpenIB. It will. It's a question of priorities and timing. > > > My proposal is to provide an API for such registrations at both user > and kernel and > > track the requesting processes. > > > Cleanup is also required both by process and kernel module > granularity. > > > > Is the API the SA client request itself for this ? Shouldn't the > > tracking be done there (within sa_query.c) ? > [EZ] It will be hard to sniff the MADs (especially user level) for all > the registration flows. It's not the sniffing which is hard but perhaps identifying which client (and reference counting). > So I propose we should have > ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. > Both in user land and in kernel. I think this is TBD and the API would be discussed on this list first prior to any implementation. > > > BTW: The same API could also handle "Client Reregistration" for > multicast groups, > > > > Client reregistration is for all subscriptions (including > ServiceRecords > > and events as well). > [EZ] Yes exactly. I believe similar problem exists for all > registrations. > > > > > such that we could avoid the need to have that code duplicated by > every client. > > > > I'm missing how client reregistration would help here. Can you > elaborate > > ? > [EZ] It is related to the reference tracking: > If a kernel module tracks all registrations to refcount them and perform > cleanup, it could with similar effort also send the - re-registration in > the event of SM change ... Sure, there are multiple ways to skin the same cat. > > > > > But this refers to yet another API that is missing: Report > dispatching which deserves > > its own > > > mail... > > > > I'm missing the connection between reregistration and report > > dispatching. > [EZ] Sorry for not being verbose. The need for Events dispatcher is > based on the fact that only one client should respond to Report with > ReportRepress. Reports are "unsolicited" MADs coming into the device. In > umad the implementation prevents any "multiple" client registration for > receiving any "unsolicited" MAD - only one class-agent needs to be there > handling "unsolicited" messages. This is fine - but what it means is > that when two clients wants to be notified about events they should > register with that agent and the agent should be able to dispatch the > message to all registered clients as well as send only one response > back. Wouldn't report represses be reference counted and only actually sent on the wire when all subscribed clients within the node indicated repress ? -- Hal From halr at voltaire.com Wed Nov 30 13:57:20 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Nov 2005 16:57:20 -0500 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <1133387839.2984.14241.camel@hal.voltaire.com> Hi Yael & Eitan, Based on the recent MgtWG discussions, are you still holding your position in terms of exiting OpenSM when a non matching SM Key is discovered ? Just wondering if I can issue a patch for this and clear this issue so OpenSM can be compliant for this aspect. Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] OpenSM and Wrong SM_Key Date: 08 Nov 2005 16:08:47 -0500 Hi, Currently, when OpenSM receives SMInfo with a different SM_Key, it exits as follows: void __osm_sminfo_rcv_process_get_response( IN const osm_sminfo_rcv_t* const p_rcv, IN const osm_madw_t* const p_madw ) { ... /* Check that the sm_key of the found SM is the same as ours, or is zero. If not - OpenSM cannot continue with configuration!. */ if ( p_smi->sm_key != 0 && p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sminfo_rcv_process_get_response: ERR 2F18: " "Got SM with sm_key that doesn't match our " "local key. Exiting\n" ); osm_log( p_rcv->p_log, OSM_LOG_SYS, "Found remote SM with non-matching sm_key. Exiting\n" ); osm_exit_flag = TRUE; goto Exit; } C14-61.2.1 states that: A master SM which finds a higher priority master SM with the wrong SM_Key should not relinquish the subnet. Exiting OpenSM relinquishes the subnet. So it appears to me that perhaps this behavior of exiting OpenSM should be at least contingent on the SM state and relative priority of the SMInfo received. Make sense ? If so, I will work on a patch for this. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From johannes at erdfelt.com Wed Nov 30 14:32:37 2005 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Wed, 30 Nov 2005 14:32:37 -0800 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used In-Reply-To: <1133384655.2984.13655.camel@hal.voltaire.com> References: <1133384655.2984.13655.camel@hal.voltaire.com> Message-ID: <20051130223237.GN31093@sventech.com> On Wed, Nov 30, 2005, Hal Rosenstock wrote: > OpenSM/complib: Move assert before variable is used > > Signed-off-by: Hal Rosenstock > > Index: cl_dispatcher.c > =================================================================== > --- cl_dispatcher.c (revision 4257) > +++ cl_dispatcher.c (working copy) > @@ -344,8 +344,8 @@ cl_disp_post( > cl_dispatcher_t *p_disp; > cl_disp_msg_t *p_msg; > > - p_disp = handle->p_disp; > CL_ASSERT( p_disp ); > + p_disp = handle->p_disp; > CL_ASSERT( msg_id != CL_DISP_MSGID_NONE ); > > cl_spinlock_acquire( &p_disp->lock ); It can't be correct to check the value before it is set, right? JE From halr at voltaire.com Wed Nov 30 14:34:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 1 Dec 2005 00:34:05 +0200 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB6A@taurus.voltaire.com> Right. I've been looking at this stuff too long :-( ________________________________ From: Johannes Erdfelt [mailto:johannes at erdfelt.com] Sent: Wed 11/30/2005 5:32 PM To: Hal Rosenstock Cc: Yael Kalka; openib-general at openib.org Subject: Re: [openib-general] [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used On Wed, Nov 30, 2005, Hal Rosenstock wrote: > OpenSM/complib: Move assert before variable is used > > Signed-off-by: Hal Rosenstock > > Index: cl_dispatcher.c > =================================================================== > --- cl_dispatcher.c (revision 4257) > +++ cl_dispatcher.c (working copy) > @@ -344,8 +344,8 @@ cl_disp_post( > cl_dispatcher_t *p_disp; > cl_disp_msg_t *p_msg; > > - p_disp = handle->p_disp; > CL_ASSERT( p_disp ); > + p_disp = handle->p_disp; > CL_ASSERT( msg_id != CL_DISP_MSGID_NONE ); > > cl_spinlock_acquire( &p_disp->lock ); It can't be correct to check the value before it is set, right? JE From arlin.r.davis at intel.com Wed Nov 30 15:39:20 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 30 Nov 2005 15:39:20 -0800 Subject: [openib-general] [PATCH][uDAPL] new provider with uCMA (librdmacm) support Message-ID: James, Here is a provider for the latest uCMA that is tested with dapltest, dtest, and Intel MPI. I also added a top level README with instructions. Default build is set for uCMA. See README for build and configuration details. Modifications: test/dtest/makefile dapl/udapl/Makefile doc/dat.conf New files: README dapl/openib_cma/README dapl/openib_cma/dapl_ib_dto.h dapl/openib_cma/dapl_ib_util.c dapl/openib_cma/dapl_ib_mem.c dapl/openib_cma/dapl_ib_cm.c dapl/openib_cma/dapl_ib_qp.c dapl/openib_cma/dapl_ib_util.h dapl/openib_cma/dapl_ib_cq.c Thanks, -arlin Signed-off by: Arlin Davis Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 4178) +++ test/dtest/makefile (working copy) @@ -2,7 +2,7 @@ CC = gcc CFLAGS = -O2 -g DAT_INC = ../../dat/include -DAT_LIB = /usr/lib64 +DAT_LIB = /usr/local/lib all: dtest @@ -11,6 +11,6 @@ clean: dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ - -DDAPL_PROVIDER='"OpenIB-ib0"' \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 4178) +++ dapl/udapl/Makefile (working copy) @@ -53,7 +53,7 @@ OSRELEASE=$(shell expr `uname -r | cut - # Set up the default provider # ifndef $VERBS -VERBS=openib +VERBS=openib_cma endif # @@ -149,6 +149,16 @@ CFLAGS += -I/usr/local/include/infinib endif # +# OpenIB provider with IB CMA +# +ifeq ($(VERBS),openib_cma) +PROVIDER = $(TOPDIR)/../openib_cma +CFLAGS += -DOPENIB +CFLAGS += -DCQ_WAIT_OBJECT +CFLAGS += -I/usr/local/include/infiniband +endif + +# # If an implementation supports CM and DTO completions on the same EVD # then DAPL_MERGE_CM_DTO should be set # CFLAGS += -DDAPL_MERGE_CM_DTO=1 @@ -268,6 +278,13 @@ PROVIDER_SRCS = dapl_ib_util.c dapl_ib_ dapl_ib_cm.c dapl_ib_mem.c endif +ifeq ($(VERBS),openib_cma) +LDFLAGS += -libverbs -lrdmacm +LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib +PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ + dapl_ib_cm.c dapl_ib_mem.c +endif + UDAPL_SRCS = dapl_init.c \ dapl_evd_create.c \ dapl_evd_query.c \ Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 0) +++ dapl/openib_cma/dapl_ib_dto.h (revision 0) @@ -0,0 +1,266 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_dto.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - DTO operations and CQE macros + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifndef _DAPL_IB_DTO_H_ +#define _DAPL_IB_DTO_H_ + +#include "dapl_ib_util.h" + +#define DEFAULT_DS_ENTRIES 8 + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p); + +/* + * dapls_ib_post_recv + * + * Provider specific Post RECV function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_recv ( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov ) +{ + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_recv_wr wr; + struct ibv_recv_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_rcv: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if (segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup work request */ + total_len = 0; + wr.next = 0; + wr.num_sge = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + + for (i = 0; i < segments; i++) { + if (!local_iov[i].segment_length) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_rcv: l_key 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if (ibv_post_recv(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + return DAT_SUCCESS; +} + + +/* + * dapls_ib_post_send + * + * Provider specific Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = + &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} + +STATIC _INLINE_ DAT_RETURN +dapls_ib_optional_prv_dat( + IN DAPL_CR *cr_ptr, + IN const void *event_data, + OUT DAPL_CR **cr_pp) +{ + return DAT_SUCCESS; +} + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + case IBV_WC_SEND: + return (OP_SEND); + case IBV_WC_RDMA_WRITE: + return (OP_RDMA_WRITE); + case IBV_WC_RDMA_READ: + return (OP_RDMA_READ); + case IBV_WC_COMP_SWAP: + return (OP_COMP_AND_SWAP); + case IBV_WC_FETCH_ADD: + return (OP_FETCH_AND_ADD); + case IBV_WC_BIND_MW: + return (OP_BIND_MW); + case IBV_WC_RECV: + return (OP_RECEIVE); + case IBV_WC_RECV_RDMA_WITH_IMM: + return (OP_RECEIVE_IMM); + default: + return (OP_INVALID); + } +} + +#define DAPL_GET_CQE_OPTYPE(cqe_p) dapls_cqe_opcode(cqe_p) +#define DAPL_GET_CQE_WRID(cqe_p) ((ib_work_completion_t*)cqe_p)->wr_id +#define DAPL_GET_CQE_STATUS(cqe_p) ((ib_work_completion_t*)cqe_p)->status +#define DAPL_GET_CQE_BYTESNUM(cqe_p) ((ib_work_completion_t*)cqe_p)->byte_len +#define DAPL_GET_CQE_IMMED_DATA(cqe_p) ((ib_work_completion_t*)cqe_p)->imm_data + +#endif /* _DAPL_IB_DTO_H_ */ Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 0) +++ dapl/openib_cma/dapl_ib_util.c (revision 0) @@ -0,0 +1,791 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - init, open, close, utilities, work thread + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifdef RCSID +static const char rcsid[] = "$Id: $"; +#endif + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_ib_util.h" + +#include +#include +#include +#include + +int g_dapl_loopback_connection = 0; +int g_ib_pipe[2]; +ib_thread_state_t g_ib_thread_state = 0; +DAPL_OS_THREAD g_ib_thread; +DAPL_OS_LOCK g_hca_lock; +struct dapl_llist_entry *g_hca_list; + +/* Get IP address */ +static int getipaddr(char *name, char *addr, int len) +{ + struct addrinfo *res; + int ret; + + ret = getaddrinfo(name, NULL, NULL, &res); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " getipaddr: invalid name or address (%s)\n", + name); + return ret; + } + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " getipaddr: family %d port %d addr %d.%d.%d.%d\n", + ((struct sockaddr_in *)res->ai_addr)->sin_family, + ((struct sockaddr_in *)res->ai_addr)->sin_port, + ((struct sockaddr_in *) + res->ai_addr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *) + res->ai_addr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *) + res->ai_addr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *) + res->ai_addr)->sin_addr.s_addr >> 24 & 0xff ); + + if (len >= res->ai_addrlen) + memcpy(addr, res->ai_addr, res->ai_addrlen); + else + return EINVAL; + + freeaddrinfo(res); + + return 0; +} + +/* + * dapls_ib_init, dapls_ib_release + * + * Initialize Verb related items for device open + * + * Input: + * none + * + * Output: + * none + * + * Returns: + * 0 success, -1 error + * + */ +int32_t dapls_ib_init(void) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " dapl_ib_init: \n" ); + + /* initialize hca_list lock */ + dapl_os_lock_init(&g_hca_lock); + + /* initialize hca list for CQ events */ + dapl_llist_init_head(&g_hca_list); + + /* create pipe for waking up work thread */ + if (pipe(g_ib_pipe)) + return 1; + + return 0; +} + +int32_t dapls_ib_release(void) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " dapl_ib_release: \n"); + dapli_ib_thread_destroy(); + return 0; +} + +/* + * dapls_ib_open_hca + * + * Open HCA + * + * Input: + * *hca_name pointer to provider device name + * *ib_hca_handle_p pointer to provide HCA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_open_hca(IN IB_HCA_NAME hca_name, IN DAPL_HCA *hca_ptr) +{ + long opts; + struct rdma_cm_id *cm_id; + union ibv_gid *gid; + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: %s - %p\n", hca_name, hca_ptr); + + if (dapli_ib_thread_init()) + return DAT_INTERNAL_ERROR; + + + /* HCA name will be hostname or IP address */ + if (getipaddr((char*)hca_name, + (char*)&hca_ptr->hca_address, + sizeof(DAT_SOCK_ADDR6))) + return DAT_INVALID_ADDRESS; + + + /* cm_id will bind local device/GID based on IP address */ + if (rdma_create_id(&cm_id, (void*)hca_ptr)) + return DAT_INTERNAL_ERROR; + + ret = rdma_bind_addr(cm_id, + (struct sockaddr *)&hca_ptr->hca_address); + if (ret) { + rdma_destroy_id(cm_id); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: ERR bind (%d) %s \n", + ret, strerror(-ret)); + return DAT_INVALID_ADDRESS; + } + + /* keep reference to IB device and cm_id */ + hca_ptr->ib_trans.cm_id = cm_id; + hca_ptr->ib_hca_handle = cm_id->verbs; + hca_ptr->port_num = cm_id->port_num; + gid = &cm_id->route.addr.addr.ibaddr.sgid; + + dapl_dbg_log( + DAPL_DBG_TYPE_UTIL, + " open_hca: ctx=%p port=%d GID subnet %016llx id %016llx\n", + cm_id->verbs,cm_id->port_num, + (unsigned long long)bswap_64(gid->global.subnet_prefix), + (unsigned long long)bswap_64(gid->global.interface_id)); + + /* set inline max with env or default, get local lid and gid 0 */ + hca_ptr->ib_trans.max_inline_send = + dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); + + /* EVD events without direct CQ channels, non-blocking */ + hca_ptr->ib_trans.ib_cq = + ibv_create_comp_channel(hca_ptr->ib_hca_handle); + opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */ + if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, + F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: ERR with CQ FD\n" ); + goto bail; + } + + /* + * Put new hca_transport on list for async and CQ event processing + * Wakeup work thread to add to polling list + */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&hca_ptr->ib_trans.entry); + dapl_os_lock( &g_hca_lock ); + dapl_llist_add_tail(&g_hca_list, + (DAPL_LLIST_ENTRY*)&hca_ptr->ib_trans.entry, + &hca_ptr->ib_trans.entry); + write(g_ib_pipe[1], "w", sizeof "w"); + dapl_os_unlock(&g_hca_lock); + + dapl_dbg_log( + DAPL_DBG_TYPE_UTIL, + " open_hca: %s, %s %d.%d.%d.%d INLINE_MAX=%d\n", hca_name, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_family == AF_INET ? + "AF_INET":"AF_INET6", + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, + hca_ptr->ib_trans.max_inline_send ); + + hca_ptr->ib_trans.d_hca = hca_ptr; + return DAT_SUCCESS; +bail: + rdma_destroy_id(hca_ptr->ib_trans.cm_id); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_close_hca + * + * Open HCA + * + * Input: + * DAPL_HCA provide CA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_close_hca(IN DAPL_HCA *hca_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," close_hca: %p->%p\n", + hca_ptr,hca_ptr->ib_hca_handle); + + if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + if (rdma_destroy_id(hca_ptr->ib_trans.cm_id)) + return(dapl_convert_errno(errno,"ib_close_device")); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + } + + /* + * Remove hca from async and CQ event processing list + * Wakeup work thread to remove from polling list + */ + hca_ptr->ib_trans.destroy = 1; + write(g_ib_pipe[1], "w", sizeof "w"); + + /* wait for thread to remove HCA references */ + while (hca_ptr->ib_trans.destroy != 2) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 10000000; /* 10 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy: wait on hca %p destroy\n"); + nanosleep (&sleep, &remain); + } + return (DAT_SUCCESS); +} + +/* + * dapls_ib_query_hca + * + * Query the hca attribute + * + * Input: + * hca_handl hca handle + * ia_attr attribute of the ia + * ep_attr attribute of the ep + * ip_addr ip address of DET NIC + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + */ + +DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA *hca_ptr, + OUT DAT_IA_ATTR *ia_attr, + OUT DAT_EP_ATTR *ep_attr, + OUT DAT_SOCK_ADDR6 *ip_addr) +{ + struct ibv_device_attr dev_attr; + struct ibv_port_attr port_attr; + + if (hca_ptr->ib_hca_handle == NULL) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," query_hca: BAD handle\n"); + return (DAT_INVALID_HANDLE); + } + + /* local IP address of device, set during ia_open */ + if (ip_addr != NULL) + memcpy(ip_addr, &hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6)); + + if (ia_attr == NULL && ep_attr == NULL) + return DAT_SUCCESS; + + /* query verbs for this device and port attributes */ + if (ibv_query_device(hca_ptr->ib_hca_handle, &dev_attr) || + ibv_query_port(hca_ptr->ib_hca_handle, + hca_ptr->port_num, &port_attr)) + return(dapl_convert_errno(errno,"ib_query_hca")); + + if (ia_attr != NULL) { + ia_attr->adapter_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->vendor_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->ia_address_ptr = + (DAT_IA_ADDRESS_PTR)&hca_ptr->hca_address; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " query_hca: %s %s %d.%d.%d.%d\n", hca_ptr->name, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_family == AF_INET ? + "AF_INET":"AF_INET6", + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff); + + ia_attr->hardware_version_major = dev_attr.hw_ver; + ia_attr->max_eps = dev_attr.max_qp; + ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; + ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; + ia_attr->max_evds = dev_attr.max_cq; + ia_attr->max_evd_qlen = dev_attr.max_cqe; + ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; + ia_attr->max_lmrs = dev_attr.max_mr; + ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + ia_attr->max_rmrs = dev_attr.max_mw; + ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; + ia_attr->max_rmr_target_address = dev_attr.max_mr_size; + ia_attr->max_pzs = dev_attr.max_pd; + ia_attr->max_mtu_size = port_attr.max_msg_sz; + ia_attr->max_rdma_size = port_attr.max_msg_sz; + ia_attr->num_transport_attr = 0; + ia_attr->transport_attr = NULL; + ia_attr->num_vendor_attr = 0; + ia_attr->vendor_attr = NULL; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " query_hca: (ver=%x) ep %d ep_q %d evd %d evd_q %d\n", + ia_attr->hardware_version_major, + ia_attr->max_eps, ia_attr->max_dto_per_ep, + ia_attr->max_evds, ia_attr->max_evd_qlen ); + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", + ia_attr->max_mtu_size, ia_attr->max_rdma_size, + ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, + ia_attr->max_rmrs ); + } + + if (ep_attr != NULL) { + ep_attr->max_mtu_size = port_attr.max_msg_sz; + ep_attr->max_rdma_size = port_attr.max_msg_sz; + ep_attr->max_recv_dtos = dev_attr.max_qp_wr; + ep_attr->max_request_dtos = dev_attr.max_qp_wr; + ep_attr->max_recv_iov = dev_attr.max_sge; + ep_attr->max_request_iov = dev_attr.max_sge; + ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; + ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", + ep_attr->max_mtu_size, + ep_attr->max_recv_dtos, ep_attr->max_recv_iov, + ep_attr->max_rdma_read_in, ep_attr->max_rdma_read_out); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_setup_async_callback + * + * Set up an asynchronous callbacks of various kinds + * + * Input: + * ia_handle IA handle + * handler_type type of handler to set up + * callback_handle handle param for completion callbacks + * callback callback routine pointer + * context argument for callback routine + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_setup_async_callback(IN DAPL_IA *ia_ptr, + IN DAPL_ASYNC_HANDLER_TYPE type, + IN DAPL_EVD *evd_ptr, + IN ib_async_handler_t callback, + IN void *context) + +{ + ib_hca_transport_t *hca_ptr; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " setup_async_cb: ia %p type %d hdl %p cb %p ctx %p\n", + ia_ptr, type, evd_ptr, callback, context); + + hca_ptr = &ia_ptr->hca_ptr->ib_trans; + switch(type) + { + case DAPL_ASYNC_UNAFILIATED: + hca_ptr->async_unafiliated = + (ib_async_handler_t)callback; + hca_ptr->async_un_ctx = context; + break; + case DAPL_ASYNC_CQ_ERROR: + hca_ptr->async_cq_error = + (ib_async_cq_handler_t)callback; + break; + case DAPL_ASYNC_CQ_COMPLETION: + hca_ptr->async_cq = + (ib_async_dto_handler_t)callback; + break; + case DAPL_ASYNC_QP_ERROR: + hca_ptr->async_qp_error = + (ib_async_qp_handler_t)callback; + break; + default: + break; + } + return DAT_SUCCESS; +} + +int dapli_ib_thread_init(void) +{ + long opts; + DAT_RETURN ret; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_init(%d)\n", getpid()); + + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_INIT) { + dapl_os_unlock(&g_hca_lock); + return 0; + } + + /* uCMA events non-blocking */ + opts = fcntl(rdma_get_fd(), F_GETFL); /* uCMA */ + if (opts < 0 || fcntl(rdma_get_fd(), + F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " dapl_ib_init: ERR with uCMA FD\n" ); + dapl_os_unlock(&g_hca_lock); + return 1; + } + + g_ib_thread_state = IB_THREAD_CREATE; + dapl_os_unlock(&g_hca_lock); + + /* create thread to process inbound connect request */ + ret = dapl_os_thread_create(dapli_thread, NULL, &g_ib_thread); + if (ret != DAT_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " ib_thread_init: failed to create thread\n"); + return 1; + } + + /* wait for thread to start */ + dapl_os_lock(&g_hca_lock); + while (g_ib_thread_state != IB_THREAD_RUN) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_init: waiting for ib_thread\n"); + dapl_os_unlock(&g_hca_lock); + nanosleep (&sleep, &remain); + dapl_os_lock(&g_hca_lock); + } + dapl_os_unlock(&g_hca_lock); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_init(%d) exit\n",getpid()); + return 0; +} + +void dapli_ib_thread_destroy(void) +{ + int retries = 10; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy(%d)\n", getpid()); + /* + * wait for async thread to terminate. + * pthread_join would be the correct method + * but some applications have some issues + */ + + /* destroy ib_thread, wait for termination, if not already */ + dapl_os_lock(&g_hca_lock); + if (g_ib_thread_state != IB_THREAD_RUN) + goto bail; + + g_ib_thread_state = IB_THREAD_CANCEL; + write(g_ib_pipe[1], "w", sizeof "w"); + while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy: waiting for ib_thread\n"); + write(g_ib_pipe[1], "w", sizeof "w"); + dapl_os_unlock( &g_hca_lock ); + nanosleep(&sleep, &remain); + dapl_os_lock( &g_hca_lock ); + } + +bail: + dapl_os_unlock( &g_hca_lock ); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " ib_thread_destroy(%d) exit\n",getpid()); +} + +void dapli_async_event_cb(struct _ib_hca_transport *hca) +{ + struct ibv_async_event event; + struct pollfd async_fd = { + .fd = hca->cm_id->verbs->async_fd, + .events = POLLIN, + .revents = 0 + }; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " async_event(%p)\n",hca); + + if (hca->destroy) + return; + + if ((poll(&async_fd, 1, 0)==1) && + (!ibv_get_async_event(hca->cm_id->verbs, &event))) { + + switch (event.event_type) { + case IBV_EVENT_CQ_ERR: + { + struct dapl_ep *evd_ptr = + event.element.cq->cq_context; + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " async_event CQ (%p) ERR %d\n", + evd_ptr, event.event_type); + + /* report up if async callback still setup */ + if (hca->async_cq_error) + hca->async_cq_error(hca->cm_id->verbs, + event.element.cq, + &event, + (void*)evd_ptr); + break; + } + case IBV_EVENT_COMM_EST: + { + /* Received msgs on connected QP before RTU */ + dapl_dbg_log( + DAPL_DBG_TYPE_UTIL, + " async_event COMM_EST(%p) rdata beat RTU\n", + event.element.qp); + + break; + } + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + case IBV_EVENT_QP_LAST_WQE_REACHED: + case IBV_EVENT_SRQ_ERR: + case IBV_EVENT_SRQ_LIMIT_REACHED: + case IBV_EVENT_SQ_DRAINED: + { + struct dapl_ep *ep_ptr = + event.element.qp->qp_context; + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " async_event QP (%p) ERR %d\n", + ep_ptr, event.event_type); + + /* report up if async callback still setup */ + if (hca->async_qp_error) + hca->async_qp_error(hca->cm_id->verbs, + ep_ptr->qp_handle, + &event, + (void*)ep_ptr); + break; + } + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_DEVICE_FATAL: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_PORT_ERR: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " async_event: DEV ERR %d\n", + event.event_type); + + /* report up if async callback still setup */ + if (hca->async_unafiliated) + hca->async_unafiliated( + hca->cm_id->verbs, + &event, + hca->async_un_ctx); + break; + } + default: + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + " async_event: UNKNOWN\n"); + break; + + } + ibv_ack_async_event(&event); + } +} + +/* work thread for uAT, uCM, CQ, and async events */ +void dapli_thread(void *arg) +{ + struct pollfd ufds[__FD_SETSIZE]; + struct _ib_hca_transport *uhca[__FD_SETSIZE]={NULL}; + struct _ib_hca_transport *hca; + int ret,idx,fds; + char rbuf[2]; + + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", + getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); + + /* Poll across pipe, CM, AT never changes */ + dapl_os_lock( &g_hca_lock ); + g_ib_thread_state = IB_THREAD_RUN; + + ufds[0].fd = g_ib_pipe[0]; /* pipe */ + ufds[0].events = POLLIN; + ufds[1].fd = rdma_get_fd(); /* uCMA */ + ufds[1].events = POLLIN; + + while (g_ib_thread_state == IB_THREAD_RUN) { + + /* build ufds after pipe and uCMA events */ + ufds[0].revents = 0; + ufds[1].revents = 0; + idx=1; + + /* Walk HCA list and setup async and CQ events */ + if (!dapl_llist_is_empty(&g_hca_list)) + hca = dapl_llist_peek_head(&g_hca_list); + else + hca = NULL; + + while(hca) { + + /* uASYNC events */ + ufds[++idx].fd = hca->cm_id->verbs->async_fd; + ufds[idx].events = POLLIN; + ufds[idx].revents = 0; + uhca[idx] = hca; + + /* uCQ, non-direct events */ + ufds[++idx].fd = hca->ib_cq->fd; + ufds[idx].events = POLLIN; + ufds[idx].revents = 0; + uhca[idx] = hca; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" + " pipe=%d cm=%d cq=d\n", + getpid(), hca, ufds[idx-1].fd, + ufds[0].fd, ufds[1].fd, ufds[idx].fd); + + hca = dapl_llist_next_entry( + &g_hca_list, + (DAPL_LLIST_ENTRY*)&hca->entry); + } + + /* unlock, and setup poll */ + fds = idx+1; + dapl_os_unlock(&g_hca_lock); + ret = poll(ufds, fds, -1); + if (ret <= 0) { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " ib_thread(%d): ERR %s poll\n", + getpid(),strerror(errno)); + dapl_os_lock(&g_hca_lock); + continue; + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " ib_thread(%d) poll_event: " + " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", + getpid(), ufds[idx-1].revents, ufds[0].revents, + ufds[1].revents, ufds[idx].revents); + + /* uCMA events */ + if (ufds[1].revents == POLLIN) + dapli_cma_event_cb(); + + /* check and process CQ and ASYNC events, per device */ + for(idx=2;idxdestroy == 1) { + dapl_os_lock(&g_hca_lock); + dapl_llist_remove_entry( + &g_hca_list, + (DAPL_LLIST_ENTRY*) + &uhca[idx]->entry); + dapl_os_unlock(&g_hca_lock); + uhca[idx]->destroy = 2; + } + } + } + dapl_os_lock(&g_hca_lock); + } + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," ib_thread(%d) EXIT\n",getpid()); + g_ib_thread_state = IB_THREAD_EXIT; + dapl_os_unlock(&g_hca_lock); +} + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 0) +++ dapl/openib_cma/dapl_ib_mem.c (revision 0) @@ -0,0 +1,391 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_mem.c + * + * PURPOSE: Intel DET APIs: Memory windows, registration, + * and protection domain + * + * $Id: $ + * + **********************************************************************/ + +#include /* for IOCTL's */ +#include /* for socket(2) and related bits and pieces */ +#include /* for socket(2) */ +#include /* for struct ifreq */ +#include /* for ARPHRD_ETHER */ +#include /* for _SC_CLK_TCK */ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" + +/* + * dapls_convert_privileges + * + * Convert LMR privileges to provider + * + * Input: + * DAT_MEM_PRIV_FLAGS + * + * Output: + * none + * + * Returns: + * ibv_access_flags + * + */ +STATIC _INLINE_ int +dapls_convert_privileges(IN DAT_MEM_PRIV_FLAGS privileges) +{ + int access = 0; + + /* + * if (DAT_MEM_PRIV_LOCAL_READ_FLAG & privileges) do nothing + */ + if (DAT_MEM_PRIV_LOCAL_WRITE_FLAG & privileges) + access |= IBV_ACCESS_LOCAL_WRITE; + if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_WRITE; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + + return access; +} + +/* + * dapl_ib_pd_alloc + * + * Alloc a PD + * + * Input: + * ia_handle IA handle + * pz pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_pd_alloc(IN DAPL_IA *ia_ptr, IN DAPL_PZ *pz) +{ + /* get a protection domain */ + pz->pd_handle = ibv_alloc_pd(ia_ptr->hca_ptr->ib_hca_handle); + if (!pz->pd_handle) + return(dapl_convert_errno(ENOMEM,"alloc_pd")); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " pd_alloc: pd_handle=%p\n", + pz->pd_handle ); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_pd_free + * + * Free a PD + * + * Input: + * ia_handle IA handle + * PZ_ptr pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_pd_free(IN DAPL_PZ *pz ) +{ + if (pz->pd_handle != IB_INVALID_HANDLE) { + if (ibv_dealloc_pd(pz->pd_handle)) + return(dapl_convert_errno(errno,"dealloc_pd")); + pz->pd_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_register + * + * Register a virtual memory region + * + * Input: + * ia_handle IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register(IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_PVOID virt_addr, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS privileges) +{ + ib_pd_handle_t ib_pd_handle; + + ib_pd_handle = ((DAPL_PZ *)lmr->param.pz_handle)->pd_handle; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " mr_register: ia=%p, lmr=%p va=%p ln=%d pv=0x%x\n", + ia_ptr, lmr, virt_addr, length, privileges ); + + /* TODO: shared memory */ + if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mr_register_shared: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); + } + + /* local read is default on IB */ + lmr->mr_handle = + ibv_reg_mr(((DAPL_PZ *)lmr->param.pz_handle)->pd_handle, + virt_addr, + length, + dapls_convert_privileges(privileges)); + + if (!lmr->mr_handle) + return(dapl_convert_errno(ENOMEM,"reg_mr")); + + lmr->param.lmr_context = lmr->mr_handle->lkey; + lmr->param.rmr_context = lmr->mr_handle->rkey; + lmr->param.registered_size = length; + lmr->param.registered_address = (DAT_VADDR)(uintptr_t)virt_addr; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " mr_register: mr=%p h %x pd %p ctx %p " + "lkey=0x%x rkey=0x%x priv=%x\n", + lmr->mr_handle, lmr->mr_handle->handle, + lmr->mr_handle->pd, lmr->mr_handle->context, + lmr->mr_handle->lkey, lmr->mr_handle->rkey, + length, dapls_convert_privileges(privileges)); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_deregister + * + * Free a memory region + * + * Input: + * lmr pointer to dapl_lmr struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mr_deregister(IN DAPL_LMR *lmr) +{ + if (lmr->mr_handle != IB_INVALID_HANDLE) { + if (ibv_dereg_mr(lmr->mr_handle)) + return(dapl_convert_errno(errno,"dereg_pd")); + lmr->mr_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + + +/* + * dapl_ib_mr_register_shared + * + * Register a virtual memory region + * + * Input: + * ia_ptr IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register_shared(IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_MEM_PRIV_FLAGS privileges) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mr_register_shared: NOT IMPLEMENTED\n"); + + return DAT_ERROR(DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_alloc + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_alloc (IN DAPL_RMR *rmr) +{ + + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mw_alloc: NOT IMPLEMENTED\n"); + + return DAT_ERROR(DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_free + * + * Release bindings of a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mw_free(IN DAPL_RMR *rmr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mw_free: NOT IMPLEMENTED\n"); + + return DAT_ERROR(DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_bind + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_bind(IN DAPL_RMR *rmr, + IN DAPL_LMR *lmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_VADDR virtual_address, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS mem_priv, + IN DAT_BOOLEAN is_signaled) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mw_bind: NOT IMPLEMENTED\n"); + + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_unbind + * + * Unbind a protection domain from a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INVALID_STATE; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_unbind(IN DAPL_RMR *rmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_BOOLEAN is_signaled ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " mw_unbind: NOT IMPLEMENTED\n"); + + return DAT_ERROR(DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 0) +++ dapl/openib_cma/dapl_ib_cm.c (revision 0) @@ -0,0 +1,1107 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cm.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - connection management + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2003 Topspin Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_cr_util.h" +#include "dapl_name_service.h" +#include "dapl_ib_util.h" +#include +#include + +/* local prototypes */ +static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, + struct rdma_cm_event *event); +static int dapli_cm_active_cb(struct dapl_cm_id *conn, + struct rdma_cm_event *event); +static int dapli_cm_passive_cb(struct dapl_cm_id *conn, + struct rdma_cm_event *event); +static void dapli_addr_resolve(struct dapl_cm_id *conn); +static void dapli_route_resolve(struct dapl_cm_id *conn); + +#if __BYTE_ORDER == __LITTLE_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return bswap_64(x); } +#elif __BYTE_ORDER == __BIG_ENDIAN +static inline uint64_t cpu_to_be64(uint64_t x) { return x; } +#endif + +/* cma requires 16 bit SID */ +#define IB_PORT_MOD 32001 +#define IB_PORT_BASE (65535 - IB_PORT_MOD) +#define MAKE_PORT(SID) \ + (SID > 0xffff ? \ + (unsigned short)((SID % IB_PORT_MOD) + IB_PORT_BASE) :\ + (unsigned short)SID) + + +static void dapli_addr_resolve(struct dapl_cm_id *conn) +{ + int ret; + struct rdma_addr *ipaddr = &conn->cm_id->route.addr; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " addr_resolve: cm_id %p SRC %x DST %x\n", + conn->cm_id, + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr)); + + ret = rdma_resolve_route(conn->cm_id, 2000); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " rdma_connect failed: %s\n",strerror(errno)); + + dapl_evd_connection_callback(conn, + IB_CME_LOCAL_FAILURE, + NULL, conn->ep); + } +} + +static void dapli_route_resolve(struct dapl_cm_id *conn) +{ + int ret; + struct rdma_cm_id *cm_id = conn->cm_id; + struct rdma_addr *ipaddr = &cm_id->route.addr; + struct ib_addr *ibaddr = &cm_id->route.addr.addr.ibaddr; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " route_resolve: cm_id %p SRC %x DST %x PORT %d\n", + conn->cm_id, + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port) ); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " route_resolve: SRC GID subnet %016llx id %016llx\n", + (unsigned long long) + cpu_to_be64(ibaddr->sgid.global.subnet_prefix), + (unsigned long long) + cpu_to_be64(ibaddr->sgid.global.interface_id)); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " route_resolve: DST GID subnet %016llx id %016llx\n", + (unsigned long long) + cpu_to_be64(ibaddr->dgid.global.subnet_prefix), + (unsigned long long) + cpu_to_be64(ibaddr->dgid.global.interface_id)); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " rdma_connect: cm_id %p pdata %p plen %d rr %d ind %d\n", + conn->cm_id, + conn->params.private_data, + conn->params.private_data_len, + conn->params.responder_resources, + conn->params.initiator_depth ); + + ret = rdma_connect(conn->cm_id, &conn->params); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect failed: %s\n", + strerror(errno)); + goto bail; + } + return; + +bail: + dapl_evd_connection_callback(conn, + IB_CME_LOCAL_FAILURE, + NULL, conn->ep); +} + +void dapli_destroy_conn(struct dapl_cm_id *conn) +{ + int in_callback; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " destroy_conn: conn %p id %d\n", + conn,conn->cm_id); + + dapl_os_lock(&conn->lock); + conn->destroy = 1; + in_callback = conn->in_callback; + dapl_os_unlock(&conn->lock); + + if (!in_callback) { + if (conn->ep) + conn->ep->cm_handle = IB_INVALID_HANDLE; + if (conn->cm_id) { + if (conn->cm_id->qp) + rdma_destroy_qp(conn->cm_id); + rdma_destroy_id(conn->cm_id); + } + + conn->cm_id = NULL; + dapl_os_free(conn, sizeof(*conn)); + } +} + +static struct dapl_cm_id * dapli_req_recv(struct dapl_cm_id *conn, + struct rdma_cm_event *event) +{ + struct dapl_cm_id *new_conn; + struct rdma_addr *ipaddr = &event->id->route.addr; + + if (conn->sp == NULL) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " dapli_rep_recv: on invalid listen " + "handle\n"); + return NULL; + } + + /* allocate new cm_id and merge listen parameters */ + new_conn = dapl_os_alloc(sizeof(*new_conn)); + if (new_conn) { + (void)dapl_os_memzero(new_conn, sizeof(*new_conn)); + new_conn->cm_id = event->id; /* provided by uCMA */ + event->id->context = new_conn; /* update CM_ID context */ + new_conn->sp = conn->sp; + new_conn->hca = conn->hca; + + /* save private data */ + if (event->private_data_len) { + dapl_os_memcpy(new_conn->p_data, + event->private_data, + event->private_data_len); + new_conn->params.private_data = new_conn->p_data; + new_conn->params.private_data_len = + event->private_data_len; + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " passive_cb: " + "REQ: SP %p PORT %d LID %d " + "NEW CONN %p ID %p pD %p,%d\n", + new_conn->sp, + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + event->listen_id, new_conn, event->id, + event->private_data, event->private_data_len); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " passive_cb: " + "REQ: IP SRC %x PORT %d DST %x PORT %d\n", + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port)); + } + return new_conn; +} + +static int dapli_cm_active_cb(struct dapl_cm_id *conn, + struct rdma_cm_event *event) +{ + int destroy; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " active_cb: conn %p id %d event %d\n", + conn, conn->cm_id, event->event ); + + dapl_os_lock(&conn->lock); + if (conn->destroy) { + dapl_os_unlock(&conn->lock); + return 0; + } + conn->in_callback = 1; + dapl_os_unlock(&conn->lock); + + switch (event->event) { + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_CONNECT_ERROR: + dapl_evd_connection_callback(conn, + IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); + break; + case RDMA_CM_EVENT_REJECTED: + dapl_evd_connection_callback(conn, IB_CME_DESTINATION_REJECT, + NULL, conn->ep); + break; + + case RDMA_CM_EVENT_ESTABLISHED: + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " active_cb: cm_id %d PORT %d CONNECTED to 0x%x!\n", + conn->cm_id, + ntohs(((struct sockaddr_in *) + &conn->cm_id->route.addr.dst_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &conn->cm_id->route.addr.dst_addr)->sin_addr.s_addr)); + + dapl_evd_connection_callback(conn, IB_CME_CONNECTED, + event->private_data, conn->ep); + break; + + case RDMA_CM_EVENT_DISCONNECTED: + break; + default: + dapl_dbg_log( + DAPL_DBG_TYPE_ERR, + " dapli_cm_active_cb_handler: Unexpected CM " + "event %d on ID 0x%p\n", event->event, conn->cm_id); + break; + } + + dapl_os_lock(&conn->lock); + destroy = conn->destroy; + conn->in_callback = conn->destroy; + dapl_os_unlock(&conn->lock); + if (destroy) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " active_cb: DESTROY conn %p id %d \n", + conn, conn->cm_id ); + if (conn->ep) + conn->ep->cm_handle = IB_INVALID_HANDLE; + + dapl_os_free(conn, sizeof(*conn)); + } + return(destroy); +} + +static int dapli_cm_passive_cb(struct dapl_cm_id *conn, + struct rdma_cm_event *event) +{ + int destroy; + struct dapl_cm_id *new_conn; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " passive_cb: conn %p id %d event %d\n", + conn, event->id, event->event); + + dapl_os_lock(&conn->lock); + if (conn->destroy) { + dapl_os_unlock(&conn->lock); + return 0; + } + conn->in_callback = 1; + dapl_os_unlock(&conn->lock); + + switch (event->event) { + case RDMA_CM_EVENT_CONNECT_REQUEST: + /* create new conn object with new conn_id from event */ + new_conn = dapli_req_recv(conn,event); + + if (new_conn) + dapls_cr_callback(new_conn, + IB_CME_CONNECTION_REQUEST_PENDING, + event->private_data, new_conn->sp); + break; + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_CONNECT_ERROR: + dapls_cr_callback(conn, IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->sp); + break; + case RDMA_CM_EVENT_REJECTED: + dapls_cr_callback(conn, IB_CME_DESTINATION_REJECT, NULL, + conn->sp); + break; + case RDMA_CM_EVENT_ESTABLISHED: + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " passive_cb: cm_id %p PORT %d CONNECTED from 0x%x!\n", + conn->cm_id, + ntohs(((struct sockaddr_in *) + &conn->cm_id->route.addr.src_addr)->sin_port), + ntohl(((struct sockaddr_in *) + &conn->cm_id->route.addr.dst_addr)->sin_addr.s_addr)); + + dapls_cr_callback(conn, IB_CME_CONNECTED, + NULL, conn->sp); + + break; + case RDMA_CM_EVENT_DISCONNECTED: + break; + default: + dapl_dbg_log(DAPL_DBG_TYPE_ERR, " passive_cb: " + "Unexpected CM event %d on ID 0x%p\n", + event->event, conn->cm_id); + break; + } + + dapl_os_lock(&conn->lock); + destroy = conn->destroy; + conn->in_callback = conn->destroy; + dapl_os_unlock(&conn->lock); + if (destroy) { + if (conn->ep) + conn->ep->cm_handle = IB_INVALID_HANDLE; + + dapl_os_free(conn, sizeof(*conn)); + } + return(destroy); +} + + +/************************ DAPL provider entry points **********************/ + +/* + * dapls_ib_connect + * + * Initiate a connection with the passive listener on another node + * + * Input: + * ep_handle, + * remote_ia_address, + * remote_conn_qual, + * prd_size size of private data and structure + * prd_prt pointer to private data structure + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_connect(IN DAT_EP_HANDLE ep_handle, + IN DAT_IA_ADDRESS_PTR r_addr, + IN DAT_CONN_QUAL r_qual, + IN DAT_COUNT p_size, + IN void *p_data) +{ + struct dapl_ep *ep_ptr = ep_handle; + + /* Sanity check */ + if (NULL == ep_ptr) + return DAT_SUCCESS; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, " connect: rSID %d, pdata %p, ln %d\n", + r_qual,p_data,p_size); + + /* rdma conn and cm_id pre-bound; reference via qp_handle */ + ep_ptr->cm_handle = ep_ptr->qp_handle; + + /* Setup QP/CM parameters and private data in cm_id */ + (void)dapl_os_memzero(&ep_ptr->cm_handle->params, + sizeof(ep_ptr->cm_handle->params)); + ep_ptr->cm_handle->params.responder_resources = IB_TARGET_MAX; + ep_ptr->cm_handle->params.initiator_depth = IB_INITIATOR_DEPTH; + ep_ptr->cm_handle->params.flow_control = 1; + ep_ptr->cm_handle->params.rnr_retry_count = IB_RNR_RETRY_COUNT; + ep_ptr->cm_handle->params.retry_count = IB_RC_RETRY_COUNT; + if (p_size) { + dapl_os_memcpy(ep_ptr->cm_handle->p_data, p_data, p_size); + ep_ptr->cm_handle->params.private_data = + ep_ptr->cm_handle->p_data; + ep_ptr->cm_handle->params.private_data_len = p_size; + } + + /* Resolve remote address, src already bound during QP create */ + ((struct sockaddr_in*)r_addr)->sin_port = htons(MAKE_PORT(r_qual)); + if (rdma_resolve_addr(ep_ptr->cm_handle->cm_id, + NULL, (struct sockaddr *)r_addr, 2000)) + return dapl_convert_errno(errno,"ib_connect"); + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " connect: resolve_addr: cm_id %p SRC %x DST %x port %d\n", + ep_ptr->cm_handle->cm_id, + ntohl(((struct sockaddr_in *) + &ep_ptr->cm_handle->hca->hca_address)->sin_addr.s_addr), + ntohl(((struct sockaddr_in *)r_addr)->sin_addr.s_addr), + MAKE_PORT(r_qual) ); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_disconnect + * + * Disconnect an EP + * + * Input: + * ep_handle, + * disconnect_flags + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * + */ +DAT_RETURN +dapls_ib_disconnect(IN DAPL_EP *ep_ptr, + IN DAT_CLOSE_FLAGS close_flags) +{ + ib_cm_handle_t conn = ep_ptr->cm_handle; + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " disconnect(ep %p, conn %p, id %d flags %x)\n", + ep_ptr,conn, (conn?conn->cm_id:0),close_flags); + + if (conn == IB_INVALID_HANDLE) + return DAT_SUCCESS; + + /* no graceful half-pipe disconnect option */ + ret = rdma_disconnect(conn->cm_id); + if (ret) + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " disconnect: ID %p ret %d\n", + ep_ptr->cm_handle, ret); + + /* + * uDAPL does NOT expect disconnect callback from provider + * with abrupt close. uDAPL will callback with DISC event when + * from provider returns. So, if callback is expected from + * rdma_cma then block and don't post the event during callback. + */ + if (close_flags != DAT_CLOSE_ABRUPT_FLAG) + { + if (ep_ptr->cr_ptr) + dapls_cr_callback(conn, IB_CME_DISCONNECTED, NULL, + ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr); + else + dapl_evd_connection_callback(conn, IB_CME_DISCONNECTED, + NULL, ep_ptr); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_disconnect_clean + * + * Clean up outstanding connection data. This routine is invoked + * after the final disconnect callback has occurred. Only on the + * ACTIVE side of a connection. + * + * Input: + * ep_ptr DAPL_EP + * active Indicates active side of connection + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_disconnect_clean(IN DAPL_EP *ep_ptr, + IN DAT_BOOLEAN active, + IN const ib_cm_events_t ib_cm_event) +{ + /* + * Clean up outstanding connection state + */ + dapls_ib_disconnect(ep_ptr, DAT_CLOSE_ABRUPT_FLAG); + +} + +/* + * dapl_ib_setup_conn_listener + * + * Have the CM set up a connection listener. + * + * Input: + * ibm_hca_handle HCA handle + * qp_handle QP handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * DAT_CONN_QUAL_UNAVAILBLE + * DAT_CONN_QUAL_IN_USE + * + */ +DAT_RETURN +dapls_ib_setup_conn_listener(IN DAPL_IA *ia_ptr, + IN DAT_UINT64 ServiceID, + IN DAPL_SP *sp_ptr ) +{ + DAT_RETURN dat_status = DAT_SUCCESS; + int status; + ib_cm_srvc_handle_t conn; + + /* Allocate CM and initialize lock */ + if ((conn = dapl_os_alloc(sizeof(*conn))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + dapl_os_memzero(conn, sizeof(*conn)); + dapl_os_lock_init(&conn->lock); + + /* create CM_ID, bind to local device, create QP */ + if (rdma_create_id(&conn->cm_id, (void*)conn)) { + dapl_os_free(conn, sizeof(*conn)); + return(dapl_convert_errno(errno,"setup_listener")); + } + + /* open identifies the local device; per DAT specification */ + ((struct sockaddr_in *)&ia_ptr->hca_ptr->hca_address)->sin_port = + htons(MAKE_PORT(ServiceID)); + + if (rdma_bind_addr(conn->cm_id, + (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { + dat_status = dapl_convert_errno(errno,"setup_listener"); + goto bail; + } + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " listen(ia_ptr %p SID %d sp %p conn %p id %d)\n", + ia_ptr, MAKE_PORT(ServiceID), + sp_ptr, conn, conn->cm_id); + + sp_ptr->cm_srvc_handle = conn; + conn->sp = sp_ptr; + conn->hca = ia_ptr->hca_ptr; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " listen(conn=%p cm_id=%d)\n", + sp_ptr->cm_srvc_handle,conn->cm_id); + + status = rdma_listen(conn->cm_id,64); /* backlog to 64 */ + + if (status) { + if (status == -EBUSY) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = + dapl_convert_errno(errno,"setup_listener"); + + goto bail; + } + + /* success */ + return DAT_SUCCESS; + +bail: + rdma_destroy_id(conn->cm_id); + dapl_os_free(conn, sizeof(*conn)); + return dat_status; +} + + +/* + * dapl_ib_remove_conn_listener + * + * Have the CM remove a connection listener. + * + * Input: + * ia_handle IA handle + * ServiceID IB Channel Service ID + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_remove_conn_listener(IN DAPL_IA *ia_ptr, IN DAPL_SP *sp_ptr) +{ + ib_cm_srvc_handle_t conn = sp_ptr->cm_srvc_handle; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " remove_listen(ia_ptr %p sp_ptr %p cm_ptr %p)\n", + ia_ptr, sp_ptr, conn ); + + if (conn != IB_INVALID_HANDLE) { + sp_ptr->cm_srvc_handle = NULL; + dapli_destroy_conn(conn); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_accept_connection + * + * Perform necessary steps to accept a connection + * + * Input: + * cr_handle + * ep_handle + * private_data_size + * private_data + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_accept_connection(IN DAT_CR_HANDLE cr_handle, + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT p_size, + IN const DAT_PVOID p_data) +{ + DAPL_CR *cr_ptr = (DAPL_CR *)cr_handle; + DAPL_EP *ep_ptr = (DAPL_EP *)ep_handle; + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + struct dapl_cm_id *cr_conn = cr_ptr->ib_cm_handle; + int ret; + DAT_RETURN dat_status; + struct rdma_conn_param conn_params; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " accept(cr %p conn %p, id %p, p_data %p, p_sz=%d)\n", + cr_ptr, cr_conn, cr_conn->cm_id, p_data, p_size ); + + /* Obtain size of private data structure & contents */ + if (p_size > IB_MAX_REP_PDATA_SIZE) { + dat_status = DAT_ERROR(DAT_LENGTH_ERROR, DAT_NO_SUBTYPE); + goto bail; + } + + if (ep_ptr->qp_state == DAPL_QP_STATE_UNATTACHED) { + /* + * If we are lazy attaching the QP then we may need to + * hook it up here. Typically, we run this code only for + * DAT_PSP_PROVIDER_FLAG + */ + dat_status = dapls_ib_qp_alloc(ia_ptr, ep_ptr, NULL); + if (dat_status != DAT_SUCCESS) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept: ib_qp_alloc failed: %d\n", + dat_status); + goto bail; + } + } + + /* + * Validate device and port in EP cm_id against inbound + * CR cm_id. The pre-allocated EP cm_id is already bound to + * a local device (cm_id and QP) when created. Move the QP + * to the new cm_id only if device and port numbers match. + */ + if (ep_ptr->qp_handle->cm_id->verbs == cr_conn->cm_id->verbs && + ep_ptr->qp_handle->cm_id->port_num == cr_conn->cm_id->port_num) { + /* move QP to new cr_conn, remove QP ref in EP cm_id */ + cr_conn->cm_id->qp = ep_ptr->qp_handle->cm_id->qp; + ep_ptr->qp_handle->cm_id->qp = NULL; + dapli_destroy_conn(ep_ptr->qp_handle); + } else { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept: ERR dev(%p!=%p) or port mismatch(%d!=%d)\n", + ep_ptr->qp_handle->cm_id->verbs,cr_conn->cm_id->verbs, + ep_ptr->qp_handle->cm_id->port_num, + cr_conn->cm_id->port_num ); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + + cr_ptr->param.local_ep_handle = ep_handle; + ep_ptr->qp_handle = cr_conn; + ep_ptr->cm_handle = cr_conn; + cr_conn->ep = ep_ptr; + + memset(&conn_params, 0, sizeof(conn_params)); + conn_params.private_data = p_data; + conn_params.private_data_len = p_size; + conn_params.responder_resources = IB_TARGET_MAX; + conn_params.initiator_depth = IB_INITIATOR_DEPTH; + conn_params.flow_control = 1; + conn_params.rnr_retry_count = IB_RNR_RETRY_COUNT; + + ret = rdma_accept(cr_conn->cm_id, &conn_params); + if (ret) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR," accept: ERROR %d\n", ret); + dat_status = dapl_convert_errno(ret, "accept"); + goto bail; + } + + return DAT_SUCCESS; +bail: + rdma_reject(cr_conn->cm_id, NULL, 0); + dapli_destroy_conn(cr_conn); + return dat_status; +} + + +/* + * dapls_ib_reject_connection + * + * Reject a connection + * + * Input: + * cr_handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_reject_connection(IN ib_cm_handle_t cm_handle, IN int reason) +{ + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " reject(cm_handle %p reason %x)\n", + cm_handle, reason ); + + if (cm_handle == IB_INVALID_HANDLE) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " reject: invalid handle: reason %d\n", + reason); + return DAT_SUCCESS; + } + + ret = rdma_reject(cm_handle->cm_id, NULL, 0); + + dapli_destroy_conn(cm_handle); + return dapl_convert_errno(ret, "reject"); +} + +/* + * dapls_ib_cm_remote_addr + * + * Obtain the remote IP address given a connection + * + * Input: + * cr_handle + * + * Output: + * remote_ia_address: where to place the remote address + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + * + */ +DAT_RETURN +dapls_ib_cm_remote_addr(IN DAT_HANDLE dat_handle, OUT DAT_SOCK_ADDR6 *raddr) +{ + DAPL_HEADER *header; + ib_cm_handle_t ib_cm_handle; + struct rdma_addr *ipaddr; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " remote_addr(cm_handle=%p, r_addr=%p)\n", + dat_handle, raddr); + + header = (DAPL_HEADER *)dat_handle; + + if (header->magic == DAPL_MAGIC_EP) + ib_cm_handle = ((DAPL_EP *)dat_handle)->cm_handle; + else if (header->magic == DAPL_MAGIC_CR) + ib_cm_handle = ((DAPL_CR *)dat_handle)->ib_cm_handle; + else + return DAT_INVALID_HANDLE; + + /* get remote IP address from cm_id route */ + ipaddr = &ib_cm_handle->cm_id->route.addr; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " remote_addr: conn %p id %p SRC %x DST %x PORT %d\n", + ib_cm_handle, ib_cm_handle->cm_id, + ntohl(((struct sockaddr_in *) + &ipaddr->src_addr)->sin_addr.s_addr), + ntohl(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_addr.s_addr), + ntohs(((struct sockaddr_in *) + &ipaddr->dst_addr)->sin_port)); + + dapl_os_memcpy(raddr,&ipaddr->dst_addr,sizeof(DAT_SOCK_ADDR)); + return DAT_SUCCESS; +} + +/* + * dapls_ib_private_data_size + * + * Return the size of private data given a connection op type + * + * Input: + * prd_ptr private data pointer + * conn_op connection operation type + * + * If prd_ptr is NULL, this is a query for the max size supported by + * the provider, otherwise it is the actual size of the private data + * contained in prd_ptr. + * + * + * Output: + * None + * + * Returns: + * length of private data + * + */ +int dapls_ib_private_data_size(IN DAPL_PRIVATE *prd_ptr, + IN DAPL_PDATA_OP conn_op) +{ + int size; + + switch(conn_op) { + + case DAPL_PDATA_CONN_REQ: + size = IB_MAX_REQ_PDATA_SIZE; + break; + case DAPL_PDATA_CONN_REP: + size = IB_MAX_REP_PDATA_SIZE; + break; + case DAPL_PDATA_CONN_REJ: + size = IB_MAX_REJ_PDATA_SIZE; + break; + case DAPL_PDATA_CONN_DREQ: + size = IB_MAX_DREQ_PDATA_SIZE; + break; + case DAPL_PDATA_CONN_DREP: + size = IB_MAX_DREP_PDATA_SIZE; + break; + default: + size = 0; + + } /* end case */ + + return size; +} + +/* + * Map all socket CM event codes to the DAT equivelent. + */ +#define DAPL_IB_EVENT_CNT 12 + +static struct ib_cm_event_map +{ + const ib_cm_events_t ib_cm_event; + DAT_EVENT_NUMBER dat_event_num; + } ib_cm_event_map[DAPL_IB_EVENT_CNT] = { + /* 00 */ { IB_CME_CONNECTED, + DAT_CONNECTION_EVENT_ESTABLISHED}, + /* 01 */ { IB_CME_DISCONNECTED, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 02 */ { IB_CME_DISCONNECTED_ON_LINK_DOWN, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 03 */ { IB_CME_CONNECTION_REQUEST_PENDING, + DAT_CONNECTION_REQUEST_EVENT}, + /* 04 */ { IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + DAT_CONNECTION_REQUEST_EVENT}, + /* 05 */ { IB_CME_CONNECTION_REQUEST_ACKED, + DAT_CONNECTION_REQUEST_EVENT}, + /* 06 */ { IB_CME_DESTINATION_REJECT, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 07 */ { IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + DAT_CONNECTION_EVENT_PEER_REJECTED}, + /* 08 */ { IB_CME_DESTINATION_UNREACHABLE, + DAT_CONNECTION_EVENT_UNREACHABLE}, + /* 09 */ { IB_CME_TOO_MANY_CONNECTION_REQUESTS, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 10 */ { IB_CME_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN}, + /* 11 */ { IB_CME_BROKEN, + DAT_CONNECTION_EVENT_BROKEN} +}; + +/* + * dapls_ib_get_cm_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * dat_event_num DAT event we need an equivelent CM event for + * + * Output: + * none + * + * Returns: + * ib_cm_event of translated DAPL value + */ +DAT_EVENT_NUMBER +dapls_ib_get_dat_event(IN const ib_cm_events_t ib_cm_event, + IN DAT_BOOLEAN active) +{ + DAT_EVENT_NUMBER dat_event_num; + int i; + + active = active; + + if (ib_cm_event > IB_CME_BROKEN) + return (DAT_EVENT_NUMBER) 0; + + dat_event_num = 0; + for(i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) { + dat_event_num = ib_cm_event_map[i].dat_event_num; + break; + } + } + dapl_dbg_log(DAPL_DBG_TYPE_CALLBACK, + "dapls_ib_get_dat_event: event(%s) ib=0x%x dat=0x%x\n", + active ? "active" : "passive", ib_cm_event, dat_event_num); + + return dat_event_num; +} + + +/* + * dapls_ib_get_dat_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * ib_cm_event event provided to the dapl callback routine + * active switch indicating active or passive connection + * + * Output: + * none + * + * Returns: + * DAT_EVENT_NUMBER of translated provider value + */ +ib_cm_events_t +dapls_ib_get_cm_event(IN DAT_EVENT_NUMBER dat_event_num) +{ + ib_cm_events_t ib_cm_event; + int i; + + ib_cm_event = 0; + for(i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (dat_event_num == ib_cm_event_map[i].dat_event_num) { + ib_cm_event = ib_cm_event_map[i].ib_cm_event; + break; + } + } + return ib_cm_event; +} + + +void dapli_cma_event_cb(void) +{ + struct rdma_cm_event *event; + int ret; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cm_event()\n"); + + ret = rdma_get_cm_event(&event); + + /* process one CM event, fairness */ + if(!ret) { + struct dapl_cm_id *conn; + int ret; + + /* set proper conn from cm_id context*/ + if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) + conn = (struct dapl_cm_id *)event->listen_id->context; + else + conn = (struct dapl_cm_id *)event->id->context; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " cm_event: EVENT=%d ID=%p LID=%p CTX=%p\n", + event->event, event->id, event->listen_id, conn); + + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + dapli_addr_resolve(conn); + break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: + dapli_route_resolve(conn); + break; + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + dapl_evd_connection_callback(conn, + IB_CME_LOCAL_FAILURE, + NULL, conn->ep); + break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: + dapl_evd_connection_callback(conn, + IB_CME_LOCAL_FAILURE, + NULL, conn->ep); + break; + case RDMA_CM_EVENT_CONNECT_REQUEST: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + case RDMA_CM_EVENT_ESTABLISHED: + case RDMA_CM_EVENT_DISCONNECTED: + /* passive or active */ + if (conn->sp) + ret = dapli_cm_passive_cb(conn,event); + else + ret = dapli_cm_active_cb(conn,event); + + if (ret) + rdma_destroy_id(conn->cm_id); + + break; + case RDMA_CM_EVENT_CONNECT_RESPONSE: + default: + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " cm_event: UNEXPECTED EVENT=%p ID=%p CTX=%p\n", + event->event, event->id, + event->id->context); + break; + } + rdma_ack_cm_event(event); + } else { + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " cm_event: ERROR: rdma_get_cm_event() %d %d %s\n", + ret, errno, strerror(errno)); + } +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 0) +++ dapl/openib_cma/dapl_ib_qp.c (revision 0) @@ -0,0 +1,305 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_qp.c + * + * PURPOSE: QP routines for access to DET Verbs + * + * $Id: $ + **********************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" + +/* + * dapl_ib_qp_alloc + * + * Alloc a QP + * + * Input: + * *ep_ptr pointer to EP INFO + * ib_hca_handle provider HCA handle + * ib_pd_handle provider protection domain handle + * cq_recv provider recv CQ handle + * cq_send provider send CQ handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAPL_EP *ep_ctx_ptr) +{ + DAT_EP_ATTR *attr; + DAPL_EVD *rcv_evd, *req_evd; + ib_cq_handle_t rcv_cq, req_cq; + ib_pd_handle_t ib_pd_handle; + struct ibv_qp_init_attr qp_create; + ib_cm_handle_t conn; + struct rdma_cm_id *cm_id; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " qp_alloc: ia_ptr %p ep_ptr %p ep_ctx_ptr %p\n", + ia_ptr, ep_ptr, ep_ctx_ptr); + + attr = &ep_ptr->param.ep_attr; + ib_pd_handle = ((DAPL_PZ *)ep_ptr->param.pz_handle)->pd_handle; + rcv_evd = (DAPL_EVD *) ep_ptr->param.recv_evd_handle; + req_evd = (DAPL_EVD *) ep_ptr->param.request_evd_handle; + + /* + * DAT allows usage model of EP's with no EVD's but IB does not. + * Create a CQ with zero entries under the covers to support and + * catch any invalid posting. + */ + if (rcv_evd != DAT_HANDLE_NULL) + rcv_cq = rcv_evd->ib_cq_handle; + else if (!ia_ptr->hca_ptr->ib_trans.ib_cq_empty) + rcv_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + else { + struct ibv_comp_channel *channel = + ia_ptr->hca_ptr->ib_trans.ib_cq; +#ifdef CQ_WAIT_OBJECT + if (rcv_evd->cq_wait_obj_handle) + channel = rcv_evd->cq_wait_obj_handle; +#endif + /* Call IB verbs to create CQ */ + rcv_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + 0, NULL, channel, 0); + + if (rcv_cq == IB_INVALID_HANDLE) + return(dapl_convert_errno(ENOMEM, "create_cq")); + + ia_ptr->hca_ptr->ib_trans.ib_cq_empty = rcv_cq; + } + if (req_evd != DAT_HANDLE_NULL) + req_cq = req_evd->ib_cq_handle; + else + req_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + + /* + * IMPLEMENTATION NOTE: + * uDAPL allows consumers to post buffers on the EP after creation + * and before a connect request (outbound and inbound). This forces + * a binding to a device during the hca_open call and requires the + * consumer to predetermine which device to listen on or connect from. + * This restriction eliminates any option of listening or connecting + * over multiple devices. uDAPL should add API's to resolve addresses + * and bind to the device at the approriate time (before connect + * and after CR arrives). Discovery should happen at connection time + * based on addressing and not on static configuration during open. + */ + + /* Allocate CM and initialize lock */ + if ((conn = dapl_os_alloc(sizeof(*conn))) == NULL) + return(dapl_convert_errno(ENOMEM, "create_cq")); + + dapl_os_memzero(conn, sizeof(*conn)); + dapl_os_lock_init(&conn->lock); + + /* create CM_ID, bind to local device, create QP */ + if (rdma_create_id(&cm_id, (void*)conn)) { + dapl_os_free(conn, sizeof(*conn)); + return(dapl_convert_errno(errno, "create_qp")); + } + + /* open identifies the local device; per DAT specification */ + if (rdma_bind_addr(cm_id, + (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) + goto bail; + + /* Setup attributes and create qp */ + dapl_os_memzero((void*)&qp_create, sizeof(qp_create)); + qp_create.cap.max_send_wr = attr->max_request_dtos; + qp_create.cap.max_recv_wr = attr->max_recv_dtos; + qp_create.cap.max_send_sge = attr->max_request_iov; + qp_create.cap.max_recv_sge = attr->max_recv_iov; + qp_create.cap.max_inline_data = + ia_ptr->hca_ptr->ib_trans.max_inline_send; + qp_create.send_cq = req_cq; + qp_create.recv_cq = rcv_cq; + qp_create.qp_type = IBV_QPT_RC; + qp_create.qp_context = (void*)ep_ptr; + + /* Let uCMA transition QP states */ + if (rdma_create_qp(cm_id, ib_pd_handle, &qp_create)) + goto bail; + + conn->cm_id = cm_id; + conn->ep = ep_ptr; + conn->hca = ia_ptr->hca_ptr; + ep_ptr->qp_handle = conn; + ep_ptr->qp_state = IB_QP_STATE_INIT; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " qp_alloc: qpn %p sq %d,%d rq %d,%d\n", + ep_ptr->qp_handle->cm_id->qp->qp_num, + qp_create.cap.max_send_wr,qp_create.cap.max_send_sge, + qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge); + + return DAT_SUCCESS; +bail: + rdma_destroy_id(cm_id); + dapl_os_free(conn, sizeof(*conn)); + return(dapl_convert_errno(errno, "create_qp")); +} + +/* + * dapl_ib_qp_free + * + * Free a QP + * + * Input: + * ia_handle IA handle + * *ep_ptr pointer to EP INFO + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_qp_free(IN DAPL_IA *ia_ptr, IN DAPL_EP *ep_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, " qp_free: ep_ptr %p qp %p\n", + ep_ptr, ep_ptr->qp_handle); + + if (ep_ptr->qp_handle != IB_INVALID_HANDLE) { + /* qp_handle is conn object with reference to cm_id and qp */ + dapli_destroy_conn(ep_ptr->qp_handle); + ep_ptr->qp_handle = IB_INVALID_HANDLE; + ep_ptr->qp_state = IB_QP_STATE_ERROR; + } + return DAT_SUCCESS; +} + +/* + * dapl_ib_qp_modify + * + * Set the QP to the parameters specified in an EP_PARAM + * + * The EP_PARAM structure that is provided has been + * sanitized such that only non-zero values are valid. + * + * Input: + * ib_hca_handle HCA handle + * qp_handle QP handle + * ep_attr Sanitized EP Params + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_qp_modify(IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAT_EP_ATTR *attr) +{ + struct ibv_qp_attr qp_attr; + + if (ep_ptr->qp_handle == IB_INVALID_HANDLE) + return DAT_INVALID_PARAMETER; + + /* + * Check if we have the right qp_state to modify attributes + */ + if ((ep_ptr->qp_handle->cm_id->qp->state != IBV_QPS_RTR) && + (ep_ptr->qp_handle->cm_id->qp->state != IBV_QPS_RTS)) + return DAT_INVALID_STATE; + + /* Adjust to current EP attributes */ + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); + qp_attr.cap.max_send_wr = attr->max_request_dtos; + qp_attr.cap.max_recv_wr = attr->max_recv_dtos; + qp_attr.cap.max_send_sge = attr->max_request_iov; + qp_attr.cap.max_recv_sge = attr->max_recv_iov; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + "modify_qp: qp %p sq %d,%d, rq %d,%d\n", + ep_ptr->qp_handle->cm_id->qp, + qp_attr.cap.max_send_wr, qp_attr.cap.max_send_sge, + qp_attr.cap.max_recv_wr, qp_attr.cap.max_recv_sge); + + if (ibv_modify_qp(ep_ptr->qp_handle->cm_id->qp, &qp_attr, IBV_QP_CAP)) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "modify_qp: modify ep %p qp %p failed\n", + ep_ptr, ep_ptr->qp_handle->cm_id->qp); + return(dapl_convert_errno(errno,"modify_qp_state")); + } + + return DAT_SUCCESS; +} + +/* + * dapls_ib_reinit_ep + * + * Move the QP to INIT state again. + * + * Input: + * ep_ptr DAPL_EP + * + * Output: + * none + * + * Returns: + * void + * + */ +void dapls_ib_reinit_ep(IN DAPL_EP *ep_ptr) +{ + /* uCMA does not allow reuse of CM_ID, destroy and create new one */ + if (ep_ptr->qp_handle != IB_INVALID_HANDLE) { + + /* destroy */ + dapli_destroy_conn(ep_ptr->qp_handle); + + /* create new CM_ID and QP */ + ep_ptr->qp_handle = IB_INVALID_HANDLE; + dapls_ib_qp_alloc(ep_ptr->header.owner_ia, ep_ptr, ep_ptr); + } +} + + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_cma/README =================================================================== --- dapl/openib_cma/README (revision 0) +++ dapl/openib_cma/README (revision 0) @@ -0,0 +1,40 @@ + +OpenIB uDAPL provider using rdma cma and openib verbs interfaces + +to build: + +cd dapl/udapl +make VERBS=openib_cma clean +make VERBS=openib_cma + + +Modifications to common code: + +- added dapl/openib_cma directory + + dapl/udapl/Makefile + +New files for openib_scm provider + + dapl/openib_cma/dapl_ib_cq.c + dapl/openib_cma/dapl_ib_dto.h + dapl/openib_cma/dapl_ib_mem.c + dapl/openib_cma/dapl_ib_qp.c + dapl/openib_cma/dapl_ib_util.c + dapl/openib_cma/dapl_ib_util.h + dapl/openib_cma/dapl_ib_cm.c + +A simple dapl test just for openib_scm testing... + + test/dtest/dtest.c + test/dtest/makefile + + server: dtest -s + client: dtest -h hostname + +known issues: + + no memory windows support in ibverbs, dat_create_rmr fails. + + + Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 0) +++ dapl/openib_cma/dapl_ib_util.h (revision 0) @@ -0,0 +1,333 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - definitions, prototypes, + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#ifndef _DAPL_IB_UTIL_H_ +#define _DAPL_IB_UTIL_H_ + +#include "verbs.h" +#include +#include + +/* Typedefs to map common DAPL provider types to IB verbs */ +typedef struct dapl_cm_id *ib_qp_handle_t; +typedef struct ibv_cq *ib_cq_handle_t; +typedef struct ibv_pd *ib_pd_handle_t; +typedef struct ibv_mr *ib_mr_handle_t; +typedef struct ibv_mw *ib_mw_handle_t; +typedef struct ibv_wc ib_work_completion_t; + +/* HCA context type maps to IB verbs */ +typedef struct ibv_context *ib_hca_handle_t; +typedef ib_hca_handle_t dapl_ibal_ca_t; + +#define IB_RC_RETRY_COUNT 7 +#define IB_RNR_RETRY_COUNT 7 +#define IB_CM_RESPONSE_TIMEOUT 18 /* 1 sec */ +#define IB_MAX_CM_RETRIES 7 +#define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ +#define IB_MAX_AT_RETRY 3 +#define IB_TARGET_MAX 4 /* max_qp_ous_rd_atom */ +#define IB_INITIATOR_DEPTH 4 /* max_qp_init_rd_atom */ + +typedef enum { + IB_CME_CONNECTED, + IB_CME_DISCONNECTED, + IB_CME_DISCONNECTED_ON_LINK_DOWN, + IB_CME_CONNECTION_REQUEST_PENDING, + IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + IB_CME_CONNECTION_REQUEST_ACKED, + IB_CME_DESTINATION_REJECT, + IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + IB_CME_DESTINATION_UNREACHABLE, + IB_CME_TOO_MANY_CONNECTION_REQUESTS, + IB_CME_LOCAL_FAILURE, + IB_CME_BROKEN +} ib_cm_events_t; + +/* CQ notifications */ +typedef enum +{ + IB_NOTIFY_ON_NEXT_COMP, + IB_NOTIFY_ON_SOLIC_COMP + +} ib_notification_type_t; + +/* other mappings */ +typedef int ib_bool_t; +typedef union ibv_gid GID; +typedef char *IB_HCA_NAME; +typedef uint16_t ib_hca_port_t; +typedef uint32_t ib_comp_handle_t; + +#ifdef CQ_WAIT_OBJECT +typedef struct ibv_comp_channel *ib_wait_obj_handle_t; +#endif + +/* Definitions */ +#define IB_INVALID_HANDLE NULL + +/* inline send rdma threshold */ +#define INLINE_SEND_DEFAULT 128 + +/* CM private data areas */ +#define IB_MAX_REQ_PDATA_SIZE 48 +#define IB_MAX_REP_PDATA_SIZE 196 +#define IB_MAX_REJ_PDATA_SIZE 148 +#define IB_MAX_DREQ_PDATA_SIZE 220 +#define IB_MAX_DREP_PDATA_SIZE 224 + +/* DTO OPs, ordered for DAPL ENUM definitions */ +#define OP_RDMA_WRITE IBV_WR_RDMA_WRITE +#define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define OP_SEND IBV_WR_SEND +#define OP_SEND_IMM IBV_WR_SEND_WITH_IMM +#define OP_RDMA_READ IBV_WR_RDMA_READ +#define OP_COMP_AND_SWAP IBV_WR_ATOMIC_CMP_AND_SWP +#define OP_FETCH_AND_ADD IBV_WR_ATOMIC_FETCH_AND_ADD +#define OP_RECEIVE 7 /* internal op */ +#define OP_RECEIVE_IMM 8 /* internel op */ +#define OP_BIND_MW 9 /* internal op */ +#define OP_INVALID 0xff + +/* Definitions to map QP state */ +#define IB_QP_STATE_RESET IBV_QPS_RESET +#define IB_QP_STATE_INIT IBV_QPS_INIT +#define IB_QP_STATE_RTR IBV_QPS_RTR +#define IB_QP_STATE_RTS IBV_QPS_RTS +#define IB_QP_STATE_SQD IBV_QPS_SQD +#define IB_QP_STATE_SQE IBV_QPS_SQE +#define IB_QP_STATE_ERROR IBV_QPS_ERR + +typedef enum +{ + IB_THREAD_INIT, + IB_THREAD_CREATE, + IB_THREAD_RUN, + IB_THREAD_CANCEL, + IB_THREAD_EXIT + +} ib_thread_state_t; + +/* + * dapl_llist_entry in dapl.h but dapl.h depends on provider + * typedef's in this file first. move dapl_llist_entry out of dapl.h + */ +struct ib_llist_entry +{ + struct dapl_llist_entry *flink; + struct dapl_llist_entry *blink; + void *data; + struct dapl_llist_entry *list_head; +}; + +struct dapl_cm_id { + DAPL_OS_LOCK lock; + int destroy; + int in_callback; + struct rdma_cm_id *cm_id; + struct dapl_hca *hca; + struct dapl_sp *sp; + struct dapl_ep *ep; + struct rdma_conn_param params; + int p_len; + unsigned char p_data[IB_MAX_DREP_PDATA_SIZE]; +}; + +typedef struct dapl_cm_id *ib_cm_handle_t; +typedef struct dapl_cm_id *ib_cm_srvc_handle_t; + +/* Operation and state mappings */ +typedef enum ibv_send_flags ib_send_op_type_t; +typedef struct ibv_sge ib_data_segment_t; +typedef enum ibv_qp_state ib_qp_state_t; +typedef enum ibv_event_type ib_async_event_type; +typedef struct ibv_async_event ib_error_record_t; + +/* Definitions for ibverbs/mthca return codes, should be defined in verbs.h */ +/* some are errno and some are -n values */ + +/** + * ibv_get_device_name - Return kernel device name + * ibv_get_device_guid - Return device's node GUID + * ibv_open_device - Return ibv_context or NULL + * ibv_close_device - Return 0, (errno?) + * ibv_get_async_event - Return 0, -1 + * ibv_alloc_pd - Return ibv_pd, NULL + * ibv_dealloc_pd - Return 0, errno + * ibv_reg_mr - Return ibv_mr, NULL + * ibv_dereg_mr - Return 0, errno + * ibv_create_cq - Return ibv_cq, NULL + * ibv_destroy_cq - Return 0, errno + * ibv_get_cq_event - Return 0 & ibv_cq/context, int + * ibv_poll_cq - Return n & ibv_wc, 0 ok, -1 empty, -2 error + * ibv_req_notify_cq - Return 0 (void?) + * ibv_create_qp - Return ibv_qp, NULL + * ibv_modify_qp - Return 0, errno + * ibv_destroy_qp - Return 0, errno + * ibv_post_send - Return 0, -1 & bad_wr + * ibv_post_recv - Return 0, -1 & bad_wr + */ + +/* async handlers for DTO, CQ, QP, and unafiliated */ +typedef void (*ib_async_dto_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_cq_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_cq_handle_t ib_cq_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_qp_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_qp_handle_t ib_qp_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + + +/* ib_hca_transport_t, specific to this implementation */ +typedef struct _ib_hca_transport +{ + struct ib_llist_entry entry; + int destroy; + struct dapl_hca *d_hca; + struct rdma_cm_id *cm_id; + struct ibv_comp_channel *ib_cq; + ib_cq_handle_t ib_cq_empty; + int max_inline_send; + ib_async_handler_t async_unafiliated; + void *async_un_ctx; + ib_async_cq_handler_t async_cq_error; + ib_async_dto_handler_t async_cq; + ib_async_qp_handler_t async_qp_error; + +} ib_hca_transport_t; + +/* provider specfic fields for shared memory support */ +typedef uint32_t ib_shm_transport_t; + +/* prototypes */ +int32_t dapls_ib_init (void); +int32_t dapls_ib_release (void); +void dapli_thread(void *arg); +int dapli_ib_thread_init(void); +void dapli_ib_thread_destroy(void); +void dapli_cma_event_cb(void); +void dapli_cq_event_cb(struct _ib_hca_transport *hca); +void dapli_async_event_cb(struct _ib_hca_transport *hca); +void dapli_destroy_conn(struct dapl_cm_id *conn); + +DAT_RETURN +dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN struct dapl_cm_id *conn ); + +/* inline functions */ +STATIC _INLINE_ IB_HCA_NAME dapl_ib_convert_name (IN char *name) +{ + /* use ascii; name of local device */ + return dapl_os_strdup(name); +} + +STATIC _INLINE_ void dapl_ib_release_name (IN IB_HCA_NAME name) +{ + return; +} + +/* + * Convert errno to DAT_RETURN values + */ +STATIC _INLINE_ DAT_RETURN +dapl_convert_errno( IN int err, IN const char *str ) +{ + if (!err) return DAT_SUCCESS; + +#if DAPL_DBG + if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); +#endif + + switch( err ) + { + case EOVERFLOW : return DAT_LENGTH_ERROR; + case EACCES : return DAT_PRIVILEGES_VIOLATION; + case ENXIO : + case ERANGE : + case EPERM : return DAT_PROTECTION_VIOLATION; + case EINVAL : + case EBADF : + case ENOENT : + case ENOTSOCK : return DAT_INVALID_HANDLE; + case EISCONN : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_CONNECTED; + case ECONNREFUSED : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_NOTREADY; + case ETIME : + case ETIMEDOUT : return DAT_TIMEOUT_EXPIRED; + case ENETUNREACH: return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_UNREACHABLE; + case EBUSY : return DAT_PROVIDER_IN_USE; + case EADDRINUSE : return DAT_CONN_QUAL_IN_USE; + case EALREADY : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_ACTCONNPENDING; + case ENOSPC : + case ENOMEM : + case E2BIG : + case EDQUOT : return DAT_INSUFFICIENT_RESOURCES; + case EAGAIN : return DAT_QUEUE_EMPTY; + case EINTR : return DAT_INTERRUPTED_CALL; + case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_MALFORMED; + case EFAULT : + default : return DAT_INTERNAL_ERROR; + } + } + +#endif /* _DAPL_IB_UTIL_H_ */ Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 0) +++ dapl/openib_cma/dapl_ib_cq.c (revision 0) @@ -0,0 +1,531 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cq.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - completion queue + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" +#include "dapl_evd_util.h" +#include "dapl_ring_buffer_util.h" +#include + +/* One CQ event channel per HCA */ +void dapli_cq_event_cb(struct _ib_hca_transport *hca) +{ + /* check all comp events on this device */ + struct dapl_evd *evd_ptr = NULL; + struct ibv_cq *ibv_cq = NULL; + struct pollfd cq_fd = { + .fd = hca->ib_cq->fd, + .events = POLLIN, + .revents = 0 + }; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," dapli_cq_event_cb(%p)\n", hca); + + if ((poll(&cq_fd, 1, 0) == 1) && + (!ibv_get_cq_event(hca->ib_cq, + &ibv_cq, (void*)&evd_ptr))) { + + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { + ibv_ack_cq_events(ibv_cq, 1); + return; + } + + /* process DTO event via callback */ + dapl_evd_dto_callback ( hca->cm_id->verbs, + evd_ptr->ib_cq_handle, + (void*)evd_ptr ); + + ibv_ack_cq_events(ibv_cq, 1); + } +} + +/* + * Map all verbs DTO completion codes to the DAT equivelent. + * + * Not returned by verbs: DAT_DTO_ERR_PARTIAL_PACKET + */ +static struct ib_status_map +{ + int ib_status; + DAT_DTO_COMPLETION_STATUS dat_status; +} ib_status_map[] = { +/* 00 */ { IBV_WC_SUCCESS, DAT_DTO_SUCCESS}, +/* 01 */ { IBV_WC_LOC_LEN_ERR, DAT_DTO_ERR_LOCAL_LENGTH}, +/* 02 */ { IBV_WC_LOC_QP_OP_ERR, DAT_DTO_ERR_LOCAL_EP}, +/* 03 */ { IBV_WC_LOC_EEC_OP_ERR, DAT_DTO_ERR_TRANSPORT}, +/* 04 */ { IBV_WC_LOC_PROT_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, +/* 05 */ { IBV_WC_WR_FLUSH_ERR, DAT_DTO_ERR_FLUSHED}, +/* 06 */ { IBV_WC_MW_BIND_ERR, DAT_RMR_OPERATION_FAILED}, +/* 07 */ { IBV_WC_BAD_RESP_ERR, DAT_DTO_ERR_BAD_RESPONSE}, +/* 08 */ { IBV_WC_LOC_ACCESS_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, +/* 09 */ { IBV_WC_REM_INV_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, +/* 10 */ { IBV_WC_REM_ACCESS_ERR, DAT_DTO_ERR_REMOTE_ACCESS}, +/* 11 */ { IBV_WC_REM_OP_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, +/* 12 */ { IBV_WC_RETRY_EXC_ERR, DAT_DTO_ERR_TRANSPORT}, +/* 13 */ { IBV_WC_RNR_RETRY_EXC_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, +/* 14 */ { IBV_WC_LOC_RDD_VIOL_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, +/* 15 */ { IBV_WC_REM_INV_RD_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, +/* 16 */ { IBV_WC_REM_ABORT_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, +/* 17 */ { IBV_WC_INV_EECN_ERR, DAT_DTO_ERR_TRANSPORT}, +/* 18 */ { IBV_WC_INV_EEC_STATE_ERR, DAT_DTO_ERR_TRANSPORT}, +/* 19 */ { IBV_WC_FATAL_ERR, DAT_DTO_ERR_TRANSPORT}, +/* 20 */ { IBV_WC_RESP_TIMEOUT_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, +/* 21 */ { IBV_WC_GENERAL_ERR, DAT_DTO_ERR_TRANSPORT}, +}; + +/* + * dapls_ib_get_dto_status + * + * Return the DAT status of a DTO operation + * + * Input: + * cqe_ptr pointer to completion queue entry + * + * Output: + * none + * + * Returns: + * Value from ib_status_map table above + */ + +DAT_DTO_COMPLETION_STATUS +dapls_ib_get_dto_status (IN ib_work_completion_t *cqe_ptr) +{ + uint32_t ib_status; + int i; + + ib_status = DAPL_GET_CQE_STATUS (cqe_ptr); + + /* + * Due to the implementation of verbs completion code, we need to + * search the table for the correct value rather than assuming + * linear distribution. + */ + for (i=0; i <= IBV_WC_GENERAL_ERR; i++) { + if (ib_status == ib_status_map[i].ib_status) { + if (ib_status != IBV_WC_SUCCESS) { + dapl_dbg_log(DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, + DAPL_GET_CQE_OPTYPE (cqe_ptr)); + } + return ib_status_map[i].dat_status; + } + } + + dapl_dbg_log(DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, + DAPL_GET_CQE_OPTYPE (cqe_ptr)); + + return DAT_DTO_FAILURE; +} + +DAT_RETURN dapls_ib_get_async_event(IN ib_error_record_t *err_record, + OUT DAT_EVENT_NUMBER *async_event) +{ + DAT_RETURN dat_status = DAT_SUCCESS; + int err_code = err_record->event_type; + + switch (err_code) { + /* OVERFLOW error */ + case IBV_EVENT_CQ_ERR: + *async_event = DAT_ASYNC_ERROR_EVD_OVERFLOW; + break; + /* INTERNAL errors */ + case IBV_EVENT_DEVICE_FATAL: + *async_event = DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR; + break; + /* CATASTROPHIC errors */ + case IBV_EVENT_PORT_ERR: + *async_event = DAT_ASYNC_ERROR_IA_CATASTROPHIC; + break; + /* BROKEN QP error */ + case IBV_EVENT_SQ_DRAINED: + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + *async_event = DAT_ASYNC_ERROR_EP_BROKEN; + break; + /* connection completion */ + case IBV_EVENT_COMM_EST: + *async_event = DAT_CONNECTION_EVENT_ESTABLISHED; + break; + /* TODO: process HW state changes */ + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + default: + dat_status = DAT_ERROR (DAT_NOT_IMPLEMENTED, 0); + } + return dat_status; +} + +/* + * dapl_ib_cq_alloc + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_cq_alloc(IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen); + + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + *cqlen, + evd_ptr, + channel, 0); + + if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* arm cq for events */ + dapls_set_cq_notify(ia_ptr, evd_ptr); + + /* update with returned cq entry size */ + *cqlen = evd_ptr->ib_cq_handle->cqe; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: new_cq %p cqlen=%d \n", + evd_ptr->ib_cq_handle, *cqlen ); + + return DAT_SUCCESS; +} + + +/* + * dapl_ib_cq_resize + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_cq_resize(IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + ib_cq_handle_t new_cq; + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + + /* IB verbs doe not support resize. Try to re-create CQ + * with new size. Can only be done if QP is not attached. + * destroy EBUSY == QP still attached. + */ + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + new_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, *cqlen, + evd_ptr, channel, 0); + + if (new_cq == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* destroy the original and replace if successful */ + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) { + ibv_destroy_cq(new_cq); + return(dapl_convert_errno(errno,"resize_cq")); + } + + /* update EVD with new cq handle and size */ + evd_ptr->ib_cq_handle = new_cq; + *cqlen = new_cq->cqe; + + /* arm cq for events */ + dapls_set_cq_notify (ia_ptr, evd_ptr); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_cq_free + * + * destroy a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_cq_free(IN DAPL_IA *ia_ptr, IN DAPL_EVD *evd_ptr) +{ + if (evd_ptr->ib_cq_handle != IB_INVALID_HANDLE) { + /* copy all entries on CQ to EVD before destroying */ + dapls_evd_copy_cq(evd_ptr); + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) + return(dapl_convert_errno(errno,"destroy_cq")); + evd_ptr->ib_cq_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapls_set_cq_notify + * + * Set the CQ notification for next + * + * Input: + * hca_handl hca handle + * DAPL_EVD evd handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_set_cq_notify(IN DAPL_IA *ia_ptr, IN DAPL_EVD *evd_ptr) +{ + if (ibv_req_notify_cq(evd_ptr->ib_cq_handle, 0)) + return(dapl_convert_errno(errno,"notify_cq")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_notify + * + * Set the CQ notification type + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * type notification type + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_ib_completion_notify(IN ib_hca_handle_t hca_handle, + IN DAPL_EVD *evd_ptr, + IN ib_notification_type_t type) +{ + if (ibv_req_notify_cq( evd_ptr->ib_cq_handle, type )) + return(dapl_convert_errno(errno,"notify_cq_type")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_poll + * + * CQ poll for completions + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * wc_ptr work completion + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_QUEUE_EMPTY + * + */ +DAT_RETURN dapls_ib_completion_poll(IN DAPL_HCA *hca_ptr, + IN DAPL_EVD *evd_ptr, + IN ib_work_completion_t *wc_ptr) +{ + if (ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr) == 1) + return DAT_SUCCESS; + + return DAT_QUEUE_EMPTY; +} + +#ifdef CQ_WAIT_OBJECT + +/* NEW common wait objects for providers with direct CQ wait objects */ +DAT_RETURN +dapls_ib_wait_object_create(IN DAPL_EVD *evd_ptr, + IN ib_wait_obj_handle_t *p_cq_wait_obj_handle) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_object_create: (%p,%p)\n", + evd_ptr, p_cq_wait_obj_handle ); + + /* set cq_wait object to evd_ptr */ + *p_cq_wait_obj_handle = + ibv_create_comp_channel( + evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_destroy(IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_object_destroy: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + ibv_destroy_comp_channel(p_cq_wait_obj_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wakeup (IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " cq_object_wakeup: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + /* no wake up mechanism */ + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wait(IN ib_wait_obj_handle_t p_cq_wait_obj_handle, + IN u_int32_t timeout) +{ + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + void *ibv_ctx = NULL; + int status = 0; + int timeout_ms = -1; + struct pollfd cq_fd = { + .fd = p_cq_wait_obj_handle->fd, + .events = POLLIN, + .revents = 0 + }; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_object_wait: CQ channel %p time %d\n", + p_cq_wait_obj_handle, timeout ); + + /* uDAPL timeout values in usecs */ + if (timeout != DAT_TIMEOUT_INFINITE) + timeout_ms = timeout/1000; + + status = poll(&cq_fd, 1, timeout_ms); + + /* returned event */ + if (status > 0) { + if (!ibv_get_cq_event(p_cq_wait_obj_handle, + &ibv_cq, (void*)&evd_ptr)) { + ibv_ack_cq_events(ibv_cq, 1); + } + status = 0; + + /* timeout */ + } else if (status == 0) + status = ETIMEDOUT; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", + evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); + + return(dapl_convert_errno(status,"cq_wait_object_wait")); + +} +#endif + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + Index: doc/dat.conf =================================================================== --- doc/dat.conf (revision 4178) +++ doc/dat.conf (working copy) @@ -1,15 +1,17 @@ # -# DAT 1.1 configuration file +# DAT 1.2 configuration file # # Each entry should have the following fields: # -# \ +# \ # # - -ia0 u1.1 nonthreadsafe default /usr/lib/libdapl.so ri.1.1 "ia_params" "pd_params" - -# Example for openib using the first Mellanox adapter, port 1 and port 2 -OpenIB1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" -OpenIB2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" - +# Example for openib_cma and openib_scm +# +# For scm version you specify as actual device name and port +# For cma version you specify as the ib device network address or network hostname and 0 for port +# +OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" +OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" +OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" +OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" Index: README =================================================================== --- README (revision 0) +++ README (revision 0) @@ -0,0 +1,94 @@ +There are now 3 uDAPL providers for openib (openib,openib_scm,openib_cma). + + +========== +1.0 BUILD: +========== + +The default build is now set to openib_cma. This version requires librdmacm +installation, IPoIB installation, and IPoIB configuration with an IP address. + +Building DAT library: +-------------------- +cd dat/udat +make clean +make + +the dat library target directory is dat/udat/$(ARCH)/Target + +Building default DAPL library: +---------------------------- + +cd dapl/udapl +make clean +make + +the dapl library target directory is dapl/udapl/Target + +NOTE: to link these libraries you must either use libtool and +specify the full pathname of the library, or use the `-LLIBDIR' +flag during linking and do at least one of the following: + - add LIBDIR to the `LD_LIBRARY_PATH' environment variable + during execution + - add LIBDIR to the `LD_RUN_PATH' environment variable + during linking + - use the `-Wl,--rpath -Wl,LIBDIR' linker flag + - have your system administrator add LIBDIR to `/etc/ld.so.conf' + +See any operating system documentation about shared libraries for +more information, such as the ld(1) and ld.so(8) manual pages. + +Building specific verbs provider set VERBS to provider name: +----------------------------------------------------------- +example for socket cm version (openib_scm): + +cd dapl/udapl +make VERBS=openib_scm clean +make VERBS=openib_scm + +=================== +2.0 CONFIGURATION: +=================== + +sample /etc/dat.conf + +# +# DAT 1.2 configuration file +# +# Each entry should have the following fields: +# +# \ +# +# +# Example for openib_cma and openib_scm +# +# For scm version you specify as actual device name and port +# For cma version you specify as the ib device network address or network hostname and 0 for port +# +OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" +OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" +OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" +OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" + +============================= +3.0 SAMPLE uDAPL APPLICATION: +============================= + +sample makefile for simple application (dapl/test/dtest.c) base on prevous dat.conf settings: + +CC = gcc +CFLAGS = -O2 -g + +DAT_INC = ../../dat/include +DAT_LIB = /usr/local/lib + +all: dtest + +clean: + rm -f *.o;touch *.c;rm -f dtest + +dtest: ./dtest.c + $(CC) $(CFLAGS) ./dtest.c -o dtest \ + -DDAPL_PROVIDER='"OpenIB_cma_name"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + From hozer at hozed.org Wed Nov 30 17:33:44 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 30 Nov 2005 19:33:44 -0600 Subject: [openib-general] OpenSM not coming out of standby state.. Message-ID: <20051201013344.GW3275@kalmia.hozed.org> A couple of days ago I started up two instances of opensm on my network, and set one with priority 11, the other with the default 10. I could kill one and the other would become master a few minutes later. Well, today, I found that there are no active links anywhere in the network.. But both SM's still appeared to be running. then I killed them both, and restarted one with 'opensm -V -p 11', it is still staying in STANDBY state, and produced the 4MB log available at http://scl.ameslab.gov/~troy/osm.log-nomaster (Hal, if you want access to this system, let me know) -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From hozer at hozed.org Wed Nov 30 17:54:45 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 30 Nov 2005 19:54:45 -0600 Subject: [openib-general] OpenSM not coming out of standby state.. In-Reply-To: <20051201013344.GW3275@kalmia.hozed.org> References: <20051201013344.GW3275@kalmia.hozed.org> Message-ID: <20051201015445.GY3275@kalmia.hozed.org> On Wed, Nov 30, 2005 at 07:33:44PM -0600, Troy Benjegerdes wrote: > A couple of days ago I started up two instances of opensm on my network, > and set one with priority 11, the other with the default 10. > > I could kill one and the other would become master a few minutes later. > > Well, today, I found that there are no active links anywhere in the > network.. But both SM's still appeared to be running. > > then I killed them both, and restarted one with 'opensm -V -p 11', > > it is still staying in STANDBY state, and produced the 4MB log available at > > http://scl.ameslab.gov/~troy/osm.log-nomaster > > (Hal, if you want access to this system, let me know) And the rest of the story.. This happened after I cross-connected two networks, and had opensm running on two nodes that had back-to-back connections (with no switch). I didn't think anything of it at the time since the 'active' lights were off on the cards machines that were connected (they had physical link, but no logical link). I've since killed opensm on the 'new' nodes, but there is still some state somewhere that prevents opensm from 'nicely' becoming the master.. If I run with 'opensm -d 0 -p 11', it becomes master just fine. How does one go about tracking down a broken rogue SM that isn't bringing up the network? From ianjiang.ict at gmail.com Wed Nov 30 19:00:03 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Thu, 1 Dec 2005 11:00:03 +0800 Subject: [openib-general] [uDAPL]Linking error of dapltest for uDAPL Message-ID: <7b2fa1820511301900q485c8990t40d3876c45d7d0b8@mail.gmail.com> Thanks James Lentini very much for his reply and I have built my uDAPL. I want to confirm that my uDAPL could run correctly, so I try to build the dapltest. But I have got a linking error. It seems that there is something wrong with the "ldat", but I failed to find where this "ldat" is referenced. I am using the default Makefile. Any suggestion is appreciated and here are the details: ]# pwd /usr/src/openib_userspace/dapl/test/dapltest/udapl # make --- Compiling .. ......... ......... --- Linking Target/dapltest --- gcc -L/usr/src/openib_userspace/dapl/test/dapltest/udapl/../../../dat/udat/Target/i686 -Wl,-R/usr/src/openib_userspace/dapl/test/dapltest/udapl/../../../dat/udat/Target/i686 -ldat -lpthread ./Obj/dapl_fft_cmd.o ./Obj/dapl_getopt.o ./Obj/dapl_limit_cmd.o ./Obj/dapl_main.o ./Obj/dapl_netaddr.o ./Obj/dapl_params.o ./Obj/dapl_performance_cmd.o ./Obj/dapl_qos_util.o ./Obj/dapl_quit_cmd.o ./Obj/dapl_server_cmd.o ./Obj/dapl_transaction_cmd.o ./Obj/dapl_bpool.o ./Obj/dapl_client.o ./Obj/dapl_client_info.o ./Obj/dapl_cnxn.o ./Obj/dapl_execute.o ./Obj/dapl_fft_connmgt.o ./Obj/dapl_fft_endpoint.o ./Obj/dapl_fft_hwconn.o ./Obj/dapl_fft_mem.o ./Obj/dapl_fft_pz.o ./Obj/dapl_fft_queryinfo.o ./Obj/dapl_fft_test.o ./Obj/dapl_fft_util.o ./Obj/dapl_limit.o ./Obj/dapl_memlist.o ./Obj/dapl_performance_client.o ./Obj/dapl_performance_server.o ./Obj/dapl_performance_stats.o ./Obj/dapl_performance_util.o ./Obj/dapl_quit_util.o ./Obj/dapl_server.o ./Obj/dapl_server_info.o ./Obj/dapl_test_data.o ./Obj/dapl_test_util.o ./Obj/dapl_thread.o ./Obj/dapl_transaction_stats.o ./Obj/dapl_transaction_test.o ./Obj/dapl_transaction_util.o ./Obj/dapl_util.o ./Obj/dapl_endian.o ./Obj/dapl_global.o ./Obj/dapl_performance_cmd_util.o ./Obj/dapl_quit_cmd_util.o ./Obj/dapl_transaction_cmd_util.o ./Obj/udapl_tdep.o ./Obj/dapl_mdep_user.o -o Target/dapltest /usr/bin/ld: cannot find -ldat collect2: ld returned 1 exit status make: *** [Target/dapltest] Error 1 -- Ian Jiang ianjiang.ict at gmail.com Institute of Computing Technology, Chinese Academy of Sciences. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mahesh at tifrpune.res.in Wed Nov 30 20:29:57 2005 From: mahesh at tifrpune.res.in (mahesh at tifrpune.res.in) Date: Thu, 1 Dec 2005 10:29:57 +0600 (LKT) Subject: [openib-general] Information Request regarding membership to OpenIB In-Reply-To: <20051130234501.4D44E228461@openib.ca.sandia.gov> Message-ID: Hi All, Ours is an academic and research organisation with latest facilities in the field of high performance computing. We would like to participate in the ongoing research efforts in IB. Can someone please let me know about the procedures, fees required to be part of this consortium. It would be of great help for us. Sincerely, -Mahesh From iod00d at hp.com Wed Nov 30 21:19:36 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 30 Nov 2005 21:19:36 -0800 Subject: [openib-general] Information Request regarding membership to OpenIB In-Reply-To: References: <20051130234501.4D44E228461@openib.ca.sandia.gov> Message-ID: <20051201051936.GE27137@esmail.cup.hp.com> On Thu, Dec 01, 2005 at 10:29:57AM +0600, mahesh at tifrpune.res.in wrote: > Hi All, > Ours is an academic and research organisation with latest > facilities in the field of high performance computing. We would like to > participate in the ongoing research efforts in IB. Excellent! > Can someone please let me know about the procedures, fees required to > be part of this consortium. It would be of great help for us. The closest thing I can find to a charter is here: http://www.openib.org/members.html Can someone please post the official charter someplace under "Contact Us"? You can contribute code/experience without being a paying member. I expect that if you want to be part of the steering committee and/or involved with press releases, then you'd need to pay membership fees. My advice is to start by going to "Development Tools" on www.openib.org and grabbing the SVN source tree. Then to the "Documentation" and possible visit the wiki referenced on that page. hth, grant From eitan at mellanox.co.il Wed Nov 30 22:07:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 1 Dec 2005 08:07:24 +0200 Subject: [openib-general] First Multicast Leave disconnects all other clients Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A41@mtlexch01.mtl.com> > > > > The bottom line: > > We are missing 3 agents in the OpenIB stack: > > InformInfo - handling registrations and Report dispatching > > These are not currently used. [EZ] They are by SRP initiator. > > > ServiceRecord - tracks registrations > > ServiceRecord is implemented in sa_query (and was used by AT/uAT but > that is largely historical now) > > > Multicast Join/Leave - tracking registrations to multicast groups and > > ref-counting > > > > All these agents should be able to cleanup dead client registrations and > > also provide re-registration in case of SM ClientReregistration event. > > In OpenIB, any Set of PortInfo (which includes ClientReregister) > currently causes a (coarse) event (LID change) which causes IPoIB client > to reregister its multicasts registrations with the SA. > > > Please see below > > > > > > > > It seems the IBTA intent was that the IB driver will be responsible > > for maintaining > > > the list of clients > > > > registered to each group. > > > > > > Yes, the end node is responsible for tracking the registrations within > > > the node and fabricating responses when the node does not want to > > leave. > > > Is delete a different case though ? > > [EZ] No it is not. Delete of multicast group is really the last leave. > > There is an explicit delete. While it shouldn't be needed to be forced, > there is always some scenario where this is useful. [EZ] To my best knowledge any leave is a "delete" so there is no way for any client to force other members out of a group. It can only leave itself. The delete will happen when the last will leave. > > > > > But the IB core does not track what clients registered (through SA > > requests) to a > > > particular multicast group. > > > > The first client to leave the group causes the rest (of the clients) > > to be disconnected. > > > > > > This is an implementation issue IMO and applies to other subscriptions > > > too (not just limited to multicast). > > [EZ] I agree it is an implementation issue. I hope it will get > > implemented in OpenIB. > > It will. It's a question of priorities and timing. > > > > > My proposal is to provide an API for such registrations at both user > > and kernel and > > > track the requesting processes. > > > > Cleanup is also required both by process and kernel module > > granularity. > > > > > > Is the API the SA client request itself for this ? Shouldn't the > > > tracking be done there (within sa_query.c) ? > > [EZ] It will be hard to sniff the MADs (especially user level) for all > > the registration flows. > > It's not the sniffing which is hard but perhaps identifying which client > (and reference counting). > > > So I propose we should have > > ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. > > Both in user land and in kernel. > > I think this is TBD and the API would be discussed on this list first > prior to any implementation. > > > > > BTW: The same API could also handle "Client Reregistration" for > > multicast groups, > > > > > > Client reregistration is for all subscriptions (including > > ServiceRecords > > > and events as well). > > [EZ] Yes exactly. I believe similar problem exists for all > > registrations. > > > > > > > such that we could avoid the need to have that code duplicated by > > every client. > > > > > > I'm missing how client reregistration would help here. Can you > > elaborate > > > ? > > [EZ] It is related to the reference tracking: > > If a kernel module tracks all registrations to refcount them and perform > > cleanup, it could with similar effort also send the - re-registration in > > the event of SM change ... > > Sure, there are multiple ways to skin the same cat. > > > > > > > > But this refers to yet another API that is missing: Report > > dispatching which deserves > > > its own > > > > mail... > > > > > > I'm missing the connection between reregistration and report > > > dispatching. > > [EZ] Sorry for not being verbose. The need for Events dispatcher is > > based on the fact that only one client should respond to Report with > > ReportRepress. Reports are "unsolicited" MADs coming into the device. In > > umad the implementation prevents any "multiple" client registration for > > receiving any "unsolicited" MAD - only one class-agent needs to be there > > handling "unsolicited" messages. This is fine - but what it means is > > that when two clients wants to be notified about events they should > > register with that agent and the agent should be able to dispatch the > > message to all registered clients as well as send only one response > > back. > > Wouldn't report represses be reference counted and only actually sent on > the wire when all subscribed clients within the node indicated repress ? [EZ] As you say there are many ways to skin a cat. I am not sure we need to wait for all clients as they are located on the same node and will be surely notified. > > -- Hal [EZ] From halr at voltaire.com Wed Nov 30 22:04:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 01:04:49 -0500 Subject: [openib-general] OpenSM not coming out of standby state.. In-Reply-To: <20051201015445.GY3275@kalmia.hozed.org> References: <20051201013344.GW3275@kalmia.hozed.org> <20051201015445.GY3275@kalmia.hozed.org> Message-ID: <1133417086.2984.18206.camel@hal.voltaire.com> Hi Troy, On Wed, 2005-11-30 at 20:54, Troy Benjegerdes wrote: > On Wed, Nov 30, 2005 at 07:33:44PM -0600, Troy Benjegerdes wrote: > > A couple of days ago I started up two instances of opensm on my network, > > and set one with priority 11, the other with the default 10. > > > > I could kill one and the other would become master a few minutes later. > > > > Well, today, I found that there are no active links anywhere in the > > network.. But both SM's still appeared to be running. OK. It seems weird that the master would respond to polls from the standby but not bring links to active. Do you have a log of the master ? That might be more informative. > > then I killed them both, and restarted one with 'opensm -V -p 11', > > > > it is still staying in STANDBY state, and produced the 4MB log available at > > > > http://scl.ameslab.gov/~troy/osm.log-nomaster There is another master out there: Nov 30 19:24:54 421835 [41802960] -> SMInfo dump: guid....................0x0002c90108cd8ba1 sm_key..................0x0000000000000000 act_count...............7389948 priority................0 sm_state................3 which shows the other SM (state 3 = master) at priority 0. What node is that ? What's weird is that the higher priority SM does not appear to take over from it. Not sure why right now. Another weird thing is the following at the start of the log: Nov 30 19:22:41 420334 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd85f1 value:0x000c 0x000c Nov 30 19:22:41 420342 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000458 value:0x0007 0x0007 Nov 30 19:22:41 420349 [AB44FCE0] -> osm_db_restore: Got key:0x0002550000039e00 value:0x0003 0x0003 Nov 30 19:22:41 420355 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000444 value:0x0006 0x0006 Nov 30 19:22:41 420362 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000043c value:0x0004 0x0004 Nov 30 19:22:41 420369 [AB44FCE0] -> osm_db_restore: Got key:0x67609ef000040000 value:0x0017 0x0017 Nov 30 19:22:41 420375 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402915 value:0x0002 0x0002 Nov 30 19:22:41 420382 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd9bd1 value:0x000b 0x000b Nov 30 19:22:41 420388 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd98c1 value:0x000a 0x000a Nov 30 19:22:41 420395 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd84a1 value:0x000d 0x000d Nov 30 19:22:41 420402 [AB44FCE0] -> osm_db_restore: Got key:0x6760a0f000040080 value:0x0021 0x0021 Nov 30 19:22:41 420408 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200007b3d value:0x0014 0x0014 Nov 30 19:22:41 420415 [AB44FCE0] -> osm_db_restore: Got key:0x0002c9020040272d value:0x001d 0x001d Nov 30 19:22:41 420421 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402917 value:0x000e 0x000e Nov 30 19:22:41 420428 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402781 value:0x0001 0x0001 Nov 30 19:22:41 420435 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200402782 value:0x0009 0x0009 Nov 30 19:22:41 420445 [AB44FCE0] -> osm_db_restore: Got key:0x6760cef000040080 value:0x000f 0x000f Nov 30 19:22:41 420452 [AB44FCE0] -> osm_db_restore: Got key:0x0002c9020040272e value:0x001e 0x001e Nov 30 19:22:41 420459 [AB44FCE0] -> osm_db_restore: Got key:0x6760cef000040000 value:0x0012 0x0012 Nov 30 19:22:41 420465 [AB44FCE0] -> osm_db_restore: Got key:0x6760a0f000040000 value:0x0022 0x0022 Nov 30 19:22:41 420472 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108cd0b71 value:0x0010 0x0010 Nov 30 19:22:41 420479 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000044e value:0x0011 0x0011 Nov 30 19:22:41 420485 [AB44FCE0] -> osm_db_restore: Got key:0x0002550000039e80 value:0x0013 0x0013 Nov 30 19:22:41 420492 [AB44FCE0] -> osm_db_restore: Got key:0x67609ef000040080 value:0x0015 0x0015 Nov 30 19:22:41 420498 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a0000441 value:0x0005 0x0005 Nov 30 19:22:41 420505 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90200007bd5 value:0x0016 0x0016 Nov 30 19:22:41 420512 [AB44FCE0] -> osm_db_restore: Got key:0x0002c90108ccc571 value:0x0008 0x0008 There are a number of funky OUIs here (the ones starting as 0x6760xx). For examples, 0x6760a0 and 0x67609e do no appear to be valid OUIs. Any idea on what equipment that is ? (Perhaps there is some endian problem with these). > > (Hal, if you want access to this system, let me know) I may. > And the rest of the story.. > > This happened after I cross-connected two networks, and had opensm > running on two nodes that had back-to-back connections (with no switch). > I didn't think anything of it at the time since the 'active' lights were > off on the cards machines that were connected (they had physical link, > but no logical link). I'm not sure I have a good picture of your network topology. You cross connected 2 subnets but there were no switches between 2 SMs. I'm missing something here. > I've since killed opensm on the 'new' nodes, but there is still some > state somewhere that prevents opensm from 'nicely' becoming the master.. There appears to be another master out there but with priority 0. > If I run with 'opensm -d 0 -p 11', it becomes master just fine. How does > one go about tracking down a broken rogue SM that isn't bringing up the > network? Can you find the following GUID 0x0002c90108cd8ba1 ? I would pull that one off the network and see. I don't see that in the original database. Can you see it with ibnetdiscover ? Here's it's NodeInfo from the standby log: Nov 30 19:22:44 238782 [41001960] -> NodeInfo dump: base_version............0x1 class_version...........0x1 node_type...............Channel Adapter num_ports...............0x2 sys_guid................0x0002c9000100d050 node_guid...............0x0002c90108cd8ba0 port_guid...............0x0002c90108cd8ba1 partition_cap...........0x40 device_id...............0x5A44 revision................0xA1 port_num................0x1 vendor_id...............0x2C9 and its PortInfo shows: Nov 30 19:22:44 301946 [41001960] -> Capabilities Mask: IB_PORT_CAP_IS_SM IB_PORT_CAP_HAS_TRAP IB_PORT_CAP_HAS_AUTO_MIG IB_PORT_CAP_HAS_SL_MAP IB_PORT_CAP_HAS_LED_INFO IB_PORT_CAP_HAS_SYS_IMG_GUID IB_PORT_CAP_HAS_VEND_CLS IB_PORT_CAP_HAS_CAP_NTC that it is running an SM and this node has a LID of 4: port number.............0x1 node_guid...............0x0002c90108cd8ba0 port_guid...............0x0002c90108cd8ba1 m_key...................0x0000000000000000 subnet_prefix...........0xfe80000000000000 base_lid................0x4 master_sm_base_lid......0x4 That LID does conflict with one from the database: Nov 30 19:22:41 420362 [AB44FCE0] -> osm_db_restore: Got key:0x00066a00a000043c value:0x0004 0x0004 Subsequent to this, the standby reports: Nov 30 19:22:44 301995 [41001960] -> __osm_pi_rcv_process_endport: Detected another SM. Requesting SMInfo. Port 0x2c90108cd8ba1. I think there is some issue with conflict resolution of duplicated LIDs when subnets are merged. -- Hal From halr at voltaire.com Wed Nov 30 22:19:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 01:19:59 -0500 Subject: [openib-general] First Multicast Leave disconnects all other clients In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A41@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A41@mtlexch01.mtl.com> Message-ID: <1133417933.2984.18321.camel@hal.voltaire.com> On Thu, 2005-12-01 at 01:07, Eitan Zahavi wrote: > > > > > > The bottom line: > > > We are missing 3 agents in the OpenIB stack: > > > InformInfo - handling registrations and Report dispatching > > > > These are not currently used. > [EZ] They are by SRP initiator. Not the SRP initiator in OpenIB svn as far as I can tell. > > > ServiceRecord - tracks registrations > > > > ServiceRecord is implemented in sa_query (and was used by AT/uAT but > > that is largely historical now) > > > > > Multicast Join/Leave - tracking registrations to multicast groups > and > > > ref-counting > > > > > > All these agents should be able to cleanup dead client registrations > and > > > also provide re-registration in case of SM ClientReregistration > event. > > > > In OpenIB, any Set of PortInfo (which includes ClientReregister) > > currently causes a (coarse) event (LID change) which causes IPoIB > client > > to reregister its multicasts registrations with the SA. > > > > > Please see below > > > > > > > > > > It seems the IBTA intent was that the IB driver will be > responsible > > > for maintaining > > > > the list of clients > > > > > registered to each group. > > > > > > > > Yes, the end node is responsible for tracking the registrations > within > > > > the node and fabricating responses when the node does not want to > > > leave. > > > > Is delete a different case though ? > > > [EZ] No it is not. Delete of multicast group is really the last > leave. > > > > There is an explicit delete. While it shouldn't be needed to be > forced, > > there is always some scenario where this is useful. > [EZ] To my best knowledge any leave is a "delete" so there is no way for > any client to force other members out of a group. It can only leave > itself. The delete will happen when the last will leave. Yes, you are right, other than the last full member (join state) rule. > > > > > But the IB core does not track what clients registered (through > SA > > > requests) to a > > > > particular multicast group. > > > > > The first client to leave the group causes the rest (of the > clients) > > > to be disconnected. > > > > > > > > This is an implementation issue IMO and applies to other > subscriptions > > > > too (not just limited to multicast). > > > [EZ] I agree it is an implementation issue. I hope it will get > > > implemented in OpenIB. > > > > It will. It's a question of priorities and timing. > > > > > > > My proposal is to provide an API for such registrations at both > user > > > and kernel and > > > > track the requesting processes. > > > > > Cleanup is also required both by process and kernel module > > > granularity. > > > > > > > > Is the API the SA client request itself for this ? Shouldn't the > > > > tracking be done there (within sa_query.c) ? > > > [EZ] It will be hard to sniff the MADs (especially user level) for > all > > > the registration flows. > > > > It's not the sniffing which is hard but perhaps identifying which > client > > (and reference counting). > > > > > So I propose we should have > > > > ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. > > > Both in user land and in kernel. > > > > I think this is TBD and the API would be discussed on this list first > > prior to any implementation. > > > > > > > BTW: The same API could also handle "Client Reregistration" for > > > multicast groups, > > > > > > > > Client reregistration is for all subscriptions (including > > > ServiceRecords > > > > and events as well). > > > [EZ] Yes exactly. I believe similar problem exists for all > > > registrations. > > > > > > > > > such that we could avoid the need to have that code duplicated > by > > > every client. > > > > > > > > I'm missing how client reregistration would help here. Can you > > > elaborate > > > > ? > > > [EZ] It is related to the reference tracking: > > > If a kernel module tracks all registrations to refcount them and > perform > > > cleanup, it could with similar effort also send the - > re-registration in > > > the event of SM change ... > > > > Sure, there are multiple ways to skin the same cat. > > > > > > > > > > > But this refers to yet another API that is missing: Report > > > dispatching which deserves > > > > its own > > > > > mail... > > > > > > > > I'm missing the connection between reregistration and report > > > > dispatching. > > > [EZ] Sorry for not being verbose. The need for Events dispatcher is > > > based on the fact that only one client should respond to Report with > > > ReportRepress. Reports are "unsolicited" MADs coming into the > device. In > > > umad the implementation prevents any "multiple" client registration > for > > > receiving any "unsolicited" MAD - only one class-agent needs to be > there > > > handling "unsolicited" messages. This is fine - but what it means is > > > that when two clients wants to be notified about events they should > > > register with that agent and the agent should be able to dispatch > the > > > message to all registered clients as well as send only one response > > > back. > > > > Wouldn't report represses be reference counted and only actually sent > on > > the wire when all subscribed clients within the node indicated repress > ? > [EZ] As you say there are many ways to skin a cat. I am not sure we need > to wait for all clients as they are located on the same node and will be > surely notified. Right, it just needs to be done once whether it was actually delivered to any client, clients, or none at all. -- Hal From yael at mellanox.co.il Wed Nov 30 22:41:15 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 1 Dec 2005 08:41:15 +0200 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E244B@mtlexch01.mtl.com> Hi Hal, This fix isn't correct, since you are asserting on a variable not yet initialized. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, November 30, 2005 11:04 PM To: Yael Kalka Cc: openib-general at openib.org Subject: [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used OpenSM/complib: Move assert before variable is used Signed-off-by: Hal Rosenstock Index: cl_dispatcher.c =================================================================== --- cl_dispatcher.c (revision 4257) +++ cl_dispatcher.c (working copy) @@ -344,8 +344,8 @@ cl_disp_post( cl_dispatcher_t *p_disp; cl_disp_msg_t *p_msg; - p_disp = handle->p_disp; CL_ASSERT( p_disp ); + p_disp = handle->p_disp; CL_ASSERT( msg_id != CL_DISP_MSGID_NONE ); cl_spinlock_acquire( &p_disp->lock ); From manpreet at gmail.com Wed Nov 30 23:06:42 2005 From: manpreet at gmail.com (Manpreet Singh) Date: Wed, 30 Nov 2005 23:06:42 -0800 Subject: [openib-general] Kernel panic on RHEL4 (svn version 4016) Message-ID: <67897d690511302306q16c35434vdbe4306c6a51e09d@mail.gmail.com> Hi, I was wondering if anyone has seen this or knows the cause. I am using the redhat RPM from: https://openib.org/svn/gen2/branches/backport-to-2.6.9/RPMS/i686/kernel-smp-2.6.9-11.OpenIB.4016.EL.root.i686.rpm . I did a 'ib_register_mad_agent' which was successful, then tried to send a MAD out and I get: Kernel panic - not syncing: drivers/infiniband/core/mad.c:948: spin_is_locked on uninitialized spinlock f88e9d76. The spinlock seems to be initialized in the ib_register_mad_agent function. Thanks, Manpreet. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Wed Nov 30 23:20:11 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 01 Dec 2005 09:20:11 +0200 Subject: [openib-general] Re: spinlock wrong CPU on CPU#1, ib_addr In-Reply-To: <438DEB30.4000309@ichips.intel.com> References: <438DEB30.4000309@ichips.intel.com> Message-ID: <438EA42B.50607@voltaire.com> Sean Hefty wrote: > Can you describe more what iSER is doing in the callback or post the > code? As of Monday, the code is posted under ulp/iser at the trunk. When iser_cma_handler gets ADDR_RESOLVED event, it calls iser_adaptor_find_by_device which if needed, allocates all the required IB resources associated with it (PD, CQ, DMA MR, FMR pool) and then calls to cma_resolve_route. > I haven't seen this bug before, so I'm not sure what to make of it. > The ib_addr code doesn't even acquire spinlocks. I see, good chances the issue is within iser and the kernel debug print is somehow missleading, any idea what "spinlock wrong CPU" means? Or.