From mst at mellanox.co.il Thu Dec 1 02:02:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 1 Dec 2005 12:02:52 +0200 Subject: [openib-general] [PATCH] libibverbs: document immediate data ordering Message-ID: <20051201100252.GS25751@mellanox.co.il> verbs.h documents ordering for immediate data in completion, but not in send work request. Signed-off-by: Michael S. Tsirkin Index: libibverbs/include/infiniband/verbs.h =================================================================== --- libibverbs/include/infiniband/verbs.h (revision 4031) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -441,7 +441,7 @@ struct ibv_send_wr { int num_sge; enum ibv_wr_opcode opcode; enum ibv_send_flags send_flags; - uint32_t imm_data; + uint32_t imm_data; /* in network byte order */ union { struct { uint64_t remote_addr; -- MST From yipeeyipeeyipeeyipee at yahoo.com Thu Dec 1 02:13:33 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 1 Dec 2005 10:13:33 +0000 (UTC) Subject: [openib-general] static OpenSM Message-ID: Hi, How can I compile OpenSM to be statically linked? I tried configuring it with '--enable-static' but that was ignored. thanks, x From halr at voltaire.com Thu Dec 1 03:33:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 06:33:22 -0500 Subject: [openib-general] RE: [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable is used In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E244B@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E244B@mtlexch01.mtl.com> Message-ID: <1133436608.2984.20781.camel@hal.voltaire.com> On Thu, 2005-12-01 at 01:41, Yael Kalka wrote: > Hi Hal, > This fix isn't correct, since you are asserting on a variable not yet > initialized. Right. My bad. This was pointed out last night by Johannes Erdfelt. Thanks. -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 30, 2005 11:04 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [PATCH] [TRIVIAL] OpenSM/complib: Move assert before variable > is used > > > OpenSM/complib: Move assert before variable is used > > Signed-off-by: Hal Rosenstock > > Index: cl_dispatcher.c > =================================================================== > --- cl_dispatcher.c (revision 4257) > +++ cl_dispatcher.c (working copy) > @@ -344,8 +344,8 @@ cl_disp_post( > cl_dispatcher_t *p_disp; > cl_disp_msg_t *p_msg; > > - p_disp = handle->p_disp; > CL_ASSERT( p_disp ); > + p_disp = handle->p_disp; > CL_ASSERT( msg_id != CL_DISP_MSGID_NONE ); > > cl_spinlock_acquire( &p_disp->lock ); From yael at mellanox.co.il Thu Dec 1 04:17:11 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 1 Dec 2005 14:17:11 +0200 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2451@mtlexch01.mtl.com> Hi Hal, Eitan, I think the best option is to add an OpenSM option flag - exit_on_fatal. This flag can decide on the action on fatal cases: 1. Exit or not when seeing SM with different SM_Key. 2. Exit or not when there is a fatal link error (e.g - multiple guids). etc. I tried to run 2 SMs just now with different SM_keys, and I see that none of them exit, since both receive SM_Key=0 on SMInfo GetResp. The reason for that is that in the SMInfo Get request (as in all other requests) we do not send anything in the mad data. Meaning - all fields are clear. In the __osm_sminfo_rcv_process_get_request function we are checking the state according to the payload data. This is always zero! Thus - SM will never know that the SMInfo request is sent from an SM that is master. I will work on a fix for that. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, November 30, 2005 11:57 PM To: Yael Kalka; Eitan Zahavi Cc: openib-general at openib.org Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Hi Yael & Eitan, Based on the recent MgtWG discussions, are you still holding your position in terms of exiting OpenSM when a non matching SM Key is discovered ? Just wondering if I can issue a patch for this and clear this issue so OpenSM can be compliant for this aspect. Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] OpenSM and Wrong SM_Key Date: 08 Nov 2005 16:08:47 -0500 Hi, Currently, when OpenSM receives SMInfo with a different SM_Key, it exits as follows: void __osm_sminfo_rcv_process_get_response( IN const osm_sminfo_rcv_t* const p_rcv, IN const osm_madw_t* const p_madw ) { ... /* Check that the sm_key of the found SM is the same as ours, or is zero. If not - OpenSM cannot continue with configuration!. */ if ( p_smi->sm_key != 0 && p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_sminfo_rcv_process_get_response: ERR 2F18: " "Got SM with sm_key that doesn't match our " "local key. Exiting\n" ); osm_log( p_rcv->p_log, OSM_LOG_SYS, "Found remote SM with non-matching sm_key. Exiting\n" ); osm_exit_flag = TRUE; goto Exit; } C14-61.2.1 states that: A master SM which finds a higher priority master SM with the wrong SM_Key should not relinquish the subnet. Exiting OpenSM relinquishes the subnet. So it appears to me that perhaps this behavior of exiting OpenSM should be at least contingent on the SM state and relative priority of the SMInfo received. Make sense ? If so, I will work on a patch for this. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yael at mellanox.co.il Thu Dec 1 04:40:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 1 Dec 2005 14:40:49 +0200 Subject: [openib-general] RE: [PATCH] Opensm - fix LinkRecord get Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2452@mtlexch01.mtl.com> Hi Hal, Regarding your question - the answer is that there shouldn't be an error message in this case. Assume the following: A LinkRecord request is received with FromLid and ToLid, both Lids of switches. In this case __osm_lr_rcv_get_port_links function will be called with both src and dest port objects not NULL (the osm_port_t depends on the Lid). The __osm_lr_rcv_get_physp_link function will be called with all possible couple of 2 such physical ports (for all possible port numbers) - also couples that are not connected. This call is not an error, but part of the flow. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, November 30, 2005 4:20 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - fix LinkRecord get On Wed, 2005-11-30 at 07:35, Yael Kalka wrote: > Hi Hal, > > During some tests I've noticed that in LinkRecord queries there are > some bugs: > 1. Trying to ensure the two physical ports are connected comparison > isn't done correctly. > 2. When __osm_lr_rcv_get_physp_link is called with physical ports not > null - there is no check that the value returned is actually different > than null. As a result we can get several links with the same value. > > This patch fixes both issues. Thanks. Applied. Some minor comments below. -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: opensm/osm_sa_link_record.c > =================================================================== > --- opensm/osm_sa_link_record.c (revision 4231) > +++ opensm/osm_sa_link_record.c (working copy) > @@ -235,7 +235,7 @@ __osm_lr_rcv_get_physp_link( > Ensure the two physp's are actually connected. > If not, bail out. > */ > - if( osm_physp_get_remote( p_src_physp ) != p_src_physp ) > + if( osm_physp_get_remote( p_src_physp ) != p_dest_physp ) > goto Exit; Should there be an error message here ? > } > else > @@ -393,12 +393,16 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( p_dest_port, > dest_port_num ); > + /* both physical ports should be with data */ > + if (p_src_physp && p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > } > + } Formatting was off here (and similarly below)... I fixed it in the change that was just committed. > else > { > /* > @@ -412,17 +416,22 @@ __osm_lr_rcv_get_port_links( > if (port_num < p_src_port->physp_tbl_size) > { > p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); > + if (p_src_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > NULL, comp_mask, p_list, > p_req_physp ); > } > } > + } > else > { > num_ports = osm_port_get_num_physp( p_src_port ); > for( port_num = 1; port_num < num_ports; port_num++ ) > { > p_src_physp = osm_port_get_phys_ptr( p_src_port, port_num ); > + if (p_src_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, p_src_physp, > NULL, comp_mask, p_list, > p_req_physp ); > @@ -430,6 +439,7 @@ __osm_lr_rcv_get_port_links( > } > } > } > + } > else > { > if( p_dest_port ) > @@ -446,11 +456,14 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( > p_dest_port, port_num ); > + if (p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > + } > else > { > num_ports = osm_port_get_num_physp( p_dest_port ); > @@ -458,12 +471,15 @@ __osm_lr_rcv_get_port_links( > { > p_dest_physp = osm_port_get_phys_ptr( > p_dest_port, port_num ); > + if (p_dest_physp) > + { > __osm_lr_rcv_get_physp_link( p_rcv, p_lr, NULL, > p_dest_physp, comp_mask, > p_list, p_req_physp ); > } > } > } > + } > else > { > /* > From mst at mellanox.co.il Thu Dec 1 05:04:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 1 Dec 2005 15:04:29 +0200 Subject: [openib-general] [PATCH applied] sdp: fix aio completion on cancel Message-ID: <20051201130429.GA25751@mellanox.co.il> SDP AIO : Turn a warning message on completed iocb cancel to debug. SDP AIO : Put the proper AIO req on cancellation. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -1738,7 +1742,7 @@ /* * completion reference */ - aio_put_req(req); + aio_put_req(iocb->req); result = 0; } @@ -1797,9 +1801,8 @@ * no IOCB found. The cancel is probably in a race with a completion. * Assume the IOCB will be completed, return appropriate value. */ - sdp_warn("Cancel write with no IOCB. <%d:%d:%08lx>", - req->ki_users, req->ki_key, req->ki_flags); - + sdp_dbg_warn(conn, "Cancel write with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); result = -EAGAIN; unlock: -- MST From halr at voltaire.com Thu Dec 1 05:19:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 08:19:04 -0500 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2451@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2451@mtlexch01.mtl.com> Message-ID: <1133443143.2984.21769.camel@hal.voltaire.com> On Thu, 2005-12-01 at 07:17, Yael Kalka wrote: > Hi Hal, Eitan, > I think the best option is to add an OpenSM option flag - exit_on_fatal. > This flag can decide on the action on fatal cases: > 1. Exit or not when seeing SM with different SM_Key. Still not sure why this would be an option. The compliance seems to me to be pretty clear on this. > 2. Exit or not when there is a fatal link error (e.g - multiple guids). > etc. I think the second issue is separable from the first. I would prefer to keep the discussion of this issue separate from SM Key. > I tried to run 2 SMs just now with different SM_keys, and I see that > none of them > exit, since both receive SM_Key=0 on SMInfo GetResp. > The reason for that is that in the SMInfo Get request (as in all other > requests) > we do not send anything in the mad data. Meaning - all fields are clear. The SM needs a way to know whether the other SM(s) (and which ones) are trusted or not so the SM_Key can be filled in. > In the __osm_sminfo_rcv_process_get_request function we are checking the > state according > to the payload data. This is always zero! Thus - SM will never know that > the SMInfo > request is sent from an SM that is master. Right, on the get side, SMState is reserved as it is a RO component (of SMInfo). > I will work on a fix for that. Thanks. -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 30, 2005 11:57 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > Hi Yael & Eitan, > > Based on the recent MgtWG discussions, are you still holding your > position in terms of exiting OpenSM when a non matching SM Key is > discovered ? Just wondering if I can issue a patch for this and clear > this issue so OpenSM can be compliant for this aspect. Thanks. > > -- Hal > > -----Forwarded Message----- > > From: Hal Rosenstock > To: openib-general at openib.org > Subject: [openib-general] OpenSM and Wrong SM_Key > Date: 08 Nov 2005 16:08:47 -0500 > > Hi, > > Currently, when OpenSM receives SMInfo with a different SM_Key, it exits > as follows: > > > void > __osm_sminfo_rcv_process_get_response( > IN const osm_sminfo_rcv_t* const p_rcv, > IN const osm_madw_t* const p_madw ) > { > ... > > > > /* > Check that the sm_key of the found SM is the same as ours, > or is zero. If not - OpenSM cannot continue with configuration!. */ > if ( p_smi->sm_key != 0 && > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > "Got SM with sm_key that doesn't match our " > "local key. Exiting\n" ); > osm_log( p_rcv->p_log, OSM_LOG_SYS, > "Found remote SM with non-matching sm_key. Exiting\n" ); > osm_exit_flag = TRUE; > goto Exit; > } > > C14-61.2.1 states that: > A master SM which finds a higher priority master SM with the wrong > SM_Key should not relinquish the subnet. > > Exiting OpenSM relinquishes the subnet. > > So it appears to me that perhaps this behavior of exiting OpenSM should > be at least contingent on the SM state and relative priority of the > SMInfo received. Make sense ? If so, I will work on a patch for this. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Thu Dec 1 05:47:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 1 Dec 2005 15:47:08 +0200 Subject: [openib-general] [PATCH applied] sdp zcopy support for send_msg/recv_msd Message-ID: <20051201134708.GB25751@mellanox.co.il> I have added zcopy option to trunk. With this enabled I am getting good bandwidth with multiple sockets, but typically worse that bcopy bandwidth for a single socket. swlab155:~ # ( export SIMPLE_LIBSDP=1 ; export LD_PRELOAD=/usr/local/lib/libsdp.so; iperf -c 11.4.8.156 -P 4 -l 64000 -f M ) ------------------------------------------------------------ Client connecting to 11.4.8.156, TCP port 5001 TCP window size: 0.11 MByte (default) ------------------------------------------------------------ [ 7] local 11.4.8.155 port 32812 connected with 11.4.8.156 port 5001 [ 5] local 11.4.8.155 port 32810 connected with 11.4.8.156 port 5001 [ 8] local 11.4.8.155 port 32813 connected with 11.4.8.156 port 5001 [ 6] local 11.4.8.155 port 32811 connected with 11.4.8.156 port 5001 [ 7] 0.0-10.0 sec 2309 MBytes 231 MBytes/sec [ 5] 0.0-10.0 sec 2309 MBytes 231 MBytes/sec [ 8] 0.0-10.0 sec 2309 MBytes 231 MBytes/sec [ 6] 0.0-10.0 sec 2309 MBytes 231 MBytes/sec [SUM] 0.0-10.0 sec 9235 MBytes 924 MBytes/sec swlab155:~ # ( export SIMPLE_LIBSDP=1 ; export LD_PRELOAD=/usr/local/lib/libsdp.so; iperf -c 11.4.8.156 -P 2 -l 64000 -f M ) ------------------------------------------------------------ Client connecting to 11.4.8.156, TCP port 5001 TCP window size: 0.11 MByte (default) ------------------------------------------------------------ [ 5] local 11.4.8.155 port 32814 connected with 11.4.8.156 port 5001 [ 6] local 11.4.8.155 port 32815 connected with 11.4.8.156 port 5001 [ 5] 0.0-10.0 sec 4233 MBytes 423 MBytes/sec [ 6] 0.0-10.0 sec 4233 MBytes 423 MBytes/sec [SUM] 0.0-10.0 sec 8466 MBytes 847 MBytes/sec swlab155:~ # ( export SIMPLE_LIBSDP=1 ; export LD_PRELOAD=/usr/local/lib/libsdp.so; iperf -c 11.4.8.156 -l 64000 -f M ) ------------------------------------------------------------ Client connecting to 11.4.8.156, TCP port 5001 TCP window size: 0.11 MByte (default) ------------------------------------------------------------ [ 5] local 11.4.8.155 port 32816 connected with 11.4.8.156 port 5001 [ 5] 0.0-10.0 sec 5092 MBytes 509 MBytes/sec --- Add zero copy support to synchronous socket operations (send_msg/recv_msg). Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/sdp/Kconfig =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/Kconfig (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/Kconfig (working copy) @@ -8,6 +8,20 @@ libsdp library from to have standard sockets applications use SDP. +config INFINIBAND_SDP_SEND_ZCOPY + bool "Sockets Direct Protocol Zero Copy Send support" + depends on INFINIBAND_SDP + default n + ---help--- + This option enables Zero Copy support for send_msg transactions. + +config INFINIBAND_SDP_RECV_ZCOPY + bool "Sockets Direct Protocol Zero Copy Receive support" + depends on INFINIBAND_SDP && INFINIBAND_SDP_SEND_ZCOPY + default n + ---help--- + This option enables Zero Copy support for recv_msg transactions. + config INFINIBAND_SDP_DEBUG bool "Sockets Direct Protocol debugging" depends on INFINIBAND_SDP Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_rcvd.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_rcvd.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_rcvd.c (working copy) @@ -439,6 +439,11 @@ sdp_advt_destroy(advt); } + +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + /* There are no more src_avail, wake up any waiting thread */ + sdp_iocb_q_wakeup_complete(&conn->r_pend); +#endif /* * If there are active reads, mark the connection as being in * source cancel. Otherwise Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_sock.h =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_sock.h (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_sock.h (working copy) @@ -61,7 +61,9 @@ #define SDP_ZCOPY_THRSH_SRC 257 /* Threshold for AIO write advertisments */ #define SDP_ZCOPY_THRSH_SNK 258 /* Threshold for AIO read advertisments */ #define SDP_ZCOPY_THRSH 256 /* Convenience for read and write */ - +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +#define SDP_ZCOPY_CANCEL_TIMEOUT (HZ * 60) /* Time before abortive close */ +#endif /* * Default values for SDP specific socket options. (for reference) */ Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_proto.h =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_proto.h (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_proto.h (working copy) @@ -152,7 +152,13 @@ void sdp_iocb_q_put_tail(struct sdpc_iocb_q *table, struct sdpc_iocb *iocb); struct sdpc_iocb *sdp_iocb_q_lookup(struct sdpc_iocb_q *table, u32 key); +struct sdpc_iocb *sdp_iocb_q_lookup_req(struct sdpc_iocb_q *table, struct kiocb *req); +struct sdpc_iocb *sdp_iocb_q_lookup_complete(struct sdpc_iocb_q *table, struct kiocb *req); +struct sdpc_iocb *sdp_iocb_q_wakeup_complete(struct sdpc_iocb_q *table); + +void sdp_iocb_q_mark_cancel(struct sdpc_iocb_q *table, struct kiocb *req); + void sdp_iocb_q_cancel(struct sdpc_iocb_q *table, u32 mask, ssize_t comp); void sdp_iocb_q_remove(struct sdpc_iocb *iocb); @@ -197,6 +203,8 @@ void *arg), void *arg); +int sdp_iocb_find_req(struct sdpc_desc *element, void *arg); + int sdp_desc_q_types_size(struct sdpc_desc_q *table, enum sdp_desc_type type); Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_read.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_read.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_read.c (working copy) @@ -93,6 +93,12 @@ } } +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + /* If there are no more src_avail, wake up any waiting thread */ + if (!conn->src_recv) + sdp_iocb_q_wakeup_complete(&conn->r_pend); + +#endif done: return 0; error: @@ -222,14 +228,23 @@ iocb->flags &= ~(SDP_IOCB_F_ACTIVE | SDP_IOCB_F_RDMA_R); - if (sk_sdp(conn)->sk_rcvlowat > iocb->post) - break; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (!iocb->len || (!conn->src_recv && iocb->post >= iocb->lowat)) +#else + if (iocb->post >= iocb->lowat) +#endif + { + /* + * complete IOCB + */ + SDP_CONN_STAT_READ_INC(conn, iocb->post); + SDP_CONN_STAT_RQ_DEC(conn, iocb->size); + /* + * callback to complete IOCB + */ + sdp_iocb_complete(sdp_iocb_q_get_head(&conn->r_pend), 0); + } - SDP_CONN_STAT_READ_INC(conn, iocb->post); - SDP_CONN_STAT_RQ_DEC(conn, iocb->size); - - sdp_iocb_complete(sdp_iocb_q_get_head(&conn->r_pend), 0); - break; default: sdp_warn("Unknown type <%d> at head of READ SRC queue. <%d>", Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -122,6 +122,10 @@ send_param.send_flags |= IB_SEND_SIGNALED; conn->send_cons = 0; } + + if (buff->bsdh_hdr->mid == SDP_MID_SRC_CANCEL) + sdp_dbg_ctrl(conn, "SRC_CANCEL bsdh_hdr->seq_num = %d conn->send_seq=%d\n", + buff->bsdh_hdr->seq_num, conn->send_seq); /* * post send */ @@ -1680,8 +1684,8 @@ static int sdp_inet_write_cancel(struct kiocb *req, struct io_event *ev) { struct sock_iocb *si = kiocb_to_siocb(req); - struct sdp_sock *conn; struct sdpc_iocb *iocb; + struct sdp_sock *conn; int result = 0; sdp_dbg_ctrl(NULL, "Cancel Write IOCB user <%d> key <%d> flag <%08lx>", @@ -1810,7 +1813,151 @@ return result; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +static int sdp_write_src_cancel(struct sdpc_desc *element, void *arg) +{ + struct sdpc_iocb *iocb = (struct sdpc_iocb *) element; + struct kiocb *req = (struct kiocb *)arg; + + if (element->type == SDP_DESC_TYPE_IOCB && iocb->req == req) + iocb->flags |= SDP_IOCB_F_CANCEL; + return -ERANGE; +} + +static int sdp_req_busy(struct sdp_sock *conn, struct sdpc_iocb_wait *wait) +{ + unsigned long flags; + int result = -EAGAIN; + + sdp_conn_lock(conn); + sdp_conn_unlock(conn); + + spin_lock_irqsave(&wait->lock, flags); + if (!wait->outstanding) + result = 0; + spin_unlock_irqrestore(&wait->lock, flags); + return result; +} /* + * sdp_write_cancel - cancel a synchronous IO operation + */ +static int sdp_write_cancel(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait) +{ + struct sdpc_iocb *iocb; + int result = 0; + + sdp_dbg_ctrl(NULL, "Cancel Write IOCB user <%d> key <%d> flag <%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + + sdp_conn_lock(conn); + + sdp_dbg_ctrl(conn, "Cancel Write IOCB. <%08x:%04x> <%08x:%04x>", + conn->src_addr, conn->src_port, + conn->dst_addr, conn->dst_port); + /* + * attempt to find the IOCB for this key. we don't have an indication + * whether this is a read or write. + */ + + while ((iocb = (struct sdpc_iocb *) + sdp_desc_q_lookup(&conn->send_queue, sdp_iocb_find_req, req))) { + iocb->flags |= SDP_IOCB_F_CANCEL; + + /* + * always remove the IOCB. + * If active, then place it into the correct active queue + */ + sdp_desc_q_remove((struct sdpc_desc *)iocb); + + if (iocb->flags & SDP_IOCB_F_ACTIVE) { + if (iocb->flags & SDP_IOCB_F_RDMA_W) + sdp_desc_q_put_tail(&conn->w_snk, + (struct sdpc_desc *)iocb); + else { + SDP_EXPECT((iocb->flags & SDP_IOCB_F_RDMA_R)); + + sdp_iocb_q_put_tail(&conn->w_src, iocb); + } + } else { + /* + * empty IOCBs can be deleted, while partials + * needs to be compelted. + */ + if (iocb->post > 0) { + sdp_iocb_complete(iocb, 0); + result = -EAGAIN; + } else { + sdp_iocb_destroy(iocb); + + /* + * completion reference + */ + if (!iocb->wait) + aio_put_req(iocb->req); + else { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + --iocb->wait->outstanding; + /* No need to wake up, + since we call sdp_req_busy + directly below */ + + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } + } + } + } + + /* + * check the sink queue, not much to do, since the operation is + * already in flight. + */ + sdp_desc_q_lookup(&conn->w_snk, sdp_write_src_cancel, req); + + iocb = (struct sdpc_iocb *)sdp_desc_q_lookup(&conn->w_snk, + sdp_iocb_find_req, + req); + if (iocb) { + sdp_dbg_ctrl(conn, "Sink Queue busy\n"); + result = -EAGAIN; + } + + /* + * check source queue. If we're in the source queue, then a cancel + * needs to be issued. + */ + sdp_iocb_q_mark_cancel(&conn->w_src, req); + + iocb = sdp_iocb_q_lookup_req(&conn->w_src, req); + if (iocb) { + sdp_dbg_ctrl(conn, "Sending Src Cancel\n"); + + if (! (conn->flags & SDP_CONN_F_SRC_CANCEL_L)) { + sdp_desc_q_lookup(&conn->w_snk, sdp_write_src_cancel, req); + conn->flags |= SDP_CONN_F_SRC_CANCEL_L; + result = sdp_send_ctrl_src_cancel(conn); + SDP_EXPECT(result >= 0); + } + + result = -EAGAIN; + } + + if (!result) { + /* + * no IOCB found. Assume the IOCB will be completed. + */ + sdp_dbg_ctrl(conn, "Cancel IOCB done. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + } + + sdp_conn_unlock(conn); + + return sdp_req_busy(conn, wait); +} +#endif + +/* * sdp_send_flush_advt - Flush passive sink advertisments */ static int sdp_send_flush_advt(struct sdp_sock *conn) @@ -1987,7 +2134,7 @@ return timeout; } -static inline int sdp_queue_iocb(struct kiocb *req, struct sdp_sock *conn, +static inline int sdp_queue_aio(struct kiocb *req, struct sdp_sock *conn, struct msghdr *msg, size_t size, size_t *copied) { @@ -2038,14 +2185,79 @@ return -EIOCBQUEUED; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +static inline int sdp_queue_sync(struct kiocb *req, struct sdp_sock *conn, + struct msghdr *msg, size_t size, + size_t *copied, + struct sdpc_iocb_wait *wait) +{ + struct sdpc_iocb *iocb; + struct iovec *msg_iov; + unsigned long flags; + size_t len; + int result; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, "Failed to allocate IOCB <%Zu:%ld>", + size, (long)*copied); + return -ENOMEM; + } + + for (msg_iov = msg->msg_iov; !msg_iov->iov_len; ++msg_iov); + + /* FMR alignment can add an extra page. */ + len = min(msg_iov->iov_len, (size_t)SDP_IOCB_SIZE_MAX - 4096); + iocb->len = len; + iocb->post = 0; + iocb->size = len; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg_iov->iov_base; + iocb->wait = wait; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> locking IOCB <%Zu:%ld>", + result, size, (long)copied); + + sdp_iocb_destroy(iocb); + return result; + } + + SDP_CONN_STAT_WQ_INC(conn, iocb->size); + + result = sdp_send_data_queue(conn, (struct sdpc_desc *)iocb); + if (result < 0) { + sdp_dbg_warn(conn, "Error <%d> queueing write IOCB", result); + sdp_iocb_destroy(iocb); + return result; + } + + spin_lock_irqsave(&wait->lock, flags); + ++wait->outstanding; + spin_unlock_irqrestore(&wait->lock, flags); + + conn->send_pipe += len; + *copied += len; /* copied amount was saved in IOCB. */ + msg_iov->iov_len -= len; + msg_iov->iov_base += len; + return 0; +} +#endif /* * sdp_inet_send - send data from user space to the network */ int sdp_inet_send(struct kiocb *req, struct socket *sock, struct msghdr *msg, size_t size) { - struct sock *sk; - struct sdp_sock *conn; +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + struct sdpc_iocb_wait wait; +#endif + struct sock *sk; + struct sdp_sock *conn; int result = 0; size_t copied = 0; int oob, zcopy; @@ -2074,6 +2286,7 @@ if (conn->state == SDP_CONN_ST_LISTEN || conn->state == SDP_CONN_ST_CLOSED) { result = -ENOTCONN; + sdp_conn_unlock(conn); goto done; } /* @@ -2082,13 +2295,24 @@ * they are smaller then the zopy threshold, but only if there is * no buffer write space. */ - zcopy = (size >= conn->src_zthresh && !is_sync_kiocb(req)); + zcopy = (size >= conn->src_zthresh && (!is_sync_kiocb(req) +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + || (!(msg->msg_flags & MSG_DONTWAIT) && !oob) +#endif + )); /* * clear ASYN space bit, it'll be reset if there is no space. */ if (!zcopy) clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + else if (is_sync_kiocb(req)) { + init_waitqueue_head(&wait.wait); + spin_lock_init(&wait.lock); + wait.outstanding = 0; + } +#endif /* * process data first if window is open, next check conditions, then * wait if there is more work to be done. The absolute window size is @@ -2143,14 +2367,45 @@ * completion. Wait on sync IO call create IOCB for async * call. */ +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (is_sync_kiocb(req) && zcopy) + result = sdp_queue_sync(req, conn, msg, size, &copied, + &wait); + /* TODO: limit the # of outstanding reqs */ + /* TODO: sleep on recoverable errors */ + else +#endif if (is_sync_kiocb(req)) timeout = sdp_wait_till_space(sk, conn, oob, timeout); else - result = sdp_queue_iocb(req, conn, msg, size, &copied); + result = sdp_queue_aio(req, conn, msg, size, &copied); } + sdp_conn_unlock(conn); + +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (!result && is_sync_kiocb(req) && zcopy) { + timeout = wait_event_interruptible_timeout(wait.wait, + !sdp_req_busy(conn, &wait), timeout); + if (!timeout) + result = -EAGAIN; + } + + if (signal_pending(current) && is_sync_kiocb(req) && zcopy) { + result = (timeout > 0) ? sock_intr_errno(timeout) : -EAGAIN; + + timeout = wait_event_timeout(wait.wait, + !sdp_write_cancel(req, conn, &wait), + SDP_ZCOPY_CANCEL_TIMEOUT); + if (!timeout) { + sdp_warn("sdp_write_cancel timed out. Abort.\n"); + sdp_conn_lock(conn); + sdp_conn_abort(conn); + sdp_conn_unlock(conn); + } + } +#endif done: - sdp_conn_unlock(conn); result = ((copied > 0) ? copied : result); if (result == -EPIPE && !(msg->msg_flags & MSG_NOSIGNAL)) Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -327,6 +327,10 @@ iocb = sdp_iocb_q_look(&conn->r_pend); if (!iocb) return ENODEV; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (iocb->flags & SDP_IOCB_F_RO) + return ENODEV; +#endif /* * check zcopy threshold */ @@ -414,7 +418,11 @@ * loop posting RDMA reads, if there is room. */ if (!sdp_iocb_q_size(&conn->r_pend)) - while (sdp_advt_q_size(&conn->src_pend) > 0 && + while( +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + !sdp_desc_q_size(&conn->r_src) && +#endif + sdp_advt_q_size(&conn->src_pend) > 0 && conn->recv_max > sdp_buff_q_size(&conn->recv_pool) && conn->rwin_max > conn->byte_strm) { @@ -706,9 +714,8 @@ * b) the amount of data moved into the IOCB is greater then the * socket recv low water mark. */ - if (!iocb->len || - (!conn->src_recv && - !(sk_sdp(conn)->sk_rcvlowat > iocb->post))) { + if (!iocb->len || (!conn->src_recv && iocb->post >= iocb->lowat)) + { /* * complete IOCB */ @@ -1055,7 +1062,151 @@ return result; } +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY +static int sdp_req_busy(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait, size_t *copied) +{ + struct sdpc_iocb *iocb; + unsigned long flags; + int result = -EAGAIN; + int lowat_reached = 0; + + sdp_conn_lock(conn); + /* Unlock polls cqs */ + sdp_conn_unlock(conn); + + for (;;) { + spin_lock_irqsave(&wait->lock, flags); + iocb = sdp_iocb_q_get_head(&wait->q); + if (!iocb) + break; + --wait->outstanding; + + if (iocb->post >= iocb->lowat) + wait->lowat_reached = 1; + + lowat_reached = wait->lowat_reached; + + spin_unlock_irqrestore(&wait->lock, flags); + + *copied -= iocb->len; + sdp_iocb_release(iocb); + sdp_iocb_unlock(iocb); + sdp_iocb_destroy(iocb); + } + + if (!wait->outstanding) + result = 0; + + spin_unlock_irqrestore(&wait->lock, flags); + + /* Remove any outstanding iocbs which have their low watermark + satisfied */ + if (lowat_reached && result) { + sdp_conn_lock(conn); + if (!conn->src_recv) + while ((iocb = sdp_iocb_q_lookup_complete(&conn->r_pend, + req))) { + sdp_iocb_q_remove(iocb); + SDP_CONN_STAT_READ_INC(conn, iocb->post); + SDP_CONN_STAT_RQ_DEC(conn, iocb->size); + sdp_iocb_complete(iocb, 0); + } + sdp_conn_unlock(conn); + } + + return result; +} + /* + * sdp_read_cancel - cancel a synchronous IO operation + */ +static int sdp_read_cancel(struct kiocb *req, struct sdp_sock *conn, + struct sdpc_iocb_wait *wait, size_t *copied) +{ + struct sdpc_iocb *iocb; + sdp_dbg_ctrl(NULL, "Cancel Read IOCBs. user <%d> req <%p> flag <%08lx>", + req->ki_users, req, req->ki_flags); + + sdp_conn_lock(conn); + + sdp_dbg_ctrl(conn, "Cancel Read IOCBs. <%08x:%04x> <%08x:%04x>", + conn->src_addr, conn->src_port, + conn->dst_addr, conn->dst_port); + /* + * attempt to find the IOCB for this req. + */ + while ((iocb = sdp_iocb_q_lookup_req(&conn->r_pend, req))) { + /* + * always remove the IOCB. If active, then place it into + * the correct active queue. Inactive empty IOCBs can be + * deleted, while inactive partials needs to be compelted. + */ + sdp_iocb_q_remove(iocb); + + if (!(iocb->flags & SDP_IOCB_F_ACTIVE)) { + sdp_iocb_complete(iocb, 0); + goto unlock; + } + + if (iocb->flags & SDP_IOCB_F_RDMA_W) + sdp_iocb_q_put_tail(&conn->r_snk, iocb); + else { + SDP_EXPECT((iocb->flags & SDP_IOCB_F_RDMA_R)); + + sdp_desc_q_put_tail(&conn->r_src, + (struct sdpc_desc *)iocb); + } + } + /* + * check the source queue, not much to do, since the operation is + * already in flight. + */ + iocb = (struct sdpc_iocb *)sdp_desc_q_lookup(&conn->r_src, + sdp_iocb_find_req, req); + if (iocb) { + iocb->flags |= SDP_IOCB_F_CANCEL; + goto unlock; + } + /* + * check sink queue. If we're in the sink queue, then a cancel + * needs to be issued. + */ + iocb = sdp_iocb_q_lookup_req(&conn->r_snk, req); + if (iocb) { + /* + * Unfortunetly there is only a course grain cancel in SDP, so + * we have to cancel everything. + */ + if (!(conn->flags & SDP_CONN_F_SNK_CANCEL)) { + int result; + + result = sdp_send_ctrl_snk_cancel(conn); + SDP_EXPECT(result >= 0); + + conn->flags |= SDP_CONN_F_SNK_CANCEL; + } + + iocb->flags |= SDP_IOCB_F_CANCEL; + + goto unlock; + } + /* + * no IOCB found. The cancel is probably in a race with a completion. + */ + sdp_dbg_ctrl(NULL, "Cancel read with no IOCB. <%d:%d:%08lx>", + req->ki_users, req->ki_key, req->ki_flags); + + +unlock: + sdp_conn_unlock(conn); + + return sdp_req_busy(req, conn, wait, copied); +} + +#endif + +/* * sdp_inet_recv - recv data from the network to user space */ int sdp_inet_recv(struct kiocb *req, struct socket *sock, struct msghdr *msg, @@ -1065,17 +1216,22 @@ struct sdp_sock *conn; struct sdpc_iocb *iocb; struct sdpc_buff *buff; - long timeout; + long timeout = 0 /*Turn off compiler warning */; size_t length; int result = 0; int expect; int low_water; - int copied = 0; + size_t copied = 0; int copy; int update; s8 oob = 0; s8 ack = 0; struct sdpc_buff_q peek_queue; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + int zcopy = 0; + struct sdpc_iocb_wait wait; + unsigned long f; +#endif sk = sock->sk; conn = sdp_sk(sk); @@ -1293,6 +1449,80 @@ /* * Either wait or create IOCB for defered completion. */ +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (is_sync_kiocb(req) && !(flags & MSG_PEEK) && + (zcopy || size - copied >= conn->snk_zthresh) + /* && (conn->src_recv || + (low_water - copied >= conn->snk_zthresh)) */ ) { + struct iovec *msg_iov; + size_t len; + /* + * create IOCB with remaining space + */ + iocb = sdp_iocb_create(); + if (!iocb) { + sdp_dbg_warn(conn, + "Error allocating IOCB <%Zu:%Zd>", + size, copied); + result = -ENOMEM; + break; + } + + for (msg_iov = msg->msg_iov; !msg_iov->iov_len; ++msg_iov); + + /* FMR alignment can add an extra page. */ + len = min(msg_iov->iov_len, (size_t)SDP_IOCB_SIZE_MAX - 4096); + iocb->len = len; + iocb->post = 0; + iocb->size = len; + iocb->req = req; + iocb->key = req->ki_key; + iocb->addr = (unsigned long)msg_iov->iov_base; + if (copied >= low_water) + iocb->lowat = 0; + else + iocb->lowat = min_t(size_t, len, low_water - copied); + iocb->wait = &wait; + + iocb->flags |= SDP_IOCB_F_RECV | SDP_IOCB_F_RO; + + req->ki_cancel = sdp_inet_read_cancel; + + result = sdp_iocb_lock(iocb); + if (result < 0) { + sdp_dbg_warn(conn, + "Error <%d> IOCB lock <%Zu:%Zd>", + result, size, copied); + + sdp_iocb_destroy(iocb); + break; + } + + SDP_CONN_STAT_RQ_INC(conn, iocb->size); + + if (!zcopy) { + init_waitqueue_head(&wait.wait); + spin_lock_init(&wait.lock); + sdp_iocb_q_init(&wait.q); + wait.outstanding = 0; + wait.lowat_reached = copied >= low_water; + zcopy = 1; + } + + sdp_iocb_q_put_tail(&conn->r_pend, iocb); + + spin_lock_irqsave(&wait.lock, f); + ++wait.outstanding; + spin_unlock_irqrestore(&wait.lock, f); + + /* TODO: set it? */ + ack = 1; + copied += len; + msg_iov->iov_len -= len; + msg_iov->iov_base += len; + break; + } else +#endif if (is_sync_kiocb(req)) { DECLARE_WAITQUEUE(wait, current); @@ -1325,7 +1555,7 @@ iocb = sdp_iocb_create(); if (!iocb) { sdp_dbg_warn(conn, - "Error allocating IOCB <%Zu:%d>", + "Error allocating IOCB <%Zu:%Zd>", size, copied); result = -ENOMEM; break; @@ -1338,7 +1568,7 @@ iocb->key = req->ki_key; iocb->addr = ((unsigned long)msg->msg_iov->iov_base - copied); - + iocb->lowat = low_water; iocb->flags |= SDP_IOCB_F_RECV; req->ki_cancel = sdp_inet_read_cancel; @@ -1346,7 +1576,7 @@ result = sdp_iocb_lock(iocb); if (result < 0) { sdp_dbg_warn(conn, - "Error <%d> IOCB lock <%Zu:%d>", + "Error <%d> IOCB lock <%Zu:%Zd>", result, size, copied); sdp_iocb_destroy(iocb); @@ -1383,5 +1613,28 @@ sdp_buff_q_put_head(&conn->recv_pool, buff); sdp_conn_unlock(conn); +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if (!result && is_sync_kiocb(req) && zcopy) { + timeout = wait_event_interruptible_timeout(wait.wait, + !sdp_req_busy(req, conn, &wait, &copied), timeout); + if (!timeout) + result = -EAGAIN; + } + + if (signal_pending(current) && is_sync_kiocb(req) && zcopy) { + result = (timeout > 0) ? sock_intr_errno(timeout) : -EAGAIN; + + timeout = wait_event_timeout(wait.wait, + !sdp_read_cancel(req, conn, &wait, &copied), + SDP_ZCOPY_CANCEL_TIMEOUT); + if (!timeout) { + sdp_warn("sdp_read_cancel timed out. Abort.\n"); + sdp_conn_lock(conn); + sdp_conn_abort(conn); + sdp_conn_unlock(conn); + } + } +#endif + return ((copied > 0) ? copied : result); } Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.c (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.c (working copy) @@ -317,12 +317,23 @@ sdp_dbg_data(NULL, "IOCB complete. <%d:%d:%08lx> value <%ld>", iocb->req->ki_users, iocb->req->ki_key, iocb->req->ki_flags, value); + +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + if (iocb->wait) { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + if (!--iocb->wait->outstanding) { + wake_up(&iocb->wait->wait); + } + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } else +#endif + /* + * valid result can be 0 or 1 for complete so + * we ignore the value. + */ + (void)aio_complete(iocb->req, value, 0); /* - * valid result can be 0 or 1 for complete so - * we ignore the value. - */ - (void)aio_complete(iocb->req, value, 0); - /* * delete IOCB */ sdp_iocb_destroy(iocb); @@ -335,7 +346,19 @@ { iocb->status = status; - if (in_atomic() || irqs_disabled()) { +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + if ((iocb->flags & SDP_IOCB_F_RECV) && iocb->wait) { + unsigned long flags; + spin_lock_irqsave(&iocb->wait->lock, flags); + sdp_iocb_q_put_tail(&iocb->wait->q, iocb); + /* Possible optimization: only wake + if no more outstanding iocbs or low watermark reached */ + wake_up(&iocb->wait->wait); + spin_unlock_irqrestore(&iocb->wait->lock, flags); + } else +#endif + if ((iocb->flags & SDP_IOCB_F_RECV) && + (in_atomic() || irqs_disabled())) { INIT_WORK(&iocb->completion, do_iocb_complete, (void *)iocb); schedule_work(&iocb->completion); } else @@ -392,6 +415,75 @@ return NULL; } +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +struct sdpc_iocb *sdp_iocb_q_lookup_req(struct sdpc_iocb_q *table, struct kiocb *req) +{ + struct sdpc_iocb *iocb; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->req == req) + return iocb; + + return NULL; +} + +void sdp_iocb_q_mark_cancel(struct sdpc_iocb_q *table, struct kiocb *req) +{ + struct sdpc_iocb *iocb = NULL; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->req == req) + iocb->flags |= SDP_IOCB_F_CANCEL; + +} + +int sdp_iocb_find_req(struct sdpc_desc *element, void *arg) +{ + struct sdpc_iocb *iocb = (struct sdpc_iocb *) element; + struct kiocb *req = (struct kiocb *)arg; + + if (element->type == SDP_DESC_TYPE_IOCB && iocb->req == req) + return 0; + return -ERANGE; +} +#endif + +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY +struct sdpc_iocb *sdp_iocb_q_lookup_complete(struct sdpc_iocb_q *table, struct kiocb *req) +{ + struct sdpc_iocb *iocb; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->req == req && iocb->post >= iocb->lowat) + return iocb; + + return NULL; +} +struct sdpc_iocb *sdp_iocb_q_wakeup_complete(struct sdpc_iocb_q *table) +{ + struct sdpc_iocb *iocb; + unsigned long flags; + int counter; + + for (counter = 0, iocb = table->head; counter < table->size; + counter++, iocb = iocb->next) + if (iocb->wait && iocb->post >= iocb->lowat) { + spin_lock_irqsave(&iocb->wait->lock, flags); + iocb->wait->lowat_reached = 1; + spin_unlock_irqrestore(&iocb->wait->lock, flags); + wake_up(&iocb->wait->wait); + } + + return NULL; +} +#endif + /* * sdp_iocb_create - create an IOCB object */ Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.h =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.h (revision 4198) +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.h (working copy) @@ -55,6 +55,9 @@ #define SDP_IOCB_F_LOCKED 0x00000040 /* IOCB is locked in memory */ #define SDP_IOCB_F_REG 0x00000080 /* IOCB memory is registered */ #define SDP_IOCB_F_RECV 0x00000100 /* IOCB is for a receive request */ +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY +#define SDP_IOCB_F_RO 0x00000200 /* Suppress SinkAvail for this IOCB */ +#endif #define SDP_IOCB_F_ALL 0xFFFFFFFF /* IOCB all mask */ /* * zcopy constants. @@ -66,10 +69,12 @@ */ #define sdp_iocb_q_size(table) ((table)->size) +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +struct sdpc_iocb_wait; +#endif /* * INET read/write IOCBs */ - /* * save a kvec read/write for processing once data shows up. */ @@ -80,7 +85,7 @@ struct sdpc_iocb_q *table; /* table to which this iocb belongs */ void (*release)(struct sdpc_iocb *iocb); /* release the object */ /* - * iocb sepcific + * iocb specific */ int flags; /* usage flags */ /* @@ -89,6 +94,7 @@ u32 key; /* matches kiocb key for lookups */ int len; /* space left in the user buffer */ int post; /* amount of data requested so far. */ + int lowat; /* when to complete this IOCB (receive only). */ u64 wrid; /* work request completing this IOCB */ ssize_t status; /* status of completed iocb */ /* @@ -112,6 +118,9 @@ int page_offset; /* offset into first page. */ struct work_struct completion; /* task for defered completion. */ +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY + struct sdpc_iocb_wait *wait; +#endif /* * kernel iocb structure */ @@ -127,4 +136,17 @@ int size; /* current number of IOCBs in table */ }; +#ifdef CONFIG_INFINIBAND_SDP_SEND_ZCOPY +/* Report completions here */ +struct sdpc_iocb_wait { + spinlock_t lock; + int outstanding; + wait_queue_head_t wait; +#ifdef CONFIG_INFINIBAND_SDP_RECV_ZCOPY + struct sdpc_iocb_q q; /* Receive iocbs only */ + int lowat_reached; +#endif +}; + +#endif #endif /* _SDP_IOCB_H */ -- MST From halr at voltaire.com Thu Dec 1 06:20:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 09:20:17 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM: osm_sm_state_mgr.c Fix some error codes Message-ID: <1133446817.4325.44.camel@hal.voltaire.com> osm_sm_state_mgr.c Eliminate some error codes Signed-off-by: Hal Rosenstock Index: osm_sm_state_mgr.c =================================================================== --- osm_sm_state_mgr.c (revision 4278) +++ osm_sm_state_mgr.c (working copy) @@ -300,7 +300,7 @@ __osm_sm_state_mgr_start_polling( if( cl_status != CL_SUCCESS ) { osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, - "__osm_sm_state_mgr_start_polling : ERROR 1000: " + "__osm_sm_state_mgr_start_polling : ERR 3210: " "Failed to start timer\n" ); } @@ -379,7 +379,7 @@ __osm_sm_state_mgr_polling_callback( if( cl_status != CL_SUCCESS ) { osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, - "__osm_sm_state_mgr_polling_callback : ERROR 1000: " + "__osm_sm_state_mgr_polling_callback : ERR 3211: " "Failed to re-start timer\n" ); } From jlentini at netapp.com Thu Dec 1 06:51:58 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 1 Dec 2005 09:51:58 -0500 (EST) Subject: [openib-general] [uDAPL]Linking error of dapltest for uDAPL In-Reply-To: <7b2fa1820511301900q485c8990t40d3876c45d7d0b8@mail.gmail.com> References: <7b2fa1820511301900q485c8990t40d3876c45d7d0b8@mail.gmail.com> Message-ID: > I want to confirm that my uDAPL could run correctly, so I try to build the > dapltest. But I have got a linking error. It seems that there is something > wrong with the "ldat", but I failed to find where this "ldat" is referenced. > I am using the default Makefile. Did you type make in svn/gen2/trunk/src/userspace/dapl/dat/udat? This will produce the dat registry library file in the Obj subdirectory. From halr at voltaire.com Thu Dec 1 06:52:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 09:52:57 -0500 Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions Message-ID: <1133448776.4325.179.camel@hal.voltaire.com> Hi Yael, osm_sa_path_record.c::__search_mgrp_by_mgid has the following: p_recvd_mgid = p_ctxt->p_mgid; p_rcv = p_ctxt->p_rcv; /* Why not compare the entire MGID ???? */ /* different scope can sneak in for the same MGID ? */ /* EZ: I changed it to full compare ! */ if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, p_recvd_mgid, sizeof(ib_gid_t))) return; whereas osm_sa_mcmember_record.c::__search_mgrp_by_mgid has the following: p_recvd_mcmember_rec = p_ctxt->p_mcmember_rec; p_rcv = p_ctxt->p_rcv; /* ignore groups marked for deletion */ if (p_mgrp->to_be_deleted) return; /* compare entire MGID so different scope will not sneak in for the same MGID */ if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, &p_recvd_mcmember_rec->mgid, sizeof(ib_gid_t))) return; Shouldn't the SA PR code also check for "to be deleted" ? It also seems like the comments on the MGID comparison should also be made the same. -- Hal From jlentini at netapp.com Thu Dec 1 07:02:19 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 1 Dec 2005 10:02:19 -0500 (EST) Subject: [openib-general] Re: [PATCH][uDAPL] new provider with uCMA (librdmacm) support In-Reply-To: References: Message-ID: arlin> Here is a provider for the latest uCMA that is tested arlin> with dapltest, dtest, and Intel MPI. I also added a top arlin> level README with instructions. Default build is arlin> set for uCMA. Committed in revision 4279. From eitan at mellanox.co.il Thu Dec 1 07:28:20 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 1 Dec 2005 17:28:20 +0200 Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A47@mtlexch01.mtl.com> Hi Hal, You are very right. Thanks. Can you patch it? Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 01, 2005 4:53 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions > > Hi Yael, > > osm_sa_path_record.c::__search_mgrp_by_mgid has the following: > > p_recvd_mgid = p_ctxt->p_mgid; > p_rcv = p_ctxt->p_rcv; > > /* Why not compare the entire MGID ???? */ > /* different scope can sneak in for the same MGID ? */ > /* EZ: I changed it to full compare ! */ > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > p_recvd_mgid, > sizeof(ib_gid_t))) > return; > > whereas osm_sa_mcmember_record.c::__search_mgrp_by_mgid has the > following: > > p_recvd_mcmember_rec = p_ctxt->p_mcmember_rec; > p_rcv = p_ctxt->p_rcv; > > /* ignore groups marked for deletion */ > if (p_mgrp->to_be_deleted) > return; > > /* compare entire MGID so different scope will not sneak in for > the same MGID */ > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > &p_recvd_mcmember_rec->mgid, > sizeof(ib_gid_t))) > return; > > Shouldn't the SA PR code also check for "to be deleted" ? It also seems > like the comments on the MGID comparison should also be made the same. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Thu Dec 1 07:35:03 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 1 Dec 2005 17:35:03 +0200 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A48@mtlexch01.mtl.com> Hi Yael, As I read through the MgtWg mails I get the impression that an out of spec mechanism is required to know if the other SM is trusted. In that case and since OpenSM does not currently provide any such mechanism, I would prefer never to send out the SM_Key on the request and always send zero. Sending our SM_Key to a non - trusted SM is not a good idea in my mind. OpenSM behavior should be to always trust any other SM. So any discovered SM that deserves to be the master should be granted that right. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Yael Kalka > Sent: Thursday, December 01, 2005 2:17 PM > To: 'Hal Rosenstock'; Eitan Zahavi > Cc: openib-general at openib.org > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > Hi Hal, Eitan, > I think the best option is to add an OpenSM option flag - exit_on_fatal. > This flag can decide on the action on fatal cases: > 1. Exit or not when seeing SM with different SM_Key. > 2. Exit or not when there is a fatal link error (e.g - multiple guids). > etc. > > I tried to run 2 SMs just now with different SM_keys, and I see that none of them > exit, since both receive SM_Key=0 on SMInfo GetResp. > The reason for that is that in the SMInfo Get request (as in all other requests) > we do not send anything in the mad data. Meaning - all fields are clear. > In the __osm_sminfo_rcv_process_get_request function we are checking the state > according > to the payload data. This is always zero! Thus - SM will never know that the SMInfo > request is sent from an SM that is master. > > I will work on a fix for that. > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, November 30, 2005 11:57 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > Hi Yael & Eitan, > > Based on the recent MgtWG discussions, are you still holding your > position in terms of exiting OpenSM when a non matching SM Key is > discovered ? Just wondering if I can issue a patch for this and clear > this issue so OpenSM can be compliant for this aspect. Thanks. > > -- Hal > > -----Forwarded Message----- > > From: Hal Rosenstock > To: openib-general at openib.org > Subject: [openib-general] OpenSM and Wrong SM_Key > Date: 08 Nov 2005 16:08:47 -0500 > > Hi, > > Currently, when OpenSM receives SMInfo with a different SM_Key, it exits > as follows: > > > void > __osm_sminfo_rcv_process_get_response( > IN const osm_sminfo_rcv_t* const p_rcv, > IN const osm_madw_t* const p_madw ) > { > ... > > > > /* > Check that the sm_key of the found SM is the same as ours, > or is zero. If not - OpenSM cannot continue with configuration!. */ > if ( p_smi->sm_key != 0 && > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > "Got SM with sm_key that doesn't match our " > "local key. Exiting\n" ); > osm_log( p_rcv->p_log, OSM_LOG_SYS, > "Found remote SM with non-matching sm_key. Exiting\n" ); > osm_exit_flag = TRUE; > goto Exit; > } > > C14-61.2.1 states that: > A master SM which finds a higher priority master SM with the wrong > SM_Key should not relinquish the subnet. > > Exiting OpenSM relinquishes the subnet. > > So it appears to me that perhaps this behavior of exiting OpenSM should > be at least contingent on the SM state and relative priority of the > SMInfo received. Make sense ? If so, I will work on a patch for this. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Thu Dec 1 07:29:56 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 1 Dec 2005 10:29:56 -0500 Subject: [swg] RE: [openib-general] socket based connectionmodel for IBproposal -round 4 Message-ID: agreed. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, November 30, 2005 12:59 PM > To: Yaron Haviv > Cc: Kanevsky, Arkady; Ted H. Kim; swg at infinibandta.org; > openib-general at openib.org > Subject: Re: [swg] RE: [openib-general] socket based > connectionmodel for IBproposal -round 4 > > Yaron Haviv wrote: > > How about using ARP to get from IP to DGID+Partition Followed by an > > SIDR to map DGID+PKey+Service to QKey & QP > > > > It is the same concept as CMA that first uses IP stack (ARP > etc') to > > get to the remote end-point (in that case GID+PKey combination) > > followed by SA-PR and CM REQ, we just substitute the CM REQ with a > > SIDR REQ It may not solve all the cases but probably most of the > > practical ones > > This was my thought as well. > > > Anyway the packets will need to carry some header (since it's not a > > connected model), you can add more stuff in that header > (e.g. can use > > IPoIB header as is which contains already the src/dst IP) > > I was assuming that each packet would need to carry some sort > of header. > > At this point, we may want to defer defining anything for UDP > until there's a better understanding of what an application > would want. My guess is that such an application will need > new APIs for posting sends based on UDP addressing. > > - Sean > From Arkady.Kanevsky at netapp.com Thu Dec 1 07:33:16 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 1 Dec 2005 10:33:16 -0500 Subject: [openib-general] scoket based connection model for IB - round 5 Message-ID: Here is the fifth and I hope the final version of the proposal. The changes from previous version: 1. IBTA bit numbering scheme (reserse order) 2. Protocol version is split into major and monr wiht 4 bits each. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v5.pdf Type: application/octet-stream Size: 24635 bytes Desc: IP Address Support by InfiniBand CM_v5.pdf URL: From eitan at mellanox.co.il Thu Dec 1 07:41:20 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 1 Dec 2005 17:41:20 +0200 Subject: [openib-general] First Multicast Leave disconnects all other clients Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A49@mtlexch01.mtl.com> Hi Hal, SRP uses InformInfo to get notification about new or lost ports (trap 64/65) such that new targets are recognized without periodic SA query. I do not know if that code already found its way to OpenIB. I do not think it is relevant to that discussion about missing APIs. Maybe to the priority of implementation. But IMO - until we do provide that missing capabilities we are actually preventing SRP and other ulps from doing the right thing and causing them to duplicate "Client Reregistration" handlers and periodic queries . The bottom line: Do you agree we are missing these API's? When can we get those done? By whom? EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 01, 2005 8:20 AM > To: Eitan Zahavi > Cc: OPENIB GENERAL; Yael Kalka; Aviram Gutman; Tziporet Koren; Roland Dreier; > sean.hefty at intel.com > Subject: RE: [openib-general] First Multicast Leave disconnects all other clients > > On Thu, 2005-12-01 at 01:07, Eitan Zahavi wrote: > > > > > > > > The bottom line: > > > > We are missing 3 agents in the OpenIB stack: > > > > InformInfo - handling registrations and Report dispatching > > > > > > These are not currently used. > > [EZ] They are by SRP initiator. > > Not the SRP initiator in OpenIB svn as far as I can tell. > > > > > ServiceRecord - tracks registrations > > > > > > ServiceRecord is implemented in sa_query (and was used by AT/uAT but > > > that is largely historical now) > > > > > > > Multicast Join/Leave - tracking registrations to multicast groups > > and > > > > ref-counting > > > > > > > > All these agents should be able to cleanup dead client registrations > > and > > > > also provide re-registration in case of SM ClientReregistration > > event. > > > > > > In OpenIB, any Set of PortInfo (which includes ClientReregister) > > > currently causes a (coarse) event (LID change) which causes IPoIB > > client > > > to reregister its multicasts registrations with the SA. > > > > > > > Please see below > > > > > > > > > > > > It seems the IBTA intent was that the IB driver will be > > responsible > > > > for maintaining > > > > > the list of clients > > > > > > registered to each group. > > > > > > > > > > Yes, the end node is responsible for tracking the registrations > > within > > > > > the node and fabricating responses when the node does not want to > > > > leave. > > > > > Is delete a different case though ? > > > > [EZ] No it is not. Delete of multicast group is really the last > > leave. > > > > > > There is an explicit delete. While it shouldn't be needed to be > > forced, > > > there is always some scenario where this is useful. > > [EZ] To my best knowledge any leave is a "delete" so there is no way for > > any client to force other members out of a group. It can only leave > > itself. The delete will happen when the last will leave. > > Yes, you are right, other than the last full member (join state) rule. > > > > > > > But the IB core does not track what clients registered (through > > SA > > > > requests) to a > > > > > particular multicast group. > > > > > > The first client to leave the group causes the rest (of the > > clients) > > > > to be disconnected. > > > > > > > > > > This is an implementation issue IMO and applies to other > > subscriptions > > > > > too (not just limited to multicast). > > > > [EZ] I agree it is an implementation issue. I hope it will get > > > > implemented in OpenIB. > > > > > > It will. It's a question of priorities and timing. > > > > > > > > > My proposal is to provide an API for such registrations at both > > user > > > > and kernel and > > > > > track the requesting processes. > > > > > > Cleanup is also required both by process and kernel module > > > > granularity. > > > > > > > > > > Is the API the SA client request itself for this ? Shouldn't the > > > > > tracking be done there (within sa_query.c) ? > > > > [EZ] It will be hard to sniff the MADs (especially user level) for > > all > > > > the registration flows. > > > > > > It's not the sniffing which is hard but perhaps identifying which > > client > > > (and reference counting). > > > > > > > So I propose we should have > > > > > > ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. > > > > Both in user land and in kernel. > > > > > > I think this is TBD and the API would be discussed on this list first > > > prior to any implementation. > > > > > > > > > BTW: The same API could also handle "Client Reregistration" for > > > > multicast groups, > > > > > > > > > > Client reregistration is for all subscriptions (including > > > > ServiceRecords > > > > > and events as well). > > > > [EZ] Yes exactly. I believe similar problem exists for all > > > > registrations. > > > > > > > > > > > such that we could avoid the need to have that code duplicated > > by > > > > every client. > > > > > > > > > > I'm missing how client reregistration would help here. Can you > > > > elaborate > > > > > ? > > > > [EZ] It is related to the reference tracking: > > > > If a kernel module tracks all registrations to refcount them and > > perform > > > > cleanup, it could with similar effort also send the - > > re-registration in > > > > the event of SM change ... > > > > > > Sure, there are multiple ways to skin the same cat. > > > > > > > > > > > > > > But this refers to yet another API that is missing: Report > > > > dispatching which deserves > > > > > its own > > > > > > mail... > > > > > > > > > > I'm missing the connection between reregistration and report > > > > > dispatching. > > > > [EZ] Sorry for not being verbose. The need for Events dispatcher is > > > > based on the fact that only one client should respond to Report with > > > > ReportRepress. Reports are "unsolicited" MADs coming into the > > device. In > > > > umad the implementation prevents any "multiple" client registration > > for > > > > receiving any "unsolicited" MAD - only one class-agent needs to be > > there > > > > handling "unsolicited" messages. This is fine - but what it means is > > > > that when two clients wants to be notified about events they should > > > > register with that agent and the agent should be able to dispatch > > the > > > > message to all registered clients as well as send only one response > > > > back. > > > > > > Wouldn't report represses be reference counted and only actually sent > > on > > > the wire when all subscribed clients within the node indicated repress > > ? > > [EZ] As you say there are many ways to skin a cat. I am not sure we need > > to wait for all clients as they are located on the same node and will be > > surely notified. > > Right, it just needs to be done once whether it was actually delivered > to any client, clients, or none at all. > > -- Hal From iod00d at hp.com Thu Dec 1 08:16:00 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 1 Dec 2005 08:16:00 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB In-Reply-To: <52u0dvt4vx.fsf@cisco.com> References: <52d5li8waw.fsf@cisco.com> <52u0dvt4vx.fsf@cisco.com> Message-ID: <20051201161600.GA32308@esmail.cup.hp.com> On Tue, Nov 29, 2005 at 03:23:46PM -0800, Roland Dreier wrote: > Any progress to report on the port of RDS from the SilverStorm > proprietary stack to the standard Linux stack? I think it would > really move the discussion forward if there were some code that people > could build and use. As primary consumer of RDS, I think Oracle first needs to decide if the deficiencies that Mike Krause pointed out are acceptable or not. grant From halr at voltaire.com Thu Dec 1 09:02:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 12:02:25 -0500 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A48@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A48@mtlexch01.mtl.com> Message-ID: <1133456350.4325.766.camel@hal.voltaire.com> Hi Eitan, On Thu, 2005-12-01 at 10:35, Eitan Zahavi wrote: > Hi Yael, > > As I read through the MgtWg mails I get the impression that an out of > spec mechanism is required to know if the other SM is trusted. Yes, that was what I was proposing (in http://openib.org/pipermail/openib-general/2005-December/014186.html where I wrote "The SM needs a way to know whether the other SM(s) (and which ones) are trusted or not so the SM_Key can be filled in."): that OpenSM have a list of trusted SMs and OpenSM would use that information. > In that case and since OpenSM does not currently provide any such > mechanism, I would prefer never to send out the SM_Key on the request > and always send zero. Sending our SM_Key to a non - trusted SM is not a > good idea in my mind. > > OpenSM behavior should be to always trust any other SM. Above you said no other SM was trusted so do you mean not trust rather than trust other SMs ? > So any discovered SM that deserves to be the master should be granted > that right. Only if it were trusted and had the correct SM Key. -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Yael Kalka > > Sent: Thursday, December 01, 2005 2:17 PM > > To: 'Hal Rosenstock'; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > Hi Hal, Eitan, > > I think the best option is to add an OpenSM option flag - > exit_on_fatal. > > This flag can decide on the action on fatal cases: > > 1. Exit or not when seeing SM with different SM_Key. > > 2. Exit or not when there is a fatal link error (e.g - multiple > guids). > > etc. > > > > I tried to run 2 SMs just now with different SM_keys, and I see that > none of them > > exit, since both receive SM_Key=0 on SMInfo GetResp. > > The reason for that is that in the SMInfo Get request (as in all other > requests) > > we do not send anything in the mad data. Meaning - all fields are > clear. > > In the __osm_sminfo_rcv_process_get_request function we are checking > the state > > according > > to the payload data. This is always zero! Thus - SM will never know > that the SMInfo > > request is sent from an SM that is master. > > > > I will work on a fix for that. > > Yael > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, November 30, 2005 11:57 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > Hi Yael & Eitan, > > > > Based on the recent MgtWG discussions, are you still holding your > > position in terms of exiting OpenSM when a non matching SM Key is > > discovered ? Just wondering if I can issue a patch for this and clear > > this issue so OpenSM can be compliant for this aspect. Thanks. > > > > -- Hal > > > > -----Forwarded Message----- > > > > From: Hal Rosenstock > > To: openib-general at openib.org > > Subject: [openib-general] OpenSM and Wrong SM_Key > > Date: 08 Nov 2005 16:08:47 -0500 > > > > Hi, > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, it > exits > > as follows: > > > > > > void > > __osm_sminfo_rcv_process_get_response( > > IN const osm_sminfo_rcv_t* const p_rcv, > > IN const osm_madw_t* const p_madw ) > > { > > ... > > > > > > > > /* > > Check that the sm_key of the found SM is the same as ours, > > or is zero. If not - OpenSM cannot continue with configuration!. > */ > > if ( p_smi->sm_key != 0 && > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > "Got SM with sm_key that doesn't match our " > > "local key. Exiting\n" ); > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > "Found remote SM with non-matching sm_key. Exiting\n" ); > > osm_exit_flag = TRUE; > > goto Exit; > > } > > > > C14-61.2.1 states that: > > A master SM which finds a higher priority master SM with the wrong > > SM_Key should not relinquish the subnet. > > > > Exiting OpenSM relinquishes the subnet. > > > > So it appears to me that perhaps this behavior of exiting OpenSM > should > > be at least contingent on the SM state and relative priority of the > > SMInfo received. Make sense ? If so, I will work on a patch for this. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Dec 1 09:19:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2005 12:19:31 -0500 Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A47@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A47@mtlexch01.mtl.com> Message-ID: <1133457475.4325.888.camel@hal.voltaire.com> Hi Eitan, On Thu, 2005-12-01 at 10:28, Eitan Zahavi wrote: > Hi Hal, > > You are very right. Thanks. Can you patch it? Sure. Any prefereance for which way should the comment be (like PR or MCM) ? -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, December 01, 2005 4:53 PM > > To: Yael Kalka > > Cc: openib-general at openib.org > > Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions > > > > Hi Yael, > > > > osm_sa_path_record.c::__search_mgrp_by_mgid has the following: > > > > p_recvd_mgid = p_ctxt->p_mgid; > > p_rcv = p_ctxt->p_rcv; > > > > /* Why not compare the entire MGID ???? */ > > /* different scope can sneak in for the same MGID ? */ > > /* EZ: I changed it to full compare ! */ > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > p_recvd_mgid, > > sizeof(ib_gid_t))) > > return; > > > > whereas osm_sa_mcmember_record.c::__search_mgrp_by_mgid has the > > following: > > > > p_recvd_mcmember_rec = p_ctxt->p_mcmember_rec; > > p_rcv = p_ctxt->p_rcv; > > > > /* ignore groups marked for deletion */ > > if (p_mgrp->to_be_deleted) > > return; > > > > /* compare entire MGID so different scope will not sneak in for > > the same MGID */ > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > &p_recvd_mcmember_rec->mgid, > > sizeof(ib_gid_t))) > > return; > > > > Shouldn't the SA PR code also check for "to be deleted" ? It also > seems > > like the comments on the MGID comparison should also be made the same. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From Richard.Frank at oracle.com Thu Dec 1 10:02:37 2005 From: Richard.Frank at oracle.com (Richard Frank) Date: Thu, 01 Dec 2005 13:02:37 -0500 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB In-Reply-To: <20051201161600.GA32308@esmail.cup.hp.com> References: <52d5li8waw.fsf@cisco.com> <52u0dvt4vx.fsf@cisco.com> <20051201161600.GA32308@esmail.cup.hp.com> Message-ID: <1133460157.6456.44.camel@localhost.localdomain> We do not see any deficiencies - the RDS specification and current implementation so far meet our requirements and is working very well. There is more we will want to do further down the road - such as access the RDS sockets via AIO so we can add zero copy support. On Thu, 2005-12-01 at 08:16 -0800, Grant Grundler wrote: > On Tue, Nov 29, 2005 at 03:23:46PM -0800, Roland Dreier wrote: > > Any progress to report on the port of RDS from the SilverStorm > > proprietary stack to the standard Linux stack? I think it would > > really move the discussion forward if there were some code that people > > could build and use. > > As primary consumer of RDS, I think Oracle first needs to decide if > the deficiencies that Mike Krause pointed out are acceptable or not. > > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Dec 1 10:28:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 01 Dec 2005 10:28:00 -0800 Subject: [openib-general] Re: spinlock wrong CPU on CPU#1, ib_addr In-Reply-To: References: Message-ID: <438F40B0.4010000@ichips.intel.com> Or Gerlitz wrote: > BUG: spinlock wrong CPU on CPU#1, ib_addr/3866 lock: ffffffff88073428, > .magic: dead4ead, .owner: ib_addr/3866, .owner_cpu: 0 > > Call Trace:{_raw_spin_unlock+112} > {:ib_iser:iser_adaptor_find_by_device+188} Based on my efforts trying to decipher the code, it looks like the adaptor_list_lock in iser_adaptor_find_device() was acquired while running on CPU 0, but an attempt was made to release it on CPU 1. I think that the .owner field above is simply referring to the fact that the thread was created by the ib_addr module. Maybe someone more familiar with the spinlock code can confirm this? Also, what happens if an interrupt occurs on the processor while this spinlock is held? A couple of things to try would be: changing from spin_lock to the spin_lock_irqsave, or using a mutex. I didn't notice where there were any errors in ib_addr, the CMA, or iSER. Maybe this is an issue in the spinlock debugging code? How easily is this reproduced? Does the system continue to operate after you see this? - Sean From mshefty at ichips.intel.com Thu Dec 1 11:39:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 01 Dec 2005 11:39:28 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: References: Message-ID: <438F5170.3000208@ichips.intel.com> Sean Hefty wrote: > The following patch modifies the kernel CM API to support matching > private data in received REQs against listen requests. This allows the > CM to support multiple listeners on the same service ID if a > discriminator is carried in the private data. As an update: further testing revealed that there is an issue with this implementation that is also found in the original code. The issue deals with how listen requests that rely on a data mask are inserted and located in the red/black tree. I'm trying to come up with a fix for this. - Sean From rpandit at silverstorm.com Thu Dec 1 15:16:06 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Thu, 1 Dec 2005 15:16:06 -0800 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable Datagram Sockets) to OpenIB In-Reply-To: <1133460157.6456.44.camel@localhost.localdomain> References: <52d5li8waw.fsf@cisco.com> <52u0dvt4vx.fsf@cisco.com> <20051201161600.GA32308@esmail.cup.hp.com> <1133460157.6456.44.camel@localhost.localdomain> Message-ID: <96f8e60e0512011516g784d7740j1032e05604519d93@mail.gmail.com> I'm shooting to get something out by mid December. Ranjit On 12/1/05, Richard Frank wrote: > We do not see any deficiencies - the RDS specification and current > implementation so far meet our requirements and is working very well. > > There is more we will want to do further down the road - such as access > the RDS sockets via AIO so we can add zero copy support. > > > On Thu, 2005-12-01 at 08:16 -0800, Grant Grundler wrote: > > On Tue, Nov 29, 2005 at 03:23:46PM -0800, Roland Dreier wrote: > > > Any progress to report on the port of RDS from the SilverStorm > > > proprietary stack to the standard Linux stack? I think it would > > > really move the discussion forward if there were some code that people > > > could build and use. > > > > As primary consumer of RDS, I think Oracle first needs to decide if > > the deficiencies that Mike Krause pointed out are acceptable or not. > > > > grant > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dledford at redhat.com Thu Dec 1 18:50:06 2005 From: dledford at redhat.com (Doug Ledford) Date: Thu, 01 Dec 2005 21:50:06 -0500 Subject: [openib-general] Announce: Updated packages available Message-ID: <438FB65E.50406@redhat.com> I've added to the list of available packages. In addition to libibverbs, libmthca, libsdp, and opensm, we now have udapl compiled. We also have an update initscripts package for RHEL-4 that enables static IP setups on ipoib interfaces and works at boot time. In addition, all the user space tools have been revved up to svn rev 4265. The kernel has not been recompiled since the last one and is still at 3965. I hope to get an updated kernel sometime tomorrow. http://people.redhat.com/dledford/Infiniband From yaronh at voltaire.com Fri Dec 2 05:49:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 2 Dec 2005 15:49:00 +0200 Subject: [openib-general] [ANNOUNCE] Contribute RDS (Reliable DatagramSockets) to OpenIB Message-ID: <35EA21F54A45CB47B879F21A91F4862F8FF0B6@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Richard Frank > Sent: Thursday, December 01, 2005 1:03 PM > To: Grant Grundler > Cc: openib-general at openib.org > Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (Reliable > DatagramSockets) to OpenIB > > We do not see any deficiencies - the RDS specification and current > implementation so far meet our requirements and is working very well. > > There is more we will want to do further down the road - such as access > the RDS sockets via AIO so we can add zero copy support. > Richard, In the document you published few weeks ago you listed latency and CPU% as key goals I assume to really get the latency down you need a user space implementation that can leverage on pooling, any plans to work in user space ? Several other comments/suggestions if I may add (may already took them into account): As a UDP consumer isn't there a need to support Multicast as well, and potentially leverage on IB multicast for scalability ? I feel that there is not much benefit in eliminating the reliability checks in the upper (UDP) consumer, since its negligible in CPU or latency overhead, you may even just go with a UC implementation, also UDP consumers may want to use RDS without modifying the application, or may accept dropped packets or over subscription (since they are interested in the most recent data). And it is very important to tie the RDS implementation to the IP stack for routing information/resolution, ARPs, etc' So it would become transparent from the mng/configuration side as well, not requiring separate configuration files, or dealing better with dynamic environments and failures like a real UDP would. Yaron > > On Thu, 2005-12-01 at 08:16 -0800, Grant Grundler wrote: > > On Tue, Nov 29, 2005 at 03:23:46PM -0800, Roland Dreier wrote: > > > Any progress to report on the port of RDS from the SilverStorm > > > proprietary stack to the standard Linux stack? I think it would > > > really move the discussion forward if there were some code that people > > > could build and use. > > > > As primary consumer of RDS, I think Oracle first needs to decide if > > the deficiencies that Mike Krause pointed out are acceptable or not. > > > > grant > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From halr at voltaire.com Fri Dec 2 06:46:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Dec 2005 09:46:55 -0500 Subject: [openib-general] [PATCH] OpenSM: In osm_sa_path_record.c::__search_mgrp_by_mgid, ignore groups marked for deletion Message-ID: <1133534815.4337.57.camel@hal.voltaire.com> In osm_sa_path_record.c::__search_mgrp_by_mgid, ignore groups marked for deletion Signed-off-by: Hal Rosenstock Index: osm_sa_path_record.c =================================================================== --- osm_sa_path_record.c (revision 4280) +++ osm_sa_path_record.c (working copy) @@ -1230,6 +1230,10 @@ __search_mgrp_by_mgid( p_recvd_mgid = p_ctxt->p_mgid; p_rcv = p_ctxt->p_rcv; + /* ignore groups marked for deletion */ + if (p_mgrp->to_be_deleted) + return; + /* Why not compare the entire MGID ???? */ /* different scope can sneak in for the same MGID ? */ /* EZ: I changed it to full compare ! */ From halr at voltaire.com Fri Dec 2 08:10:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Dec 2005 11:10:51 -0500 Subject: [openib-general] First Multicast Leave disconnects all other clients In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A49@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A49@mtlexch01.mtl.com> Message-ID: <1133539562.4337.454.camel@hal.voltaire.com> On Thu, 2005-12-01 at 10:41, Eitan Zahavi wrote: > Hi Hal, > > SRP uses InformInfo to get notification about new or lost ports (trap > 64/65) such that new targets are recognized without periodic SA query. > I do not know if that code already found its way to OpenIB. It hasn't. > I do not think it is relevant to that discussion about missing APIs. > Maybe to the priority of implementation. But IMO - until we do provide > that missing capabilities we are actually preventing SRP and other ulps > from doing the right thing and causing them to duplicate "Client > Reregistration" handlers and periodic queries . > The bottom line: Do you agree we are missing these API's? Yes, OpenIB is missing this functionality. I vaguely recall having this discussion with you on the list a while ago... What shape the API would take is a discussion for this list. Is it an extension to the existing SA client API ? > When can we get those done? By whom? That's also a discussion for this list. Anyone else care to comment ? -- Hal > EZ > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > --Original Message-- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, December 01, 2005 8:20 AM > > To: Eitan Zahavi > > Cc: OPENIB GENERAL; Yael Kalka; Aviram Gutman; Tziporet Koren; Roland > Dreier; > > sean.hefty at intel.com > > Subject: RE: [openib-general] First Multicast Leave disconnects all > other clients > > > > On Thu, 2005-12-01 at 01:07, Eitan Zahavi wrote: > > > > > > > > > > The bottom line: > > > > > We are missing 3 agents in the OpenIB stack: > > > > > InformInfo - handling registrations and Report dispatching > > > > > > > > These are not currently used. > > > [EZ] They are by SRP initiator. > > > > Not the SRP initiator in OpenIB svn as far as I can tell. > > > > > > > ServiceRecord - tracks registrations > > > > > > > > ServiceRecord is implemented in sa_query (and was used by AT/uAT > but > > > > that is largely historical now) > > > > > > > > > Multicast Join/Leave - tracking registrations to multicast > groups > > > and > > > > > ref-counting > > > > > > > > > > All these agents should be able to cleanup dead client > registrations > > > and > > > > > also provide re-registration in case of SM ClientReregistration > > > event. > > > > > > > > In OpenIB, any Set of PortInfo (which includes ClientReregister) > > > > currently causes a (coarse) event (LID change) which causes IPoIB > > > client > > > > to reregister its multicasts registrations with the SA. > > > > > > > > > Please see below > > > > > > > > > > > > > > It seems the IBTA intent was that the IB driver will be > > > responsible > > > > > for maintaining > > > > > > the list of clients > > > > > > > registered to each group. > > > > > > > > > > > > Yes, the end node is responsible for tracking the > registrations > > > within > > > > > > the node and fabricating responses when the node does not want > to > > > > > leave. > > > > > > Is delete a different case though ? > > > > > [EZ] No it is not. Delete of multicast group is really the last > > > leave. > > > > > > > > There is an explicit delete. While it shouldn't be needed to be > > > forced, > > > > there is always some scenario where this is useful. > > > [EZ] To my best knowledge any leave is a "delete" so there is no way > for > > > any client to force other members out of a group. It can only leave > > > itself. The delete will happen when the last will leave. > > > > Yes, you are right, other than the last full member (join state) rule. > > > > > > > > > But the IB core does not track what clients registered > (through > > > SA > > > > > requests) to a > > > > > > particular multicast group. > > > > > > > The first client to leave the group causes the rest (of the > > > clients) > > > > > to be disconnected. > > > > > > > > > > > > This is an implementation issue IMO and applies to other > > > subscriptions > > > > > > too (not just limited to multicast). > > > > > [EZ] I agree it is an implementation issue. I hope it will get > > > > > implemented in OpenIB. > > > > > > > > It will. It's a question of priorities and timing. > > > > > > > > > > > My proposal is to provide an API for such registrations at > both > > > user > > > > > and kernel and > > > > > > track the requesting processes. > > > > > > > Cleanup is also required both by process and kernel module > > > > > granularity. > > > > > > > > > > > > Is the API the SA client request itself for this ? Shouldn't > the > > > > > > tracking be done there (within sa_query.c) ? > > > > > [EZ] It will be hard to sniff the MADs (especially user level) > for > > > all > > > > > the registration flows. > > > > > > > > It's not the sniffing which is hard but perhaps identifying which > > > client > > > > (and reference counting). > > > > > > > > > So I propose we should have > > > > > > > > > ib_join/ib_leave/ib_reg_svc/ib_unreg_svc/ib_reg_inform/ib_unreg_inform. > > > > > Both in user land and in kernel. > > > > > > > > I think this is TBD and the API would be discussed on this list > first > > > > prior to any implementation. > > > > > > > > > > > BTW: The same API could also handle "Client Reregistration" > for > > > > > multicast groups, > > > > > > > > > > > > Client reregistration is for all subscriptions (including > > > > > ServiceRecords > > > > > > and events as well). > > > > > [EZ] Yes exactly. I believe similar problem exists for all > > > > > registrations. > > > > > > > > > > > > > such that we could avoid the need to have that code > duplicated > > > by > > > > > every client. > > > > > > > > > > > > I'm missing how client reregistration would help here. Can you > > > > > elaborate > > > > > > ? > > > > > [EZ] It is related to the reference tracking: > > > > > If a kernel module tracks all registrations to refcount them and > > > perform > > > > > cleanup, it could with similar effort also send the - > > > re-registration in > > > > > the event of SM change ... > > > > > > > > Sure, there are multiple ways to skin the same cat. > > > > > > > > > > > > > > > > > But this refers to yet another API that is missing: Report > > > > > dispatching which deserves > > > > > > its own > > > > > > > mail... > > > > > > > > > > > > I'm missing the connection between reregistration and report > > > > > > dispatching. > > > > > [EZ] Sorry for not being verbose. The need for Events dispatcher > is > > > > > based on the fact that only one client should respond to Report > with > > > > > ReportRepress. Reports are "unsolicited" MADs coming into the > > > device. In > > > > > umad the implementation prevents any "multiple" client > registration > > > for > > > > > receiving any "unsolicited" MAD - only one class-agent needs to > be > > > there > > > > > handling "unsolicited" messages. This is fine - but what it > means is > > > > > that when two clients wants to be notified about events they > should > > > > > register with that agent and the agent should be able to > dispatch > > > the > > > > > message to all registered clients as well as send only one > response > > > > > back. > > > > > > > > Wouldn't report represses be reference counted and only actually > sent > > > on > > > > the wire when all subscribed clients within the node indicated > repress > > > ? > > > [EZ] As you say there are many ways to skin a cat. I am not sure we > need > > > to wait for all clients as they are located on the same node and > will be > > > surely notified. > > > > Right, it just needs to be done once whether it was actually delivered > > to any client, clients, or none at all. > > > > -- Hal From mikopon99 at kobej.zzn.com Fri Dec 2 09:18:25 2005 From: mikopon99 at kobej.zzn.com (mikopon99 at kobej.zzn.com) Date: Fri, 2 Dec 2005 09:18:25 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCJTslbCVWPXdALSQrJGkbKEI=?= =?iso-2022-jp?b?GyRCI0gkTkpzPTckTzpHRGMjMiMwS3wxXyQrJGkkRyQ5GyhC?= Message-ID: 20051203014158.59877mail@mail.lovelove-kameriasex552158754_lookserver772_womansystem01_woman-kameria-love.tv $B""8}%m(I[$B%m(I[[(B $B!!8}""(B $B!!(I[(B $B!!!2!2!2!2!2(B $B!!!!"#""!!!!!!!!!!5.J}MM$X$N8=7BT(Bmail $B"!(+(+(B $B!!!!!!!!(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"#""($(B $B!!!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1!1(B $B!!!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B!!(B http://camellia.cx/h/ $B!!Ev(Bclub$B$O!"8@$o$:$bCN$l$?!X#H:GM%@hL\E*!Y$N(B $B!!Bg?M$N(Bclub$B!#KhF|MM!9$JM_K>$rJz$-!"O"F|O"Lk!"(B $B!!8+CN$i$LCK at -MM$H$NHkL)$NBg?M$N$*IU$-9g$$$,(B $B!!7+$j9-$2$i$l$F$*$j$^$9!#(B $B!!Ev(Bclub$B$K$4EPO?$r$5$l$?=w at -2q0wMM$O!"(B $B!!!X=iBPLL$NCK at -$H$NB(0)$$B(#H!Y$rM_$7$F$*$j$^$9!#(B $B!!$=$s$J0|Mp0|=wMMJ}$O!"5.J}MM$H$N#H$X$NJs=7$K(B $B!!:GDc#2#0K|1_$N at .8yJs=7$r$b$4MQ0UD:$$$F(B $B!!$*$j$^$9!#(B $B!!(B http://camellia.cx/h/ $B!!!cBg?M$N$*IU$-9g$$!d$=$7$F!"!cC;;~4V9b<}F~!d(B $B!!$NA4$F$r/2ACM$N9b$$40A4(Bvip$BBT6x@)!#(B $B!!4{$K5.J}MM$X$N?dA&=w at -2q0wMM$r$b(B $B!!$4MQ0U$5$;$FD:$-$^$7$F$NEv$4O"Mm$H$J$j$^$9!#(B $B!!(B http://camellia.cx/h/ $B!!!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B!!(!"!(B $B5.J}MM$X?dA&=w at -2q0wMM(B $B"!(!(B $B!!!N(B:Entry-No.011057:$B!O!~OBED(B $BfF;R!J(I\@^(B (I<.3:$B!KMM!!!~(B29$B:M(B $B!!(B http://camellia.cx/h/ $B!!!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B From norika_nitta at yahoo.com Fri Dec 2 09:26:37 2005 From: norika_nitta at yahoo.com (norika_nitta at yahoo.com) Date: Fri, 2 Dec 2005 09:26:37 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCNSQ3WiRKJFskJiQsJGgbKEI=?= =?iso-2022-jp?b?GyRCJC8kSiQkJEckOSQrIUohKSEyISkbKEI=?= Message-ID: ���T�C�g�͒j���Ə����̏o��̏�Ƃ��ẴR�~���j�e�B��񋟂��� ���܂��B�L�������ɂ��^�c��s���Ă��܂��̂ŁA���Ȃ����S���� �ł����p���������܂��B ------------------------------------------------------------ �����S�����I�I�@���@http://ad.love-meets.com/?a001 ------------------------------------------------------------ �s�v�ȃ��m�͈�؍폜�I�V���v���������p�����ł̋C�y���𐶂݁A ���ݏ������[�U�[���啝�ɑ������ł��I���R���[�U�[�ɂȂ������� �ŗ��p���Ȃ����̎q����܂����E�E�E�ł���[�U�[�ł���ȏ�A ���Ȃ������[���𑗂�΂��̎q�ɂ͊m���ɓ͂��܂�(^.^ ���̎q�̊֐S������A�f�G���ȃ��[���𑗂��Ă����Ă��������I [�]�v�ȃ��[�����؂Ȃ��I�o��D��̃V�X�e���ł��I] �L�������ɂ���ĉ^�c���Ă���T�C�g�Ȃ̂ŁA�v���t�B�[���̉{���E ���[���̑���M�Ȃǂ͈�ؖ����ł��I����ɁE�E�E ���@���A�h/���d�̑��M�E����S�����Ȃ��I �t�B�[�����O���������̎q��������A���ЁI���A�h/���d������� ���������I���݂��ɂ����ƈ��S���ĘA��������ł��傤�I ------------------------------------------------------------ �����S�����I�I�@���@http://ad.love-meets.com/?a001 ------------------------------------------------------------ ���@�t�����ł̓o�^�n�j�I ya����o��Ahot��ail�Ȃǂ̃t���[���[���̗��p�͑S�R���Ȃ��ł��I �C�y�Ɏg���Ă��������ˁB ���܂��ł����A�u���̃L�����i�����̃��[�U�[���j���߂��ȁv�� �v������A�V�����t���[���[�������ĐV�L�����ōēx�`�������W ���Ă��������I�I �v���t�B�[���ŏ��̎q�̋C������‚��߂�Ƙb�������ł���I ------------------------------------------------------------ �����S�����I�I�@���@http://ad.love-meets.com/?a001 ------------------------------------------------------------ ���@���‚ł�މ�n�j �₵���ł����A�u���߂��ȁ[�v�Ǝv�����炢�‚ł�މ�n�j�ł��E�E�E �܂��A�^�C�~���O������Ȃ��ƑS���_���Ȃ��Ƃ����������炠�� �������Ȃ��̂ŁE�E�E�B ------------------------------------------------------------ �����S�����I�I�@���@http://ad.love-meets.com/?a001 ------------------------------------------------------------ �z�M���ہFluvletters at cashette.com From mshefty at ichips.intel.com Fri Dec 2 10:52:13 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 02 Dec 2005 10:52:13 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <438F5170.3000208@ichips.intel.com> References: <438F5170.3000208@ichips.intel.com> Message-ID: <439097DD.2050300@ichips.intel.com> Sean Hefty wrote: > As an update: further testing revealed that there is an issue with this > implementation that is also found in the original code. The issue deals > with how listen requests that rely on a data mask are inserted and > located in the red/black tree. I'm trying to come up with a fix for this. After researching into this, I'm coming to the conclusion that there does not exist an efficient way to sort/search for listens without adding some restrictions. For example, a client listens on id1 with mask1. A request is matched with the listen if its serviceid & mask1 = id1. If a second client listens on id2 with mask2, then a request must check against both requests for a match, or until a match is found. There's no method that I can find that can be used to filter checks that works in a generic fashion, resulting in requests needing to walk a linear list of listens. There are several potential fixes for this, with only a couple mentioned below. One solution around this is to have the IB CM only listen on service IDs, and remove the mask parameter from the API. This requires SDP to change to only listen on ports that have a listener. Another alternative is to restrict the type of masks that are supported. If masks are restricted to a series of most significant bits, then the existing algorithm can be used. For instance, we can support masks 0xFF00 and 0xFFF0, but not 0x00FF or 0xFF0F. This restriction would work for both SDP and the CMA. To be clear, the API could change from a mask to the number of bits to match. Matching on private data can either be done by clients, or restrictions can be placed on it as well. For private data, I believe that a restriction that all listen requests on the same service ID use the same mask is sufficient. Hopefully this makes sense to people. Thoughts? - Sean From caitlinb at broadcom.com Fri Dec 2 10:58:54 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 10:58:54 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C27C8@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Sean Hefty wrote: >> As an update: further testing revealed that there is an issue with >> this implementation that is also found in the original code. The >> issue deals with how listen requests that rely on a data mask are >> inserted and located in the red/black tree. I'm trying to come up >> with a fix for this. > > After researching into this, I'm coming to the conclusion > that there does not exist an efficient way to sort/search for > listens without adding some restrictions. > > For example, a client listens on id1 with mask1. A request > is matched with the listen if its serviceid & mask1 = id1. > If a second client listens on id2 with mask2, then a request > must check against both requests for a match, or until a > match is found. There's no method that I can find that can > be used to filter checks that works in a generic fashion, > resulting in requests needing to walk a linear list of > listens. There are several potential fixes for this, with > only a couple mentioned below. > > One solution around this is to have the IB CM only listen on > service IDs, and remove the mask parameter from the API. > This requires SDP to change to only listen on ports that have a > listener. > > Another alternative is to restrict the type of masks that are > supported. If masks are restricted to a series of most > significant bits, then the existing algorithm can be used. > For instance, we can support masks 0xFF00 and 0xFFF0, but not > 0x00FF or 0xFF0F. This restriction would work for both SDP and the > CMA. To be clear, the API could change from a mask to the number of > bits to match. > > Matching on private data can either be done by clients, or > restrictions can be placed on it as well. For private data, > I believe that a restriction that all listen requests on the > same service ID use the same mask is sufficient. > > Hopefully this makes sense to people. Thoughts? > Just listen on the Service ID / Port and let the ULP sort them out by destination IP address. From ftillier at silverstorm.com Fri Dec 2 11:39:15 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 11:39:15 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C27C8@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001901c5f778$180fc4f0$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 10:59 AM > > Just listen on the Service ID / Port and let the ULP sort them > out by destination IP address. That only works if there is a single kernel module providing the extra checks. Multiple user-mode ULPs cannot do the checking in user-mode - the checking must be done in the kernel to figure out which user-mode client to hand the request to. I think putting in restrictions to the comparisons possible is fine, as the functionality of having the CM facilitate some sort of filtering is useful. - Fab From caitlinb at broadcom.com Fri Dec 2 11:45:34 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 11:45:34 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C27DB@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >> Sent: Friday, December 02, 2005 10:59 AM >> >> Just listen on the Service ID / Port and let the ULP sort them out by >> destination IP address. > > That only works if there is a single kernel module providing the > extra checks. Multiple user-mode ULPs cannot do the checking in > user-mode - the checking must be done in the kernel to figure out > which user-mode client to hand the request to. > > I think putting in restrictions to the comparisons possible > is fine, as the functionality of having the CM facilitate > some sort of filtering is useful. > > - Fab Filtering between multiple kernels is fine, but it does not involve the API. Basically if you are filtering amongst multiple kernels then the Hypervisor is doing it, not any user-mode client or even the individual kernels. Within the context of a single Guest OS, I do not seed the need to have multiple listens on the same Service ID / TCP Port number for a given network interface. In that context the more common usage is to have the destination address select the *content* that will be served, not the service that will be selected. (Or put another way, the class of the daemon object being connected with is a constant -- only the instance data is different). For those uses the number of virtual servers can be very large, literally thousands. The filtering/selection is best left to the content-serving daemon itself. From mshefty at ichips.intel.com Fri Dec 2 11:52:31 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 02 Dec 2005 11:52:31 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <001901c5f778$180fc4f0$9e5aa8c0@infiniconsys.com> References: <001901c5f778$180fc4f0$9e5aa8c0@infiniconsys.com> Message-ID: <4390A5FF.9090404@ichips.intel.com> Fab Tillier wrote: >>Just listen on the Service ID / Port and let the ULP sort them >>out by destination IP address. > > That only works if there is a single kernel module providing the extra checks. > Multiple user-mode ULPs cannot do the checking in user-mode - the checking must > be done in the kernel to figure out which user-mode client to hand the request > to. > > I think putting in restrictions to the comparisons possible is fine, as the > functionality of having the CM facilitate some sort of filtering is useful. My concern with pushing this to the ULP is that it requires the ULP to track service IDs for reference counting purposes and adds additional synchronization to the ULP that could have been handled by the CM. I'm looking at what the full effect of implementing this in the ULP would be. - Sean From caitlinb at broadcom.com Fri Dec 2 12:13:00 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 12:13:00 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C27E4@NT-SJCA-0751.brcm.ad.broadcom.com> Sean Hefty wrote: > Fab Tillier wrote: >>> Just listen on the Service ID / Port and let the ULP sort them out >>> by destination IP address. >> >> That only works if there is a single kernel module providing the >> extra checks. Multiple user-mode ULPs cannot do the checking in >> user-mode - the checking must be done in the kernel to figure out >> which user-mode client to hand the request to. >> >> I think putting in restrictions to the comparisons possible is fine, >> as the functionality of having the CM facilitate some sort of >> filtering is useful. > > My concern with pushing this to the ULP is that it requires > the ULP to track service IDs for reference counting purposes > and adds additional synchronization to the ULP that could have been > handled by the CM. > > I'm looking at what the full effect of implementing this in the ULP > would be. > > - Sean I'm still missing something. My understanding is that there are two scenarios where differentiating on the Destination IP address to subqualify a listen is required: 1) When virtualization is in effect and the device is shared by multiple kernels that are not aware of each other. In this case a Destination Address (which could be the IP Address, or the Ethernet MAC, or the GID) determines which kernel is the destination for all packets, and which connections can be set up for which kernels. 2) When the daemon itself is virtualizing multiple instances of the same service, such as a virtual web or ftp sites. The same httpd/ftpd is reached in all cases, but the virtual root maps to a different root in the local file system (i.e., the instance data is different). I don't see how filtering in the CM is of benefit in either case. The work either belongs in the Hypervisor or in the Daemon, not the CM. From ftillier at silverstorm.com Fri Dec 2 12:22:47 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 12:22:47 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C27E4@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001a01c5f77e$2a9c3f30$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 12:13 PM > > Sean Hefty wrote: > > Fab Tillier wrote: > >>> Just listen on the Service ID / Port and let the ULP sort them out > >>> by destination IP address. > >> > >> That only works if there is a single kernel module providing the > >> extra checks. Multiple user-mode ULPs cannot do the checking in > >> user-mode - the checking must be done in the kernel to figure out > >> which user-mode client to hand the request to. > >> > >> I think putting in restrictions to the comparisons possible is fine, > >> as the functionality of having the CM facilitate some sort of > >> filtering is useful. > > > > My concern with pushing this to the ULP is that it requires > > the ULP to track service IDs for reference counting purposes > > and adds additional synchronization to the ULP that could have been > > handled by the CM. > > > > I'm looking at what the full effect of implementing this in the ULP > > would be. > > I'm still missing something. > > I don't see how filtering in the CM is of benefit in either case. The > work either belongs in the Hypervisor or in the Daemon, not the CM. Your focus is strictly on TCP socket semantics, but we're talking about IB CM functionality - the IB CM does more than just provide TCP socket semantics. Imagine a user-mode IB application (not virtualization mind you, but just an app) that wants to listen on a given SID (because the SID defines the application), but wants to discriminate incoming requests based on some content in the private data. Multiple instances of that application can only work properly if the CM performs the private data comparison to properly dispatch the incoming requests to the right user-mode process. If the CM doesn't provide the private data compare functionality, then the app developer needs to create a kernel agent to perform this functionality for the app. The functionality is simple enough, and has potential value to multiple clients, that it makes sense to have the IB CM provide it. - Fab From caitlinb at broadcom.com Fri Dec 2 12:27:41 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 12:27:41 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C27E9@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: >> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >> Sent: Friday, December 02, 2005 12:13 PM >> >> Sean Hefty wrote: >>> Fab Tillier wrote: >>>>> Just listen on the Service ID / Port and let the ULP sort them out >>>>> by destination IP address. >>>> >>>> That only works if there is a single kernel module providing the >>>> extra checks. Multiple user-mode ULPs cannot do the checking in >>>> user-mode - the checking must be done in the kernel to figure out >>>> which user-mode client to hand the request to. >>>> >>>> I think putting in restrictions to the comparisons possible is >>>> fine, as the functionality of having the CM facilitate some sort of >>>> filtering is useful. >>> >>> My concern with pushing this to the ULP is that it requires the ULP >>> to track service IDs for reference counting purposes and adds >>> additional synchronization to the ULP that could have been handled >>> by the CM. >>> >>> I'm looking at what the full effect of implementing this in the ULP >>> would be. >> >> I'm still missing something. >> >> I don't see how filtering in the CM is of benefit in either case. The >> work either belongs in the Hypervisor or in the Daemon, not the CM. > > Your focus is strictly on TCP socket semantics, but we're > talking about IB CM functionality - the IB CM does more than > just provide TCP socket semantics. > > Imagine a user-mode IB application (not virtualization mind > you, but just an > app) that wants to listen on a given SID (because the SID > defines the application), but wants to discriminate incoming > requests based on some content in the private data. Multiple > instances of that application can only work properly if the > CM performs the private data comparison to properly dispatch > the incoming requests to the right user-mode process. > > If the CM doesn't provide the private data compare > functionality, then the app developer needs to create a > kernel agent to perform this functionality for the app. The > functionality is simple enough, and has potential value to > multiple clients, that it makes sense to have the IB CM provide it. > > - Fab You are proposing that the API be made more complex and you do not have any justification other that something some user-mode application *might* want to do. Why are these different user-mode applications sharing a Service ID in the first place? On what basis do they trust each other? How do they co-ordinate their filtering? Couldn't they use CM redirection to share the Service ID? The goal was supposed to be providing TCP-compatible connection setup, but this is describing something that is decidedly un-TCP-like. TCP applications differentiate within the daemon, or redirect connections. If they split connections based upon packet content it is only done by very sophisticated L7 load balancers that identify cookies or other HTTP content. From trimmer at silverstorm.com Fri Dec 2 13:22:19 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 2 Dec 2005 16:22:19 -0500 Subject: [openib-general] [PATCH] [CM] add private data comparisonto match REQs with listens Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0902@mercury.infiniconsys.com> > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 1:59 PM > To: Sean Hefty > Cc: openib-general at openib.org > Subject: RE: [openib-general] [PATCH] [CM] add private data > comparisonto > match REQs with listens > > > openib-general-bounces at openib.org wrote: > > Sean Hefty wrote: > >> As an update: further testing revealed that there is an issue with > >> this implementation that is also found in the original code. The > >> issue deals with how listen requests that rely on a data mask are > >> inserted and located in the red/black tree. I'm trying to come up > >> with a fix for this. > > > > After researching into this, I'm coming to the conclusion > > that there does not exist an efficient way to sort/search for > > listens without adding some restrictions. > > > > For example, a client listens on id1 with mask1. A request > > is matched with the listen if its serviceid & mask1 = id1. > > If a second client listens on id2 with mask2, then a request > > must check against both requests for a match, or until a > > match is found. There's no method that I can find that can > > be used to filter checks that works in a generic fashion, > > resulting in requests needing to walk a linear list of > > listens. There are several potential fixes for this, with > > only a couple mentioned below. > > > > One solution around this is to have the IB CM only listen on > > service IDs, and remove the mask parameter from the API. > > This requires SDP to change to only listen on ports that have a > > listener. > > > > Another alternative is to restrict the type of masks that are > > supported. If masks are restricted to a series of most > > significant bits, then the existing algorithm can be used. > > For instance, we can support masks 0xFF00 and 0xFFF0, but not > > 0x00FF or 0xFF0F. This restriction would work for both SDP and the > > CMA. To be clear, the API could change from a mask to the > number of > > bits to match. > > > > Matching on private data can either be done by clients, or > > restrictions can be placed on it as well. For private data, > > I believe that a restriction that all listen requests on the > > same service ID use the same mask is sufficient. > > > > Hopefully this makes sense to people. Thoughts? > > > > Just listen on the Service ID / Port and let the ULP sort them > out by destination IP address. On approach is to make the sort criteria of the tree dependent on a comparison function. For example the sort could have a multi-faceted compare. We solved this problem in our stack (which allows listen by SID, sender GUID, receiver Port, private data, etc) by the following set of functions. These were called per red/black tree comparison (both inserts and searches used functions, potentially different). I realize these would not be used exactly as given, but they can provide some ideas on how to do it. ListenMap is the red/black tree our stack used to keep track of all listening CEPs in the system. // ListenMap Key Compare functions // Three functions are provided: // // CepListenAddrCompare - is used to insert cep entries into the ListenMap and // is the primary key_compare function for the ListenMap // // CepReqAddrCompare - is used to search the ListenMap as part of processing // an inbound REQ // // CepSidrReqAddrCompare - is used to search the ListenMap as part of // processing an inbound SIDR_REQ // // To provide the maximum flexibilty, the key for a CEP bound address is // sophisticated and allows wildcarded/optional fields. This allows // a listener to simply bind for all traffic of a given SID or to refine the // scope by binding for traffic to/from specific addresses, or specific // private data. The QPN/EECN/CaGUID aspect is used to allow multiple // outbound Peer Connects to still be considered unique. // // The result of this approach is very flexible CM bind. The same SID // can be used on different ports or between different node pairs for // completely different meanings. However a SID used between a given // pair of nodes must be used for a single model (Listen, Peer, Sidr) // In addition for Peer connects, each connect must have a unique // QPN/EECN/CaGUID. // // Comparision allows for wildcarding in all but SID // A value of 0 is a wildcard. See ib_helper.h:WildcardGidCompare for // the rules of GID comparision, which are more involved due to multiple Gid // formats // // Field is Used by models as follows: // Coallating order is: Listen Peer Connect Sidr Register // SID Y Y Y // local GID option Y future option // local LID option Y future option // QPN wildcard Y wildcard // EECN wildcard Y wildcard // CaGUID wildcard Y wildcard // remote GID option Y future option // remote LID option Y future option // private data discriminator length option option option // private data discriminator value option option option // // if bPeer is 0 for either CEP, the QPN, EECN and CaGUID are treated as a match // // FUTURE: add a sid masking option so can easily listen on a group // of SIDs with 1 listen (such as if low bits of sid have a private meaning) // // FUTURE: add a pkey option so can easily listen on a partition // // FUTURE: for SIDR to support GID/LID they will have to come from the LRH // and GRH headers to the CM mad. local GID and lid could be used to merely // select the local port number // A qmap key_compare function to compare the bound address for // two listener, SIDR or Peer Connect CEPs // // key1 - CEP1 pointer // key2 - CEP2 pointer // // Returns: // -1: cep1 bind address < cep2 bind address // 0: cep1 bind address = cep2 bind address (accounting for wildcards) // 1: cep1 bind address > cep2 bind address int CepListenAddrCompare(uint64 key1, uint64 key2) { IN CM_CEP_OBJECT* pCEP1 = (CM_CEP_OBJECT*)(uintn)key1; IN CM_CEP_OBJECT* pCEP2 = (CM_CEP_OBJECT*)(uintn)key2; int res; if (pCEP1->SID < pCEP2->SID) return -1; else if (pCEP1->SID > pCEP2->SID) return 1; res = WildcardGidCompare(&pCEP1->PrimaryPath.LocalGID, &pCEP2->PrimaryPath.LocalGID); if (res != 0) return res; res = WildcardCompareU64(pCEP1->PrimaryPath.LocalLID, pCEP2->PrimaryPath.LocalLID); if (res != 0) return res; if (pCEP1->bPeer && pCEP2->bPeer) { res = CompareU64(pCEP1->LocalEndPoint.QPN, pCEP2->LocalEndPoint.QPN); if (res != 0) return res; res = CompareU64(pCEP1->LocalEndPoint.EECN, pCEP2->LocalEndPoint.EECN); if (res != 0) return res; res = CompareU64(pCEP1->LocalEndPoint.CaGUID, pCEP2->LocalEndPoint.CaGUID); if (res != 0) return res; } res = WildcardGidCompare(&pCEP1->PrimaryPath.RemoteGID, &pCEP2->PrimaryPath.RemoteGID); if (res != 0) return res; res = WildcardCompareU64(pCEP1->PrimaryPath.RemoteLID, pCEP2->PrimaryPath.RemoteLID); if (res != 0) return res; // a length of 0 matches any private data, so this too is a wildcard compare if (pCEP1->DiscriminatorLen == 0 || pCEP2->DiscriminatorLen == 0) return 0; res = CompareU64(pCEP1->DiscriminatorLen, pCEP2->DiscriminatorLen); if (res != 0) return res; res = MemoryCompare(pCEP1->Discriminator, pCEP2->Discriminator, pCEP1->DiscriminatorLen); return res; } // A qmap key_compare function to search the ListenMap for a match with // a given REQ // // key1 - CEP pointer // key2 - REQ pointer // // Returns: // -1: cep1 bind address < req remote address // 0: cep1 bind address = req remote address (accounting for wildcards) // 1: cep1 bind address > req remote address // // The QPN/EECN/CaGUID are not part of the search, hence multiple Peer Connects // could be matched (and one which was started earliest should be then linearly // searched for among the neighbors of the matching CEP) int CepReqAddrCompare(uint64 key1, uint64 key2) { IN CM_CEP_OBJECT* pCEP = (CM_CEP_OBJECT*)(uintn)key1; IN CMM_REQ* pREQ = (CMM_REQ*)(uintn)key2; int res; if (pCEP->SID < pREQ->ServiceID) return -1; else if (pCEP->SID > pREQ->ServiceID) return 1; // local and remote is from perspective of sender (remote node in this // case, so we compare local to remote and visa versa res = WildcardGidCompare(&pCEP->PrimaryPath.LocalGID, &pREQ->PrimaryRemoteGID); if (res != 0) return res; res = WildcardCompareU64(pCEP->PrimaryPath.LocalLID, pREQ->PrimaryRemoteLID); if (res != 0) return res; // do not compare QPN/EECN/CaGUID res = WildcardGidCompare(&pCEP->PrimaryPath.RemoteGID, &pREQ->PrimaryLocalGID); if (res != 0) return res; res = WildcardCompareU64(pCEP->PrimaryPath.RemoteLID, pREQ->PrimaryLocalLID); if (res != 0) return res; // a length of 0 matches any private data, so this too is a wildcard compare if (pCEP->DiscriminatorLen == 0) return 0; res = MemoryCompare(pCEP->Discriminator, pREQ->PrivateData+pCEP->DiscrimPrivateDataOffset, pCEP->DiscriminatorLen); return res; } // A qmap key_compare function to search the ListenMap for a match with // a given SIDR_REQ // // key1 - CEP pointer // key2 - SIDR_REQ pointer // // Returns: // -1: cep1 bind address < cep2 bind address // 0: cep1 bind address = cep2 bind address (accounting for wildcards) // 1: cep1 bind address > cep2 bind address // // The QPN/EECN/CaGUID are not part of the search. int CepSidrReqAddrCompare(uint64 key1, uint64 key2) { IN CM_CEP_OBJECT* pCEP = (CM_CEP_OBJECT*)(uintn)key1; IN CMM_SIDR_REQ* pSIDR_REQ = (CMM_SIDR_REQ*)(uintn)key2; int res; if (pCEP->SID < pSIDR_REQ->ServiceID) return -1; else if (pCEP->SID > pSIDR_REQ->ServiceID) return 1; // GID and LIDs are wildcarded/not available at this time // do not compare QPN/EECN/CaGUID // a length of 0 matches any private data, so this too is a wildcard compare if (pCEP->DiscriminatorLen == 0) return 0; res = MemoryCompare(pCEP->Discriminator, pSIDR_REQ->PrivateData+pCEP->DiscrimPrivateDataOffset, pCEP->DiscriminatorLen); return res; } /* non-Wildcarded compare of 2 64 bit values * Return: * 0 : v1 == v2 * -1: v1 < v2 * 1 : v1 > v2 */ static __inline int CompareU64(uint64 v1, uint64 v2) { if (v1 == v2) return 0; else if (v1 < v2) return -1; else return 1; } /* Wildcarded compare of 2 64 bit values * Return: * 0 : v1 == v2 * -1: v1 < v2 * 1 : v1 > v2 * if v1 or v2 is 0, they are considered wildcards and match any value */ static __inline int WildcardCompareU64(uint64 v1, uint64 v2) { if (v1 == 0 || v2 == 0 || v1 == v2) return 0; else if (v1 < v2) return -1; else return 1; } /* Compare Gid1 to Gid2 (host byte order) * Return: * 0 : Gid1 == Gid2 * -1: Gid1 < Gid2 * 1 : Gid1 > Gid2 * This also allows for Wildcarded compare. * A MC Gid with the lower 56 bits all 0, will match any MC gid * A SubnetPrefix of 0 will match any top 64 bits of a non-MC gid * A InterfaceID of 0 will match any low 64 bits of a non-MC gid * Coallating order: * non-MC Subnet Prefix (0 is wildcard and comes first) * non-MC Interface ID (0 is wilcard and comes first) * MC wildcard * MC by value of low 56 bits (0 is wildcard and comes first) */ static __inline int WildcardGidCompare(IN const IB_GID* const pGid1, IN const IB_GID* const pGid2 ) { if (pGid1->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX && pGid2->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) { /* Multicast compare: compare low 120 bits, 120 bits of 0 is wildcard */ uint64 h1 = pGid1->AsReg64s.H & ~IB_GID_MCAST_FORMAT_MASK_H; uint64 h2 = pGid2->AsReg64s.H & ~IB_GID_MCAST_FORMAT_MASK_H; /* check for 120 bits of wildcard */ if ((h1 == 0 && pGid1->AsReg64s.L == 0) || (h2 == 0 && pGid2->AsReg64s.L == 0)) { return 0; } else if (h1 < h2) { return -1; } else if (h1 > h2) { return 1; } else { return CompareU64(pGid1->AsReg64s.L, pGid1->AsReg64s.L); } } else if (pGid1->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) { /* Gid1 is MC, Gid2 is other, treat MC as > others */ return 1; } else if (pGid2->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) { /* Gid1 is other, Gid2 is MC, treat other as < MC */ return -1; } else { /* Non-Multicast compare: compare high 64 bits */ /* Note all other GID formats are essentially a prefix in upper */ /* 64 bits and a identifier in the low 64 bits */ /* so this covers link local, site local, global formats */ int res = WildcardCompareU64(pGid1->AsReg64s.H, pGid2->AsReg64s.H); if (res == 0) { return WildcardCompareU64(pGid1->AsReg64s.L, pGid2->AsReg64s.L); } else { return res; } } } From tom at opengridcomputing.com Fri Dec 2 14:14:10 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 02 Dec 2005 16:14:10 -0600 Subject: [openib-general] [PATCH] [CM] add private data comparisonto match REQs with listens In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0902@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0902@mercury.infiniconsys.com> Message-ID: <1133561650.21815.124.camel@trinity.austin.ammasso.com> Am I correct to assume that this functionality is unique to the IB CM and is not going to be exposed through the CMA? On Fri, 2005-12-02 at 16:22 -0500, Rimmer, Todd wrote: > > -----Original Message----- > > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > > Sent: Friday, December 02, 2005 1:59 PM > > To: Sean Hefty > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] [PATCH] [CM] add private data > > comparisonto > > match REQs with listens > > > > > > openib-general-bounces at openib.org wrote: > > > Sean Hefty wrote: > > >> As an update: further testing revealed that there is an issue with > > >> this implementation that is also found in the original code. The > > >> issue deals with how listen requests that rely on a data mask are > > >> inserted and located in the red/black tree. I'm trying to come up > > >> with a fix for this. > > > > > > After researching into this, I'm coming to the conclusion > > > that there does not exist an efficient way to sort/search for > > > listens without adding some restrictions. > > > > > > For example, a client listens on id1 with mask1. A request > > > is matched with the listen if its serviceid & mask1 = id1. > > > If a second client listens on id2 with mask2, then a request > > > must check against both requests for a match, or until a > > > match is found. There's no method that I can find that can > > > be used to filter checks that works in a generic fashion, > > > resulting in requests needing to walk a linear list of > > > listens. There are several potential fixes for this, with > > > only a couple mentioned below. > > > > > > One solution around this is to have the IB CM only listen on > > > service IDs, and remove the mask parameter from the API. > > > This requires SDP to change to only listen on ports that have a > > > listener. > > > > > > Another alternative is to restrict the type of masks that are > > > supported. If masks are restricted to a series of most > > > significant bits, then the existing algorithm can be used. > > > For instance, we can support masks 0xFF00 and 0xFFF0, but not > > > 0x00FF or 0xFF0F. This restriction would work for both SDP and the > > > CMA. To be clear, the API could change from a mask to the > > number of > > > bits to match. > > > > > > Matching on private data can either be done by clients, or > > > restrictions can be placed on it as well. For private data, > > > I believe that a restriction that all listen requests on the > > > same service ID use the same mask is sufficient. > > > > > > Hopefully this makes sense to people. Thoughts? > > > > > > > Just listen on the Service ID / Port and let the ULP sort them > > out by destination IP address. > > On approach is to make the sort criteria of the tree dependent on a comparison function. > > For example the sort could have a multi-faceted compare. > > We solved this problem in our stack (which allows listen by SID, sender GUID, receiver Port, private data, etc) by the following set of functions. These were called per red/black tree comparison (both inserts and searches used functions, potentially different). I realize these would not be used exactly as given, but they can provide some ideas on how to do it. ListenMap is the red/black tree our stack used to keep track of all listening CEPs in the system. > > // ListenMap Key Compare functions > // Three functions are provided: > // > // CepListenAddrCompare - is used to insert cep entries into the ListenMap and > // is the primary key_compare function for the ListenMap > // > // CepReqAddrCompare - is used to search the ListenMap as part of processing > // an inbound REQ > // > // CepSidrReqAddrCompare - is used to search the ListenMap as part of > // processing an inbound SIDR_REQ > // > // To provide the maximum flexibilty, the key for a CEP bound address is > // sophisticated and allows wildcarded/optional fields. This allows > // a listener to simply bind for all traffic of a given SID or to refine the > // scope by binding for traffic to/from specific addresses, or specific > // private data. The QPN/EECN/CaGUID aspect is used to allow multiple > // outbound Peer Connects to still be considered unique. > // > // The result of this approach is very flexible CM bind. The same SID > // can be used on different ports or between different node pairs for > // completely different meanings. However a SID used between a given > // pair of nodes must be used for a single model (Listen, Peer, Sidr) > // In addition for Peer connects, each connect must have a unique > // QPN/EECN/CaGUID. > // > // Comparision allows for wildcarding in all but SID > // A value of 0 is a wildcard. See ib_helper.h:WildcardGidCompare for > // the rules of GID comparision, which are more involved due to multiple Gid > // formats > // > // Field is Used by models as follows: > // Coallating order is: Listen Peer Connect Sidr Register > // SID Y Y Y > // local GID option Y future option > // local LID option Y future option > // QPN wildcard Y wildcard > // EECN wildcard Y wildcard > // CaGUID wildcard Y wildcard > // remote GID option Y future option > // remote LID option Y future option > // private data discriminator length option option option > // private data discriminator value option option option > // > // if bPeer is 0 for either CEP, the QPN, EECN and CaGUID are treated as a match > // > // FUTURE: add a sid masking option so can easily listen on a group > // of SIDs with 1 listen (such as if low bits of sid have a private meaning) > // > // FUTURE: add a pkey option so can easily listen on a partition > // > // FUTURE: for SIDR to support GID/LID they will have to come from the LRH > // and GRH headers to the CM mad. local GID and lid could be used to merely > // select the local port number > > > // A qmap key_compare function to compare the bound address for > // two listener, SIDR or Peer Connect CEPs > // > // key1 - CEP1 pointer > // key2 - CEP2 pointer > // > // Returns: > // -1: cep1 bind address < cep2 bind address > // 0: cep1 bind address = cep2 bind address (accounting for wildcards) > // 1: cep1 bind address > cep2 bind address > int > CepListenAddrCompare(uint64 key1, uint64 key2) > { > IN CM_CEP_OBJECT* pCEP1 = (CM_CEP_OBJECT*)(uintn)key1; > IN CM_CEP_OBJECT* pCEP2 = (CM_CEP_OBJECT*)(uintn)key2; > int res; > > if (pCEP1->SID < pCEP2->SID) > return -1; > else if (pCEP1->SID > pCEP2->SID) > return 1; > res = WildcardGidCompare(&pCEP1->PrimaryPath.LocalGID, &pCEP2->PrimaryPath.LocalGID); > if (res != 0) > return res; > res = WildcardCompareU64(pCEP1->PrimaryPath.LocalLID, pCEP2->PrimaryPath.LocalLID); > if (res != 0) > return res; > if (pCEP1->bPeer && pCEP2->bPeer) > { > res = CompareU64(pCEP1->LocalEndPoint.QPN, pCEP2->LocalEndPoint.QPN); > if (res != 0) > return res; > res = CompareU64(pCEP1->LocalEndPoint.EECN, pCEP2->LocalEndPoint.EECN); > if (res != 0) > return res; > res = CompareU64(pCEP1->LocalEndPoint.CaGUID, pCEP2->LocalEndPoint.CaGUID); > if (res != 0) > return res; > } > res = WildcardGidCompare(&pCEP1->PrimaryPath.RemoteGID, &pCEP2->PrimaryPath.RemoteGID); > if (res != 0) > return res; > res = WildcardCompareU64(pCEP1->PrimaryPath.RemoteLID, pCEP2->PrimaryPath.RemoteLID); > if (res != 0) > return res; > // a length of 0 matches any private data, so this too is a wildcard compare > if (pCEP1->DiscriminatorLen == 0 || pCEP2->DiscriminatorLen == 0) > return 0; > res = CompareU64(pCEP1->DiscriminatorLen, pCEP2->DiscriminatorLen); > if (res != 0) > return res; > res = MemoryCompare(pCEP1->Discriminator, pCEP2->Discriminator, pCEP1->DiscriminatorLen); > return res; > } > > // A qmap key_compare function to search the ListenMap for a match with > // a given REQ > // > // key1 - CEP pointer > // key2 - REQ pointer > // > // Returns: > // -1: cep1 bind address < req remote address > // 0: cep1 bind address = req remote address (accounting for wildcards) > // 1: cep1 bind address > req remote address > // > // The QPN/EECN/CaGUID are not part of the search, hence multiple Peer Connects > // could be matched (and one which was started earliest should be then linearly > // searched for among the neighbors of the matching CEP) > int > CepReqAddrCompare(uint64 key1, uint64 key2) > { > IN CM_CEP_OBJECT* pCEP = (CM_CEP_OBJECT*)(uintn)key1; > IN CMM_REQ* pREQ = (CMM_REQ*)(uintn)key2; > int res; > > if (pCEP->SID < pREQ->ServiceID) > return -1; > else if (pCEP->SID > pREQ->ServiceID) > return 1; > // local and remote is from perspective of sender (remote node in this > // case, so we compare local to remote and visa versa > res = WildcardGidCompare(&pCEP->PrimaryPath.LocalGID, &pREQ->PrimaryRemoteGID); > if (res != 0) > return res; > res = WildcardCompareU64(pCEP->PrimaryPath.LocalLID, pREQ->PrimaryRemoteLID); > if (res != 0) > return res; > // do not compare QPN/EECN/CaGUID > res = WildcardGidCompare(&pCEP->PrimaryPath.RemoteGID, &pREQ->PrimaryLocalGID); > if (res != 0) > return res; > res = WildcardCompareU64(pCEP->PrimaryPath.RemoteLID, pREQ->PrimaryLocalLID); > if (res != 0) > return res; > // a length of 0 matches any private data, so this too is a wildcard compare > if (pCEP->DiscriminatorLen == 0) > return 0; > res = MemoryCompare(pCEP->Discriminator, pREQ->PrivateData+pCEP->DiscrimPrivateDataOffset, pCEP->DiscriminatorLen); > return res; > } > > // A qmap key_compare function to search the ListenMap for a match with > // a given SIDR_REQ > // > // key1 - CEP pointer > // key2 - SIDR_REQ pointer > // > // Returns: > // -1: cep1 bind address < cep2 bind address > // 0: cep1 bind address = cep2 bind address (accounting for wildcards) > // 1: cep1 bind address > cep2 bind address > // > // The QPN/EECN/CaGUID are not part of the search. > int > CepSidrReqAddrCompare(uint64 key1, uint64 key2) > { > IN CM_CEP_OBJECT* pCEP = (CM_CEP_OBJECT*)(uintn)key1; > IN CMM_SIDR_REQ* pSIDR_REQ = (CMM_SIDR_REQ*)(uintn)key2; > int res; > > if (pCEP->SID < pSIDR_REQ->ServiceID) > return -1; > else if (pCEP->SID > pSIDR_REQ->ServiceID) > return 1; > // GID and LIDs are wildcarded/not available at this time > // do not compare QPN/EECN/CaGUID > // a length of 0 matches any private data, so this too is a wildcard compare > if (pCEP->DiscriminatorLen == 0) > return 0; > res = MemoryCompare(pCEP->Discriminator, pSIDR_REQ->PrivateData+pCEP->DiscrimPrivateDataOffset, pCEP->DiscriminatorLen); > return res; > } > > /* non-Wildcarded compare of 2 64 bit values > * Return: > * 0 : v1 == v2 > * -1: v1 < v2 > * 1 : v1 > v2 > */ > static __inline int > CompareU64(uint64 v1, uint64 v2) > { > if (v1 == v2) > return 0; > else if (v1 < v2) > return -1; > else > return 1; > } > > /* Wildcarded compare of 2 64 bit values > * Return: > * 0 : v1 == v2 > * -1: v1 < v2 > * 1 : v1 > v2 > * if v1 or v2 is 0, they are considered wildcards and match any value > */ > static __inline int > WildcardCompareU64(uint64 v1, uint64 v2) > { > if (v1 == 0 || v2 == 0 || v1 == v2) > return 0; > else if (v1 < v2) > return -1; > else > return 1; > } > > /* Compare Gid1 to Gid2 (host byte order) > * Return: > * 0 : Gid1 == Gid2 > * -1: Gid1 < Gid2 > * 1 : Gid1 > Gid2 > * This also allows for Wildcarded compare. > * A MC Gid with the lower 56 bits all 0, will match any MC gid > * A SubnetPrefix of 0 will match any top 64 bits of a non-MC gid > * A InterfaceID of 0 will match any low 64 bits of a non-MC gid > * Coallating order: > * non-MC Subnet Prefix (0 is wildcard and comes first) > * non-MC Interface ID (0 is wilcard and comes first) > * MC wildcard > * MC by value of low 56 bits (0 is wildcard and comes first) > */ > static __inline int > WildcardGidCompare(IN const IB_GID* const pGid1, IN const IB_GID* const pGid2 ) > { > if (pGid1->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX > && pGid2->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) > { > /* Multicast compare: compare low 120 bits, 120 bits of 0 is wildcard */ > uint64 h1 = pGid1->AsReg64s.H & ~IB_GID_MCAST_FORMAT_MASK_H; > uint64 h2 = pGid2->AsReg64s.H & ~IB_GID_MCAST_FORMAT_MASK_H; > /* check for 120 bits of wildcard */ > if ((h1 == 0 && pGid1->AsReg64s.L == 0) > || (h2 == 0 && pGid2->AsReg64s.L == 0)) > { > return 0; > } else if (h1 < h2) { > return -1; > } else if (h1 > h2) { > return 1; > } else { > return CompareU64(pGid1->AsReg64s.L, pGid1->AsReg64s.L); > } > } else if (pGid1->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) { > /* Gid1 is MC, Gid2 is other, treat MC as > others */ > return 1; > } else if (pGid2->Type.Multicast.s.FormatPrefix == IPV6_MULTICAST_PREFIX) { > /* Gid1 is other, Gid2 is MC, treat other as < MC */ > return -1; > } else { > /* Non-Multicast compare: compare high 64 bits */ > /* Note all other GID formats are essentially a prefix in upper */ > /* 64 bits and a identifier in the low 64 bits */ > /* so this covers link local, site local, global formats */ > int res = WildcardCompareU64(pGid1->AsReg64s.H, pGid2->AsReg64s.H); > if (res == 0) > { > return WildcardCompareU64(pGid1->AsReg64s.L, pGid2->AsReg64s.L); > } else { > return res; > } > } > } > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ftillier at silverstorm.com Fri Dec 2 14:21:01 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 14:21:01 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C27E4@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001b01c5f78e$af5706a0$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 12:13 PM > > Sean Hefty wrote: > > Fab Tillier wrote: > >>> Just listen on the Service ID / Port and let the ULP sort them out > >>> by destination IP address. > >> > >> That only works if there is a single kernel module providing the > >> extra checks. Multiple user-mode ULPs cannot do the checking in > >> user-mode - the checking must be done in the kernel to figure out > >> which user-mode client to hand the request to. > >> > >> I think putting in restrictions to the comparisons possible is fine, > >> as the functionality of having the CM facilitate some sort of > >> filtering is useful. > > > > My concern with pushing this to the ULP is that it requires > > the ULP to track service IDs for reference counting purposes > > and adds additional synchronization to the ULP that could have been > > handled by the CM. > > > > I'm looking at what the full effect of implementing this in the ULP > > would be. > > I'm still missing something. > > I don't see how filtering in the CM is of benefit in either case. The > work either belongs in the Hypervisor or in the Daemon, not the CM. Your focus is strictly on TCP socket semantics, but we're talking about IB CM functionality - the IB CM does more than just provide TCP socket semantics. Imagine a user-mode IB application (not virtualization mind you, but just an app) that wants to listen on a given SID (because the SID defines the application), but wants to discriminate incoming requests based on some content in the private data. Multiple instances of that application can only work properly if the CM performs the private data comparison to properly dispatch the incoming requests to the right user-mode process. If the CM doesn't provide the private data compare functionality, then the app developer needs to create a kernel agent to perform this functionality for the app. The functionality is simple enough, and has potential value to multiple clients, that it makes sense to have the IB CM provide it. - Fab From ftillier at silverstorm.com Fri Dec 2 14:37:29 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 14:37:29 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C27E9@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001c01c5f790$fbba2980$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 12:28 PM > > Fab Tillier wrote: > > > > Your focus is strictly on TCP socket semantics, but we're > > talking about IB CM functionality - the IB CM does more than > > just provide TCP socket semantics. > > > > Imagine a user-mode IB application (not virtualization mind > > you, but just an > > app) that wants to listen on a given SID (because the SID > > defines the application), but wants to discriminate incoming > > requests based on some content in the private data. Multiple > > instances of that application can only work properly if the > > CM performs the private data comparison to properly dispatch > > the incoming requests to the right user-mode process. > > > > If the CM doesn't provide the private data compare > > functionality, then the app developer needs to create a > > kernel agent to perform this functionality for the app. The > > functionality is simple enough, and has potential value to > > multiple clients, that it makes sense to have the IB CM provide it. > > You are proposing that the API be made more complex and > you do not have any justification other that something > some user-mode application *might* want to do. In Windows, the Winsock Direct provider does exactly this, and would require a kernel component if the IB CM wasn't providing this functionality. WSD uses the private data to carry the IP address of the client, but uses its own private data format. I believe some native-IB MPI implementations make use of similar functionality, using the rank of the process in the private data. This allows such implementations to limit the size of their SID range to a single value or a single value per job. > Why are these different user-mode applications sharing > a Service ID in the first place? On what basis do they > trust each other? How do they co-ordinate their filtering? > Couldn't they use CM redirection to share the Service ID? The world is larger than just TCP-compatible applications. I'm not talking about two applications sharing a SID, but two instances of one application sharing a SID. Imagine processes in a larger MPI job - the SID can be used to differentiate jobs, and the private data comparison can be used to differentiate different processes within that job. Alternatively, the SID could be constant, and the job ID and rank could be expressed in the private data, with the IB-level CM performing all the proper dispatching. I don't think CM redirection would work since both apps are on the same system, and share the same CM. There can only be a single connection ID namespace per HCA GUID or things quickly become ambiguous. > The goal was supposed to be providing TCP-compatible > connection setup, but this is describing something that > is decidedly un-TCP-like. TCP applications differentiate > within the daemon, or redirect connections. If they split > connections based upon packet content it is only done by > very sophisticated L7 load balancers that identify cookies > or other HTTP content. The goal of the CMA *is* to support TCP-compatible semantics, but that is not the goal of the IB CM. The IB CM already keeps track of listens and performs lookups when a REQ comes in based on service ID. Extending it to do some fairly basic extra checking is far simpler than adding duplicate lookup functionality to the CMA. This allows the IB CM to do all the filtering at once as part of REQ matching, and thus simplifies the CMA. It also allows user-mode apps to use similar functionality without requiring a kernel agent. Anyhow, do you have an objection to the CM enabling simple comparisons on private data? If so, what are your objections (aside from it not being TCP-like)? Thanks, - Fab From ftillier at silverstorm.com Fri Dec 2 14:37:29 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 14:37:29 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison to match REQs with listens In-Reply-To: <1133561650.21815.124.camel@trinity.austin.ammasso.com> Message-ID: <001d01c5f790$fd0a5030$9e5aa8c0@infiniconsys.com> > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Friday, December 02, 2005 2:14 PM > > Am I correct to assume that this functionality is unique to the IB CM > and is not going to be exposed through the CMA? My understanding is that the CMA would make use of that functionality, but it would not be exposed to users of the CMA. - Fab From trimmer at silverstorm.com Fri Dec 2 14:54:05 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 2 Dec 2005 17:54:05 -0500 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0903@mercury.infiniconsys.com> > -----Original Message----- > From: Tillier, Fabian > Sent: Friday, December 02, 2005 5:21 PM > To: 'Caitlin Bestler'; 'Sean Hefty' > Cc: openib-general at openib.org > Subject: RE: [openib-general] [PATCH] [CM] add private data comparison > tomatch REQs with listens > > > > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > > Sent: Friday, December 02, 2005 12:13 PM > > > > Sean Hefty wrote: > > > Fab Tillier wrote: > > >>> Just listen on the Service ID / Port and let the ULP > sort them out > > >>> by destination IP address. > > >> > > >> That only works if there is a single kernel module providing the > > >> extra checks. Multiple user-mode ULPs cannot do the checking in > > >> user-mode - the checking must be done in the kernel to figure out > > >> which user-mode client to hand the request to. > > >> > > >> I think putting in restrictions to the comparisons > possible is fine, > > >> as the functionality of having the CM facilitate some sort of > > >> filtering is useful. > > > > > > My concern with pushing this to the ULP is that it requires > > > the ULP to track service IDs for reference counting purposes > > > and adds additional synchronization to the ULP that could > have been > > > handled by the CM. > > > > > > I'm looking at what the full effect of implementing this > in the ULP > > > would be. > > > > I'm still missing something. > > > > I don't see how filtering in the CM is of benefit in either > case. The > > work either belongs in the Hypervisor or in the Daemon, not the CM. > > Your focus is strictly on TCP socket semantics, but we're > talking about IB CM > functionality - the IB CM does more than just provide TCP > socket semantics. > > Imagine a user-mode IB application (not virtualization mind > you, but just an > app) that wants to listen on a given SID (because the SID defines the > application), but wants to discriminate incoming requests > based on some content > in the private data. Multiple instances of that application > can only work > properly if the CM performs the private data comparison to > properly dispatch the > incoming requests to the right user-mode process. > > If the CM doesn't provide the private data compare > functionality, then the app > developer needs to create a kernel agent to perform this > functionality for the > app. The functionality is simple enough, and has potential > value to multiple > clients, that it makes sense to have the IB CM provide it. > > - Fab I agree, to give you a good practical example, MPI needs to listen for incoming connections. It is wasteful to have MPI create separate SIDs for each rank (especially when there can be thousands of ranks in many jobs all running in the same cluster parts of which on the same node) and then listen on 1000s of SIDs in each process. Instead it makes sense to use a single SID for the entire job (possibly using the global Job ID as part of the SID), and have the private data of the REQ indicate the destination rank of the request. Then each rank in the MPI job can listen for the combination of the global Job ID's SID and private data where the destination rank matches itself (using 1 listening CEP per process) and let the CM filter by both criteria and deliver the REQs to the appropriate processes. The above scheme works very well and minimizes CM resource use for large MPI jobs. I'm sure other interesting and useful examples can be found as well. Todd Rimmer From caitlinb at broadcom.com Fri Dec 2 15:01:47 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 15:01:47 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C2810@NT-SJCA-0751.brcm.ad.broadcom.com> Rimmer, Todd wrote: >> -----Original Message----- >> From: Tillier, Fabian >> Sent: Friday, December 02, 2005 5:21 PM >> To: 'Caitlin Bestler'; 'Sean Hefty' >> Cc: openib-general at openib.org >> Subject: RE: [openib-general] [PATCH] [CM] add private data >> comparison tomatch REQs with listens >> >> >>> From: Caitlin Bestler [mailto:caitlinb at broadcom.com] >>> Sent: Friday, December 02, 2005 12:13 PM >>> >>> Sean Hefty wrote: >>>> Fab Tillier wrote: >>>>>> Just listen on the Service ID / Port and let the ULP sort them >>>>>> out by destination IP address. >>>>> >>>>> That only works if there is a single kernel module providing the >>>>> extra checks. Multiple user-mode ULPs cannot do the checking in >>>>> user-mode - the checking must be done in the kernel to figure out >>>>> which user-mode client to hand the request to. >>>>> >>>>> I think putting in restrictions to the comparisons possible is >>>>> fine, as the functionality of having the CM facilitate some sort >>>>> of filtering is useful. >>>> >>>> My concern with pushing this to the ULP is that it requires the >>>> ULP to track service IDs for reference counting purposes and adds >>>> additional synchronization to the ULP that could have been handled >>>> by the CM. >>>> >>>> I'm looking at what the full effect of implementing this in the ULP >>>> would be. >>> >>> I'm still missing something. >>> >>> I don't see how filtering in the CM is of benefit in either case. >>> The work either belongs in the Hypervisor or in the Daemon, not the >>> CM. >> >> Your focus is strictly on TCP socket semantics, but we're talking >> about IB CM functionality - the IB CM does more than just provide >> TCP socket semantics. >> >> Imagine a user-mode IB application (not virtualization mind you, but >> just an app) that wants to listen on a given SID (because the SID >> defines the application), but wants to discriminate incoming >> requests based on some content in the private data. Multiple >> instances of that application can only work properly if the CM >> performs the private data comparison to properly dispatch the >> incoming requests to the right user-mode process. >> >> If the CM doesn't provide the private data compare functionality, >> then the app developer needs to create a kernel agent to perform this >> functionality for the app. The functionality is simple enough, and >> has potential value to multiple clients, that it makes sense to have >> the IB CM provide it. >> >> - Fab > > I agree, to give you a good practical example, MPI needs to > listen for incoming connections. > > It is wasteful to have MPI create separate SIDs for each rank > (especially when there can be thousands of ranks in many jobs > all running in the same cluster parts of which on the same > node) and then listen on 1000s of SIDs in each process. > > Instead it makes sense to use a single SID for the entire job > (possibly using the global Job ID as part of the SID), and > have the private data of the REQ indicate the destination > rank of the request. Then each rank in the MPI job can > listen for the combination of the global Job ID's SID and > private data where the destination rank matches itself (using > 1 listening CEP per process) and let the CM filter by both > criteria and deliver the REQs to the appropriate processes. > > The above scheme works very well and minimizes CM resource > use for large MPI jobs. > > I'm sure other interesting and useful examples can be found as well. > MPI works over plain TCP right now, and yet there is no such feature in INETD or in current socket listens. And they do not allocate a TCP Port to listen for each connection. Rather the same listen just accepts each connection and either creates the process or passes the handle to a process. There are many reasons why an established RDMA connection cannot be passed between processes, but I know of know reason why a Connection Request cannot be passed to a child or third process where it can be accepted. Why not emulate the existing solution rather than creating a new interface that is transport specific? Or conversely, if you truly think this is of general utility, why not implement it in INETD as well? From trimmer at silverstorm.com Fri Dec 2 15:11:59 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 2 Dec 2005 18:11:59 -0500 Subject: [openib-general] [PATCH] [CM] add private data comparisonto match REQs with listens Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12AE2@mercury.infiniconsys.com> Sean wrote: > This is similar in concept to what I have in my latest patch. > A difference is > that your discriminator is located at the start of the > private data, whereas I > was trying to use a mask. Actually our discrimator is an offset/len into the private data, adding an optional mask to that concept would be useful, this allowed a contiguous portion of the private data to be tested, but it did not need to occur at the start of the private data. > > Did you find a use for listening on the sender GUID? We have not yet used that feature, but having the ability to key off of all the assorted addressing info in the REQ seemed sensible. I'm sure some applications can come up with a use, perhaps as a security feature? Todd Rimmer From ftillier at silverstorm.com Fri Dec 2 15:17:10 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 15:17:10 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2810@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001e01c5f796$86f22b60$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 3:02 PM > > There are many reasons why an established RDMA connection > cannot be passed between processes, but I know of know > reason why a Connection Request cannot be passed to a child > or third process where it can be accepted. > > Why not emulate the existing solution rather than creating > a new interface that is transport specific? Allowing a connection request to come in on one CID (which is associated with the listening process) and letting that connection be accepted by a different process requires making changes to the user-mode CM infrastructure to allow CIDs to be migrated safely between processes. This is very likely to be more difficult than adding private data comparison to the IB CM. This is all under the covers for socket applications. It avoids the need for the CMA to keep an efficiently searchable tree of listen requests to perform private data comparison when the IB CM already does 90% of the work. To sum up, it is simpler to add the private data compare functionality to the IB CM than to add it to every client that wants it. The changes required don't complicate the API significantly, certainly within the grasp of someone interfacing to verbs. I know this from experience because I've done it before. > Or conversely, if you truly think this is of general utility, > why not implement it in INETD as well? I wasn't making the case that it has general utility, just that it has utility within the realm of IB connection management. Someone else is welcome to expand the scope if they see fit, but that's not what I'm advocating. - Fab From caitlinb at broadcom.com Fri Dec 2 15:25:48 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 15:25:48 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C281B@NT-SJCA-0751.brcm.ad.broadcom.com> Fab Tillier wrote: > > To sum up, it is simpler to add the private data compare > functionality to the IB CM than to add it to every client > that wants it. The changes required don't complicate the API > significantly, certainly within the grasp of someone > interfacing to verbs. I know this from experience because > I've done it before. > >> Or conversely, if you truly think this is of general utility, why not >> implement it in INETD as well? > > I wasn't making the case that it has general utility, just > that it has utility within the realm of IB connection > management. Someone else is welcome to expand the scope if > they see fit, but that's not what I'm advocating. > But if your justification is MPI Ranks then you have already exceed the scope "IB connection management". There is an *existing* solution on how the remote end establishes multiple connections to the same service but with different instances. That solution has been around for a very long time, literally decades. Needing to restructure your code slightly to preserve an existing interface that has been around that long does not seem inapropriate. Are you claiming that there is something in the definition of the protocol that *requires* IB to handle this differently than other networks do? The only IB specific issue that I can think of is that IB actually can afford to waste Service IDs more than IP can affor to waste TCP Ports. From ftillier at silverstorm.com Fri Dec 2 15:51:03 2005 From: ftillier at silverstorm.com (Fabian Tillier) Date: Fri, 2 Dec 2005 15:51:03 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C281B@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10C281B@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <79ae2f320512021551x1bbfa79au7fedb8e5310549d9@mail.gmail.com> On 12/2/05, Caitlin Bestler wrote: > Fab Tillier wrote: > > > To sum up, it is simpler to add the private data compare > > functionality to the IB CM than to add it to every client > > that wants it. The changes required don't complicate the API > > significantly, certainly within the grasp of someone > > interfacing to verbs. I know this from experience because > > I've done it before. > > > >> Or conversely, if you truly think this is of general utility, why not > >> implement it in INETD as well? > > > > I wasn't making the case that it has general utility, just > > that it has utility within the realm of IB connection > > management. Someone else is welcome to expand the scope if > > they see fit, but that's not what I'm advocating. > > > > But if your justification is MPI Ranks then you have already > exceed the scope "IB connection management". Why shouldn't an MPI implementation that interfaces directly to IB verbs use the IB CM functionality? Why should it restrict itself to TCP connection semantics when IB can provide it with something richer? > There is an *existing* solution on how the remote end > establishes multiple connections to the same service > but with different instances. > > That solution has been around for a very long time, > literally decades. Needing to restructure your code > slightly to preserve an existing interface that has > been around that long does not seem inapropriate. The IB CM is not an existing interface. I'M TALKING ABOUT I-N-F-I-N-I-B-A-N-D. INFINIBAND. Not IP, not TCP, not sockets, not iWarp. Show me the IB CM API that has existed for decades. > Are you claiming that there is something in the > definition of the protocol that *requires* IB to > handle this differently than other networks do? IB listen semantics are different from socket listen semantics. Again, IB is not Ethernet, not iWarp, not IP, not TCP. This is an important point that I feel you keep missing. - Fab From caitlinb at broadcom.com Fri Dec 2 16:06:28 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 2 Dec 2005 16:06:28 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C2828@NT-SJCA-0751.brcm.ad.broadcom.com> ftillier.sst at gmail.com wrote: > On 12/2/05, Caitlin Bestler wrote: >> Fab Tillier wrote: >> >>> To sum up, it is simpler to add the private data compare >>> functionality to the IB CM than to add it to every client that wants >>> it. The changes required don't complicate the API significantly, >>> certainly within the grasp of someone interfacing to verbs. I know >>> this from experience because I've done it before. >>> >>>> Or conversely, if you truly think this is of general utility, why >>>> not implement it in INETD as well? >>> >>> I wasn't making the case that it has general utility, just that it >>> has utility within the realm of IB connection management. > Someone >>> else is welcome to expand the scope if they see fit, but that's not >>> what I'm advocating. >>> >> >> But if your justification is MPI Ranks then you have already exceed >> the scope "IB connection management". > > Why shouldn't an MPI implementation that interfaces directly > to IB verbs use the IB CM functionality? Why should it > restrict itself to TCP connection semantics when IB can > provide it with something richer? > >> There is an *existing* solution on how the remote end establishes >> multiple connections to the same service but with different >> instances. >> >> That solution has been around for a very long time, literally >> decades. Needing to restructure your code slightly to preserve an >> existing interface that has been around that long does not seem >> inapropriate. > > The IB CM is not an existing interface. I'M TALKING ABOUT > I-N-F-I-N-I-B-A-N-D. INFINIBAND. Not IP, not TCP, not > sockets, not iWarp. Show me the IB CM API that has existed for > decades. > >> Are you claiming that there is something in the definition of the >> protocol that *requires* IB to handle this differently than other >> networks do? > > IB listen semantics are different from socket listen semantics. > Again, IB is not Ethernet, not iWarp, not IP, not TCP. This > is an important point that I feel you keep missing. > > - Fab Socket listen semantics have nothing to do with Ethernet. They are Unix/POSIX. In fact a major point of socket semantics is that they worked over multiple networks. Sockets are part of the problem when it comes to transferring data once a connection is established, which is why we have QPs and CQs. But there is a very simple transport neutral definition of passive side connectin setup. The server issues a listen. The server receives connection requests. The service can optionally hand off the connection request, accept it or reject it. That model is a natural extension of both TCP connection setup and the InfiniBand CM. It allows the server to deal with destination multiplexing. DAPL and IT-API both already work this way. Are you opposed to transport neutral connection establishment? From ftillier at silverstorm.com Fri Dec 2 16:57:37 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 2 Dec 2005 16:57:37 -0800 Subject: [openib-general] [PATCH] [CM] add private data comparison tomatch REQs with listens In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2828@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <001f01c5f7a4$8e9c6ac0$9e5aa8c0@infiniconsys.com> > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Friday, December 02, 2005 4:06 PM > > Socket listen semantics have nothing to do with Ethernet. > They are Unix/POSIX. In fact a major point of socket semantics > is that they worked over multiple networks. The IB CM doesn't provide socket semantics. Period, end of story. Providing socket semantics is higher level functionality (the CMA), and outside the scope of the IB CM and this email thread. > Sockets are part of the problem when it comes to transferring > data once a connection is established, which is why we have > QPs and CQs. Irrelevant. > But there is a very simple transport neutral definition of > passive side connectin setup. The server issues a listen. > The server receives connection requests. The service can > optionally hand off the connection request, accept it > or reject it. There is no notion of per-request handoff in IB - you either accept or reject - that's it. The reject can cause a redirect, but that requires a new connection request from the client. > That model is a natural extension of both TCP connection > setup and the InfiniBand CM. How does the IB CM protocol support hand off? > It allows the server to deal > with destination multiplexing. DAPL and IT-API both already > work this way. > > Are you opposed to transport neutral connection establishment? I don't give a hoot about transport neutral connection establishment, DAPL, or IT-API in the scope of this email thread. They just aren't relevant whatsoever. This thread is about adding private data comparison functionality to the IB CM. The IB CM is the module to which the CMA interfaces. The CMA is a separate module providing higher level functionality, and is designed to provide transport neutral connection establishment, specifically IP addressing over IB. As Sean originally stated in the mail that started this thread, the CMA will make use of the private data comparison functionality. Adding this functionality to the IB CM is simpler than implementing it in the CMA while at the same time providing additional flexibility to future users of the IB CM that wish to have similar functionality. - Fab From ianjiang.ict at gmail.com Sat Dec 3 09:45:52 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Sun, 4 Dec 2005 01:45:52 +0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer Message-ID: <7b2fa1820512030945j22e205d9j86a3b8e7bd709182@mail.gmail.com> I am doing a simple rdma-read test using the kDAPL. My test is running in the kernel model. When I allocate both the data source and sink buffers using kmalloc() and register the buffers using dat_lmr_kcreate() with memory type DAT_MEM_TYPE_PHYSICAL, everything goes well. If the sink buffer is allocated with vmalloc() and registered as before, no registering error or rdma read DTO completion error occours but My questions: 1) Could a buffer allocated with vmalloc() be used for a kDAPL rdma reading? If so, 2) should a buffer of this kind be registered in the same as a buffer allocated with kmalloc()? Could anyone give some suggestion? Thanks very much! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Sun Dec 4 01:44:10 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 04 Dec 2005 11:44:10 +0200 Subject: [openib-general] [PATCH] Opensm - fix segfault on exit Message-ID: <5z8xv1xklx.fsf@mtl066.yok.mtl.com> Hi Hal, If the driver isn't loaded, opensm exits with segfault. This is since it tries destroying the signal event in the osm_vendor, but due to the failure - this event wasn't created. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4281) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -552,6 +552,7 @@ osm_vendor_delete( /* umad receiver thread ? */ p_ur = (*pp_vend)->receiver; + if (&p_ur->signal) cl_event_destroy( &p_ur->signal ); cl_spinlock_destroy( &(*pp_vend)->cb_lock ); cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); From ogerlitz at voltaire.com Sun Dec 4 03:35:38 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 04 Dec 2005 13:35:38 +0200 Subject: [openib-general] Re: spinlock wrong CPU on CPU#1, ib_addr In-Reply-To: <438F40B0.4010000@ichips.intel.com> References: <438F40B0.4010000@ichips.intel.com> Message-ID: <4392D48A.3090201@voltaire.com> Sean Hefty wrote: > it looks like the adaptor_list_lock in iser_adaptor_find_device() was > acquired while running on CPU 0, but an attempt was made to release it on CPU 1. Indeed. The thing did not reproduce, moreover i can't see why a thread which was interrupted on one CPU resume its running on another CPU. For now, I will not change this code to lock irq's. Or. From yael at mellanox.co.il Sun Dec 4 04:10:15 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 4 Dec 2005 14:10:15 +0200 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E245C@mtlexch01.mtl.com> Hi Eitan, Hal, I agree that currently we do not have an authentication mechanism, thus - we cannot decide that an SM is not trusted. I think that in the current situation the option of always sending our true SM_Key when receiving SMInfo SubnGet request is a good one. In this case - there is no need to update anything in the SMInfo SubnGet request. Any objections? Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, December 01, 2005 7:02 PM To: Eitan Zahavi Cc: Yael Kalka; openib-general at openib.org Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Hi Eitan, On Thu, 2005-12-01 at 10:35, Eitan Zahavi wrote: > Hi Yael, > > As I read through the MgtWg mails I get the impression that an out of > spec mechanism is required to know if the other SM is trusted. Yes, that was what I was proposing (in http://openib.org/pipermail/openib-general/2005-December/014186.html where I wrote "The SM needs a way to know whether the other SM(s) (and which ones) are trusted or not so the SM_Key can be filled in."): that OpenSM have a list of trusted SMs and OpenSM would use that information. > In that case and since OpenSM does not currently provide any such > mechanism, I would prefer never to send out the SM_Key on the request > and always send zero. Sending our SM_Key to a non - trusted SM is not a > good idea in my mind. > > OpenSM behavior should be to always trust any other SM. Above you said no other SM was trusted so do you mean not trust rather than trust other SMs ? > So any discovered SM that deserves to be the master should be granted > that right. Only if it were trusted and had the correct SM Key. -- Hal > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Yael Kalka > > Sent: Thursday, December 01, 2005 2:17 PM > > To: 'Hal Rosenstock'; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > Hi Hal, Eitan, > > I think the best option is to add an OpenSM option flag - > exit_on_fatal. > > This flag can decide on the action on fatal cases: > > 1. Exit or not when seeing SM with different SM_Key. > > 2. Exit or not when there is a fatal link error (e.g - multiple > guids). > > etc. > > > > I tried to run 2 SMs just now with different SM_keys, and I see that > none of them > > exit, since both receive SM_Key=0 on SMInfo GetResp. > > The reason for that is that in the SMInfo Get request (as in all other > requests) > > we do not send anything in the mad data. Meaning - all fields are > clear. > > In the __osm_sminfo_rcv_process_get_request function we are checking > the state > > according > > to the payload data. This is always zero! Thus - SM will never know > that the SMInfo > > request is sent from an SM that is master. > > > > I will work on a fix for that. > > Yael > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, November 30, 2005 11:57 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > Hi Yael & Eitan, > > > > Based on the recent MgtWG discussions, are you still holding your > > position in terms of exiting OpenSM when a non matching SM Key is > > discovered ? Just wondering if I can issue a patch for this and clear > > this issue so OpenSM can be compliant for this aspect. Thanks. > > > > -- Hal > > > > -----Forwarded Message----- > > > > From: Hal Rosenstock > > To: openib-general at openib.org > > Subject: [openib-general] OpenSM and Wrong SM_Key > > Date: 08 Nov 2005 16:08:47 -0500 > > > > Hi, > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, it > exits > > as follows: > > > > > > void > > __osm_sminfo_rcv_process_get_response( > > IN const osm_sminfo_rcv_t* const p_rcv, > > IN const osm_madw_t* const p_madw ) > > { > > ... > > > > > > > > /* > > Check that the sm_key of the found SM is the same as ours, > > or is zero. If not - OpenSM cannot continue with configuration!. > */ > > if ( p_smi->sm_key != 0 && > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > "Got SM with sm_key that doesn't match our " > > "local key. Exiting\n" ); > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > "Found remote SM with non-matching sm_key. Exiting\n" ); > > osm_exit_flag = TRUE; > > goto Exit; > > } > > > > C14-61.2.1 states that: > > A master SM which finds a higher priority master SM with the wrong > > SM_Key should not relinquish the subnet. > > > > Exiting OpenSM relinquishes the subnet. > > > > So it appears to me that perhaps this behavior of exiting OpenSM > should > > be at least contingent on the SM state and relative priority of the > > SMInfo received. Make sense ? If so, I will work on a patch for this. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From yael at mellanox.co.il Sun Dec 4 05:02:50 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 04 Dec 2005 15:02:50 +0200 Subject: [openib-general] [PATCH] Opensm - duplicated guids issue Message-ID: <5z7jalxbet.fsf@mtl066.yok.mtl.com> Hi Hal, Currently if OpenSM discovers duplicated guids or 12x link with lane reversal badly configured it only issues an error to the log file. This issue, though, is much more problematic, since it will cause part of the subnet to be un-initialized. The following patch includes a fuller handling of the issue - first, issue an error message to the /var/log/messeges file as well. Second - add an option flag to the SM that will define wether or not to exit on such case. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 4288) +++ include/opensm/osm_subnet.h (working copy) @@ -235,6 +235,7 @@ typedef struct _osm_subn_opt osm_testability_modes_t testability_mode; boolean_t updn_activate; char * updn_guid_file; + boolean_t exit_on_fatal; } osm_subn_opt_t; /* * FIELDS @@ -372,6 +373,13 @@ typedef struct _osm_subn_opt * updn_guid_file * Pointer to name of the UPDN guid file given by User * +* exit_on_fatal +* If TRUE (default) - SM will exit on fatal subnet initialization issues. +* If FALSE - SM will not exit. +* Fatal initialization issues: +* a. SM recognizes 2 different nodes with the same guid, or 12x link with +* lane reversal badly configured. +* * SEE ALSO * Subnet object *********/ Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 4288) +++ opensm/osm_subnet.c (working copy) @@ -440,6 +440,7 @@ osm_subn_set_default_opt( p_opt->testability_mode = OSM_TEST_MODE_NONE; p_opt->updn_activate = FALSE; p_opt->updn_guid_file = NULL; + p_opt->exit_on_fatal = TRUE; } /********************************************************************** @@ -765,6 +766,10 @@ osm_subn_parse_conf_file( __osm_subn_opts_unpack_charp( "updn_guid_file" , p_key, p_val, &p_opts->updn_guid_file); + + __osm_subn_opts_unpack_boolean( + "exit_on_fatal", + p_key, p_val, &p_opts->exit_on_fatal); } } fclose(opts_file); @@ -930,14 +935,17 @@ osm_subn_write_conf_file( "# If TRUE if OpenSM should disable multicast support\n" "no_multicast_option %s\n\n" "# No multicast routing is performed if TRUE\n" - "disable_multicast %s\n\n", + "disable_multicast %s\n\n" + "# If TRUE opensm will exit on fatal initialization issues\n" + "exit_on_fatal %s\n\n", p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, p_opts->accum_log_file ? "TRUE" : "FALSE", p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", - p_opts->disable_multicast ? "TRUE" : "FALSE" + p_opts->disable_multicast ? "TRUE" : "FALSE", + p_opts->exit_on_fatal ? "TRUE" : "FALSE" ); /* optional string attributes ... */ Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 4288) +++ opensm/osm_node_info_rcv.c (working copy) @@ -198,6 +198,14 @@ __osm_ni_rcv_set_links( p_ni_context->port_num, dr_new_path ); + + osm_log( p_rcv->p_log, OSM_LOG_SYS, + "Errors on subnet. SM found duplicated guids or 12x " + "link with lane reversal badly configured. " + "Use osm log for more details.\n"); + + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) + exit( 1 ); } /* Index: opensm/main.c =================================================================== --- opensm/main.c (revision 4288) +++ opensm/main.c (working copy) @@ -178,6 +178,12 @@ show_usage(void) " This option will cause deletion of the log file\n" " (if it previously exists). By default, the log file\n" " is accumulative.\n\n"); + printf( "-y\n" + "--stay_on_fatal\n" + " This option will cause SM not to exit on fatal initialization\n" + " issues: If SM discovers duplicated guids or 12x link with\n" + " lane reversal badly configured.\n" + " By default, the SM will exit.\n\n"); printf( "-v\n" "--verbose\n" " This option increases the log verbosity level.\n" @@ -460,7 +466,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorc"; + const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; /* In the array below, the 2nd parameter specified the number @@ -492,6 +498,7 @@ main( { "updn", 0, NULL, 'u'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, + { "stay_on_fatal", 0, NULL, 'y'}, { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; @@ -665,6 +672,11 @@ main( printf(" Creating new log file\n"); break; + case 'y': + opt.exit_on_fatal = FALSE; + printf(" Staying on fatal initialization\n"); + break; + case 'v': log_flags = (log_flags <<1 )|1; printf(" Verbose option -v (log flags = 0x%X)\n", log_flags ); From halr at voltaire.com Sun Dec 4 09:11:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 4 Dec 2005 19:11:36 +0200 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB79@taurus.voltaire.com> Hi Yael, On Sun, 2005-12-04 at 07:10, Yael Kalka wrote: > Hi Eitan, Hal, > > I agree that currently we do not have an authentication mechanism, > thus - we cannot decide that an SM is not trusted. > I think that in the current situation the option of always sending our > true SM_Key when receiving SMInfo SubnGet request is a good one. > In this case - there is no need to update anything in the SMInfo SubnGet > request. > Any objections? IMO this is a first step (assuming in subnet with only OpenSMs and hence all are trusted). What needs to be done is: The SM needs a way to know whether the other SM(s) (and which ones) are trusted or not so the SM_Key can be filled in. To accomplish this, OpenSM needs to have a list of trusted SMs (e.g. additional configuration). Also, given that the current default SM Key is 0. there is no difference here (so perhaps the default SM Key should be changed to a non 0 value). There is some ambiguity in the spec currently around this (and a comment has been filed with the MgtWG on this). -- Hal > Yael > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 01, 2005 7:02 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > Hi Eitan, > > On Thu, 2005-12-01 at 10:35, Eitan Zahavi wrote: > > Hi Yael, > > > > As I read through the MgtWg mails I get the impression that an out of > > spec mechanism is required to know if the other SM is trusted. > > Yes, that was what I was proposing (in > http://openib.org/pipermail/openib-general/2005-December/014186.html > where I wrote "The SM needs a way to know whether the other SM(s) (and > which ones) are trusted or not so the SM_Key can be filled in."): that > OpenSM have a list of trusted SMs and OpenSM would use that information. > > > In that case and since OpenSM does not currently provide any such > > mechanism, I would prefer never to send out the SM_Key on the request > > and always send zero. Sending our SM_Key to a non - trusted SM is not > a > > good idea in my mind. > > > > OpenSM behavior should be to always trust any other SM. > > Above you said no other SM was trusted so do you mean not trust rather > than trust other SMs ? > > > So any discovered SM that deserves to be the master should be granted > > that right. > > Only if it were trusted and had the correct SM Key. > > -- Hal > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Yael Kalka > > > Sent: Thursday, December 01, 2005 2:17 PM > > > To: 'Hal Rosenstock'; Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > Hi Hal, Eitan, > > > I think the best option is to add an OpenSM option flag - > > exit_on_fatal. > > > This flag can decide on the action on fatal cases: > > > 1. Exit or not when seeing SM with different SM_Key. > > > 2. Exit or not when there is a fatal link error (e.g - multiple > > guids). > > > etc. > > > > > > I tried to run 2 SMs just now with different SM_keys, and I see that > > none of them > > > exit, since both receive SM_Key=0 on SMInfo GetResp. > > > The reason for that is that in the SMInfo Get request (as in all > other > > requests) > > > we do not send anything in the mad data. Meaning - all fields are > > clear. > > > In the __osm_sminfo_rcv_process_get_request function we are checking > > the state > > > according > > > to the payload data. This is always zero! Thus - SM will never know > > that the SMInfo > > > request is sent from an SM that is master. > > > > > > I will work on a fix for that. > > > Yael > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Wednesday, November 30, 2005 11:57 PM > > > To: Yael Kalka; Eitan Zahavi > > > Cc: openib-general at openib.org > > > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > > Hi Yael & Eitan, > > > > > > Based on the recent MgtWG discussions, are you still holding your > > > position in terms of exiting OpenSM when a non matching SM Key is > > > discovered ? Just wondering if I can issue a patch for this and > clear > > > this issue so OpenSM can be compliant for this aspect. Thanks. > > > > > > -- Hal > > > > > > -----Forwarded Message----- > > > > > > From: Hal Rosenstock > > > To: openib-general at openib.org > > > Subject: [openib-general] OpenSM and Wrong SM_Key > > > Date: 08 Nov 2005 16:08:47 -0500 > > > > > > Hi, > > > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, it > > exits > > > as follows: > > > > > > > > > void > > > __osm_sminfo_rcv_process_get_response( > > > IN const osm_sminfo_rcv_t* const p_rcv, > > > IN const osm_madw_t* const p_madw ) > > > { > > > ... > > > > > > > > > > > > /* > > > Check that the sm_key of the found SM is the same as ours, > > > or is zero. If not - OpenSM cannot continue with > configuration!. > > */ > > > if ( p_smi->sm_key != 0 && > > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > > { > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > > "Got SM with sm_key that doesn't match our " > > > "local key. Exiting\n" ); > > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > > "Found remote SM with non-matching sm_key. Exiting\n" > ); > > > osm_exit_flag = TRUE; > > > goto Exit; > > > } > > > > > > C14-61.2.1 states that: > > > A master SM which finds a higher priority master SM with the wrong > > > SM_Key should not relinquish the subnet. > > > > > > Exiting OpenSM relinquishes the subnet. > > > > > > So it appears to me that perhaps this behavior of exiting OpenSM > > should > > > be at least contingent on the SM state and relative priority of the > > > SMInfo received. Make sense ? If so, I will work on a patch for > this. > > > > > > -- Hal > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Sun Dec 4 10:57:26 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 4 Dec 2005 20:57:26 +0200 Subject: [openib-general] libibcommon: fix make dist Message-ID: <20051204185726.GA27549@mellanox.co.il> fix make dist for libibcommon Signed-off-by: Michael S. Tsirkin Index: trunk/src/userspace/management/libibcommon/Makefile.am =================================================================== --- trunk.orig/src/userspace/management/libibcommon/Makefile.am +++ trunk/src/userspace/management/libibcommon/Makefile.am @@ -22,7 +22,8 @@ libibcommonincludedir = $(includedir)/in libibcommoninclude_HEADERS = $(srcdir)/include/infiniband/common.h -EXTRA_DIST = $(srcdir)/include/infiniband/common.h libibcommon.spec.in +EXTRA_DIST = $(srcdir)/include/infiniband/common.h libibcommon.spec.in \ + $(srcdir)/src/libibcommon.map dist-hook: libibcommon.spec cp libibcommon.spec $(distdir) -- MST From olgag at voltaire.com Sun Dec 4 12:47:51 2005 From: olgag at voltaire.com (Olga Grissik) Date: Sun, 4 Dec 2005 22:47:51 +0200 Subject: [openib-general] address rcpt to: openib-general@openib.org Message-ID: Test , ignore it -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sun Dec 4 14:57:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Dec 2005 17:57:24 -0500 Subject: [openib-general] Re: libibcommon: fix make dist In-Reply-To: <20051204185726.GA27549@mellanox.co.il> References: <20051204185726.GA27549@mellanox.co.il> Message-ID: <1133736913.4587.8485.camel@hal.voltaire.com> On Sun, 2005-12-04 at 13:57, Michael S. Tsirkin wrote: > fix make dist for libibcommon Thanks. Applied. From sayo at yahoo.com.cn Sun Dec 4 15:32:29 2005 From: sayo at yahoo.com.cn (sayo) Date: Mon, 5 Dec 2005 08:32:29 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCJS8laiU5JV4lOUZDPTgbKEI=?= Message-ID: <20051204233443.E61D622834D@openib.ca.sandia.gov> やっとイイサイト見つけました。 苦労した甲斐あります。 http://mr-goo.com/1292/ 問 gerolia8888 at yaho0、com.cn From mmmic at 55mail.cc Sun Dec 4 15:35:23 2005 From: mmmic at 55mail.cc (mmmic at 55mail.cc) Date: Sun, 4 Dec 2005 15:35:23 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCQGRCUCRLJCo2YiQsTGMbKEI=?= =?iso-2022-jp?b?GyRCJCgkXiQ5ISobKEI=?= Message-ID: 20051205075816.92482mail@mail.serebu_woman-server99_soondeai-go-free1919_system08_heart-kiss.tv ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■━■━■━■━■━■━■━■━■━■━■━■━■━■━■━■━ ┃                                ■  絶対に逢えます!絶対にH出来ます!絶対にお金が貰えます!  ┃                                ■       だって、ココは本物の…       ┃                                ■        セレブな女性達の集まりですから         ┃        ━━━━━━━━━━━━━━━━        ■━■━■━■━■━■━■━■━■━■━■━■━■━■━■━■━ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ┏┓ ┗★ なぜ絶対に逢えると言い切れるか?        ■■ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━ http://perfection.cx/h/ 当サークルは男性様が主役。当社のシステムがその理由です。女性 様の登録条件が以下の通りなのです。 ¶.登録の際、登録金額をお振込み頂き、男性様への謝礼金としての保障と   してもお預かりさせて頂く ¶.直接連絡先の交換は、男性様からのメールが届き次第速やかに行う事。  これは双方の信頼性・安全性を高める上での絶対条件の為 ¶.<男性様が貴方にお会い頂ける>という認識を大切に、ご希望条件(肉   の体関係の求愛・逆援助・送迎)等には快く従う http://perfection.cx/h/ ┏┓ ┗★ 貴方が思っている以上に女性は淫乱なんです…   ■■ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━ http://perfection.cx/h/ 某有名雑誌にはこんな記事が掲載されております。 『一般的に性欲と言う物は<男性の方が高く持ち合わせている>という説  が殆どの方の認識であると思うが、某有名病院○○医師の見解によると  、どうやらそれは違う様である。』  つまり、女性は貴方のその性欲よりも<更に強く>SEXを求めている  のです。これは物理的に考えると『絶対に逢える!』と言う答えが、<  必然的>に裏付けられるのです。 http://perfection.cx/h/ ┏┓ ┗★ ご存知ですが・・・               ■■ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━ http://perfection.cx/h/ よく『本当ですか?』等のご質問を頂きますが、何も行動されないから 不安になるんです。こうしてる間にもセレブ女性は貴方を待っています。 当社は男性様に上記の項目に完全に当てはまる女性をご 紹介しております。 信じる・信じないは貴方様の自由で御座います。只、          <紹介という事実> は、決して曲げる事の出来ない事実で御座います。 男性は登録料・紹介料など一切かかりません また、当サークルは問題視されている不正請求・自動課金も一切行っておりません。 どなた様も安心してご利用いただけます。 ▼18歳未満のご利用は禁止されています▼ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ http://perfection.cx/h/ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ From lond152 at kiwi.ne.jp Sun Dec 4 11:46:38 2005 From: lond152 at kiwi.ne.jp (lond152 at kiwi.ne.jp) Date: Mon, 05 Dec 2005 04:46:38 +0900 Subject: [openib-general] =?iso-2022-jp?b?GyRCTDVOQSViJUslPyE8MnEbKEI=?= =?iso-2022-jp?b?GyRCMHdFdkEqISohKhsoQg==?= Message-ID: <20051204.1946380677@lond152-kiwi.ne.jp> 第3回、アドレス抽選会において、あなた様のアカウントが見事当選致しました。 参加サービスとして、1500円分の無料ポイント進呈致します。 【参加要項】参加料…無料 年会費…無料 各地域=男性5名・女性10名となっております。 是非、ご参加下さい☆ 1.2回無料モニターでは、男性平均3名のカップル成功率と、大盛況でした。 今回の募集で最後になりますので是非、お楽しみくださいませ♪ 当選参加URL↓ http://hitozuma-jp.org/elegant/pc/e.cgi?hx08a 参加辞退→me622133 at members.interq.or.jp From eitan at mellanox.co.il Sun Dec 4 23:46:56 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Dec 2005 09:46:56 +0200 Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A56@mtlexch01.mtl.com> Hi Hal, I thought we all agree that a full MGID compare is required. Also we should not deal with MGRPs marked "to be deleted". For all purposes but MGRP re-route they should not exist... Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 01, 2005 7:20 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: [openib-general] OpenSM: search_mgrp_by_mgid questions > > Hi Eitan, > > On Thu, 2005-12-01 at 10:28, Eitan Zahavi wrote: > > Hi Hal, > > > > You are very right. Thanks. Can you patch it? > > Sure. Any prefereance for which way should the comment be (like PR or > MCM) ? > > -- Hal > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Thursday, December 01, 2005 4:53 PM > > > To: Yael Kalka > > > Cc: openib-general at openib.org > > > Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions > > > > > > Hi Yael, > > > > > > osm_sa_path_record.c::__search_mgrp_by_mgid has the following: > > > > > > p_recvd_mgid = p_ctxt->p_mgid; > > > p_rcv = p_ctxt->p_rcv; > > > > > > /* Why not compare the entire MGID ???? */ > > > /* different scope can sneak in for the same MGID ? */ > > > /* EZ: I changed it to full compare ! */ > > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > > p_recvd_mgid, > > > sizeof(ib_gid_t))) > > > return; > > > > > > whereas osm_sa_mcmember_record.c::__search_mgrp_by_mgid has the > > > following: > > > > > > p_recvd_mcmember_rec = p_ctxt->p_mcmember_rec; > > > p_rcv = p_ctxt->p_rcv; > > > > > > /* ignore groups marked for deletion */ > > > if (p_mgrp->to_be_deleted) > > > return; > > > > > > /* compare entire MGID so different scope will not sneak in for > > > the same MGID */ > > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > > &p_recvd_mcmember_rec->mgid, > > > sizeof(ib_gid_t))) > > > return; > > > > > > Shouldn't the SA PR code also check for "to be deleted" ? It also > > seems > > > like the comments on the MGID comparison should also be made the same. > > > > > > -- Hal > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Sun Dec 4 23:52:21 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Dec 2005 09:52:21 +0200 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A57@mtlexch01.mtl.com> > > > > I agree that currently we do not have an authentication mechanism, > > thus - we cannot decide that an SM is not trusted. > > I think that in the current situation the option of always sending our > > true SM_Key when receiving SMInfo SubnGet request is a good one. > > In this case - there is no need to update anything in the SMInfo SubnGet > > request. > > Any objections? > > IMO this is a first step (assuming in subnet with only OpenSMs and hence > all are trusted). What needs to be done is: > The SM needs a way to know whether the other SM(s) (and which ones) are > trusted or not so the SM_Key can be filled in. To accomplish this, > OpenSM needs to have a list of trusted SMs (e.g. additional > configuration). [EZ] I guess what you mean is that a list of trusted SM's port guids will provided to the SM? We can do that. > > Also, given that the current default SM Key is 0. there is no difference > here (so perhaps the default SM Key should be changed to a non 0 value). > There is some ambiguity in the spec currently around this (and a comment > has been filed with the MgtWG on this). > > -- Hal > > > Yael > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, December 01, 2005 7:02 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > Hi Eitan, > > > > On Thu, 2005-12-01 at 10:35, Eitan Zahavi wrote: > > > Hi Yael, > > > > > > As I read through the MgtWg mails I get the impression that an out of > > > spec mechanism is required to know if the other SM is trusted. > > > > Yes, that was what I was proposing (in > > http://openib.org/pipermail/openib-general/2005-December/014186.html > > where I wrote "The SM needs a way to know whether the other SM(s) (and > > which ones) are trusted or not so the SM_Key can be filled in."): that > > OpenSM have a list of trusted SMs and OpenSM would use that information. > > > > > In that case and since OpenSM does not currently provide any such > > > mechanism, I would prefer never to send out the SM_Key on the request > > > and always send zero. Sending our SM_Key to a non - trusted SM is not > > a > > > good idea in my mind. > > > > > > OpenSM behavior should be to always trust any other SM. > > > > Above you said no other SM was trusted so do you mean not trust rather > > than trust other SMs ? > > > > > So any discovered SM that deserves to be the master should be granted > > > that right. > > > > Only if it were trusted and had the correct SM Key. > > > > -- Hal > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Yael Kalka > > > > Sent: Thursday, December 01, 2005 2:17 PM > > > > To: 'Hal Rosenstock'; Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > Hi Hal, Eitan, > > > > I think the best option is to add an OpenSM option flag - > > > exit_on_fatal. > > > > This flag can decide on the action on fatal cases: > > > > 1. Exit or not when seeing SM with different SM_Key. > > > > 2. Exit or not when there is a fatal link error (e.g - multiple > > > guids). > > > > etc. > > > > > > > > I tried to run 2 SMs just now with different SM_keys, and I see that > > > none of them > > > > exit, since both receive SM_Key=0 on SMInfo GetResp. > > > > The reason for that is that in the SMInfo Get request (as in all > > other > > > requests) > > > > we do not send anything in the mad data. Meaning - all fields are > > > clear. > > > > In the __osm_sminfo_rcv_process_get_request function we are checking > > > the state > > > > according > > > > to the payload data. This is always zero! Thus - SM will never know > > > that the SMInfo > > > > request is sent from an SM that is master. > > > > > > > > I will work on a fix for that. > > > > Yael > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Wednesday, November 30, 2005 11:57 PM > > > > To: Yael Kalka; Eitan Zahavi > > > > Cc: openib-general at openib.org > > > > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > > > > > Hi Yael & Eitan, > > > > > > > > Based on the recent MgtWG discussions, are you still holding your > > > > position in terms of exiting OpenSM when a non matching SM Key is > > > > discovered ? Just wondering if I can issue a patch for this and > > clear > > > > this issue so OpenSM can be compliant for this aspect. Thanks. > > > > > > > > -- Hal > > > > > > > > -----Forwarded Message----- > > > > > > > > From: Hal Rosenstock > > > > To: openib-general at openib.org > > > > Subject: [openib-general] OpenSM and Wrong SM_Key > > > > Date: 08 Nov 2005 16:08:47 -0500 > > > > > > > > Hi, > > > > > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, it > > > exits > > > > as follows: > > > > > > > > > > > > void > > > > __osm_sminfo_rcv_process_get_response( > > > > IN const osm_sminfo_rcv_t* const p_rcv, > > > > IN const osm_madw_t* const p_madw ) > > > > { > > > > ... > > > > > > > > > > > > > > > > /* > > > > Check that the sm_key of the found SM is the same as ours, > > > > or is zero. If not - OpenSM cannot continue with > > configuration!. > > > */ > > > > if ( p_smi->sm_key != 0 && > > > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > > > { > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > > > "Got SM with sm_key that doesn't match our " > > > > "local key. Exiting\n" ); > > > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > > > "Found remote SM with non-matching sm_key. Exiting\n" > > ); > > > > osm_exit_flag = TRUE; > > > > goto Exit; > > > > } > > > > > > > > C14-61.2.1 states that: > > > > A master SM which finds a higher priority master SM with the wrong > > > > SM_Key should not relinquish the subnet. > > > > > > > > Exiting OpenSM relinquishes the subnet. > > > > > > > > So it appears to me that perhaps this behavior of exiting OpenSM > > > should > > > > be at least contingent on the SM state and relative priority of the > > > > SMInfo received. Make sense ? If so, I will work on a patch for > > this. > > > > > > > > -- Hal > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Sun Dec 4 23:58:43 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Dec 2005 09:58:43 +0200 Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A58@mtlexch01.mtl.com> Hi Hal, Yael tells me I did not understand your question. I also see you have provided the patch implementing exactly what I want. Please ignore my previous mail. (Maybe the zillion mails long inbox can serve as a poor excuse for my previous one ...) Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Eitan Zahavi > Sent: Monday, December 05, 2005 9:47 AM > To: Hal Rosenstock > Cc: openib-general at openib.org > Subject: RE: [openib-general] OpenSM: search_mgrp_by_mgid questions > > Hi Hal, > > I thought we all agree that a full MGID compare is required. > Also we should not deal with MGRPs marked "to be deleted". > For all purposes but MGRP re-route they should not exist... > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Thursday, December 01, 2005 7:20 PM > > To: Eitan Zahavi > > Cc: Yael Kalka; openib-general at openib.org > > Subject: RE: [openib-general] OpenSM: search_mgrp_by_mgid questions > > > > Hi Eitan, > > > > On Thu, 2005-12-01 at 10:28, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > You are very right. Thanks. Can you patch it? > > > > Sure. Any prefereance for which way should the comment be (like PR or > > MCM) ? > > > > -- Hal > > > > > Eitan Zahavi > > > Design Technology Director > > > Mellanox Technologies LTD > > > Tel:+972-4-9097208 > > > Fax:+972-4-9593245 > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > -----Original Message----- > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > Sent: Thursday, December 01, 2005 4:53 PM > > > > To: Yael Kalka > > > > Cc: openib-general at openib.org > > > > Subject: [openib-general] OpenSM: search_mgrp_by_mgid questions > > > > > > > > Hi Yael, > > > > > > > > osm_sa_path_record.c::__search_mgrp_by_mgid has the following: > > > > > > > > p_recvd_mgid = p_ctxt->p_mgid; > > > > p_rcv = p_ctxt->p_rcv; > > > > > > > > /* Why not compare the entire MGID ???? */ > > > > /* different scope can sneak in for the same MGID ? */ > > > > /* EZ: I changed it to full compare ! */ > > > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > > > p_recvd_mgid, > > > > sizeof(ib_gid_t))) > > > > return; > > > > > > > > whereas osm_sa_mcmember_record.c::__search_mgrp_by_mgid has the > > > > following: > > > > > > > > p_recvd_mcmember_rec = p_ctxt->p_mcmember_rec; > > > > p_rcv = p_ctxt->p_rcv; > > > > > > > > /* ignore groups marked for deletion */ > > > > if (p_mgrp->to_be_deleted) > > > > return; > > > > > > > > /* compare entire MGID so different scope will not sneak in for > > > > the same MGID */ > > > > if (cl_memcmp(&p_mgrp->mcmember_rec.mgid, > > > > &p_recvd_mcmember_rec->mgid, > > > > sizeof(ib_gid_t))) > > > > return; > > > > > > > > Shouldn't the SA PR code also check for "to be deleted" ? It also > > > seems > > > > like the comments on the MGID comparison should also be made the > same. > > > > > > > > -- Hal > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Dec 5 03:19:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2005 06:19:21 -0500 Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A57@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A57@mtlexch01.mtl.com> Message-ID: <1133781560.4587.11462.camel@hal.voltaire.com> On Mon, 2005-12-05 at 02:52, Eitan Zahavi wrote: > > > > > > I agree that currently we do not have an authentication mechanism, > > > thus - we cannot decide that an SM is not trusted. > > > I think that in the current situation the option of always sending > our > > > true SM_Key when receiving SMInfo SubnGet request is a good one. > > > In this case - there is no need to update anything in the SMInfo > SubnGet > > > request. > > > Any objections? > > > > IMO this is a first step (assuming in subnet with only OpenSMs and > hence > > all are trusted). What needs to be done is: > > The SM needs a way to know whether the other SM(s) (and which ones) > are > > trusted or not so the SM_Key can be filled in. To accomplish this, > > OpenSM needs to have a list of trusted SMs (e.g. additional > > configuration). > [EZ] I guess what you mean is that a list of trusted SM's port guids > will provided to the SM? We can do that. Yes, that's what I mean/meant. -- Hal > > > > Also, given that the current default SM Key is 0. there is no > difference > > here (so perhaps the default SM Key should be changed to a non 0 > value). > > There is some ambiguity in the spec currently around this (and a > comment > > has been filed with the MgtWG on this). > > > > -- Hal > > > > > Yael > > > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Thursday, December 01, 2005 7:02 PM > > > To: Eitan Zahavi > > > Cc: Yael Kalka; openib-general at openib.org > > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > > Hi Eitan, > > > > > > On Thu, 2005-12-01 at 10:35, Eitan Zahavi wrote: > > > > Hi Yael, > > > > > > > > As I read through the MgtWg mails I get the impression that an out > of > > > > spec mechanism is required to know if the other SM is trusted. > > > > > > Yes, that was what I was proposing (in > > > http://openib.org/pipermail/openib-general/2005-December/014186.html > > > where I wrote "The SM needs a way to know whether the other SM(s) > (and > > > which ones) are trusted or not so the SM_Key can be filled in."): > that > > > OpenSM have a list of trusted SMs and OpenSM would use that > information. > > > > > > > In that case and since OpenSM does not currently provide any such > > > > mechanism, I would prefer never to send out the SM_Key on the > request > > > > and always send zero. Sending our SM_Key to a non - trusted SM is > not > > > a > > > > good idea in my mind. > > > > > > > > OpenSM behavior should be to always trust any other SM. > > > > > > Above you said no other SM was trusted so do you mean not trust > rather > > > than trust other SMs ? > > > > > > > So any discovered SM that deserves to be the master should be > granted > > > > that right. > > > > > > Only if it were trusted and had the correct SM Key. > > > > > > -- Hal > > > > > > > Eitan Zahavi > > > > Design Technology Director > > > > Mellanox Technologies LTD > > > > Tel:+972-4-9097208 > > > > Fax:+972-4-9593245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > > > > > > -----Original Message----- > > > > > From: Yael Kalka > > > > > Sent: Thursday, December 01, 2005 2:17 PM > > > > > To: 'Hal Rosenstock'; Eitan Zahavi > > > > > Cc: openib-general at openib.org > > > > > Subject: RE: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > > > Hi Hal, Eitan, > > > > > I think the best option is to add an OpenSM option flag - > > > > exit_on_fatal. > > > > > This flag can decide on the action on fatal cases: > > > > > 1. Exit or not when seeing SM with different SM_Key. > > > > > 2. Exit or not when there is a fatal link error (e.g - multiple > > > > guids). > > > > > etc. > > > > > > > > > > I tried to run 2 SMs just now with different SM_keys, and I see > that > > > > none of them > > > > > exit, since both receive SM_Key=0 on SMInfo GetResp. > > > > > The reason for that is that in the SMInfo Get request (as in all > > > other > > > > requests) > > > > > we do not send anything in the mad data. Meaning - all fields > are > > > > clear. > > > > > In the __osm_sminfo_rcv_process_get_request function we are > checking > > > > the state > > > > > according > > > > > to the payload data. This is always zero! Thus - SM will never > know > > > > that the SMInfo > > > > > request is sent from an SM that is master. > > > > > > > > > > I will work on a fix for that. > > > > > Yael > > > > > > > > > > -----Original Message----- > > > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > > > Sent: Wednesday, November 30, 2005 11:57 PM > > > > > To: Yael Kalka; Eitan Zahavi > > > > > Cc: openib-general at openib.org > > > > > Subject: [Fwd: [openib-general] OpenSM and Wrong SM_Key] > > > > > > > > > > > > > > > Hi Yael & Eitan, > > > > > > > > > > Based on the recent MgtWG discussions, are you still holding > your > > > > > position in terms of exiting OpenSM when a non matching SM Key > is > > > > > discovered ? Just wondering if I can issue a patch for this and > > > clear > > > > > this issue so OpenSM can be compliant for this aspect. Thanks. > > > > > > > > > > -- Hal > > > > > > > > > > -----Forwarded Message----- > > > > > > > > > > From: Hal Rosenstock > > > > > To: openib-general at openib.org > > > > > Subject: [openib-general] OpenSM and Wrong SM_Key > > > > > Date: 08 Nov 2005 16:08:47 -0500 > > > > > > > > > > Hi, > > > > > > > > > > Currently, when OpenSM receives SMInfo with a different SM_Key, > it > > > > exits > > > > > as follows: > > > > > > > > > > > > > > > void > > > > > __osm_sminfo_rcv_process_get_response( > > > > > IN const osm_sminfo_rcv_t* const p_rcv, > > > > > IN const osm_madw_t* const p_madw ) > > > > > { > > > > > ... > > > > > > > > > > > > > > > > > > > > /* > > > > > Check that the sm_key of the found SM is the same as ours, > > > > > or is zero. If not - OpenSM cannot continue with > > > configuration!. > > > > */ > > > > > if ( p_smi->sm_key != 0 && > > > > > p_smi->sm_key != p_rcv->p_subn->opt.sm_key ) > > > > > { > > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > > > > "__osm_sminfo_rcv_process_get_response: ERR 2F18: " > > > > > "Got SM with sm_key that doesn't match our " > > > > > "local key. Exiting\n" ); > > > > > osm_log( p_rcv->p_log, OSM_LOG_SYS, > > > > > "Found remote SM with non-matching sm_key. > Exiting\n" > > > ); > > > > > osm_exit_flag = TRUE; > > > > > goto Exit; > > > > > } > > > > > > > > > > C14-61.2.1 states that: > > > > > A master SM which finds a higher priority master SM with the > wrong > > > > > SM_Key should not relinquish the subnet. > > > > > > > > > > Exiting OpenSM relinquishes the subnet. > > > > > > > > > > So it appears to me that perhaps this behavior of exiting OpenSM > > > > should > > > > > be at least contingent on the SM state and relative priority of > the > > > > > SMInfo received. Make sense ? If so, I will work on a patch for > > > this. > > > > > > > > > > -- Hal > > > > > > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Dec 5 05:19:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2005 08:19:27 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix segfault on exit In-Reply-To: <5z8xv1xklx.fsf@mtl066.yok.mtl.com> References: <5z8xv1xklx.fsf@mtl066.yok.mtl.com> Message-ID: <1133788766.4587.12025.camel@hal.voltaire.com> On Sun, 2005-12-04 at 04:44, Yael Kalka wrote: > Hi Hal, > > If the driver isn't loaded, opensm exits with segfault. This is since > it tries destroying the signal event in the osm_vendor, but due to the > failure - this event wasn't created. > The following patch fixes this. Thanks. Applied. > Thanks, > Yael From halr at voltaire.com Mon Dec 5 06:14:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 5 Dec 2005 16:14:01 +0200 Subject: [openib-general] Re: [PATCH] Opensm - duplicated guids issue Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB7A@taurus.voltaire.com> Hi Yael, On Sun, 2005-12-04 at 08:02, Yael Kalka wrote: > Hi Hal, > > Currently if OpenSM discovers duplicated guids What is the cause of a duplicated GUID ? Is it a misconfiguration of someone's firmware (rather than some error on the part of OpenSM) ? If so, I'm not sure exiting SM is the best option. IMO the policy is to decide which GUID to "honor" (either the original one or the new one). > or 12x link with lane reversal badly configured What does badly configured mean ? Does it mean the link does not come up at all or just in some non desired mode ? How is "bad lane reversal" reconfigured ? Can't this also occur on a 4x link as well ? > it only issues an error to the log > file. This issue, though, is much more problematic, since it will cause > part of the subnet to be un-initialized. > The following patch includes a fuller handling of the issue - first, > issue an error message to the /var/log/messeges file as well. I am incorporating this part of the patch. > Second - add an option flag to the SM that will define wether or not > to exit on such case. Also, there are other scenarios which mark the subnet initialization as failed (but don't exit the SM). This seems inconsistent to me. These cases also do not put errors out on syslog. Should they ? IMO, in general, exiting out of OpenSM should be avoided at all costs. The admin can always cause this to occur if desired and operating part of the subnet is better than none. Are these cases where the admin would not want to run the SM until the issues were resolved ? -- Hal > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: include/opensm/osm_subnet.h > =================================================================== > --- include/opensm/osm_subnet.h (revision 4288) > +++ include/opensm/osm_subnet.h (working copy) > @@ -235,6 +235,7 @@ typedef struct _osm_subn_opt > osm_testability_modes_t testability_mode; > boolean_t updn_activate; > char * updn_guid_file; > + boolean_t exit_on_fatal; > } osm_subn_opt_t; > /* > * FIELDS > @@ -372,6 +373,13 @@ typedef struct _osm_subn_opt > * updn_guid_file > * Pointer to name of the UPDN guid file given by User > * > +* exit_on_fatal > +* If TRUE (default) - SM will exit on fatal subnet initialization issues. > +* If FALSE - SM will not exit. > +* Fatal initialization issues: > +* a. SM recognizes 2 different nodes with the same guid, or 12x link with > +* lane reversal badly configured. > +* > * SEE ALSO > * Subnet object > *********/ > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 4288) > +++ opensm/osm_subnet.c (working copy) > @@ -440,6 +440,7 @@ osm_subn_set_default_opt( > p_opt->testability_mode = OSM_TEST_MODE_NONE; > p_opt->updn_activate = FALSE; > p_opt->updn_guid_file = NULL; > + p_opt->exit_on_fatal = TRUE; > } > > /********************************************************************** > @@ -765,6 +766,10 @@ osm_subn_parse_conf_file( > __osm_subn_opts_unpack_charp( > "updn_guid_file" , > p_key, p_val, &p_opts->updn_guid_file); > + > + __osm_subn_opts_unpack_boolean( > + "exit_on_fatal", > + p_key, p_val, &p_opts->exit_on_fatal); > } > } > fclose(opts_file); > @@ -930,14 +935,17 @@ osm_subn_write_conf_file( > "# If TRUE if OpenSM should disable multicast support\n" > "no_multicast_option %s\n\n" > "# No multicast routing is performed if TRUE\n" > - "disable_multicast %s\n\n", > + "disable_multicast %s\n\n" > + "# If TRUE opensm will exit on fatal initialization issues\n" > + "exit_on_fatal %s\n\n", > p_opts->log_flags, > p_opts->force_log_flush ? "TRUE" : "FALSE", > p_opts->log_file, > p_opts->accum_log_file ? "TRUE" : "FALSE", > p_opts->dump_files_dir, > p_opts->no_multicast_option ? "TRUE" : "FALSE", > - p_opts->disable_multicast ? "TRUE" : "FALSE" > + p_opts->disable_multicast ? "TRUE" : "FALSE", > + p_opts->exit_on_fatal ? "TRUE" : "FALSE" > ); > > /* optional string attributes ... */ > Index: opensm/osm_node_info_rcv.c > =================================================================== > --- opensm/osm_node_info_rcv.c (revision 4288) > +++ opensm/osm_node_info_rcv.c (working copy) > @@ -198,6 +198,14 @@ __osm_ni_rcv_set_links( > p_ni_context->port_num, > dr_new_path > ); > + > + osm_log( p_rcv->p_log, OSM_LOG_SYS, > + "Errors on subnet. SM found duplicated guids or 12x " > + "link with lane reversal badly configured. " > + "Use osm log for more details.\n"); > + > + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) > + exit( 1 ); > } > > /* > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 4288) > +++ opensm/main.c (working copy) > @@ -178,6 +178,12 @@ show_usage(void) > " This option will cause deletion of the log file\n" > " (if it previously exists). By default, the log file\n" > " is accumulative.\n\n"); > + printf( "-y\n" > + "--stay_on_fatal\n" > + " This option will cause SM not to exit on fatal initialization\n" > + " issues: If SM discovers duplicated guids or 12x link with\n" > + " lane reversal badly configured.\n" > + " By default, the SM will exit.\n\n"); > printf( "-v\n" > "--verbose\n" > " This option increases the log verbosity level.\n" > @@ -460,7 +466,7 @@ main( > boolean_t cache_options = FALSE; > char *ignore_guids_file_name = NULL; > uint32_t val; > - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorc"; > + const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; > > /* > In the array below, the 2nd parameter specified the number > @@ -492,6 +498,7 @@ main( > { "updn", 0, NULL, 'u'}, > { "add_guid_file", 1, NULL, 'a'}, > { "cache-options", 0, NULL, 'c'}, > + { "stay_on_fatal", 0, NULL, 'y'}, > { NULL, 0, NULL, 0 } /* Required at the end of the array */ > }; > > @@ -665,6 +672,11 @@ main( > printf(" Creating new log file\n"); > break; > > + case 'y': > + opt.exit_on_fatal = FALSE; > + printf(" Staying on fatal initialization\n"); > + break; > + > case 'v': > log_flags = (log_flags <<1 )|1; > printf(" Verbose option -v (log flags = 0x%X)\n", log_flags ); > From eitan at mellanox.co.il Mon Dec 5 07:32:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 5 Dec 2005 17:32:06 +0200 Subject: [openib-general] RE: [PATCH] Opensm - duplicated guids issue Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618A70@mtlexch01.mtl.com> Hi Hal, Please see my response below > > Currently if OpenSM discovers duplicated guids > > What is the cause of a duplicated GUID ? Is it a misconfiguration of > someone's firmware (rather than some error on the part of OpenSM) ? If > so, I'm not sure exiting SM is the best option. IMO the policy is to > decide which GUID to "honor" (either the original one or the new one). [EZ] There is no way to know which GUID to honor if this is the first sweep. More over the cause for duplicated GUID is from bad firmware burning. Currently the last GUID found is honored but the fabric behind the first one is ignored. > > > or 12x link with lane reversal badly configured > > What does badly configured mean ? Does it mean the link does not come up > at all or just in some non desired mode ? How is "bad lane reversal" > reconfigured ? [EZ] Bad FW configuration. The details are provided in the IS3 PRM. But if one route the board and swizzle the lanes it has to enable automatic lane reversal detection in the INI file. > > Can't this also occur on a 4x link as well ? [EZ] No. > > > it only issues an error to the log > > file. This issue, though, is much more problematic, since it will cause > > part of the subnet to be un-initialized. > > The following patch includes a fuller handling of the issue - first, > > issue an error message to the /var/log/messeges file as well. > > I am incorporating this part of the patch. > > > Second - add an option flag to the SM that will define wether or not > > to exit on such case. > > Also, there are other scenarios which mark the subnet initialization as > failed (but don't exit the SM). This seems inconsistent to me. These > cases also do not put errors out on syslog. Should they ? > > IMO, in general, exiting out of OpenSM should be avoided at all costs. > The admin can always cause this to occur if desired and operating part > of the subnet is better than none. Are these cases where the admin would > not want to run the SM until the issues were resolved ? [EZ] The case of "bad connectivity" is different then "initialization failure": "bad connectivity" is a static problem caused by bad firmware options used or even bad hardware. "initialization failure" can be caused by management packet dropping which may happen due to flaky links or even reasonable bit error rate. The proposal is to provide an option for the sake of exiting the SM on such "bad hardware/firmware" conditions. If one wants to keep going all he has to do is to set that option to 0. Needless to say we have proposed this "exit condition" based on our experience where such cases have happened and the log message ignored. Such that many man hours could have been saved if the SM would insist on not running under such conditions. > > -- Hal > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: include/opensm/osm_subnet.h > > =================================================================== > > --- include/opensm/osm_subnet.h (revision 4288) > > +++ include/opensm/osm_subnet.h (working copy) > > @@ -235,6 +235,7 @@ typedef struct _osm_subn_opt > > osm_testability_modes_t testability_mode; > > boolean_t updn_activate; > > char * updn_guid_file; > > + boolean_t exit_on_fatal; > > } osm_subn_opt_t; > > /* > > * FIELDS > > @@ -372,6 +373,13 @@ typedef struct _osm_subn_opt > > * updn_guid_file > > * Pointer to name of the UPDN guid file given by User > > * > > +* exit_on_fatal > > +* If TRUE (default) - SM will exit on fatal subnet initialization issues. > > +* If FALSE - SM will not exit. > > +* Fatal initialization issues: > > +* a. SM recognizes 2 different nodes with the same guid, or 12x link with > > +* lane reversal badly configured. > > +* > > * SEE ALSO > > * Subnet object > > *********/ > > Index: opensm/osm_subnet.c > > =================================================================== > > --- opensm/osm_subnet.c (revision 4288) > > +++ opensm/osm_subnet.c (working copy) > > @@ -440,6 +440,7 @@ osm_subn_set_default_opt( > > p_opt->testability_mode = OSM_TEST_MODE_NONE; > > p_opt->updn_activate = FALSE; > > p_opt->updn_guid_file = NULL; > > + p_opt->exit_on_fatal = TRUE; > > } > > > > /********************************************************************** > > @@ -765,6 +766,10 @@ osm_subn_parse_conf_file( > > __osm_subn_opts_unpack_charp( > > "updn_guid_file" , > > p_key, p_val, &p_opts->updn_guid_file); > > + > > + __osm_subn_opts_unpack_boolean( > > + "exit_on_fatal", > > + p_key, p_val, &p_opts->exit_on_fatal); > > } > > } > > fclose(opts_file); > > @@ -930,14 +935,17 @@ osm_subn_write_conf_file( > > "# If TRUE if OpenSM should disable multicast support\n" > > "no_multicast_option %s\n\n" > > "# No multicast routing is performed if TRUE\n" > > - "disable_multicast %s\n\n", > > + "disable_multicast %s\n\n" > > + "# If TRUE opensm will exit on fatal initialization issues\n" > > + "exit_on_fatal %s\n\n", > > p_opts->log_flags, > > p_opts->force_log_flush ? "TRUE" : "FALSE", > > p_opts->log_file, > > p_opts->accum_log_file ? "TRUE" : "FALSE", > > p_opts->dump_files_dir, > > p_opts->no_multicast_option ? "TRUE" : "FALSE", > > - p_opts->disable_multicast ? "TRUE" : "FALSE" > > + p_opts->disable_multicast ? "TRUE" : "FALSE", > > + p_opts->exit_on_fatal ? "TRUE" : "FALSE" > > ); > > > > /* optional string attributes ... */ > > Index: opensm/osm_node_info_rcv.c > > =================================================================== > > --- opensm/osm_node_info_rcv.c (revision 4288) > > +++ opensm/osm_node_info_rcv.c (working copy) > > @@ -198,6 +198,14 @@ __osm_ni_rcv_set_links( > > p_ni_context->port_num, > > dr_new_path > > ); > > + > > + osm_log( p_rcv->p_log, OSM_LOG_SYS, > > + "Errors on subnet. SM found duplicated guids or 12x " > > + "link with lane reversal badly configured. " > > + "Use osm log for more details.\n"); > > + > > + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) > > + exit( 1 ); > > } > > > > /* > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 4288) > > +++ opensm/main.c (working copy) > > @@ -178,6 +178,12 @@ show_usage(void) > > " This option will cause deletion of the log file\n" > > " (if it previously exists). By default, the log file\n" > > " is accumulative.\n\n"); > > + printf( "-y\n" > > + "--stay_on_fatal\n" > > + " This option will cause SM not to exit on fatal initialization\n" > > + " issues: If SM discovers duplicated guids or 12x link with\n" > > + " lane reversal badly configured.\n" > > + " By default, the SM will exit.\n\n"); > > printf( "-v\n" > > "--verbose\n" > > " This option increases the log verbosity level.\n" > > @@ -460,7 +466,7 @@ main( > > boolean_t cache_options = FALSE; > > char *ignore_guids_file_name = NULL; > > uint32_t val; > > - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorc"; > > + const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; > > > > /* > > In the array below, the 2nd parameter specified the number > > @@ -492,6 +498,7 @@ main( > > { "updn", 0, NULL, 'u'}, > > { "add_guid_file", 1, NULL, 'a'}, > > { "cache-options", 0, NULL, 'c'}, > > + { "stay_on_fatal", 0, NULL, 'y'}, > > { NULL, 0, NULL, 0 } /* Required at the end of the array */ > > }; > > > > @@ -665,6 +672,11 @@ main( > > printf(" Creating new log file\n"); > > break; > > > > + case 'y': > > + opt.exit_on_fatal = FALSE; > > + printf(" Staying on fatal initialization\n"); > > + break; > > + > > case 'v': > > log_flags = (log_flags <<1 )|1; > > printf(" Verbose option -v (log flags = 0x%X)\n", log_flags ); > > From jlentini at netapp.com Mon Dec 5 08:11:53 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 5 Dec 2005 11:11:53 -0500 (EST) Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer In-Reply-To: <7b2fa1820512030945j22e205d9j86a3b8e7bd709182@mail.gmail.com> References: <7b2fa1820512030945j22e205d9j86a3b8e7bd709182@mail.gmail.com> Message-ID: > I am doing a simple rdma-read test using the kDAPL. My test is running in > the kernel model. > When I allocate both the data source and sink buffers using kmalloc() and > register the buffers using dat_lmr_kcreate() with memory type > DAT_MEM_TYPE_PHYSICAL, everything goes well. If the sink buffer is allocated > with vmalloc() and registered as before, no registering error or rdma read > DTO completion error occours but > My questions: > 1) Could a buffer allocated with vmalloc() be used for a kDAPL rdma reading? > If so, > 2) should a buffer of this kind be registered in the same as a buffer > allocated with kmalloc()? > > Could anyone give some suggestion? > Thanks very much! Hi Ian, An IB HCA needs to be able to DMA the memory used for RDMA read. Since vmalloc does not guarantee that the memory it returns can be accessed via DMA, you should not use vmalloc. james From halr at voltaire.com Mon Dec 5 10:40:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 5 Dec 2005 20:40:08 +0200 Subject: [openib-general] RE: [PATCH] Opensm - duplicated guids issue Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB7E@taurus.voltaire.com> Hi Eitan, On Mon, 2005-12-05 at 10:32, Eitan Zahavi wrote: > Hi Hal, > > Please see my response below > > > Currently if OpenSM discovers duplicated guids > > > > What is the cause of a duplicated GUID ? Is it a misconfiguration of > > someone's firmware (rather than some error on the part of OpenSM) ? If > > so, I'm not sure exiting SM is the best option. IMO the policy is to > > decide which GUID to "honor" (either the original one or the new one). > [EZ] There is no way to know which GUID to honor if this is the first > sweep. More over the cause for duplicated GUID is from bad firmware > burning. IMO, leaving the configuration of a globally unique ID to firmware configuration is a poor choice as it lends itself to being error prone. It should be done at manufacturing time in something like an EEPROM. I know this increases the cost, etc. but also reduces the chances of this being an issue. > Currently the last GUID found is honored but the fabric behind > the first one is ignored. > > > > > or 12x link with lane reversal badly configured > > > > What does badly configured mean ? Does it mean the link does not come > up > > at all or just in some non desired mode ? How is "bad lane reversal" > > reconfigured ? > [EZ] Bad FW configuration. The details are provided in the IS3 PRM. But > if one route the board and swizzle the lanes it has to enable automatic > lane reversal detection in the INI file. > > > > Can't this also occur on a 4x link as well ? > [EZ] No. > > > > > it only issues an error to the log > > > file. This issue, though, is much more problematic, since it will > cause > > > part of the subnet to be un-initialized. > > > The following patch includes a fuller handling of the issue - first, > > > issue an error message to the /var/log/messeges file as well. > > > > I am incorporating this part of the patch. > > > > > Second - add an option flag to the SM that will define wether or not > > > to exit on such case. > > > > Also, there are other scenarios which mark the subnet initialization > as > > failed (but don't exit the SM). This seems inconsistent to me. These > > cases also do not put errors out on syslog. Should they ? > > > > IMO, in general, exiting out of OpenSM should be avoided at all costs. > > The admin can always cause this to occur if desired and operating part > > of the subnet is better than none. Are these cases where the admin > would > > not want to run the SM until the issues were resolved ? > [EZ] The case of "bad connectivity" is different then "initialization > failure": > "bad connectivity" is a static problem caused by bad firmware options > used or even bad hardware. "initialization failure" can be caused by > management packet dropping which may happen due to flaky links or even > reasonable bit error rate. I think there are other cases aside from the "bad connectivity" cases you cite (as was seen at SC05). > The proposal is to provide an option for the sake of exiting the SM on > such "bad hardware/firmware" conditions. If one wants to keep going all > he has to do is to set that option to 0. > > Needless to say we have proposed this "exit condition" based on our > experience where such cases have happened and the log message ignored. > Such that many man hours could have been saved if the SM would insist on > not running under such conditions. I think there is a chance that there will be support calls this way too since the OpenSM won't come up at all in this case. We can always change the default for this (for exiting on these errors) from TRUE to FALSE if and when this becomes an issue... Anyone else have an opinion on this ? -- Hal > > > > -- Hal > > > > > Thanks, > > > Yael > > > > > > Signed-off-by: Yael Kalka > > > > > > Index: include/opensm/osm_subnet.h > > > =================================================================== > > > --- include/opensm/osm_subnet.h (revision 4288) > > > +++ include/opensm/osm_subnet.h (working copy) > > > @@ -235,6 +235,7 @@ typedef struct _osm_subn_opt > > > osm_testability_modes_t testability_mode; > > > boolean_t updn_activate; > > > char * updn_guid_file; > > > + boolean_t exit_on_fatal; > > > } osm_subn_opt_t; > > > /* > > > * FIELDS > > > @@ -372,6 +373,13 @@ typedef struct _osm_subn_opt > > > * updn_guid_file > > > * Pointer to name of the UPDN guid file given by User > > > * > > > +* exit_on_fatal > > > +* If TRUE (default) - SM will exit on fatal subnet > initialization issues. > > > +* If FALSE - SM will not exit. > > > +* Fatal initialization issues: > > > +* a. SM recognizes 2 different nodes with the same guid, or 12x > link with > > > +* lane reversal badly configured. > > > +* > > > * SEE ALSO > > > * Subnet object > > > *********/ > > > Index: opensm/osm_subnet.c > > > =================================================================== > > > --- opensm/osm_subnet.c (revision 4288) > > > +++ opensm/osm_subnet.c (working copy) > > > @@ -440,6 +440,7 @@ osm_subn_set_default_opt( > > > p_opt->testability_mode = OSM_TEST_MODE_NONE; > > > p_opt->updn_activate = FALSE; > > > p_opt->updn_guid_file = NULL; > > > + p_opt->exit_on_fatal = TRUE; > > > } > > > > > > > /********************************************************************** > > > @@ -765,6 +766,10 @@ osm_subn_parse_conf_file( > > > __osm_subn_opts_unpack_charp( > > > "updn_guid_file" , > > > p_key, p_val, &p_opts->updn_guid_file); > > > + > > > + __osm_subn_opts_unpack_boolean( > > > + "exit_on_fatal", > > > + p_key, p_val, &p_opts->exit_on_fatal); > > > } > > > } > > > fclose(opts_file); > > > @@ -930,14 +935,17 @@ osm_subn_write_conf_file( > > > "# If TRUE if OpenSM should disable multicast support\n" > > > "no_multicast_option %s\n\n" > > > "# No multicast routing is performed if TRUE\n" > > > - "disable_multicast %s\n\n", > > > + "disable_multicast %s\n\n" > > > + "# If TRUE opensm will exit on fatal initialization issues\n" > > > + "exit_on_fatal %s\n\n", > > > p_opts->log_flags, > > > p_opts->force_log_flush ? "TRUE" : "FALSE", > > > p_opts->log_file, > > > p_opts->accum_log_file ? "TRUE" : "FALSE", > > > p_opts->dump_files_dir, > > > p_opts->no_multicast_option ? "TRUE" : "FALSE", > > > - p_opts->disable_multicast ? "TRUE" : "FALSE" > > > + p_opts->disable_multicast ? "TRUE" : "FALSE", > > > + p_opts->exit_on_fatal ? "TRUE" : "FALSE" > > > ); > > > > > > /* optional string attributes ... */ > > > Index: opensm/osm_node_info_rcv.c > > > =================================================================== > > > --- opensm/osm_node_info_rcv.c (revision 4288) > > > +++ opensm/osm_node_info_rcv.c (working copy) > > > @@ -198,6 +198,14 @@ __osm_ni_rcv_set_links( > > > p_ni_context->port_num, > > > dr_new_path > > > ); > > > + > > > + osm_log( p_rcv->p_log, OSM_LOG_SYS, > > > + "Errors on subnet. SM found duplicated guids > or 12x " > > > + "link with lane reversal badly configured. " > > > + "Use osm log for more details.\n"); > > > + > > > + if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE ) > > > + exit( 1 ); > > > } > > > > > > /* > > > Index: opensm/main.c > > > =================================================================== > > > --- opensm/main.c (revision 4288) > > > +++ opensm/main.c (working copy) > > > @@ -178,6 +178,12 @@ show_usage(void) > > > " This option will cause deletion of the log > file\n" > > > " (if it previously exists). By default, the log > file\n" > > > " is accumulative.\n\n"); > > > + printf( "-y\n" > > > + "--stay_on_fatal\n" > > > + " This option will cause SM not to exit on fatal > initialization\n" > > > + " issues: If SM discovers duplicated guids or > 12x link with\n" > > > + " lane reversal badly configured.\n" > > > + " By default, the SM will exit.\n\n"); > > > printf( "-v\n" > > > "--verbose\n" > > > " This option increases the log verbosity > level.\n" > > > @@ -460,7 +466,7 @@ main( > > > boolean_t cache_options = FALSE; > > > char *ignore_guids_file_name = NULL; > > > uint32_t val; > > > - const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorc"; > > > + const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorcy"; > > > > > > /* > > > In the array below, the 2nd parameter specified the number > > > @@ -492,6 +498,7 @@ main( > > > { "updn", 0, NULL, 'u'}, > > > { "add_guid_file", 1, NULL, 'a'}, > > > { "cache-options", 0, NULL, 'c'}, > > > + { "stay_on_fatal", 0, NULL, 'y'}, > > > { NULL, 0, NULL, 0 } /* Required at the end of > the array */ > > > }; > > > > > > @@ -665,6 +672,11 @@ main( > > > printf(" Creating new log file\n"); > > > break; > > > > > > + case 'y': > > > + opt.exit_on_fatal = FALSE; > > > + printf(" Staying on fatal initialization\n"); > > > + break; > > > + > > > case 'v': > > > log_flags = (log_flags <<1 )|1; > > > printf(" Verbose option -v (log flags = 0x%X)\n", log_flags > ); > > > > From halr at voltaire.com Mon Dec 5 10:47:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 5 Dec 2005 20:47:07 +0200 Subject: [openib-general] Re: [PATCH] Opensm - duplicated guids issue Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB7F@taurus.voltaire.com> On Sun, 2005-12-04 at 08:02, Yael Kalka wrote: > Hi Hal, > > Currently if OpenSM discovers duplicated guids or 12x link with lane > reversal badly configured it only issues an error to the log > file. This issue, though, is much more problematic, since it will cause > part of the subnet to be un-initialized. > The following patch includes a fuller handling of the issue - first, > issue an error message to the /var/log/messeges file as well. > Second - add an option flag to the SM that will define wether or not > to exit on such case. Thanks. Applied. From mshefty at ichips.intel.com Mon Dec 5 11:02:38 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 05 Dec 2005 11:02:38 -0800 Subject: [openib-general] possible CMA bug In-Reply-To: <013701c5de5a$9124ad10$d5000a0a@STEVO> References: <013701c5de5a$9124ad10$d5000a0a@STEVO> Message-ID: <43948ECE.1010502@ichips.intel.com> Steve Wise wrote: > I'm using the new rdma cma interface and i've perhaps stumbled onto a > bug. I'm trying to bind to port 9999 on both IB ports of a mthca > device. The IPoIB interfaces for the HCA are configured as two seperate > subnets. The second rdma_listen() always fails with EBUSY. Maybe this > is a limitation in the CMA design, but TCP stacks allow binding to the > same port on different ip addresses. And the CMA interface allows it > too as long as the two ip addresses map to different IB devices. > Whether this should work or not, I am seeing a crash when I try to > destroy the cm_id after the rdma_listen() failure. This has been fixed in the latest check-in (revision 4313). - Sean From sean.hefty at intel.com Mon Dec 5 12:36:35 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 5 Dec 2005 12:36:35 -0800 Subject: [openib-general] SDP use of CMA Message-ID: I'm like to start a discussion of having the CMA support SDP. To do this, I'd like to propose adding the following optional call to the CMA: enum rdma_cm_protocol { RDMA_CM_DEFAULT, RDMA_CM_SDP }; /** * rdma_set_protocol - Associate a specific connection protocol with the * RDMA identifier. * * @id: RDMA identifier. * @protocol: Connection protocol. * @version: Connection protocol version, or 0 to use the default. * * This is an optional call that may be made to specify that the indicated * connection protocol be used when establishing connections to the * associated RDMA identifier. */ int rdma_set_protocol(struct rdma_cm_id *id, enum rdma_cm_protocol protocol, int version); For most users, this routing would not need to be called. If called by SDP, then the CMA would use SDP's service IDs and private data format. The CMA would be responsible for setting the local port number and source/destination IP addresses. All other fields would be set by the caller. Comments? - Sean From caitlin.bestler at gmail.com Mon Dec 5 13:30:56 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 5 Dec 2005 13:30:56 -0800 Subject: [openib-general] SDP use of CMA In-Reply-To: References: Message-ID: <469958e00512051330o7ef9af94jb5684f04e5e76185@mail.gmail.com> On 12/5/05, Sean Hefty wrote: > > I'm like to start a discussion of having the CMA support SDP. To do this, > I'd > like to propose adding the following optional call to the CMA: > > enum rdma_cm_protocol { > RDMA_CM_DEFAULT, > RDMA_CM_SDP > }; > > /** > * rdma_set_protocol - Associate a specific connection protocol with the > * RDMA identifier. > * > * @id: RDMA identifier. > * @protocol: Connection protocol. > * @version: Connection protocol version, or 0 to use the default. > * > * This is an optional call that may be made to specify that the indicated > * connection protocol be used when establishing connections to the > * associated RDMA identifier. > */ > int rdma_set_protocol(struct rdma_cm_id *id, enum rdma_cm_protocol > protocol, > int version); > > > For most users, this routing would not need to be called. If called by > SDP, > then the CMA would use SDP's service IDs and private data format. The CMA > would > be responsible for setting the local port number and source/destination IP > addresses. All other fields would be set by the caller. > > Comments? > > - Sean > > Who is the intended consumer of this API? My understanding is that there are few to zero end applications that use SDP knowingly. They use the sockets API, which is intercepted at one layer or another by a middleware library, and it is that middleware library that uses SDP. If SDP middleware libraries are the only users of SDP-style connection setup then it would make more sense to have a distinct method to serve that purpose rather than having an enum/option flag on the main method. In particular I would not want end applications to expected to "request SDP" merely to get an offloaded SOCK_STREAM connection. On an IB network the advantage of SDP over TCP/IP over IPoIB is a no-brainer. But the tradeoff between the host TCP/IP stack, an offload TCP/IP stack and SDP/iWARP is a much more complex tradeoff. Depending on who the envisioned user is we may need to distinquish between 'definitely use SDP, because I know my peer is using SDP' and 'offloaded by whatever mutually available methods'. Those questions are irrelevant if the call is made from the intercept library itself, as that they were decided by controlling the intercept. But if the intercept library is the primary user of this option then I defnitely think that a separate method is better than an option param. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Dec 5 14:00:42 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 05 Dec 2005 14:00:42 -0800 Subject: [openib-general] SDP use of CMA In-Reply-To: <469958e00512051330o7ef9af94jb5684f04e5e76185@mail.gmail.com> References: <469958e00512051330o7ef9af94jb5684f04e5e76185@mail.gmail.com> Message-ID: <4394B88A.2090605@ichips.intel.com> Caitlin Bestler wrote: > Who is the intended consumer of this API? The SDP kernel module is the intended consumer. Currently SDP duplicates most of the functionality found in the CMA. > If SDP middleware libraries are the only users of SDP-style connection setup > then it would make more sense to have a distinct method to serve that > purpose > rather than having an enum/option flag on the main method. I'm open to alternate proposals. Please provide specific details. - Sean From mshefty at ichips.intel.com Mon Dec 5 14:27:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 05 Dec 2005 14:27:20 -0800 Subject: [openib-general] SDP use of CMA In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2929@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2929@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4394BEC8.3020104@ichips.intel.com> Caitlin Bestler wrote: > Generally, I was advocating adding an extra method > that appends "_sdp" to the name rather than inserting > an "is_sdp" param. An option that supports this format would be to add a new call similar to: struct rdma_cm_id* sdp_create_id(rdma_cm_event_handler event_handler, void *context); - Sean From mst at mellanox.co.il Mon Dec 5 21:54:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 07:54:58 +0200 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: References: Message-ID: <20051206055457.GA13071@mellanox.co.il> Quoting r. Sean Hefty : > Subject: SDP use of CMA > > I'm like to start a discussion of having the CMA support SDP. To do > this, I'd > like to propose adding the following optional call to the CMA: > > enum rdma_cm_protocol { > RDMA_CM_DEFAULT, > RDMA_CM_SDP > }; > > /** > * rdma_set_protocol - Associate a specific connection protocol with the > * RDMA identifier. > * > * @id: RDMA identifier. > * @protocol: Connection protocol. > * @version: Connection protocol version, or 0 to use the default. > * > * This is an optional call that may be made to specify that the > indicated > * connection protocol be used when establishing connections to the > * associated RDMA identifier. > */ > int rdma_set_protocol(struct rdma_cm_id *id, enum rdma_cm_protocol > protocol, > int version); > > > For most users, this routing would not need to be called. If called by > SDP, > then the CMA would use SDP's service IDs and private data format. The > CMA would > be responsible for setting the local port number and source/destination > IP > addresses. All other fields would be set by the caller. > > Comments? > > - Sean Fine with me. -- MST From mst at mellanox.co.il Mon Dec 5 21:55:28 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 07:55:28 +0200 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <4394BEC8.3020104@ichips.intel.com> References: <4394BEC8.3020104@ichips.intel.com> Message-ID: <20051206055528.GB13071@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: SDP use of CMA > > Caitlin Bestler wrote: > > Generally, I was advocating adding an extra method > > that appends "_sdp" to the name rather than inserting > > an "is_sdp" param. > > An option that supports this format would be to add a new call similar to: > > struct rdma_cm_id* sdp_create_id(rdma_cm_event_handler event_handler, > void *context); > > - Sean Thats fine with me, too. -- MST From rjwalsh at pathscale.com Mon Dec 5 21:59:13 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 05 Dec 2005 21:59:13 -0800 Subject: [openib-general] ip_dev_find resolution? Message-ID: <1133848753.15727.11.camel@phosphene.durables.org> Hi all, There was some discussion back in Sep/Oct about ip_dev_find. Was there ever a resolution to this? Are we just waiting to get the modules that use it put into the kernel so we can justify getting it re-exported once again? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From ianjiang.ict at gmail.com Tue Dec 6 00:00:36 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 6 Dec 2005 16:00:36 +0800 Subject: [openib-general] Want to learn more about the FMR Message-ID: <7b2fa1820512060000n6ce66e7eo2ab1278fa30a1358@mail.gmail.com> It is said in "Zero Copy Sockets Direct Protocol over InfiniBand - Preliminary Implementation and Performance Analysis" that the FMR is a Mellanox feature extending the 1.1 InfiniBand specification, and a similar feature was added later on to the 1.2 InfiniBand specification. But I have got nothing about the FMR in "InfiniBand™ Architecture Specification Release 1.2" (http://www.infinibandta.org/specs/register/publicspec/). I read the description of FMR related verbs in "Mellanox IB-Verbs API (VAPI) 1.00, however I am not very clear of the difference between FMR registering and common registering. Is there a description of more details? Any suggestion is appreciated! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Tue Dec 6 00:31:42 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 06 Dec 2005 10:31:42 +0200 Subject: [openib-general] [PATCH] osm: change info message to debug Message-ID: <861x0qsk29.fsf@mtl066.yok.mtl.com> Hi Hal The umad vendor provides an info message every time the osm_vendor_get_all_port_attr is invoked. This patch makes it a debug message. Thanks Eitan Signed-off-by: Eitan Zahavi Index: osm/libvendor/osm_vendor_ibumad.c =================================================================== --- osm/libvendor/osm_vendor_ibumad.c (revision 4317) +++ osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -630,7 +630,7 @@ osm_vendor_get_all_port_attr( lids[0] = def_port.base_lid; linkstates[0] = def_port.state; - osm_log( p_vend->p_log, OSM_LOG_INFO, + osm_log( p_vend->p_log, OSM_LOG_DEBUG, "osm_vendor_get_all_port_attr: " "assign CA %s port %d guid (0x%"PRIx64") as the default port\n", def_port.ca_name, def_port.portnum, From halr at voltaire.com Tue Dec 6 03:27:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 06:27:49 -0500 Subject: [openib-general] ip_dev_find resolution? In-Reply-To: <1133848753.15727.11.camel@phosphene.durables.org> References: <1133848753.15727.11.camel@phosphene.durables.org> Message-ID: <1133868031.4587.20000.camel@hal.voltaire.com> Hi Robert, On Tue, 2005-12-06 at 00:59, Robert Walsh wrote: > Hi all, > > There was some discussion back in Sep/Oct about ip_dev_find. Was there > ever a resolution to this? Are we just waiting to get the modules that > use it put into the kernel so we can justify getting it re-exported once > again? Yes. At one point, Grant had indicated that IPmc might need that but I'm not sure how that was resolved. -- Hal > > Regards, > Robert. From mwqcduditw at nougen.com Tue Dec 6 04:36:09 2005 From: mwqcduditw at nougen.com (Isabelle Lockhart) Date: Tue, 6 Dec 2005 11:36:09 -0100 Subject: [openib-general] Brain? Message-ID: <09b601c5fa59$4030f1f0$955a81de@unit0d2q87dmlx> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about.

# Suppress your appetite and feel full and satisfied all day long
# Increase your energy levels
# Lose excess weight
# Increase your metabolism
# Burn body fat
# Burn calories
# Attack obesity
And more..

http://healthcarehoodia.com/

# Suitable for vegetarians and vegans
# MAINTAIN your weight loss
# Make losing weight a sure guarantee
# Look your best during the summer months

http://healthcarehoodia.com/

Regards, Dr. Isabelle Lockhart -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 6 03:37:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 06:37:57 -0500 Subject: [openib-general] Re: [PATCH] osm: change info message to debug In-Reply-To: <861x0qsk29.fsf@mtl066.yok.mtl.com> References: <861x0qsk29.fsf@mtl066.yok.mtl.com> Message-ID: <1133869075.4587.20098.camel@hal.voltaire.com> On Tue, 2005-12-06 at 03:31, Eitan Zahavi wrote: > Hi Hal > > The umad vendor provides an info message every time the > osm_vendor_get_all_port_attr is invoked. This patch makes it a debug > message. Thanks. Applied. From yael at mellanox.co.il Tue Dec 6 04:02:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 06 Dec 2005 14:02:49 +0200 Subject: [openib-general] [PATCH] Opensm - add node record dumping Message-ID: <5zacfeh1qu.fsf@mtl066.yok.mtl.com> Hi Hal, The following code exists at least in several of the osm_sa_*_record.c files, but is missing in the osm_sa_node_record.c. When running with debug level - add a dump of the node record sent in the request. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_node_record.c =================================================================== --- opensm/osm_sa_node_record.c (revision 4319) +++ opensm/osm_sa_node_record.c (working copy) @@ -467,6 +467,9 @@ osm_nr_rcv_process( goto Exit; } + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_node_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + cl_qlist_init( &rec_list ); context.p_rcvd_rec = p_rcvd_rec; From halr at voltaire.com Tue Dec 6 04:15:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 07:15:19 -0500 Subject: [openib-general] Re: [PATCH] Opensm - add node record dumping In-Reply-To: <5zacfeh1qu.fsf@mtl066.yok.mtl.com> References: <5zacfeh1qu.fsf@mtl066.yok.mtl.com> Message-ID: <1133871318.4587.20286.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-12-06 at 07:02, Yael Kalka wrote: > Hi Hal, > > The following code exists at least in several of the osm_sa_*_record.c > files, but is missing in the osm_sa_node_record.c. > When running with debug level - add a dump of the node record sent > in the request. Thanks. Applied. Does the same thing apply to any other SA records which are not currently dumped on debug ? -- Hal From yael at mellanox.co.il Tue Dec 6 04:44:02 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 6 Dec 2005 14:44:02 +0200 Subject: [openib-general] RE: [PATCH] Opensm - add node record dumping Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E246C@mtlexch01.mtl.com> Didn't look at all of them, but I know at least some of them include the record dumping. -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, December 06, 2005 2:15 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - add node record dumping Hi Yael, On Tue, 2005-12-06 at 07:02, Yael Kalka wrote: > Hi Hal, > > The following code exists at least in several of the osm_sa_*_record.c > files, but is missing in the osm_sa_node_record.c. > When running with debug level - add a dump of the node record sent > in the request. Thanks. Applied. Does the same thing apply to any other SA records which are not currently dumped on debug ? -- Hal From ianjiang.ict at gmail.com Tue Dec 6 04:52:13 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 6 Dec 2005 20:52:13 +0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer In-Reply-To: References: <7b2fa1820512030945j22e205d9j86a3b8e7bd709182@mail.gmail.com> Message-ID: <7b2fa1820512060452p28a7a552w3b68b57513b3c80d@mail.gmail.com> Hi James, You are always so kind! Now I have a question about reading a buffer of a application in user space. Is it the only way to use the uDAPL? I used to have an idea like this: The application in user space gives the virtual start address and length of its data buffer to a kernel module program. This kernel program acts as a application of the kDAPL and registers the user space data buffer with the kDAPl, then request a RDMA read operation to complete the data transferring. But I think it is not feasible after getting your last reply. Am I right? Please give some suggestion and thanks very much! On 12/6/05, James Lentini wrote: > > Hi Ian, > > An IB HCA needs to be able to DMA the memory used for RDMA read. Since > vmalloc does not guarantee that the memory it returns can be accessed > via DMA, you should not use vmalloc. > > james > -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Tue Dec 6 05:24:39 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 6 Dec 2005 05:24:39 -0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C29C3@NT-SJCA-0751.brcm.ad.broadcom.com> An option like that was discussed in RNIC-PI, but is not generally explicitly supported. What the kernel daemon must do is map the user-space to bus/io addresses and then physically register that as a Memory Region (or LMR if working at the DAT layer). If the user-mode application is not explicitly involved in identifying what buffers are going to be used (i.e., registering the memory) then you won't achieve the full efficiency of RDMA. Creating memory regions per operation at the verb layer is expensive. and interferes with pipelining. Full fast-memory-register work requests are defined in iWARP and the latest IBTA spec, but has not made it into wide deployment yet. Unless the user-kernel daemon interface is something you are inheriting I would not recommend this approach. ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Ian Jiang Sent: Tuesday, December 06, 2005 4:52 AM To: James Lentini Cc: openib-general Subject: Re: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer Hi James, You are always so kind! Now I have a question about reading a buffer of a application in user space. Is it the only way to use the uDAPL? I used to have an idea like this: The application in user space gives the virtual start address and length of its data buffer to a kernel module program. This kernel program acts as a application of the kDAPL and registers the user space data buffer with the kDAPl, then request a RDMA read operation to complete the data transferring. But I think it is not feasible after getting your last reply. Am I right? Please give some suggestion and thanks very much! On 12/6/05, James Lentini wrote: Hi Ian, An IB HCA needs to be able to DMA the memory used for RDMA read. Since vmalloc does not guarantee that the memory it returns can be accessed via DMA, you should not use vmalloc. james -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 6 06:05:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 09:05:36 -0500 Subject: [openib-general] [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method Message-ID: <1133877936.4587.20821.camel@hal.voltaire.com> OpenSM: SA SMInfoRecord should support GetTable as well as Get method Signed-off-by: Hal Rosenstock Index: osm_sa_sminfo_record.c =================================================================== --- osm_sa_sminfo_record.c (revision 4323) +++ osm_sa_sminfo_record.c (working copy) @@ -165,7 +165,8 @@ osm_smir_rcv_process( CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); - if (p_sa_mad->method != IB_MAD_METHOD_GET) + if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && + (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_smir_rcv_process: ERR 2804: " From halr at voltaire.com Tue Dec 6 06:11:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 09:11:40 -0500 Subject: [openib-general] RE: [PATCH] Opensm - add node record dumping In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E246C@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E246C@mtlexch01.mtl.com> Message-ID: <1133877977.4587.20828.camel@hal.voltaire.com> On Tue, 2005-12-06 at 07:44, Yael Kalka wrote: > Didn't look at all of them, but I know at least some of them include the > record dumping. It looks to me like the following supported SA records don't currently do this: sminfo vlarb slvl pkey lft -- Hal > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 06, 2005 2:15 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - add node record dumping > > > Hi Yael, > > On Tue, 2005-12-06 at 07:02, Yael Kalka wrote: > > Hi Hal, > > > > The following code exists at least in several of the osm_sa_*_record.c > > > files, but is missing in the osm_sa_node_record.c. > > When running with debug level - add a dump of the node record sent > > in the request. > > Thanks. Applied. > > Does the same thing apply to any other SA records which are not > currently dumped on debug ? > > -- Hal From tom at opengridcomputing.com Tue Dec 6 07:34:43 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 06 Dec 2005 09:34:43 -0600 Subject: [openib-general] SDP use of CMA In-Reply-To: <4394BEC8.3020104@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2929@NT-SJCA-0751.brcm.ad.broadcom.com> <4394BEC8.3020104@ichips.intel.com> Message-ID: <1133883283.11138.6.camel@trinity.austin.ammasso.com> Not to jump in late on this, but why couldn't we just add a protocol parameter to the create_id call. Then it is arbitrarily extensible ala socket(AF_INET, SOCK_STREAM, ) So what I'm specifically suggesting is: struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, void* context, rdma_cm_proto proto); Then we don't need a new call, it's extensible to new protocols, and it's not a single purpose is_sdp parameter. On Mon, 2005-12-05 at 14:27 -0800, Sean Hefty wrote: > Caitlin Bestler wrote: > > Generally, I was advocating adding an extra method > > that appends "_sdp" to the name rather than inserting > > an "is_sdp" param. > > An option that supports this format would be to add a new call similar to: > > struct rdma_cm_id* sdp_create_id(rdma_cm_event_handler event_handler, > void *context); > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Dec 6 07:38:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 17:38:55 +0200 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <1133883283.11138.6.camel@trinity.austin.ammasso.com> References: <1133883283.11138.6.camel@trinity.austin.ammasso.com> Message-ID: <20051206153855.GG21035@mellanox.co.il> Quoting r. Tom Tucker : > Subject: Re: SDP use of CMA > > Not to jump in late on this, but why couldn't we just add a protocol > parameter to the create_id call. Then it is arbitrarily extensible ala > socket(AF_INET, SOCK_STREAM, ) > > > So what I'm specifically suggesting is: > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > void* context, > rdma_cm_proto proto); Makes sense. We'd need to define rdma_cm_proto values I guess: SDP, default, .... -- MST From dotanb at mellanox.co.il Tue Dec 6 07:48:13 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 6 Dec 2005 17:48:13 +0200 Subject: [openib-general] RE: can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8D436@mtlexch01.mtl.com> > > I guess we might as well fix it. I checked in the following patch. > > - R. > sorry about the delay. i checked this patch: posting SR with 0 s/g list length with the inline flag enabled works only for tavor. for memfree devices i still get completion with error. thanks Dotan From steve.apo at googlemail.com Tue Dec 6 08:05:26 2005 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 6 Dec 2005 16:05:26 +0000 Subject: [openib-general] UC connection server Message-ID: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> Hi, I wonder if anybody could give some advice about an idea for making UC connections with a device that doesn't support a CM (it's a custom-made embedded device). The idea is to use a PC-based stack that does use the standard CM interface. I can then make a connection with that. The PC then gets the info about the real QP from the embedded device via some proprietary method. The problem with this idea is that in the standard CM protocol, it forms the connection using the LID that the REQ was sent to. But I need to change this to the LID of the embedded device. I've looked at doing path migration which looked like it might do this, but I could do with some advice. For example, in path migration, does the original connection remain? Any other suggests are welcome (I know I could do an Ethernet connection with the PC and exchange the info that way, but that's a last resort at the moment). Thanks for your time. Cheers, Steve. From steve.apo at googlemail.com Tue Dec 6 08:16:17 2005 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 6 Dec 2005 16:16:17 +0000 Subject: [openib-general] Relaying data through an HCA card Message-ID: <2cfcf21e0512060816r6ea2083fr@mail.gmail.com> Hi, I have the requirement for a PC that acts as a data relay. I basically need to pass data from an input QP connection to an output QP connection on the PC. Could this be done entirely within the HCA card, without touching system memory or using a userspace application to supervise the data? This is so the data throughput remains as high as possible. Thanks for your time. Cheers, Steve. From dotanb at mellanox.co.il Tue Dec 6 08:23:40 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Tue, 6 Dec 2005 18:23:40 +0200 Subject: [openib-general] RE: can i post a send request with 0 bytes with the inline bit enabled? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8D453@mtlexch01.mtl.com> > > > > > I guess we might as well fix it. I checked in the following patch. > > > > - R. > > > > sorry about the delay. > > i checked this patch: > posting SR with 0 s/g list length with the inline flag > enabled works only for tavor. > for memfree devices i still get completion with error. > > > thanks > Dotan > i'm sorry about my last email (i had a mess with in my sources): i checked the patch and it works for tavor and memfree devices as well. thanks Dotan From caitlinb at broadcom.com Tue Dec 6 08:10:12 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 6 Dec 2005 08:10:12 -0800 Subject: [openib-general] SDP use of CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C29E5@NT-SJCA-0751.brcm.ad.broadcom.com> Tom Tucker wrote: > Not to jump in late on this, but why couldn't we just add a > protocol parameter to the create_id call. Then it is > arbitrarily extensible ala socket(AF_INET, SOCK_STREAM, ) > > > So what I'm specifically suggesting is: > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > void* context, rdma_cm_proto proto); > > > Then we don't need a new call, it's extensible to new > protocols, and it's not a single purpose is_sdp parameter. > That's an excellent suggestion, if we think that this is an area that will be extensible. So far in iWARP we have RDMAC MPA, SDP, IETF MPA using unstructured private data and IETF MPA using the IT-API structure private data. And IB we have unstructured private data and TCP compatible connection setup private data. If we're confident that there is a dominant one for each transport, and the others are merely transitional relics, then the extra methods make the most sense. If we don't think things are really settled then the 'proto' argument makes a lot of sense. The crux question remains though, will there ever be a caller that does not specify the 'proto' as a constant? If there's a scenario for that, then having a parameter in the call rather than a case statement in the caller makes a lot of sense. But if every actual use will select a constant value then what is gained by having a single method? From mshefty at ichips.intel.com Tue Dec 6 09:31:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Dec 2005 09:31:46 -0800 Subject: [openib-general] SDP use of CMA In-Reply-To: <1133883283.11138.6.camel@trinity.austin.ammasso.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2929@NT-SJCA-0751.brcm.ad.broadcom.com> <4394BEC8.3020104@ichips.intel.com> <1133883283.11138.6.camel@trinity.austin.ammasso.com> Message-ID: <4395CB02.9010602@ichips.intel.com> Tom Tucker wrote: > Not to jump in late on this, but why couldn't we just add a protocol > parameter to the create_id call. Then it is arbitrarily extensible ala > socket(AF_INET, SOCK_STREAM, ) > > So what I'm specifically suggesting is: > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > void* context, > rdma_cm_proto proto); This isn't much different that having a separate call to set the protocol. I went with a separate protocol API to add in version information as well. Changing just the create_id call makes sense. - Sean From iod00d at hp.com Tue Dec 6 09:43:55 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 6 Dec 2005 09:43:55 -0800 Subject: [openib-general] ip_dev_find resolution? In-Reply-To: <1133868031.4587.20000.camel@hal.voltaire.com> References: <1133848753.15727.11.camel@phosphene.durables.org> <1133868031.4587.20000.camel@hal.voltaire.com> Message-ID: <20051206174355.GB21980@esmail.cup.hp.com> On Tue, Dec 06, 2005 at 06:27:49AM -0500, Hal Rosenstock wrote: > Hi Robert, > > On Tue, 2005-12-06 at 00:59, Robert Walsh wrote: > > Hi all, > > > > There was some discussion back in Sep/Oct about ip_dev_find. Was there > > ever a resolution to this? Not that I'm aware of. > > Are we just waiting to get the modules that > > use it put into the kernel so we can justify getting it re-exported once > > again? That would be a good approach. > Yes. At one point, Grant had indicated that IPmc might need that but I'm > not sure how that was resolved. IPmc? Oh! IP_MROUTE. But IP_MROUTE doesn't need ip_dev_find exported since IP_MROUTE code can't be built as a module. My original email is here: http://openib.org/pipermail/openib-general/2005-November/013563.html Original email thread starts here: http://openib.org/pipermail/openib-general/2005-November/013471.html thanks, grant From mshefty at ichips.intel.com Tue Dec 6 09:43:50 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Dec 2005 09:43:50 -0800 Subject: [openib-general] UC connection server In-Reply-To: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> References: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> Message-ID: <4395CDD6.1080601@ichips.intel.com> Steven Wooding wrote: > The idea is to use a PC-based stack that does use the standard CM > interface. I can then make a connection with that. The PC then gets > the info about the real QP from the embedded device via some > proprietary method. You can use the existing connection to exchange the QP information, similar to how you would exchange the information over Ethernet. Once the other QPs are setup, you can either teardown the existing connection or use it to connect other QPs. > The problem with this idea is that in the standard CM protocol, it > forms the connection using the LID that the REQ was sent to. But I > need to change this to the LID of the embedded device. I've looked at > doing path migration which looked like it might do this, but I could > do with some advice. Just a thought, and likely not a very good one, but you could reject the REQ, then issue a new REQ from the PC that contained the new LID. The CM might need some changes to support this. I think that CM port redirection (reject code 25) may also give you what you want. Supporting this in the CM is a little tricky however. - Sean From mst at mellanox.co.il Tue Dec 6 09:56:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 19:56:52 +0200 Subject: [openib-general] Re: ip_dev_find resolution? In-Reply-To: <20051206174355.GB21980@esmail.cup.hp.com> References: <20051206174355.GB21980@esmail.cup.hp.com> Message-ID: <20051206175652.GC22860@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: ip_dev_find resolution? > > On Tue, Dec 06, 2005 at 06:27:49AM -0500, Hal Rosenstock wrote: > > Hi Robert, > > > > On Tue, 2005-12-06 at 00:59, Robert Walsh wrote: > > > Hi all, > > > > > > There was some discussion back in Sep/Oct about ip_dev_find. Was > there > > > ever a resolution to this? > > Not that I'm aware of. > > > > Are we just waiting to get the modules that > > > use it put into the kernel so we can justify getting it re-exported > once > > > again? > > That would be a good approach. > > > Yes. At one point, Grant had indicated that IPmc might need that but > I'm > > not sure how that was resolved. > > IPmc? Oh! IP_MROUTE. But IP_MROUTE doesn't need ip_dev_find exported > since IP_MROUTE code can't be built as a module. > > My original email is here: > http://openib.org/pipermail/openib-general/2005-November/013563.html > > Original email thread starts here: > http://openib.org/pipermail/openib-general/2005-November/013471.html Actually, I wander whether instead of ip_dev_find we can just read_lock(&dev_base_lock); for (dev = dev_base; dev; dev = dev->next) { and check the ip address? If this works, this has the advantage of supporting IPv6 as well. MST -- MST From ftillier at silverstorm.com Tue Dec 6 10:14:47 2005 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 6 Dec 2005 10:14:47 -0800 Subject: [openib-general] UC connection server In-Reply-To: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> References: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> Message-ID: <79ae2f320512061014u31811f8bwafc5ed37acd4f97c@mail.gmail.com> Hi Steve, On 12/6/05, Steven Wooding wrote: > Hi, > > The idea is to use a PC-based stack that does use the standard CM > interface. I can then make a connection with that. The PC then gets > the info about the real QP from the embedded device via some > proprietary method. > > The problem with this idea is that in the standard CM protocol, it > forms the connection using the LID that the REQ was sent to. Actually, it's the other way around. The CM uses the LID from the path information in the REQ as the destination of the MAD. There is no way currently to send a CM REQ to another LID than the one indicated in the REQ's path information. > But I need to change this to the LID of the embedded device. I've > looked at doing path migration which looked like it might do this, > but I could do with some advice. No, path migration will only work accross ports of a single HCA. - Fab From ftillier at silverstorm.com Tue Dec 6 10:18:18 2005 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 6 Dec 2005 10:18:18 -0800 Subject: [openib-general] UC connection server In-Reply-To: <4395CDD6.1080601@ichips.intel.com> References: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> <4395CDD6.1080601@ichips.intel.com> Message-ID: <79ae2f320512061018m61a445d9xeee4e37bfb326974@mail.gmail.com> On 12/6/05, Sean Hefty wrote: > Steven Wooding wrote: > > The idea is to use a PC-based stack that does use the standard CM > > interface. I can then make a connection with that. The PC then gets > > the info about the real QP from the embedded device via some > > proprietary method. > > You can use the existing connection to exchange the QP information, similar to > how you would exchange the information over Ethernet. Once the other QPs are > setup, you can either teardown the existing connection or use it to connect > other QPs. I think what Steve wants to do is issue a REQ, send it to the PC, but have the path record go to his embedded device (which has a different LID). The CM protocol supports this, but the implementation of the CM looks at the path record to determine the destination of the CM MADs. Supporting this would require some way for the user to set the target of the CM MADs independently of the path information contained in the REQ. Adding an optional extra path record for the CM path might do the trick. - Fab From iod00d at hp.com Tue Dec 6 10:23:38 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 6 Dec 2005 10:23:38 -0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer In-Reply-To: <7b2fa1820512060452p28a7a552w3b68b57513b3c80d@mail.gmail.com> References: <7b2fa1820512030945j22e205d9j86a3b8e7bd709182@mail.gmail.com> <7b2fa1820512060452p28a7a552w3b68b57513b3c80d@mail.gmail.com> Message-ID: <20051206182338.GC21980@esmail.cup.hp.com> On Tue, Dec 06, 2005 at 08:52:13PM +0800, Ian Jiang wrote: > Hi James, > You are always so kind! > Now I have a question about reading a buffer of a application in user space. > Is it the only way to use the uDAPL? > I used to have an idea like this: > The application in user space gives the virtual start address and length of > its data buffer to a kernel module program. This kernel program acts as a > application of the kDAPL and registers the user space data buffer with the > kDAPl, Ian, If you are doing this with OpenIB, my advice is to NOT start with kDAPL. AFAICT, kDAPL is going away once any dependencies on it are resolved. And it's clearly not going to be pushed to kernel.org source trees. ISTR Dan Bar Dov wrote iSER was no longer dependent on kDAPL but not sure if that was the only module. > then request a RDMA read operation to complete the data transferring. > But I think it is not feasible after getting your last reply. Am I right? > Please give some suggestion and thanks very much! In general, a kernel module can map a user space address to a "DMA Address". OpenIB code has interfaces to register the "DMA Address" with the IB card. grant From mst at mellanox.co.il Tue Dec 6 10:57:52 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 20:57:52 +0200 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <4395CB02.9010602@ichips.intel.com> References: <4395CB02.9010602@ichips.intel.com> Message-ID: <20051206185752.GD23088@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: SDP use of CMA > > Tom Tucker wrote: > > Not to jump in late on this, but why couldn't we just add a protocol > > parameter to the create_id call. Then it is arbitrarily extensible ala > > socket(AF_INET, SOCK_STREAM, ) > > > > So what I'm specifically suggesting is: > > > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > > void* context, > > rdma_cm_proto proto); > > This isn't much different that having a separate call to set the protocol. I > went with a separate protocol API to add in version information as well. > Changing just the create_id call makes sense. > > - Sean I also have this notion that it might be a good idea to put the protocol in the reserved bits in the service id. Makes sense? -- MST From mst at mellanox.co.il Tue Dec 6 10:48:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 20:48:14 +0200 Subject: [openib-general] [PATCH] core: segmented rmpp sends Message-ID: <20051206184814.GC23088@mellanox.co.il> With the following in place we are able to perform very large RMPP transfers. Please comment. --- Modify the rmpp mad support to accept a linked list of segments instead of a large physically contigious buffer. The list is kept in mad_send_wr private data and constructed with new ib_append_to_multipacket_mad API call. Modify user_mad.c to allocate large MADs for send/receive by chunks. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/core/mad_rmpp.c =================================================================== --- latest.orig/drivers/infiniband/core/mad_rmpp.c +++ latest/drivers/infiniband/core/mad_rmpp.c @@ -433,44 +433,6 @@ static struct ib_mad_recv_wc * complete_ return rmpp_wc; } -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf) -{ - struct ib_mad_recv_buf *seg_buf; - struct ib_rmpp_mad *rmpp_mad; - void *data; - int size, len, offset; - u8 flags; - - len = mad_recv_wc->mad_len; - if (len <= sizeof(struct ib_mad)) { - memcpy(buf, mad_recv_wc->recv_buf.mad, len); - return; - } - - offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); - - list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { - rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; - flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); - - if (flags & IB_MGMT_RMPP_FLAG_FIRST) { - data = rmpp_mad; - size = sizeof(*rmpp_mad); - } else { - data = (void *) rmpp_mad + offset; - if (flags & IB_MGMT_RMPP_FLAG_LAST) - size = len; - else - size = sizeof(*rmpp_mad) - offset; - } - - memcpy(buf, data, size); - len -= size; - buf += size; - } -} -EXPORT_SYMBOL(ib_coalesce_recv_mad); - static struct ib_mad_recv_wc * continue_rmpp(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) @@ -570,16 +532,26 @@ start_rmpp(struct ib_mad_agent_private * return mad_recv_wc; } -static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) +static inline void * get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) { - return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + - (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * - (mad_send_wr->seg_num - 1); + struct ib_mad_multipacket_seg *seg; + int i = 2; + + if (list_empty(&mad_send_wr->multipacket_list)) + return NULL; + + list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { + if (i == mad_send_wr->seg_num) + return seg->data; + i++; + } + return NULL; } -static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) +int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; + void *next_data; int timeout; u32 paylen; @@ -594,12 +566,14 @@ static int send_next_seg(struct ib_mad_s rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(paylen); mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); } else { - mad_send_wr->send_wr.num_sge = 2; - mad_send_wr->sg_list[0].length = mad_send_wr->data_offset; - mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr); - mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - - mad_send_wr->data_offset; - mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + next_data = get_seg_addr(mad_send_wr); + if (!next_data) { + printk(KERN_ERR PFX "send_next_seg: " + "could not find next segment\n"); + return -EINVAL; + } + memcpy((void *)rmpp_mad + mad_send_wr->data_offset, next_data, + sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); rmpp_mad->rmpp_hdr.paylen_newwin = 0; } Index: latest/drivers/infiniband/include/rdma/ib_mad.h =================================================================== --- latest.orig/drivers/infiniband/include/rdma/ib_mad.h +++ latest/drivers/infiniband/include/rdma/ib_mad.h @@ -141,6 +141,11 @@ struct ib_rmpp_hdr { __be32 paylen_newwin; }; +struct ib_mad_multipacket_seg { + struct list_head list; + u8 data[0]; +}; + typedef u64 __bitwise ib_sa_comp_mask; #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) @@ -485,17 +490,6 @@ int ib_unregister_mad_agent(struct ib_ma int ib_post_send_mad(struct ib_mad_send_buf *send_buf, struct ib_mad_send_buf **bad_send_buf); -/** - * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. - * @mad_recv_wc: Work completion information for a received MAD. - * @buf: User-provided data buffer to receive the coalesced buffers. The - * referenced buffer should be at least the size of the mad_len specified - * by @mad_recv_wc. - * - * This call copies a chain of received MAD segments into a single data buffer, - * removing duplicated headers. - */ -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf); /** * ib_free_recv_mad - Returns data buffers used to receive a MAD. @@ -601,6 +595,18 @@ struct ib_mad_send_buf * ib_create_send_ gfp_t gfp_mask); /** + * ib_append_to_multipacket_mad - Append a segment of an RMPP multipacket mad send + * to the send buffer. + * @send_buf: Previously allocated send data buffer. + * @seg: segment to append to linked list (already filled with data). + * + * This routine appends a segment of a multipacket RMPP message + * (copied from user space) to a MAD for sending. + */ +void ib_append_to_multipacket_mad(struct ib_mad_send_buf * send_buf, + struct ib_mad_multipacket_seg *seg); + +/** * ib_free_send_mad - Returns data buffers used to send a MAD. * @send_buf: Previously allocated send data buffer. */ Index: latest/drivers/infiniband/core/mad.c =================================================================== --- latest.orig/drivers/infiniband/core/mad.c +++ latest/drivers/infiniband/core/mad.c @@ -792,17 +792,13 @@ struct ib_mad_send_buf * ib_create_send_ return ERR_PTR(-EINVAL); length = sizeof *mad_send_wr + buf_size; - if (length >= PAGE_SIZE) - buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - buf = kmalloc(length, gfp_mask); + buf = kzalloc(sizeof *mad_send_wr + sizeof(struct ib_mad), gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, length); - - mad_send_wr = buf + buf_size; + mad_send_wr = buf + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_send_wr->multipacket_list); mad_send_wr->send_buf.mad = buf; mad_send_wr->mad_agent_priv = mad_agent_priv; @@ -834,23 +830,33 @@ struct ib_mad_send_buf * ib_create_send_ } EXPORT_SYMBOL(ib_create_send_mad); +void ib_append_to_multipacket_mad(struct ib_mad_send_buf * send_buf, + struct ib_mad_multipacket_seg *seg) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, + send_buf); + list_add_tail(&seg->list, &mad_send_wr->multipacket_list); +} +EXPORT_SYMBOL(ib_append_to_multipacket_mad); + void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { struct ib_mad_agent_private *mad_agent_priv; - void *mad_send_wr; - int length; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_multipacket_seg *seg, *tmp; mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - length = sizeof(struct ib_mad_send_wr_private) + (mad_send_wr - send_buf->mad); - if (length >= PAGE_SIZE) - free_pages((unsigned long)send_buf->mad, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(send_buf->mad); - + list_for_each_entry_safe(seg, tmp, &mad_send_wr->multipacket_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } Index: latest/drivers/infiniband/core/mad_priv.h =================================================================== --- latest.orig/drivers/infiniband/core/mad_priv.h +++ latest/drivers/infiniband/core/mad_priv.h @@ -130,6 +130,7 @@ struct ib_mad_send_wr_private { enum ib_wc_status status; /* RMPP control */ + struct list_head multipacket_list; int last_ack; int seg_num; int newwin; Index: latest/drivers/infiniband/core/user_mad.c =================================================================== --- latest.orig/drivers/infiniband/core/user_mad.c +++ latest/drivers/infiniband/core/user_mad.c @@ -123,6 +123,7 @@ struct ib_umad_packet { struct ib_mad_send_buf *msg; struct list_head list; int length; + struct list_head seg_list; struct ib_user_mad mad; }; @@ -176,6 +177,87 @@ static int queue_packet(struct ib_umad_f return ret; } +static int data_offset(u8 mgmt_class) +{ + if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) + return IB_MGMT_SA_HDR; + else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && + (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return IB_MGMT_VENDOR_HDR; + else + return IB_MGMT_RMPP_HDR; +} + +static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + struct ib_umad_packet *packet) +{ + struct ib_mad_recv_buf *seg_buf; + struct ib_rmpp_mad *rmpp_mad; + void *data; + struct ib_mad_multipacket_seg *seg; + int size, len, offset; + u8 flags; + + len = mad_recv_wc->mad_len; + if (len <= sizeof(struct ib_mad)) { + memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); + return 0; + } + + offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); + + list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { + rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; + flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); + + if (flags & IB_MGMT_RMPP_FLAG_FIRST) { + size = sizeof(*rmpp_mad); + memcpy(&packet->mad.data, rmpp_mad, size); + } else { + data = (void *) rmpp_mad + offset; + if (flags & IB_MGMT_RMPP_FLAG_LAST) + size = len; + else + size = sizeof(*rmpp_mad) - offset; + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + sizeof(struct ib_rmpp_mad) - offset, + GFP_KERNEL); + if (!seg) + return -ENOMEM; + memcpy(seg->data, data, size); + list_add_tail(&seg->list, &packet->seg_list); + } + len -= size; + } + return 0; +} + +static struct ib_umad_packet *alloc_packet(void) +{ + struct ib_umad_packet *packet; + int length = sizeof *packet + sizeof(struct ib_mad); + + packet = kzalloc(length, GFP_KERNEL); + if (!packet) { + printk(KERN_ERR "alloc_packet: mem alloc failed for length %d\n", + length); + return NULL; + } + INIT_LIST_HEAD(&packet->seg_list); + return packet; +} + +static void free_packet(struct ib_umad_packet *packet) +{ + struct ib_mad_multipacket_seg *seg, *tmp; + + list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(packet); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { @@ -187,7 +269,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kzalloc(sizeof *timeout + IB_MGMT_MAD_HDR, GFP_KERNEL); + timeout = alloc_packet(); if (!timeout) goto out; @@ -198,40 +280,14 @@ static void send_handler(struct ib_mad_a sizeof (struct ib_mad_hdr)); if (!queue_packet(file, agent, timeout)) - return; + return; + else + free_packet(timeout); } out: kfree(packet); } -static struct ib_umad_packet *alloc_packet(int buf_size) -{ - struct ib_umad_packet *packet; - int length = sizeof *packet + buf_size; - - if (length >= PAGE_SIZE) - packet = (void *)__get_free_pages(GFP_KERNEL, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - packet = kmalloc(length, GFP_KERNEL); - - if (!packet) - return NULL; - - memset(packet, 0, length); - return packet; -} - -static void free_packet(struct ib_umad_packet *packet) -{ - int length = packet->length + sizeof *packet; - if (length >= PAGE_SIZE) - free_pages((unsigned long) packet, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(packet); -} - - - static void recv_handler(struct ib_mad_agent *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -243,13 +299,16 @@ static void recv_handler(struct ib_mad_a goto out; length = mad_recv_wc->mad_len; - packet = alloc_packet(length); + packet = alloc_packet(); if (!packet) goto out; packet->length = length; - ib_coalesce_recv_mad(mad_recv_wc, packet->mad.data); + if (copy_recv_mad(mad_recv_wc, packet)) { + free_packet(packet); + goto out; + } packet->mad.hdr.status = 0; packet->mad.hdr.length = length + sizeof (struct ib_user_mad); @@ -278,6 +337,7 @@ static ssize_t ib_umad_read(struct file size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; + struct ib_mad_multipacket_seg *seg; struct ib_umad_packet *packet; ssize_t ret; @@ -304,18 +364,42 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); - if (count < packet->length + sizeof (struct ib_user_mad)) { - /* Return length needed (and first RMPP segment) if too small */ - if (copy_to_user(buf, &packet->mad, - sizeof (struct ib_user_mad) + sizeof (struct ib_mad))) - ret = -EFAULT; - else - ret = -ENOSPC; - } else if (copy_to_user(buf, &packet->mad, - packet->length + sizeof (struct ib_user_mad))) + if (copy_to_user(buf, &packet->mad, + sizeof(struct ib_user_mad) + sizeof(struct ib_mad))) { ret = -EFAULT; - else + goto err; + } + + if (count < packet->length + sizeof (struct ib_user_mad)) + /* User buffer too small. Return first RMPP segment (which + * includes RMPP message length). + */ + ret = -ENOSPC; + else if (packet->length <= sizeof(struct ib_mad)) + ret = packet->length + sizeof(struct ib_user_mad); + else { + int len = packet->length - sizeof(struct ib_mad); + struct ib_rmpp_mad *rmpp_mad = + (struct ib_rmpp_mad *) packet->mad.data; + int max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + int seg_payload; + /* multipacket RMPP MAD message. Copy remainder of message. + * Note that last segment may have a shorter payload. + */ + buf += sizeof(struct ib_user_mad) + sizeof(struct ib_mad); + list_for_each_entry(seg, &packet->seg_list, list) { + seg_payload = min_t(int, len, max_seg_payload); + if (copy_to_user(buf, seg->data, seg_payload)) { + ret = -EFAULT; + goto err; + } + buf += seg_payload; + len -= seg_payload; + } ret = packet->length + sizeof (struct ib_user_mad); + } +err: if (ret < 0) { /* Requeue packet */ spin_lock_irq(&file->recv_lock); @@ -339,6 +423,8 @@ static ssize_t ib_umad_write(struct file __be64 *tid; int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; + int max_seg_payload; + struct ib_mad_multipacket_seg *seg; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -415,6 +501,11 @@ static ssize_t ib_umad_write(struct file goto err_ah; } + if (!rmpp_active && length > sizeof(struct ib_mad)) { + ret = -EINVAL; + goto err_ah; + } + packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), 0, rmpp_active, @@ -432,12 +523,39 @@ static ssize_t ib_umad_write(struct file /* Copy MAD headers (RMPP header in place) */ memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); - /* Now, copy rest of message from user into send buffer */ + /* complete copying first 256 bytes of message into send buffer */ if (copy_from_user(packet->msg->mad + copy_offset, buf + sizeof (struct ib_user_mad) + copy_offset, - length - copy_offset)) { + min_t(int, length, sizeof(struct ib_mad)) - copy_offset)) { ret = -EFAULT; - goto err_msg; + goto err_ah; + } + + /* if multipacket, copy remainder of send message from user to multipacket list */ + length -= sizeof(struct ib_mad); + buf += sizeof (struct ib_user_mad) + sizeof(struct ib_mad); + max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + while (length > 0) { + int seg_payload = min_t(int, length, max_seg_payload); + seg = kzalloc(sizeof(struct ib_mad_multipacket_seg) + + max_seg_payload, GFP_KERNEL); + if (!seg) { + printk(KERN_ERR "ib_umad_write: " + "mem alloc failed for length %d\n", + sizeof(struct ib_mad_multipacket_seg) + + max_seg_payload); + ret = -ENOMEM; + goto err_msg; + } + + if (copy_from_user(seg->data, buf, seg_payload)) { + ret = -EFAULT; + goto err_msg; + } + ib_append_to_multipacket_mad(packet->msg, seg); + buf += seg_payload; + length -= seg_payload; } /* -- MST From swise at opengridcomputing.com Tue Dec 6 11:05:48 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 06 Dec 2005 13:05:48 -0600 Subject: [openib-general] ISER question Message-ID: <1133895948.27598.41.camel@linux.site> Is there iscsi initiator code somewhere that uses the infiniband/ulp/iser module? Thanks, Steve. From caitlinb at broadcom.com Tue Dec 6 11:13:29 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 6 Dec 2005 11:13:29 -0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C2A1A@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Tue, Dec 06, 2005 at 08:52:13PM +0800, Ian Jiang wrote: >> Hi James, >> You are always so kind! >> Now I have a question about reading a buffer of a application in >> user space. Is it the only way to use the uDAPL? >> I used to have an idea like this: >> The application in user space gives the virtual start address and >> length of its data buffer to a kernel module program. This kernel >> program acts as a application of the kDAPL and registers the user >> space data buffer with the kDAPl, > > Ian, > If you are doing this with OpenIB, my advice is to NOT start > with kDAPL. > AFAICT, kDAPL is going away once any dependencies on it are resolved. > And it's clearly not going to be pushed to kernel.org source trees. > ISTR Dan Bar Dov wrote iSER was no longer dependent on kDAPL > but not sure if that was the only module. > > >> then request a RDMA read operation to complete the data transferring. >> But I think it is not feasible after getting your last reply. Am I >> right? Please give some suggestion and thanks very much! > > In general, a kernel module can map a user space address to a > "DMA Address". OpenIB code has interfaces to register the > "DMA Address" with the IB card. > kDAPL will still be of value for applications that want to minimize their dependencies on the OS while still operating in kernel space (but obviously not as part of *the* kernel). However, agenting user-mode buffers is going to get very OS specific, so this application doesn't seem to be one that would benefit from kDAPL. From ftillier at silverstorm.com Tue Dec 6 11:15:35 2005 From: ftillier at silverstorm.com (Fabian Tillier) Date: Tue, 6 Dec 2005 11:15:35 -0800 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <20051206185752.GD23088@mellanox.co.il> References: <4395CB02.9010602@ichips.intel.com> <20051206185752.GD23088@mellanox.co.il> Message-ID: <79ae2f320512061115w58304228ib6c7003c71c34b19@mail.gmail.com> On 12/6/05, Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : > > Subject: Re: SDP use of CMA > > > > Tom Tucker wrote: > > > Not to jump in late on this, but why couldn't we just add a protocol > > > parameter to the create_id call. Then it is arbitrarily extensible ala > > > socket(AF_INET, SOCK_STREAM, ) > > > > > > So what I'm specifically suggesting is: > > > > > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > > > void* context, > > > rdma_cm_proto proto); > > > > This isn't much different that having a separate call to set the protocol. I > > went with a separate protocol API to add in version information as well. > > Changing just the create_id call makes sense. > > > > - Sean > > I also have this notion that it might be a good idea to put the > protocol in the reserved bits in the service id. > > Makes sense? The protocol defines the private data format. SDP already defines its private data format which is different than the CMA's native private data format. The protocol input to the CMA would dictate what SID range would be used (SDP's or the CMA's) and what the private data format would be so that the CMA can properly process incoming requests. - Fab From mst at mellanox.co.il Tue Dec 6 11:25:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 6 Dec 2005 21:25:58 +0200 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <20051206185752.GD23088@mellanox.co.il> References: <20051206185752.GD23088@mellanox.co.il> Message-ID: <20051206192558.GF23088@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: Re: SDP use of CMA > > Quoting r. Sean Hefty : > > Subject: Re: SDP use of CMA > > > > Tom Tucker wrote: > > > Not to jump in late on this, but why couldn't we just add a protocol > > > parameter to the create_id call. Then it is arbitrarily extensible > ala > > > socket(AF_INET, SOCK_STREAM, ) > > > > > > So what I'm specifically suggesting is: > > > > > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler* cm_handler, > > > void* context, > > > rdma_cm_proto proto); > > > > This isn't much different that having a separate call to set the > protocol. I > > went with a separate protocol API to add in version information as > well. > > Changing just the create_id call makes sense. > > > > - Sean > > I also have this notion that it might be a good idea to put the > protocol in the reserved bits in the service id. It seems I was too brief. What I'm trying to say here, service ID not only specifies the private data format but also allows demultiplexing. For SDP the SID format is set, but for other protocols we can have if (id->protocol != CMA_SDP_PROTO) return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + (id->protocol << 16) + ((struct sockaddr_in *) addr)->sin_port); and this way more than one protocol will be able to listen on the same port. > Makes sense? -- MST From sean.hefty at intel.com Tue Dec 6 10:57:28 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Dec 2005 10:57:28 -0800 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <20051206185752.GD23088@mellanox.co.il> Message-ID: >I also have this notion that it might be a good idea to put the >protocol in the reserved bits in the service id. > >Makes sense? The service ID is an IB specific identifier, not exposed through the CMA. Also, there aren't any "reserved bits" in the service ID. Can you clarify? - Sean From mshefty at ichips.intel.com Tue Dec 6 11:29:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Dec 2005 11:29:01 -0800 Subject: [openib-general] Re: SDP use of CMA In-Reply-To: <20051206192558.GF23088@mellanox.co.il> References: <20051206185752.GD23088@mellanox.co.il> <20051206192558.GF23088@mellanox.co.il> Message-ID: <4395E67D.9000703@ichips.intel.com> Michael S. Tsirkin wrote: > It seems I was too brief. > What I'm trying to say here, service ID not only specifies the private > data format but also allows demultiplexing. > > For SDP the SID format is set, but for other protocols we can have > > if (id->protocol != CMA_SDP_PROTO) > return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + > (id->protocol << 16) + > ((struct sockaddr_in *) addr)->sin_port); > > and this way more than one protocol will be able to listen on the > same port. The latest version of this has the socket protocol as part of the service ID, similar to what you have above. The CMA won't be updated to reflect this until the proposed protocol becomes standard however. - Sean From halr at voltaire.com Tue Dec 6 11:40:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 6 Dec 2005 21:40:03 +0200 Subject: [openib-general] ISER question Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB80@taurus.voltaire.com> Hi Steve, *The Linux open-iscsi initiator is used with the iSER initiator (http://www.open-iscsi.org ). *It needs to be built for iSER transport and produces scsi_transport_iscsi.ko and iscsi_iser.ko We will provide more complete build (and running) instructions on the wiki for this. Do you have an iSER target ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Steve Wise Sent: Tue 12/6/2005 2:05 PM To: openib-general Subject: [openib-general] ISER question Is there iscsi initiator code somewhere that uses the infiniband/ulp/iser module? Thanks, Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Tue Dec 6 12:19:51 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 06 Dec 2005 14:19:51 -0600 Subject: [openib-general] cm_add_one events Message-ID: <1133900391.11138.9.camel@trinity.austin.ammasso.com> Sean: Should the IB CM ignore add_one events for all but node_type == IB_NODE_HCA? Thanks, Tom From mshefty at ichips.intel.com Tue Dec 6 12:24:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 06 Dec 2005 12:24:04 -0800 Subject: [openib-general] Re: cm_add_one events In-Reply-To: <1133900391.11138.9.camel@trinity.austin.ammasso.com> References: <1133900391.11138.9.camel@trinity.austin.ammasso.com> Message-ID: <4395F364.6080204@ichips.intel.com> Tom Tucker wrote: > Should the IB CM ignore add_one events for all but node_type == > IB_NODE_HCA? My assumption was that if someone wanted to run it on a switch or router for whatever reason, then there's no real reason to prevent it. If we add an iWarp node type, then the CM could filter out those nodes, but it should just fail when trying to register a MAD agent anyway. - Sean From iod00d at hp.com Tue Dec 6 13:00:25 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 6 Dec 2005 13:00:25 -0800 Subject: [openib-general] Flash sector size? eh? Message-ID: <20051206210025.GJ21980@esmail.cup.hp.com> Hi, I'm wondering if anyone has a clue what this is about: # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 -i /root/fw-25208-4_7_400-MHGA28-1T.bin -s b Flash sector size(0x10000) differs from sector size defined in image (0x20000) # Did I grab the wrong firmware image? http://www.mellanox.com/support/firmware_download.php The HCA is : 0084:05:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) Flags: bus master, fast devsel, latency 0, IRQ 58 Memory at 00000f2888800000 (64-bit, non-prefetchable) [size=1M] Memory at 00000f2888000000 (64-bit, prefetchable) [size=8M] Memory at 00000f2880000000 (64-bit, prefetchable) [size=128M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [60] #10 [0001] This is running a recent svn openib bits (less than 2 weeks old) on 2.6.14 kernel. I'm messing with firmware because when loading mthca driver, I get: ... GSI 65 (level, low) -> CPU 1 (0x0808) vector 58 unregistered ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0084:05:00.0 GSI 65 (level, low) -> CPU 0 (0x0008) vector 58 ACPI: PCI Interrupt 0084:05:00.0[A] -> GSI 65 (level, low) -> IRQ 58 ib_mthca 0084:05:00.0: HCA FW version 4.5.0 is old (4.7.0 is current). ib_mthca 0084:05:00.0: If you have problems, try updating your HCA FW. And I like that kind of warning. thanks, grant From halr at voltaire.com Tue Dec 6 13:18:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2005 16:18:27 -0500 Subject: [openib-general] Re: ip_dev_find resolution? In-Reply-To: <20051206175652.GC22860@mellanox.co.il> References: <20051206174355.GB21980@esmail.cup.hp.com> <20051206175652.GC22860@mellanox.co.il> Message-ID: <1133903906.4587.23311.camel@hal.voltaire.com> On Tue, 2005-12-06 at 12:56, Michael S. Tsirkin wrote: > Actually, I wander whether instead of ip_dev_find we can just > > read_lock(&dev_base_lock); > for (dev = dev_base; dev; dev = dev->next) { > > and check the ip address? working off the ip_ptr and ip6_ptr ? > If this works, this has the advantage of supporting IPv6 as well. This was introduced at one point and we subsequently changed to ip_dev_find. I forget exactly why this was but can dig it out if no one recalls. -- Hal From steve.apo at googlemail.com Tue Dec 6 13:57:52 2005 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 6 Dec 2005 21:57:52 +0000 Subject: [openib-general] UC connection server In-Reply-To: <79ae2f320512061018m61a445d9xeee4e37bfb326974@mail.gmail.com> References: <2cfcf21e0512060805u466c9d83m@mail.gmail.com> <4395CDD6.1080601@ichips.intel.com> <79ae2f320512061018m61a445d9xeee4e37bfb326974@mail.gmail.com> Message-ID: <2cfcf21e0512061357h5d1a90h@mail.gmail.com> Hi Fabain, > I think what Steve wants to do is issue a REQ, send it to the PC, but > have the path record go to his embedded device (which has a different > LID). The CM protocol supports this, but the implementation of the CM > looks at the path record to determine the destination of the CM MADs. > Supporting this would require some way for the user to set the target > of the CM MADs independently of the path information contained in the > REQ. Adding an optional extra path record for the CM path might do > the trick. That's it in a nutshell really. I don't know how useful such a feature would be in the wider IB community. We've been forced into this position by a vendor not following the standard. I wanted to check with you guys whether there was a quick solution that was ready to fly. It seems that this feature would need to go into the openib drivers, which we don't have time or money to do. We do have a backup solution from the vendor, but its non-standard and I was trying to keep our side of the interface use the standard. Anyway, thanks for your suggests. It's all useful info. Regards, Steve. From panda at cse.ohio-state.edu Tue Dec 6 21:21:37 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 7 Dec 2005 00:21:37 -0500 (EST) Subject: [openib-general] Announcing the release of MVAPICH 0.9.6 (MPI-1 over InfiniBand and other RDMA Interconnects) Message-ID: <200512070521.jB75LbnM010303@xi.cse.ohio-state.edu> As MVAPICH software keeps on empowering several clusters in the TOP500 list (including the #5 ranked Sandia Thunderbird), the MVAPICH team is aiming to push the performance and scalability of InfiniBand clusters to the next level!! The team is pleased to announce the release of MVAPICH 0.9.6 for the following platforms, OS, compilers, and InfiniBand adapters: - Platforms: EM64T, Opteron, IA-32 and Mac G5 - Operating Systems: Linux, Solaris and Mac OSX - Compilers: gcc, intel, pathscale and pgi - InfiniBand Adapters: Mellanox adapters with PCI-X and PCI-Express (SDR and DDR with mem-full and mem-free cards) In addition to delivering high performance with VAPI interface, MVAPICH 0.9.6 also provides uDAPL support for portability across networks and platforms with highest performance. The uDAPL interface of this release has been tested with InfiniBand (OpenIB SCM/Gen2 uDAPL and Solaris IBTL/uDAPL) and Myrinet (DAPL-GM beta). Starting with this release, MVAPICH enables InfiniBand support for Solaris environment through uDAPL support. MVAPICH 0.9.6 is being distributed as a single integrated package (with MPICH 1.2.7 and MVICH). It is available under BSD license. This release has the following features: - Designs for scaling InfiniBand clusters to multi-thousand nodes with highest performance and reduced memory usage - Optimized implementation of Rendezvous protocol (RDMA Read and RDMA Write) for better computation-communication overlap and progress - Two modes of communication progress (polling and blocking) - Resource-aware registration cache - Optimized intra-node communication for Bus-based and NUMA-based systems with processor affinity - High performance and scalable collective communication support (Broadcast support using IB hardware multicast mechanism; RDMA-based barrier, all-to-all and all-gather) - Multi-rail communication support (multiple ports per adapter and multiple adapters) - Shared library support - ROMIO support - uDAPL support for portability across networks and OS (tested for InfiniBand on Linux and Solaris; and Myrinet) - Scalable job start-up with MPD - TotalView debugger support - Optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express with SDR and DDR) - Support for multiple compilers (gcc, icc, pathscale and pgi) - Single code base for all of the above platforms and OS - Integrated and easy-to-use build script for installing the code on various platforms, OS, compilers, Devices, and InfiniBand adapters - Incorporates a set of runtime and compiler time tunable parameters for convenient tuning on large-scale clusters Other features of this release include: - Excellent performance: Sample performance numbers include: EM64T, PCI-Ex, IBA-DDR: - 3.09 microsec one-way latency (4 bytes) - 1475 MB/sec unidirectional bandwidth - 2661 MB/sec bidirectional bandwidth EM64T, PCI-Ex, IBA-SDR: - 3.52 microsec one-way latency (4 bytes) - 968 MB/sec unidirectional bandwidth with single-rail and 1497 MB/sec with multi-rail - 1781 MB/sec bidirectional bandwidth with single-rail and 2721 MB/sec with multi-rail Opteron, PCI-Ex, IBA-SDR: - 3.42 microsec one-way latency (4 bytes) - 968 MB/sec unidirectional bandwidth with single-rail - 1865 MB/sec bidirectional bandwidth with single-rail Solaris uDAPL/IBTL on Opteron, PCI-X, IBA-SDR: - 5.38 microsec one-way latency (4 bytes) - 651 MB/sec unidirectional bandwidth - 808 MB/sec bidirectional bandwidth OpenIB/Gen2 uDAPL on Opteron, PCI-Ex, IBA-SDR: - 3.39 microsec one-way latency (4 bytes) - 968 MB/sec unidirectional bandwidth - 1890 MB/sec bidirectional bandwidth OpenIB/Gen2 uDAPL on EM64T, PCI-Ex, IBA-SDR: - 3.43 microsec one-way latency (4 bytes) - 968 MB/sec unidirectional bandwidth - 1912 MB/sec bidirectional bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance Results' section of the project's web page. - A set of benchmarks to evaluate point-to-point and collective operations - An enhanced and detailed `User Guide' to assist users: - to install this package on different platforms with both interfaces (VAPI and uDAPL) and different options - to vary different parameters of the MPI installation to extract maximum performance and achieve scalability, especially on large-scale systems. You are welcome to download the MVAPICH 0.9.6 package and access relevant information from the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A successive version with support for OpenIB/Gen2 will be available soon. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please send an e-mail to mvapich-help at cse.ohio-state.edu. Thanks, MVAPICH Team at OSU/NBCL ---------- PS: If you would like to be removed from this mailing list, please end an e-mail to mvapich_request at cse.ohio-state.edu. ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems, and Linux Networx; and with equipment support from AMD, Apple, Appro, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== From ianjiang.ict at gmail.com Wed Dec 7 00:13:04 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Wed, 7 Dec 2005 16:13:04 +0800 Subject: [openib-general] [kDAPL]How to register a vmalloc() allocated buffer In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2A1A@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2A1A@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <7b2fa1820512070013n5f6a41d1ie39049452f5c4083@mail.gmail.com> My question originally come from the iSER. I used to think that a data buffer described in a iSCSI data PDU is in the user space, but now I am afraid that it was not correct. openib-general-bounces at openib.org wrote: > > On Tue, Dec 06, 2005 at 08:52:13PM +0800, Ian Jiang wrote: > >> Hi James, > >> You are always so kind! > >> Now I have a question about reading a buffer of a application in > >> user space. Is it the only way to use the uDAPL? > >> I used to have an idea like this: > >> The application in user space gives the virtual start address and > >> length of its data buffer to a kernel module program. This kernel > >> program acts as a application of the kDAPL and registers the user > >> space data buffer with the kDAPl, > > > > Ian, > > If you are doing this with OpenIB, my advice is to NOT start > > with kDAPL. > > AFAICT, kDAPL is going away once any dependencies on it are resolved. > > And it's clearly not going to be pushed to kernel.org source trees. > > ISTR Dan Bar Dov wrote iSER was no longer dependent on kDAPL > > but not sure if that was the only module. > > > > > >> then request a RDMA read operation to complete the data transferring. > >> But I think it is not feasible after getting your last reply. Am I > >> right? Please give some suggestion and thanks very much! > > > > In general, a kernel module can map a user space address to a > > "DMA Address". OpenIB code has interfaces to register the > > "DMA Address" with the IB card. > > > > > kDAPL will still be of value for applications that want to minimize > their dependencies on the OS while still operating in kernel space > (but obviously not as part of *the* kernel). > > However, agenting user-mode buffers is going to get very OS > specific, so this application doesn't seem to be one that > would benefit from kDAPL. > > -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Dec 7 00:22:20 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 10:22:20 +0200 Subject: [openib-general] Re: ip_dev_find resolution? In-Reply-To: <1133903906.4587.23311.camel@hal.voltaire.com> References: <1133903906.4587.23311.camel@hal.voltaire.com> Message-ID: <20051207082220.GK21035@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: ip_dev_find resolution? > > On Tue, 2005-12-06 at 12:56, Michael S. Tsirkin wrote: > > Actually, I wander whether instead of ip_dev_find we can just > > > > read_lock(&dev_base_lock); > > for (dev = dev_base; dev; dev = dev->next) { > > > > and check the ip address? > > working off the ip_ptr and ip6_ptr ? Yes. > > If this works, this has the advantage of supporting IPv6 as well. > > This was introduced at one point and we subsequently changed to > ip_dev_find. I forget exactly why this was but can dig it out if no one > recalls. Please do. -- MST From danb at voltaire.com Wed Dec 7 00:38:24 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 7 Dec 2005 10:38:24 +0200 Subject: [openib-general] ISER question Message-ID: The open-iscsi initiator project has the capability to use the infiniband/ulp/iser module. Currently it needs an in-between module called iscsi_iser, but that in-between is being merged into ib_iser, so that open-iscsl will directly interface ib_iser. Please let me know if you need iscsi_iser since its code is not in open-iscsi, nor in openIB. Dan > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Steve Wise > Sent: Tuesday, December 06, 2005 9:06 PM > To: openib-general > Subject: [openib-general] ISER question > > Is there iscsi initiator code somewhere that uses the > infiniband/ulp/iser module? > > Thanks, > > Steve. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Dec 7 01:31:20 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 11:31:20 +0200 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051206210025.GJ21980@esmail.cup.hp.com> References: <20051206210025.GJ21980@esmail.cup.hp.com> Message-ID: <20051207093120.GN21035@mellanox.co.il> You need to query the Board ID on card and in image: ./mstflint/mstflint -d q ./mstflint/mstflint -i q Quoting r. Grant Grundler : > Subject: Flash sector size? eh? > > Hi, > I'm wondering if anyone has a clue what this is about: > > # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 -i > /root/fw-25208-4_7_400-MHGA28-1T.bin -s b > Flash sector size(0x10000) differs from sector size defined in image > (0x20000) > # > > Did I grab the wrong firmware image? > http://www.mellanox.com/support/firmware_download.php > > > The HCA is : > 0084:05:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (rev a0) > Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) > Flags: bus master, fast devsel, latency 0, IRQ 58 > Memory at 00000f2888800000 (64-bit, non-prefetchable) [size=1M] > Memory at 00000f2888000000 (64-bit, prefetchable) [size=8M] > Memory at 00000f2880000000 (64-bit, prefetchable) [size=128M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: 64bit+ > Queue=0/5 Enable- > Capabilities: [60] #10 [0001] > > > This is running a recent svn openib bits (less than 2 weeks old) > on 2.6.14 kernel. > > I'm messing with firmware because when loading mthca driver, I get: > ... > GSI 65 (level, low) -> CPU 1 (0x0808) vector 58 unregistered > ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) > ib_mthca: Initializing 0084:05:00.0 > GSI 65 (level, low) -> CPU 0 (0x0008) vector 58 > ACPI: PCI Interrupt 0084:05:00.0[A] -> GSI 65 (level, low) -> IRQ 58 > ib_mthca 0084:05:00.0: HCA FW version 4.5.0 is old (4.7.0 is current). > ib_mthca 0084:05:00.0: If you have problems, try updating your HCA FW. > > And I like that kind of warning. > > thanks, > grant > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- MST From juliab at agt.net Wed Dec 7 02:26:00 2005 From: juliab at agt.net (Wanda Jarrett) Date: Wed, 07 Dec 2005 04:26:00 -0600 Subject: [openib-general] Low mortagge ratee approvall Message-ID: <385m943c.6405665@msn.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: pemmican.4.gif Type: image/gif Size: 6879 bytes Desc: not available URL: From yael at mellanox.co.il Wed Dec 7 03:59:14 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 7 Dec 2005 13:59:14 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2473@mtlexch01.mtl.com> Hi Hal, If you look at the code - currently the What is returned is a record with the local SMInfo record. The code should be fixed to return a table, or a requested SMInfo record - not only of the local port. So currently - a table is not returned, and the code isn't correct with or without the patch... This issue should be added to our to-do list. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, December 06, 2005 4:06 PM To: Yael Kalka Cc: openib-general at openib.org Subject: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method OpenSM: SA SMInfoRecord should support GetTable as well as Get method Signed-off-by: Hal Rosenstock Index: osm_sa_sminfo_record.c =================================================================== --- osm_sa_sminfo_record.c (revision 4323) +++ osm_sa_sminfo_record.c (working copy) @@ -165,7 +165,8 @@ osm_smir_rcv_process( CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); - if (p_sa_mad->method != IB_MAD_METHOD_GET) + if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && + (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_smir_rcv_process: ERR 2804: " From halr at voltaire.com Wed Dec 7 04:05:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2005 07:05:48 -0500 Subject: [openib-general] [PATCH] OpenSM: SubnAdmGet PathRecord should assume NumbPath of 1 Message-ID: <1133957147.4587.28533.camel@hal.voltaire.com> OpenSM: SubnAdmGet PathRecord should assume NumbPath of 1 (1.2 erratum) Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 4335) +++ opensm/osm_sa_path_record.c (working copy) @@ -709,13 +709,15 @@ __osm_pr_rcv_get_lid_pair_path( static void __osm_pr_rcv_get_port_pair_paths( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const p_req_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, IN const ib_net64_t comp_mask, IN cl_qlist_t* const p_list ) { + const ib_path_rec_t* p_pr; + const ib_sa_mad_t* p_sa_mad; osm_pr_item_t* p_pr_item; uint16_t src_lid_min_ho; uint16_t src_lid_max_ho; @@ -752,6 +754,9 @@ __osm_pr_rcv_get_port_pair_paths( goto Exit; } + p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + /* We shouldn't be here if the paths are disqualified in some way... Thus, we assume every possible connection is valid. @@ -842,10 +847,14 @@ __osm_pr_rcv_get_port_pair_paths( preference = 0; path_num = 0; - if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) - iterations = p_pr->num_path & 0x7F; + /* If SubnAdmGet, assume NumbPaths 1 (1.2 erratum) */ + if (p_sa_mad->method != IB_MAD_METHOD_GET) + if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) + iterations = p_pr->num_path & 0x7F; + else + iterations = (uintn_t)(-1); else - iterations = (uintn_t)(-1); + iterations = 1; while( path_num < iterations ) { @@ -1101,7 +1110,7 @@ __osm_pr_rcv_get_end_points( static void __osm_pr_rcv_process_world( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const ib_net64_t comp_mask, IN cl_qlist_t* const p_list ) @@ -1128,7 +1137,7 @@ __osm_pr_rcv_process_world( p_src_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_src_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, comp_mask, p_list ); p_src_port = (osm_port_t*)cl_qmap_next( &p_src_port->map_item ); @@ -1145,7 +1154,7 @@ __osm_pr_rcv_process_world( static void __osm_pr_rcv_process_half( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, @@ -1172,7 +1181,7 @@ __osm_pr_rcv_process_half( p_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw , requestor_port, p_src_port, p_port, comp_mask, p_list ); p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ); } @@ -1185,7 +1194,7 @@ __osm_pr_rcv_process_half( p_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_port, p_dest_port, comp_mask, p_list ); p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ); } @@ -1199,7 +1208,7 @@ __osm_pr_rcv_process_half( static void __osm_pr_rcv_process_pair( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, @@ -1208,7 +1217,7 @@ __osm_pr_rcv_process_pair( { OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_process_pair ); - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, comp_mask, p_list ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -1413,7 +1422,8 @@ __osm_pr_match_mgrp_attributes( goto Exit; } - if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) + /* If SubnAdmGet, assume NumbPaths of 1 (1.2 erratum) */ + if( ( comp_mask & IB_PR_COMPMASK_NUMBPATH ) && ( p_sa_mad->method != IB_MAD_METHOD_GET ) ) { if( ( p_pr->num_path & 0x7f ) == 0 ) goto Exit; @@ -1513,7 +1523,7 @@ __osm_pr_rcv_respond( /* * C15-0.1.30: - * If we do a SubAdmGet and got more than one record it is an error ! + * If we do a SubnAdmGet and got more than one record it is an error ! */ if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { @@ -1720,22 +1730,22 @@ osm_pr_rcv_process( if( p_src_port ) { if( p_dest_port ) - __osm_pr_rcv_process_pair( p_rcv, p_pr, requestor_port, p_src_port, p_dest_port, + __osm_pr_rcv_process_pair( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, p_sa_mad->comp_mask, &pr_list ); else - __osm_pr_rcv_process_half( p_rcv, p_pr, requestor_port, p_src_port, NULL, + __osm_pr_rcv_process_half( p_rcv, p_madw, requestor_port, p_src_port, NULL, p_sa_mad->comp_mask, &pr_list ); } else { if( p_dest_port ) - __osm_pr_rcv_process_half( p_rcv, p_pr, requestor_port, NULL, p_dest_port, + __osm_pr_rcv_process_half( p_rcv, p_madw, requestor_port, NULL, p_dest_port, p_sa_mad->comp_mask, &pr_list ); else /* Katie, bar the door! */ - __osm_pr_rcv_process_world( p_rcv, p_pr, requestor_port, + __osm_pr_rcv_process_world( p_rcv, p_madw, requestor_port, p_sa_mad->comp_mask, &pr_list ); } goto Unlock; From halr at voltaire.com Wed Dec 7 04:14:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2005 07:14:39 -0500 Subject: [openib-general] RE: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2473@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2473@mtlexch01.mtl.com> Message-ID: <1133957678.4587.28597.camel@hal.voltaire.com> Hi Yael, On Wed, 2005-12-07 at 06:59, Yael Kalka wrote: > Hi Hal, > If you look at the code - currently the What is returned is a record > with the > local SMInfo record. > The code should be fixed to return a table, or a requested SMInfo record > - > not only of the local port. > So currently - a table is not returned, and the code isn't correct with > or > without the patch... > This issue should be added to our to-do list. Are you saying that because of that the GetTable should not be accepted until this is fixed or is it a separate issue ? There is more to do here as you point out and I will track this on the TODO list. Do you have other things for this list (see management/osm/doc/todo) ? -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 06, 2005 4:06 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well > as Get method > > > OpenSM: SA SMInfoRecord should support GetTable as well as Get method > > Signed-off-by: Hal Rosenstock > > Index: osm_sa_sminfo_record.c > =================================================================== > --- osm_sa_sminfo_record.c (revision 4323) > +++ osm_sa_sminfo_record.c (working copy) > @@ -165,7 +165,8 @@ osm_smir_rcv_process( > > CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); > > - if (p_sa_mad->method != IB_MAD_METHOD_GET) > + if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && > + (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "osm_smir_rcv_process: ERR 2804: " From yael at mellanox.co.il Wed Dec 7 04:36:16 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 7 Dec 2005 14:36:16 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2474@mtlexch01.mtl.com> The GetTable can be accepted. As I said - currently it doesn't mean anything. As for the list - the client re-registration issue can be added. I can't think of anything else right now, I will let you know when I have something to add there. Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, December 07, 2005 2:15 PM To: Yael Kalka Cc: openib-general at openib.org Subject: RE: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well as Get method Hi Yael, On Wed, 2005-12-07 at 06:59, Yael Kalka wrote: > Hi Hal, > If you look at the code - currently the What is returned is a record > with the > local SMInfo record. > The code should be fixed to return a table, or a requested SMInfo record > - > not only of the local port. > So currently - a table is not returned, and the code isn't correct with > or > without the patch... > This issue should be added to our to-do list. Are you saying that because of that the GetTable should not be accepted until this is fixed or is it a separate issue ? There is more to do here as you point out and I will track this on the TODO list. Do you have other things for this list (see management/osm/doc/todo) ? -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 06, 2005 4:06 PM > To: Yael Kalka > Cc: openib-general at openib.org > Subject: [PATCH] OpenSM: SA SMInfoRecord should support GetTable as well > as Get method > > > OpenSM: SA SMInfoRecord should support GetTable as well as Get method > > Signed-off-by: Hal Rosenstock > > Index: osm_sa_sminfo_record.c > =================================================================== > --- osm_sa_sminfo_record.c (revision 4323) > +++ osm_sa_sminfo_record.c (working copy) > @@ -165,7 +165,8 @@ osm_smir_rcv_process( > > CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); > > - if (p_sa_mad->method != IB_MAD_METHOD_GET) > + if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && > + (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "osm_smir_rcv_process: ERR 2804: " From mst at mellanox.co.il Wed Dec 7 05:11:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 15:11:05 +0200 Subject: [openib-general] [PATCH] ipoib_multicast: IPOIB_FLAG_ADMIN_UP test racy Message-ID: <20051207131105.GS21035@mellanox.co.il> Hello, Roland! Here's a simple race scenario. device is up. port event triggers flush_task. ipoib_ib_dev_flush (running from the default work queue) calls ipoib_ib_dev_down. This calls ipoib_mcast_stop_thread. This flushes the ipoib workqueue. mcast_task runs on ipoib workqueue, since IPOIB_FLAG_ADMIN_UP is set, this re-starts the mcast task. As a result mcast_task may be running while mcast_stop_thread is scanning the multicast_list, or after that. --- Fix race condition where mcast_task may be running after ipoib_mcast_stop_thread has flushed the workqueue. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -904,7 +904,7 @@ ipoib_mcast_free(mcast); } - if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) ipoib_mcast_start_thread(dev); } -- MST From Arkady.Kanevsky at netapp.com Wed Dec 7 05:11:53 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 7 Dec 2005 08:11:53 -0500 Subject: [openib-general] FW: [swg] 12/6 meeting minutes (2nd half) Message-ID: SWG have approved the IP address proposal (v5). Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Mike Ko [mailto:mako at almaden.ibm.com] > Sent: Tuesday, December 06, 2005 6:41 PM > To: swg at infinibandta.org > Subject: [swg] 12/6 meeting minutes (2nd half) > > We had a brief discussion on the revised slide deck from > Arkady on the RDMA-Aware SID and CM REQ Message Extension and > there were no disagreements on the direction. > > Arkady Kanevsky from NetApp made the following motion: > "Create a new Annex for RDMA aware ULPs that includes: > a. port mapping between IETF protocols ports and IB SIDs b. > CM REQ message private data format extensions c. CM usage for > RDMA aware ULPs" > > Ted Kim from Sun seconded the motion. > > Vote count: > Against: 0 > Abstain: 0 > > Motion passed. > > We continued with a discussion on the slide deck from Mike Ko > on supporting iSER on InfiniBand. There were disagreements > on the merits on the need for Connection Preference bits. We > decided to move forward with the rest of the suggestions from > Mike and postpone the decision on the CP bits until the next meeting. > > Mike Ko from IBM made the following motion: > "Create a new annex to support iSER on InfiniBand release 1.1 > and 1.2 as represented in Mike Ko's slidedeck dated December > 1 but not including the support for Connection Preference > bits, and also making ARI a must requirement for CM REJ." > > Yaron Haviv from Voltaire seconded the motion. > > Vote count: > Against: 0 > Abstain: 0 > > Motion passed. > > The meeting was adjourned after the vote. > > Mike > From kkmihdlcz at go.com Wed Dec 7 06:10:46 2005 From: kkmihdlcz at go.com (Victoria Wolf) Date: Wed, 07 Dec 2005 14:10:46 -0000 Subject: [openib-general] Just reduce it. Message-ID: <116cd01c4db9b$e2612e50$492a323a@billgates> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about.

# Suppress your appetite and feel full and satisfied all day long
# Increase your energy levels
# Lose excess weight
# Increase your metabolism
# Burn body fat
# Burn calories
# Attack obesity
And more..

http://healthcarehoodia.com/

# Suitable for vegetarians and vegans
# MAINTAIN your weight loss
# Make losing weight a sure guarantee
# Look your best during the summer months

http://healthcarehoodia.com/

Regards, Dr. Victoria Wolf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Dec 7 07:43:48 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 17:43:48 +0200 Subject: [openib-general] Re: mthca_qp patch In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8D6CD@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8D6CD@mtlexch01.mtl.com> Message-ID: <20051207154348.GZ21035@mellanox.co.il> Several fixes in mthca: 1. Add limit checking on rd_atomic and dest_rd_atomic attributes: especially for max_dest_rd_atomic, a value that is larger than HCA capability can cause RDB overflow and corruption of another QP. 2. Fix typo in rd_atomic calculation: ffs(x) - 1 does not find the next power of 2, fls(x - 1) does. 3. Only change the driver's copy of the QP attributes in modify QP after checking the modify QP command completed successfully. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c @@ -589,6 +589,20 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && + attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { + mthca_dbg(dev, "Max rdma_atomic as initiator (%u) too large. max is %d\n", + attr->max_rd_atomic, dev->limits.max_qp_init_rdma); + return -EINVAL; + } + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && + attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) { + mthca_dbg(dev, "Max rdma_atomic as responder(%u) too large. max is %d\n", + attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift); + return -EINVAL; + } + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) return PTR_ERR(mailbox); @@ -712,9 +726,9 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - qp_context->params1 |= cpu_to_be32(min(attr->max_rd_atomic ? - ffs(attr->max_rd_atomic) - 1 : 0, - 7) << 21); + qp_context->params1 |= + cpu_to_be32(attr->max_rd_atomic ? + fls(attr->max_rd_atomic - 1) << 21 : 0); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } @@ -748,13 +762,9 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); - - qp->atomic_rd_en = attr->qp_access_flags; } if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { - u8 rra_max; - if (qp->resp_depth && !attr->max_dest_rd_atomic) { /* * Lowering our responder resources to zero. @@ -782,16 +792,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, MTHCA_QP_OPTPAR_RAE); } - for (rra_max = 0; - 1 << rra_max < attr->max_dest_rd_atomic && - rra_max < dev->qp_table.rdb_shift; - ++rra_max) - ; /* nothing */ - - qp_context->params2 |= cpu_to_be32(rra_max << 21); + qp_context->params2 |= + cpu_to_be32(attr->max_dest_rd_atomic ? + fls(attr->max_dest_rd_atomic - 1) << 21 : 0); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRA_MAX); - - qp->resp_depth = attr->max_dest_rd_atomic; } qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); @@ -833,8 +837,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, err = -EINVAL; } - if (!err) + if (!err) { + if (attr_mask & IB_QP_ACCESS_FLAGS) + qp->atomic_rd_en = attr->qp_access_flags; + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + qp->resp_depth = attr->max_dest_rd_atomic; qp->state = new_state; + } mthca_free_mailbox(dev, mailbox); From mst at mellanox.co.il Wed Dec 7 08:44:33 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 18:44:33 +0200 Subject: [openib-general] [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051207131105.GS21035@mellanox.co.il> References: <20051207131105.GS21035@mellanox.co.il> Message-ID: <20051207164433.GA21035@mellanox.co.il> Hello, Roland! Here's another race scenario. --- Fix the following race scenario: device is up. port event or set mcast list triggers ipoib_mcast_stop_thread, This cancels the query and waits on mcast "done" completion. completion is called and "done" is set. Meanwhile, ipoib_mcast_send arrives and starts a new query, re-initializing "done". Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-01 14:53:08.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-07 18:23:55.000000000 +0200 @@ -78,6 +78,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-07 18:22:12.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-07 18:29:49.000000000 +0200 @@ -582,6 +582,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); up(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -592,6 +596,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + down(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -674,6 +682,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags)) + goto unlock; + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -732,6 +743,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } -- MST From rdreier at cisco.com Wed Dec 7 10:38:53 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Dec 2005 10:38:53 -0800 Subject: [openib-general] [PATCH] ipoib_multicast: IPOIB_FLAG_ADMIN_UP test racy In-Reply-To: <20051207131105.GS21035@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 7 Dec 2005 15:11:05 +0200") References: <20051207131105.GS21035@mellanox.co.il> Message-ID: Thanks -- I'm just now digging myself out of the backlog caused by being offline while we moved offices but I will review and apply these patches ASAP. - R. From iod00d at hp.com Wed Dec 7 11:41:37 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 7 Dec 2005 11:41:37 -0800 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051207093120.GN21035@mellanox.co.il> References: <20051206210025.GJ21980@esmail.cup.hp.com> <20051207093120.GN21035@mellanox.co.il> Message-ID: <20051207194137.GB26945@esmail.cup.hp.com> On Wed, Dec 07, 2005 at 11:31:20AM +0200, Michael S. Tsirkin wrote: > You need to query the Board ID on card and in image: > ./mstflint/mstflint -d q > ./mstflint/mstflint -i q Ah...another issue: # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 q *** ERROR *** Can't open /proc/bus/pci/0084:05/00.0: Can not obtain Flash semaphore (63). You can use -clear_semaphore to force semaphore unlock. See help for details. I expect that's due to the previous failure not cleaning up behind itself. After clearing the flash semaphore: # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 q Image type: FailSafe Chip rev.: A0 GUID Des: Node Port1 Port2 Sys image GUIDs: 001321ffff757800 001321ffff757801 001321ffff757802 001321ffff757803 Board ID: 76­ # ./mstflint/mstflint -i /root/fw-25208-4_7_400-MHGA28-1T.bin q Image type: FailSafe Chip rev.: A0 GUID Des: Node Port1 Port2 Sys image GUIDs: 0002c9000100d050 0002c9000100d051 0002c9000100d052 0002c9000100d050 Board ID: V_ym (MT_0200000001) I don't know what to make of the "76-" for board ID. Could this be a prototype board with some HP generated firmware? Is "Board ID" the only way to tell which vendor provided an HCA? I normally expect Subsystem ID to tell me that but have the impression (after looking at several PCI-X HCAs I have installed) that I can't trust that in this case. :( thanks, grant From ftillier at silverstorm.com Wed Dec 7 11:50:18 2005 From: ftillier at silverstorm.com (Fabian Tillier) Date: Wed, 7 Dec 2005 11:50:18 -0800 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051207194137.GB26945@esmail.cup.hp.com> References: <20051206210025.GJ21980@esmail.cup.hp.com> <20051207093120.GN21035@mellanox.co.il> <20051207194137.GB26945@esmail.cup.hp.com> Message-ID: <79ae2f320512071150p600f9854o3553340c6b385c97@mail.gmail.com> On 12/7/05, Grant Grundler wrote: > I don't know what to make of the "76-" for board ID. > Could this be a prototype board with some HP generated firmware? > > Is "Board ID" the only way to tell which vendor provided an HCA? > I normally expect Subsystem ID to tell me that but have the impression > (after looking at several PCI-X HCAs I have installed) that I can't > trust that in this case. :( The GUIDs have the OUI of the vendor in the first 3 bytes. 0002c9 is Mellanox, and 001321 is HP. So your card looks like it has HP FW. However, that in itself shouldn't prevent the FW from loading, so I'm not of much use to you. I'll let Michael chime in on that part. - Fab From mst at mellanox.co.il Wed Dec 7 13:25:04 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 23:25:04 +0200 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051207194137.GB26945@esmail.cup.hp.com> References: <20051207194137.GB26945@esmail.cup.hp.com> Message-ID: <20051207212504.GE1404@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: Flash sector size? eh? > > On Wed, Dec 07, 2005 at 11:31:20AM +0200, Michael S. Tsirkin wrote: > > You need to query the Board ID on card and in image: > > ./mstflint/mstflint -d q > > ./mstflint/mstflint -i q > > Ah...another issue: > # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 q > *** ERROR *** Can't open /proc/bus/pci/0084:05/00.0: Can not obtain Flash semaphore (63). You can use -clear_semaphore to force semaphore unlock. See help for details. > > I expect that's due to the previous failure not cleaning up behind itself. > After clearing the flash semaphore: > > # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 q > Image type: FailSafe > Chip rev.: A0 > GUID Des: Node Port1 Port2 Sys image > GUIDs: 001321ffff757800 001321ffff757801 001321ffff757802 001321ffff757803 > Board ID: 76­ > > # ./mstflint/mstflint -i /root/fw-25208-4_7_400-MHGA28-1T.bin q > Image type: FailSafe > Chip rev.: A0 > GUID Des: Node Port1 Port2 Sys image > GUIDs: 0002c9000100d050 0002c9000100d051 0002c9000100d052 0002c9000100d050 > Board ID: V_ym (MT_0200000001) > > I don't know what to make of the "76-" for board ID. > Could this be a prototype board with some HP generated firmware? Looks like it is, and looks like HP has its own way to format board IDs, different from both mellanox and topspin boards. Could you find out about this? > Is "Board ID" the only way to tell which vendor provided an HCA? > I normally expect Subsystem ID to tell me that but have the impression > (after looking at several PCI-X HCAs I have installed) that I can't > trust that in this case. :( Mellanox FW lets you set the Subsystem ID, I wouldnt know whether vendors actually do this. -- MST From mst at mellanox.co.il Wed Dec 7 13:27:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 7 Dec 2005 23:27:41 +0200 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051207194137.GB26945@esmail.cup.hp.com> References: <20051207194137.GB26945@esmail.cup.hp.com> Message-ID: <20051207212741.GF1404@mellanox.co.il> Quoting r. Grant Grundler : > # ./mstflint/mstflint -d /proc/bus/pci/0084\:05/00.0 q > Image type: FailSafe > Chip rev.: A0 > GUID Des: Node Port1 Port2 Sys image > GUIDs: 001321ffff757800 001321ffff757801 001321ffff757802 001321ffff757803 > Board ID: 76­ > > # ./mstflint/mstflint -i /root/fw-25208-4_7_400-MHGA28-1T.bin q > Image type: FailSafe > Chip rev.: A0 > GUID Des: Node Port1 Port2 Sys image > GUIDs: 0002c9000100d050 0002c9000100d051 0002c9000100d052 0002c9000100d050 > Board ID: V_ym (MT_0200000001) > > I don't know what to make of the "76-" for board ID. > Could this be a prototype board with some HP generated firmware? > > Is "Board ID" the only way to tell which vendor provided an HCA? Actually, quite a lot of data is accessible in the PCI VPD records. I'm not sure what, need to look the format up in the spec ... -- MST From ralphc at pathscale.com Wed Dec 7 13:28:04 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 07 Dec 2005 13:28:04 -0800 Subject: [openib-general] Async events all lumped together? Message-ID: <1133990884.12986.37.camel@brick.internal.keyresearch.com> I was wondering why the mthca driver generates a single IB_EVENT_LID_CHANGE event for all the changes (if any) contained in a SubnSet(Portinfo) MAD. Also, the various OpenIB core agents seem to respond to more event types than is strictly necessary: ib_sa_event() looks like it should only need to respond to IB_EVENT_SM_CHANGE. ib_cache_event() looks like it should only need to respond to IB_EVENT_PKEY_CHANGE or IB_EVENT_GID_CHANGE if there was one. -- Ralph Campbell From iod00d at hp.com Wed Dec 7 13:52:00 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 7 Dec 2005 13:52:00 -0800 Subject: [openib-general] Re: Flash sector size? eh? In-Reply-To: <20051207212504.GE1404@mellanox.co.il> References: <20051207194137.GB26945@esmail.cup.hp.com> <20051207212504.GE1404@mellanox.co.il> Message-ID: <20051207215200.GH26945@esmail.cup.hp.com> On Wed, Dec 07, 2005 at 11:25:04PM +0200, Michael S. Tsirkin wrote: > Looks like it is, and looks like HP has its own way to format board IDs, > different from both mellanox and topspin boards. > Could you find out about this? Yes, I should be able to. In anycase, I want to upgrade the HCA firmware to 4.7.0 before posting issues with opensm. > Mellanox FW lets you set the Subsystem ID, > I wouldnt know whether vendors actually do this. I'll raise the issue inside HP and see who bites. thanks, grant From robert.j.woodruff at intel.com Wed Dec 7 15:59:00 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 7 Dec 2005 15:59:00 -0800 Subject: [openib-general] Announce: Updated packages available In-Reply-To: <438FB65E.50406@redhat.com> Message-ID: Doug Ledford wrote, >I've added to the list of available packages. In addition to >libibverbs, libmthca, libsdp, and opensm, we now have udapl compiled. >We also have an update initscripts package for RHEL-4 that enables >static IP setups on ipoib interfaces and works at boot time. In >addition, all the user space tools have been revved up to svn rev 4265. > The kernel has not been recompiled since the last one and is still at >3965. I hope to get an updated kernel sometime tomorrow. Hi Doug, I loaded your latest code onto a couple of X86_64 boxes and was successful at running MPI over the uDAPL from your RPM. The only problem I ran into was that I had to use my own libdat.so. Are you also planning on installing the libdat.so along with the libdapl.so for InfiniBand ? woody From caitlinb at broadcom.com Wed Dec 7 16:09:33 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 7 Dec 2005 16:09:33 -0800 Subject: [openib-general] Announce: Updated packages available Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C2B6B@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Doug Ledford wrote, >> I've added to the list of available packages. In addition to >> libibverbs, libmthca, libsdp, and opensm, we now have udapl compiled. >> We also have an update initscripts package for RHEL-4 that enables >> static IP setups on ipoib interfaces and works at boot time. In >> addition, all the user space tools have been revved up to svn rev >> 4265. The kernel has not been recompiled since the last one and is >> still at 3965. I hope to get an updated kernel sometime tomorrow. > > Hi Doug, > > I loaded your latest code onto a couple of X86_64 boxes and > was successful at running MPI over the uDAPL from your RPM. > The only problem I ran into was that I had to use my own > libdat.so. Are you also planning on installing the libdat.so along > with the libdapl.so for InfiniBand ? > > woody > In the true spirit of both RPM and DAPL, libdat should probably be its own distinct package. From robert.j.woodruff at intel.com Wed Dec 7 16:14:29 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 7 Dec 2005 16:14:29 -0800 Subject: [openib-general] Announce: Updated packages available In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2B6B@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: Catlin wrote, >In the true spirit of both RPM and DAPL, libdat should probably >be its own distinct package. Makes sense. woody From dledford at redhat.com Wed Dec 7 16:40:54 2005 From: dledford at redhat.com (Doug Ledford) Date: Wed, 07 Dec 2005 19:40:54 -0500 Subject: [openib-general] Announce: Updated packages available In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2B6B@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2B6B@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43978116.5050301@redhat.com> Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > >>Doug Ledford wrote, >> >>>I've added to the list of available packages. In addition to >>>libibverbs, libmthca, libsdp, and opensm, we now have udapl compiled. >>>We also have an update initscripts package for RHEL-4 that enables >>>static IP setups on ipoib interfaces and works at boot time. In >>>addition, all the user space tools have been revved up to svn rev >>> 4265. The kernel has not been recompiled since the last one and is >>>still at 3965. I hope to get an updated kernel sometime tomorrow. >> >>Hi Doug, >> >>I loaded your latest code onto a couple of X86_64 boxes and >>was successful at running MPI over the uDAPL from your RPM. >>The only problem I ran into was that I had to use my own >>libdat.so. Are you also planning on installing the libdat.so along >>with the libdapl.so for InfiniBand ? >> >>woody >> > > In the true spirit of both RPM and DAPL, libdat should probably > be its own distinct package. > For the future, I more or less had that in mind, but I didn't do that this go around. -- Doug Ledford http://people.redhat.com/dledford From robert.j.woodruff at intel.com Wed Dec 7 16:59:01 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 7 Dec 2005 16:59:01 -0800 Subject: [openib-general] Announce: Updated packages available Message-ID: <1AC79F16F5C5284499BB9591B33D6F00064EF5ED@orsmsx408> Bob Woodruff wrote, >I loaded your latest code onto a couple of X86_64 boxes and >was successful at running MPI over the uDAPL from your >RPM. The only problem I ran into was that I had to use my own >libdat.so. Are you also planning on installing the libdat.so along >with the libdapl.so for InfiniBand ? >woody Never mind, my bad. This turned out to be my problem when converting my systems from the SVN installed S/W to the new Redhat RPMS. woody From ogerlitz at voltaire.com Wed Dec 7 23:17:15 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 08 Dec 2005 09:17:15 +0200 Subject: [openib-general] assumptions on page mapping (was High memory) In-Reply-To: <52acg25zr6.fsf@cisco.com> References: <437C1622.5070505@cse.ohio-state.edu> <52acg25zr6.fsf@cisco.com> Message-ID: <4397DDFB.2060102@voltaire.com> Roland Dreier wrote: > The right way to use the MR from get_dma_mr() is to use "bus > addresses" from the DMA mapping API. For highmem, the right way to > get those addresses is with dma_map_sg() or dma_map_page(). Looking on the kernel x86_64 code, both dma_map_sg and dma_map_page seem to assume that the page is already mapped, since they call page_address(page). Specifically is it safe in a SCSI LLD (eg SRP and iSER which is among other things such) to call dma_map_sg on a SG which comes with a SCSI command, so the SCSI Mid-Layer always makes sure the pages are mapped? Or. From fohhzqp at nougen.com Wed Dec 7 21:16:19 2005 From: fohhzqp at nougen.com (Ethan Ames) Date: Thu, 8 Dec 2005 09:16:19 +0400 Subject: [openib-general] Re: Hello. Message-ID: <2e9401c5fbd8$aca0bc50$ddf194da@kkz2h81vdqssaz> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about.

# Suppress your appetite and feel full and satisfied all day long
# Increase your energy levels
# Lose excess weight
# Increase your metabolism
# Burn body fat
# Burn calories
# Attack obesity
And more..

http://doctorsfound.com/

# Suitable for vegetarians and vegans
# MAINTAIN your weight loss
# Make losing weight a sure guarantee
# Look your best during the summer months

http://doctorsfound.com/

Regards, Dr. Ethan Ames -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Thu Dec 8 02:39:30 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 08 Dec 2005 12:39:30 +0200 Subject: [openib-general] [PATCH] Opensm - fix osm_venodr_get_all_port_attr Message-ID: <5z8xuvhnz1.fsf@mtl066.yok.mtl.com> Hi Hal, If osm_vendor_get_all_port_attr is called before the osm_vendor_bind, then the sm_lid of the default port isn't updated correctly. This patch fixes it. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4345) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -577,6 +577,7 @@ osm_vendor_get_all_port_attr( int *p_linkstates = linkstates; umad_port_t def_port = {""}; int r, i, j; + int sm_lid = 0; OSM_LOG_ENTER( p_vend->p_log, osm_vendor_get_all_port_attr ); @@ -636,6 +637,8 @@ osm_vendor_get_all_port_attr( def_port.ca_name, def_port.portnum, cl_hton64(def_port.port_guid)); + sm_lid = def_port.sm_lid; + umad_release_port(&def_port); } @@ -644,6 +647,9 @@ osm_vendor_get_all_port_attr( for (i = 0; i < *p_num_ports; i++) { p_attr_array[i].port_guid = portguids[i]; p_attr_array[i].lid = lids[i]; + if (i == 0) + p_attr_array[i].sm_lid = sm_lid; + else p_attr_array[i].sm_lid = p_vend->umad_port.sm_lid; p_attr_array[i].link_state = linkstates[i]; } From yael at mellanox.co.il Thu Dec 8 03:19:57 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Thu, 8 Dec 2005 13:19:57 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: SubnAdmGet PathRecord should assume NumbPath of 1 Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2482@mtlexch01.mtl.com> This looks good. Please apply it. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, December 07, 2005 2:06 PM To: Yael Kalka; Eitan Zahavi Cc: openib-general at openib.org Subject: [PATCH] OpenSM: SubnAdmGet PathRecord should assume NumbPath of 1 OpenSM: SubnAdmGet PathRecord should assume NumbPath of 1 (1.2 erratum) Signed-off-by: Hal Rosenstock Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 4335) +++ opensm/osm_sa_path_record.c (working copy) @@ -709,13 +709,15 @@ __osm_pr_rcv_get_lid_pair_path( static void __osm_pr_rcv_get_port_pair_paths( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const p_req_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, IN const ib_net64_t comp_mask, IN cl_qlist_t* const p_list ) { + const ib_path_rec_t* p_pr; + const ib_sa_mad_t* p_sa_mad; osm_pr_item_t* p_pr_item; uint16_t src_lid_min_ho; uint16_t src_lid_max_ho; @@ -752,6 +754,9 @@ __osm_pr_rcv_get_port_pair_paths( goto Exit; } + p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_pr = (ib_path_rec_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + /* We shouldn't be here if the paths are disqualified in some way... Thus, we assume every possible connection is valid. @@ -842,10 +847,14 @@ __osm_pr_rcv_get_port_pair_paths( preference = 0; path_num = 0; - if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) - iterations = p_pr->num_path & 0x7F; + /* If SubnAdmGet, assume NumbPaths 1 (1.2 erratum) */ + if (p_sa_mad->method != IB_MAD_METHOD_GET) + if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) + iterations = p_pr->num_path & 0x7F; + else + iterations = (uintn_t)(-1); else - iterations = (uintn_t)(-1); + iterations = 1; while( path_num < iterations ) { @@ -1101,7 +1110,7 @@ __osm_pr_rcv_get_end_points( static void __osm_pr_rcv_process_world( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const ib_net64_t comp_mask, IN cl_qlist_t* const p_list ) @@ -1128,7 +1137,7 @@ __osm_pr_rcv_process_world( p_src_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_src_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, comp_mask, p_list ); p_src_port = (osm_port_t*)cl_qmap_next( &p_src_port->map_item ); @@ -1145,7 +1154,7 @@ __osm_pr_rcv_process_world( static void __osm_pr_rcv_process_half( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, @@ -1172,7 +1181,7 @@ __osm_pr_rcv_process_half( p_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw , requestor_port, p_src_port, p_port, comp_mask, p_list ); p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ); } @@ -1185,7 +1194,7 @@ __osm_pr_rcv_process_half( p_port = (osm_port_t*)cl_qmap_head( p_tbl ); while( p_port != (osm_port_t*)cl_qmap_end( p_tbl ) ) { - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_port, p_dest_port, comp_mask, p_list ); p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ); } @@ -1199,7 +1208,7 @@ __osm_pr_rcv_process_half( static void __osm_pr_rcv_process_pair( IN osm_pr_rcv_t* const p_rcv, - IN const ib_path_rec_t* const p_pr, + IN const osm_madw_t* const p_madw, IN const osm_port_t* const requestor_port, IN const osm_port_t* const p_src_port, IN const osm_port_t* const p_dest_port, @@ -1208,7 +1217,7 @@ __osm_pr_rcv_process_pair( { OSM_LOG_ENTER( p_rcv->p_log, __osm_pr_rcv_process_pair ); - __osm_pr_rcv_get_port_pair_paths( p_rcv, p_pr, requestor_port, p_src_port, + __osm_pr_rcv_get_port_pair_paths( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, comp_mask, p_list ); OSM_LOG_EXIT( p_rcv->p_log ); @@ -1413,7 +1422,8 @@ __osm_pr_match_mgrp_attributes( goto Exit; } - if( comp_mask & IB_PR_COMPMASK_NUMBPATH ) + /* If SubnAdmGet, assume NumbPaths of 1 (1.2 erratum) */ + if( ( comp_mask & IB_PR_COMPMASK_NUMBPATH ) && ( p_sa_mad->method != IB_MAD_METHOD_GET ) ) { if( ( p_pr->num_path & 0x7f ) == 0 ) goto Exit; @@ -1513,7 +1523,7 @@ __osm_pr_rcv_respond( /* * C15-0.1.30: - * If we do a SubAdmGet and got more than one record it is an error ! + * If we do a SubnAdmGet and got more than one record it is an error ! */ if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1)) { @@ -1720,22 +1730,22 @@ osm_pr_rcv_process( if( p_src_port ) { if( p_dest_port ) - __osm_pr_rcv_process_pair( p_rcv, p_pr, requestor_port, p_src_port, p_dest_port, + __osm_pr_rcv_process_pair( p_rcv, p_madw, requestor_port, p_src_port, p_dest_port, p_sa_mad->comp_mask, &pr_list ); else - __osm_pr_rcv_process_half( p_rcv, p_pr, requestor_port, p_src_port, NULL, + __osm_pr_rcv_process_half( p_rcv, p_madw, requestor_port, p_src_port, NULL, p_sa_mad->comp_mask, &pr_list ); } else { if( p_dest_port ) - __osm_pr_rcv_process_half( p_rcv, p_pr, requestor_port, NULL, p_dest_port, + __osm_pr_rcv_process_half( p_rcv, p_madw, requestor_port, NULL, p_dest_port, p_sa_mad->comp_mask, &pr_list ); else /* Katie, bar the door! */ - __osm_pr_rcv_process_world( p_rcv, p_pr, requestor_port, + __osm_pr_rcv_process_world( p_rcv, p_madw, requestor_port, p_sa_mad->comp_mask, &pr_list ); } goto Unlock; From mst at mellanox.co.il Thu Dec 8 05:51:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 15:51:16 +0200 Subject: [openib-general] [PATCH] core: fix user_mad memory leaks on timeout Message-ID: <20051208135116.GL21035@mellanox.co.il> Dont leak packet if it had a timeout. Dont leak timeout mad if queue_packet fails. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/core/user_mad.c =================================================================== --- openib.orig/drivers/infiniband/core/user_mad.c 2005-12-08 15:40:41.000000000 +0200 +++ openib/drivers/infiniband/core/user_mad.c 2005-12-08 15:40:28.000000000 +0200 @@ -197,8 +197,8 @@ static void send_handler(struct ib_mad_a memcpy(timeout->mad.data, packet->mad.data, sizeof (struct ib_mad_hdr)); - if (!queue_packet(file, agent, timeout)) - return; + if (queue_packet(file, agent, timeout)) + kfree(timeout); } out: kfree(packet); -- MST From mst at mellanox.co.il Thu Dec 8 05:55:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 15:55:43 +0200 Subject: [openib-general] [PATCH rebase] large rmpp support Message-ID: <20051208135543.GM21035@mellanox.co.il> Hi! I am still looking at addressing Sean's comments. Meanwhile, for all adventurous testers out there, here's a revision of the previous large rmpp patch that applies on top of the memory leak fix that I've just posted. For review only. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/core/mad_rmpp.c =================================================================== --- openib.orig/drivers/infiniband/core/mad_rmpp.c 2005-11-22 10:53:48.000000000 +0200 +++ openib/drivers/infiniband/core/mad_rmpp.c 2005-12-08 15:44:35.000000000 +0200 @@ -433,44 +433,6 @@ static struct ib_mad_recv_wc * complete_ return rmpp_wc; } -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf) -{ - struct ib_mad_recv_buf *seg_buf; - struct ib_rmpp_mad *rmpp_mad; - void *data; - int size, len, offset; - u8 flags; - - len = mad_recv_wc->mad_len; - if (len <= sizeof(struct ib_mad)) { - memcpy(buf, mad_recv_wc->recv_buf.mad, len); - return; - } - - offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); - - list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { - rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; - flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); - - if (flags & IB_MGMT_RMPP_FLAG_FIRST) { - data = rmpp_mad; - size = sizeof(*rmpp_mad); - } else { - data = (void *) rmpp_mad + offset; - if (flags & IB_MGMT_RMPP_FLAG_LAST) - size = len; - else - size = sizeof(*rmpp_mad) - offset; - } - - memcpy(buf, data, size); - len -= size; - buf += size; - } -} -EXPORT_SYMBOL(ib_coalesce_recv_mad); - static struct ib_mad_recv_wc * continue_rmpp(struct ib_mad_agent_private *agent, struct ib_mad_recv_wc *mad_recv_wc) @@ -570,16 +532,26 @@ start_rmpp(struct ib_mad_agent_private * return mad_recv_wc; } -static inline u64 get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) +static inline void * get_seg_addr(struct ib_mad_send_wr_private *mad_send_wr) { - return mad_send_wr->sg_list[0].addr + mad_send_wr->data_offset + - (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset) * - (mad_send_wr->seg_num - 1); + struct ib_mad_multipacket_seg *seg; + int i = 2; + + if (list_empty(&mad_send_wr->multipacket_list)) + return NULL; + + list_for_each_entry(seg, &mad_send_wr->multipacket_list, list) { + if (i == mad_send_wr->seg_num) + return seg->data; + i++; + } + return NULL; } -static int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) +int send_next_seg(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_rmpp_mad *rmpp_mad; + void *next_data; int timeout; u32 paylen; @@ -594,12 +566,14 @@ static int send_next_seg(struct ib_mad_s rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(paylen); mad_send_wr->sg_list[0].length = sizeof(struct ib_rmpp_mad); } else { - mad_send_wr->send_wr.num_sge = 2; - mad_send_wr->sg_list[0].length = mad_send_wr->data_offset; - mad_send_wr->sg_list[1].addr = get_seg_addr(mad_send_wr); - mad_send_wr->sg_list[1].length = sizeof(struct ib_rmpp_mad) - - mad_send_wr->data_offset; - mad_send_wr->sg_list[1].lkey = mad_send_wr->sg_list[0].lkey; + next_data = get_seg_addr(mad_send_wr); + if (!next_data) { + printk(KERN_ERR PFX "send_next_seg: " + "could not find next segment\n"); + return -EINVAL; + } + memcpy((void *)rmpp_mad + mad_send_wr->data_offset, next_data, + sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); rmpp_mad->rmpp_hdr.paylen_newwin = 0; } Index: openib/drivers/infiniband/include/rdma/ib_mad.h =================================================================== --- openib.orig/drivers/infiniband/include/rdma/ib_mad.h 2005-11-22 12:52:31.000000000 +0200 +++ openib/drivers/infiniband/include/rdma/ib_mad.h 2005-12-08 15:44:35.000000000 +0200 @@ -141,6 +141,11 @@ struct ib_rmpp_hdr { __be32 paylen_newwin; }; +struct ib_mad_multipacket_seg { + struct list_head list; + u8 data[0]; +}; + typedef u64 __bitwise ib_sa_comp_mask; #define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) @@ -485,17 +490,6 @@ int ib_unregister_mad_agent(struct ib_ma int ib_post_send_mad(struct ib_mad_send_buf *send_buf, struct ib_mad_send_buf **bad_send_buf); -/** - * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. - * @mad_recv_wc: Work completion information for a received MAD. - * @buf: User-provided data buffer to receive the coalesced buffers. The - * referenced buffer should be at least the size of the mad_len specified - * by @mad_recv_wc. - * - * This call copies a chain of received MAD segments into a single data buffer, - * removing duplicated headers. - */ -void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, void *buf); /** * ib_free_recv_mad - Returns data buffers used to receive a MAD. @@ -601,6 +595,18 @@ struct ib_mad_send_buf * ib_create_send_ gfp_t gfp_mask); /** + * ib_append_to_multipacket_mad - Append a segment of an RMPP multipacket mad send + * to the send buffer. + * @send_buf: Previously allocated send data buffer. + * @seg: segment to append to linked list (already filled with data). + * + * This routine appends a segment of a multipacket RMPP message + * (copied from user space) to a MAD for sending. + */ +void ib_append_to_multipacket_mad(struct ib_mad_send_buf * send_buf, + struct ib_mad_multipacket_seg *seg); + +/** * ib_free_send_mad - Returns data buffers used to send a MAD. * @send_buf: Previously allocated send data buffer. */ Index: openib/drivers/infiniband/core/mad.c =================================================================== --- openib.orig/drivers/infiniband/core/mad.c 2005-11-28 09:03:21.000000000 +0200 +++ openib/drivers/infiniband/core/mad.c 2005-12-08 15:44:35.000000000 +0200 @@ -792,17 +792,13 @@ struct ib_mad_send_buf * ib_create_send_ return ERR_PTR(-EINVAL); length = sizeof *mad_send_wr + buf_size; - if (length >= PAGE_SIZE) - buf = (void *)__get_free_pages(gfp_mask, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - buf = kmalloc(length, gfp_mask); + buf = kzalloc(sizeof *mad_send_wr + sizeof(struct ib_mad), gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, length); - - mad_send_wr = buf + buf_size; + mad_send_wr = buf + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_send_wr->multipacket_list); mad_send_wr->send_buf.mad = buf; mad_send_wr->mad_agent_priv = mad_agent_priv; @@ -834,23 +830,33 @@ struct ib_mad_send_buf * ib_create_send_ } EXPORT_SYMBOL(ib_create_send_mad); +void ib_append_to_multipacket_mad(struct ib_mad_send_buf * send_buf, + struct ib_mad_multipacket_seg *seg) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, + send_buf); + list_add_tail(&seg->list, &mad_send_wr->multipacket_list); +} +EXPORT_SYMBOL(ib_append_to_multipacket_mad); + void ib_free_send_mad(struct ib_mad_send_buf *send_buf) { struct ib_mad_agent_private *mad_agent_priv; - void *mad_send_wr; - int length; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_multipacket_seg *seg, *tmp; mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); mad_send_wr = container_of(send_buf, struct ib_mad_send_wr_private, send_buf); - length = sizeof(struct ib_mad_send_wr_private) + (mad_send_wr - send_buf->mad); - if (length >= PAGE_SIZE) - free_pages((unsigned long)send_buf->mad, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(send_buf->mad); - + list_for_each_entry_safe(seg, tmp, &mad_send_wr->multipacket_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); } Index: openib/drivers/infiniband/core/mad_priv.h =================================================================== --- openib.orig/drivers/infiniband/core/mad_priv.h 2005-11-13 10:48:32.000000000 +0200 +++ openib/drivers/infiniband/core/mad_priv.h 2005-12-08 15:44:35.000000000 +0200 @@ -130,6 +130,7 @@ struct ib_mad_send_wr_private { enum ib_wc_status status; /* RMPP control */ + struct list_head multipacket_list; int last_ack; int seg_num; int newwin; Index: openib/drivers/infiniband/core/user_mad.c =================================================================== --- openib.orig/drivers/infiniband/core/user_mad.c 2005-12-08 15:40:28.000000000 +0200 +++ openib/drivers/infiniband/core/user_mad.c 2005-12-08 15:45:07.000000000 +0200 @@ -123,6 +123,7 @@ struct ib_umad_packet { struct ib_mad_send_buf *msg; struct list_head list; int length; + struct list_head seg_list; struct ib_user_mad mad; }; @@ -176,6 +177,87 @@ static int queue_packet(struct ib_umad_f return ret; } +static int data_offset(u8 mgmt_class) +{ + if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) + return IB_MGMT_SA_HDR; + else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && + (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return IB_MGMT_VENDOR_HDR; + else + return IB_MGMT_RMPP_HDR; +} + +static int copy_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + struct ib_umad_packet *packet) +{ + struct ib_mad_recv_buf *seg_buf; + struct ib_rmpp_mad *rmpp_mad; + void *data; + struct ib_mad_multipacket_seg *seg; + int size, len, offset; + u8 flags; + + len = mad_recv_wc->mad_len; + if (len <= sizeof(struct ib_mad)) { + memcpy(&packet->mad.data, mad_recv_wc->recv_buf.mad, len); + return 0; + } + + offset = data_offset(mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class); + + list_for_each_entry(seg_buf, &mad_recv_wc->rmpp_list, list) { + rmpp_mad = (struct ib_rmpp_mad *)seg_buf->mad; + flags = ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr); + + if (flags & IB_MGMT_RMPP_FLAG_FIRST) { + size = sizeof(*rmpp_mad); + memcpy(&packet->mad.data, rmpp_mad, size); + } else { + data = (void *) rmpp_mad + offset; + if (flags & IB_MGMT_RMPP_FLAG_LAST) + size = len; + else + size = sizeof(*rmpp_mad) - offset; + seg = kmalloc(sizeof(struct ib_mad_multipacket_seg) + + sizeof(struct ib_rmpp_mad) - offset, + GFP_KERNEL); + if (!seg) + return -ENOMEM; + memcpy(seg->data, data, size); + list_add_tail(&seg->list, &packet->seg_list); + } + len -= size; + } + return 0; +} + +static struct ib_umad_packet *alloc_packet(void) +{ + struct ib_umad_packet *packet; + int length = sizeof *packet + sizeof(struct ib_mad); + + packet = kzalloc(length, GFP_KERNEL); + if (!packet) { + printk(KERN_ERR "alloc_packet: mem alloc failed for length %d\n", + length); + return NULL; + } + INIT_LIST_HEAD(&packet->seg_list); + return packet; +} + +static void free_packet(struct ib_umad_packet *packet) +{ + struct ib_mad_multipacket_seg *seg, *tmp; + + list_for_each_entry_safe(seg, tmp, &packet->seg_list, list) { + list_del(&seg->list); + kfree(seg); + } + kfree(packet); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { @@ -187,7 +269,7 @@ static void send_handler(struct ib_mad_a ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kzalloc(sizeof *timeout + IB_MGMT_MAD_HDR, GFP_KERNEL); + timeout = alloc_packet(); if (!timeout) goto out; @@ -198,40 +280,12 @@ static void send_handler(struct ib_mad_a sizeof (struct ib_mad_hdr)); if (queue_packet(file, agent, timeout)) - kfree(timeout); + free_packet(timeout); } out: kfree(packet); } -static struct ib_umad_packet *alloc_packet(int buf_size) -{ - struct ib_umad_packet *packet; - int length = sizeof *packet + buf_size; - - if (length >= PAGE_SIZE) - packet = (void *)__get_free_pages(GFP_KERNEL, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - packet = kmalloc(length, GFP_KERNEL); - - if (!packet) - return NULL; - - memset(packet, 0, length); - return packet; -} - -static void free_packet(struct ib_umad_packet *packet) -{ - int length = packet->length + sizeof *packet; - if (length >= PAGE_SIZE) - free_pages((unsigned long) packet, long_log2(roundup_pow_of_two(length)) - PAGE_SHIFT); - else - kfree(packet); -} - - - static void recv_handler(struct ib_mad_agent *agent, struct ib_mad_recv_wc *mad_recv_wc) { @@ -243,13 +297,16 @@ static void recv_handler(struct ib_mad_a goto out; length = mad_recv_wc->mad_len; - packet = alloc_packet(length); + packet = alloc_packet(); if (!packet) goto out; packet->length = length; - ib_coalesce_recv_mad(mad_recv_wc, packet->mad.data); + if (copy_recv_mad(mad_recv_wc, packet)) { + free_packet(packet); + goto out; + } packet->mad.hdr.status = 0; packet->mad.hdr.length = length + sizeof (struct ib_user_mad); @@ -278,6 +335,7 @@ static ssize_t ib_umad_read(struct file size_t count, loff_t *pos) { struct ib_umad_file *file = filp->private_data; + struct ib_mad_multipacket_seg *seg; struct ib_umad_packet *packet; ssize_t ret; @@ -304,18 +362,42 @@ static ssize_t ib_umad_read(struct file spin_unlock_irq(&file->recv_lock); - if (count < packet->length + sizeof (struct ib_user_mad)) { - /* Return length needed (and first RMPP segment) if too small */ - if (copy_to_user(buf, &packet->mad, - sizeof (struct ib_user_mad) + sizeof (struct ib_mad))) - ret = -EFAULT; - else - ret = -ENOSPC; - } else if (copy_to_user(buf, &packet->mad, - packet->length + sizeof (struct ib_user_mad))) + if (copy_to_user(buf, &packet->mad, + sizeof(struct ib_user_mad) + sizeof(struct ib_mad))) { ret = -EFAULT; - else + goto err; + } + + if (count < packet->length + sizeof (struct ib_user_mad)) + /* User buffer too small. Return first RMPP segment (which + * includes RMPP message length). + */ + ret = -ENOSPC; + else if (packet->length <= sizeof(struct ib_mad)) + ret = packet->length + sizeof(struct ib_user_mad); + else { + int len = packet->length - sizeof(struct ib_mad); + struct ib_rmpp_mad *rmpp_mad = + (struct ib_rmpp_mad *) packet->mad.data; + int max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + int seg_payload; + /* multipacket RMPP MAD message. Copy remainder of message. + * Note that last segment may have a shorter payload. + */ + buf += sizeof(struct ib_user_mad) + sizeof(struct ib_mad); + list_for_each_entry(seg, &packet->seg_list, list) { + seg_payload = min_t(int, len, max_seg_payload); + if (copy_to_user(buf, seg->data, seg_payload)) { + ret = -EFAULT; + goto err; + } + buf += seg_payload; + len -= seg_payload; + } ret = packet->length + sizeof (struct ib_user_mad); + } +err: if (ret < 0) { /* Requeue packet */ spin_lock_irq(&file->recv_lock); @@ -339,6 +421,8 @@ static ssize_t ib_umad_write(struct file __be64 *tid; int ret, length, hdr_len, copy_offset; int rmpp_active, has_rmpp_header; + int max_seg_payload; + struct ib_mad_multipacket_seg *seg; if (count < sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR) return -EINVAL; @@ -415,6 +499,11 @@ static ssize_t ib_umad_write(struct file goto err_ah; } + if (!rmpp_active && length > sizeof(struct ib_mad)) { + ret = -EINVAL; + goto err_ah; + } + packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), 0, rmpp_active, @@ -432,12 +521,39 @@ static ssize_t ib_umad_write(struct file /* Copy MAD headers (RMPP header in place) */ memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); - /* Now, copy rest of message from user into send buffer */ + /* complete copying first 256 bytes of message into send buffer */ if (copy_from_user(packet->msg->mad + copy_offset, buf + sizeof (struct ib_user_mad) + copy_offset, - length - copy_offset)) { + min_t(int, length, sizeof(struct ib_mad)) - copy_offset)) { ret = -EFAULT; - goto err_msg; + goto err_ah; + } + + /* if multipacket, copy remainder of send message from user to multipacket list */ + length -= sizeof(struct ib_mad); + buf += sizeof (struct ib_user_mad) + sizeof(struct ib_mad); + max_seg_payload = sizeof(struct ib_mad) - + data_offset(rmpp_mad->mad_hdr.mgmt_class); + while (length > 0) { + int seg_payload = min_t(int, length, max_seg_payload); + seg = kzalloc(sizeof(struct ib_mad_multipacket_seg) + + max_seg_payload, GFP_KERNEL); + if (!seg) { + printk(KERN_ERR "ib_umad_write: " + "mem alloc failed for length %d\n", + sizeof(struct ib_mad_multipacket_seg) + + max_seg_payload); + ret = -ENOMEM; + goto err_msg; + } + + if (copy_from_user(seg->data, buf, seg_payload)) { + ret = -EFAULT; + goto err_msg; + } + ib_append_to_multipacket_mad(packet->msg, seg); + buf += seg_payload; + length -= seg_payload; } /* -- MST From halr at voltaire.com Thu Dec 8 06:15:33 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2005 09:15:33 -0500 Subject: [openib-general] [PATCH] [TRIVIAL] OpenSM: Add DivergeNet into recognized vendors in osm_helper.c Message-ID: <1134051331.4587.42550.camel@hal.voltaire.com> OpenSM: Add DivergeNet into list of recognized manufacturers in osm_helper.c::osm_get_manufacturer_str Signed-off-by: Hal Rosenstock Index: osm_helper.c =================================================================== --- osm_helper.c (revision 4332) +++ osm_helper.c (working copy) @@ -1849,6 +1849,7 @@ osm_get_node_type_str_fixed_width( #define OSM_VENDOR_ID_YOTTAYOTTA 0x000453 /* Also, Obsidian Research */ #define OSM_VENDOR_ID_PATHSCALE 0x001175 #define OSM_VENDOR_ID_IBM 0x000255 +#define OSM_VENDOR_ID_DIVERGENET 0x00084E /********************************************************************** **********************************************************************/ @@ -1866,6 +1867,7 @@ osm_get_manufacturer_str( static const char* yotta_str = "YottaYotta "; static const char* pathscale_str = "PathScale "; static const char* ibm_str = "IBM "; + static const char* divergenet_str = "DivergeNet "; static const char* unknown_str = "Unknown "; switch( (uint32_t)(guid_ho >> (5 * 8)) ) @@ -1891,6 +1893,8 @@ osm_get_manufacturer_str( return( pathscale_str ); case OSM_VENDOR_ID_IBM: return( ibm_str ); + case OSM_VENDOR_ID_DIVERGENET: + return( divergenet_str ); default: return( unknown_str ); } From ianjiang.ict at gmail.com Thu Dec 8 06:26:14 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Thu, 8 Dec 2005 22:26:14 +0800 Subject: [openib-general] [kDAPL]questions about the LMR creation of different types of memory Message-ID: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> Hi James, As is known to all, there several memory types uesed for memory register in kDAPL. I have some questions about the types DAT_MEM_TYPE_PHYSICAL and DAT_MEM_TYPE_IA: 1) Could memory allocated by kmem_cache_create() be OK? AFAIK, memory allocated by kmalloc() is OK and that by vmalloc() is not. What about that allocated by kmem_cache_create()? Is it OK in the condition that the SLAB_CACHE_DMA flag is used? 2) What is the difference between DAT_MEM_TYPE_PHYSICAL and DAT_MEM_TYPE_IA when a continuous range of physical memory is to be registered? In my opinion, the continuous range should be tranlated into a serious of page addresses before registered as the DAT_MEM_TYPE_PHYSICAL type, and it's not necessary for the DAT_MEM_TYPE_IA type. Is the translation is done in the dat_lmr_kcreate() for the DAT_MEM_TYPE_IA type? Thanks a lot! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Thu Dec 8 06:42:15 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 8 Dec 2005 15:42:15 +0100 Subject: [openib-general] assumptions on page mapping (was High memory) In-Reply-To: <4397DDFB.2060102@voltaire.com> References: <437C1622.5070505@cse.ohio-state.edu> <52acg25zr6.fsf@cisco.com> <4397DDFB.2060102@voltaire.com> Message-ID: <20051208144215.GA15022@lst.de> On Thu, Dec 08, 2005 at 09:17:15AM +0200, Or Gerlitz wrote: > Roland Dreier wrote: > >The right way to use the MR from get_dma_mr() is to use "bus > >addresses" from the DMA mapping API. For highmem, the right way to > >get those addresses is with dma_map_sg() or dma_map_page(). > > Looking on the kernel x86_64 code, both dma_map_sg and dma_map_page seem > to assume that the page is already mapped, since they call > page_address(page). x86_64 doesn't have highmem, so page_address(page) is valid on every page. > Specifically is it safe in a SCSI LLD (eg SRP and iSER which is among > other things such) to call dma_map_sg on a SG which comes with a SCSI yes, this is definitly safe. > command, so the SCSI Mid-Layer always makes sure the pages are mapped? no, it doesn't. in fact pages don't need to be mapped at all for dma normally. From halr at voltaire.com Thu Dec 8 07:03:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2005 10:03:16 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix osm_venodr_get_all_port_attr In-Reply-To: <5z8xuvhnz1.fsf@mtl066.yok.mtl.com> References: <5z8xuvhnz1.fsf@mtl066.yok.mtl.com> Message-ID: <1134054074.4485.3.camel@hal.voltaire.com> Hi Yael, On Thu, 2005-12-08 at 05:39, Yael Kalka wrote: > Hi Hal, > > If osm_vendor_get_all_port_attr is called before the osm_vendor_bind, What exercises the vendor calls in this manner ? > then the sm_lid of the default port isn't updated correctly. > This patch fixes it. Thanks. Applied. -- Hal From halr at voltaire.com Thu Dec 8 07:12:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2005 10:12:29 -0500 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <20051208135116.GL21035@mellanox.co.il> References: <20051208135116.GL21035@mellanox.co.il> Message-ID: <1134054749.4485.10.camel@hal.voltaire.com> On Thu, 2005-12-08 at 08:51, Michael S. Tsirkin wrote: > Dont leak packet if it had a timeout. > Dont leak timeout mad if queue_packet fails. > > Signed-off-by: Jack Morgenstein > Signed-off-by: Michael S. Tsirkin > > Index: openib/drivers/infiniband/core/user_mad.c > =================================================================== > --- openib.orig/drivers/infiniband/core/user_mad.c 2005-12-08 15:40:41.000000000 +0200 > +++ openib/drivers/infiniband/core/user_mad.c 2005-12-08 15:40:28.000000000 +0200 > @@ -197,8 +197,8 @@ static void send_handler(struct ib_mad_a > memcpy(timeout->mad.data, packet->mad.data, > sizeof (struct ib_mad_hdr)); > > - if (!queue_packet(file, agent, timeout)) > - return; > + if (queue_packet(file, agent, timeout)) > + kfree(timeout); Yes, there appears to be a memory leak here but I don't think this fix is quite right as it has lost the return when the queue_packet succeeds. Isn't that still needed ? if (!queue_packet(file, agent, timeout)) return; kfree(timeout); > } > out: > kfree(packet); Another point: on either failure to allocate the timeout MAD or failure to queue the timeout MAD, is simply throwing this away sufficient ? It seems to me that if this occurs, then the contract is broken and the client still needs to worry about its own timeout. -- Hal From mst at mellanox.co.il Thu Dec 8 07:25:21 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 17:25:21 +0200 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <1134054749.4485.10.camel@hal.voltaire.com> References: <1134054749.4485.10.camel@hal.voltaire.com> Message-ID: <20051208152521.GQ21035@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [PATCH] core: fix user_mad memory leaks on timeout > > On Thu, 2005-12-08 at 08:51, Michael S. Tsirkin wrote: > > Dont leak packet if it had a timeout. > > Dont leak timeout mad if queue_packet fails. > > > > Signed-off-by: Jack Morgenstein > > Signed-off-by: Michael S. Tsirkin > > > > Index: openib/drivers/infiniband/core/user_mad.c > > =================================================================== > > --- openib.orig/drivers/infiniband/core/user_mad.c 2005-12-08 > 15:40:41.000000000 +0200 > > +++ openib/drivers/infiniband/core/user_mad.c 2005-12-08 > 15:40:28.000000000 +0200 > > @@ -197,8 +197,8 @@ static void send_handler(struct ib_mad_a > > memcpy(timeout->mad.data, packet->mad.data, > > sizeof (struct ib_mad_hdr)); > > > > - if (!queue_packet(file, agent, timeout)) > > - return; > > + if (queue_packet(file, agent, timeout)) > > + kfree(timeout); > > Yes, there appears to be a memory leak here but I don't think this fix > is quite right as it has lost the return when the queue_packet succeeds. > Isn't that still needed ? No, the return here was wrong: we copied the packet and we need to free it anyway so falling through to kfree below is the correct behaviour. Thats what I mean by "Dont leak packet if it had a timeout". > if (!queue_packet(file, agent, timeout)) > return; > kfree(timeout); > > > } > > out: > > kfree(packet); > > Another point: on either failure to allocate the timeout MAD or failure > to queue the timeout MAD, is simply throwing this away sufficient ? It > seems to me that if this occurs, then the contract is broken and the > client still needs to worry about its own timeout. > > -- Hal > Not much to do though, since allocating memory with gfp kernel fails: lets at least be careful to avoid a crash or memory leak. -- MST From mst at mellanox.co.il Thu Dec 8 07:36:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 17:36:46 +0200 Subject: [openib-general] mthca_multicast_attach/detach questions Message-ID: <20051208153646.GR21035@mellanox.co.il> Hello, Roland! 1. error handling in mthca_multicast_attach looks strange: in particular, dont we want to revert the result of mthca_alloc if QP is already a member of MGM, or if MGM is full? 2. mthca_multicast_detach has an unconditional goto if (i != 1) goto out; goto out; this looks wrong: it seems you'll never remove an empty multicast group. Comments? -- MST From halr at voltaire.com Thu Dec 8 07:35:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2005 10:35:14 -0500 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <20051208152521.GQ21035@mellanox.co.il> References: <1134054749.4485.10.camel@hal.voltaire.com> <20051208152521.GQ21035@mellanox.co.il> Message-ID: <1134056114.4485.13.camel@hal.voltaire.com> On Thu, 2005-12-08 at 10:25, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: [PATCH] core: fix user_mad memory leaks on timeout > > > > On Thu, 2005-12-08 at 08:51, Michael S. Tsirkin wrote: > > > Dont leak packet if it had a timeout. > > > Dont leak timeout mad if queue_packet fails. > > > > > > Signed-off-by: Jack Morgenstein > > > Signed-off-by: Michael S. Tsirkin > > > > > > Index: openib/drivers/infiniband/core/user_mad.c > > > =================================================================== > > > --- openib.orig/drivers/infiniband/core/user_mad.c 2005-12-08 > > 15:40:41.000000000 +0200 > > > +++ openib/drivers/infiniband/core/user_mad.c 2005-12-08 > > 15:40:28.000000000 +0200 > > > @@ -197,8 +197,8 @@ static void send_handler(struct ib_mad_a > > > memcpy(timeout->mad.data, packet->mad.data, > > > sizeof (struct ib_mad_hdr)); > > > > > > - if (!queue_packet(file, agent, timeout)) > > > - return; > > > + if (queue_packet(file, agent, timeout)) > > > + kfree(timeout); > > > > Yes, there appears to be a memory leak here but I don't think this fix > > is quite right as it has lost the return when the queue_packet succeeds. > > Isn't that still needed ? > > No, the return here was wrong: we copied the packet > and we need to free it anyway > so falling through to kfree below is the correct behaviour. > Thats what I mean by "Dont leak packet if it had a timeout". You're right. > > if (!queue_packet(file, agent, timeout)) > > return; > > kfree(timeout); > > > > > } > > > out: > > > kfree(packet); > > > > Another point: on either failure to allocate the timeout MAD or failure > > to queue the timeout MAD, is simply throwing this away sufficient ? It > > seems to me that if this occurs, then the contract is broken and the > > client still needs to worry about its own timeout. > > > > -- Hal > > > > Not much to do though, since allocating memory with gfp kernel fails: Couldn't the callback be rescheduled for some time later where the allocation might succeed ? > lets at least be careful to avoid a crash or memory leak. Agreed. -- Hal From rdreier at cisco.com Thu Dec 8 07:44:10 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Dec 2005 07:44:10 -0800 Subject: [openib-general] Re: mthca_multicast_attach/detach questions In-Reply-To: <20051208153646.GR21035@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 8 Dec 2005 17:36:46 +0200") References: <20051208153646.GR21035@mellanox.co.il> Message-ID: Michael> Hello, Roland! 1. error handling in Michael> mthca_multicast_attach looks strange: in particular, dont Michael> we want to revert the result of mthca_alloc if QP is Michael> already a member of MGM, or if MGM is full? Yes, I would use the word "wrong" instead of "strange." We certainly have to undo the alloc if something after it fails. Michael> 2. mthca_multicast_detach has an unconditional goto Michael> if (i != 1) goto out; Michael> goto out; Michael> this looks wrong: it seems you'll never remove an empty Michael> multicast group. Yes, the second goto looks wrong as well... probably some sort of typing/editing error in the past. - R. From mst at mellanox.co.il Thu Dec 8 07:54:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 17:54:37 +0200 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <1134056114.4485.13.camel@hal.voltaire.com> References: <1134056114.4485.13.camel@hal.voltaire.com> Message-ID: <20051208155437.GS21035@mellanox.co.il> Quoting Hal Rosenstock : > Couldn't the callback be rescheduled for some time later where the > allocation might succeed ? Arent we allocating with GFP_KERNEL already? I think this means "try as hard as you can" already. No? Maybe we could allocate the timeout packet together with allocating the send mad. > > lets at least be careful to avoid a crash or memory leak. > > Agreed. -- MST From yipeeyipeeyipeeyipee at yahoo.com Thu Dec 8 07:54:29 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Thu, 8 Dec 2005 15:54:29 +0000 (UTC) Subject: [openib-general] QP from userspace used in kernel Message-ID: Hi, What are the reasons that a qp allocated in user-space can't be passed and used by a kernel module? What are the steps needed to make a userspace-allocated qp usable by a kernel module? Thanks, x From jlentini at netapp.com Thu Dec 8 08:05:48 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 8 Dec 2005 08:05:48 -0800 (PST) Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> Message-ID: ian> Hi James, ian> As is known to all, there several memory types uesed for memory ian> register in kDAPL. I have some questions about the types ian> DAT_MEM_TYPE_PHYSICAL and DAT_MEM_TYPE_IA: ian> 1) Could memory allocated by kmem_cache_create() be OK? ian> AFAIK, memory allocated by kmalloc() is OK and that by vmalloc() ian> is not. Correct, assuming you pass kmalloc the GFP_DMA flag. ian> What about that allocated by kmem_cache_create()? Is it ian> OK in the condition that the SLAB_CACHE_DMA flag is used? Yes. ian> 2) What is the difference between DAT_MEM_TYPE_PHYSICAL and ian> DAT_MEM_TYPE_IA when a continuous range of physical memory is to ian> be registered? ian> In my opinion, the continuous range should be tranlated into a ian> serious of page addresses before registered as the ian> DAT_MEM_TYPE_PHYSICAL type, correct ian> and it's not necessary for the DAT_MEM_TYPE_IA type. Is the ian> translation is done in the dat_lmr_kcreate() for the ian> DAT_MEM_TYPE_IA type? No translation is done in dat_lmr_kcreate for DAT_MEM_TYPE_IA. A DAT_MEM_TYPE_IA address is supposed to be an I/O address that the adapter can use. From jackm at mellanox.co.il Thu Dec 8 08:16:25 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 8 Dec 2005 18:16:25 +0200 Subject: [openib-general] [PATCH] mthca: fix memory leak Message-ID: <20051208161625.GA7653@mellanox.co.il> Hi, This patch frees the memory allocated in mthca_init_user_db_tab. Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_memfree.c =================================================================== --- linux-kernel.orig/drivers/infiniband/hw/mthca/mthca_memfree.c +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -485,6 +485,7 @@ void mthca_cleanup_user_db_tab(struct mt put_page(db_tab->page[i].mem.page); } } + kfree(db_tab); } int mthca_alloc_db(struct mthca_dev *dev, enum mthca_db_type type, From mst at mellanox.co.il Thu Dec 8 08:16:46 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 18:16:46 +0200 Subject: [openib-general] ipoib: ipoib_mcast_join_finish oops Message-ID: <20051208161646.GT21035@mellanox.co.il> Roland, from some ipoib oopses that I see, it seems, that ipoib_mcast_join_finish is running when priv->dev->broadcast is NULL. Any idea how could that be the case? -- MST From caitlinb at broadcom.com Thu Dec 8 08:22:51 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 8 Dec 2005 08:22:51 -0800 Subject: [openib-general] QP from userspace used in kernel Message-ID: <54AD0F12E08D1541B826BE97C98F99F10C2BF6@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > Hi, > > What are the reasons that a qp allocated in user-space can't > be passed and used by a kernel module? > What are the steps needed to make a userspace-allocated qp usable by > a kernel module? > > You would need to construct an environment such that the device-specific verbs module, which assumes it is executing in the user space where the QP was created, would never notice the difference. The device-specific verbs will typically have created shared memory resources that are accessible by both the RDMA device and from the creating user memory map. These resources may include pointers that assume the original memory map. The exact methods of remembering the locations of these resources will vary by device, so the chance of coming up with a scheme that works without explicit support of all device vendors is very low. The chances of convincing all device vendors to add a new option to support this model is similarly low unless you can make a very compelling case as to why this is necessary. Having the in-kernel proxy create the QP and do operations for the end-user is a very adequate work around. For complex cleanup purposes the kernel could simply assume the identity of the failed process, but that would only be required if the standard cleanup was somehow not adequate. From mst at mellanox.co.il Thu Dec 8 10:34:06 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 8 Dec 2005 20:34:06 +0200 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051207164433.GA21035@mellanox.co.il> References: <20051207164433.GA21035@mellanox.co.il> Message-ID: <20051208183406.GA13614@mellanox.co.il> Quoting Michael S. Tsirkin : > Subject: [PATCH] ipoib_multicast/ipoib_mcast_send race > > Hello, Roland! > Here's another race scenario. > > --- > > Fix the following race scenario: > device is up. > port event or set mcast list triggers ipoib_mcast_stop_thread, > This cancels the query and waits on mcast "done" completion. > completion is called and "done" is set. > Meanwhile, ipoib_mcast_send arrives and starts a new query, > re-initializing "done". > > Signed-off-by: Michael S. Tsirkin The patch I posted previously leaked an skb when a multicast send arrived while the mcast thread is stopped. Further, there's an additional issue that I saw in testing: ipoib_mcast_send may get called when priv->broadcast is NULL (e.g. if the device was downed and then upped internally because of a port event). If this happends and the sendonly join request gets completed before priv->broadcast is set, we get an oops that I posted previously. Here's a better patch to address these two problems. It has been running fine here for a while now. Please note that this replaces the ipoib_multicast/ipoib_mcast_send patch, but not the ADMIN_UP patch that I posted previously. --- Do not send multicasts if mcast thread is stopped or if priv->broadcast is not set. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4222) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -582,6 +582,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); up(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -592,6 +596,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + down(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -674,6 +682,11 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { + dev_kfree_skb_any(skb); + goto unlock; + } + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -732,6 +745,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib.h (revision 4222) +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h (working copy) @@ -78,6 +78,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, -- MST From rdreier at cisco.com Thu Dec 8 12:16:03 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Dec 2005 12:16:03 -0800 Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: (James Lentini's message of "Thu, 8 Dec 2005 08:05:48 -0800 (PST)") References: Message-ID: ian> 1) Could memory allocated by kmem_cache_create() be OK? ian> AFAIK, memory allocated by kmalloc() is OK and that by ian> vmalloc() is not. James> Correct, assuming you pass kmalloc the GFP_DMA flag. No, the GFP_DMA flag is not necessary. On x86 is means to allocate from the 24-bit ISA DMA region (ie the low 16 MB of RAM). In general it is never necessary to use GFP_DMA in modern code. - R. From sean.hefty at intel.com Thu Dec 8 16:59:26 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 Dec 2005 16:59:26 -0800 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol Message-ID: The following patch updates the CMA to support the IB socket-based protocol standard and SDP's private data format. The CMA now defines RDMA "port spaces". RDMA identifiers are associated with a user-specified port space at creation time. Please respond with any comments on the approach. Note that these changes have not been pushed up to userspace yet. Signed-off-by: Sean Hefty Index: ulp/iser/iser_verbs.c =================================================================== --- ulp/iser/iser_verbs.c (revision 4356) +++ ulp/iser/iser_verbs.c (working copy) @@ -428,7 +428,8 @@ iser_connect(struct iser_conn *p_iser_co return -1; } p_iser_conn->cma_id = rdma_create_id(iser_cma_handler, - (void *)p_iser_conn); + (void *)p_iser_conn, + RDMA_PS_TCP); if (IS_ERR(p_iser_conn->cma_id)) { ret = PTR_ERR(p_iser_conn->cma_id); iser_err("rdma_create_id failed: %d\n", ret); Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 4356) +++ include/rdma/rdma_cm.h (working copy) @@ -54,6 +54,13 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DEVICE_REMOVAL, }; +enum rdma_port_space { + RDMA_PS_SDP = 0x0001, + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, + RDMA_PS_SCTP = 0x0183 +}; + struct rdma_addr { struct sockaddr src_addr; u8 src_pad[sizeof(struct sockaddr_in6) - @@ -97,11 +104,20 @@ struct rdma_cm_id { struct ib_qp *qp; rdma_cm_event_handler event_handler; struct rdma_route route; + enum rdma_port_space ps; u8 port_num; }; +/** + * rdma_create_id - Create an RDMA identifier. + * + * @event_handler: User callback invoked to report events associated with the + * returned rdma_id. + * @context: User specified context associated with the id. + * @ps: RDMA port space. + */ struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, - void *context); + void *context, enum rdma_port_space ps); void rdma_destroy_id(struct rdma_cm_id *id); Index: core/cma.c =================================================================== --- core/cma.c (revision 4356) +++ core/cma.c (working copy) @@ -110,21 +110,35 @@ struct rdma_id_private { u8 srq; }; -struct cma_addr { - u8 version; /* CMA version: 7:4, IP version: 3:0 */ - u8 reserved; - __u16 port; +union cma_ip_addr { + struct in6_addr ip6; struct { - union { - struct in6_addr ip6; - struct { - __u32 pad[3]; - __u32 addr; - } ip4; - } ver; - } src_addr, dst_addr; + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; }; +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +#define CMA_VERSION 0x10 +#define SDP_VERSION 0x22 + static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) { unsigned long flags; @@ -162,19 +176,24 @@ static enum cma_state cma_exch(struct rd return old; } -static inline u8 cma_get_ip_ver(struct cma_addr *addr) +static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) { - return addr->version & 0xF; + return hdr->ip_version >> 4; } -static inline u8 cma_get_cma_ver(struct cma_addr *addr) +static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) { - return addr->version >> 4; + hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); } -static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 ip_ver) +static inline u8 sdp_get_ip_ver(struct sdp_hh *hh) { - addr->version = (cma_ver << 4) + (ip_ver & 0xF); + return hh->ip_version >> 4; +} + +static inline void sdp_set_ip_ver(struct sdp_hh *hh, u8 ip_ver) +{ + hh->ip_version = (ip_ver << 4) | (hh->ip_version & 0xF); } static void cma_attach_to_dev(struct rdma_id_private *id_priv, @@ -226,17 +245,18 @@ static void cma_release_remove(struct rd } struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, - void *context) + void *context, enum rdma_port_space ps) { struct rdma_id_private *id_priv; id_priv = kzalloc(sizeof *id_priv, GFP_KERNEL); if (!id_priv) - return NULL; + return ERR_PTR(-ENOMEM); id_priv->state = CMA_IDLE; id_priv->id.context = context; id_priv->id.event_handler = event_handler; + id_priv->id.ps = ps; spin_lock_init(&id_priv->lock); init_waitqueue_head(&id_priv->wait); atomic_set(&id_priv->refcount, 1); @@ -387,25 +407,93 @@ int rdma_init_qp_attr(struct rdma_cm_id } EXPORT_SYMBOL(rdma_init_qp_attr); -static int cma_verify_addr(struct cma_addr *addr, - struct sockaddr_in *ip_addr) +static inline int cma_any_addr(struct sockaddr *addr) { - if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) - return -EINVAL; + struct in6_addr *ip6; - if (ip_addr->sin_port != addr->port) - return -EINVAL; + if (addr->sa_family == AF_INET) + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + INADDR_ANY; + else { + ip6 = &((struct sockaddr_in6 *) addr)->sin6_addr; + return (ip6->s6_addr32[0] | ip6->s6_addr32[1] | + ip6->s6_addr32[3] | ip6->s6_addr32[4]) == 0; + } +} - if (ip_addr->sin_addr.s_addr && - (ip_addr->sin_addr.s_addr != addr->dst_addr.ver.ip4.addr)) - return -EINVAL; +static int cma_get_net_info(void *hdr, enum rdma_port_space ps, + u8 *ip_ver, __u16 *port, + union cma_ip_addr **src, union cma_ip_addr **dst) +{ + switch (ps) { + case RDMA_PS_SDP: + if (((struct sdp_hh *) hdr)->sdp_version != SDP_VERSION) + return -EINVAL; + *ip_ver = sdp_get_ip_ver(hdr); + *port = ((struct sdp_hh *) hdr)->port; + *src = &((struct sdp_hh *) hdr)->src_addr; + *dst = &((struct sdp_hh *) hdr)->dst_addr; + break; + default: + if (((struct cma_hdr *) hdr)->cma_version != CMA_VERSION) + return -EINVAL; + + *ip_ver = cma_get_ip_ver(hdr); + *port = ((struct cma_hdr *) hdr)->port; + *src = &((struct cma_hdr *) hdr)->src_addr; + *dst = &((struct cma_hdr *) hdr)->dst_addr; + break; + } return 0; } -static inline int cma_any_addr(struct sockaddr *addr) +static void cma_save_net_info(struct rdma_addr *addr, + struct rdma_addr *listen_addr, + u8 ip_ver, __u16 port, + union cma_ip_addr *src, union cma_ip_addr *dst) +{ + struct sockaddr_in *listen4, *ip4; + struct sockaddr_in6 *listen6, *ip6; + + switch (ip_ver) { + case 4: + listen4 = (struct sockaddr_in *) &listen_addr->src_addr; + ip4 = (struct sockaddr_in *) &addr->src_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = dst->ip4.addr; + ip4->sin_port = listen4->sin_port; + + ip4 = (struct sockaddr_in *) &addr->dst_addr; + ip4->sin_family = listen4->sin_family; + ip4->sin_addr.s_addr = src->ip4.addr; + ip4->sin_port = port; + break; + case 6: + listen6 = (struct sockaddr_in6 *) &listen_addr->src_addr; + ip6 = (struct sockaddr_in6 *) &addr->src_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = dst->ip6; + ip6->sin6_port = listen6->sin6_port; + + ip6 = (struct sockaddr_in6 *) &addr->dst_addr; + ip6->sin6_family = listen6->sin6_family; + ip6->sin6_addr = src->ip6; + ip6->sin6_port = port; + break; + default: + break; + } +} + +static inline int cma_user_data_offset(enum rdma_port_space ps) { - return ((struct sockaddr_in *) addr)->sin_addr.s_addr == 0; + switch (ps) { + case RDMA_PS_SDP: + return 0; + default: + return sizeof(struct cma_hdr); + } } static int cma_notify_user(struct rdma_id_private *id_priv, @@ -640,53 +728,41 @@ static struct rdma_id_private* cma_new_i { struct rdma_id_private *id_priv; struct rdma_cm_id *id; - struct rdma_route *route; - struct sockaddr_in *ip_addr, *listen_addr; - struct ib_sa_path_rec *path_rec; - struct cma_addr *addr; - int num_paths; - - listen_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; - if (cma_verify_addr(ib_event->private_data, listen_addr)) - return NULL; + struct rdma_route *rt; + union cma_ip_addr *src, *dst; + __u16 port; + u8 ip_ver; - num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); - path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); - if (!path_rec) + id = rdma_create_id(listen_id->event_handler, listen_id->context, + listen_id->ps); + if (IS_ERR(id)) return NULL; - id = rdma_create_id(listen_id->event_handler, listen_id->context); - if (!id) + rt = &id->route; + rt->num_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1; + rt->path_rec = kmalloc(sizeof *rt->path_rec * rt->num_paths, GFP_KERNEL); + if (!rt->path_rec) goto err; - addr = ib_event->private_data; - route = &id->route; + if (cma_get_net_info(ib_event->private_data, listen_id->ps, + &ip_ver, &port, &src, &dst)) + goto err; - ip_addr = (struct sockaddr_in *) &route->addr.src_addr; - ip_addr->sin_family = listen_addr->sin_family; - ip_addr->sin_addr.s_addr = addr->dst_addr.ver.ip4.addr; - ip_addr->sin_port = listen_addr->sin_port; - - ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; - ip_addr->sin_family = listen_addr->sin_family; - ip_addr->sin_addr.s_addr = addr->src_addr.ver.ip4.addr; - ip_addr->sin_port = addr->port; - - route->num_paths = num_paths; - route->path_rec = path_rec; - path_rec[0] = *ib_event->param.req_rcvd.primary_path; - if (num_paths == 2) - path_rec[1] = *ib_event->param.req_rcvd.alternate_path; - - route->addr.addr.ibaddr.sgid = path_rec->sgid; - route->addr.addr.ibaddr.dgid = path_rec->dgid; - route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); + cma_save_net_info(&id->route.addr, &listen_id->route.addr, + ip_ver, port, src, dst); + rt->path_rec[0] = *ib_event->param.req_rcvd.primary_path; + if (rt->num_paths == 2) + rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; + + rt->addr.addr.ibaddr.sgid = rt->path_rec[0].sgid; + rt->addr.addr.ibaddr.dgid = rt->path_rec[0].dgid; + rt->addr.addr.ibaddr.pkey = be16_to_cpu(rt->path_rec[0].pkey); id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; return id_priv; err: - kfree(path_rec); + rdma_destroy_id(id); return NULL; } @@ -708,7 +784,6 @@ static int cma_req_handler(struct ib_cm_ goto out; } - conn_id->state = CMA_CONNECT; atomic_inc(&conn_id->dev_remove); ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); if (ret) { @@ -722,7 +797,7 @@ static int cma_req_handler(struct ib_cm_ cm_id->context = conn_id; cm_id->cm_handler = cma_ib_handler; - offset = sizeof(struct cma_addr); + offset = cma_user_data_offset(listen_id->id.ps); ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, ib_event->private_data + offset, IB_CM_REQ_PRIVATE_DATA_SIZE - offset); @@ -738,16 +813,16 @@ out: return ret; } -static __be64 cma_get_service_id(struct sockaddr *addr) +static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) { - return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + + return cpu_to_be64(((u64)ps << 16) + ((struct sockaddr_in *) addr)->sin_port); } static void cma_set_compare_data(struct sockaddr *addr, struct ib_cm_private_data_compare *compare) { - struct cma_addr *data, *mask; + struct cma_hdr *data, *mask; memset(compare, 0, sizeof *compare); data = (void *) compare->data; @@ -755,19 +830,18 @@ static void cma_set_compare_data(struct switch (addr->sa_family) { case AF_INET: - cma_set_vers(data, 0, 4); - cma_set_vers(mask, 0, 0xF); - data->dst_addr.ver.ip4.addr = ((struct sockaddr_in *) addr)-> - sin_addr.s_addr; - mask->dst_addr.ver.ip4.addr = ~0; + cma_set_ip_ver(data, 4); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip4.addr = ((struct sockaddr_in *) addr)-> + sin_addr.s_addr; + mask->dst_addr.ip4.addr = ~0; break; case AF_INET6: - cma_set_vers(data, 0, 6); - cma_set_vers(mask, 0, 0xF); - data->dst_addr.ver.ip6 = ((struct sockaddr_in6 *) addr)-> - sin6_addr; - memset(&mask->dst_addr.ver.ip6, 1, - sizeof mask->dst_addr.ver.ip6); + cma_set_ip_ver(data, 6); + cma_set_ip_ver(mask, 0xF); + data->dst_addr.ip6 = ((struct sockaddr_in6 *) addr)-> + sin6_addr; + memset(&mask->dst_addr.ip6, 1, sizeof mask->dst_addr.ip6); break; default: break; @@ -787,7 +861,7 @@ static int cma_ib_listen(struct rdma_id_ return PTR_ERR(id_priv->cm_id); addr = &id_priv->id.route.addr.src_addr; - svc_id = cma_get_service_id(addr); + svc_id = cma_get_service_id(id_priv->id.ps, addr); if (cma_any_addr(addr)) ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); else { @@ -835,7 +909,7 @@ static void cma_listen_on_dev(struct rdm struct rdma_cm_id *id; int ret; - id = rdma_create_id(cma_listen_handler, id_priv); + id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); if (IS_ERR(id)) return; @@ -1099,19 +1173,34 @@ err: } EXPORT_SYMBOL(rdma_bind_addr); -static void cma_format_addr(struct cma_addr *addr, struct rdma_route *route) +static void cma_format_hdr(void *hdr, enum rdma_port_space ps, + struct rdma_route *route) { - struct sockaddr_in *ip_addr; - - memset(addr, 0, sizeof *addr); - cma_set_vers(addr, 1, 4); - - ip_addr = (struct sockaddr_in *) &route->addr.src_addr; - addr->src_addr.ver.ip4.addr = ip_addr->sin_addr.s_addr; - - ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; - addr->dst_addr.ver.ip4.addr = ip_addr->sin_addr.s_addr; - addr->port = ip_addr->sin_port; + struct sockaddr_in *src4, *dst4; + struct cma_hdr *cma_hdr; + struct sdp_hh *sdp_hdr; + + src4 = (struct sockaddr_in *) &route->addr.src_addr; + dst4 = (struct sockaddr_in *) &route->addr.dst_addr; + + switch (ps) { + case RDMA_PS_SDP: + sdp_hdr = hdr; + sdp_hdr->sdp_version = SDP_VERSION; + sdp_set_ip_ver(sdp_hdr, 4); + sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + sdp_hdr->port = src4->sin_port; + break; + default: + cma_hdr = hdr; + cma_hdr->cma_version = CMA_VERSION; + cma_set_ip_ver(cma_hdr, 4); + cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + cma_hdr->port = src4->sin_port; + break; + } } static int cma_connect_ib(struct rdma_id_private *id_priv, @@ -1119,17 +1208,20 @@ static int cma_connect_ib(struct rdma_id { struct ib_cm_req_param req; struct rdma_route *route; - struct cma_addr *addr; void *private_data; - int ret; + int offset, ret; memset(&req, 0, sizeof req); - req.private_data_len = sizeof *addr + conn_param->private_data_len; - - private_data = kmalloc(req.private_data_len, GFP_ATOMIC); + offset = cma_user_data_offset(id_priv->id.ps); + req.private_data_len = offset + conn_param->private_data_len; + private_data = kzalloc(req.private_data_len, GFP_ATOMIC); if (!private_data) return -ENOMEM; + if (conn_param->private_data && conn_param->private_data_len) + memcpy(private_data + offset, conn_param->private_data, + conn_param->private_data_len); + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, id_priv); if (IS_ERR(id_priv->cm_id)) { @@ -1137,20 +1229,16 @@ static int cma_connect_ib(struct rdma_id goto out; } - addr = private_data; route = &id_priv->id.route; - cma_format_addr(addr, route); - - if (conn_param->private_data && conn_param->private_data_len) - memcpy(addr + 1, conn_param->private_data, - conn_param->private_data_len); + cma_format_hdr(private_data, id_priv->id.ps, route); req.private_data = private_data; req.primary_path = &route->path_rec[0]; if (route->num_paths == 2) req.alternate_path = &route->path_rec[1]; - req.service_id = cma_get_service_id(&route->addr.dst_addr); + req.service_id = cma_get_service_id(id_priv->id.ps, + &route->addr.dst_addr); req.qp_num = id_priv->qp_num; req.qp_type = id_priv->qp_type; req.starting_psn = id_priv->seq_num; @@ -1317,23 +1405,6 @@ out: } EXPORT_SYMBOL(rdma_disconnect); -/* TODO: add this to the device structure - see Roland's patch */ -static __be64 get_ca_guid(struct ib_device *device) -{ - struct ib_device_attr *device_attr; - __be64 guid; - int ret; - - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) - return 0; - - ret = ib_query_device(device, device_attr); - guid = ret ? 0 : device_attr->node_guid; - kfree(device_attr); - return guid; -} - static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -1344,7 +1415,7 @@ static void cma_add_one(struct ib_device return; cma_dev->device = device; - cma_dev->node_guid = get_ca_guid(device); + cma_dev->node_guid = device->node_guid; if (!cma_dev->node_guid) goto err; Index: core/ucma.c =================================================================== --- core/ucma.c (revision 4356) +++ core/ucma.c (working copy) @@ -287,7 +287,7 @@ static ssize_t ucma_create_id(struct ucm return -ENOMEM; ctx->uid = cmd.uid; - ctx->cm_id = rdma_create_id(ucma_event_handler, ctx); + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP); if (IS_ERR(ctx->cm_id)) { ret = PTR_ERR(ctx->cm_id); goto err1; From ianjiang.ict at gmail.com Thu Dec 8 17:42:56 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Fri, 9 Dec 2005 09:42:56 +0800 Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: References: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> Message-ID: <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> On 12/9/05, James Lentini wrote: > ian> 1) Could memory allocated by kmem_cache_create() be OK? > ian> AFAIK, memory allocated by kmalloc() is OK and that by vmalloc() > ian> is not. > > Correct, assuming you pass kmalloc the GFP_DMA flag. Roland> No, the GFP_DMA flag is not necessary. On x86 is means to allocate Roland> from the 24-bit ISA DMA region (ie the low 16 MB of RAM). In general Roland> it is never necessary to use GFP_DMA in modern code. I agree with Roland. I tried the kmalloc() with GFP_ATOMIC flag and it was OK. ian> and it's not necessary for the DAT_MEM_TYPE_IA type. Is the > ian> translation is done in the dat_lmr_kcreate() for the > ian> DAT_MEM_TYPE_IA type? > > No translation is done in dat_lmr_kcreate for DAT_MEM_TYPE_IA. A > DAT_MEM_TYPE_IA address is supposed to be an I/O address that the > adapter can use. Question 1: How to distinguish a address that the adapter can use from that the adapter cannot use? Could you give an example? I am really not very familiar with the I/O address details. Question 2: Which memory type should be use given a continuous range of physical memory? It seems simpler to use the DAT_MEM_TYPE_IA type since no translation is needed. But is not there any limitation to the memory to be registered using the DAT_MEM_TYPE_IA, contrasted with the DAT_MEM_PHYSICAL type? Thanks a lot! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From tsxjjjje at go.com Thu Dec 8 22:12:51 2005 From: tsxjjjje at go.com (Cornelius Anthony) Date: Fri, 9 Dec 2005 04:12:51 -0200 Subject: [openib-general] Increase your energy levels Message-ID: <9b8101c5fc77$1a780c30$e3a85eda@billgates> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about.

# Suppress your appetite and feel full and satisfied all day long
# Increase your energy levels
# Lose excess weight
# Increase your metabolism
# Burn body fat
# Burn calories
# Attack obesity
And more..

http://vitaserious.com/

# Suitable for vegetarians and vegans
# MAINTAIN your weight loss
# Make losing weight a sure guarantee
# Look your best during the summer months

http://vitaserious.com/

Regards, Dr. Cornelius Anthony -------------- next part -------------- An HTML attachment was scrubbed... URL: From bboas at llnl.gov Thu Dec 8 22:52:38 2005 From: bboas at llnl.gov (Bill Boas) Date: Thu, 08 Dec 2005 22:52:38 -0800 Subject: [openib-general] Next workshop dates? Please respond with your preferences Message-ID: <6.2.3.4.2.20051208224443.03a16be0@mail-lc.llnl.gov> All those wishing to attend the next workshop in Sonoma at the Lodge (same as last year) in the late January-early February please respond with your preferred dates. We currently have Jan29-Feb1 held for us but some people are telling us that is bad for them. The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we need guidance from those planning to attend as to their preferred dates. Bill. Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From krkumar2 at in.ibm.com Fri Dec 9 00:48:47 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 9 Dec 2005 14:18:47 +0530 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051208183406.GA13614@mellanox.co.il> Message-ID: Hi Micheal, Is there a reason to have the atomic set_bit() within a lock (even for a race condition of stop vs send, it doesn't seem to be required) ? Which means the test_bit() can also be put before the existing lock... Thanks, - KK openib-general-bounces at openib.org wrote on 12/09/2005 12:04:06 AM: > Quoting Michael S. Tsirkin : > > Subject: [PATCH] ipoib_multicast/ipoib_mcast_send race > > > > Hello, Roland! > > Here's another race scenario. > > > > --- > > > > Fix the following race scenario: > > device is up. > > port event or set mcast list triggers ipoib_mcast_stop_thread, > > This cancels the query and waits on mcast "done" completion. > > completion is called and "done" is set. > > Meanwhile, ipoib_mcast_send arrives and starts a new query, > > re-initializing "done". > > > > Signed-off-by: Michael S. Tsirkin > > The patch I posted previously leaked an skb when a multicast > send arrived while the mcast thread is stopped. > > Further, there's an additional issue that I saw in testing: > ipoib_mcast_send may get called when priv->broadcast is NULL > (e.g. if the device was downed and then upped internally because > of a port event). > If this happends and the sendonly join request gets completed before > priv->broadcast is set, we get an oops that I posted previously. > > Here's a better patch to address these two problems. > It has been running fine here for a while now. > > Please note that this replaces the ipoib_multicast/ipoib_mcast_send patch, > but not the ADMIN_UP patch that I posted previously. > > --- > > Do not send multicasts if mcast thread is stopped or if > priv->broadcast is not set. > > Signed-off-by: Michael S. Tsirkin > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > =================================================================== > --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4222) > +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) > @@ -582,6 +582,10 @@ int ipoib_mcast_start_thread(struct net_ > queue_work(ipoib_workqueue, &priv->mcast_task); > up(&mcast_mutex); > > + spin_lock_irq(&priv->lock); > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > + spin_unlock_irq(&priv->lock); > + > return 0; > } > > @@ -592,6 +596,10 @@ int ipoib_mcast_stop_thread(struct net_d > > ipoib_dbg_mcast(priv, "stopping multicast thread\n"); > > + spin_lock_irq(&priv->lock); > + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); > + spin_unlock_irq(&priv->lock); > + > down(&mcast_mutex); > clear_bit(IPOIB_MCAST_RUN, &priv->flags); > cancel_delayed_work(&priv->mcast_task); > @@ -674,6 +682,11 @@ void ipoib_mcast_send(struct net_device > */ > spin_lock(&priv->lock); > > + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || !priv->broadcast) { > + dev_kfree_skb_any(skb); > + goto unlock; > + } > + > mcast = __ipoib_mcast_find(dev, mgid); > if (!mcast) { > /* Let's create a new send only group now */ > @@ -732,6 +745,7 @@ out: > ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); > } > > +unlock: > spin_unlock(&priv->lock); > } > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h > =================================================================== > --- openib/drivers/infiniband/ulp/ipoib/ipoib.h (revision 4222) > +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h (working copy) > @@ -78,6 +78,7 @@ enum { > IPOIB_FLAG_SUBINTERFACE = 4, > IPOIB_MCAST_RUN = 5, > IPOIB_STOP_REAPER = 6, > + IPOIB_MCAST_STARTED = 7, > > IPOIB_MAX_BACKOFF_SECONDS = 16, > > > -- > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Fri Dec 9 05:11:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 9 Dec 2005 15:11:51 +0200 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: References: Message-ID: <20051209131151.GA21716@mellanox.co.il> The lock around clear_bit is there to ensure that ipoib_mcast_send isnt running already when we stop the thread. Thats why test_bit has to be instide the lock, too. Quoting r. Krishna Kumar2 : > Subject: Re: [openib-general] [PATCH fixed] was Re: [PATCH]?ipoib_multicast/ipoib_mcast_send race > > > Hi Micheal, > > Is there a reason to have the atomic set_bit() within a lock (even for > a race condition of stop vs send, it doesn't seem to be required) ? > Which means the test_bit() can also be put before the existing lock... > > Thanks, > > - KK > > openib-general-bounces at openib.org wrote on 12/09/2005 12:04:06 AM: > > > Quoting Michael S. Tsirkin : > > > Subject: [PATCH] ipoib_multicast/ipoib_mcast_send race > > > > > > Hello, Roland! > > > Here's another race scenario. > > > > > > --- > > > > > > Fix the following race scenario: > > > device is up. > > > port event or set mcast list triggers ipoib_mcast_stop_thread, > > > This cancels the query and waits on mcast "done" completion. > > > completion is called and "done" is set. > > > Meanwhile, ipoib_mcast_send arrives and starts a new query, > > > re-initializing "done". > > > > > > Signed-off-by: Michael S. Tsirkin > > > > The patch I posted previously leaked an skb when a multicast > > send arrived while the mcast thread is stopped. > > > > Further, there's an additional issue that I saw in testing: > > ipoib_mcast_send may get called when priv->broadcast is NULL > > (e.g. if the device was downed and then upped internally because > > of a port event). > > If this happends and the sendonly join request gets completed before > > priv->broadcast is set, we get an oops that I posted previously. > > > > Here's a better patch to address these two problems. > > It has been running fine here for a while now. > > > > Please note that this replaces the ipoib_multicast/ipoib_mcast_send > patch, > > but not the ADMIN_UP patch that I posted previously. > > > > --- > > > > Do not send multicasts if mcast thread is stopped or if > > priv->broadcast is not set. > > > > Signed-off-by: Michael S. Tsirkin > > > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > =================================================================== > > --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision > 4222) > > +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working > copy) > > @@ -582,6 +582,10 @@ int ipoib_mcast_start_thread(struct net_ > > queue_work(ipoib_workqueue, &priv->mcast_task); > > up(&mcast_mutex); > > > > + spin_lock_irq(&priv->lock); > > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > > + spin_unlock_irq(&priv->lock); > > + > > return 0; > > } > > > > @@ -592,6 +596,10 @@ int ipoib_mcast_stop_thread(struct net_d > > > > ipoib_dbg_mcast(priv, "stopping multicast thread\n"); > > > > + spin_lock_irq(&priv->lock); > > + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); > > + spin_unlock_irq(&priv->lock); > > + > > down(&mcast_mutex); > > clear_bit(IPOIB_MCAST_RUN, &priv->flags); > > cancel_delayed_work(&priv->mcast_task); > > @@ -674,6 +682,11 @@ void ipoib_mcast_send(struct net_device > > */ > > spin_lock(&priv->lock); > > > > + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || > !priv->broadcast) { > > + dev_kfree_skb_any(skb); > > + goto unlock; > > + } > > + > > mcast = __ipoib_mcast_find(dev, mgid); > > if (!mcast) { > > /* Let's create a new send only group now */ > > @@ -732,6 +745,7 @@ out: > > ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); > > } > > > > +unlock: > > spin_unlock(&priv->lock); > > } > > > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h > > =================================================================== > > --- openib/drivers/infiniband/ulp/ipoib/ipoib.h (revision 4222) > > +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h (working copy) > > @@ -78,6 +78,7 @@ enum { > > IPOIB_FLAG_SUBINTERFACE = 4, > > IPOIB_MCAST_RUN = 5, > > IPOIB_STOP_REAPER = 6, > > + IPOIB_MCAST_STARTED = 7, > > > > IPOIB_MAX_BACKOFF_SECONDS = 16, > > > > > > -- > > MST > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- MST From halr at voltaire.com Fri Dec 9 07:51:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2005 10:51:36 -0500 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <20051208135116.GL21035@mellanox.co.il> References: <20051208135116.GL21035@mellanox.co.il> Message-ID: <1134143496.4485.6696.camel@hal.voltaire.com> On Thu, 2005-12-08 at 08:51, Michael S. Tsirkin wrote: > Dont leak packet if it had a timeout. > Dont leak timeout mad if queue_packet fails. Thanks. Applied. Should this change be pushed upstream to 2.6.15 ? From arlin.r.davis at intel.com Fri Dec 9 12:39:15 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 9 Dec 2005 12:39:15 -0800 Subject: [openib-general] [PATCH][uDAPL] openib_cma provider update Message-ID: James, I modified the IP address lookup during the open to take either a network name, network address, or device name. This will make the dat.conf setup a little easier and more flexible. I updated the README, and /doc/dat.conf with details. Thanks, -arlin Signed-off by: Arlin Davis Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 4361) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -58,6 +58,13 @@ static const char rcsid[] = "$Id: $"; #include #include +#include /* for IOCTL's */ +#include /* for socket(2) and related bits and pieces */ +#include /* for socket(2) */ +#include /* for struct ifreq */ +#include /* for ARPHRD_INFINIBAND */ + + int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -65,39 +72,77 @@ DAPL_OS_THREAD g_ib_thread; DAPL_OS_LOCK g_hca_lock; struct dapl_llist_entry *g_hca_list; -/* Get IP address */ +/* Get IP address using network device name */ +static int getipaddr_netdev(char *name, char *addr, int addr_len) +{ + struct ifreq ifr; + int skfd, ret, len; + + /* Fill in the structure */ + snprintf(ifr.ifr_name, IFNAMSIZ, "%s", name); + ifr.ifr_hwaddr.sa_family = ARPHRD_INFINIBAND; + + /* Create a socket fd */ + skfd = socket(PF_INET, SOCK_STREAM, 0); + ret = ioctl(skfd, SIOCGIFADDR, &ifr); + if (ret) + goto bail; + + switch (ifr.ifr_addr.sa_family) + { +#ifdef AF_INET6 + case AF_INET6: + len = sizeof(struct sockaddr_in6); + break; +#endif + case AF_INET: + default: + len = sizeof(struct sockaddr); + break; + } + + if (len <= addr_len) + memcpy(addr, &ifr.ifr_addr, len); + else + ret = EINVAL; + +bail: + close(skfd); + return ret; +} + +/* Get IP address using network name, address, or device name */ static int getipaddr(char *name, char *addr, int len) { struct addrinfo *res; int ret; - - ret = getaddrinfo(name, NULL, NULL, &res); - if (ret) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, - " getipaddr: invalid name or address (%s)\n", - name); + + /* Assume network name and address type for first attempt */ + if (getaddrinfo(name, NULL, NULL, &res)) { + /* retry using network device name */ + ret = getipaddr_netdev(name,addr,len); + if (ret) + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " getipaddr: invalid name, addr, or netdev(%s)\n", + name); return ret; + } else { + if (len >= res->ai_addrlen) + memcpy(addr, res->ai_addr, res->ai_addrlen); + else + return EINVAL; + + freeaddrinfo(res); } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " getipaddr: family %d port %d addr %d.%d.%d.%d\n", - ((struct sockaddr_in *)res->ai_addr)->sin_family, - ((struct sockaddr_in *)res->ai_addr)->sin_port, - ((struct sockaddr_in *) - res->ai_addr)->sin_addr.s_addr >> 0 & 0xff, - ((struct sockaddr_in *) - res->ai_addr)->sin_addr.s_addr >> 8 & 0xff, - ((struct sockaddr_in *) - res->ai_addr)->sin_addr.s_addr >> 16 & 0xff, - ((struct sockaddr_in *) - res->ai_addr)->sin_addr.s_addr >> 24 & 0xff ); - - if (len >= res->ai_addrlen) - memcpy(addr, res->ai_addr, res->ai_addrlen); - else - return EINVAL; - - freeaddrinfo(res); + ((struct sockaddr_in *)addr)->sin_family, + ((struct sockaddr_in *)addr)->sin_port, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 24 & 0xff); return 0; } Index: doc/dat.conf =================================================================== --- doc/dat.conf (revision 4361) +++ doc/dat.conf (working copy) @@ -9,9 +9,12 @@ # Example for openib_cma and openib_scm # # For scm version you specify as actual device name and port -# For cma version you specify as the ib device network address or network hostname and 0 for port +# For cma version you specify as: +# network address, network hostname, or netdev name and 0 for port # OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" +OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "ib0 0" "" + Index: README =================================================================== --- README (revision 4361) +++ README (working copy) @@ -63,12 +63,14 @@ sample /etc/dat.conf # Example for openib_cma and openib_scm # # For scm version you specify as actual device name and port -# For cma version you specify as the ib device network address or network hostname and 0 for port +# For cma version you specify as: +# network address, network hostname, or netdev name and 0 for port # OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" +OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "ib0 0" "" ============================= 3.0 SAMPLE uDAPL APPLICATION: From rdreier at cisco.com Fri Dec 9 13:46:46 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 13:46:46 -0800 Subject: [openib-general] Re: [PATCH] core: fix user_mad memory leaks on timeout In-Reply-To: <20051208135116.GL21035@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 8 Dec 2005 15:51:16 +0200") References: <20051208135116.GL21035@mellanox.co.il> Message-ID: Thanks, I queued this in my git tree. - R. From rdreier at cisco.com Fri Dec 9 13:49:14 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 13:49:14 -0800 Subject: [openib-general] Re: [PATCH] mthca: fix memory leak In-Reply-To: <20051208161625.GA7653@mellanox.co.il> (Jack Morgenstein's message of "Thu, 8 Dec 2005 18:16:25 +0200") References: <20051208161625.GA7653@mellanox.co.il> Message-ID: Thanks, applied to svn and queued in git... - R. From rolandd at cisco.com Fri Dec 9 13:51:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 21:51:50 +0000 Subject: [openib-general] [git patch review 2/5] IB/cm: correct reported reject code In-Reply-To: <1134165110300-0a7b2146d584150e@cisco.com> Message-ID: <1134165110300-7a2e27ea7ca96ec0@cisco.com> Change reject code from TIMEOUT to CONSUMER_REJECT when destroying a cm_id in the process of connecting. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- drivers/infiniband/core/cm.c | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-) 227eca83690da7dcbd698d3268e29402e0571723 diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 02110e0..1fe2186 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -684,6 +684,13 @@ retest: cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); break; case IB_CM_REQ_SENT: + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, + &cm_id_priv->av.port->cm_dev->ca_guid, + sizeof cm_id_priv->av.port->cm_dev->ca_guid, + NULL, 0); + break; case IB_CM_MRA_REQ_RCVD: case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: @@ -694,10 +701,8 @@ retest: case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ib_send_cm_rej(cm_id, IB_CM_REJ_TIMEOUT, - &cm_id_priv->av.port->cm_dev->ca_guid, - sizeof cm_id_priv->av.port->cm_dev->ca_guid, - NULL, 0); + ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); break; case IB_CM_ESTABLISHED: spin_unlock_irqrestore(&cm_id_priv->lock, flags); -- 0.99.9l From rolandd at cisco.com Fri Dec 9 13:51:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 21:51:50 +0000 Subject: [openib-general] [git patch review 4/5] IB/umad: fix memory leaks In-Reply-To: <1134165110300-7535693e84cc230f@cisco.com> Message-ID: <1134165110301-ac635a95a66180bb@cisco.com> Don't leak packet if it had a timeout, and don't leak timeout struct if queue_packet() fails. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/core/user_mad.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) 0efc4883a6b3de12476cd7a35e638c0a9f5fd75f diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index eb7f525..c908de8 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -197,8 +197,8 @@ static void send_handler(struct ib_mad_a memcpy(timeout->mad.data, packet->mad.data, sizeof (struct ib_mad_hdr)); - if (!queue_packet(file, agent, timeout)) - return; + if (queue_packet(file, agent, timeout)) + kfree(timeout); } out: kfree(packet); -- 0.99.9l From rolandd at cisco.com Fri Dec 9 13:51:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 21:51:50 +0000 Subject: [openib-general] [git patch review 3/5] IB/cm: avoid reusing local ID In-Reply-To: <1134165110300-7a2e27ea7ca96ec0@cisco.com> Message-ID: <1134165110300-7535693e84cc230f@cisco.com> Use an increasing local ID to avoid re-using identifiers while messages may still be outstanding on the old ID. Without this, a quick connect-disconnect-connect sequence can fail by matching messages for the new connection with the old connection. Signed-off-by: Sean Hefty Signed-off-by: Roland Dreier --- drivers/infiniband/core/cm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) de1bb1a64c29bae4f5330c70bd1dc6a62954c9f4 diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 1fe2186..3a611fe 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -308,10 +308,11 @@ static int cm_alloc_id(struct cm_id_priv { unsigned long flags; int ret; + static int next_id; do { spin_lock_irqsave(&cm.lock, flags); - ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, 1, + ret = idr_get_new_above(&cm.local_id_table, cm_id_priv, next_id++, (__force int *) &cm_id_priv->id.local_id); spin_unlock_irqrestore(&cm.lock, flags); } while( (ret == -EAGAIN) && idr_pre_get(&cm.local_id_table, GFP_KERNEL) ); -- 0.99.9l From rolandd at cisco.com Fri Dec 9 13:51:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 21:51:50 +0000 Subject: [openib-general] [git patch review 1/5] IB/mthca: fix QP size limits for mem-free HCAs Message-ID: <1134165110300-0a7b2146d584150e@cisco.com> Unlike tavor, the max work queue size is an exact power of 2 for arbel mode, despite what the documentation (of the QUERY_DEV_LIM firmware command) says. Without this patch, on Arbel, we can start with a QP of a valid size and get above the reported limit after rounding to the next power of two. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cmd.c | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) a3c8ab4fe8f006d742c24be677518bfa9862e732 diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 9ed3458..22ac72b 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -937,10 +937,6 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev if (err) goto out; - MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); - dev_lim->max_srq_sz = (1 << field) - 1; - MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); - dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); dev_lim->reserved_qps = 1 << (field & 0xf); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); @@ -1056,6 +1052,10 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); if (mthca_is_memfree(dev)) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); dev_lim->hca.arbel.resize_srq = field & 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_RQ_OFFSET); @@ -1087,6 +1087,10 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev mthca_dbg(dev, "Max ICM size %lld MB\n", (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = (1 << field) - 1; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); dev_lim->mpt_entry_sz = MTHCA_MPT_ENTRY_SIZE; -- 0.99.9l From rolandd at cisco.com Fri Dec 9 13:51:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 21:51:50 +0000 Subject: [openib-general] [git patch review 5/5] IB/mthca: fix memory user DB table leak In-Reply-To: <1134165110301-ac635a95a66180bb@cisco.com> Message-ID: <1134165110301-b5d3e449a24a06fe@cisco.com> Free the memory allocated in mthca_init_user_db_tab() when releasing the db_tab in mthca_cleanup_user_db_tab(). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_memfree.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) 52d0df153c987e4ad57d15f5df91848f65858e5d diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index d72fe95..5798ed0 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -485,6 +485,8 @@ void mthca_cleanup_user_db_tab(struct mt put_page(db_tab->page[i].mem.page); } } + + kfree(db_tab); } int mthca_alloc_db(struct mthca_dev *dev, enum mthca_db_type type, -- 0.99.9l From sean.hefty at intel.com Fri Dec 9 14:55:18 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 9 Dec 2005 14:55:18 -0800 Subject: [openib-general] [PATCH] [uCM] prevent userspace from using SDP/CMA SIDs Message-ID: The following patch rejects requests from userspace to use either the SDP or CMA service IDs. Signed-off-by: Sean Hefty Index: core/ucm.c =================================================================== --- core/ucm.c (revision 4356) +++ core/ucm.c (working copy) @@ -645,6 +645,17 @@ out: return result; } +static int ucm_validate_listen(__be64 service_id, __be64 service_mask) +{ + service_id &= service_mask; + + if (((service_id & IB_CMA_SERVICE_ID_MASK) == IB_CMA_SERVICE_ID) || + ((service_id & IB_SDP_SERVICE_ID_MASK) == IB_SDP_SERVICE_ID)) + return -EINVAL; + + return 0; +} + static ssize_t ib_ucm_listen(struct ib_ucm_file *file, const char __user *inbuf, int in_len, int out_len) @@ -660,8 +671,13 @@ static ssize_t ib_ucm_listen(struct ib_u if (IS_ERR(ctx)) return PTR_ERR(ctx); + result = ucm_validate_listen(cmd.service_id, cmd.service_mask); + if (result) + goto out; + result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask, NULL); +out: ib_ucm_ctx_put(ctx); return result; } Index: include/rdma/ib_cm.h =================================================================== --- include/rdma/ib_cm.h (revision 4356) +++ include/rdma/ib_cm.h (working copy) @@ -317,6 +317,10 @@ void ib_destroy_cm_id(struct ib_cm_id *c #define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL) #define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL) +#define IB_CMA_SERVICE_ID __constant_cpu_to_be64(0x0000000001000000ULL) +#define IB_CMA_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFF000000ULL) +#define IB_SDP_SERVICE_ID __constant_cpu_to_be64(0x0000000000010000ULL) +#define IB_SDP_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFFFF0000ULL) struct ib_cm_private_data_compare { u8 data[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; From xma at us.ibm.com Fri Dec 9 16:11:03 2005 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 9 Dec 2005 17:11:03 -0700 Subject: [openib-general] [PATCH] check create_srq in libibverbs Message-ID: create_irq is not a mandatory device function, therefor in userspace/libibverbs/src/verbs.c ibv_create_srq should check create_srq() first before calling it, otherwise the caller will cause the segmentation fault on device which doesn't support srq. Signed-off-by: Shirley Ma diff -urN userspace/libibverbs/src/verbs.c userspace-srq/libibverbs/src/verbs.c --- userspace/libibverbs/src/verbs.c 2005-11-14 13:44:52.000000000 -0800 +++ userspace-srq/libibverbs/src/verbs.c 2005-12-09 16:04:12.022433272 -0800 @@ -246,7 +246,9 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr) { - struct ibv_srq *srq = pd->context->ops.create_srq(pd, srq_init_attr); + struct ibv_srq *srq = NULL; + if (pd->context->ops.create_srq) + srq = pd->context->ops.create_srq(pd, srq_init_attr); if (srq) { srq->context = pd->context; Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ibv_srq.patch Type: application/octet-stream Size: 605 bytes Desc: not available URL: From rdreier at cisco.com Fri Dec 9 16:42:56 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 16:42:56 -0800 Subject: [openib-general] Re: mthca_qp patch In-Reply-To: <20051207154348.GZ21035@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 7 Dec 2005 17:43:48 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3B8D6CD@mtlexch01.mtl.com> <20051207154348.GZ21035@mellanox.co.il> Message-ID: Thanks, I applied this and queued it in git as three separate patches. From rdreier at cisco.com Fri Dec 9 16:48:59 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 16:48:59 -0800 Subject: [openib-general] [PATCH] check create_srq in libibverbs In-Reply-To: (Shirley Ma's message of "Fri, 9 Dec 2005 17:11:03 -0700") References: Message-ID: Thanks, looks good. I'll apply this after some pending stuff I have in my tree... From rdreier at cisco.com Fri Dec 9 17:22:32 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Dec 2005 17:22:32 -0800 Subject: [openib-general] [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() Message-ID: This patch converts the ibv_get_devices() API to a better ibv_get_device_list(). The old API was bad because it exposed the dlist data structure exposed by libsysfs, which was not thread-safe and was just plain overly complex for what it was used for. In addition, I've converted over all the in-tree users of ibv_get_devices() that I could find -- DAPL, libehca, libibcm, librdmacm and mvapich. I'm planning to commit this early next week; any objections, comments, or suggestions before I do so? Thanks, Roland dapl/dapl/openib/dapl_ib_util.c | 36 ++++++----- dapl/dapl/openib_scm/dapl_ib_util.c | 35 +++++++---- libehca/configure.in | 2 libibcm/configure.in | 4 - libibcm/examples/cmpost.c | 7 -- libibverbs/ChangeLog | 13 +++- libibverbs/examples/asyncwatch.c | 16 +---- libibverbs/examples/device_list.c | 13 ++-- libibverbs/examples/devinfo.c | 101 +++++++++++++++----------------- libibverbs/examples/rc_pingpong.c | 9 +- libibverbs/examples/srq_pingpong.c | 9 +- libibverbs/examples/uc_pingpong.c | 9 +- libibverbs/examples/ud_pingpong.c | 9 +- libibverbs/include/infiniband/verbs.h | 16 ++++- libibverbs/src/device.c | 27 ++++++-- libibverbs/src/ibverbs.h | 11 +-- libibverbs/src/init.c | 72 +++++++++++++--------- libibverbs/src/libibverbs.map | 3 librdmacm/configure.in | 4 - librdmacm/src/cma.c | 14 ++-- mpi/mvapich-gen2/mpid/ch_gen2/viainit.c | 11 +++ perftest/rdma_bw.c | 9 +- perftest/rdma_lat.c | 9 +- 23 files changed, 255 insertions(+), 184 deletions(-) --- userspace/libibverbs/include/infiniband/verbs.h (revision 4360) +++ userspace/libibverbs/include/infiniband/verbs.h (working copy) @@ -585,9 +585,21 @@ struct ibv_context { }; /** - * ibv_get_devices - Return list of IB devices + * ibv_get_device_list - Get list of IB devices currently available + * @num_devices: optional. if non-NULL, set to the number of devices + * returned in the array. + * + * Return a NULL-terminated array of IB devices. The array can be + * released with ibv_free_device_list(). + */ +extern struct ibv_device **ibv_get_device_list(int *num_devices); + +/** + * ibv_free_device_list - Free list from ibv_get_device_list() + * + * Free an array of devices returned from ibv_get_device_list() */ -extern struct dlist *ibv_get_devices(void); +extern void ibv_free_device_list(struct ibv_device **list); /** * ibv_get_device_name - Return kernel device name --- userspace/libibverbs/ChangeLog (revision 4360) +++ userspace/libibverbs/ChangeLog (working copy) @@ -1,4 +1,15 @@ -2005-11-10 Sean Hefty +2005-11-11 Roland Dreier + + * examples/asyncwatch.c, examples/rc_pingpong.c, + examples/srq_pingpong.c, examples/uc_pingpong.c, + examples/ud_pingpong.c, examples/device_list.c, + examples/devinfo.c: Update examples to match new API. + + * include/infiniband/verbs.h, src/device.c, src/init.c, + src/ibverbs.h: Change from dlist-based ibv_get_devices() API to + simpler ibv_get_device_list() and ibv_free_device_list() API. + +2005-11-10 Sean Hefty * include/infiniband/sa-kern-abi.h: New include file to contain definitions of SA structures passed between userspace and kernel. --- userspace/libibverbs/src/libibverbs.map (revision 4360) +++ userspace/libibverbs/src/libibverbs.map (working copy) @@ -1,6 +1,7 @@ IBVERBS_1.0 { global: - ibv_get_devices; + ibv_get_device_list; + ibv_free_device_list; ibv_get_device_name; ibv_get_device_guid; ibv_open_device; --- userspace/libibverbs/src/device.c (revision 4360) +++ userspace/libibverbs/src/device.c (working copy) @@ -49,21 +49,36 @@ #include "ibverbs.h" static pthread_mutex_t device_list_lock = PTHREAD_MUTEX_INITIALIZER; -static struct dlist *device_list; +static int num_devices; +static struct ibv_device **device_list; -struct dlist *ibv_get_devices(void) +struct ibv_device **ibv_get_device_list(int *num) { - struct dlist *l; + struct ibv_device **l; + int i; pthread_mutex_lock(&device_list_lock); - if (!device_list) - device_list = ibverbs_init(); - l = device_list; + + if (!num_devices) + num_devices = ibverbs_init(&device_list); + + l = calloc(num_devices, sizeof (struct ibv_device *)); + for (i = 0; i < num_devices; ++i) + l[i] = device_list[i]; + pthread_mutex_unlock(&device_list_lock); + if (num) + *num = l ? num_devices : 0; + return l; } +void ibv_free_device_list(struct ibv_device **list) +{ + free(list); +} + const char *ibv_get_device_name(struct ibv_device *device) { return device->ibdev->name; --- userspace/libibverbs/src/ibverbs.h (revision 4360) +++ userspace/libibverbs/src/ibverbs.h (working copy) @@ -47,7 +47,8 @@ #define PFX "libibverbs: " struct ibv_driver { - ibv_driver_init_func init_func; + ibv_driver_init_func init_func; + struct ibv_driver *next; }; struct ibv_abi_compat_v2 { @@ -57,11 +58,11 @@ struct ibv_abi_compat_v2 { extern HIDDEN int abi_ver; -extern struct dlist *ibverbs_init(void); +extern HIDDEN int ibverbs_init(struct ibv_device ***list); -extern int ibv_init_mem_map(void); -extern int ibv_lock_range(void *base, size_t size); -extern int ibv_unlock_range(void *base, size_t size); +extern HIDDEN int ibv_init_mem_map(void); +extern HIDDEN int ibv_lock_range(void *base, size_t size); +extern HIDDEN int ibv_unlock_range(void *base, size_t size); #define IBV_INIT_CMD(cmd, size, opcode) \ do { \ --- userspace/libibverbs/src/init.c (revision 4360) +++ userspace/libibverbs/src/init.c (working copy) @@ -55,7 +55,7 @@ HIDDEN int abi_ver; static char default_path[] = DRIVER_PATH; static const char *user_path; -static struct dlist *driver_list; +static struct ibv_driver *driver_list; static void load_driver(char *so_path) { @@ -82,7 +82,8 @@ static void load_driver(char *so_path) } driver->init_func = init_func; - dlist_push(driver_list, driver); + driver->next = driver_list; + driver_list = driver; } static void find_drivers(char *dir) @@ -112,8 +113,7 @@ static void find_drivers(char *dir) load_driver(so_glob.gl_pathv[i]); } -static void init_drivers(struct sysfs_class_device *verbs_dev, - struct dlist *device_list) +static struct ibv_device *init_drivers(struct sysfs_class_device *verbs_dev) { struct sysfs_class_device *ib_dev; struct sysfs_attribute *attr; @@ -125,7 +125,7 @@ static void init_drivers(struct sysfs_cl if (!attr) { fprintf(stderr, PFX "Warning: no ibdev class attr for %s\n", verbs_dev->name); - return; + return NULL; } sscanf(attr->value, "%63s", ibdev_name); @@ -134,19 +134,17 @@ static void init_drivers(struct sysfs_cl if (!ib_dev) { fprintf(stderr, PFX "Warning: no infiniband class device %s for %s\n", attr->value, verbs_dev->name); - return; + return NULL; } - dlist_for_each_data(driver_list, driver, struct ibv_driver) { + for (driver = driver_list; driver; driver = driver->next) { dev = driver->init_func(verbs_dev); if (dev) { dev->dev = verbs_dev; dev->ibdev = ib_dev; dev->driver = driver; - dlist_push(device_list, dev); - - return; + return dev; } } @@ -155,6 +153,8 @@ static void init_drivers(struct sysfs_cl if (user_path) fprintf(stderr, "%s:", user_path); fprintf(stderr, "%s\n", default_path); + + return NULL; } static int check_abi_version(void) @@ -188,28 +188,23 @@ static int check_abi_version(void) } -struct dlist *ibverbs_init(void) +HIDDEN int ibverbs_init(struct ibv_device ***list) { char *wr_path, *dir; struct sysfs_class *cls; struct dlist *verbs_dev_list; - struct dlist *device_list; struct sysfs_class_device *verbs_dev; + struct ibv_device *device; + struct ibv_device **new_list; + int num_devices = 0; + int list_size = 0; - driver_list = dlist_new(sizeof (struct ibv_driver)); - device_list = dlist_new(sizeof (struct ibv_device)); - if (!driver_list || !device_list) { - fprintf(stderr, PFX "Fatal: couldn't allocate device/driver list.\n"); - abort(); - } + *list = NULL; if (ibv_init_mem_map()) - return NULL; + return 0; - /* - * Check if a driver is statically linked, and if so load it first. - */ - load_driver(NULL); + find_drivers(default_path); /* * Only follow the path passed in through the calling user's @@ -224,25 +219,42 @@ struct dlist *ibverbs_init(void) } } - find_drivers(default_path); + /* + * Now check if a driver is statically linked. Since we push + * drivers onto our driver list, the last driver we find will + * be the first one we try. + */ + load_driver(NULL); cls = sysfs_open_class("infiniband_verbs"); if (!cls) { fprintf(stderr, PFX "Fatal: couldn't open sysfs class 'infiniband_verbs'.\n"); - return NULL; + return 0; } if (check_abi_version()) - return NULL; + return 0; verbs_dev_list = sysfs_get_class_devices(cls); if (!verbs_dev_list) { fprintf(stderr, PFX "Fatal: no infiniband class devices found.\n"); - return NULL; + return 0; } - dlist_for_each_data(verbs_dev_list, verbs_dev, struct sysfs_class_device) - init_drivers(verbs_dev, device_list); + dlist_for_each_data(verbs_dev_list, verbs_dev, struct sysfs_class_device) { + device = init_drivers(verbs_dev); + if (device) { + if (list_size <= num_devices) { + list_size = list_size ? list_size * 2 : 1; + new_list = realloc(*list, list_size * sizeof (struct ibv_device *)); + if (!new_list) + goto out; + *list = new_list; + } + *list[num_devices++] = device; + } + } - return device_list; +out: + return num_devices; } --- userspace/libibverbs/examples/asyncwatch.c (revision 4360) +++ userspace/libibverbs/examples/asyncwatch.c (working copy) @@ -50,34 +50,30 @@ static inline uint64_t be64_to_cpu(uint6 int main(int argc, char *argv[]) { - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; struct ibv_context *context; struct ibv_async_event event; - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); - ib_dev = dlist_next(dev_list); - - if (!ib_dev) { + if (!*dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - context = ibv_open_device(ib_dev); + context = ibv_open_device(*dev_list); if (!context) { fprintf(stderr, "Couldn't get context for %s\n", - ibv_get_device_name(ib_dev)); + ibv_get_device_name(*dev_list)); return 1; } printf("%s: async event FD %d\n", - ibv_get_device_name(ib_dev), context->async_fd); + ibv_get_device_name(*dev_list), context->async_fd); while (1) { if (ibv_get_async_event(context, &event)) --- userspace/libibverbs/examples/rc_pingpong.c (revision 4360) +++ userspace/libibverbs/examples/rc_pingpong.c (working copy) @@ -447,7 +447,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -536,21 +536,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- userspace/libibverbs/examples/srq_pingpong.c (revision 4360) +++ userspace/libibverbs/examples/srq_pingpong.c (working copy) @@ -509,7 +509,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest[MAX_QP]; @@ -605,21 +605,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- userspace/libibverbs/examples/uc_pingpong.c (revision 4360) +++ userspace/libibverbs/examples/uc_pingpong.c (working copy) @@ -435,7 +435,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -524,21 +524,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- userspace/libibverbs/examples/ud_pingpong.c (revision 4360) +++ userspace/libibverbs/examples/ud_pingpong.c (working copy) @@ -443,7 +443,7 @@ static void usage(const char *argv0) int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -532,21 +532,20 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; } - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = *dev_list; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- userspace/libibverbs/examples/device_list.c (revision 4360) +++ userspace/libibverbs/examples/device_list.c (working copy) @@ -51,10 +51,9 @@ static inline uint64_t be64_to_cpu(uint6 int main(int argc, char *argv[]) { - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "No IB devices found\n"); return 1; @@ -63,10 +62,12 @@ int main(int argc, char *argv[]) printf(" %-16s\t node GUID\n", "device"); printf(" %-16s\t----------------\n", "------"); - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + while (*dev_list) { printf(" %-16s\t%016llx\n", - ibv_get_device_name(ib_dev), - (unsigned long long) be64_to_cpu(ibv_get_device_guid(ib_dev))); + ibv_get_device_name(*dev_list), + (unsigned long long) be64_to_cpu(ibv_get_device_guid(*dev_list))); + ++dev_list; + } return 0; } --- userspace/libibverbs/examples/devinfo.c (revision 4360) +++ userspace/libibverbs/examples/devinfo.c (working copy) @@ -299,11 +299,11 @@ cleanup: static void usage(const char *argv0) { - printf("Usage: %s print the ca attributes\n", argv0); - printf("\n"); - printf("Options:\n"); - printf(" -d, --ib-dev= use IB device (default first device found)\n"); - printf(" -i, --ib-port= use port of IB device (default all ports)\n"); + printf("Usage: %s print the ca attributes\n", argv0); + printf("\n"); + printf("Options:\n"); + printf(" -d, --ib-dev= use IB device (default first device found)\n"); + printf(" -i, --ib-port= use port of IB device (default all ports)\n"); printf(" -l, --list print only the IB devices names\n"); printf(" -v, --verbose print all the attributes of the IB device(s)\n"); } @@ -312,60 +312,56 @@ int main(int argc, char *argv[]) { char *ib_devname = NULL; int ret = 0; - struct dlist *dev_list; - struct ibv_device *ib_dev; + struct ibv_device **dev_list; int num_of_hcas; int ib_port = 0; /* parse command line options */ while (1) { int c; - static struct option long_options[] = { - { .name = "ib-dev", .has_arg = 1, .val = 'd' }, - { .name = "ib-port", .has_arg = 1, .val = 'i' }, + static struct option long_options[] = { + { .name = "ib-dev", .has_arg = 1, .val = 'd' }, + { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "list", .has_arg = 0, .val = 'l' }, - { .name = "verbose", .has_arg = 0, .val = 'v' }, - { 0, 0, 0, 0} - }; + { .name = "verbose", .has_arg = 0, .val = 'v' }, + { 0, 0, 0, 0} + }; - c = getopt_long(argc, argv, "d:i:lv", long_options, NULL); - if (c == -1) - break; - - switch (c) { - case 'd': - ib_devname = strdup(optarg); - break; - - case 'i': - ib_port = strtol(optarg, NULL, 0); - if (ib_port < 0) { - usage(argv[0]); - return 1; - } - break; + c = getopt_long(argc, argv, "d:i:lv", long_options, NULL); + if (c == -1) + break; + + switch (c) { + case 'd': + ib_devname = strdup(optarg); + break; + + case 'i': + ib_port = strtol(optarg, NULL, 0); + if (ib_port < 0) { + usage(argv[0]); + return 1; + } + break; case 'v': - verbose = 1; - break; + verbose = 1; + break; case 'l': - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(&num_of_hcas); if (!dev_list) { fprintf(stderr, "Failed to get IB devices list"); return -1; } - num_of_hcas = 0; - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - num_of_hcas ++; - printf("%d HCA%s found:\n", num_of_hcas, num_of_hcas != 1 ? "s" : ""); - dlist_start(dev_list); - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - printf("\t%s\n", ibv_get_device_name(ib_dev)); + while (*dev_list) { + printf("\t%s\n", ibv_get_device_name(*dev_list)); + ++dev_list; + } printf("\n"); return 0; @@ -376,28 +372,31 @@ int main(int argc, char *argv[]) } } - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { fprintf(stderr, "Failed to get IB device list\n"); return -1; } - dlist_start(dev_list); + if (ib_devname) { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) - if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) + while (*dev_list) { + if (!strcmp(ibv_get_device_name(*dev_list), ib_devname)) break; - if (!ib_dev) { + ++dev_list; + } + + if (!*dev_list) { fprintf(stderr, "IB device '%s' wasn't found\n", ib_devname); return -1; } - ret |= print_hca_cap(ib_dev, ib_port); + + ret |= print_hca_cap(*dev_list, ib_port); } else { - ib_dev = dlist_next(dev_list); - if (!ib_dev) { - fprintf(stderr, "No IB devices found\n"); - return -1; - } - ret |= print_hca_cap(ib_dev, ib_port); + if (!*dev_list) { + fprintf(stderr, "No IB devices found\n"); + return -1; + } + ret |= print_hca_cap(*dev_list, ib_port); } if (ib_devname) --- userspace/dapl/dapl/openib/dapl_ib_util.c (revision 4360) +++ userspace/dapl/dapl/openib/dapl_ib_util.c (working copy) @@ -206,29 +206,34 @@ DAT_RETURN dapls_ib_open_hca ( IN IB_HCA_NAME hca_name, IN DAPL_HCA *hca_ptr) { - struct dlist *dev_list; + struct ibv_device **dev_list; long opts; + int i; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " open_hca: %s - %p\n", hca_name, hca_ptr ); /* Get list of all IB devices, find match, open */ - dev_list = ibv_get_devices(); - dlist_start(dev_list); - dlist_for_each_data(dev_list, - hca_ptr->ib_trans.ib_dev, - struct ibv_device) { - if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev), - hca_name)) - break; - } - - if (!hca_ptr->ib_trans.ib_dev) { + dev_list = ibv_get_device_list(NULL); + if (!dev_list) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, - " open_hca: IB device %s not found\n", + " open_hca: ibv_get_device_list() failed\n", hca_name); return DAT_INTERNAL_ERROR; } + + for (i = 0; dev_list[i]; ++i) { + hca_ptr->ib_trans.ib_dev = dev_list[i]; + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_name)) + goto found; + } + + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB device %s not found\n", + hca_name); + goto err; +found: dapl_dbg_log ( DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), @@ -240,7 +245,7 @@ DAT_RETURN dapls_ib_open_hca ( dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB dev open failed for %s\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); - return DAT_INTERNAL_ERROR; + goto err; } hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle; @@ -336,11 +341,14 @@ DAT_RETURN dapls_ib_open_hca ( hca_ptr->ib_trans.max_inline_send ); hca_ptr->ib_trans.d_hca = hca_ptr; + ibv_free_device_list(dev_list); return DAT_SUCCESS; bail: ibv_close_device(hca_ptr->ib_hca_handle); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; +err: + ibv_free_device_list(dev_list); return DAT_INTERNAL_ERROR; } --- userspace/dapl/dapl/openib_scm/dapl_ib_util.c (revision 4360) +++ userspace/dapl/dapl/openib_scm/dapl_ib_util.c (working copy) @@ -131,28 +131,35 @@ DAT_RETURN dapls_ib_open_hca ( IN IB_HCA_NAME hca_name, IN DAPL_HCA *hca_ptr) { - struct dlist *dev_list; + struct ibv_device **dev_list; int opts; + int i; DAT_RETURN dat_status = DAT_SUCCESS; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " open_hca: %s - %p\n", hca_name, hca_ptr ); /* Get list of all IB devices, find match, open */ - dev_list = ibv_get_devices(); - dlist_start(dev_list); - dlist_for_each_data(dev_list,hca_ptr->ib_trans.ib_dev,struct ibv_device) { - if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) - break; - } - - if (!hca_ptr->ib_trans.ib_dev) { + dev_list = ibv_get_device_list(NULL); + if (!dev_list) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, - " open_hca: IB device %s not found\n", + " open_hca: ibv_get_device_list() failed\n", hca_name); return DAT_INTERNAL_ERROR; } - + + for (i = 0; dev_list[i]; ++i) { + hca_ptr->ib_trans.ib_dev = dev_list[i]; + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) + goto found; + } + + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB device %s not found\n", + hca_name); + goto err; + +found: dapl_dbg_log (DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), (unsigned long long)bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); @@ -162,7 +169,7 @@ DAT_RETURN dapls_ib_open_hca ( dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB dev open failed for %s\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); - return DAT_INTERNAL_ERROR; + goto err; } /* set inline max with enviroment or default */ @@ -242,10 +249,14 @@ DAT_RETURN dapls_ib_open_hca ( ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff ); + ibv_free_device_list(dev_list); return dat_status; + bail: ibv_close_device(hca_ptr->ib_hca_handle); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; +err: + ibv_free_device_list(dev_list); return DAT_INTERNAL_ERROR; } --- userspace/mpi/mvapich-gen2/mpid/ch_gen2/viainit.c (revision 4360) +++ userspace/mpi/mvapich-gen2/mpid/ch_gen2/viainit.c (working copy) @@ -74,13 +74,22 @@ static void set_malloc_options(void) static void open_hca(void) { - struct dlist *dev_list; struct ibv_device *ib_dev = NULL; +#ifdef GEN2_OLD_DEVICE_LIST_VERB + struct dlist *dev_list; + dev_list = ibv_get_devices(); dlist_start(dev_list); ib_dev = dlist_next(dev_list); +#else + struct ibv_device **dev_list; + + dev_list = ibv_get_device_list(NULL); + ib_dev = dev_list[0]; + ibv_free_device_list(dev_list); +#endif if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); --- userspace/libehca/configure.in (revision 4360) +++ userspace/libehca/configure.in (working copy) @@ -12,7 +12,7 @@ AC_HEADER_STDC dnl Checks for libraries. AC_CHECK_LIB(ibverbs, - ibv_get_devices, + ibv_get_device_list, [], AC_MSG_ERROR([libibverbs not installed])) --- userspace/librdmacm/configure.in (revision 4360) +++ userspace/librdmacm/configure.in (working copy) @@ -25,8 +25,8 @@ AC_CHECK_SIZEOF(long) dnl Checks for libraries if test "$disable_libcheck" != "yes" then -AC_CHECK_LIB(ibverbs, ibv_get_devices, [], - AC_MSG_ERROR([ibv_get_devices() not found. librdmacm requires libibverbs.])) +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. librdmacm requires libibverbs.])) fi dnl Checks for header files. --- userspace/librdmacm/src/cma.c (revision 4360) +++ userspace/librdmacm/src/cma.c (working copy) @@ -114,7 +114,7 @@ struct cma_id_private { uint32_t handle; }; -static struct dlist *dev_list; +static struct ibv_device **dev_list; static struct dlist *cma_dev_list; static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; static int ucma_initialized; @@ -141,7 +141,7 @@ static void ucma_cleanup(void) static int ucma_init(void) { - struct ibv_device *dev; + int i; struct cma_device *cma_dev; struct ibv_device_attr attr; int ret; @@ -163,22 +163,22 @@ static int ucma_init(void) goto err; } - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); if (!dev_list) { printf("CMA: unable to get RDMA device liste\n"); ret = -ENODEV; goto err; } - dlist_for_each_data(dev_list, dev, struct ibv_device) { + for (i = 0; dev_list[i]; ++i) { cma_dev = malloc(sizeof *cma_dev); if (!cma_dev) { ret = -ENOMEM; goto err; } - cma_dev->guid = ibv_get_device_guid(dev); - cma_dev->verbs = ibv_open_device(dev); + cma_dev->guid = ibv_get_device_guid(dev_list[i]); + cma_dev->verbs = ibv_open_device(dev_list[i]); if (!cma_dev->verbs) { printf("CMA: unable to open RDMA device\n"); ret = -ENODEV; @@ -201,6 +201,8 @@ out: err: ucma_cleanup(); pthread_mutex_unlock(&mut); + if (dev_list) + ibv_free_device_list(dev_list); return ret; } --- userspace/perftest/rdma_lat.c (revision 4360) +++ userspace/perftest/rdma_lat.c (working copy) @@ -105,18 +105,17 @@ static uint16_t pp_get_local_lid(struct static struct ibv_device *pp_find_dev(const char *ib_devname) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev = NULL; - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = dev_list[0]; if (!ib_dev) fprintf(stderr, "No IB devices found\n"); } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) --- userspace/perftest/rdma_bw.c (revision 4360) +++ userspace/perftest/rdma_bw.c (working copy) @@ -472,7 +472,7 @@ static void print_report(unsigned int it int main(int argc, char *argv[]) { - struct dlist *dev_list; + struct ibv_device **dev_list; struct ibv_device *ib_dev; struct pingpong_context *ctx; struct pingpong_dest my_dest; @@ -587,17 +587,16 @@ int main(int argc, char *argv[]) page_size = sysconf(_SC_PAGESIZE); - dev_list = ibv_get_devices(); + dev_list = ibv_get_device_list(NULL); - dlist_start(dev_list); if (!ib_devname) { - ib_dev = dlist_next(dev_list); + ib_dev = dev_list[0]; if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); return 1; } } else { - dlist_for_each_data(dev_list, ib_dev, struct ibv_device) + for (ib_dev = *dev_list; ib_dev; ++dev_list) if (!strcmp(ibv_get_device_name(ib_dev), ib_devname)) break; if (!ib_dev) { --- userspace/libibcm/configure.in (revision 4360) +++ userspace/libibcm/configure.in (working copy) @@ -25,8 +25,8 @@ AC_CHECK_SIZEOF(long) dnl Checks for libraries if test "$disable_libcheck" != "yes" then -AC_CHECK_LIB(ibverbs, ibv_get_devices, [], - AC_MSG_ERROR([ibv_get_devices() not found. libibcm requires libibverbs.])) +AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], + AC_MSG_ERROR([ibv_get_device_list() not found. libibcm requires libibverbs.])) #AC_CHECK_LIB(rdmacm, rdma_create_id, [], # AC_MSG_ERROR([rdma_create_id() not found. ucmpost requires librdmacm.])) fi --- userspace/libibcm/examples/cmpost.c (revision 4360) +++ userspace/libibcm/examples/cmpost.c (working copy) @@ -423,15 +423,14 @@ static void destroy_messages(void) static int init(void) { - struct dlist *dev_list; + struct ibv_device **dev_list; int ret; test.connects_left = connections; test.disconnects_left = connections; - dev_list = ibv_get_devices(); - dlist_start(dev_list); - test.device = dlist_next(dev_list); + dev_list = ibv_get_device_list(NULL); + test.device = dev_list[0]; if (!test.device) return -1; From sean.hefty at intel.com Sat Dec 10 09:49:49 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 10 Dec 2005 09:49:49 -0800 Subject: [openib-general] [PATCH/RFC] change ibv_get_devices() toibv_get_device_list() In-Reply-To: Message-ID: >I'm planning to commit this early next week; any objections, comments, >or suggestions before I do so? I'm in favor of this change. - Sean From mst at mellanox.co.il Sat Dec 10 11:17:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 10 Dec 2005 21:17:43 +0200 Subject: [openib-general] Re: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() In-Reply-To: References: Message-ID: <20051210191743.GB30682@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() > > This patch converts the ibv_get_devices() API to a better > ibv_get_device_list(). The old API was bad because it exposed the > dlist data structure exposed by libsysfs, which was not thread-safe > and was just plain overly complex for what it was used for. > > In addition, I've converted over all the in-tree users of > ibv_get_devices() that I could find -- DAPL, libehca, libibcm, > librdmacm and mvapich. > > I'm planning to commit this early next week; any objections, comments, > or suggestions before I do so? > > Thanks, > Roland To make hotplug feasible, we need to document requirements that 1. users call free_device_list after opening relevant devices 2. no user opens a device after calling free_device_list -- MST From mst at mellanox.co.il Sat Dec 10 13:21:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 10 Dec 2005 23:21:40 +0200 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage Message-ID: <20051210212140.GA30971@mellanox.co.il> SDP was using kunmap_atomic incorrectly. Of course, I'm typically using it on platforms where its a nop, so I dint notice :) The following is already applied. --- kunmap_atomic gets a virtual address, not a page* pointer. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- openib/drivers/infiniband/ulp/sdp/sdp_send.c (revision 4369) +++ openib/drivers/infiniband/ulp/sdp/sdp_send.c (working copy) @@ -647,7 +647,7 @@ static int sdp_send_data_iocb_src(struct memcpy(buff->tail, addr + off, len); - kunmap_atomic(iocb->page_array[pos], KM_IRQ0); + kunmap_atomic(addr, KM_IRQ0); local_irq_restore(flags); Index: openib/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- openib/drivers/infiniband/ulp/sdp/sdp_recv.c (revision 4369) +++ openib/drivers/infiniband/ulp/sdp/sdp_recv.c (working copy) @@ -618,7 +618,7 @@ static int sdp_read_buff_iocb(struct sdp iocb->io_addr += copy; - kunmap_atomic(iocb->page_array[counter], KM_IRQ0); + kunmap_atomic(addr, KM_IRQ0); ++counter; local_irq_restore(flags); -- MST From mst at mellanox.co.il Sat Dec 10 13:48:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 10 Dec 2005 23:48:32 +0200 Subject: [openib-general] Re: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() In-Reply-To: References: Message-ID: <20051210214832.GA31057@mellanox.co.il> Quoting Roland Dreier : > --- userspace/mpi/mvapich-gen2/mpid/ch_gen2/viainit.c (revision 4360) > +++ userspace/mpi/mvapich-gen2/mpid/ch_gen2/viainit.c (working copy) > @@ -74,13 +74,22 @@ static void set_malloc_options(void) > > static void open_hca(void) > { > - struct dlist *dev_list; > struct ibv_device *ib_dev = NULL; > > +#ifdef GEN2_OLD_DEVICE_LIST_VERB > + struct dlist *dev_list; > + > dev_list = ibv_get_devices(); > > dlist_start(dev_list); > ib_dev = dlist_next(dev_list); > +#else > + struct ibv_device **dev_list; > + > + dev_list = ibv_get_device_list(NULL); > + ib_dev = dev_list[0]; > + ibv_free_device_list(dev_list); > +#endif > > if (!ib_dev) { > fprintf(stderr, "No IB devices found\n"); This wont work for hotplug: you are saving the device pointer without opening the device, so it might go away from under your feet. I wander whether we can come up with an API that helps people get it right more easily? -- MST From iod00d at hp.com Sat Dec 10 19:23:47 2005 From: iod00d at hp.com (Grant Grundler) Date: Sat, 10 Dec 2005 19:23:47 -0800 Subject: [openib-general] [PATCH] better warning about libsdp.conf location Message-ID: <20051211032347.GC9348@esmail.cup.hp.com> Michael, When LIBSDP_DEFAULT_CONFIG_FILE isn't set, the default location lidsdp looks for libsdp.conf doesn't match where the Makefile installs it (sysconfdir = /usr/local/etc). Patch below also provides a _useful_ warning message by indicating *why* we are warning the user and the default location (which might vary by release). It just occurred to me that libsdp could set LIBSDP_CONFIG_FILE so the warning doesn't appear on the next invocation. Oh well, idea for another patch... thanks, grant Signed-off-by: Grant Grundler Index: src/userspace/libsdp/src/port.c =================================================================== --- src/userspace/libsdp/src/port.c (revision 4356) +++ src/userspace/libsdp/src/port.c (working copy) @@ -1202,8 +1202,9 @@ if (config_file) { __sdp_read_config(config_file); } else { - printf("default libsdp configuration is used\n"); -#define LIBSDP_DEFAULT_CONFIG_FILE "/usr/local/ibgd/etc/libsdp.conf" +#define LIBSDP_DEFAULT_CONFIG_FILE "/usr/local/etc/libsdp.conf" + printf("libsdp.so: $LIBSDP_CONFIG_FILE not set. Using " + LIBSDP_DEFAULT_CONFIG_FILE "\n"); __sdp_read_config(LIBSDP_DEFAULT_CONFIG_FILE); } } /* __sdp_init */ From iod00d at hp.com Sat Dec 10 20:15:13 2005 From: iod00d at hp.com (Grant Grundler) Date: Sat, 10 Dec 2005 20:15:13 -0800 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage In-Reply-To: <20051210212140.GA30971@mellanox.co.il> References: <20051210212140.GA30971@mellanox.co.il> Message-ID: <20051211041513.GD9348@esmail.cup.hp.com> On Sat, Dec 10, 2005 at 11:21:40PM +0200, Michael S. Tsirkin wrote: > SDP was using kunmap_atomic incorrectly. > Of course, I'm typically using it on platforms where its a nop, > so I dint notice :) It's a real function on ia64 so I had to try this. :) One of the recent changes (possibly this one) seems to have fixed the issue! I'll have to run a full set but the initial test was promising. IIRC, the most recent "failure" was with r4279. With r4371, I'm now getting: gsyprf3:~# LD_PRELOAD=/usr/local/lib/libsdp.so /usr/local/bin/netperf -p 12866 -l 60 -H 10.0.0.30 -t TCP_STREAM -T 1,1 -- -m 512 -s 16384 -S 16384 libsdp.so: $LIBSDP_CONFIG_FILE not set. Using /usr/local/etc/libsdp.conf bind_to_specific_processor: enter masking masked TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.30 (10.0.0.30) port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 32768 32768 512 60.01 4823.45 gsyprf3:~# ("dual rope" PCI-X on a pair of HP rx2600, 1.5Ghz/6M) Single stream throughput normally peaks around ~5.5 to 6 Gb/s with this configuration. thanks! grant From iod00d at hp.com Sat Dec 10 20:34:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Sat, 10 Dec 2005 20:34:52 -0800 Subject: [openib-general] [PATCH] OPENSM make missing /dev/infiniband/umad entries obvious Message-ID: <20051211043452.GF9348@esmail.cup.hp.com> Hi, When installing openib bits on a new machine, I wasted an unreasonable and absurd amount of time (by doing some other stupid things) when OpenSM failed to start and gave an error msg about "osm_vendor_bind: ERR 5424: Unable to Open Port 0x1321ffff75787a" Once I enabled debugging in umad.c the problem was obvious: I forgot to mknod the /dev/infiniband device files! Doh! (obviously didn't have udev installed either.) Just tell me the first time around please. Don't hide error messages that users likely to run opensm are able to correct. This error now shows up on the controlling tty. thanks, grant Signed-off-by: Grant Grundler Index: src/userspace/management/libibumad/src/umad.c =================================================================== --- src/userspace/management/libibumad/src/umad.c (revision 4371) +++ src/userspace/management/libibumad/src/umad.c (working copy) @@ -558,7 +558,7 @@ umad_open_port(char *ca_name, int portnu UMAD_DEV_DIR , umad_id); if ((port->dev_fd = open(port->dev_file, O_RDWR|O_NONBLOCK)) < 0) { - DEBUG("open %s failed", port->dev_file); + IBWARN("open %s failed", port->dev_file); return -EIO; } From iod00d at hp.com Sat Dec 10 20:42:31 2005 From: iod00d at hp.com (Grant Grundler) Date: Sat, 10 Dec 2005 20:42:31 -0800 Subject: [openib-general] [PATCH] OPENSM identify failure cases uniquely Message-ID: <20051211044231.GG9348@esmail.cup.hp.com> Hi, When tracking down the opensm "can't open port" failure described in previous email, I added log output for each of the failure cases in osm_vendor_open_port(). The "ERR" numbers need to be compared to some "master list" that I don't know about and replaced. I just picked sequential numbers not used in that routine. thanks, grant Signed-off-by: Grant Grundler Index: src/userspace/management/osm/libvendor/osm_vendor_ibumad.c =================================================================== --- src/userspace/management/osm/libvendor/osm_vendor_ibumad.c (revision 4371) +++ src/userspace/management/osm/libvendor/osm_vendor_ibumad.c (working copy) @@ -715,14 +715,26 @@ osm_vendor_open_port( } /* Port found, try to open it */ - if (umad_get_ca(p_vend->ca_names[ca], &p_vend->umad_ca) < 0) + if (umad_get_ca(p_vend->ca_names[ca], &p_vend->umad_ca) < 0) { + osm_log( p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_open_port: ERR 5423: " + "umad_get_ca() failed\n" ); goto Exit; + } - if (umad_get_port(p_vend->ca_names[ca], i, &p_vend->umad_port) < 0) + if (umad_get_port(p_vend->ca_names[ca], i, &p_vend->umad_port) < 0) { + osm_log( p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_open_port: ERR 5424: " + "umad_get_port() failed\n" ); goto Exit; + } - if ((umad_port_id = umad_open_port(p_vend->ca_names[ca], i)) < 0) + if ((umad_port_id = umad_open_port(p_vend->ca_names[ca], i)) < 0) { + osm_log( p_vend->p_log, OSM_LOG_ERROR, + "osm_vendor_open_port: ERR 5425: " + "umad_open_port() failed\n" ); goto Exit; + } p_vend->umad_port_id = umad_port_id; From yael at mellanox.co.il Sun Dec 11 04:06:06 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 11 Dec 2005 14:06:06 +0200 Subject: [openib-general] RE: [PATCH] Opensm - fix osm_venodr_get_all_port_attr Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2486@mtlexch01.mtl.com> Hi Hal, Hi Yael, On Thu, 2005-12-08 at 05:39, Yael Kalka wrote: > Hi Hal, > > If osm_vendor_get_all_port_attr is called before the osm_vendor_bind, What exercises the vendor calls in this manner ? [YK] - We saw it in some test that uses the vendor lib. Later on the bind was called. > then the sm_lid of the default port isn't updated correctly. > This patch fixes it. Thanks. Applied. -- Hal From tziporet at mellanox.co.il Sun Dec 11 07:25:32 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 11 Dec 2005 17:25:32 +0200 Subject: [openib-general] Next workshop dates? Please respond with your preferences Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> Hi Bill, What is the proposed agenda for the workshop? Tziporet -----Original Message----- From: Bill Boas [mailto:bboas at llnl.gov] Sent: Friday, December 09, 2005 8:53 AM To: openib-promoters at openib.org; openib-general at openib.org Subject: [openib-general] Next workshop dates? Please respond with your preferences All those wishing to attend the next workshop in Sonoma at the Lodge (same as last year) in the late January-early February please respond with your preferred dates. We currently have Jan29-Feb1 held for us but some people are telling us that is bad for them. The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we need guidance from those planning to attend as to their preferred dates. Bill. Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bardov at gmail.com Sun Dec 11 07:57:04 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Sun, 11 Dec 2005 17:57:04 +0200 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: References: Message-ID: I would have preferred not to add upper layer aware code into CMA, but I guess I'm late for that discussion. Regarding the patch below, it makes sense. Are you going to apply it to all affected modules? Dan On 12/9/05, Sean Hefty wrote: > The following patch updates the CMA to support the IB socket-based > protocol standard and SDP's private data format. > > The CMA now defines RDMA "port spaces". RDMA identifiers are associated > with a user-specified port space at creation time. > > Please respond with any comments on the approach. Note that these > changes have not been pushed up to userspace yet. > > Signed-off-by: Sean Hefty > > > > Index: ulp/iser/iser_verbs.c > =================================================================== > --- ulp/iser/iser_verbs.c (revision 4356) > +++ ulp/iser/iser_verbs.c (working copy) > @@ -428,7 +428,8 @@ iser_connect(struct iser_conn *p_iser_co > return -1; > } > p_iser_conn->cma_id = rdma_create_id(iser_cma_handler, > - (void *)p_iser_conn); > + (void *)p_iser_conn, > + RDMA_PS_TCP); > if (IS_ERR(p_iser_conn->cma_id)) { > ret = PTR_ERR(p_iser_conn->cma_id); > iser_err("rdma_create_id failed: %d\n", ret); > Index: include/rdma/rdma_cm.h > =================================================================== > --- include/rdma/rdma_cm.h (revision 4356) > +++ include/rdma/rdma_cm.h (working copy) > @@ -54,6 +54,13 @@ enum rdma_cm_event_type { > RDMA_CM_EVENT_DEVICE_REMOVAL, > }; > > +enum rdma_port_space { > + RDMA_PS_SDP = 0x0001, > + RDMA_PS_TCP = 0x0106, > + RDMA_PS_UDP = 0x0111, > + RDMA_PS_SCTP = 0x0183 > +}; > + > struct rdma_addr { > struct sockaddr src_addr; > u8 src_pad[sizeof(struct sockaddr_in6) - > @@ -97,11 +104,20 @@ struct rdma_cm_id { > struct ib_qp *qp; > rdma_cm_event_handler event_handler; > struct rdma_route route; > + enum rdma_port_space ps; > u8 port_num; > }; > > +/** > + * rdma_create_id - Create an RDMA identifier. > + * > + * @event_handler: User callback invoked to report events associated with the > + * returned rdma_id. > + * @context: User specified context associated with the id. > + * @ps: RDMA port space. > + */ > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, > - void *context); > + void *context, enum rdma_port_space ps); > > void rdma_destroy_id(struct rdma_cm_id *id); > > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 4356) > +++ core/cma.c (working copy) > @@ -110,21 +110,35 @@ struct rdma_id_private { > u8 srq; > }; > > -struct cma_addr { > - u8 version; /* CMA version: 7:4, IP version: 3:0 */ > - u8 reserved; > - __u16 port; > +union cma_ip_addr { > + struct in6_addr ip6; > struct { > - union { > - struct in6_addr ip6; > - struct { > - __u32 pad[3]; > - __u32 addr; > - } ip4; > - } ver; > - } src_addr, dst_addr; > + __u32 pad[3]; > + __u32 addr; > + } ip4; > +}; > + > +struct cma_hdr { > + u8 cma_version; > + u8 ip_version; /* IP version: 7:4 */ > + __u16 port; > + union cma_ip_addr src_addr; > + union cma_ip_addr dst_addr; > }; > > +struct sdp_hh { > + u8 sdp_version; > + u8 ip_version; /* IP version: 7:4 */ > + u8 sdp_specific1[10]; > + __u16 port; > + __u16 sdp_specific2; > + union cma_ip_addr src_addr; > + union cma_ip_addr dst_addr; > +}; > + > +#define CMA_VERSION 0x10 > +#define SDP_VERSION 0x22 > + > static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) > { > unsigned long flags; > @@ -162,19 +176,24 @@ static enum cma_state cma_exch(struct rd > return old; > } > > -static inline u8 cma_get_ip_ver(struct cma_addr *addr) > +static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) > { > - return addr->version & 0xF; > + return hdr->ip_version >> 4; > } > > -static inline u8 cma_get_cma_ver(struct cma_addr *addr) > +static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) > { > - return addr->version >> 4; > + hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); > } > > -static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 ip_ver) > +static inline u8 sdp_get_ip_ver(struct sdp_hh *hh) > { > - addr->version = (cma_ver << 4) + (ip_ver & 0xF); > + return hh->ip_version >> 4; > +} > + > +static inline void sdp_set_ip_ver(struct sdp_hh *hh, u8 ip_ver) > +{ > + hh->ip_version = (ip_ver << 4) | (hh->ip_version & 0xF); > } > > static void cma_attach_to_dev(struct rdma_id_private *id_priv, > @@ -226,17 +245,18 @@ static void cma_release_remove(struct rd > } > > struct rdma_cm_id* rdma_create_id(rdma_cm_event_handler event_handler, > - void *context) > + void *context, enum rdma_port_space ps) > { > struct rdma_id_private *id_priv; > > id_priv = kzalloc(sizeof *id_priv, GFP_KERNEL); > if (!id_priv) > - return NULL; > + return ERR_PTR(-ENOMEM); > > id_priv->state = CMA_IDLE; > id_priv->id.context = context; > id_priv->id.event_handler = event_handler; > + id_priv->id.ps = ps; > spin_lock_init(&id_priv->lock); > init_waitqueue_head(&id_priv->wait); > atomic_set(&id_priv->refcount, 1); > @@ -387,25 +407,93 @@ int rdma_init_qp_attr(struct rdma_cm_id > } > EXPORT_SYMBOL(rdma_init_qp_attr); > > -static int cma_verify_addr(struct cma_addr *addr, > - struct sockaddr_in *ip_addr) > +static inline int cma_any_addr(struct sockaddr *addr) > { > - if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) > - return -EINVAL; > + struct in6_addr *ip6; > > - if (ip_addr->sin_port != addr->port) > - return -EINVAL; > + if (addr->sa_family == AF_INET) > + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == > + INADDR_ANY; > + else { > + ip6 = &((struct sockaddr_in6 *) addr)->sin6_addr; > + return (ip6->s6_addr32[0] | ip6->s6_addr32[1] | > + ip6->s6_addr32[3] | ip6->s6_addr32[4]) == 0; > + } > +} > > - if (ip_addr->sin_addr.s_addr && > - (ip_addr->sin_addr.s_addr != addr->dst_addr.ver.ip4.addr)) > - return -EINVAL; > +static int cma_get_net_info(void *hdr, enum rdma_port_space ps, > + u8 *ip_ver, __u16 *port, > + union cma_ip_addr **src, union cma_ip_addr **dst) > +{ > + switch (ps) { > + case RDMA_PS_SDP: > + if (((struct sdp_hh *) hdr)->sdp_version != SDP_VERSION) > + return -EINVAL; > > + *ip_ver = sdp_get_ip_ver(hdr); > + *port = ((struct sdp_hh *) hdr)->port; > + *src = &((struct sdp_hh *) hdr)->src_addr; > + *dst = &((struct sdp_hh *) hdr)->dst_addr; > + break; > + default: > + if (((struct cma_hdr *) hdr)->cma_version != CMA_VERSION) > + return -EINVAL; > + > + *ip_ver = cma_get_ip_ver(hdr); > + *port = ((struct cma_hdr *) hdr)->port; > + *src = &((struct cma_hdr *) hdr)->src_addr; > + *dst = &((struct cma_hdr *) hdr)->dst_addr; > + break; > + } > return 0; > } > > -static inline int cma_any_addr(struct sockaddr *addr) > +static void cma_save_net_info(struct rdma_addr *addr, > + struct rdma_addr *listen_addr, > + u8 ip_ver, __u16 port, > + union cma_ip_addr *src, union cma_ip_addr *dst) > +{ > + struct sockaddr_in *listen4, *ip4; > + struct sockaddr_in6 *listen6, *ip6; > + > + switch (ip_ver) { > + case 4: > + listen4 = (struct sockaddr_in *) &listen_addr->src_addr; > + ip4 = (struct sockaddr_in *) &addr->src_addr; > + ip4->sin_family = listen4->sin_family; > + ip4->sin_addr.s_addr = dst->ip4.addr; > + ip4->sin_port = listen4->sin_port; > + > + ip4 = (struct sockaddr_in *) &addr->dst_addr; > + ip4->sin_family = listen4->sin_family; > + ip4->sin_addr.s_addr = src->ip4.addr; > + ip4->sin_port = port; > + break; > + case 6: > + listen6 = (struct sockaddr_in6 *) &listen_addr->src_addr; > + ip6 = (struct sockaddr_in6 *) &addr->src_addr; > + ip6->sin6_family = listen6->sin6_family; > + ip6->sin6_addr = dst->ip6; > + ip6->sin6_port = listen6->sin6_port; > + > + ip6 = (struct sockaddr_in6 *) &addr->dst_addr; > + ip6->sin6_family = listen6->sin6_family; > + ip6->sin6_addr = src->ip6; > + ip6->sin6_port = port; > + break; > + default: > + break; > + } > +} > + > +static inline int cma_user_data_offset(enum rdma_port_space ps) > { > - return ((struct sockaddr_in *) addr)->sin_addr.s_addr == 0; > + switch (ps) { > + case RDMA_PS_SDP: > + return 0; > + default: > + return sizeof(struct cma_hdr); > + } > } > > static int cma_notify_user(struct rdma_id_private *id_priv, > @@ -640,53 +728,41 @@ static struct rdma_id_private* cma_new_i > { > struct rdma_id_private *id_priv; > struct rdma_cm_id *id; > - struct rdma_route *route; > - struct sockaddr_in *ip_addr, *listen_addr; > - struct ib_sa_path_rec *path_rec; > - struct cma_addr *addr; > - int num_paths; > - > - listen_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; > - if (cma_verify_addr(ib_event->private_data, listen_addr)) > - return NULL; > + struct rdma_route *rt; > + union cma_ip_addr *src, *dst; > + __u16 port; > + u8 ip_ver; > > - num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); > - path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); > - if (!path_rec) > + id = rdma_create_id(listen_id->event_handler, listen_id->context, > + listen_id->ps); > + if (IS_ERR(id)) > return NULL; > > - id = rdma_create_id(listen_id->event_handler, listen_id->context); > - if (!id) > + rt = &id->route; > + rt->num_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1; > + rt->path_rec = kmalloc(sizeof *rt->path_rec * rt->num_paths, GFP_KERNEL); > + if (!rt->path_rec) > goto err; > > - addr = ib_event->private_data; > - route = &id->route; > + if (cma_get_net_info(ib_event->private_data, listen_id->ps, > + &ip_ver, &port, &src, &dst)) > + goto err; > > - ip_addr = (struct sockaddr_in *) &route->addr.src_addr; > - ip_addr->sin_family = listen_addr->sin_family; > - ip_addr->sin_addr.s_addr = addr->dst_addr.ver.ip4.addr; > - ip_addr->sin_port = listen_addr->sin_port; > - > - ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > - ip_addr->sin_family = listen_addr->sin_family; > - ip_addr->sin_addr.s_addr = addr->src_addr.ver.ip4.addr; > - ip_addr->sin_port = addr->port; > - > - route->num_paths = num_paths; > - route->path_rec = path_rec; > - path_rec[0] = *ib_event->param.req_rcvd.primary_path; > - if (num_paths == 2) > - path_rec[1] = *ib_event->param.req_rcvd.alternate_path; > - > - route->addr.addr.ibaddr.sgid = path_rec->sgid; > - route->addr.addr.ibaddr.dgid = path_rec->dgid; > - route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); > + cma_save_net_info(&id->route.addr, &listen_id->route.addr, > + ip_ver, port, src, dst); > + rt->path_rec[0] = *ib_event->param.req_rcvd.primary_path; > + if (rt->num_paths == 2) > + rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; > + > + rt->addr.addr.ibaddr.sgid = rt->path_rec[0].sgid; > + rt->addr.addr.ibaddr.dgid = rt->path_rec[0].dgid; > + rt->addr.addr.ibaddr.pkey = be16_to_cpu(rt->path_rec[0].pkey); > > id_priv = container_of(id, struct rdma_id_private, id); > id_priv->state = CMA_CONNECT; > return id_priv; > err: > - kfree(path_rec); > + rdma_destroy_id(id); > return NULL; > } > > @@ -708,7 +784,6 @@ static int cma_req_handler(struct ib_cm_ > goto out; > } > > - conn_id->state = CMA_CONNECT; > atomic_inc(&conn_id->dev_remove); > ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); > if (ret) { > @@ -722,7 +797,7 @@ static int cma_req_handler(struct ib_cm_ > cm_id->context = conn_id; > cm_id->cm_handler = cma_ib_handler; > > - offset = sizeof(struct cma_addr); > + offset = cma_user_data_offset(listen_id->id.ps); > ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > ib_event->private_data + offset, > IB_CM_REQ_PRIVATE_DATA_SIZE - offset); > @@ -738,16 +813,16 @@ out: > return ret; > } > > -static __be64 cma_get_service_id(struct sockaddr *addr) > +static __be64 cma_get_service_id(enum rdma_port_space ps, struct sockaddr *addr) > { > - return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + > + return cpu_to_be64(((u64)ps << 16) + > ((struct sockaddr_in *) addr)->sin_port); > } > > static void cma_set_compare_data(struct sockaddr *addr, > struct ib_cm_private_data_compare *compare) > { > - struct cma_addr *data, *mask; > + struct cma_hdr *data, *mask; > > memset(compare, 0, sizeof *compare); > data = (void *) compare->data; > @@ -755,19 +830,18 @@ static void cma_set_compare_data(struct > > switch (addr->sa_family) { > case AF_INET: > - cma_set_vers(data, 0, 4); > - cma_set_vers(mask, 0, 0xF); > - data->dst_addr.ver.ip4.addr = ((struct sockaddr_in *) addr)-> > - sin_addr.s_addr; > - mask->dst_addr.ver.ip4.addr = ~0; > + cma_set_ip_ver(data, 4); > + cma_set_ip_ver(mask, 0xF); > + data->dst_addr.ip4.addr = ((struct sockaddr_in *) addr)-> > + sin_addr.s_addr; > + mask->dst_addr.ip4.addr = ~0; > break; > case AF_INET6: > - cma_set_vers(data, 0, 6); > - cma_set_vers(mask, 0, 0xF); > - data->dst_addr.ver.ip6 = ((struct sockaddr_in6 *) addr)-> > - sin6_addr; > - memset(&mask->dst_addr.ver.ip6, 1, > - sizeof mask->dst_addr.ver.ip6); > + cma_set_ip_ver(data, 6); > + cma_set_ip_ver(mask, 0xF); > + data->dst_addr.ip6 = ((struct sockaddr_in6 *) addr)-> > + sin6_addr; > + memset(&mask->dst_addr.ip6, 1, sizeof mask->dst_addr.ip6); > break; > default: > break; > @@ -787,7 +861,7 @@ static int cma_ib_listen(struct rdma_id_ > return PTR_ERR(id_priv->cm_id); > > addr = &id_priv->id.route.addr.src_addr; > - svc_id = cma_get_service_id(addr); > + svc_id = cma_get_service_id(id_priv->id.ps, addr); > if (cma_any_addr(addr)) > ret = ib_cm_listen(id_priv->cm_id, svc_id, 0, NULL); > else { > @@ -835,7 +909,7 @@ static void cma_listen_on_dev(struct rdm > struct rdma_cm_id *id; > int ret; > > - id = rdma_create_id(cma_listen_handler, id_priv); > + id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps); > if (IS_ERR(id)) > return; > > @@ -1099,19 +1173,34 @@ err: > } > EXPORT_SYMBOL(rdma_bind_addr); > > -static void cma_format_addr(struct cma_addr *addr, struct rdma_route *route) > +static void cma_format_hdr(void *hdr, enum rdma_port_space ps, > + struct rdma_route *route) > { > - struct sockaddr_in *ip_addr; > - > - memset(addr, 0, sizeof *addr); > - cma_set_vers(addr, 1, 4); > - > - ip_addr = (struct sockaddr_in *) &route->addr.src_addr; > - addr->src_addr.ver.ip4.addr = ip_addr->sin_addr.s_addr; > - > - ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > - addr->dst_addr.ver.ip4.addr = ip_addr->sin_addr.s_addr; > - addr->port = ip_addr->sin_port; > + struct sockaddr_in *src4, *dst4; > + struct cma_hdr *cma_hdr; > + struct sdp_hh *sdp_hdr; > + > + src4 = (struct sockaddr_in *) &route->addr.src_addr; > + dst4 = (struct sockaddr_in *) &route->addr.dst_addr; > + > + switch (ps) { > + case RDMA_PS_SDP: > + sdp_hdr = hdr; > + sdp_hdr->sdp_version = SDP_VERSION; > + sdp_set_ip_ver(sdp_hdr, 4); > + sdp_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; > + sdp_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; > + sdp_hdr->port = src4->sin_port; > + break; > + default: > + cma_hdr = hdr; > + cma_hdr->cma_version = CMA_VERSION; > + cma_set_ip_ver(cma_hdr, 4); > + cma_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; > + cma_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; > + cma_hdr->port = src4->sin_port; > + break; > + } > } > > static int cma_connect_ib(struct rdma_id_private *id_priv, > @@ -1119,17 +1208,20 @@ static int cma_connect_ib(struct rdma_id > { > struct ib_cm_req_param req; > struct rdma_route *route; > - struct cma_addr *addr; > void *private_data; > - int ret; > + int offset, ret; > > memset(&req, 0, sizeof req); > - req.private_data_len = sizeof *addr + conn_param->private_data_len; > - > - private_data = kmalloc(req.private_data_len, GFP_ATOMIC); > + offset = cma_user_data_offset(id_priv->id.ps); > + req.private_data_len = offset + conn_param->private_data_len; > + private_data = kzalloc(req.private_data_len, GFP_ATOMIC); > if (!private_data) > return -ENOMEM; > > + if (conn_param->private_data && conn_param->private_data_len) > + memcpy(private_data + offset, conn_param->private_data, > + conn_param->private_data_len); > + > id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > id_priv); > if (IS_ERR(id_priv->cm_id)) { > @@ -1137,20 +1229,16 @@ static int cma_connect_ib(struct rdma_id > goto out; > } > > - addr = private_data; > route = &id_priv->id.route; > - cma_format_addr(addr, route); > - > - if (conn_param->private_data && conn_param->private_data_len) > - memcpy(addr + 1, conn_param->private_data, > - conn_param->private_data_len); > + cma_format_hdr(private_data, id_priv->id.ps, route); > req.private_data = private_data; > > req.primary_path = &route->path_rec[0]; > if (route->num_paths == 2) > req.alternate_path = &route->path_rec[1]; > > - req.service_id = cma_get_service_id(&route->addr.dst_addr); > + req.service_id = cma_get_service_id(id_priv->id.ps, > + &route->addr.dst_addr); > req.qp_num = id_priv->qp_num; > req.qp_type = id_priv->qp_type; > req.starting_psn = id_priv->seq_num; > @@ -1317,23 +1405,6 @@ out: > } > EXPORT_SYMBOL(rdma_disconnect); > > -/* TODO: add this to the device structure - see Roland's patch */ > -static __be64 get_ca_guid(struct ib_device *device) > -{ > - struct ib_device_attr *device_attr; > - __be64 guid; > - int ret; > - > - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); > - if (!device_attr) > - return 0; > - > - ret = ib_query_device(device, device_attr); > - guid = ret ? 0 : device_attr->node_guid; > - kfree(device_attr); > - return guid; > -} > - > static void cma_add_one(struct ib_device *device) > { > struct cma_device *cma_dev; > @@ -1344,7 +1415,7 @@ static void cma_add_one(struct ib_device > return; > > cma_dev->device = device; > - cma_dev->node_guid = get_ca_guid(device); > + cma_dev->node_guid = device->node_guid; > if (!cma_dev->node_guid) > goto err; > > Index: core/ucma.c > =================================================================== > --- core/ucma.c (revision 4356) > +++ core/ucma.c (working copy) > @@ -287,7 +287,7 @@ static ssize_t ucma_create_id(struct ucm > return -ENOMEM; > > ctx->uid = cmd.uid; > - ctx->cm_id = rdma_create_id(ucma_event_handler, ctx); > + ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP); > if (IS_ERR(ctx->cm_id)) { > ret = PTR_ERR(ctx->cm_id); > goto err1; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From iod00d at hp.com Sun Dec 11 09:53:41 2005 From: iod00d at hp.com (Grant Grundler) Date: Sun, 11 Dec 2005 09:53:41 -0800 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage In-Reply-To: <20051210212140.GA30971@mellanox.co.il> References: <20051210212140.GA30971@mellanox.co.il> Message-ID: <20051211175341.GA12176@esmail.cup.hp.com> On Sat, Dec 10, 2005 at 11:21:40PM +0200, Michael S. Tsirkin wrote: > SDP was using kunmap_atomic incorrectly. > Of course, I'm typically using it on platforms where its a nop, > so I dint notice :) I might have spoken too soon...I just started getting "ERR" output from ib_sdp running netperf TCP_STREAM over SDP on the IA64 rx2600's. I killed and restarted the "sdpstream" script. It seems to be working. I've not yet seen this type of error running r4344 on a different box. If it's not obvious what's wrong, I can try r4344 on the rx2600's as well. thanks, grant $ for i in run-*; do date; echo $i; ./$i 10.0.0.30 CPU ; done Sat Dec 10 20:12:36 PST 2005 run-sdprr Sun Dec 11 09:07:06 PST 2005 run-sdprr-gnuplot Sun Dec 11 09:07:06 PST 2005 run-sdpstream ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8192:0:8192> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49157:0:49157> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <57344:0:57344> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <65536:0:65536> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8192:0:8192> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49157:0:49157> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <57344:0:57344> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <65536:0:65536> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> ... From mst at mellanox.co.il Sun Dec 11 10:05:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 11 Dec 2005 20:05:43 +0200 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage In-Reply-To: <20051211175341.GA12176@esmail.cup.hp.com> References: <20051211175341.GA12176@esmail.cup.hp.com> Message-ID: <20051211180543.GR14936@mellanox.co.il> Quoting Grant Grundler : > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8192:0:8192> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49157:0:49157> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <57344:0:57344> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <65536:0:65536> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8192:0:8192> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <8197:0:8197> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <16384:0:16384> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49152:0:49152> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <49157:0:49157> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <57344:0:57344> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <65536:0:65536> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> > ib_sdp ERR: IOCB <-1> cancel <0> flag <0340> size <126976:0:126976> This might be benign, need to check. Did the test run to completion with these messages? -- MST From heritage at 0571jiajiao.com Sun Dec 11 10:04:52 2005 From: heritage at 0571jiajiao.com (Yahir Adams) Date: Sun, 11 Dec 2005 12:04:52 -0600 Subject: [openib-general] Massive PE patch sale Message-ID: <000001c5fea7$5345bd00$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.befaso.net/pt/?46&fqmroe -------------- next part -------------- An HTML attachment was scrubbed... URL: From eyeleen at indosat.net.id Sun Dec 11 10:18:42 2005 From: eyeleen at indosat.net.id (Otis Daugherty) Date: Sun, 11 Dec 2005 12:18:42 -0600 Subject: [openib-general] Re-finance before rates skyrocket Message-ID: <149j615y.2320444@yahoo.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rhombi.4.gif Type: image/gif Size: 7817 bytes Desc: not available URL: From bboas at llnl.gov Sun Dec 11 16:10:54 2005 From: bboas at llnl.gov (Bill Boas) Date: Sun, 11 Dec 2005 16:10:54 -0800 Subject: [openib-general] Next workshop dates? Ideas for agenda??? In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> Message-ID: <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> Tziporet, Not sure yet - I think, subject to others input, it'll be focused on wrapping up rel 1.0 of OpenIB, discussing what the developers are going to focus on next and validating the strategy for RDMA over Ethernet integration at the verbs level to lay the foundation for one, consistent RDMA structure in Linux, if possible. We may also see the formation of customer working groups with vertical (e.g.:- financial, HPC, oil and gas, ) common interests able to express their requirements as a group to the development community. Just some ideas, it would be good to get feedback from both the developer and the promoter communities???? Bill. At 07:25 AM 12/11/2005, Tziporet Koren wrote: >Hi Bill, > >What is the proposed agenda for the workshop? > >Tziporet > > >-----Original Message----- >From: Bill Boas [mailto:bboas at llnl.gov] >Sent: Friday, December 09, 2005 8:53 AM >To: openib-promoters at openib.org; openib-general at openib.org >Subject: [openib-general] Next workshop dates? Please respond with your >preferences > >All those wishing to attend the next workshop in Sonoma at the Lodge >(same as last year) in the late January-early February please respond >with your preferred dates. > >We currently have Jan29-Feb1 held for us but some people are telling >us that is bad for them. > >The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we >need guidance from those planning to attend as to their preferred dates. > >Bill. > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general >_______________________________________________ >openib-promoters mailing list >openib-promoters at openib.org >http://openib.org/mailman/listinfo/openib-promoters Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From spoole at lanl.gov Sun Dec 11 16:16:37 2005 From: spoole at lanl.gov (Steve Poole) Date: Sun, 11 Dec 2005 17:16:37 -0700 Subject: [openib-general] Re: [Openib-promoters] Next workshop dates? Ideas for agenda??? In-Reply-To: <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> Message-ID: <6.2.5.6.0.20051211171549.0203dd28@lanl.gov> At 05:10 PM 12/11/2005, Bill Boas wrote: >Tziporet, > >Not sure yet - I think, subject to others input, it'll be focused on >wrapping up rel 1.0 of OpenIB, discussing what the developers are >going to focus on next and validating the strategy for RDMA over >Ethernet integration at the verbs level to lay the foundation for >one, consistent RDMA structure in Linux, if possible. > >We may also see the formation of customer working groups with >vertical (e.g.:- financial, HPC, oil and gas, ) common interests >able to express their requirements as a group to the development community. As long as they merge with the rest of the requirements for OpenIB, this is great. We will not have several different versions of OpenIB. Steve... >Just some ideas, it would be good to get feedback from both the >developer and the promoter communities???? > >Bill. > > At 07:25 AM 12/11/2005, Tziporet Koren wrote: >>Hi Bill, >> >>What is the proposed agenda for the workshop? >> >>Tziporet >> >> >>-----Original Message----- >>From: Bill Boas [mailto:bboas at llnl.gov] >>Sent: Friday, December 09, 2005 8:53 AM >>To: openib-promoters at openib.org; openib-general at openib.org >>Subject: [openib-general] Next workshop dates? Please respond with your >>preferences >> >>All those wishing to attend the next workshop in Sonoma at the Lodge >>(same as last year) in the late January-early February please respond >>with your preferred dates. >> >>We currently have Jan29-Feb1 held for us but some people are telling >>us that is bad for them. >> >>The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we >>need guidance from those planning to attend as to their preferred dates. >> >>Bill. >> >>Bill Boas bboas at llnl.gov >>ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >>7000 East Ave, L-555 Cell: 925-337-2224 >>Livermore, CA 94551 Pgr: 877-203-2248 >> >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >>_______________________________________________ >>openib-promoters mailing list >>openib-promoters at openib.org >>http://openib.org/mailman/listinfo/openib-promoters > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 >_______________________________________________ >openib-promoters mailing list >openib-promoters at openib.org >http://openib.org/mailman/listinfo/openib-promoters From bboas at llnl.gov Sun Dec 11 16:18:19 2005 From: bboas at llnl.gov (Bill Boas) Date: Sun, 11 Dec 2005 16:18:19 -0800 Subject: [openib-general] RE: Next workshop dates? Feedback from others please!!! In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000652719A@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000652719A@orsmsx408> Message-ID: <6.2.3.4.2.20051211161206.039c1b80@mail-lc.llnl.gov> Bob, Voltaire says they cannot not make Jan 29 'cos they have an all hands 3 day meeting in that time slot. In the Feb 5-8 timeframe we worked in the super bowl last year OK I think??? By the time we get to late Feb the hotel rates in Sonoma go up so I'm thinking we should go for Feb 5 again, but feedback on this would be very helpful as we need to make a decision in the next few days. Feedback from others please. Bill. At 03:10 PM 12/9/2005, you wrote: >Jan29-Feb1 works for me. >Feb 5 is super bowl Sunday, might want to stay away from that one >Feb 12 - Sean is on vacation. >Feb 20 - Is a holiday in the US. >How about Feb 27- March 2 as an alternative if people cannot make Jan29 >? > >woody > > >-----Original Message----- >From: openib-general-bounces at openib.org >[mailto:openib-general-bounces at openib.org] On Behalf Of Bill Boas >Sent: Thursday, December 08, 2005 10:53 PM >To: openib-promoters at openib.org; openib-general at openib.org >Subject: [openib-general] Next workshop dates? Please respond with >yourpreferences > >All those wishing to attend the next workshop in Sonoma at the Lodge >(same as last year) in the late January-early February please respond >with your preferred dates. > >We currently have Jan29-Feb1 held for us but some people are telling >us that is bad for them. > >The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we >need guidance from those planning to attend as to their preferred dates. > >Bill. > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From krkumar2 at in.ibm.com Sun Dec 11 21:40:12 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 12 Dec 2005 11:10:12 +0530 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051209131151.GA21716@mellanox.co.il> Message-ID: Hi Micheal, But the lock doesn't help that case. The only difference with having the lock is that in case of a race, the mcast_send() will complete *before* the flag is set, while without the lock the mcast_send() could be in the *middle* of execution when the flag is set. But in both cases, the packet would be sent out. In case of reverse race (that means the stop_thread executes before the mcast_send(), in both cases, the packet would not be sent out, hence the lock is not helping both cases. I feel the new code looks fine without the lock. Thanks, - KK "Michael S. Tsirkin" wrote on 12/09/2005 06:41:51 PM: > The lock around clear_bit is there to ensure that ipoib_mcast_send isnt running > already when we stop the thread. > Thats why test_bit has to be instide the lock, too. > > > Quoting r. Krishna Kumar2 : > > Subject: Re: [openib-general] [PATCH fixed] was Re: [PATCH]? > ipoib_multicast/ipoib_mcast_send race > > > > > > Hi Micheal, > > > > Is there a reason to have the atomic set_bit() within a lock (even for > > a race condition of stop vs send, it doesn't seem to be required) ? > > Which means the test_bit() can also be put before the existing lock... > > > > Thanks, > > > > - KK > > > > openib-general-bounces at openib.org wrote on 12/09/2005 12:04:06 AM: > > > > > Quoting Michael S. Tsirkin : > > > > Subject: [PATCH] ipoib_multicast/ipoib_mcast_send race > > > > > > > > Hello, Roland! > > > > Here's another race scenario. > > > > > > > > --- > > > > > > > > Fix the following race scenario: > > > > device is up. > > > > port event or set mcast list triggers ipoib_mcast_stop_thread, > > > > This cancels the query and waits on mcast "done" completion. > > > > completion is called and "done" is set. > > > > Meanwhile, ipoib_mcast_send arrives and starts a new query, > > > > re-initializing "done". > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > The patch I posted previously leaked an skb when a multicast > > > send arrived while the mcast thread is stopped. > > > > > > Further, there's an additional issue that I saw in testing: > > > ipoib_mcast_send may get called when priv->broadcast is NULL > > > (e.g. if the device was downed and then upped internally because > > > of a port event). > > > If this happends and the sendonly join request gets completed before > > > priv->broadcast is set, we get an oops that I posted previously. > > > > > > Here's a better patch to address these two problems. > > > It has been running fine here for a while now. > > > > > > Please note that this replaces the ipoib_multicast/ipoib_mcast_send > > patch, > > > but not the ADMIN_UP patch that I posted previously. > > > > > > --- > > > > > > Do not send multicasts if mcast thread is stopped or if > > > priv->broadcast is not set. > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > > > =================================================================== > > > --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision > > 4222) > > > +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working > > copy) > > > @@ -582,6 +582,10 @@ int ipoib_mcast_start_thread(struct net_ > > > queue_work(ipoib_workqueue, &priv->mcast_task); > > > up(&mcast_mutex); > > > > > > + spin_lock_irq(&priv->lock); > > > + set_bit(IPOIB_MCAST_STARTED, &priv->flags); > > > + spin_unlock_irq(&priv->lock); > > > + > > > return 0; > > > } > > > > > > @@ -592,6 +596,10 @@ int ipoib_mcast_stop_thread(struct net_d > > > > > > ipoib_dbg_mcast(priv, "stopping multicast thread\n"); > > > > > > + spin_lock_irq(&priv->lock); > > > + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); > > > + spin_unlock_irq(&priv->lock); > > > + > > > down(&mcast_mutex); > > > clear_bit(IPOIB_MCAST_RUN, &priv->flags); > > > cancel_delayed_work(&priv->mcast_task); > > > @@ -674,6 +682,11 @@ void ipoib_mcast_send(struct net_device > > > */ > > > spin_lock(&priv->lock); > > > > > > + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags) || > > !priv->broadcast) { > > > + dev_kfree_skb_any(skb); > > > + goto unlock; > > > + } > > > + > > > mcast = __ipoib_mcast_find(dev, mgid); > > > if (!mcast) { > > > /* Let's create a new send only group now */ > > > @@ -732,6 +745,7 @@ out: > > > ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); > > > } > > > > > > +unlock: > > > spin_unlock(&priv->lock); > > > } > > > > > > Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h > > > =================================================================== > > > --- openib/drivers/infiniband/ulp/ipoib/ipoib.h (revision 4222) > > > +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h (working copy) > > > @@ -78,6 +78,7 @@ enum { > > > IPOIB_FLAG_SUBINTERFACE = 4, > > > IPOIB_MCAST_RUN = 5, > > > IPOIB_STOP_REAPER = 6, > > > + IPOIB_MCAST_STARTED = 7, > > > > > > IPOIB_MAX_BACKOFF_SECONDS = 16, > > > > > > > > > -- > > > MST > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Dec 11 22:12:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 08:12:13 +0200 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: References: Message-ID: <20051212061213.GA24168@mellanox.co.il> Quoting Krishna Kumar2 : > Subject: Re: [openib-general] [PATCH fixed] was Re: [PATCH]?ipoib_multicast/ipoib_mcast_send race > > > Hi Micheal, > > But the lock doesn't help that case. The only difference with having the > > lock is that in case of a race, the mcast_send() will complete *before* > the flag is set, while without the lock the mcast_send() could be in the > > *middle* of execution when the flag is set. Exactly, you got it. If you look at mcast_send you'll see that it creates new queries, creates broad cast group and adds entries to the list. So here's why the lock helps: > > > > > Fix the following race scenario: > > > > > device is up. > > > > > port event or set mcast list triggers ipoib_mcast_stop_thread, > > > > > This cancels the query and waits on mcast "done" completion. > > > > > completion is called and "done" is set. > > > > > Meanwhile, ipoib_mcast_send arrives and starts a new query, > > > > > re-initializing "done". Clear now? -- MST From krkumar2 at in.ibm.com Sun Dec 11 22:37:58 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 12 Dec 2005 12:07:58 +0530 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051212061213.GA24168@mellanox.co.il> Message-ID: Hi Michael, > Exactly, you got it. If you look at mcast_send you'll see that it creates new > queries, creates broad cast group and adds entries to the list. Correct, but even with the lock, it would create those (once, and that is true for whether lock is held or not). The only thing stopping creation of those is setting the bit (but only for a. the non-race part or b. the race part where the stop_thread executes before the mcast_send(), but not the race part where the mcast_send() wins over the stop_thread), but holding the lock for the setting/testing of that bit will not stop creation of those entries in the b. case. > This cancels the query and waits on mcast "done" completion. > completion is called and "done" is set. > Meanwhile, ipoib_mcast_send arrives and starts a new query, > re-initializing "done". Isn't all that managed by clearing/testing the bit ? Because holding the lock doesn't solve it. To give an example : stop_thread() { lock(); clear(); unlock(); ... wait_for_completion(mcast); } mcast_send() { lock(); test(); results_in_creation_of_entries_done_etc();; unlock(); } In this case, if mcast_send() gets the lock first and proceeds while the stop_thread is spinning on the lock, the entries are created and then the stop_thread() clears the bit and at this point in time, no more entries can be ever created. Now if the lock were removed, the behavior is identical - the mcast_send() would test the bit, and get the lock() while the stop_thread() clears the bit (without a lock) and waits for completion, while *no more* mcast_sends() would ever continue beyond this time. thanks, - KK "Michael S. Tsirkin" wrote on 12/12/2005 11:42:13 AM: > Quoting Krishna Kumar2 : > > Subject: Re: [openib-general] [PATCH fixed] was Re: [PATCH]? > ipoib_multicast/ipoib_mcast_send race > > > > > > Hi Micheal, > > > > But the lock doesn't help that case. The only difference with having the > > > > lock is that in case of a race, the mcast_send() will complete *before* > > the flag is set, while without the lock the mcast_send() could be in the > > > > *middle* of execution when the flag is set. > > Exactly, you got it. If you look at mcast_send you'll see that it creates new > queries, creates broad cast group and adds entries to the list. > > So here's why the lock helps: > > > > > > > Fix the following race scenario: > > > > > > device is up. > > > > > > port event or set mcast list triggers ipoib_mcast_stop_thread, > > > > > > This cancels the query and waits on mcast "done" completion. > > > > > > completion is called and "done" is set. > > > > > > Meanwhile, ipoib_mcast_send arrives and starts a new query, > > > > > > re-initializing "done". > > Clear now? > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Dec 11 22:56:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 08:56:23 +0200 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: References: Message-ID: <20051212065623.GD24168@mellanox.co.il> Quoting r. Krishna Kumar2 : > Isn't all that managed by clearing/testing the bit ? Because holding the > lock doesn't solve > it. > To give an example : > > stop_thread() > { > lock(); > clear(); > unlock(); > ... > wait_for_completion(mcast); > } > > mcast_send() > { > lock(); > test(); > results_in_creation_of_entries_done_etc();; > unlock(); > } > > In this case, if mcast_send() gets the lock first and proceeds while the > stop_thread is spinning > on the lock, the entries are created and then the stop_thread() clears > the bit and at this point > in time, no more entries can be ever created. Now if the lock were > removed, the behavior > is identical - the mcast_send() would test the bit, and get the lock() > while the stop_thread() > clears the bit (without a lock) and waits for completion, while *no > more* mcast_sends() would > ever continue beyond this time. Now mcast_send can call init_completion *after* stop_thread does wait for completion. It could also call list_add while mcast_stop_thread walks the list. Thats what I am trying to prevent. -- MST From yael at mellanox.co.il Sun Dec 11 23:52:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 12 Dec 2005 09:52:49 +0200 Subject: [openib-general] [PATCH] Opensm - fix bug in osm_sa_portinfo_record Message-ID: <5z7jaahhv2.fsf@mtl066.yok.mtl.com> Hi Hal, During some tests here, we noticed that if the SA is queried with IB_PIR_COMPMASK_BASELID, and base_lid = zero - the SA will return in result all the ports. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_sa_portinfo_record.c =================================================================== --- opensm/osm_sa_portinfo_record.c (revision 4371) +++ opensm/osm_sa_portinfo_record.c (working copy) @@ -266,6 +266,13 @@ __osm_sa_pir_check_physp( &p_physp->port_info, OSM_LOG_DEBUG ); + /* We have to re-chech the base_lid, since if the given + base_lid in p_pi is zero - we are comparing on all ports. */ + if( comp_mask & IB_PIR_COMPMASK_BASELID ) + { + if( p_comp_pi->base_lid != p_pi->base_lid ) + goto Exit; + } if( comp_mask & IB_PIR_COMPMASK_MKEY ) { if( p_comp_pi->m_key != p_pi->m_key ) @@ -586,6 +593,9 @@ osm_pir_rcv_process( goto Exit; } + if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_portinfo_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + p_tbl = &p_rcv->p_subn->port_lid_tbl; p_pi = &p_rcvd_rec->port_info; From krkumar2 at in.ibm.com Mon Dec 12 02:52:13 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 12 Dec 2005 16:22:13 +0530 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051212065623.GD24168@mellanox.co.il> Message-ID: Hi Michael, I see what you are doing with that lock :-) But isn't the lock a hack ? Eg, I could instead do this (If I can make sure the redundant lock/unlock is not optimized out) and it would still work : stop_thread() { clear_bit(); lock(); /* empty lock/unlock to synchronize with the mcast_send() */ unlock(); /* make the other routine FINISH before we start other activity */ ... ... } mcast_send() { lock(); if (test_bit) ... ... unlock(); } So basically, the lock is not required for clearing (and absolutely not required for setting) the bit, but a lock is required before we start waiting, to enable us to synchronize with any sends. The lock is being used as a signalling mechanism between two processes (in this case, lock/unlock is a mechanism for the mcast_send() to finish if running). Thanks, - KK "Michael S. Tsirkin" wrote on 12/12/2005 12:26:23 PM: > Quoting r. Krishna Kumar2 : > > Isn't all that managed by clearing/testing the bit ? Because holding the > > lock doesn't solve > > it. > > To give an example : > > > > stop_thread() > > { > > lock(); > > clear(); > > unlock(); > > ... > > wait_for_completion(mcast); > > } > > > > mcast_send() > > { > > lock(); > > test(); > > results_in_creation_of_entries_done_etc();; > > unlock(); > > } > > > > In this case, if mcast_send() gets the lock first and proceeds while the > > stop_thread is spinning > > on the lock, the entries are created and then the stop_thread() clears > > the bit and at this point > > in time, no more entries can be ever created. Now if the lock were > > removed, the behavior > > is identical - the mcast_send() would test the bit, and get the lock() > > while the stop_thread() > > clears the bit (without a lock) and waits for completion, while *no > > more* mcast_sends() would > > ever continue beyond this time. > > Now mcast_send can call init_completion *after* stop_thread does wait for > completion. > It could also call list_add while mcast_stop_thread walks the list. > > Thats what I am trying to prevent. > > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Dec 12 03:03:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 13:03:50 +0200 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: References: Message-ID: <20051212110350.GY14936@mellanox.co.il> Quoting Krishna Kumar2 : > stop_thread() > { > clear_bit(); > lock(); /* empty lock/unlock to synchronize with the mcast_send() */ > unlock(); /* make the other routine FINISH before we start other activity */ > ... > ... > } > mcast_send() > { > lock(); > if (test_bit) > ... > ... > unlock(); > } I think this will work, too. But I have easier time reasoning about locks than barriers and atomic operations. "bit is protected by priv->lock" is a simple rule, and we are not on data path here. The fact that the race went unnoticed for a while validates this approach in my eyes. -- MST From krkumar2 at in.ibm.com Mon Dec 12 03:30:04 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 12 Dec 2005 17:00:04 +0530 Subject: [openib-general] [PATCH fixed] was Re: [PATCH] ipoib_multicast/ipoib_mcast_send race In-Reply-To: <20051212110350.GY14936@mellanox.co.il> Message-ID: Hi Michael, (wow, don't you sleep!) > I think this will work, too. But I have easier time reasoning about locks than > barriers and atomic operations. "bit is protected by priv->lock" is a simple Correct. Also, some optimization in mcast_send() could be done by moving the label "out:" to just before the spin_unlock() (and change "mcast = NULL" to "goto out"). - KK "Michael S. Tsirkin" wrote on 12/12/2005 04:33:50 PM: > Quoting Krishna Kumar2 : > > stop_thread() > > { > > clear_bit(); > > lock(); /* empty lock/unlock to synchronize with the mcast_send() */ > > unlock(); /* make the other routine FINISH before we start other activity */ > > ... > > ... > > } > > mcast_send() > > { > > lock(); > > if (test_bit) > > ... > > ... > > unlock(); > > } > > I think this will work, too. But I have easier time reasoning about locks than > barriers and atomic operations. "bit is protected by priv->lock" is a simple > rule, and we are not on data path here. > > The fact that the race went unnoticed for a while validates this approach > in my eyes. > > -- > MST -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Dec 12 06:07:20 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2005 09:07:20 -0500 Subject: [openib-general] [PATCH] OPENSM identify failure cases uniquely In-Reply-To: <20051211044231.GG9348@esmail.cup.hp.com> References: <20051211044231.GG9348@esmail.cup.hp.com> Message-ID: <1134396439.4485.28325.camel@hal.voltaire.com> Hi Grant, On Sat, 2005-12-10 at 23:42, Grant Grundler wrote: > Hi, > When tracking down the opensm "can't open port" failure described > in previous email, I added log output for each of the failure cases > in osm_vendor_open_port(). Thanks. Applied. > The "ERR" numbers need to be compared to some "master list" that > I don't know about and replaced. > I just picked sequential numbers not used in that routine. I modified the error numbers to be unique. -- Hal From halr at voltaire.com Mon Dec 12 06:24:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2005 09:24:31 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix bug in osm_sa_portinfo_record In-Reply-To: <5z7jaahhv2.fsf@mtl066.yok.mtl.com> References: <5z7jaahhv2.fsf@mtl066.yok.mtl.com> Message-ID: <1134397469.4485.28408.camel@hal.voltaire.com> On Mon, 2005-12-12 at 02:52, Yael Kalka wrote: > Hi Hal, > > During some tests here, we noticed that if the SA is queried with > IB_PIR_COMPMASK_BASELID, and base_lid = zero - the SA will return in > result all the ports. > The following patch fixes this. Thanks. Applied. From ianjiang.ict at gmail.com Mon Dec 12 06:37:07 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Mon, 12 Dec 2005 22:37:07 +0800 Subject: [openib-general] [kDAPL]Need the array of physical pages be continuous when using dat_lmr_kcreate Message-ID: <7b2fa1820512120637m2869e1fdjf4c962decc4de9ae@mail.gmail.com> I created a LMR from three buffers which were allocated respectively with kmalloc of size 64kB. The registration went well, but the subsequent rdma read dto completed with a DAT_DTO_ERR_LOCAL_PROTECTION error. Was that because the physical address of the three buffers were not continuous? Any suggestion is appreciated! -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Dec 12 06:56:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 16:56:29 +0200 Subject: [openib-general] [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051207082220.GK21035@mellanox.co.il> References: <20051207082220.GK21035@mellanox.co.il> Message-ID: <20051212145629.GF14936@mellanox.co.il> Quoting Michael S. Tsirkin : > Subject: Re: ip_dev_find resolution? > > Quoting r. Hal Rosenstock : > > Subject: Re: ip_dev_find resolution? > > > > On Tue, 2005-12-06 at 12:56, Michael S. Tsirkin wrote: > > > Actually, I wander whether instead of ip_dev_find we can just > > > > > > read_lock(&dev_base_lock); > > > for (dev = dev_base; dev; dev = dev->next) { > > > > > > and check the ip address? > > > > working off the ip_ptr and ip6_ptr ? > > Yes. > > > > If this works, this has the advantage of supporting IPv6 as well. > > > > This was introduced at one point and we subsequently changed to > > ip_dev_find. I forget exactly why this was but can dig it out if no > one > > recalls. > > Please do. Any updates? Hal? I've coded the following up since I grew tired of patching my kernels to run sdp. Seems to work fine for me, can someone please speak up on why this isnt a good idea for CMA, as well? Ultimately, IMO this also has a better chance to be generalizable to IPv6. --- Replace ip_dev_find (which isnt exported in 2.6.14) with full device list lookup. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14.3/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-2.6.14.3.orig/drivers/infiniband/ulp/sdp/sdp_link.c +++ linux-2.6.14.3/drivers/infiniband/ulp/sdp/sdp_link.c @@ -346,6 +346,41 @@ static int sdp_link_path_rec_get(struct return 0; } +static int tryaddrmatch(struct net_device *dev, u32 s_addr, u32 d_addr) +{ + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + int rc = -ENETUNREACH; + __be32 addr; + + if (dev->type != ARPHRD_INFINIBAND) + return rc; + + in_dev = in_dev_get(dev); + if (!in_dev) + return rc; + + addr = (ZERONET(s_addr) || LOOPBACK(s_addr)) ? d_addr : s_addr; + + /* Hack to enable using SDP on addresses such as 127.0.0.1 */ + if (ZERONET(addr) || LOOPBACK(addr)) { + rc = (dev->flags & IFF_UP) ? 0 : -ENETUNREACH; + goto done; + } + + for (ifap = &in_dev->ifa_list; (ifa = *ifap); ifap = &ifa->ifa_next) { + if (s_addr == ifa->ifa_address) { + rc = 0; + break; /* found */ + } + } + +done: + in_dev_put(in_dev); + return rc; +} + /* * do_link_path_lookup - resolve an ip address to a path record */ @@ -406,17 +441,9 @@ static void do_link_path_lookup(struct s rt->u.dst.neighbour->dev->name, rt->rt_src, rt->rt_dst, rt->rt_gateway, rt->u.dst.neighbour->nud_state); - /* - * device needs to be a valid IB device. Check for loopback. - * In case of loopback find a valid IB device on which to - * direct the loopback traffic. - */ - if (rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) - dev = ip_dev_find(rt->rt_src); - else { - dev = rt->u.dst.neighbour->dev; - dev_hold(dev); - } + + dev = rt->u.dst.neighbour->dev; + dev_hold(dev); /* * check for IB device or loopback, the later requires extra @@ -433,13 +460,11 @@ static void do_link_path_lookup(struct s if (dev->flags & IFF_LOOPBACK) { dev_put(dev); read_lock(&dev_base_lock); - for (dev = dev_base; dev; dev = dev->next) { - if (dev->type == ARPHRD_INFINIBAND && - (dev->flags & IFF_UP)) { + for (dev = dev_base; dev; dev = dev->next) + if (!tryaddrmatch(dev, rt->rt_src, rt->rt_dst)) { dev_hold(dev); break; } - } read_unlock(&dev_base_lock); } -- MST From halr at voltaire.com Mon Dec 12 07:08:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2005 10:08:09 -0500 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051212145629.GF14936@mellanox.co.il> References: <20051207082220.GK21035@mellanox.co.il> <20051212145629.GF14936@mellanox.co.il> Message-ID: <1134400089.4485.28653.camel@hal.voltaire.com> On Mon, 2005-12-12 at 09:56, Michael S. Tsirkin wrote: > Quoting Michael S. Tsirkin : > > Subject: Re: ip_dev_find resolution? > > > > Quoting r. Hal Rosenstock : > > > Subject: Re: ip_dev_find resolution? > > > > > > On Tue, 2005-12-06 at 12:56, Michael S. Tsirkin wrote: > > > > Actually, I wander whether instead of ip_dev_find we can just > > > > > > > > read_lock(&dev_base_lock); > > > > for (dev = dev_base; dev; dev = dev->next) { > > > > > > > > and check the ip address? > > > > > > working off the ip_ptr and ip6_ptr ? > > > > Yes. > > > > > > If this works, this has the advantage of supporting IPv6 as well. > > > > > > This was introduced at one point and we subsequently changed to > > > ip_dev_find. I forget exactly why this was but can dig it out if no > > one > > > recalls. > > > > Please do. > > Any updates? Hal? > > I've coded the following up since I grew tired of patching my kernels > to run sdp. Seems to work fine for me, can someone please speak up > on why this isnt a good idea for CMA, as well? Sorry for the slow response on this. I meant to dig this out over the weekend. I believe the reason it was changed from searching the netdevices list to ip_dev_find originally was that this really is a route lookup on the dest addr to determine what the local outgoing interface is and that can't be done directly from the netdevices list if the destination is not (IP) subnet local (e.g. gateway cases). > Ultimately, IMO this also has a better chance to be generalizable to IPv6. Yes, we need a different lookup for IPv6 with the current (ip_dev_find) approach. -- Hal From mst at mellanox.co.il Mon Dec 12 07:28:43 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 17:28:43 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <1134400089.4485.28653.camel@hal.voltaire.com> References: <1134400089.4485.28653.camel@hal.voltaire.com> Message-ID: <20051212152843.GJ14936@mellanox.co.il> Quoting Hal Rosenstock : > > I've coded the following up since I grew tired of patching my kernels > > to run sdp. Seems to work fine for me, can someone please speak up > > on why this isnt a good idea for CMA, as well? > > Sorry for the slow response on this. I meant to dig this out over the > weekend. > > I believe the reason it was changed from searching the netdevices list > to ip_dev_find originally was that this really is a route lookup on the > dest addr to determine what the local outgoing interface is and that > can't be done directly from the netdevices list if the destination is > not (IP) subnet local (e.g. gateway cases). What you are saying is, the original approach didnt do ip route lookup at all, and thats why it was changed? But now we only do the list walk if the ip route resolution returns a loopback device, so we are ok, right? -- MST From mst at mellanox.co.il Mon Dec 12 08:09:20 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 18:09:20 +0200 Subject: [openib-general] [PATCH] libibverbs: report board id in ibv_devinfo Message-ID: <20051212160920.GM14936@mellanox.co.il> Report board_id from ibv_devinfo, if present. Signed-off-by: Dotan Barak Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libibverbs/examples/devinfo.c =================================================================== --- openib/src/userspace/libibverbs/examples/devinfo.c (revision 4399) +++ openib/src/userspace/libibverbs/examples/devinfo.c (working copy) @@ -174,6 +174,7 @@ static int print_hca_cap(struct ibv_devi struct ibv_context *ctx; struct ibv_device_attr device_attr; struct ibv_port_attr port_attr; + struct sysfs_attribute *attr; int rc = 0; uint8_t port; char buf[256]; @@ -198,6 +199,12 @@ static int print_hca_cap(struct ibv_devi printf("\tvendor_id:\t\t\t0x%04x\n", device_attr.vendor_id); printf("\tvendor_part_id:\t\t\t%d\n", device_attr.vendor_part_id); printf("\thw_ver:\t\t\t\t0x%X\n", device_attr.hw_ver); + attr = sysfs_get_classdev_attr(ib_dev->ibdev, "board_id"); + if (attr) { + printf("\tboard_id:\t\t\t%s", attr->value); + sysfs_close_attribute(attr); + } + printf("\tphys_port_cnt:\t\t\t%d\n", device_attr.phys_port_cnt); if (verbose) { -- MST From robert.j.woodruff at intel.com Mon Dec 12 08:30:38 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 12 Dec 2005 08:30:38 -0800 Subject: [openib-general] RE: Next workshop dates? Feedback from others please!!! Message-ID: <1AC79F16F5C5284499BB9591B33D6F00065627B7@orsmsx408> Sounds like the 5th is the best date to accommodate most people. woody -----Original Message----- From: Bill Boas [mailto:bboas at llnl.gov] Sent: Sunday, December 11, 2005 4:18 PM To: Woodruff, Robert J; openib-general at openib.org; openib-promoters at openib.org Subject: RE: Next workshop dates? Feedback from others please!!! Bob, Voltaire says they cannot not make Jan 29 'cos they have an all hands 3 day meeting in that time slot. In the Feb 5-8 timeframe we worked in the super bowl last year OK I think??? By the time we get to late Feb the hotel rates in Sonoma go up so I'm thinking we should go for Feb 5 again, but feedback on this would be very helpful as we need to make a decision in the next few days. Feedback from others please. Bill. At 03:10 PM 12/9/2005, you wrote: >Jan29-Feb1 works for me. >Feb 5 is super bowl Sunday, might want to stay away from that one >Feb 12 - Sean is on vacation. >Feb 20 - Is a holiday in the US. >How about Feb 27- March 2 as an alternative if people cannot make Jan29 >? > >woody > > >-----Original Message----- >From: openib-general-bounces at openib.org >[mailto:openib-general-bounces at openib.org] On Behalf Of Bill Boas >Sent: Thursday, December 08, 2005 10:53 PM >To: openib-promoters at openib.org; openib-general at openib.org >Subject: [openib-general] Next workshop dates? Please respond with >yourpreferences > >All those wishing to attend the next workshop in Sonoma at the Lodge >(same as last year) in the late January-early February please respond >with your preferred dates. > >We currently have Jan29-Feb1 held for us but some people are telling >us that is bad for them. > >The next 2 Sun-Wed slots (Feb 5-8 or 12-15) maybe available but we >need guidance from those planning to attend as to their preferred dates. > >Bill. > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From mst at mellanox.co.il Mon Dec 12 08:58:59 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 18:58:59 +0200 Subject: [openib-general] [PATCH] mthca: correct IB_QP_ACCESS_FLAGS handling Message-ID: <20051212165859.GO14936@mellanox.co.il> This patch corrects some corner cases in managing the RAE/RRE bits in the mthca qp context. These bits need to be zero if the user requests max_dest_rd_atomic of zero. The bits need to be restored to the value implied by the qp access flags attribute in a previous (or the current) modify-qp command if the dest_rd_atomic variable is changed to non-zero. In the current implementation, the following scenario will not work: RESET-to-INIT set QP access flags to all disabled (zeroes) INIT-to-RTR set max_dest_rd_atomic=10, AND set qp_access_flags = IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_ATOMIC The current code will incorrectly take the access-flags value set in the RESET-to-INIT transition. --- Simplify, and correct, IB_QP_ACCESS_FLAGS handling: it is always safe to set qp access flags in hardware command if either of IB_QP_MAX_DEST_RD_ATOMIC or IB_QP_ACCESS_FLAGS is set, so lets just set it to the correct value, always. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_qp.c @@ -520,6 +520,36 @@ static void init_port(struct mthca_dev * mthca_warn(dev, "INIT_IB returned status %02x.\n", status); } +static u32 get_hw_access_flags(struct mthca_qp *qp, struct ib_qp_attr *attr, + int attr_mask) +{ + u8 dest_rd_atomic; + u32 access_flags; + u32 hw_access_flags; + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + dest_rd_atomic = attr->max_dest_rd_atomic; + else + dest_rd_atomic = qp->resp_depth; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + access_flags = (u32)attr->qp_access_flags; + else + access_flags = qp->atomic_rd_en; + + if (!dest_rd_atomic) + access_flags &= IB_ACCESS_REMOTE_WRITE; + + hw_access_flags = access_flags & IB_ACCESS_REMOTE_READ ? + MTHCA_QP_BIT_RRE : 0; + hw_access_flags |= access_flags & IB_ACCESS_REMOTE_ATOMIC ? + MTHCA_QP_BIT_RAE : 0; + hw_access_flags |= access_flags & IB_ACCESS_REMOTE_WRITE ? + MTHCA_QP_BIT_RWE : 0; + + return cpu_to_be32(hw_access_flags); +} + int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) { struct mthca_dev *dev = to_mdev(ibqp->device); @@ -741,57 +776,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->snd_db_index = cpu_to_be32(qp->sq.db_index); } - if (attr_mask & IB_QP_ACCESS_FLAGS) { - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - - /* - * Only enable RDMA reads and atomics if we have - * responder resources set to a non-zero value. - */ - if (qp->resp_depth) { - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? - MTHCA_QP_BIT_RRE : 0); - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_ATOMIC ? - MTHCA_QP_BIT_RAE : 0); - } - - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { - if (qp->resp_depth && !attr->max_dest_rd_atomic) { - /* - * Lowering our responder resources to zero. - * Turn off reads RDMA and atomics as responder. - * (RRE/RAE in params2 already zero) - */ - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - - if (!qp->resp_depth && attr->max_dest_rd_atomic) { - /* - * Increasing our responder resources from - * zero. Turn on RDMA reads and atomics as - * appropriate. - */ - qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_READ ? - MTHCA_QP_BIT_RRE : 0); - qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_ATOMIC ? - MTHCA_QP_BIT_RAE : 0); - - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - if (attr->max_dest_rd_atomic) qp_context->params2 |= cpu_to_be32(fls(attr->max_dest_rd_atomic - 1) << 21); @@ -799,6 +784,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRA_MAX); } + if (attr_mask & (IB_QP_ACCESS_FLAGS | IB_QP_MAX_DEST_RD_ATOMIC)) { + qp_context->params2 |= get_hw_access_flags(qp, attr, attr_mask); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | + MTHCA_QP_OPTPAR_RRE | + MTHCA_QP_OPTPAR_RAE); + } + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); if (ibqp->srq) -- MST From iod00d at hp.com Mon Dec 12 09:38:24 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 12 Dec 2005 09:38:24 -0800 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage In-Reply-To: <20051211180543.GR14936@mellanox.co.il> References: <20051211175341.GA12176@esmail.cup.hp.com> <20051211180543.GR14936@mellanox.co.il> Message-ID: <20051212173824.GA15771@esmail.cup.hp.com> On Sun, Dec 11, 2005 at 08:05:43PM +0200, Michael S. Tsirkin wrote: > This might be benign, need to check. > Did the test run to completion with these messages? I aborted the test with ^C and I tried to restarted it. The two hosts could not ping each other via the IPoIB link. I'll poke at this more later today. grant From mst at mellanox.co.il Mon Dec 12 09:42:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 19:42:39 +0200 Subject: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage In-Reply-To: <20051212173824.GA15771@esmail.cup.hp.com> References: <20051212173824.GA15771@esmail.cup.hp.com> Message-ID: <20051212174239.GQ14936@mellanox.co.il> Quoting r. Grant Grundler : > Subject: Re: [openib-general] [PATCH applied] sdp: fix kunmap_atomic usage > > On Sun, Dec 11, 2005 at 08:05:43PM +0200, Michael S. Tsirkin wrote: > > This might be benign, need to check. > > Did the test run to completion with these messages? > > I aborted the test with ^C and I tried to restarted it. > The two hosts could not ping each other via the IPoIB link. > I'll poke at this more later today. Unless SDP triggered an oops, this doesnt sound like an SDP problem ... -- MST From rdreier at cisco.com Mon Dec 12 09:39:30 2005 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Dec 2005 09:39:30 -0800 Subject: [openib-general] Next workshop dates? Ideas for agenda??? In-Reply-To: <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> (Bill Boas's message of "Sun, 11 Dec 2005 16:10:54 -0800") References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> Message-ID: Bill> I think, subject to others input, it'll be focused on Bill> wrapping up rel 1.0 of OpenIB, discussing what the Bill> developers are going to focus on next and validating the Bill> strategy for RDMA over Ethernet integration at the verbs Bill> level to lay the foundation for one, consistent RDMA Bill> structure in Linux, if possible. I'm not sure I see the point in dragging everyone together in early February. With the holidays coming, realistically we only have maybe 5 weeks to prepare a conference agenda, and I don't see that as being enough time to set up a productive meeting. In particular: * wrapping up rel 1.0 -- the release process for a "1.0" release has not even started. About all we could hope to accomplish would be to pick a release manager and tell that person to go start driving a release, and I don't see that as a good use of face-to-face time. It would be much better to pick someone to drive the release and then give the release manager time to start putting the release together before getting together, so that we have some idea of what the real issues that need to be hashed out in person are. * iWARP integration -- again, not enough discussion has taken place in advance. Until the community has a chance to really study the proposed changes and figure out what the real difficult issues that need to be sorted out in person are, again it's a waste of time to meet in person. * discuss developers next steps -- perhaps I'm pessimistic but I think we'll just get the same talks we've already seen twice before at Sonoma and IDF. Sonoma is a short trip for me but given the number of people that will have to come from the East coast and Israel, I think we should think hard about whether this conference is the best use of our time. - R. From mshefty at ichips.intel.com Mon Dec 12 09:54:26 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Dec 2005 09:54:26 -0800 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: References: Message-ID: <439DB952.3040506@ichips.intel.com> Dan Bar Dov wrote: > I would have preferred not to add upper layer aware code into CMA, > but I guess I'm late for that discussion. If you mean add SDP code to the CMA, without it, SDP cannot use the CMA and must duplicate most of the same functionality itself. > Regarding the patch below, it makes sense. Are you going to apply it to all > affected modules? I will apply the patch to all affected modules. - Sean From bboas at llnl.gov Mon Dec 12 09:55:03 2005 From: bboas at llnl.gov (Bill Boas) Date: Mon, 12 Dec 2005 09:55:03 -0800 Subject: [Openib-promoters] Re: [openib-general] Next workshop dates? Ideas for agenda??? In-Reply-To: References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> Message-ID: <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> Roland, These are all excellent perspectives, I hope others will respond with their view points. Certainly repeating what we have heard already is not a good use of anyone's time or money but I'm under the impression that we will have made some progress toward what we want to work on next as a result of PathForward Phase 2, input from Tom Tucker and others on OpenIB iWARP integration and the HSIR meeting in NYC tomorrow. With respect to "release" of OpenIB rel 1.0, did Doug Ledford effectively do that a week or two ago? I think those of us ( including me) who originally thought OpenIB was actually going be an organization that released and supported code (like RedHat, say) had got it wrong. Now I believe that when a Linux distribution, an IB company or a Tier One OEM decides that is a version of the code that they will support, then that is a "release". OpenIB may be best utilized to try to achieve some consistency in timeframe and content amongst those who wish to "release and support" the code??? Bill. At 09:39 AM 12/12/2005, Roland Dreier wrote: > Bill> I think, subject to others input, it'll be focused on > Bill> wrapping up rel 1.0 of OpenIB, discussing what the > Bill> developers are going to focus on next and validating the > Bill> strategy for RDMA over Ethernet integration at the verbs > Bill> level to lay the foundation for one, consistent RDMA > Bill> structure in Linux, if possible. > >I'm not sure I see the point in dragging everyone together in early >February. With the holidays coming, realistically we only have maybe >5 weeks to prepare a conference agenda, and I don't see that as being >enough time to set up a productive meeting. > >In particular: > > * wrapping up rel 1.0 -- the release process for a "1.0" release has > not even started. About all we could hope to accomplish would be > to pick a release manager and tell that person to go start driving > a release, and I don't see that as a good use of face-to-face > time. It would be much better to pick someone to drive the release > and then give the release manager time to start putting the release > together before getting together, so that we have some idea of what > the real issues that need to be hashed out in person are. > > * iWARP integration -- again, not enough discussion has taken place > in advance. Until the community has a chance to really study the > proposed changes and figure out what the real difficult issues that > need to be sorted out in person are, again it's a waste of time to > meet in person. > > * discuss developers next steps -- perhaps I'm pessimistic but I > think we'll just get the same talks we've already seen twice before > at Sonoma and IDF. > >Sonoma is a short trip for me but given the number of people that will >have to come from the East coast and Israel, I think we should think >hard about whether this conference is the best use of our time. > > - R. >_______________________________________________ >openib-promoters mailing list >openib-promoters at openib.org >http://openib.org/mailman/listinfo/openib-promoters Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From robert.j.woodruff at intel.com Mon Dec 12 11:37:26 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 12 Dec 2005 11:37:26 -0800 Subject: [Openib-promoters] Re: [openib-general] Next workshopdates? Ideas for agenda??? In-Reply-To: <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> Message-ID: Roland wrote, >I'm not sure I see the point in dragging everyone together in early >February. With the holidays coming, realistically we only have maybe >5 weeks to prepare a conference agenda, and I don't see that as being >enough time to set up a productive meeting. Another possibility would be to delay the workshop till early March and have it the day before IDF, as we did last fall. Thoughts ? woody From halr at voltaire.com Mon Dec 12 12:41:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2005 15:41:50 -0500 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051212152843.GJ14936@mellanox.co.il> References: <1134400089.4485.28653.camel@hal.voltaire.com> <20051212152843.GJ14936@mellanox.co.il> Message-ID: <1134420109.4485.31207.camel@hal.voltaire.com> On Mon, 2005-12-12 at 10:28, Michael S. Tsirkin wrote: > Quoting Hal Rosenstock : > > > I've coded the following up since I grew tired of patching my kernels > > > to run sdp. Seems to work fine for me, can someone please speak up > > > on why this isnt a good idea for CMA, as well? > > > > Sorry for the slow response on this. I meant to dig this out over the > > weekend. > > > > I believe the reason it was changed from searching the netdevices list > > to ip_dev_find originally was that this really is a route lookup on the > > dest addr to determine what the local outgoing interface is and that > > can't be done directly from the netdevices list if the destination is > > not (IP) subnet local (e.g. gateway cases). > > What you are saying is, the original approach didnt do ip route lookup at all, > and thats why it was changed? Yes, I believe so. > But now we only do the list walk if the ip route resolution returns > a loopback device, so we are ok, right? That's the way it seems to me (at least for SDP but not CMA (addr)). -- Hal From mst at mellanox.co.il Mon Dec 12 12:55:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 12 Dec 2005 22:55:15 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <1134420109.4485.31207.camel@hal.voltaire.com> References: <1134420109.4485.31207.camel@hal.voltaire.com> Message-ID: <20051212205515.GC28391@mellanox.co.il> Quoting Hal Rosenstock : > > But now we only do the list walk if the ip route resolution returns > > a loopback device, so we are ok, right? > > That's the way it seems to me (at least for SDP but not CMA (addr)). OK, Sean, do you want to cook up a patch for CMA based on this code of mine? I could do it too but I dont have a way to test cma yet. -- MST From halr at voltaire.com Mon Dec 12 12:54:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2005 15:54:55 -0500 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051212205515.GC28391@mellanox.co.il> References: <1134420109.4485.31207.camel@hal.voltaire.com> <20051212205515.GC28391@mellanox.co.il> Message-ID: <1134420895.4485.31272.camel@hal.voltaire.com> On Mon, 2005-12-12 at 15:55, Michael S. Tsirkin wrote: > Quoting Hal Rosenstock : > > > But now we only do the list walk if the ip route resolution returns > > > a loopback device, so we are ok, right? > > > > That's the way it seems to me (at least for SDP but not CMA (addr)). > > OK, Sean, do you want to cook up a patch for CMA based on this code of mine? > I could do it too but I dont have a way to test cma yet. It's used in 2 ways in addr.c from what I can see. One is for the local address, the other not. So I'm not sure the same thing can be done here. -- Hal From lindahl at pathscale.com Mon Dec 12 13:23:25 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Mon, 12 Dec 2005 13:23:25 -0800 Subject: [Openib-promoters] Re: [openib-general] Next workshop dates? Ideas for agenda??? In-Reply-To: <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> Message-ID: <20051212212325.GA1990@greglaptop.internal.keyresearch.com> On Mon, Dec 12, 2005 at 09:55:03AM -0800, Bill Boas wrote: > Now I believe that when a Linux > distribution, an IB company or a Tier One OEM decides that is a > version of the code that they will support, then that is a "release". Why don't we imitate the Linux kernel process? OpenIB has to follow a sane process of innovation followed by stabilization and bug-fixing in order for the IB companies and Tier 1s to be able to make solid releases. -- greg From rdreier at cisco.com Mon Dec 12 14:01:33 2005 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Dec 2005 14:01:33 -0800 Subject: [Openib-promoters] Re: [openib-general] Next workshop dates? Ideas for agenda??? In-Reply-To: <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> (Bill Boas's message of "Mon, 12 Dec 2005 09:55:03 -0800") References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> <6.2.3.4.2.20051212094618.01ff5c70@mail-lc.llnl.gov> Message-ID: Bill> With respect to "release" of OpenIB rel 1.0, did Doug Bill> Ledford effectively do that a week or two ago? No, Doug put a snapshot of the OpenIB tree into a RHEL update, which is quite a different thing. Bill> I think those of us ( including me) who originally thought Bill> OpenIB was actually going be an organization that released Bill> and supported code (like RedHat, say) had got it wrong. Now Bill> I believe that when a Linux distribution, an IB company or a Bill> Tier One OEM decides that is a version of the code that they Bill> will support, then that is a "release". OpenIB may be best Bill> utilized to try to achieve some consistency in timeframe and Bill> content amongst those who wish to "release and support" the Bill> code??? I think the model that projects like the Linux kernel, Gnome, KDE, Mozilla, et al follow is a good one: the project releases well-defined packages with a known version number, which distributors can then base their packages on (adding value through QA, bug fixes, additional functionality, or whatever else they can think of). For example, I know that my Ubuntu desktop ships with Gnome version 2.10, so even though Ubuntu has done a fair bit of customization, I know basically what to expect. - R. From mshefty at ichips.intel.com Mon Dec 12 14:36:56 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 12 Dec 2005 14:36:56 -0800 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051212205515.GC28391@mellanox.co.il> References: <1134420109.4485.31207.camel@hal.voltaire.com> <20051212205515.GC28391@mellanox.co.il> Message-ID: <439DFB88.9000006@ichips.intel.com> Michael S. Tsirkin wrote: >>>But now we only do the list walk if the ip route resolution returns >>>a loopback device, so we are ok, right? >> >>That's the way it seems to me (at least for SDP but not CMA (addr)). > > OK, Sean, do you want to cook up a patch for CMA based on this code of mine? > I could do it too but I dont have a way to test cma yet. I need to read back over this thread; I didn't follow it the first time. (At this point, I think that it makes more sense to convert SDP to use the CMA or ib_addr, rather than duplicating their functionality.) There are both kernel and userspace test programs (cmatose) for the CMA checked into the tree. ip_addr can use a different approach than ip_dev_find, but I think it makes sense to use existing kernel functionality wherever possible. If ip_dev_find cannot be modified to support IPv6 addresses, then how about adding a API that is similar, but takes a pointer to an address, along with an indication of the address family that's used? I can work on a patch for this. - Sean From mst at mellanox.co.il Mon Dec 12 21:06:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 07:06:57 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <439DFB88.9000006@ichips.intel.com> References: <439DFB88.9000006@ichips.intel.com> Message-ID: <20051213050657.GA4940@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [openib-general] Re: [PATCH] sdp: replace ip_dev_find with dev_base scan (was Re: ip_dev_find resolution?) > > Michael S. Tsirkin wrote: > >>>But now we only do the list walk if the ip route resolution returns > >>>a loopback device, so we are ok, right? > >> > >>That's the way it seems to me (at least for SDP but not CMA (addr)). > > > > OK, Sean, do you want to cook up a patch for CMA based on this code of mine? > > I could do it too but I dont have a way to test cma yet. > > I need to read back over this thread; I didn't follow it the first time. (At > this point, I think that it makes more sense to convert SDP to use the CMA or > ib_addr, rather than duplicating their functionality.) Sure, this was just a test to show how to get rid of ip_dev_find. I didnt follow CMA recently - does CMA already support SDP private format? If yes, I need to work on moving SDP to use it. > There are both kernel and userspace test programs (cmatose) for the CMA checked > into the tree. > > ip_addr can use a different approach than ip_dev_find, but I think it makes > sense to use existing kernel functionality wherever possible. If ip_dev_find > cannot be modified to support IPv6 addresses, That's my assumption. > Then how about adding a API that > is similar, but takes a pointer to an address, along with an indication of the > address family that's used? > > I can work on a patch for this. > > - Sean ip_dev_find isnt ideal for our purposes either. As a reminder, what we are trying to do is handle the loopback case, where IP route resolution simply gives us back the loopback device, but we want to use the IB loopback: either external if source/destination are specified, or external if not. Take a look at tryaddrmatch in the SDP patch I've sent (reposting here for convenience), I think it might be a good start functionality-wise. Here's what it does: +static int tryaddrmatch(struct net_device *dev, u32 s_addr, u32 d_addr) +{ + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + int rc = -ENETUNREACH; + __be32 addr; + + if (dev->type != ARPHRD_INFINIBAND) + return rc; Look for net device of a given hardware type only. + + in_dev = in_dev_get(dev); + if (!in_dev) + return rc; And that supports the given address family. + + addr = (ZERONET(s_addr) || LOOPBACK(s_addr)) ? d_addr : s_addr; If source address is for a loopback device, select by destination address. + + /* Hack to enable using SDP on addresses such as 127.0.0.1 */ + if (ZERONET(addr) || LOOPBACK(addr)) { If destination address is for loopback as well, select any device of appropriate type that is up. + rc = (dev->flags & IFF_UP) ? 0 : -ENETUNREACH; + goto done; + } + + for (ifap = &in_dev->ifa_list; (ifa = *ifap); ifap = &ifa->ifa_next) { + if (s_addr == ifa->ifa_address) { Otherwise, look for device with the appropriate IP address. + rc = 0; + break; /* found */ + } + } + +done: + in_dev_put(in_dev); + return rc; +} + /* * do_link_path_lookup - resolve an ip address to a path record */ @@ -406,17 +441,9 @@ static void do_link_path_lookup(struct s rt->u.dst.neighbour->dev->name, rt->rt_src, rt->rt_dst, rt->rt_gateway, rt->u.dst.neighbour->nud_state); - /* - * device needs to be a valid IB device. Check for loopback. - * In case of loopback find a valid IB device on which to - * direct the loopback traffic. - */ - if (rt->u.dst.neighbour->dev->flags & IFF_LOOPBACK) - dev = ip_dev_find(rt->rt_src); - else { - dev = rt->u.dst.neighbour->dev; - dev_hold(dev); - } + + dev = rt->u.dst.neighbour->dev; + dev_hold(dev); /* * check for IB device or loopback, the later requires extra @@ -433,13 +460,11 @@ static void do_link_path_lookup(struct s if (dev->flags & IFF_LOOPBACK) { dev_put(dev); read_lock(&dev_base_lock); - for (dev = dev_base; dev; dev = dev->next) { - if (dev->type == ARPHRD_INFINIBAND && - (dev->flags & IFF_UP)) { + for (dev = dev_base; dev; dev = dev->next) + if (!tryaddrmatch(dev, rt->rt_src, rt->rt_dst)) { dev_hold(dev); break; } - } read_unlock(&dev_base_lock); } -- MST From bardov at gmail.com Tue Dec 13 00:40:26 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 13 Dec 2005 10:40:26 +0200 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <439DB952.3040506@ichips.intel.com> References: <439DB952.3040506@ichips.intel.com> Message-ID: On 12/12/05, Sean Hefty wrote: > Dan Bar Dov wrote: > > I would have preferred not to add upper layer aware code into CMA, > > but I guess I'm late for that discussion. > > If you mean add SDP code to the CMA, without it, SDP cannot use the CMA and must > duplicate most of the same functionality itself. I understand that SDP needs address translation services as well as its own private data. However, I think it could be implemented using optional API functions that allow the ULP to modify the private data per its need, rather than adding ULP knowledge into CMA. As example, if ISER spec will be modified, or some new ULP implemented, that needed their own private data, we'll need to modify CMA again, as well as creating a dependency between CMA versions and ULPs. > > > Regarding the patch below, it makes sense. Are you going to apply it to all > > affected modules? > > I will apply the patch to all affected modules. Thanks. Dan > > - Sean > From mst at mellanox.co.il Tue Dec 13 01:07:26 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 11:07:26 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: References: Message-ID: <20051213090726.GV14936@mellanox.co.il> Quoting Dan Bar Dov : > I understand that SDP needs address translation services as well as > its own private data. SDP is an exception simply because it was there first. > However, I think it could be implemented using > optional API functions that allow the ULP to modify the private data > per its need, rather than adding ULP knowledge into CMA. I agree this would also work, but I like the existing API better. Hopefully, the simple way in which its being implemented will help drive new ULP authors to follow the uniform spec rather than override it :) > As example, if ISER spec will be modified, or some new ULP > implemented, that needed their own private data, we'll need to modify > CMA again, as well as creating a dependency between CMA versions and > ULPs. What do you mean by "CMA versions"? -- MST From mst at mellanox.co.il Tue Dec 13 01:09:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 11:09:19 +0200 Subject: [openib-general] [PATCH] mthca: correct max_rd_atomic handling Message-ID: <20051213090919.GW14936@mellanox.co.il> Fix corner cases in max_rd_atomic value handling in modify-qp. sae/sre/swe bits should only be set when setting sra_max. Further, in current code, if caller specifies max_rd_atomic = 0, the sre and sae bits remain set, with the result that max_rd_atomic = 1 in effect. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_qp.c +++ latest/drivers/infiniband/hw/mthca/mthca_qp.c @@ -714,10 +714,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | - (MTHCA_FLIGHT_LIMIT << 24) | - MTHCA_QP_BIT_SRE | - MTHCA_QP_BIT_SWE | - MTHCA_QP_BIT_SAE); + (MTHCA_FLIGHT_LIMIT << 24)); if (qp->sq_policy == IB_SIGNAL_ALL_WR) qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); if (attr_mask & IB_QP_RETRY_CNT) { @@ -726,9 +723,14 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - if (attr->max_rd_atomic) + qp_context->params1 = cpu_to_be32(MTHCA_QP_BIT_SWE); + if (attr->max_rd_atomic) { + qp_context->params1 |= + cpu_to_be32(MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SAE); qp_context->params1 |= cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); + } qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } -- MST From mst at mellanox.co.il Tue Dec 13 02:22:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 12:22:55 +0200 Subject: [openib-general] [PATCH] libmthca: fix error handling in mthca_store_qp In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3D1B80D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3D1B80D@mtlexch01.mtl.com> Message-ID: <20051213102255.GX14936@mellanox.co.il> fix error handling in mthca_store_qp Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/src/userspace/libmthca/src/qp.c =================================================================== --- latest.orig/src/userspace/libmthca/src/qp.c +++ latest/src/userspace/libmthca/src/qp.c @@ -879,6 +879,7 @@ int mthca_store_qp(struct mthca_context ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1, sizeof (struct mthca_qp *)); if (!ctx->qp_table[tind].table) { + --ctx->qp_table[tind].refcnt; ret = -1; goto out; } From sean.hefty at intel.com Tue Dec 13 10:39:13 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Dec 2005 10:39:13 -0800 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: Message-ID: >I understand that SDP needs address translation services as well as >its own private data. However, I think it could be implemented using >optional API functions that allow the ULP to modify the private data >per its need, rather than adding ULP knowledge into CMA. >As example, if ISER spec will be modified, or some new ULP >implemented, that needed their own private data, we'll need to modify >CMA again, as well as creating a dependency between CMA versions and >ULPs. The CMA must be aware of the format of the data in order to set and extract the IP addressing information. SDP and the new CMA format locate these in different areas of the private data. The CMA only defines the SDP hello header, and restricts its definition to the location of the IP addresses, source port, and version information. If a ULP wants to define their own private data format and move the locations of any of those fields, then yes, the CMA would need to be changed again. But I don't see how any API changes can prevent this, since the CMA must be able to extract the data on the remote side. - Sean From sean.hefty at intel.com Tue Dec 13 10:46:03 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Dec 2005 10:46:03 -0800 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051213050657.GA4940@mellanox.co.il> Message-ID: >Sure, this was just a test to show how to get rid of ip_dev_find. >I didnt follow CMA recently - does CMA already support SDP private format? >If yes, I need to work on moving SDP to use it. The latest check-in added support for SDP. >As a reminder, what we are trying to do is handle the loopback case, >where IP route resolution simply gives us back the loopback device, but >we want to use the IB loopback: either external if source/destination >are specified, or external if not. Can you look over the CMA/ib_addr code and see if what is done meets your needs (minus support for IPv6)? I've never tested the CMA using an address of 127.0.0.1, so I'm not sure what it would do in that case. I'm also not sure if it makes sense for the CMA to handle that case... I will review this patch early next week. - Sean From ftillier at silverstorm.com Tue Dec 13 12:14:45 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 13 Dec 2005 12:14:45 -0800 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: Message-ID: <000001c60021$e4215bb0$6401a8c0@infiniconsys.com> > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Tuesday, December 13, 2005 10:39 AM > > >I understand that SDP needs address translation services as well as > >its own private data. However, I think it could be implemented using > >optional API functions that allow the ULP to modify the private data > >per its need, rather than adding ULP knowledge into CMA. > >As example, if ISER spec will be modified, or some new ULP > >implemented, that needed their own private data, we'll need to modify > >CMA again, as well as creating a dependency between CMA versions and > >ULPs. > > The CMA must be aware of the format of the data in order to > set and extract the IP addressing information. SDP and the > new CMA format locate these in different areas of the private > data. The CMA only defines the SDP hello header, and > restricts its definition to the location of the IP addresses, > source port, and version information. > > If a ULP wants to define their own private data format and move > the locations of any of those fields, then yes, the CMA would > need to be changed again. But I don't see how any API changes > can prevent this, since the CMA must be able to extract the data > on the remote side. Now that the IB spec is going to have a section for how to support IP addressing in CM MADs, there shouldn't be any need for a ULP to duplicate that functionality. SDP is a special case because it predates the IP addressing extension to the CM protocol. - Fab From mst at mellanox.co.il Tue Dec 13 12:37:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 22:37:58 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: References: Message-ID: <20051213203758.GA6715@mellanox.co.il> Quoting Sean Hefty : > >As a reminder, what we are trying to do is handle the loopback case, > >where IP route resolution simply gives us back the loopback device, but > >we want to use the IB loopback: either external if source/destination > >are specified, or external if not. > > Can you look over the CMA/ib_addr code and see if what is done meets > your needs > (minus support for IPv6)? Will do. > I've never tested the CMA using an address of > 127.0.0.1, so I'm not sure what it would do in that case. I'm also not > sure if it makes sense for the CMA to handle that case... At least for SDP its important: people are used to being able to specify 127.0.0.1 and get a loopback connection. And in some cases (zcopy), you actually can get good performance out of it. > I will review this patch early next week. -- MST From mst at mellanox.co.il Tue Dec 13 12:44:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 13 Dec 2005 22:44:01 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: References: Message-ID: <20051213204401.GB6715@mellanox.co.il> Quoting Sean Hefty : > Subject: [PATCH] [CMA] support for SDP + standard protocol > > The following patch updates the CMA to support the IB socket-based > protocol standard and SDP's private data format. > > The CMA now defines RDMA "port spaces". RDMA identifiers are associated > with a user-specified port space at creation time. > > Please respond with any comments on the approach. Note that these > changes have not been pushed up to userspace yet. > > Signed-off-by: Sean Hefty OK, I started looking at converting SDP to CMA. One thing I'm a bit confused about: do I do my own QP transitions on the passive side? -- MST From ardavis at ichips.intel.com Tue Dec 13 13:55:17 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 13 Dec 2005 13:55:17 -0800 Subject: [openib-general] [PATCH][uDAPL] openib_cma provider update In-Reply-To: References: Message-ID: <439F4345.6070304@ichips.intel.com> Arlin Davis wrote: >James, > >I modified the IP address lookup during the open to take either a network name, network address, or >device name. This will make the dat.conf setup a little easier and more flexible. I updated the >README, and /doc/dat.conf with details. > >Thanks, > >-arlin > >Signed-off by: Arlin Davis > > > James, Did you get a chance to look at this patch? -arlin From sean.hefty at intel.com Tue Dec 13 14:21:34 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Dec 2005 14:21:34 -0800 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051213203758.GA6715@mellanox.co.il> Message-ID: >> I've never tested the CMA using an address of >> 127.0.0.1, so I'm not sure what it would do in that case. I'm also not >> sure if it makes sense for the CMA to handle that case... > >At least for SDP its important: people are used to being able >to specify 127.0.0.1 and get a loopback connection. >And in some cases (zcopy), you actually can get good performance out of it. I agree that this should be supported from the user's perspective, just not sure if the CMA should perform this functionality. A higher level ULP could map 127.0.0.1 to a specific IP address before calling the CMA, but I'm not sure that's any better. From the CMA's perspective, 127.0.0.1 could just as easily map to an iWarp device as an Infiniband device. - Sean From jlentini at netapp.com Tue Dec 13 14:31:26 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 13 Dec 2005 17:31:26 -0500 (EST) Subject: [openib-general] [PATCH][uDAPL] openib_cma provider update In-Reply-To: <439F4345.6070304@ichips.intel.com> Message-ID: On Tue, 13 Dec 2005, Arlin Davis wrote: > Arlin Davis wrote: > > >James, > > > >I modified the IP address lookup during the open to take either a network name, network address, or > >device name. This will make the dat.conf setup a little easier and more flexible. I updated the > >README, and /doc/dat.conf with details. > > > >Thanks, > > > >-arlin > > > >Signed-off by: Arlin Davis > > > > > > > James, > > Did you get a chance to look at this patch? > > -arlin I haven't had a chance yet. I'm traveling this week, so my conectivity is sporadic. I'll be able to review it by Thursday at the latest. From sean.hefty at intel.com Tue Dec 13 14:31:47 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 Dec 2005 14:31:47 -0800 Subject: [openib-general] RE: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <20051213204401.GB6715@mellanox.co.il> Message-ID: >OK, I started looking at converting SDP to CMA. >One thing I'm a bit confused about: do I do >my own QP transitions on the passive side? The CMA should perform the QP transitions on both sides. The main difference between the SDP and other users of the CMA is that SDP passes in the start of the SDP hello header, and owns setting any information not related to the IP addressing, such as the SDP version, MaxAdverts, etc. The CMA will fill in the IP version, IP addresses, and local port. - Sean From mst at mellanox.co.il Tue Dec 13 14:36:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 00:36:22 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: References: Message-ID: <20051213223622.GA7173@mellanox.co.il> Quoting Sean Hefty : > A higher level ULP could map 127.0.0.1 to a specific IP address before > calling the CMA, but I'm not sure that's any better. Ugh. I really would like to hide all the IPv4/IPv6 etc from ULPs. > From the CMA's perspective, 127.0.0.1 could just as easily > map to an iWarp device as an Infiniband device. Which device to select is a difficult problem. I think we might be able to just punt on this for now, selecting an arbitrary device of an appropriate type that happens to be up. My hope is that in the long run, this can be viewed as a special case in the general path selection/multipathing problem. By the way, CMA seems to happily take bits out of the hardware address and assume that these include the gid, pkey, etc. Shouldnt it check the device type before doing this? -- MST From mst at mellanox.co.il Tue Dec 13 14:42:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 00:42:02 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: References: Message-ID: <20051213224202.GB7173@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [PATCH] [CMA] support for SDP + standard protocol > > >OK, I started looking at converting SDP to CMA. > >One thing I'm a bit confused about: do I do > >my own QP transitions on the passive side? > > The CMA should perform the QP transitions on both sides. > > The main difference between the SDP and other users of the CMA is that > SDP > passes in the start of the SDP hello header, and owns setting any > information > not related to the IP addressing, such as the SDP version, MaxAdverts, > etc. The > CMA will fill in the IP version, IP addresses, and local port. > > - Sean > What confuses me is how do I handle creation of multiple QPs when multiple clients want to connect to a specific port on a server. cma id seems to only include one qp: do I disconnect it from qp somehow after connection is set up? -- MST From rpandit at silverstorm.com Tue Dec 13 14:59:13 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Tue, 13 Dec 2005 14:59:13 -0800 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <20051213224202.GB7173@mellanox.co.il> References: <20051213224202.GB7173@mellanox.co.il> Message-ID: <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> On 12/13/05, Michael S. Tsirkin wrote: > > What confuses me is how do I handle creation of multiple QPs > when multiple clients want to connect to a specific port on a server. > cma id seems to only include one qp: do I disconnect it from qp > somehow after connection is set up? RDS has a similar requirement - it creates one listener to which all clients connect. I'm also trying to figure out whether to use CM or CMA for Rds. I would think that rdma_accept() would create a seperate cma_id (and it's associated qp) which will then become the actual passive side of the connection. Ranjit > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Tue Dec 13 15:07:24 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 13 Dec 2005 17:07:24 -0600 Subject: [openib-general] dev_remove in the CMA Message-ID: <1134515244.3764.9.camel@trinity.austin.ammasso.com> Sean: I'm don't understand the dev_remove usage in the rdma_cm_id. It looks to me like if the user calls rdma_resolve_addr, but never calls rdma_resolve_route that the device could not be removed. Is this the intended behavior? Is the goal to prevent the user from removing the device if the client is in a callback? If so, can't we just increment and decrement in the cma_notify_user function? I guess I just don't understand... Thanks, Tom From rpandit at silverstorm.com Tue Dec 13 15:50:22 2005 From: rpandit at silverstorm.com (Ranjit Pandit) Date: Tue, 13 Dec 2005 15:50:22 -0800 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> References: <20051213224202.GB7173@mellanox.co.il> <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> Message-ID: <96f8e60e0512131550m4cdc9081h9b00883e4cd572e8@mail.gmail.com> Looks like the CMA does create a new cm_id on every connect request. cma_req_handler() calls cma_new_id() and passes the new id to the connect request callback. On 12/13/05, Ranjit Pandit wrote: > On 12/13/05, Michael S. Tsirkin wrote: > > > > What confuses me is how do I handle creation of multiple QPs > > when multiple clients want to connect to a specific port on a server. > > cma id seems to only include one qp: do I disconnect it from qp > > somehow after connection is set up? > > RDS has a similar requirement - it creates one listener to which all > clients connect. > I'm also trying to figure out whether to use CM or CMA for Rds. > > I would think that rdma_accept() would create a seperate cma_id (and > it's associated qp) which will then become the actual passive side of > the connection. > > Ranjit > > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From abeyn at datadirectnet.com Tue Dec 13 19:07:53 2005 From: abeyn at datadirectnet.com (Alexander Beyn) Date: Tue, 13 Dec 2005 19:07:53 -0800 Subject: [openib-general] [PATCH] ibsrpdm: use the proper HCA and port with non-default umad device Message-ID: <439F8C89.7040507@datadirectnet.com> In srptools-0.0.2, ibsrpdm gets the LID of the first port of the first HCA to do discovery. This means ibsrpdm can't find SRP targets connected to other ports, even if the proper umad device is passed with the -d option. With the following patch, ibsrpdm uses the HCA and port associated with the umad device to get the LID. It was tested with 2 dual-port HCAs directly connected to our Infiniband array, properly finding SRP targets on all four ports. Alexander Beyn DataDirect Networks --- srp-dm.c.orig 2005-12-13 15:32:52.000000000 -0800 +++ srp-dm.c 2005-12-13 18:22:10.000000000 -0800 @@ -44,6 +44,7 @@ static const uint8_t topspin_oui[3] = { 0x00, 0x05, 0xad }; static char *umad_dev = "/dev/infiniband/umad0"; +static char *port_sysfs_path; static int timeout_ms = 2500; static uint16_t sm_lid; static uint32_t tid = 1; @@ -77,6 +78,47 @@ fprintf(stderr, "Usage: %s [-gGvc] [-d ]\n", argv0); } +int setup_port_sysfs_path(void) { + char path[256]; + char ibport[16]; + char ibdev[16]; + char *umad_dev_name; + struct sysfs_class_device *umad_sysfs_dev; + struct sysfs_attribute *umad_attr; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return -1; + } + if((umad_dev_name = rindex(umad_dev, '/'))) { + umad_dev_name++; + } + umad_sysfs_dev = sysfs_open_class_device("infiniband_mad", + umad_dev_name); + if(!umad_sysfs_dev) { + fprintf(stderr, "Couldn't open umad sysfs entry named: %s\n", + umad_dev_name); + return -1; + } + umad_attr = sysfs_get_classdev_attr(umad_sysfs_dev, "ibdev"); + if(sysfs_read_attribute(umad_attr)) { + fprintf(stderr, "Couldn't read ibdev attribute.\n"); + return -1; + } + sscanf(umad_attr->value, "%15s", ibdev); + + umad_attr = sysfs_get_classdev_attr(umad_sysfs_dev, "port"); + if(sysfs_read_attribute(umad_attr)) { + fprintf(stderr, "Couldn't read port attribute.\n"); + return -1; + } + sscanf(umad_attr->value, "%15s", ibport); + + asprintf(&port_sysfs_path, "%s/class/infiniband/%s/ports/%s", + path, ibdev, ibport); + return 0; +} + int create_agent(int fd, uint32_t agent[2]) { struct ib_user_mad_reg_req req; @@ -196,7 +238,6 @@ struct ib_user_mad in_mad, out_mad; struct srp_dm_mad *out_dm_mad, *in_dm_mad; struct srp_dm_class_port_info *cpi; - char path[256]; char val[64]; char *name; int i; @@ -208,12 +249,7 @@ cpi = (void *) out_dm_mad->data; - if (sysfs_get_mnt_path(path, sizeof path)) { - fprintf(stderr, "Couldn't find sysfs mount.\n"); - return -1; - } - - asprintf(&name, "%s/class/infiniband/mthca0/ports/1/lid", path); + asprintf(&name, "%s/lid", port_sysfs_path); if (sysfs_read_attribute_value(name, val, sizeof val)) { fprintf(stderr, "Couldn't read LID at %s\n", name); @@ -222,7 +258,7 @@ cpi->trap_lid = htons(strtol(val, NULL, 0)); - asprintf(&name, "%s/class/infiniband/mthca0/ports/1/gids/0", path); + asprintf(&name, "%s/gids/0", port_sysfs_path); if (sysfs_read_attribute_value(name, val, sizeof val)) { fprintf(stderr, "Couldn't read GID at %s\n", name); @@ -473,19 +509,13 @@ struct srp_dm_rmpp_sa_mad *out_sa_mad, *in_sa_mad; struct srp_sa_port_info_rec *port_info; ssize_t len; - char path[256]; char val[64]; char *name; int pn; int size; int i; - if (sysfs_get_mnt_path(path, sizeof path)) { - fprintf(stderr, "Couldn't find sysfs mount.\n"); - return -1; - } - - asprintf(&name, "%s/class/infiniband/mthca0/ports/1/sm_lid", path); + asprintf(&name, "%s/sm_lid", port_sysfs_path); if (sysfs_read_attribute_value(name, val, sizeof val)) { fprintf(stderr, "Couldn't read LID at %s\n", name); @@ -585,6 +615,8 @@ perror("open"); return 1; } + if(setup_port_sysfs_path()) + return 1; if (create_agent(fd, agent)) return 1; From bardov at gmail.com Tue Dec 13 22:43:26 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Wed, 14 Dec 2005 08:43:26 +0200 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <000001c60021$e4215bb0$6401a8c0@infiniconsys.com> References: <000001c60021$e4215bb0$6401a8c0@infiniconsys.com> Message-ID: Wouldn't it make sense than, to also modify the SDP spec? After all, the change in openIB would modify both client & server sides. Of course existing stacks would have to be changed if they will want interoperability, but I think it could fly. Dan On 12/13/05, Fab Tillier wrote: > > From: Sean Hefty [mailto:sean.hefty at intel.com] > > Sent: Tuesday, December 13, 2005 10:39 AM > > > > >I understand that SDP needs address translation services as well as > > >its own private data. However, I think it could be implemented using > > >optional API functions that allow the ULP to modify the private data > > >per its need, rather than adding ULP knowledge into CMA. > > >As example, if ISER spec will be modified, or some new ULP > > >implemented, that needed their own private data, we'll need to modify > > >CMA again, as well as creating a dependency between CMA versions and > > >ULPs. > > > > The CMA must be aware of the format of the data in order to > > set and extract the IP addressing information. SDP and the > > new CMA format locate these in different areas of the private > > data. The CMA only defines the SDP hello header, and > > restricts its definition to the location of the IP addresses, > > source port, and version information. > > > > If a ULP wants to define their own private data format and move > > the locations of any of those fields, then yes, the CMA would > > need to be changed again. But I don't see how any API changes > > can prevent this, since the CMA must be able to extract the data > > on the remote side. > > Now that the IB spec is going to have a section for how to support IP addressing > in CM MADs, there shouldn't be any need for a ULP to duplicate that > functionality. SDP is a special case because it predates the IP addressing > extension to the CM protocol. > > - Fab > > > From vvcute at kobej.zzn.com Wed Dec 14 00:45:05 2005 From: vvcute at kobej.zzn.com (vvcute at kobej.zzn.com) Date: Wed, 14 Dec 2005 00:45:05 -0800 (PST) Subject: [openib-general] =?utf-8?b?wo/Cl8KQwqvCgsONwo1EwoLCq8KCw4XCgsK3?= =?utf-8?b?woLCqcKBSA==?= Message-ID: 20030410235146.58176mail@mail.love-sexlife88545879889_woman-server889_womansystem01_woman-sexlife-love.tv �������������y�o���z�X�g�z���ł� �y�E��z�F�o���z�X�g �y�ҋ��z�F�o�^���P��A�T2�񏗐����Љ� �y���^�z�F�挎����57.6���i���Ǝҕ���21.3���~�j �y���i�z�F18�Έȏ�̌��N�ȕ� �y�Q�����@�z�Fhttp://twilight.cx/h/ ����o�^���Ă��������B �o�^�̍ۂɁy�o���z�X�g�z��K�����L�����������B �o�^��͎������珗���������Ă�ǂ��ł����A �y�o���z�X�g�z���ă��[�����Ă���鏗���������̂ŁA ���[����������ԐM���āA��‚��Ă��������B �@�@�@�@�@�@�@http://twilight.cx/h/ From ogerlitz at voltaire.com Wed Dec 14 01:56:22 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 Dec 2005 11:56:22 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: <96f8e60e0512131550m4cdc9081h9b00883e4cd572e8@mail.gmail.com> References: <20051213224202.GB7173@mellanox.co.il> <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> <96f8e60e0512131550m4cdc9081h9b00883e4cd572e8@mail.gmail.com> Message-ID: <439FEC46.9080904@voltaire.com> Ranjit Pandit wrote: > Looks like the CMA does create a new cm_id on every connect request. > cma_req_handler() calls cma_new_id() and passes the new id to the > connect request callback. Indeed, to see how it is used you can follow on the passive side flow in gen2/utils/src/linux-kernel/infiniband/util/cmatose. Having both the cma ids of the listener and the new connection at hand within the connection request callback, is easy: you add the listener id as a field in the context struct which you associate with it, so you get it as the callback second param. From ogerlitz at voltaire.com Wed Dec 14 02:17:55 2005 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 Dec 2005 12:17:55 +0200 Subject: [openib-general] CMA/RDS In-Reply-To: <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> References: <20051213224202.GB7173@mellanox.co.il> <96f8e60e0512131459m76301fb4gae24954921a3e388@mail.gmail.com> Message-ID: <439FF153.8030203@voltaire.com> Ranjit Pandit wrote: > I'm also trying to figure out whether to use CM or CMA for RDS Please note that the CMA does much more then making your CM interaction easier, eg handle IP to RDMA (eg IB) address translation, being at hotplug client and generally provides IP based RDMA transport neutral connection management service. Combining this with the ib_verbs api being also RDMA transport neutral enables an app/ulp to be such. Eventually, most (all) the middlewares/ULPs are to be coded over the CMA and ib_verbs, eg uDAPL, iSER, NFSoRDMA, Lustre, SDP, SRP(?). The iSER initiator code over the CMA is under work and initial drop is commited to the openib svn. Or. From yael at mellanox.co.il Wed Dec 14 02:41:42 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 14 Dec 2005 12:41:42 +0200 Subject: [openib-general] [PATCH] Opensm - support arbitrary paths for driver installation Message-ID: <5z64psgduh.fsf@mtl066.yok.mtl.com> Hi Hal, Currently, if the user level driver installation is not in /usr/local/ dir - configure of opensm failes. The following patch enables support for arbitrary paths of user level driver installation. Path can be given using --with-uldrv flag. Thanks, Yael Signed-off-by: Yael Kalka Index: osmtest/Makefile.am =================================================================== --- osmtest/Makefile.am (revision 4412) +++ osmtest/Makefile.am (working copy) @@ -5,10 +5,7 @@ else DBGFLAGS = -g -O2 endif -INCLUDES = -I$(srcdir)/include \ - -I$(srcdir)/../include \ - -I$(srcdir)/../../libibcommon/include/infiniband \ - -I$(srcdir)/../../libibumad/include/infiniband +INCLUDES = -I$(srcdir)/include $(OSMV_INCLUDES) bin_PROGRAMS = osmtest osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 4412) +++ opensm/Makefile.am (working copy) @@ -1,7 +1,5 @@ -INCLUDES = -I$(srcdir)/../include \ - -I$(srcdir)/../../libibcommon/include/infiniband \ - -I$(srcdir)/../../libibumad/include/infiniband +INCLUDES = $(OSMV_INCLUDES) lib_LTLIBRARIES = libopensm.la Index: config/osmvsel.m4 =================================================================== --- config/osmvsel.m4 (revision 4412) +++ config/osmvsel.m4 (working copy) @@ -13,21 +13,39 @@ AC_DEFUN([OPENIB_APP_OSMV_SEL], [ dnl Define a way for the user to provide the osm vendor type AC_ARG_WITH(osmv, -[ --with-osmv= define the osm vendor type], +[ --with-osmv= define the osm vendor type to build], AC_MSG_NOTICE(Using OSM Vendor Type:$with_osmv), with_osmv="openib") +dnl Define a way for the user to provide the path to the driver installation +AC_ARG_WITH(uldrv, +[ --with-uldrv= define the dir where the user level driver is installed], +AC_MSG_NOTICE(Using user level installation prefix:$with_uldrv), +with_uldrv="") + dnl Define a way for the user to provide the path to the simulator installation AC_ARG_WITH(sim, [ --with-sim= define the simulator prefix for building sim vendor (/usr)], AC_MSG_NOTICE(Using Simulator from:$with_sim), with_sim="/usr") +dnl Should we use lib64 or lib +if test "$(uname -m)" = "x86_64"; then + osmv_lib_type="lib64" +else + osmv_lib_type="lib" +fi + dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" + if test "x$with_uldrv" = "x"; then OSMV_LDADD="-libumad" + else + OSMV_INCLUDES="-I$with_uldrv/include $OSMV_INCLUDES" + OSMV_LDADD="-L$with_uldrv/$osmv_lib_type -libumad" + fi elif test $with_osmv = "sim" ; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_SIM" OSMV_INCLUDES="-I$with_sim/include -I\$(srcdir)/../include" @@ -90,8 +108,11 @@ if test "$disable_libcheck" != "yes"; th dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then + osmv_save_ldflags=$LDFALGS + LDFLAGS="$LDFLAGS $OSMV_LDADD" AC_CHECK_LIB(ibumad, umad_init, [], AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) + LD_FLAGS=$osmv_save_ldflags elif test $with_osmv = "sim" ; then LDFLAGS="$LDFLAGS -L$with_sim/lib" AC_CHECK_FILE([$with_sim/lib/libibmscli.a], [], From mst at mellanox.co.il Wed Dec 14 04:48:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 14:48:40 +0200 Subject: [openib-general] ipoib: question Message-ID: <20051214124840.GN14870@mellanox.co.il> Roland, where exactly does the following math come from? static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - (offsetof(struct neighbour, ha) & 4)); } 1. What does & 4 do here? 2. Why are we subsrctucting a function of ha offset? 4. What is 24? Is it related to INFINIBAND_ALEN? Thanks, -- MST From halr at voltaire.com Wed Dec 14 08:09:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2005 11:09:04 -0500 Subject: [openib-general] Re: [PATCH] Opensm - support arbitrary paths for driver installation In-Reply-To: <5z64psgduh.fsf@mtl066.yok.mtl.com> References: <5z64psgduh.fsf@mtl066.yok.mtl.com> Message-ID: <1134576541.26766.7669.camel@hal.voltaire.com> Hi Yael, On Wed, 2005-12-14 at 05:41, Yael Kalka wrote: > Hi Hal, > > Currently, if the user level driver installation is not in /usr/local/ > dir - configure of opensm failes. > The following patch enables support for arbitrary paths of user level > driver installation. Path can be given using --with-uldrv flag. Thanks. Applied. A couple of comments below. -- Hal > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: osmtest/Makefile.am > =================================================================== > --- osmtest/Makefile.am (revision 4412) > +++ osmtest/Makefile.am (working copy) > @@ -5,10 +5,7 @@ else > DBGFLAGS = -g -O2 > endif > > -INCLUDES = -I$(srcdir)/include \ > - -I$(srcdir)/../include \ > - -I$(srcdir)/../../libibcommon/include/infiniband \ > - -I$(srcdir)/../../libibumad/include/infiniband > +INCLUDES = -I$(srcdir)/include $(OSMV_INCLUDES) > > bin_PROGRAMS = osmtest > osmtest_SOURCES = main.c osmtest.c osmt_service.c osmt_slvl_vl_arb.c \ > Index: opensm/Makefile.am > =================================================================== > --- opensm/Makefile.am (revision 4412) > +++ opensm/Makefile.am (working copy) > @@ -1,7 +1,5 @@ > > -INCLUDES = -I$(srcdir)/../include \ > - -I$(srcdir)/../../libibcommon/include/infiniband \ > - -I$(srcdir)/../../libibumad/include/infiniband > +INCLUDES = $(OSMV_INCLUDES) > > lib_LTLIBRARIES = libopensm.la For some reason, the Makefile.am patches were rejected. I applied them by hand. > Index: config/osmvsel.m4 > =================================================================== > --- config/osmvsel.m4 (revision 4412) > +++ config/osmvsel.m4 (working copy) > @@ -13,21 +13,39 @@ AC_DEFUN([OPENIB_APP_OSMV_SEL], [ > > dnl Define a way for the user to provide the osm vendor type > AC_ARG_WITH(osmv, > -[ --with-osmv= define the osm vendor type], > +[ --with-osmv= define the osm vendor type to build], > AC_MSG_NOTICE(Using OSM Vendor Type:$with_osmv), > with_osmv="openib") > > +dnl Define a way for the user to provide the path to the driver installation > +AC_ARG_WITH(uldrv, > +[ --with-uldrv= define the dir where the user level driver is installed], > +AC_MSG_NOTICE(Using user level installation prefix:$with_uldrv), > +with_uldrv="") > + > dnl Define a way for the user to provide the path to the simulator installation > AC_ARG_WITH(sim, > [ --with-sim= define the simulator prefix for building sim vendor (/usr)], > AC_MSG_NOTICE(Using Simulator from:$with_sim), > with_sim="/usr") > > +dnl Should we use lib64 or lib > +if test "$(uname -m)" = "x86_64"; then > + osmv_lib_type="lib64" > +else > + osmv_lib_type="lib" > +fi > + > dnl based on the with_osmv we can try the vendor flag > if test $with_osmv = "openib"; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" > OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" > + if test "x$with_uldrv" = "x"; then > OSMV_LDADD="-libumad" > + else > + OSMV_INCLUDES="-I$with_uldrv/include $OSMV_INCLUDES" > + OSMV_LDADD="-L$with_uldrv/$osmv_lib_type -libumad" > + fi > elif test $with_osmv = "sim" ; then > OSMV_CFLAGS="-DOSM_VENDOR_INTF_SIM" > OSMV_INCLUDES="-I$with_sim/include -I\$(srcdir)/../include" > @@ -90,8 +108,11 @@ if test "$disable_libcheck" != "yes"; th > > dnl based on the with_osmv we can try the vendor flag > if test $with_osmv = "openib"; then > + osmv_save_ldflags=$LDFALGS ^^^^^^^ LDFLAGS (in some other places too) > + LDFLAGS="$LDFLAGS $OSMV_LDADD" > AC_CHECK_LIB(ibumad, umad_init, [], > AC_MSG_ERROR([umad_init() not found. libosmvendor of type openib requires libibumad.])) > + LD_FLAGS=$osmv_save_ldflags > elif test $with_osmv = "sim" ; then > LDFLAGS="$LDFLAGS -L$with_sim/lib" > AC_CHECK_FILE([$with_sim/lib/libibmscli.a], [], From iod00d at hp.com Wed Dec 14 09:14:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Wed, 14 Dec 2005 09:14:52 -0800 Subject: [openib-general] Re: [openib-commits] r4453 - trunk/contrib/mellanox/gen2/src/userspace/perftest In-Reply-To: <20051214095726.682B32283EE@openib.ca.sandia.gov> References: <20051214095726.682B32283EE@openib.ca.sandia.gov> Message-ID: <20051214171452.GA26274@esmail.cup.hp.com> On Wed, Dec 14, 2005 at 01:57:26AM -0800, sagir at openib.org wrote: > Author: sagir > Date: 2005-12-14 01:57:24 -0800 (Wed, 14 Dec 2005) > New Revision: 4453 > > Modified: > trunk/contrib/mellanox/gen2/src/userspace/perftest/rdma_lat.c Can someone from mellanox explain why mainline src/userspace is cloned under contrib/mellanox? > Log: > mtu per device You guys are certainly welcome to add stuff to contrib/mellanox. I just would like to be able to explain to HP management why there are two versions of rmda_lat.c. thanks, grant From eitan at mellanox.co.il Wed Dec 14 11:25:23 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 14 Dec 2005 21:25:23 +0200 Subject: [openib-general] Next workshop dates? Please respond with your preferences In-Reply-To: <6.2.3.4.2.20051208224443.03a16be0@mail-lc.llnl.gov> References: <6.2.3.4.2.20051208224443.03a16be0@mail-lc.llnl.gov> Message-ID: <43A071A3.30500@mellanox.co.il> Hi, I would like to propose the following agenda topics: Core Enhancements: QoS - directions for integration and support by OpenIB stack Partitions - recognize areas needing enhancements Multicast, Services and InformInfo Registrations: reference counting and re-registrations - implementation plan/API Diagnostics: Describe/discuss new diagnostic tools feature. Eitan From arlin.r.davis at intel.com Wed Dec 14 11:56:02 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 14 Dec 2005 11:56:02 -0800 Subject: [openib-general] [PATCH][uDAPL] openib_scm uses incorrect rd_atomic values for modify_qp Message-ID: James, Here is a fix for openib socket cm version. I ran into a problem with the latest verbs qp_modify as a result of incorrect rd_atomic values so I modified to use the values returned from the ibv_query_device() instead of hard coded values. -arlin Signed-off by: Arlin Davis Index: dapl/openib_scm/dapl_ib_qp.c =================================================================== --- dapl/openib_scm/dapl_ib_qp.c (revision 4464) +++ dapl/openib_scm/dapl_ib_qp.c (working copy) @@ -300,10 +300,11 @@ dapls_modify_qp_state ( IN ib_qp_handle_ { struct ibv_qp_attr qp_attr; enum ibv_qp_attr_mask mask = IBV_QP_STATE; - + DAPL_EP *ep_ptr = (DAPL_EP*)qp_handle->qp_context; + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); qp_attr.qp_state = qp_state; - + switch (qp_state) { /* additional attributes with RTR and RTS */ case IBV_QPS_RTR: @@ -318,17 +319,21 @@ dapls_modify_qp_state ( IN ib_qp_handle_ qp_attr.path_mtu = IBV_MTU_1024; qp_attr.dest_qp_num = qp_cm->qpn; qp_attr.rq_psn = 1; - qp_attr.max_dest_rd_atomic = 8; + qp_attr.max_dest_rd_atomic = + ep_ptr->param.ep_attr.max_rdma_read_out; qp_attr.min_rnr_timer = 12; qp_attr.ah_attr.is_global = 0; qp_attr.ah_attr.dlid = qp_cm->lid; qp_attr.ah_attr.sl = 0; qp_attr.ah_attr.src_path_bits = 0; qp_attr.ah_attr.port_num = qp_cm->port; - + dapl_dbg_log (DAPL_DBG_TYPE_EP, - " modify_qp_rtr: qpn %x lid %x port %x\n", - qp_cm->qpn,qp_cm->lid,qp_cm->port ); + " modify_qp_rtr: qpn %x lid %x " + "port %x rd_atomic %d\n", + qp_cm->qpn, qp_cm->lid, qp_cm->port, + qp_attr.max_dest_rd_atomic ); + break; } case IBV_QPS_RTS: @@ -343,9 +348,11 @@ dapls_modify_qp_state ( IN ib_qp_handle_ qp_attr.retry_cnt = 7; qp_attr.rnr_retry = 7; qp_attr.sq_psn = 1; - qp_attr.max_rd_atomic = 8; + qp_attr.max_rd_atomic = + ep_ptr->param.ep_attr.max_rdma_read_out; + dapl_dbg_log (DAPL_DBG_TYPE_EP, - " modify_qp_rts: psn %x or %x\n", + " modify_qp_rts: psn %x rd_atomic %d\n", qp_attr.sq_psn, qp_attr.max_rd_atomic ); break; } From rdreier at cisco.com Wed Dec 14 12:16:45 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 12:16:45 -0800 Subject: [openib-general] Re: ipoib: question In-Reply-To: <20051214124840.GN14870@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 14 Dec 2005 14:48:40 +0200") References: <20051214124840.GN14870@mellanox.co.il> Message-ID: > where exactly does the following math come from? > > static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) > { > return (struct ipoib_neigh **) (neigh->ha + 24 - > (offsetof(struct neighbour, ha) & 4)); > } > > 1. What does & 4 do here? > 2. Why are we subsrctucting a function of ha offset? No #3 again ;) > 4. What is 24? Is it related to INFINIBAND_ALEN? Yes, 24 is INFINIBAND_ALEN + 4. Maybe it would be clearer to write it that way. The idea is that we want to get something aligned to 8 bytes. I'd have to check again to be sure but I think that on some architectures, the beginning ha member of struct neighbour is only aligned to 4 bytes, so we should offset by 20 bytes to get to an alignment of 8 and still leave room for the real hardware address. - R. From rdreier at cisco.com Wed Dec 14 12:21:34 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 12:21:34 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix error handling in mthca_store_qp In-Reply-To: <20051213102255.GX14936@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 13 Dec 2005 12:22:55 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E3D1B80D@mtlexch01.mtl.com> <20051213102255.GX14936@mellanox.co.il> Message-ID: Thanks, good catch. I decided it was clearer to split off the test of refcnt, and only increment refcnt if the allocation succeeds. - R. --- libmthca/src/qp.c (revision 4465) +++ libmthca/src/qp.c (working copy) @@ -875,13 +875,15 @@ int mthca_store_qp(struct mthca_context pthread_mutex_lock(&ctx->qp_table_mutex); - if (!ctx->qp_table[tind].refcnt++) { + if (!ctx->qp_table[tind].refcnt) { ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1, sizeof (struct mthca_qp *)); if (!ctx->qp_table[tind].table) { ret = -1; goto out; } + + ++ctx->qp_table[tind].refcnt; } ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp; --- libmthca/ChangeLog (revision 4465) +++ libmthca/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2005-12-14 Roland Dreier + + * src/qp.c (mthca_store_qp): Only increment qp_table ref count if + allocation succeeds. + 2005-11-29 Michael S. Tsirkin * src/qp.c (mthca_arbel_post_send): Add handling for posting long From rdreier at cisco.com Wed Dec 14 12:22:50 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 12:22:50 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: <20051213090919.GW14936@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 13 Dec 2005 11:09:19 +0200") References: <20051213090919.GW14936@mellanox.co.il> Message-ID: What happens if SAE and SRE are turned off and the consumer posts an RDMA read? Does it fail and generate an error completion? I don't think we want it to just stop the QP processing. - R. From rdreier at cisco.com Wed Dec 14 12:28:18 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 12:28:18 -0800 Subject: [openib-general] Re: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() In-Reply-To: <20051210214832.GA31057@mellanox.co.il> (Michael S. Tsirkin's message of "Sat, 10 Dec 2005 23:48:32 +0200") References: <20051210214832.GA31057@mellanox.co.il> Message-ID: Michael> This wont work for hotplug: you are saving the device Michael> pointer without opening the device, so it might go away Michael> from under your feet. Good point -- see updated change below. Michael> I wander whether we can come up with an API that helps Michael> people get it right more easily? I guess we could return some opaque cookie or validate pointers using a hash table or something like that. --- mvapich-gen2/mpid/ch_gen2/viainit.c (revision 4465) +++ mvapich-gen2/mpid/ch_gen2/viainit.c (working copy) @@ -74,13 +74,21 @@ static void set_malloc_options(void) static void open_hca(void) { - struct dlist *dev_list; struct ibv_device *ib_dev = NULL; +#ifdef GEN2_OLD_DEVICE_LIST_VERB + struct dlist *dev_list; + dev_list = ibv_get_devices(); dlist_start(dev_list); ib_dev = dlist_next(dev_list); +#else + struct ibv_device **dev_list; + + dev_list = ibv_get_device_list(NULL); + ib_dev = dev_list[0]; +#endif if (!ib_dev) { fprintf(stderr, "No IB devices found\n"); @@ -90,6 +98,10 @@ static void open_hca(void) ibv_dev.context = ibv_open_device(ib_dev); +#ifndef GEN2_OLD_DEVICE_LIST_VERB + ibv_free_device_list(dev_list); +#endif + if(!ibv_dev.context) { error_abort_all(GEN_EXIT_ERR, "Error getting HCA context\n"); } From rdreier at cisco.com Wed Dec 14 12:39:43 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 12:39:43 -0800 Subject: [openib-general] Re: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() In-Reply-To: (Roland Dreier's message of "Wed, 14 Dec 2005 12:28:18 -0800") References: <20051210214832.GA31057@mellanox.co.il> Message-ID: BTW, I'm going to commit this whole set of changes now... - R. From mst at mellanox.co.il Wed Dec 14 13:25:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 23:25:16 +0200 Subject: [openib-general] Re: ipoib: question In-Reply-To: References: Message-ID: <20051214212516.GI17538@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib: question > > > where exactly does the following math come from? > > > > static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) > > { > > return (struct ipoib_neigh **) (neigh->ha + 24 - > > (offsetof(struct neighbour, ha) & 4)); > > } > The idea is that we want to get something aligned to 8 bytes. Does & 4 do that? -- MST From mst at mellanox.co.il Wed Dec 14 13:28:49 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 23:28:49 +0200 Subject: [openib-general] Re: [PATCH/RFC] change ibv_get_devices() to ibv_get_device_list() In-Reply-To: References: Message-ID: <20051214212848.GJ17538@mellanox.co.il> Quoting r. Roland Dreier : > Michael> I wander whether we can come up with an API that helps > Michael> people get it right more easily? > > I guess we could return some opaque cookie or validate pointers using > a hash table or something like that. Lets just clarify this in the documentation. -- MST From rdreier at cisco.com Wed Dec 14 13:29:17 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 13:29:17 -0800 Subject: [openib-general] Re: ipoib: question In-Reply-To: <20051214212516.GI17538@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 14 Dec 2005 23:25:16 +0200") References: <20051214212516.GI17538@mellanox.co.il> Message-ID: Roland> The idea is that we want to get something aligned to 8 bytes. Michael> Does & 4 do that? I think so -- I'd have to check to be sure, but the idea is that if the offset of ha is 4 mod 8, then we should subtract off 4, and if it's 0 mod 8, then we should subtract off 0. So "& 4" is right, I think. - R. From huanwei at cse.ohio-state.edu Wed Dec 14 13:36:38 2005 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed, 14 Dec 2005 16:36:38 -0500 (EST) Subject: [openib-general] *** glibc detected *** corrupted double-linked list error Message-ID: Hi, We encountered the following error when we call ibv_close_device: *** glibc detected *** corrupted double-linked list: 0x0000000000a54e10 *** Could someone tell us what could be the possible reasons for this error? Thanks! Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 From huanwei at cse.ohio-state.edu Wed Dec 14 13:44:52 2005 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed, 14 Dec 2005 16:44:52 -0500 (EST) Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: Message-ID: Sorry we forget some detailed information: 1) we use gen2 svn revision 4344 with linux kernel 2.6.14 2) Machine is Opteron. On Wed, 14 Dec 2005, wei huang wrote: > Hi, > > We encountered the following error when we call ibv_close_device: > *** glibc detected *** corrupted double-linked list: 0x0000000000a54e10 *** > > Could someone tell us what could be the possible reasons for this error? > > Thanks! > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Dec 14 13:59:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 14 Dec 2005 23:59:15 +0200 Subject: [openib-general] Re: ipoib: question In-Reply-To: References: Message-ID: <20051214215915.GA18526@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ipoib: question > > Roland> The idea is that we want to get something aligned to 8 bytes. > > Michael> Does & 4 do that? > > I think so -- I'd have to check to be sure, but the idea is that if > the offset of ha is 4 mod 8, then we should subtract off 4, and if > it's 0 mod 8, then we should subtract off 0. So "& 4" is right, I think. Oh, I see. Is this better? - return (struct ipoib_neigh **) (neigh->ha + 24 - - (offsetof(struct neighbour, ha) & 4)); + return (void*)neigh + ALIGN(offsetof(struct neighbour, ha) + INFINIBAND_ALEN, x) -- MST From nacc at us.ibm.com Wed Dec 14 15:23:08 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 14 Dec 2005 15:23:08 -0800 Subject: [openib-general] [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051214171452.GA26274@esmail.cup.hp.com> References: <20051214095726.682B32283EE@openib.ca.sandia.gov> <20051214171452.GA26274@esmail.cup.hp.com> Message-ID: <20051214232308.GA3369@us.ibm.com> On 14.12.2005 [09:14:52 -0800], Grant Grundler wrote: > On Wed, Dec 14, 2005 at 01:57:26AM -0800, sagir at openib.org wrote: > > Author: sagir > > Date: 2005-12-14 01:57:24 -0800 (Wed, 14 Dec 2005) > > New Revision: 4453 > > > > Modified: > > trunk/contrib/mellanox/gen2/src/userspace/perftest/rdma_lat.c > > Can someone from mellanox explain why mainline src/userspace > is cloned under contrib/mellanox? Is there a reason the perftest/Makefile doesn't use LDFLAGS? Specifically, in automating userspace build & test, I put the IB libraries in a temporary directory, and exporting CFLAGS and LDFLAGS works with all other Makefiles (well, the ones I expect to work), but perftest does not seem to pick up my exports. Would something like the following make sense (sorry if a different -p is preferred)? Or does it need to be +=? Description: Add LDFLAGS to the perftest Makefile to allow library directories in non-standard locations to be specified. Signed-off-by: Nishanth Aravamudan --- Makefile 2005-12-14 14:57:04.000000000 -0800 +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw all: ${TESTS} +LDFLAGS = CFLAGS += -Wall -O2 -g -D_GNU_SOURCE LOADLIBES += -libverbs EXTRA_FILES = get_clock.c From nacc at us.ibm.com Wed Dec 14 18:25:19 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 14 Dec 2005 18:25:19 -0800 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051214232308.GA3369@us.ibm.com> References: <20051214095726.682B32283EE@openib.ca.sandia.gov> <20051214171452.GA26274@esmail.cup.hp.com> <20051214232308.GA3369@us.ibm.com> Message-ID: <20051215022519.GF11674@us.ibm.com> On 14.12.2005 [15:23:08 -0800], Nishanth Aravamudan wrote: > On 14.12.2005 [09:14:52 -0800], Grant Grundler wrote: > > On Wed, Dec 14, 2005 at 01:57:26AM -0800, sagir at openib.org wrote: > > > Author: sagir > > > Date: 2005-12-14 01:57:24 -0800 (Wed, 14 Dec 2005) > > > New Revision: 4453 > > > > > > Modified: > > > trunk/contrib/mellanox/gen2/src/userspace/perftest/rdma_lat.c > > > > Can someone from mellanox explain why mainline src/userspace > > is cloned under contrib/mellanox? > > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > Specifically, in automating userspace build & test, I put the IB > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > works with all other Makefiles (well, the ones I expect to work), but > perftest does not seem to pick up my exports. > > Would something like the following make sense (sorry if a different -p > is preferred)? Or does it need to be +=? It does need to be +=... Description: Add LDFLAGS to perftest/Makefile to allow non-standard library location. Signed-off-by: Nishanth Aravamudan --- Makefile 2005-12-14 14:57:04.000000000 -0800 +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw all: ${TESTS} +LDFLAGS += CFLAGS += -Wall -O2 -g -D_GNU_SOURCE LOADLIBES += -libverbs EXTRA_FILES = get_clock.c From nacc at us.ibm.com Wed Dec 14 18:29:04 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 14 Dec 2005 18:29:04 -0800 Subject: [openib-general] Compilation failure in perftest/rdma_lat (latest svn) Message-ID: <20051215022904.GG11674@us.ibm.com> Hi, ppc32 version of perftest is failing with: gcc -m32 -m32 -I/usr/local/autobench/var/tmp/out/ppc32/include -Wall -O2 -g -D_GNU_SOURCE rdma_lat.c get_clock.c -libverbs -o rdma_lat /usr/local/autobench/var/tmp//ccJld3yB.o(.text+0x6c8): In function `main': /usr/local/autobench/var/tmp/gen2-trunk/userspace/perftest/rdma_lat.c:111: undefined reference to `ibv_get_device_list' collect2: ld returned 1 exit status Thanks, Nish From rdreier at cisco.com Wed Dec 14 19:16:10 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 19:16:10 -0800 Subject: [openib-general] Compilation failure in perftest/rdma_lat (latest svn) In-Reply-To: <20051215022904.GG11674@us.ibm.com> (Nishanth Aravamudan's message of "Wed, 14 Dec 2005 18:29:04 -0800") References: <20051215022904.GG11674@us.ibm.com> Message-ID: > /usr/local/autobench/var/tmp/gen2-trunk/userspace/perftest/rdma_lat.c:111: undefined reference to `ibv_get_device_list' Is your build in sync with the latest tree? ibv_get_device_list was added to libibverbs in the same changeset that changed perftest to use it. - R. From rdreier at cisco.com Wed Dec 14 19:17:36 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Dec 2005 19:17:36 -0800 Subject: [openib-general] *** glibc detected *** corrupted double-linked list error In-Reply-To: (wei huang's message of "Wed, 14 Dec 2005 16:36:38 -0500 (EST)") References: Message-ID: wei> Hi, We encountered the following error when we call wei> ibv_close_device: *** glibc detected *** corrupted wei> double-linked list: 0x0000000000a54e10 *** wei> Could someone tell us what could be the possible reasons for wei> this error? Probably a memory-management bug somewhere. Can you get a traceback from a core dump when this happens? - R. From nacc at us.ibm.com Wed Dec 14 19:26:04 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 14 Dec 2005 19:26:04 -0800 Subject: [openib-general] Compilation failure in perftest/rdma_lat (latest svn) In-Reply-To: <20051215022904.GG11674@us.ibm.com> References: <20051215022904.GG11674@us.ibm.com> Message-ID: <20051215032604.GI11674@us.ibm.com> On 14.12.2005 [18:29:04 -0800], Nishanth Aravamudan wrote: > Hi, > > ppc32 version of perftest is failing with: > > gcc -m32 -m32 -I/usr/local/autobench/var/tmp/out/ppc32/include -Wall -O2 -g -D_GNU_SOURCE rdma_lat.c get_clock.c -libverbs -o rdma_lat > /usr/local/autobench/var/tmp//ccJld3yB.o(.text+0x6c8): In function `main': > /usr/local/autobench/var/tmp/gen2-trunk/userspace/perftest/rdma_lat.c:111: undefined reference to `ibv_get_device_list' > collect2: ld returned 1 exit status Nevermind, this is is an error on my part. Thanks, Nish From mst at mellanox.co.il Wed Dec 14 22:39:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 08:39:51 +0200 Subject: [openib-general] Re: [PATCH] libmthca: fix error handling in mthca_store_qp In-Reply-To: References: Message-ID: <20051215063951.GB26191@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: [PATCH] libmthca: fix error handling in mthca_store_qp > > Thanks, good catch. I decided it was clearer to split off the test of > refcnt, and only increment refcnt if the allocation succeeds. > > - R. > > --- libmthca/src/qp.c (revision 4465) > +++ libmthca/src/qp.c (working copy) > @@ -875,13 +875,15 @@ int mthca_store_qp(struct mthca_context > > pthread_mutex_lock(&ctx->qp_table_mutex); > > - if (!ctx->qp_table[tind].refcnt++) { > + if (!ctx->qp_table[tind].refcnt) { > ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1, > sizeof (struct mthca_qp *)); > if (!ctx->qp_table[tind].table) { > ret = -1; > goto out; > } > + > + ++ctx->qp_table[tind].refcnt; > } > > ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp; This does not look right: it seems you are incrementing the counter from 0 to 1, but it never goes to 2. How about this: --- Only increment qp_table ref count if allocation succeeds. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libmthca/src/qp.c =================================================================== --- openib/src/userspace/libmthca/src/qp.c (revision 4466) +++ openib/src/userspace/libmthca/src/qp.c (working copy) @@ -875,7 +875,7 @@ int mthca_store_qp(struct mthca_context pthread_mutex_lock(&ctx->qp_table_mutex); - if (!ctx->qp_table[tind].refcnt++) { + if (!ctx->qp_table[tind].refcnt) { ctx->qp_table[tind].table = calloc(ctx->qp_table_mask + 1, sizeof (struct mthca_qp *)); if (!ctx->qp_table[tind].table) { @@ -884,6 +884,8 @@ int mthca_store_qp(struct mthca_context } } + ++ctx->qp_table[tind].refcnt; + ctx->qp_table[tind].table[qpn & ctx->qp_table_mask] = qp; out: -- MST From mst at mellanox.co.il Wed Dec 14 22:52:41 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 08:52:41 +0200 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051215022519.GF11674@us.ibm.com> References: <20051215022519.GF11674@us.ibm.com> Message-ID: <20051215065241.GE26191@mellanox.co.il> Quoting Nishanth Aravamudan : > > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > > Specifically, in automating userspace build & test, I put the IB > > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > > works with all other Makefiles (well, the ones I expect to work), but > > perftest does not seem to pick up my exports. I'll fix this. -- MST From mst at mellanox.co.il Wed Dec 14 22:57:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 08:57:24 +0200 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051214232308.GA3369@us.ibm.com> References: <20051214232308.GA3369@us.ibm.com> Message-ID: <20051215065723.GA26401@mellanox.co.il> Quoting Nishanth Aravamudan : > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > Specifically, in automating userspace build & test, I put the IB > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > works with all other Makefiles (well, the ones I expect to work), but > perftest does not seem to pick up my exports. > > Would something like the following make sense (sorry if a different -p > is preferred)? Or does it need to be +=? > > Description: Add LDFLAGS to the perftest Makefile to allow library > directories in non-standard locations to be specified. Are you using gnu make? which version? Gnu make should use LDFLAGS automatically: Linking a single object file `N' is made automatically from `N.o' by running the linker (usually called `ld') via the C compiler. The precise command used is `$(CC) $(LDFLAGS) N.o $(LOADLIBES) $(LDLIBS)'. > Signed-off-by: Nishanth Aravamudan > > --- Makefile 2005-12-14 14:57:04.000000000 -0800 > +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 > @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw > > all: ${TESTS} > > +LDFLAGS += > CFLAGS += -Wall -O2 -g -D_GNU_SOURCE > LOADLIBES += -libverbs > EXTRA_FILES = get_clock.c This really does nothing. Does this patch help you? -- MST From nacc at us.ibm.com Wed Dec 14 23:02:22 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 14 Dec 2005 23:02:22 -0800 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051215065723.GA26401@mellanox.co.il> References: <20051214232308.GA3369@us.ibm.com> <20051215065723.GA26401@mellanox.co.il> Message-ID: <20051215070222.GK11674@us.ibm.com> On 15.12.2005 [08:57:24 +0200], Michael S. Tsirkin wrote: > Quoting Nishanth Aravamudan : > > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > > Specifically, in automating userspace build & test, I put the IB > > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > > works with all other Makefiles (well, the ones I expect to work), but > > perftest does not seem to pick up my exports. > > > > Would something like the following make sense (sorry if a different -p > > is preferred)? Or does it need to be +=? > > > > Description: Add LDFLAGS to the perftest Makefile to allow library > > directories in non-standard locations to be specified. > > Are you using gnu make? which version? GNU Make 3.80 on SLES 9 SP2. > Gnu make should use LDFLAGS automatically: > > Linking a single object file > `N' is made automatically from `N.o' by running the linker > (usually called `ld') via the C compiler. The precise command > used is `$(CC) $(LDFLAGS) N.o $(LOADLIBES) $(LDLIBS)'. I thought this would be the case as well, but it didn't seem to work without the Makefile modification. > > Signed-off-by: Nishanth Aravamudan > > > > --- Makefile 2005-12-14 14:57:04.000000000 -0800 > > +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 > > @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw > > > > all: ${TESTS} > > > > +LDFLAGS += > > CFLAGS += -Wall -O2 -g -D_GNU_SOURCE > > LOADLIBES += -libverbs > > EXTRA_FILES = get_clock.c > > This really does nothing. Does this patch help you? I didn't think it should do anything either, but it did allow the make to work on both ppc32 and ppc64 with LDFLAGS exported in the environment. Without the change, the build would fail as it would not have the appropriate -L flags. Thanks, Nish From mst at mellanox.co.il Wed Dec 14 23:35:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 09:35:12 +0200 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051215070222.GK11674@us.ibm.com> References: <20051215070222.GK11674@us.ibm.com> Message-ID: <20051215073512.GA26722@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: [PATCH] add LDFLAGS to perftest/Makefile > > On 15.12.2005 [08:57:24 +0200], Michael S. Tsirkin wrote: > > Quoting Nishanth Aravamudan : > > > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > > > Specifically, in automating userspace build & test, I put the IB > > > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > > > works with all other Makefiles (well, the ones I expect to work), > but > > > perftest does not seem to pick up my exports. > > > > > > Would something like the following make sense (sorry if a different > -p > > > is preferred)? Or does it need to be +=? > > > > > > Description: Add LDFLAGS to the perftest Makefile to allow library > > > directories in non-standard locations to be specified. > > > > Are you using gnu make? which version? > > GNU Make 3.80 on SLES 9 SP2. > > > Gnu make should use LDFLAGS automatically: > > > > Linking a single object file > > `N' is made automatically from `N.o' by running the linker > > (usually called `ld') via the C compiler. The precise command > > used is `$(CC) $(LDFLAGS) N.o $(LOADLIBES) $(LDLIBS)'. > > I thought this would be the case as well, but it didn't seem to work > without the Makefile modification. > > > > Signed-off-by: Nishanth Aravamudan > > > > > > --- Makefile 2005-12-14 14:57:04.000000000 -0800 > > > +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 > > > @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw > > > > > > all: ${TESTS} > > > > > > +LDFLAGS += > > > CFLAGS += -Wall -O2 -g -D_GNU_SOURCE > > > LOADLIBES += -libverbs > > > EXTRA_FILES = get_clock.c > > > > This really does nothing. Does this patch help you? > > I didn't think it should do anything either, but it did allow the make > to work on both ppc32 and ppc64 with LDFLAGS exported in the > environment. Without the change, the build would fail as it would not > have the appropriate -L flags. Looks like a work around for bug in make. I'll have a look. -- MST From mst at mellanox.co.il Thu Dec 15 00:30:35 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 10:30:35 +0200 Subject: [openib-general] Re: [openib-commits] r4453 - trunk/contrib/mellanox/gen2/src/userspace/perftest In-Reply-To: <20051214171452.GA26274@esmail.cup.hp.com> References: <20051214171452.GA26274@esmail.cup.hp.com> Message-ID: <20051215083035.GF26722@mellanox.co.il> Quoting Grant Grundler : > On Wed, Dec 14, 2005 at 01:57:26AM -0800, sagir at openib.org wrote: > > Author: sagir > > Date: 2005-12-14 01:57:24 -0800 (Wed, 14 Dec 2005) > > New Revision: 4453 > > > > Modified: > > trunk/contrib/mellanox/gen2/src/userspace/perftest/rdma_lat.c > > Can someone from mellanox explain why mainline src/userspace > is cloned under contrib/mellanox? Thats how subversion handles tags: we are tagging mainline approximately weekly. I could put it in some other place - just didnt want to interfere with people. > > Log: > > mtu per device > > You guys are certainly welcome to add stuff to contrib/mellanox. > I just would like to be able to explain to HP management why there > are two versions of rmda_lat.c. > > thanks, > grant Sagi is working on adding performance tests. We are not ready to get community feedback on that, yet. I agree it is probably better to leave rdma_lat.c alone and add new tests under new names. I'll fix that next week. -- MST From mst at mellanox.co.il Thu Dec 15 00:46:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 10:46:47 +0200 Subject: [openib-general] [PATCH applied] kill dead code around kmap_atomic Message-ID: <20051215084647.GH26722@mellanox.co.il> kmap_atomic never returns NULL. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-11-15 21:10:49.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-07 21:23:33.000000000 +0200 @@ -78,6 +78,7 @@ enum { IPOIB_FLAG_SUBINTERFACE = 4, IPOIB_MCAST_RUN = 5, IPOIB_STOP_REAPER = 6, + IPOIB_MCAST_STARTED = 7, IPOIB_MAX_BACKOFF_SECONDS = 16, Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-07 17:42:56.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-08 20:28:02.000000000 +0200 @@ -203,16 +203,20 @@ static int ipoib_mcast_join_finish(struc { struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; int ret; mcast->mcmember = *mcmember; + spin_lock_irqsave(&priv->lock, flags); /* Set the cached Q_Key before we attach if it's the broadcast group */ - if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, + if (priv->broadcast && + !memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; } + spin_unlock_irqrestore(&priv->lock, flags); if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) { if (test_and_set_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { @@ -582,6 +586,10 @@ int ipoib_mcast_start_thread(struct net_ queue_work(ipoib_workqueue, &priv->mcast_task); up(&mcast_mutex); + spin_lock_irq(&priv->lock); + set_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + return 0; } @@ -592,6 +600,10 @@ int ipoib_mcast_stop_thread(struct net_d ipoib_dbg_mcast(priv, "stopping multicast thread\n"); + spin_lock_irq(&priv->lock); + clear_bit(IPOIB_MCAST_STARTED, &priv->flags); + spin_unlock_irq(&priv->lock); + down(&mcast_mutex); clear_bit(IPOIB_MCAST_RUN, &priv->flags); cancel_delayed_work(&priv->mcast_task); @@ -674,6 +686,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(&priv->lock); + if (!test_bit(IPOIB_MCAST_STARTED, &priv->flags)) + goto unlock; + mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ @@ -732,6 +747,7 @@ out: ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } +unlock: spin_unlock(&priv->lock); } Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_send.c 2005-12-15 13:25:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c 2005-12-15 13:25:49.000000000 +0200 @@ -639,11 +639,6 @@ static int sdp_send_data_iocb_src(struct local_irq_save(flags); addr = kmap_atomic(iocb->page_array[pos], KM_IRQ0); - if (!addr) { - result = -ENOMEM; - local_irq_restore(flags); - goto error; - } memcpy(buff->tail, addr + off, len); @@ -711,10 +706,6 @@ static int sdp_send_iocb_buff_write(stru local_irq_save(flags); addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); - if (!addr) { - local_irq_restore(flags); - break; - } copy = min(PAGE_SIZE - offset, (unsigned long)(buff->end - buff->tail)); Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-12-15 13:25:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-12-15 13:25:53.000000000 +0200 @@ -599,8 +599,6 @@ static int sdp_read_buff_iocb(struct sdp local_irq_save(flags); addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); - if (!addr) - break; copy = min(PAGE_SIZE - offset, (unsigned long)(buff->tail - buff->data)); -- MST From mst at mellanox.co.il Thu Dec 15 00:49:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 10:49:45 +0200 Subject: [openib-general] Re: [PATCH applied] kill dead code around kmap_atomic In-Reply-To: <20051215084647.GH26722@mellanox.co.il> References: <20051215084647.GH26722@mellanox.co.il> Message-ID: <20051215084945.GI26722@mellanox.co.il> Quoting r. Michael S. Tsirkin : > Subject: [PATCH applied] kill dead code around kmap_atomic > > kmap_atomic never returns NULL. > > Signed-off-by: Michael S. Tsirkin Sorry, wrong patch. Here it is: Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_send.c 2005-12-15 13:25:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_send.c 2005-12-15 13:25:49.000000000 +0200 @@ -639,11 +639,6 @@ static int sdp_send_data_iocb_src(struct local_irq_save(flags); addr = kmap_atomic(iocb->page_array[pos], KM_IRQ0); - if (!addr) { - result = -ENOMEM; - local_irq_restore(flags); - goto error; - } memcpy(buff->tail, addr + off, len); @@ -711,10 +706,6 @@ static int sdp_send_iocb_buff_write(stru local_irq_save(flags); addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); - if (!addr) { - local_irq_restore(flags); - break; - } copy = min(PAGE_SIZE - offset, (unsigned long)(buff->end - buff->tail)); Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-12-15 13:25:11.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_recv.c 2005-12-15 13:25:53.000000000 +0200 @@ -599,8 +599,6 @@ static int sdp_read_buff_iocb(struct sdp local_irq_save(flags); addr = kmap_atomic(iocb->page_array[counter], KM_IRQ0); - if (!addr) - break; copy = min(PAGE_SIZE - offset, (unsigned long)(buff->tail - buff->data)); -- MST From jackm at mellanox.co.il Thu Dec 15 01:23:03 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 15 Dec 2005 11:23:03 +0200 Subject: [openib-general] [PATCH] libmthca: fix SRQ cleanup during destroy-qp Message-ID: <20051215092303.GA27784@mellanox.co.il> When cleaning up a CQ for a QP attached to SRQ, need to free an SRQ wqe only if the CQE is a receive completion. Signed-off-by: Jack Morgenstein Index: latest/src/userspace/libmthca/src/cq.c =================================================================== --- latest.orig/src/userspace/libmthca/src/cq.c +++ latest/src/userspace/libmthca/src/cq.c @@ -121,6 +121,13 @@ struct mthca_err_cqe { uint8_t owner; }; +static inline int is_recv_cqe(struct mthca_cqe * cqe) +{ + return (((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) ? + !(cqe->opcode & 0x01) : !(cqe->is_send & 0x80)); +} + static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { return cq->buf + entry * MTHCA_CQ_ENTRY_SIZE; @@ -549,7 +556,7 @@ void mthca_cq_clean(struct mthca_cq *cq, while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibv_cq.cqe); if (cqe->my_qpn == htonl(qpn)) { - if (srq) + if (srq && is_recv_cqe(cqe)) mthca_free_srq_wqe(srq, ntohl(cqe->wqe) >> srq->wqe_shift); ++nfreed; From jackm at mellanox.co.il Thu Dec 15 01:26:18 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 15 Dec 2005 11:26:18 +0200 Subject: [openib-general] [patch] mthca: fix SRQ cleanup during destroy-qp Message-ID: <20051215092618.GB27784@mellanox.co.il> When cleaning up a CQ for a QP attached to SRQ, need to free an SRQ wqe only if the CQE is a receive completion. Signed-off-by: Jack Morgenstein Index: openib/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_cq.c +++ openib/drivers/infiniband/hw/mthca/mthca_cq.c @@ -150,6 +150,13 @@ struct mthca_err_cqe { #define MTHCA_ARBEL_CQ_DB_REQ_NOT (2 << 24) #define MTHCA_ARBEL_CQ_DB_REQ_NOT_MULT (3 << 24) +static inline int is_recv_cqe(struct mthca_cqe * cqe) +{ + return (((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) ? + !(cqe->opcode & 0x01) : !(cqe->is_send & 0x80)); +} + static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { if (cq->is_direct) @@ -296,7 +303,7 @@ void mthca_cq_clean(struct mthca_dev *de while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); if (cqe->my_qpn == cpu_to_be32(qpn)) { - if (srq) + if (srq && is_recv_cqe(cqe)) mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; } else if (nfreed) From jackm at mellanox.co.il Thu Dec 15 01:51:58 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 15 Dec 2005 11:51:58 +0200 Subject: [openib-general] [PATCH] mthca: fix SRQ cleanup during destroy-qp Message-ID: <20051215095158.GA27874@mellanox.co.il> When cleaning up a CQ for a QP attached to SRQ, need to free an SRQ wqe only if the CQE is a receive completion. Signed-off-by: Jack Morgenstein Index: openib/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- openib.orig/drivers/infiniband/hw/mthca/mthca_cq.c +++ openib/drivers/infiniband/hw/mthca/mthca_cq.c @@ -150,6 +150,13 @@ struct mthca_err_cqe { #define MTHCA_ARBEL_CQ_DB_REQ_NOT (2 << 24) #define MTHCA_ARBEL_CQ_DB_REQ_NOT_MULT (3 << 24) +static inline int is_recv_cqe(struct mthca_cqe * cqe) +{ + return (((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) ? + !(cqe->opcode & 0x01) : !(cqe->is_send & 0x80)); +} + static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) { if (cq->is_direct) @@ -296,7 +303,7 @@ void mthca_cq_clean(struct mthca_dev *de while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); if (cqe->my_qpn == cpu_to_be32(qpn)) { - if (srq) + if (srq && is_recv_cqe(cqe)) mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; } else if (nfreed) From tziporet at mellanox.co.il Thu Dec 15 06:37:40 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 15 Dec 2005 16:37:40 +0200 Subject: [openib-general] Next workshop dates? Please respond with your preferences Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E366C4B4@mtlexch01.mtl.com> Hi, Regarding dates - Mellanox prefer not to have the workshop on Feb 5 but from the week of 12 Feb or afterward Tziporet -----Original Message----- From: Eitan Zahavi Sent: Wednesday, December 14, 2005 9:25 PM To: Bill Boas Cc: openib-general at openib.org; openib-promoters at openib.org Subject: Re: [openib-general] Next workshop dates? Please respond with your preferences Hi, I would like to propose the following agenda topics: Core Enhancements: QoS - directions for integration and support by OpenIB stack Partitions - recognize areas needing enhancements Multicast, Services and InformInfo Registrations: reference counting and re-registrations - implementation plan/API Diagnostics: Describe/discuss new diagnostic tools feature. Eitan _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Dec 15 06:32:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2005 09:32:54 -0500 Subject: [openib-general] [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate Message-ID: <1134657173.4334.9122.camel@hal.voltaire.com> OpenSM/ib_types.h: Modify ib_port_info_compute_rate so that gcc version 4.0.0 20050519 (Red Hat 4.0.0-8) doesn't complain when compiling osm_sa_path_record.c as follows: osm_sa_path_record.c: In function ‘__osm_pr_rcv_get_path_parms’: osm_sa_path_record.c:194: warning: control may reach end of non-void function ‘ib_port_info_compute_rate’ being inlined Signed-off-by: Hal Rosenstock Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 4479) +++ include/iba/ib_types.h (working copy) @@ -4290,59 +4290,76 @@ static inline uint8_t ib_port_info_compute_rate( IN const ib_port_info_t* const p_pi ) { + uint8_t rate = 0; + switch (ib_port_info_get_link_speed_active(p_pi)) { case IB_LINK_SPEED_ACTIVE_2_5: switch (p_pi->link_width_active) { case IB_LINK_WIDTH_ACTIVE_1X: - return IB_PATH_RECORD_RATE_2_5_GBS; + rate = IB_PATH_RECORD_RATE_2_5_GBS; + break; case IB_LINK_WIDTH_ACTIVE_4X: - return IB_PATH_RECORD_RATE_10_GBS; - + rate = IB_PATH_RECORD_RATE_10_GBS; + break; + case IB_LINK_WIDTH_ACTIVE_12X: - return IB_PATH_RECORD_RATE_30_GBS; - + rate = IB_PATH_RECORD_RATE_30_GBS; + break; + default: - return IB_PATH_RECORD_RATE_2_5_GBS; + rate = IB_PATH_RECORD_RATE_2_5_GBS; + break; } break; case IB_LINK_SPEED_ACTIVE_5: switch (p_pi->link_width_active) { case IB_LINK_WIDTH_ACTIVE_1X: - return IB_PATH_RECORD_RATE_5_GBS; - + rate = IB_PATH_RECORD_RATE_5_GBS; + break; + case IB_LINK_WIDTH_ACTIVE_4X: - return IB_PATH_RECORD_RATE_20_GBS; - + rate = IB_PATH_RECORD_RATE_20_GBS; + break; + case IB_LINK_WIDTH_ACTIVE_12X: - return IB_PATH_RECORD_RATE_60_GBS; - + rate = IB_PATH_RECORD_RATE_60_GBS; + break; + default: - return IB_PATH_RECORD_RATE_5_GBS; + rate = IB_PATH_RECORD_RATE_5_GBS; + break; } break; case IB_LINK_SPEED_ACTIVE_10: switch (p_pi->link_width_active) { case IB_LINK_WIDTH_ACTIVE_1X: - return IB_PATH_RECORD_RATE_10_GBS; - + rate = IB_PATH_RECORD_RATE_10_GBS; + break; + case IB_LINK_WIDTH_ACTIVE_4X: - return IB_PATH_RECORD_RATE_40_GBS; - + rate = IB_PATH_RECORD_RATE_40_GBS; + break; + case IB_LINK_WIDTH_ACTIVE_12X: - return IB_PATH_RECORD_RATE_120_GBS; - + rate =IB_PATH_RECORD_RATE_120_GBS; + break; + default: - return IB_PATH_RECORD_RATE_10_GBS; + rate = IB_PATH_RECORD_RATE_10_GBS; + break; } break; default: - return IB_PATH_RECORD_RATE_2_5_GBS; + rate = IB_PATH_RECORD_RATE_2_5_GBS; + break; } + + return rate; } /* * PARAMETERS From halr at voltaire.com Thu Dec 15 06:56:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2005 09:56:22 -0500 Subject: [openib-general] [PATCH] OpenSM/osm_base.h: Modify some SM constants Message-ID: <1134658582.4334.9332.camel@hal.voltaire.com> OpenSM/osm_base.h: Modify some SM constants Default SM Key should not be 0 Also, changed default SM priority to 1 so can have lower priority SM (and not rely on low GUID comparison). Note also that priority sense was inverted in comment. Signed-off-by: Hal Rosenstock Index: osm_base.h =================================================================== --- osm_base.h (revision 4478) +++ osm_base.h (working copy) @@ -122,7 +122,7 @@ BEGIN_C_DECLS * * SYNOPSIS */ -#define OSM_DEFAULT_SM_KEY 0 +#define OSM_DEFAULT_SM_KEY 1 /********/ /****s* OpenSM: Base/OSM_DEFAULT_LMC @@ -169,11 +169,11 @@ BEGIN_C_DECLS * * DESCRIPTION * Default SM priority value used by the OpenSM, -* as defined in the SMInfo attribute. 0 is the highest priority. +* as defined in the SMInfo attribute. 0 is the lowest priority. * * SYNOPSIS */ -#define OSM_DEFAULT_SM_PRIORITY 0 +#define OSM_DEFAULT_SM_PRIORITY 1 /********/ /****d* OpenSM: Base/OSM_DEFAULT_TMP_DIR From rdreier at cisco.com Thu Dec 15 08:08:13 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 08:08:13 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix error handling in mthca_store_qp In-Reply-To: <20051215063951.GB26191@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 15 Dec 2005 08:39:51 +0200") References: <20051215063951.GB26191@mellanox.co.il> Message-ID: good catch again, thanks -- applied From mst at mellanox.co.il Thu Dec 15 08:43:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 18:43:37 +0200 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: References: Message-ID: <20051215164337.GV26722@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: correct max_rd_atomic handling > > What happens if SAE and SRE are turned off and the consumer posts an > RDMA read? Does it fail and generate an error completion? Hardware guys confirmed that it does, as per spec: clearing these bits is the way to tell hardware that we have max_rd_atomic set to 0. I thought its obvious from documentation: do you think this needs clarification? > I don't think we want it to just stop the QP processing. It doesnt do that. -- MST From mst at mellanox.co.il Thu Dec 15 08:44:02 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 18:44:02 +0200 Subject: [openib-general] [PATCH] mthca thinko Message-ID: <20051215164402.GW26722@mellanox.co.il> Fix thinko in mthca_table_find: break only escapes from the innermost loop. Ishai Rabinovitch Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_memfree.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_memfree.c (revision 4369) +++ openib/drivers/infiniband/hw/mthca/mthca_memfree.c (working copy) @@ -232,9 +232,9 @@ void *mthca_table_find(struct mthca_icm_ list_for_each_entry(chunk, &icm->chunk_list, list) { for (i = 0; i < chunk->npages; ++i) { if (chunk->mem[i].length >= offset) { page = chunk->mem[i].page; - break; + goto out; } offset -= chunk->mem[i].length; } } -- MST From iod00d at hp.com Thu Dec 15 08:44:41 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 15 Dec 2005 08:44:41 -0800 Subject: [openib-general] Re: [openib-commits] r4453 - trunk/contrib/mellanox/gen2/src/userspace/perftest In-Reply-To: <20051215083035.GF26722@mellanox.co.il> References: <20051214171452.GA26274@esmail.cup.hp.com> <20051215083035.GF26722@mellanox.co.il> Message-ID: <20051215164441.GA2923@esmail.cup.hp.com> On Thu, Dec 15, 2005 at 10:30:35AM +0200, Michael S. Tsirkin wrote: > > Can someone from mellanox explain why mainline src/userspace > > is cloned under contrib/mellanox? > > Thats how subversion handles tags: we are tagging mainline > approximately weekly. > I could put it in some other place - just didnt want to interfere with > people. If the plan is to move them to mainline anyway, I'm perfectly happy to have him do the work on mainline. This is not a broadly used test that everyone depends on. People can continue using the old versions if they really have to. > Sagi is working on adding performance tests. > We are not ready to get community feedback on that, yet. Nice! I'll remind that "open source" is about collaboration. If Sagi gets a new test working on one architecture, he has a good basis to ask for help from the list with running on other architectures. > I agree it is probably better to leave rdma_lat.c alone and add > new tests under new names. I'll fix that next week. If you think that's necessary....I don't mind one test that deals with multiple variants of latency testing. thanks, grant From bardov at gmail.com Thu Dec 15 09:07:47 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Thu, 15 Dec 2005 19:07:47 +0200 Subject: [openib-general] QP from userspace used in kernel In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10C2BF6@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10C2BF6@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: This is what we are doing in ISER (creating the QP in kernel as the connect action of a socket provider that is accessed by userspace). Dan On 12/8/05, Caitlin Bestler wrote: > openib-general-bounces at openib.org wrote: > > Hi, > > > > What are the reasons that a qp allocated in user-space can't > > be passed and used by a kernel module? > > What are the steps needed to make a userspace-allocated qp usable by > > a kernel module? > > > > > > You would need to construct an environment such that the > device-specific verbs module, which assumes it is executing > in the user space where the QP was created, would never notice > the difference. > > The device-specific verbs will typically have created shared > memory resources that are accessible by both the RDMA device > and from the creating user memory map. These resources may > include pointers that assume the original memory map. The > exact methods of remembering the locations of these resources > will vary by device, so the chance of coming up with a scheme > that works without explicit support of all device vendors is > very low. The chances of convincing all device vendors to add > a new option to support this model is similarly low unless you > can make a very compelling case as to why this is necessary. > > Having the in-kernel proxy create the QP and do operations > for the end-user is a very adequate work around. > > For complex cleanup purposes the kernel could simply assume > the identity of the failed process, but that would only be > required if the standard cleanup was somehow not adequate. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Thu Dec 15 09:44:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 19:44:32 +0200 Subject: [openib-general] Re: [PATCH] mthca thinko In-Reply-To: <20051215164402.GW26722@mellanox.co.il> References: <20051215164402.GW26722@mellanox.co.il> Message-ID: <20051215174432.GA29677@mellanox.co.il> Quoting Michael S. Tsirkin : > Subject: [PATCH] mthca thinko > > Fix thinko in mthca_table_find: break only escapes from the innermost > loop. > Ishai Rabinovitch Correction: Noticed by Ishai Rabinovitch -- MST From mst at mellanox.co.il Thu Dec 15 09:33:53 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 19:33:53 +0200 Subject: [openib-general] kmap_atomic slot collision Message-ID: <20051215173353.GA29402@mellanox.co.il> Hi! I'm trying to use kmap_atomic from both interrupt and task context. My idea was to do local_irq_save and then use KM_IRQ0/KM_IRQ1: since I'm disabling interrupts I assumed that this should be safe. The relevant code is here: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_iocb.c However, under stress I see errors from arch/i386/mm/highmem.c:42 if (!pte_none(*(kmap_pte-idx))) BUG(); Apparently, my routine, running from a task context, races with some other kernel code, and so I'm trying to use a slot that was not yet unmapped. Anyone has an idea on what I could be doing wrong? Thanks, -- MST From tom at opengridcomputing.com Thu Dec 15 09:55:36 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 15 Dec 2005 11:55:36 -0600 Subject: [openib-general] [PATCH] iWARP Support added to the CMA Message-ID: <1134669336.7186.2.camel@trinity.austin.ammasso.com> This is a patch to the iWARP branch that adds: - A generic iWARP transport CM module - Support for iWARP transports to the CMA - Modifications to the AMSO1100 driver for the iWARP transport CM - ULP add_one event changes to filter events based on node_type The code has been tested on IB and iWARP HCA with both the cmatose and krping applications. The code can also be checked out from the iWARP branch with these patches applied. Signed-off-by: Tom Tucker Index: ulp/ipoib/ipoib_main.c =================================================================== --- ulp/ipoib/ipoib_main.c (revision 4186) +++ ulp/ipoib/ipoib_main.c (working copy) @@ -1024,6 +1024,9 @@ struct ipoib_dev_priv *priv; int s, e, p; + if (device->node_type == IB_NODE_RNIC) + return; + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); if (!dev_list) return; @@ -1054,6 +1057,9 @@ struct ipoib_dev_priv *priv, *tmp; struct list_head *dev_list; + if (device->node_type == IB_NODE_RNIC) + return; + dev_list = ib_get_client_data(device, &ipoib_client); list_for_each_entry_safe(priv, tmp, dev_list, list) { Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 4186) +++ include/rdma/ib_verbs.h (working copy) @@ -805,7 +805,7 @@ struct ib_gid_cache **gid_cache; }; -struct iw_cm; +struct iw_cm_provider; struct ib_device { struct device *dma_device; @@ -822,7 +822,7 @@ u32 flags; - struct iw_cm *iwcm; + struct iw_cm_verbs *iwcm; int (*query_device)(struct ib_device *device, struct ib_device_attr *device_attr); Index: include/rdma/iw_cm.h =================================================================== --- include/rdma/iw_cm.h (revision 4186) +++ include/rdma/iw_cm.h (working copy) @@ -1,5 +1,7 @@ /* * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -33,112 +35,119 @@ #define IW_CM_H #include +#include -/* iWARP connection attributes. */ +struct iw_cm_id; +struct iw_cm_event; -struct iw_conn_attr { - struct in_addr local_addr; - struct in_addr remote_addr; - u16 local_port; - u16 remote_port; +enum iw_cm_event_type { + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ + IW_CM_EVENT_ESTABLISHED, + IW_CM_EVENT_LLP_DISCONNECT, + IW_CM_EVENT_LLP_RESET, + IW_CM_EVENT_LLP_TIMEOUT, + IW_CM_EVENT_CLOSE }; -/* This is provided in the event generated when - * a remote peer accepts our connect request - */ - -enum conn_result { - IW_CONN_ACCEPT = 0, - IW_CONN_RESET, - IW_CONN_PEER_REJECT, - IW_CONN_TIMEDOUT, - IW_CONN_NO_ROUTE_TO_HOST, - IW_CONN_INVALID_PARM +struct iw_cm_event { + enum iw_cm_event_type event; + int status; + u32 provider_id; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + void *private_data; + u8 private_data_len; }; - -/* This structure is provided in the event that - * completes an active connection request. - */ -struct iw_conn_results { - enum conn_result result; - struct iw_conn_attr conn_attr; - u8 *private_data; - int private_data_len; -}; -/* This is provided in the event generated by a remote - * connect request to a listening endpoint - */ -struct iw_conn_request { - u32 cr_id; - struct iw_conn_attr conn_attr; - u8 *private_data; - int private_data_len; -}; +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, + struct iw_cm_event *event); -/* Connection events. */ -enum iw_cm_event_type { - IW_EVENT_ACTIVE_CONNECT_RESULTS, - IW_EVENT_CONNECT_REQUEST, - IW_EVENT_DISCONNECT +enum iw_cm_state { + IW_CM_STATE_IDLE, /* unbound, inactive */ + IW_CM_STATE_LISTEN, /* listen waiting for connect */ + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ + IW_CM_STATE_ESTABLISHED, /* established */ }; -struct iw_cm_event { - struct ib_device *device; - union { - struct iw_conn_results active_results; - struct iw_conn_request conn_request; - } element; - enum iw_cm_event_type event; +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, + struct iw_cm_event* event); +struct iw_cm_id { + iw_cm_handler cm_handler; /* client callback function */ + void *context; /* context to provide to client cb */ + enum iw_cm_state state; + struct ib_device *device; + struct ib_qp *qp; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + u64 provider_id; /* device handle for this conn. */ + iw_event_handler event_handler; /* callback for IW CM Provider events */ }; -/* Listening endpoint. */ -struct iw_listen_ep_attr { - void (*event_handler)(struct iw_cm_event *, void *); - void *listen_context; - struct in_addr addr; - u16 port; - int backlog; -}; +/** + * iw_create_cm_id - Allocate a communication identifier. + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @context: User specified context associated with the communication + * identifier. + * + * Communication identifiers are used to track connection states, + * addr resolution requests, and listen requests. + */ +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); -struct iw_cm { +/* This is provided in the event generated when + * a remote peer accepts our connect request + */ - int (*connect_qp)(struct ib_qp *ib_qp, - struct iw_conn_attr* attr, - void (*event_handler)(struct iw_cm_event*, void*), - void* context, - u8 *pdata, - int pdata_len - ); +struct iw_cm_verbs { + int (*connect)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); + + int (*disconnect)(struct iw_cm_id* cm_id, + int abrupt); - int (*disconnect_qp)(struct ib_qp *qp, - int abrupt - ); + int (*accept)(struct iw_cm_id*, + const void *private_data, + u8 pdata_data_len); - int (*accept_cr)(struct ib_device* ibdev, - u32 cr_id, - struct ib_qp *qp, - void (*event_handler)(struct iw_cm_event*, void*), - void *context, - u8 *pdata, - int pdata_len); + int (*reject)(struct iw_cm_id* cm_id, + const void* private_data, + u8 private_data_len); - int (*reject_cr)(struct ib_device* ibdev, - u32 cr_id, - u8 *pdata, - int pdata_len); + int (*getpeername)(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr); - int (*query_cr)(struct ib_device* ibdev, - u32 cr_id, - struct iw_conn_request* req); + int (*create_listen)(struct iw_cm_id* cm_id, + int backlog); - int (*create_listen_ep)(struct ib_device *ibdev, - struct iw_listen_ep_attr *ep_attrs, - void **ep_handle); + int (*destroy_listen)(struct iw_cm_id* cm_id); - int (*destroy_listen_ep)(struct ib_device *ibdev, - void *ep_handle); - }; +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context); +void iw_destroy_cm_id(struct iw_cm_id *cm_id); +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_add, + struct sockaddr_in* remote_addr); +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len); +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len); +int iw_cm_disconnect(struct iw_cm_id *cm_id); +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); + #endif /* IW_CM_H */ Index: core/cm.c =================================================================== --- core/cm.c (revision 4186) +++ core/cm.c (working copy) @@ -3227,6 +3227,10 @@ int ret; u8 i; + /* Ignore RNIC devices */ + if (device->node_type == IB_NODE_RNIC) + return; + cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * device->phys_port_cnt, GFP_KERNEL); if (!cm_dev) @@ -3291,6 +3295,10 @@ if (!cm_dev) return; + /* Ignore RNIC devices */ + if (device->node_type == IB_NODE_RNIC) + return; + write_lock_irqsave(&cm.device_lock, flags); list_del(&cm_dev->list); write_unlock_irqrestore(&cm.device_lock, flags); Index: core/iwcm.c =================================================================== --- core/iwcm.c (revision 0) +++ core/iwcm.c (revision 0) @@ -0,0 +1,671 @@ +/* + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "cm_msgs.h" + +MODULE_AUTHOR("Tom Tucker"); +MODULE_DESCRIPTION("iWARP CM"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void iwcm_add_one(struct ib_device *device); +static void iwcm_remove_one(struct ib_device *device); +struct iwcm_id_private; + +static struct ib_client iwcm_client = { + .name = "cm", + .add = iwcm_add_one, + .remove = iwcm_remove_one +}; + +static struct { + spinlock_t lock; + struct list_head device_list; + rwlock_t device_lock; + struct workqueue_struct* wq; +} iwcm; + +struct iwcm_device; +struct iwcm_port { + struct iwcm_device *iwcm_dev; + struct sockaddr_in local_addr; + u8 port_num; +}; + +struct iwcm_device { + struct list_head list; + struct ib_device *device; + struct iwcm_port port[0]; +}; + +struct iwcm_id_private { + struct iw_cm_id id; + + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + + struct rb_node listen_node; + + struct list_head work_list; + atomic_t work_count; +}; + +struct iwcm_work { + struct work_struct work; + struct iwcm_id_private* cm_id; + struct iw_cm_event event; +}; + +/* Called whenever a reference added for a cm_id */ +static inline void iwcm_addref_id(struct iwcm_id_private *cm_id_priv) +{ + atomic_inc(&cm_id_priv->refcount); +} + +/* Called whenever releasing a reference to a cm id */ +static inline void iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + if (atomic_dec_and_test(&cm_id_priv->refcount)) + wake_up(&cm_id_priv->wait); +} + +static void cm_event_handler(struct iw_cm_id* cm_id, struct iw_cm_event* event); + +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, + iw_cm_handler cm_handler, + void *context) +{ + struct iwcm_id_private *iwcm_id_priv; + + iwcm_id_priv = kmalloc(sizeof *iwcm_id_priv, GFP_KERNEL); + if (!iwcm_id_priv) + return ERR_PTR(-ENOMEM); + + memset(iwcm_id_priv, 0, sizeof *iwcm_id_priv); + iwcm_id_priv->id.state = IW_CM_STATE_IDLE; + iwcm_id_priv->id.device = device; + iwcm_id_priv->id.cm_handler = cm_handler; + iwcm_id_priv->id.context = context; + iwcm_id_priv->id.event_handler = cm_event_handler; + + spin_lock_init(&iwcm_id_priv->lock); + init_waitqueue_head(&iwcm_id_priv->wait); + atomic_set(&iwcm_id_priv->refcount, 1); + + return &iwcm_id_priv->id; + +} +EXPORT_SYMBOL(iw_create_cm_id); + +struct iw_cm_id* iw_clone_id(struct iw_cm_id* parent) +{ + return iw_create_cm_id(parent->device, + parent->cm_handler, + parent->context); +} + +void iw_destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_LISTEN: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->destroy_listen(cm_id); + break; + + case IW_CM_STATE_CONN_RECV: + case IW_CM_STATE_CONN_SENT: + case IW_CM_STATE_ESTABLISHED: + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + ret = cm_id->device->iwcm->disconnect(cm_id,1); + break; + + case IW_CM_STATE_IDLE: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + break; + + default: + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + printk(KERN_ERR "%s:%s:%u Illegal state %d for iw_cm_id.\n", + __FILE__, __FUNCTION__, __LINE__, cm_id->state); + ; + } + + atomic_dec(&iwcm_id_priv->refcount); + wait_event(iwcm_id_priv->wait, !atomic_read(&iwcm_id_priv->refcount)); + + kfree(iwcm_id_priv); +} +EXPORT_SYMBOL(iw_destroy_cm_id); + +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret = 0; + + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + if (cm_id->device == 0) { + printk(KERN_ERR "device is NULL\n"); + return -EINVAL; + } + + if (cm_id->device->iwcm == 0) { + printk(KERN_ERR "iwcm is NULL\n"); + return -EINVAL; + } + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + if (cm_id->state != IW_CM_STATE_IDLE) { + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + return -EBUSY; + } + cm_id->state = IW_CM_STATE_LISTEN; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); + if (ret != 0) { + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + } + return ret; +} +EXPORT_SYMBOL(iw_cm_listen); + +int iw_cm_getpeername(struct iw_cm_id *cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr) +{ + if (cm_id->device == 0) + return -EINVAL; + + if (cm_id->device->iwcm == 0) + return -EINVAL; + + /* Make sure there's a connection */ + if (cm_id->state != IW_CM_STATE_ESTABLISHED) + return -ENOTCONN; + + return cm_id->device->iwcm->getpeername(cm_id, local_addr, remote_addr); +} +EXPORT_SYMBOL(iw_cm_getpeername); + +int iw_cm_reject(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + unsigned long flags; + int ret; + + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&iwcm_id_priv->lock, flags); + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + ret = cm_id->device->iwcm->reject(cm_id, private_data, private_data_len); + cm_id->state = IW_CM_STATE_IDLE; + break; + default: + ret = -EINVAL; + goto out; + } + +out: spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); + return ret; +} +EXPORT_SYMBOL(iw_cm_reject); + +int iw_cm_accept(struct iw_cm_id *cm_id, + const void *private_data, + u8 private_data_len) +{ + struct iwcm_id_private *iwcm_id_priv; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + switch (cm_id->state) { + case IW_CM_STATE_CONN_RECV: + ret = cm_id->device->iwcm->accept(cm_id, private_data, + private_data_len); + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_ESTABLISHED; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_accept); + +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp) +{ + int ret = -EINVAL; + + if (cm_id) + cm_id->qp = qp; + + return ret; +} +EXPORT_SYMBOL(iw_cm_bind_qp); + +int iw_cm_connect(struct iw_cm_id *cm_id, + const void* pdata, u8 pdata_len) +{ + struct iwcm_id_private* cm_id_priv; + int ret = 0; + unsigned long flags; + + if (cm_id->state != IW_CM_STATE_IDLE) + return -EBUSY; + + if (cm_id->device == 0) + return -EINVAL; + + if (cm_id->device->iwcm == 0) + return -ENOSYS; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + spin_lock_irqsave(&cm_id_priv->lock, flags); + cm_id->state = IW_CM_STATE_CONN_SENT; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + ret = cm_id->device->iwcm->connect(cm_id, pdata, pdata_len); + if (ret != 0) { + spin_lock_irqsave(&cm_id_priv->lock, flags); + cm_id->state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + } + return ret; +} +EXPORT_SYMBOL(iw_cm_connect); + +int iw_cm_disconnect(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *iwcm_id_priv; + int ret; + + if (cm_id->device == 0 || cm_id->device->iwcm == 0 || cm_id->qp == 0) + return -EINVAL; + + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + + switch (cm_id->state) { + case IW_CM_STATE_ESTABLISHED: + ret = cm_id->device->iwcm->disconnect(cm_id, 1); + cm_id->state = IW_CM_STATE_IDLE; + if (ret == 0) { + struct iw_cm_event event; + event.event = IW_CM_EVENT_LLP_DISCONNECT; + event.provider_id = cm_id->provider_id; + event.status = 0; + event.local_addr = cm_id->local_addr; + event.remote_addr = cm_id->remote_addr; + event.private_data = 0; + event.private_data_len = 0; + cm_event_handler(cm_id, &event); + } + + break; + default: + ret = -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); + +static void iwcm_add_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + struct iwcm_port *port; + unsigned long flags; + u8 i; + + if (device->node_type != IB_NODE_RNIC) + return; + + iwcm_dev = kmalloc(sizeof(*iwcm_dev) + sizeof(*port) * + device->phys_port_cnt, GFP_KERNEL); + if (!iwcm_dev) + return; + + iwcm_dev->device = device; + + for (i = 1; i <= device->phys_port_cnt; i++) { + port = &iwcm_dev->port[i-1]; + port->iwcm_dev = iwcm_dev; + port->port_num = i; + } + + ib_set_client_data(device, &iwcm_client, iwcm_dev); + + write_lock_irqsave(&iwcm.device_lock, flags); + list_add_tail(&iwcm_dev->list, &iwcm.device_list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + return; +} + +static void iwcm_remove_one(struct ib_device *device) +{ + struct iwcm_device *iwcm_dev; + unsigned long flags; + + if (device->node_type != IB_NODE_RNIC) + return; + + iwcm_dev = ib_get_client_data(device, &iwcm_client); + if (!iwcm_dev) + return; + + write_lock_irqsave(&iwcm.device_lock, flags); + list_del(&iwcm_dev->list); + write_unlock_irqrestore(&iwcm.device_lock, flags); + + kfree(iwcm_dev); +} + +/* Handles an inbound connect request. The function creates a new + * iw_cm_id to represent the new connection and inherits the client + * callback function and other attributes from the listening parent. + * + * The work item contains a pointer to the listen_cm_id and the event. The + * listen_cm_id contains the client cm_handler, context and device. These are + * copied when the device is cloned. The event contains the new four tuple. + */ +static int cm_conn_req_handler(struct iwcm_work* work) +{ + struct iw_cm_id* cm_id; + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int rc; + + /* If the status was not successful, ignore request */ + if (work->event.status) { + printk(KERN_ERR "Bad status=%d for connection request ... " + "should be filtered by provider\n", + work->event.status); + return work->event.status; + } + cm_id = iw_clone_id(&work->cm_id->id); + if (IS_ERR(cm_id)) + return PTR_ERR(cm_id); + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.provider_id = work->event.provider_id; + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + rc = cm_id->cm_handler(cm_id, &work->event); + if (rc) { + cm_id->state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(cm_id); + } + kfree(work); + return 0; +} + +/* + * Handles the transition to established state on the passive side. + */ +static int cm_conn_est_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_RECV) { + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for established event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +/* + * Handles the reply to our connect request. There are three + * possibilities: + * - If the cm_id is in the wrong state when the event is + * delivered, the event is ignored. [What should we do when the + * provider does something crazy?] + * - If the remote peer accepts the connection, we update the 4-tuple + * in the cm_id with the remote peer info, move the cm_id to the + * ESTABLISHED state and deliver the event to the client. + * - If the remote peer rejects the connection, or there is some + * connection error, move the cm_id to the IDLE state, and deliver + * the event to the client. + */ +static int cm_conn_rep_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id_priv->id.state != IW_CM_STATE_CONN_SENT) { + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for connect reply event\n", + __FUNCTION__, __LINE__, cm_id_priv->id.state); + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = -EINVAL; + goto error_out; + } + + if (work->event.status == 0) { + cm_id_priv = work->cm_id; + cm_id_priv->id.local_addr = work->event.local_addr; + cm_id_priv->id.remote_addr = work->event.remote_addr; + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; + } else { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + } + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + error_out: + kfree(work); + return ret; +} + +static int cm_disconnect_handler(struct iwcm_work* work) +{ + struct iwcm_id_private* cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = work->cm_id; + spin_lock_irqsave(&cm_id_priv->lock, flags); + cm_id_priv->id.state = IW_CM_STATE_IDLE; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + /* Call the client CM handler */ + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); + if (ret) { + cm_id_priv->id.state = IW_CM_STATE_IDLE; + iw_destroy_cm_id(&cm_id_priv->id); + } + + kfree(work); + return ret; +} + +static void cm_work_handler(void* arg) +{ + struct iwcm_work* work = (struct iwcm_work*)arg; + int rc; + + switch (work->event.event) { + case IW_CM_EVENT_CONNECT_REQUEST: + rc = cm_conn_req_handler(work); + break; + case IW_CM_EVENT_CONNECT_REPLY: + rc = cm_conn_rep_handler(work); + break; + case IW_CM_EVENT_ESTABLISHED: + rc = cm_conn_est_handler(work); + break; + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_CLOSE: + rc = cm_disconnect_handler(work); + break; + } +} + +/* IW CM provider event callback handler. This function is called on + * interrupt context. The function builds a work queue element + * and enqueues it for processing on a work queue thread. This allows + * CM client callback functions to block. + */ +static void cm_event_handler(struct iw_cm_id* cm_id, + struct iw_cm_event* event) +{ + struct iwcm_work *work; + struct iwcm_id_private* cm_id_priv; + + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + INIT_WORK(&work->work, cm_work_handler, work); + work->cm_id = cm_id_priv; + work->event = *event; + queue_work(iwcm.wq, &work->work); +} + +static int __init iw_cm_init(void) +{ + memset(&iwcm, 0, sizeof iwcm); + INIT_LIST_HEAD(&iwcm.device_list); + rwlock_init(&iwcm.device_lock); + spin_lock_init(&iwcm.lock); + iwcm.wq = create_workqueue("iw_cm"); + if (!iwcm.wq) + return -ENOMEM; + + return ib_register_client(&iwcm_client); +} + +static void __exit iw_cm_cleanup(void) +{ + ib_unregister_client(&iwcm_client); +} + +module_init(iw_cm_init); +module_exit(iw_cm_cleanup); + Index: core/addr.c =================================================================== --- core/addr.c (revision 4186) +++ core/addr.c (working copy) @@ -73,8 +73,13 @@ if (!dev) return -EADDRNOTAVAIL; - *gid = *(union ib_gid *) (dev->dev_addr + 4); - *pkey = addr_get_pkey(dev); + if (dev->type == ARPHRD_INFINIBAND) { + *gid = *(union ib_gid *) (dev->dev_addr + 4); + *pkey = addr_get_pkey(dev); + } else { + *gid = *(union ib_gid *) (dev->dev_addr); + *pkey = 0; + } dev_put(dev); return 0; } Index: core/Makefile =================================================================== --- core/Makefile (revision 4186) +++ core/Makefile (working copy) @@ -1,6 +1,6 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o iw_cm.o \ ib_sa.o ib_at.o ib_addr.o rdma_cm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o @@ -14,6 +14,8 @@ ib_cm-y := cm.o +iw_cm-y := iwcm.o + rdma_cm-y := cma.o ib_addr-y := addr.o Index: core/cma.c =================================================================== --- core/cma.c (revision 4186) +++ core/cma.c (working copy) @@ -1,4 +1,5 @@ /* + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. @@ -30,9 +31,14 @@ */ #include #include +#include +#include +#include +#include #include #include #include +#include #include MODULE_AUTHOR("Guy German"); @@ -100,7 +106,10 @@ int timeout_ms; struct ib_sa_query *query; int query_id; - struct ib_cm_id *cm_id; + union { + struct ib_cm_id *ib; + struct iw_cm_id *iw; + } cm_id; }; struct cma_addr { @@ -266,6 +275,16 @@ IB_QP_PKEY_INDEX | IB_QP_PORT); } +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); +} + int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr) { @@ -285,6 +304,9 @@ case IB_NODE_CA: ret = cma_init_ib_qp(id_priv, qp); break; + case IB_NODE_RNIC: + ret = cma_init_iw_qp(id_priv, qp); + break; default: ret = -ENOSYS; break; @@ -314,7 +336,7 @@ /* Need to update QP attributes from default values. */ qp_attr.qp_state = IB_QPS_INIT; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); if (ret) return ret; @@ -323,7 +345,7 @@ return ret; qp_attr.qp_state = IB_QPS_RTR; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); if (ret) return ret; @@ -337,7 +359,7 @@ int qp_attr_mask, ret; qp_attr.qp_state = IB_QPS_RTS; - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); if (ret) return ret; @@ -419,8 +441,8 @@ { cma_exch(id_priv, CMA_DESTROYING); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) + ib_destroy_cm_id(id_priv->cm_id.ib); list_del(&id_priv->listen_list); if (id_priv->cma_dev) @@ -476,8 +498,22 @@ state = cma_exch(id_priv, CMA_DESTROYING); cma_cancel_operation(id_priv, state); - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) - ib_destroy_cm_id(id_priv->cm_id); + if (id->device) { + switch (id->device->node_type) { + case IB_NODE_RNIC: + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = 0; + } + break; + default: + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = 0; + } + break; + } + } if (id_priv->cma_dev) { down(&mutex); @@ -505,14 +541,14 @@ if (ret) goto reject; - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); if (ret) goto reject; return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -528,7 +564,7 @@ return 0; reject: cma_modify_qp_err(&id_priv->id); - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, NULL, 0); return ret; } @@ -586,7 +622,7 @@ private_data_len); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - id_priv->cm_id = NULL; + id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); cma_release_remove(id_priv); rdma_destroy_id(&id_priv->id); @@ -675,7 +711,7 @@ goto out; } - conn_id->cm_id = cm_id; + conn_id->cm_id.ib = cm_id; cm_id->context = conn_id; cm_id->cm_handler = cma_ib_handler; @@ -685,7 +721,7 @@ IB_CM_REQ_PRIVATE_DATA_SIZE - offset); if (ret) { /* Destroy the CM ID by returning a non-zero value. */ - conn_id->cm_id = NULL; + conn_id->cm_id.ib = NULL; cma_exch(conn_id, CMA_DESTROYING); cma_release_remove(conn_id); rdma_destroy_id(&conn_id->id); @@ -695,6 +731,112 @@ return ret; } +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event_type = 0; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (event->event) { + case IW_CM_EVENT_LLP_DISCONNECT: + case IW_CM_EVENT_LLP_RESET: + case IW_CM_EVENT_LLP_TIMEOUT: + case IW_CM_EVENT_CLOSE: + event_type = RDMA_CM_EVENT_DISCONNECTED; + break; + + case IW_CM_EVENT_CONNECT_REQUEST: + BUG_ON(1); + break; + + case IW_CM_EVENT_CONNECT_REPLY: { + if (event->status) + event_type = RDMA_CM_EVENT_REJECTED; + else + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + } + + case IW_CM_EVENT_ESTABLISHED: + event_type = RDMA_CM_EVENT_ESTABLISHED; + break; + } + + ret = cma_notify_user(id_priv, + event_type, + event->status, + event->private_data, + event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id.iw = NULL; + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return ret; + } + + cma_release_remove(id_priv); + return ret; +} + +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id* new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in* sin; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); + conn_id->state = CMA_CONNECT; + + /* New connection inherits device from parent */ + cma_attach_to_dev(conn_id, listen_id->cma_dev); + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + cma_release_remove(listen_id); + return ret; +} + static __be64 cma_get_service_id(struct sockaddr *addr) { return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + @@ -706,21 +848,44 @@ __be64 svc_id; int ret; - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) - return PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) + return PTR_ERR(id_priv->cm_id.ib); svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0); if (ret) { - ib_destroy_cm_id(id_priv->cm_id); - id_priv->cm_id = NULL; + ib_destroy_cm_id(id_priv->cm_id.ib); + id_priv->cm_id.ib = NULL; } return ret; } +static int cma_iw_listen(struct rdma_id_private *id_priv) +{ + int ret; + struct sockaddr_in* sin; + + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, + iw_conn_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id.iw)) + return PTR_ERR(id_priv->cm_id.iw); + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + id_priv->cm_id.iw->local_addr = *sin; + + ret = iw_cm_listen(id_priv->cm_id.iw, 10 /* backlog */); + if (ret) { + iw_destroy_cm_id(id_priv->cm_id.iw); + id_priv->cm_id.iw = NULL; + } + + return ret; +} + static int cma_duplicate_listen(struct rdma_id_private *id_priv) { struct rdma_id_private *cur_id_priv; @@ -785,8 +950,9 @@ goto out; list_add_tail(&id_priv->list, &listen_any_list); - list_for_each_entry(cma_dev, &dev_list, list) + list_for_each_entry(cma_dev, &dev_list, list) { cma_listen_on_dev(id_priv, cma_dev); + } out: up(&mutex); return ret; @@ -796,7 +962,6 @@ { struct rdma_id_private *id_priv; int ret; - id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) return -EINVAL; @@ -806,6 +971,9 @@ case IB_NODE_CA: ret = cma_ib_listen(id_priv); break; + case IB_NODE_RNIC: + ret = cma_iw_listen(id_priv); + break; default: ret = -ENOSYS; break; @@ -890,6 +1058,30 @@ return (id_priv->query_id < 0) ? id_priv->query_id : 0; } +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; + int rc; + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) + BUG_ON(1); + + rc = cma_notify_user(id_priv, event, 0, NULL, 0); + if (rc) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return rc; + } + + cma_release_remove(id_priv); + cma_deref_id(id_priv); + return rc; +} + int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) { struct rdma_id_private *id_priv; @@ -904,6 +1096,9 @@ case IB_NODE_CA: ret = cma_resolve_ib_route(id_priv, timeout_ms); break; + case IB_NODE_RNIC: + ret = cma_resolve_iw_route(id_priv, timeout_ms); + break; default: ret = -ENOSYS; break; @@ -952,20 +1147,133 @@ cma_deref_id(id_priv); } + +/* Find the local interface with a route to the specified address and + * bind the CM ID to this interface's CMA device + */ +static int cma_acquire_iw_dev(struct rdma_cm_id* id, struct sockaddr* addr) +{ + int ret = -ENOENT; + struct cma_device* cma_dev; + struct rdma_id_private *id_priv; + struct sockaddr_in* sin; + struct rtable *rt = 0; + struct flowi fl; + struct net_device* netdev; + struct in_addr src_ip; + unsigned char* dev_addr; + + sin = (struct sockaddr_in*)addr; + if (sin->sin_family != AF_INET) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + + /* If the address is local, use the device. If it is remote, + * look up a route to get the local address + */ + netdev = ip_dev_find(sin->sin_addr.s_addr); + if (netdev) { + src_ip = sin->sin_addr; + dev_addr = netdev->dev_addr; + dev_put(netdev); + } else { + memset(&fl, 0, sizeof(fl)); + fl.nl_u.ip4_u.daddr = sin->sin_addr.s_addr; + if (ip_route_output_key(&rt, &fl)) { + return -ENETUNREACH; + } + dev_addr = rt->idev->dev->dev_addr; + src_ip.s_addr = rt->rt_src; + + ip_rt_put(rt); + } + + down(&mutex); + + list_for_each_entry(cma_dev, &dev_list, list) { + if (memcmp(dev_addr, + &cma_dev->node_guid, + sizeof(cma_dev->node_guid)) == 0) { + /* If we find the device, then check if this + * is an iWARP device. If it is, then call the + * callback handler immediately because we + * already have the native address + */ + if (cma_dev->device->node_type == IB_NODE_RNIC) { + struct sockaddr_in* cm_sin; + /* Set our source address */ + cm_sin = (struct sockaddr_in*) + &id_priv->id.route.addr.src_addr; + cm_sin->sin_family = AF_INET; + cm_sin->sin_addr.s_addr = src_ip.s_addr; + + /* Claim the device in the mutex */ + cma_attach_to_dev(id_priv, cma_dev); + ret = 0; + break; + } + } + } + up(&mutex); + + return ret; +} + + +/** + * rdma_resolve_addr - RDMA Resolve Address + * + * @id: RDMA identifier. + * @src_addr: Source IP address + * @dst_addr: Destination IP address + * &timeout_ms: Timeout to wait for address resolution + * + * Bind the specified cm_id to a local interface and if this is an IB + * CA, determine the GIDs associated with the specified IP addresses. + */ int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { struct rdma_id_private *id_priv; - int ret; + int ret = 0; id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) return -EINVAL; atomic_inc(&id_priv->refcount); + id->route.addr.dst_addr = *dst_addr; - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, - timeout_ms, addr_handler, id_priv); + + if (cma_acquire_iw_dev(id, dst_addr)==0) { + + enum rdma_cm_event_type event; + + cma_exch(id_priv, CMA_ADDR_RESOLVED); + + atomic_inc(&id_priv->dev_remove); + + event = RDMA_CM_EVENT_ADDR_RESOLVED; + if (cma_notify_user(id_priv, event, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_deref_id(id_priv); + cma_release_remove(id_priv); + rdma_destroy_id(&id_priv->id); + return -EINVAL; + } + + cma_release_remove(id_priv); + cma_deref_id(id_priv); + + } else { + + ret = ib_resolve_addr(src_addr, + dst_addr, &id->route.addr.addr.ibaddr, + timeout_ms, addr_handler, id_priv); + + } + if (ret) goto err; @@ -980,10 +1288,13 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; + struct sockaddr_in* sin; struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; int ret; - if (addr->sa_family != AF_INET) + sin = (struct sockaddr_in*)addr; + + if (sin->sin_family != AF_INET) return -EINVAL; id_priv = container_of(id, struct rdma_id_private, id); @@ -994,9 +1305,11 @@ id->route.addr.src_addr = *addr; ret = 0; } else { - ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); - if (!ret) - ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + if ((ret = cma_acquire_iw_dev(id, addr))) { + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); + if (!ret) + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + } } if (ret) @@ -1041,10 +1354,10 @@ if (!private_data) return -ENOMEM; - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, id_priv); - if (IS_ERR(id_priv->cm_id)) { - ret = PTR_ERR(id_priv->cm_id); + if (IS_ERR(id_priv->cm_id.ib)) { + ret = PTR_ERR(id_priv->cm_id.ib); goto out; } @@ -1075,25 +1388,61 @@ req.max_cm_retries = CMA_MAX_CM_RETRIES; req.srq = id_priv->id.qp->srq ? 1 : 0; - ret = ib_send_cm_req(id_priv->cm_id, &req); + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); out: kfree(private_data); return ret; } +static int cma_connect_iw(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct iw_cm_id* cm_id; + struct sockaddr_in* sin; + int ret; + + if (id_priv->id.qp == NULL) + return -EINVAL; + + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + goto out; + } + + id_priv->cm_id.iw = cm_id; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + cm_id->local_addr = *sin; + + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; + cm_id->remote_addr = *sin; + + iw_cm_bind_qp(cm_id, id_priv->id.qp); + + ret = iw_cm_connect(cm_id, conn_param->private_data, + conn_param->private_data_len); + +out: + return ret; +} + int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct rdma_id_private *id_priv; int ret; id_priv = container_of(id, struct rdma_id_private, id); - if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) return -EINVAL; switch (id->device->node_type) { case IB_NODE_CA: ret = cma_connect_ib(id_priv, conn_param); break; + case IB_NODE_RNIC: + ret = cma_connect_iw(id_priv, conn_param); + break; default: ret = -ENOSYS; break; @@ -1131,7 +1480,7 @@ rep.rnr_retry_count = conn_param->rnr_retry_count; rep.srq = id_priv->id.qp->srq ? 1 : 0; - return ib_send_cm_rep(id_priv->cm_id, &rep); + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); } int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) @@ -1147,6 +1496,12 @@ case IB_NODE_CA: ret = cma_accept_ib(id_priv, conn_param); break; + case IB_NODE_RNIC: { + iw_cm_bind_qp(id_priv->cm_id.iw, id_priv->id.qp); + ret = iw_cm_accept(id_priv->cm_id.iw, conn_param->private_data, + conn_param->private_data_len); + break; + } default: ret = -ENOSYS; break; @@ -1175,9 +1530,15 @@ switch (id->device->node_type) { case IB_NODE_CA: - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, NULL, 0, private_data, private_data_len); break; + + case IB_NODE_RNIC: + ret = iw_cm_reject(id_priv->cm_id.iw, + private_data, private_data_len); + break; + default: ret = -ENOSYS; break; @@ -1190,7 +1551,6 @@ { struct rdma_id_private *id_priv; int ret; - id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; @@ -1202,9 +1562,12 @@ switch (id->device->node_type) { case IB_NODE_CA: /* Initiate or respond to a disconnect. */ - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) - ib_send_cm_drep(id_priv->cm_id, NULL, 0); + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); break; + case IB_NODE_RNIC: + ret = iw_cm_disconnect(id_priv->cm_id.iw); + break; default: break; } Index: Makefile =================================================================== --- Makefile (revision 4186) +++ Makefile (working copy) @@ -7,3 +7,5 @@ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_KDAPL) += ulp/kdapl/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_KRPING) += krping/ +obj-$(CONFIG_RDMA_CMATOSE) += cmatose/ Index: hw/amso1100/c2.c =================================================================== --- hw/amso1100/c2.c (revision 4482) +++ hw/amso1100/c2.c (working copy) @@ -933,7 +933,7 @@ spin_lock_init(&c2_port->tx_lock); /* Copy our 48-bit ethernet hardware address */ - memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_RDMA_ENADDR, 6); /* Validate the MAC address */ if(!is_valid_ether_addr(netdev->dev_addr)) { Index: hw/amso1100/c2_qp.c =================================================================== --- hw/amso1100/c2_qp.c (revision 4482) +++ hw/amso1100/c2_qp.c (working copy) @@ -184,7 +184,7 @@ struct c2_vq_req *vq_req; ccwr_qp_destroy_req_t wr; ccwr_qp_destroy_rep_t *reply; - int err; + int err; /* * Allocate a verb request message @@ -343,8 +343,6 @@ qp->send_sgl_depth = qp_attrs->cap.max_send_sge; qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; - qp->event_handler = NULL; - qp->context = NULL; /* Initialize the SQ MQ */ q_size = be32_to_cpu(reply->sq_depth); Index: hw/amso1100/c2.h =================================================================== --- hw/amso1100/c2.h (revision 4482) +++ hw/amso1100/c2.h (working copy) @@ -113,6 +113,7 @@ C2_REGS_Q2_MSGSIZE = 0x0038, C2_REGS_Q2_SHARED = 0x0040, C2_REGS_ENADDR = 0x004C, + C2_REGS_RDMA_ENADDR = 0x0054, C2_REGS_HRX_CUR = 0x006C, }; @@ -592,16 +593,11 @@ extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); /* CM */ -extern int c2_qp_connect(struct c2_dev *c2dev, struct c2_qp *qp, u32 remote_addr, - u16 remote_port, u32 pdata_len, u8 *pdata); -extern int c2_cr_query(struct c2_dev *c2dev, u32 cr_id, - struct c2_cr_query_attrs *cr_attrs); -extern int c2_cr_accept(struct c2_dev *c2dev, u32 cr_id, struct c2_qp *qp, - u32 pdata_len, u8 *pdata); -extern int c2_cr_reject(struct c2_dev *c2dev, u32 cr_id); -extern int c2_ep_listen_create(struct c2_dev *c2dev, u32 addr, u16 port, - u32 backlog, struct c2_ep *ep); -extern int c2_ep_listen_destroy(struct c2_dev *c2dev, struct c2_ep *ep); +extern int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); +extern int c2_llp_service_create(struct iw_cm_id* cm_id, int backlog); +extern int c2_llp_service_destroy(struct iw_cm_id* cm_id); /* MM */ extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, Index: hw/amso1100/c2_pd.c =================================================================== --- hw/amso1100/c2_pd.c (revision 4482) +++ hw/amso1100/c2_pd.c (working copy) @@ -44,6 +44,8 @@ { int err = 0; + printk(KERN_ERR "%s:%d\n", __FUNCTION__, __LINE__); + might_sleep(); atomic_set(&pd->sqp_count, 0); Index: hw/amso1100/c2_ae.c =================================================================== --- hw/amso1100/c2_ae.c (revision 4482) +++ hw/amso1100/c2_ae.c (working copy) @@ -35,51 +35,37 @@ #include "cc_status.h" #include "cc_ae.h" -enum conn_result -c2_convert_cm_status(u32 cc_status) +static int c2_convert_cm_status(u32 cc_status) { switch (cc_status) { - case CC_CONN_STATUS_SUCCESS: return IW_CONN_ACCEPT; - case CC_CONN_STATUS_REJECTED: return IW_CONN_RESET; - case CC_CONN_STATUS_REFUSED: return IW_CONN_PEER_REJECT; - case CC_CONN_STATUS_TIMEDOUT: return IW_CONN_TIMEDOUT; - case CC_CONN_STATUS_NETUNREACH: return IW_CONN_NO_ROUTE_TO_HOST; - case CC_CONN_STATUS_HOSTUNREACH: return IW_CONN_NO_ROUTE_TO_HOST; - case CC_CONN_STATUS_INVALID_RNIC: return IW_CONN_INVALID_PARM; - case CC_CONN_STATUS_INVALID_QP: return IW_CONN_INVALID_PARM; - case CC_CONN_STATUS_INVALID_QP_STATE: return IW_CONN_INVALID_PARM; + case CC_CONN_STATUS_SUCCESS: + return 0; + case CC_CONN_STATUS_REJECTED: + return -ENETRESET; + case CC_CONN_STATUS_REFUSED: + return -ECONNREFUSED; + case CC_CONN_STATUS_TIMEDOUT: + return -ETIMEDOUT; + case CC_CONN_STATUS_NETUNREACH: + return -ENETUNREACH; + case CC_CONN_STATUS_HOSTUNREACH: + return -EHOSTUNREACH; + case CC_CONN_STATUS_INVALID_RNIC: + return -EINVAL; + case CC_CONN_STATUS_INVALID_QP: + return -EINVAL; + case CC_CONN_STATUS_INVALID_QP_STATE: + return -EINVAL; default: panic("Unable to convert CM status: %d\n", cc_status); break; } } -static int -is_cm_event(cc_event_id_t id) -{ - int is_cm; - - switch (id) { - case CCAE_ACTIVE_CONNECT_RESULTS: - case CCAE_BAD_CLOSE: - case CCAE_LLP_CLOSE_COMPLETE: - case CCAE_LLP_CONNECTION_RESET: - case CCAE_LLP_CONNECTION_LOST: - is_cm = 1; - break; - case CCAE_TERMINATE_MESSAGE_RECEIVED: - case CCAE_CQ_SQ_COMPLETION_OVERFLOW: - default: - is_cm = 0; - break; - } - - return is_cm; -} void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) { - ccwr_t *wr; struct c2_mq *mq = c2dev->qptr_array[mq_index]; + ccwr_t *wr; void *resource_user_context; struct iw_cm_event cm_event; struct ib_event ib_event; @@ -94,6 +80,7 @@ if (!wr) return; + memset(&cm_event, 0, sizeof(cm_event)); event_id = c2_wr_get_id(wr); resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); resource_user_context = (void *)(unsigned long)wr->ae.ae_generic.user_context; @@ -102,117 +89,126 @@ case CC_RES_IND_QP: { struct c2_qp *qp = (struct c2_qp *)resource_user_context; + cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); - if (is_cm_event(event_id)) { - - cm_event.device = &c2dev->ibdev; - if (event_id == CCAE_ACTIVE_CONNECT_RESULTS) { - cm_event.event = IW_EVENT_ACTIVE_CONNECT_RESULTS; - cm_event.element.active_results.result = - c2_convert_cm_status(c2_wr_get_result(wr)); - cm_event.element.active_results.conn_attr.local_addr.s_addr = - wr->ae.ae_active_connect_results.laddr; - cm_event.element.active_results.conn_attr.remote_addr.s_addr = - wr->ae.ae_active_connect_results.raddr; - cm_event.element.active_results.conn_attr.local_port = - wr->ae.ae_active_connect_results.lport; - cm_event.element.active_results.conn_attr.remote_port = - wr->ae.ae_active_connect_results.rport; - cm_event.element.active_results.private_data_len = + switch (event_id) { + case CCAE_ACTIVE_CONNECT_RESULTS: + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; + cm_event.local_addr.sin_addr.s_addr = + wr->ae.ae_active_connect_results.laddr; + cm_event.remote_addr.sin_addr.s_addr = + wr->ae.ae_active_connect_results.raddr; + cm_event.local_addr.sin_port = + wr->ae.ae_active_connect_results.lport; + cm_event.remote_addr.sin_port = + wr->ae.ae_active_connect_results.rport; + cm_event.private_data_len = be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); + if (cm_event.private_data_len) { /* XXX */ - pdata = kmalloc(cm_event.element.active_results.private_data_len, - GFP_ATOMIC); - if (!pdata) - break; + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the remote peer + * will retry */ + dprintk("Ignored connect request -- no memory for pdata" + "private_data_len=%d\n", cm_event.private_data_len); + goto ignore_it; + } memcpy(pdata, wr->ae.ae_active_connect_results.private_data, - cm_event.element.active_results.private_data_len); - cm_event.element.active_results.private_data = pdata; + cm_event.private_data_len); - } else { - cm_event.event = IW_EVENT_DISCONNECT; + cm_event.private_data = pdata; } + if (qp->cm_id->event_handler) + qp->cm_id->event_handler(qp->cm_id, &cm_event); - if (qp->event_handler) - (*qp->event_handler)(&cm_event, qp->context); + break; - if (pdata) - kfree(pdata); - } else { - + case CCAE_TERMINATE_MESSAGE_RECEIVED: + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: ib_event.device = &c2dev->ibdev; ib_event.element.qp = &qp->ibqp; - /* XXX */ ib_event.event = IB_EVENT_QP_REQ_ERR; if(qp->ibqp.event_handler) - (*qp->ibqp.event_handler)(&ib_event, qp->context); - } + (*qp->ibqp.event_handler)(&ib_event, + qp->ibqp.qp_context); + case CCAE_BAD_CLOSE: + case CCAE_LLP_CLOSE_COMPLETE: + case CCAE_LLP_CONNECTION_RESET: + case CCAE_LLP_CONNECTION_LOST: + default: + cm_event.event = IW_CM_EVENT_CLOSE; + if (qp->cm_id->event_handler) + qp->cm_id->event_handler(qp->cm_id, &cm_event); + } break; } + case CC_RES_IND_EP: { - struct c2_ep *ep = (struct c2_ep *)resource_user_context; + struct iw_cm_id* cm_id = (struct iw_cm_id*)resource_user_context; + dprintk("CC_RES_IND_EP event_id=%d\n", event_id); if (event_id != CCAE_CONNECTION_REQUEST) { dprintk("%s: Invalid event_id: %d\n", __FUNCTION__, event_id); break; } - cm_event.device = &c2dev->ibdev; - cm_event.event = IW_EVENT_CONNECT_REQUEST; - cm_event.element.conn_request.cr_id = + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_id = wr->ae.ae_connection_request.cr_handle; - cm_event.element.conn_request.conn_attr.local_addr.s_addr = + cm_event.local_addr.sin_addr.s_addr = wr->ae.ae_connection_request.laddr; - cm_event.element.conn_request.conn_attr.remote_addr.s_addr = + cm_event.remote_addr.sin_addr.s_addr = wr->ae.ae_connection_request.raddr; - cm_event.element.conn_request.conn_attr.local_port = + cm_event.local_addr.sin_port = wr->ae.ae_connection_request.lport; - cm_event.element.conn_request.conn_attr.remote_port = + cm_event.remote_addr.sin_port = wr->ae.ae_connection_request.rport; - cm_event.element.conn_request.private_data_len = + cm_event.private_data_len = be32_to_cpu(wr->ae.ae_connection_request.private_data_length); - /* XXX */ - pdata = kmalloc(cm_event.element.conn_request.private_data_len, - GFP_ATOMIC); - if (!pdata) - break; + if (cm_event.private_data_len) { + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); + if (!pdata) { + /* Ignore the request, maybe the remote peer + * will retry */ + dprintk("Ignored connect request -- no memory for pdata" + "private_data_len=%d\n", cm_event.private_data_len); + goto ignore_it; + } + memcpy(pdata, + wr->ae.ae_connection_request.private_data, + cm_event.private_data_len); - memcpy(pdata, - wr->ae.ae_connection_request.private_data, - cm_event.element.conn_request.private_data_len); - - cm_event.element.conn_request.private_data = pdata; - - if (ep->event_handler) - (*ep->event_handler)(&cm_event, ep->listen_context); - - kfree(pdata); + cm_event.private_data = pdata; + } + if (cm_id->event_handler) + cm_id->event_handler(cm_id, &cm_event); break; } + case CC_RES_IND_CQ: { struct c2_cq *cq = (struct c2_cq *)resource_user_context; + dprintk("IB_EVENT_CQ_ERR\n"); ib_event.device = &c2dev->ibdev; ib_event.element.cq = &cq->ibcq; ib_event.event = IB_EVENT_CQ_ERR; if (cq->ibcq.event_handler) - (*cq->ibcq.event_handler)(&ib_event, cq->ibcq.cq_context); + cq->ibcq.event_handler(&ib_event, cq->ibcq.cq_context); } + default: break; } - - /* - * free the adapter message - */ + + ignore_it: c2_mq_free(mq); } - Index: hw/amso1100/c2_provider.c =================================================================== --- hw/amso1100/c2_provider.c (revision 4482) +++ hw/amso1100/c2_provider.c (working copy) @@ -305,8 +305,6 @@ struct c2_cq *cq; int err; - dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { dprintk("%s: Unable to allocate CQ\n", __FUNCTION__); @@ -315,6 +313,7 @@ err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); if (err) { + dprintk("%s: error initializing CQ\n", __FUNCTION__); kfree(cq); return ERR_PTR(err); } @@ -540,156 +539,96 @@ return -ENOSYS; } -static int c2_connect_qp(struct ib_qp *ib_qp, - struct iw_conn_attr *attr, - void (*event_handler)(struct iw_cm_event*, void*), - void *context, - u8 *pdata, - int pdata_len - ) +static int c2_connect(struct iw_cm_id* cm_id, + const void* pdata, u8 pdata_len) { - struct c2_qp *qp = to_c2qp(ib_qp); int err; + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - if (!event_handler) + if (cm_id->qp == NULL) return -EINVAL; - /* - * Store the event handler and the - * context in the QP. - */ - qp->event_handler = event_handler; - qp->context = context; + /* Cache the cm_id in the qp */ + qp->cm_id = cm_id; - err = c2_qp_connect(to_c2dev(ib_qp->device), qp, - attr->remote_addr.s_addr, attr->remote_port, - pdata_len, pdata); - if (err) { - qp->event_handler = NULL; - qp->context = NULL; - } + err = c2_llp_connect(cm_id, pdata, pdata_len); return err; } -static int c2_disconnect_qp(struct ib_qp *qp, - int abrupt) +static int c2_disconnect(struct iw_cm_id* cm_id, int abrupt) { struct ib_qp_attr attr; + struct ib_qp *ib_qp = cm_id->qp; int err; dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + if (ib_qp == 0) + /* If this is a lietening endpoint, there is no QP */ + return 0; + memset(&attr, 0, sizeof(struct ib_qp_attr)); if (abrupt) attr.qp_state = IB_QPS_ERR; else attr.qp_state = IB_QPS_SQD; - err = c2_modify_qp(qp, &attr, IB_QP_STATE); + err = c2_modify_qp(ib_qp, &attr, IB_QP_STATE); return err; } -static int c2_accept_cr(struct ib_device *ibdev, - u32 cr_id, - struct ib_qp *ib_qp, - void (*event_handler)(struct iw_cm_event*, void*), - void *context, - u8 *pdata, - int pdata_len) +static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 pdata_len) { - struct c2_qp *qp = to_c2qp(ib_qp); int err; dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - /* - * Store the event handler and the - * context in the QP. - */ - qp->event_handler = event_handler; - qp->context = context; + err = c2_llp_accept(cm_id, pdata, pdata_len); - err = c2_cr_accept(to_c2dev(ibdev), cr_id, qp, - pdata_len, pdata); - return err; } -static int c2_reject_cr(struct ib_device *ibdev, - u32 cr_id, - u8 *pdata, - int pdata_len) +static int c2_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) { int err; dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - err = c2_cr_reject(to_c2dev(ibdev), cr_id); + err = c2_llp_reject(cm_id, pdata, pdata_len); return err; } -static int c2_query_cr(struct ib_device *ibdev, - u32 cr_id, - struct iw_conn_request *req) +static int c2_getpeername(struct iw_cm_id* cm_id, + struct sockaddr_in* local_addr, + struct sockaddr_in* remote_addr ) { - int err; - struct c2_cr_query_attrs cr_attrs; - dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - err = c2_cr_query(to_c2dev(ibdev), cr_id, &cr_attrs); - if (!err) { - req->cr_id = cr_id; - req->conn_attr.local_addr.s_addr = cr_attrs.local_addr; - req->conn_attr.local_port = cr_attrs.local_port; - req->conn_attr.remote_addr.s_addr = cr_attrs.remote_addr; - req->conn_attr.remote_port = cr_attrs.remote_port; - /* XXX pdata? */ - } - return err; + *local_addr = cm_id->local_addr; + *remote_addr = cm_id->remote_addr; + return 0; } -static int c2_create_listen_ep(struct ib_device *ibdev, - struct iw_listen_ep_attr *ep_attr, - void **ep_handle) +static int c2_service_create(struct iw_cm_id* cm_id, int backlog) { int err; - struct c2_ep *ep; dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - - ep = kmalloc(sizeof(*ep), GFP_KERNEL); - if (!ep) { - dprintk("%s: Unable to allocate EP\n", __FUNCTION__); - return -ENOMEM; - } - - ep->event_handler = ep_attr->event_handler; - ep->listen_context = ep_attr->listen_context; - - err = c2_ep_listen_create(to_c2dev(ibdev), - ep_attr->addr.s_addr, ep_attr->port, - ep_attr->backlog, ep); - if (err) - kfree(ep); - else - *ep_handle = (void *)ep; - + err = c2_llp_service_create(cm_id, backlog); return err; } -static int c2_destroy_listen_ep(struct ib_device *ibdev, void *ep_handle) +static int c2_service_destroy(struct iw_cm_id* cm_id) { - struct c2_ep *ep = (struct c2_ep *)ep_handle; - + int err; dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); - c2_ep_listen_destroy(to_c2dev(ibdev), ep); - kfree(ep); - return 0; + err = c2_llp_service_destroy(cm_id); + + return err; } int c2_register_device(struct c2_dev *dev) @@ -742,13 +681,13 @@ dev->ibdev.post_recv = c2_post_receive; dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); - dev->ibdev.iwcm->connect_qp = c2_connect_qp; - dev->ibdev.iwcm->disconnect_qp = c2_disconnect_qp; - dev->ibdev.iwcm->accept_cr = c2_accept_cr; - dev->ibdev.iwcm->reject_cr = c2_reject_cr; - dev->ibdev.iwcm->query_cr = c2_query_cr; - dev->ibdev.iwcm->create_listen_ep = c2_create_listen_ep; - dev->ibdev.iwcm->destroy_listen_ep = c2_destroy_listen_ep; + dev->ibdev.iwcm->connect = c2_connect; + dev->ibdev.iwcm->disconnect = c2_disconnect; + dev->ibdev.iwcm->accept = c2_accept; + dev->ibdev.iwcm->reject = c2_reject; + dev->ibdev.iwcm->getpeername = c2_getpeername; + dev->ibdev.iwcm->create_listen = c2_service_create; + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; ret = ib_register_device(&dev->ibdev); if (ret) Index: hw/amso1100/c2_provider.h =================================================================== --- hw/amso1100/c2_provider.h (revision 4482) +++ hw/amso1100/c2_provider.h (working copy) @@ -115,17 +115,15 @@ struct c2_wq { spinlock_t lock; }; - +struct iw_cm_id; struct c2_qp { struct ib_qp ibqp; + struct iw_cm_id* cm_id; spinlock_t lock; atomic_t refcount; wait_queue_head_t wait; int qpn; - void (*event_handler)(struct iw_cm_event *, void *); - void *context; - u32 adapter_handle; u32 send_sgl_depth; u32 recv_sgl_depth; @@ -136,15 +134,6 @@ struct c2_mq rq_mq; }; -struct c2_ep { - u32 adapter_handle; - void (*event_handler)(struct iw_cm_event *, void *); - void *listen_context; - u32 addr; - u16 port; - int backlog; -}; - struct c2_cr_query_attrs { u32 local_addr; u32 remote_addr; Index: hw/amso1100/c2_cm.c =================================================================== --- hw/amso1100/c2_cm.c (revision 4482) +++ hw/amso1100/c2_cm.c (working copy) @@ -35,11 +35,10 @@ #include "c2_vq.h" #include -int -c2_qp_connect(struct c2_dev *c2dev, struct c2_qp *qp, - u32 remote_addr, u16 remote_port, - u32 pdata_len, u8 *pdata) +int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) { + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp = to_c2qp(cm_id->qp); ccwr_qp_connect_req_t *wr; /* variable size needs a malloc. */ struct c2_vq_req *vq_req; int err; @@ -70,8 +69,8 @@ wr->rnic_handle = c2dev->adapter_handle; wr->qp_handle = qp->adapter_handle; - wr->remote_addr = remote_addr; /* already in Network Byte Order */ - wr->remote_port = remote_port; /* already in Network Byte Order */ + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; + wr->remote_port = cm_id->remote_addr.sin_port; /* * Move any private data from the callers's buf into @@ -96,14 +95,18 @@ } int -c2_ep_listen_create(struct c2_dev *c2dev, u32 addr, - u16 port, u32 backlog, struct c2_ep *ep) +c2_llp_service_create(struct iw_cm_id* cm_id, int backlog) { + struct c2_dev *c2dev; ccwr_ep_listen_create_req_t wr; ccwr_ep_listen_create_rep_t *reply; struct c2_vq_req *vq_req; int err; + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + /* * Allocate verbs request. */ @@ -115,15 +118,15 @@ * Build the WR */ c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); - wr.hdr.context = (unsigned long)vq_req; + wr.hdr.context = (u64)(unsigned long)vq_req; wr.rnic_handle = c2dev->adapter_handle; - wr.local_addr = addr; /* already in Net Byte Order */ - wr.local_port = port; /* already in Net Byte Order */ + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; + wr.local_port = cm_id->local_addr.sin_port; wr.backlog = cpu_to_be32(backlog); - wr.user_context = (unsigned long)ep; + wr.user_context = (u64)(unsigned long)cm_id; /* - * reference the request struct. dereferenced in the int handler. + * Reference the request struct. Dereferenced in the int handler. */ vq_req_get(c2dev, vq_req); @@ -160,12 +163,7 @@ /* * get the adapter handle */ - ep->adapter_handle = reply->ep_handle; - if (port != reply->local_port) - { - // XXX - //*p_port = reply->local_port; - } + cm_id->provider_id = reply->ep_handle; /* * free vq stuff @@ -184,13 +182,19 @@ int -c2_ep_listen_destroy(struct c2_dev *c2dev, struct c2_ep *ep) +c2_llp_service_destroy(struct iw_cm_id* cm_id) { + + struct c2_dev *c2dev; ccwr_ep_listen_destroy_req_t wr; ccwr_ep_listen_destroy_rep_t *reply; struct c2_vq_req *vq_req; int err; + c2dev = to_c2dev(cm_id->device); + if (c2dev == NULL) + return -EINVAL; + /* * Allocate verbs request. */ @@ -205,7 +209,7 @@ c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); wr.hdr.context = (unsigned long)vq_req; wr.rnic_handle = c2dev->adapter_handle; - wr.ep_handle = ep->adapter_handle; + wr.ep_handle = cm_id->provider_id; /* * reference the request struct. dereferenced in the int handler. @@ -250,87 +254,20 @@ int -c2_cr_query(struct c2_dev *c2dev, u32 cr_id, - struct c2_cr_query_attrs *cr_attrs) +c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) { - ccwr_ep_query_req_t wr; - ccwr_ep_query_rep_t *reply; - struct c2_vq_req *vq_req; - int err; + struct c2_dev *c2dev = to_c2dev(cm_id->device); + struct c2_qp *qp = to_c2qp(cm_id->qp); + ccwr_cr_accept_req_t *wr; /* variable length WR */ + struct c2_vq_req *vq_req; + ccwr_cr_accept_rep_t *reply; /* VQ Reply msg ptr. */ + int err; - /* - * Create and send a WR_EP_CREATE... - */ - vq_req = vq_req_alloc(c2dev); - if (!vq_req) { - return -ENOMEM; - } + /* Make sure there's a bound QP */ + if (qp == 0) + return -EINVAL; - /* - * Build the WR - */ - c2_wr_set_id(&wr, CCWR_EP_QUERY); - wr.hdr.context = (unsigned long)vq_req; - wr.rnic_handle = c2dev->adapter_handle; - wr.ep_handle = cr_id; - /* - * reference the request struct. dereferenced in the int handler. - */ - vq_req_get(c2dev, vq_req); - - /* - * Send WR to adapter - */ - err = vq_send_wr(c2dev, (ccwr_t*)&wr); - if (err) { - vq_req_put(c2dev, vq_req); - goto bail0; - } - - /* - * Wait for reply from adapter - */ - err = vq_wait_for_reply(c2dev, vq_req); - if (err) { - goto bail0; - } - - /* - * Process reply - */ - reply = (ccwr_ep_query_rep_t*)(unsigned long)vq_req->reply_msg; - if (!reply) { - err = -ENOMEM; - goto bail0; - } - if ( (err = c2_errno(reply)) != 0) { - goto bail1; - } - - cr_attrs->local_addr = reply->local_addr; - cr_attrs->local_port = reply->local_port; - cr_attrs->remote_addr = reply->remote_addr; - cr_attrs->remote_port = reply->remote_port; - -bail1: - vq_repbuf_free(c2dev, reply); -bail0: - vq_req_free(c2dev, vq_req); - return err; -} - - -int -c2_cr_accept(struct c2_dev *c2dev, u32 cr_id, struct c2_qp *qp, - u32 pdata_len, u8 *pdata) -{ - ccwr_cr_accept_req_t *wr; /* variable length WR */ - struct c2_vq_req *vq_req; - ccwr_cr_accept_rep_t* reply; /* VQ Reply msg ptr. */ - int err; - - /* * only support the max private_data length */ if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { @@ -357,7 +294,7 @@ c2_wr_set_id(wr, CCWR_CR_ACCEPT); wr->hdr.context = (unsigned long)vq_req; wr->rnic_handle = c2dev->adapter_handle; - wr->ep_handle = cr_id; + wr->ep_handle = (u32)cm_id->provider_id; wr->qp_handle = qp->adapter_handle; if (pdata) { wr->private_data_length = cpu_to_be32(pdata_len); @@ -407,15 +344,17 @@ return err; } - int -c2_cr_reject(struct c2_dev *c2dev, u32 cr_id) +c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) { + struct c2_dev *c2dev; ccwr_cr_reject_req_t wr; struct c2_vq_req *vq_req; ccwr_cr_reject_rep_t *reply; int err; + c2dev = to_c2dev(cm_id->device); + /* * Allocate verbs request. */ @@ -430,7 +369,7 @@ c2_wr_set_id(&wr, CCWR_CR_REJECT); wr.hdr.context = (unsigned long)vq_req; wr.rnic_handle = c2dev->adapter_handle; - wr.ep_handle = cr_id; + wr.ep_handle = (u32)cm_id->provider_id; /* * reference the request struct. dereferenced in the int handler. From mst at mellanox.co.il Thu Dec 15 10:07:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 20:07:25 +0200 Subject: [openib-general] [PATCH applied] sdp_iocb memory corruption fix Message-ID: <20051215180725.GA29878@mellanox.co.il> Fix thinko in sdp_copy_one_page: dont try to copy beyond the page boundary. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-12-15 22:33:02.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_iocb.c 2005-12-15 22:34:01.000000000 +0200 @@ -43,8 +43,8 @@ static void sdp_copy_one_page(struct pag unsigned long uaddr) { size_t size_left = iocb_addr + iocb_size - uaddr; - size_t size = min(size_left, (size_t)PAGE_SIZE); unsigned long offset = uaddr % PAGE_SIZE; + size_t size = min(size_left, (size_t)(PAGE_SIZE - offset)); unsigned long flags; void* fptr; -- MST From nacc at us.ibm.com Thu Dec 15 10:03:14 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 15 Dec 2005 10:03:14 -0800 Subject: [openib-general] Re: [PATCH] add LDFLAGS to perftest/Makefile In-Reply-To: <20051215073512.GA26722@mellanox.co.il> References: <20051215070222.GK11674@us.ibm.com> <20051215073512.GA26722@mellanox.co.il> Message-ID: <20051215180314.GN11674@us.ibm.com> On 15.12.2005 [09:35:12 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: [PATCH] add LDFLAGS to perftest/Makefile > > > > On 15.12.2005 [08:57:24 +0200], Michael S. Tsirkin wrote: > > > Quoting Nishanth Aravamudan : > > > > Is there a reason the perftest/Makefile doesn't use LDFLAGS? > > > > Specifically, in automating userspace build & test, I put the IB > > > > libraries in a temporary directory, and exporting CFLAGS and LDFLAGS > > > > works with all other Makefiles (well, the ones I expect to work), > > but > > > > perftest does not seem to pick up my exports. > > > > > > > > Would something like the following make sense (sorry if a different > > -p > > > > is preferred)? Or does it need to be +=? > > > > > > > > Description: Add LDFLAGS to the perftest Makefile to allow library > > > > directories in non-standard locations to be specified. > > > > > > Are you using gnu make? which version? > > > > GNU Make 3.80 on SLES 9 SP2. > > > > > Gnu make should use LDFLAGS automatically: > > > > > > Linking a single object file > > > `N' is made automatically from `N.o' by running the linker > > > (usually called `ld') via the C compiler. The precise command > > > used is `$(CC) $(LDFLAGS) N.o $(LOADLIBES) $(LDLIBS)'. > > > > I thought this would be the case as well, but it didn't seem to work > > without the Makefile modification. > > > > > > Signed-off-by: Nishanth Aravamudan > > > > > > > > --- Makefile 2005-12-14 14:57:04.000000000 -0800 > > > > +++ Makefile.ldflags 2005-12-14 14:57:23.000000000 -0800 > > > > @@ -2,6 +2,7 @@ TESTS = rdma_lat rdma_bw > > > > > > > > all: ${TESTS} > > > > > > > > +LDFLAGS += > > > > CFLAGS += -Wall -O2 -g -D_GNU_SOURCE > > > > LOADLIBES += -libverbs > > > > EXTRA_FILES = get_clock.c > > > > > > This really does nothing. Does this patch help you? > > > > I didn't think it should do anything either, but it did allow the make > > to work on both ppc32 and ppc64 with LDFLAGS exported in the > > environment. Without the change, the build would fail as it would not > > have the appropriate -L flags. > > Looks like a work around for bug in make. I'll have a look. Would make more sense than it does now if that's the case :) Thanks, Nish From swise at opengridcomputing.com Thu Dec 15 13:11:34 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 Dec 2005 15:11:34 -0600 Subject: [Fwd: [openib-general] [PATCH] iWARP Support added to the CMA] In-Reply-To: <1134669640.7186.8.camel@trinity.austin.ammasso.com> References: <1134669640.7186.8.camel@trinity.austin.ammasso.com> Message-ID: <1134681094.18325.16.camel@stevo-desktop> Here are some comments on iwcm.c: iwcm_client struct: You probably want this ib client to be named something other than "cm". I would suggest you make the IB CM named "ibcm", and the IW CM named "iwcm". iw_cm_listen() - nuke the error path printks or make them more informative. iw_cm_reject() - you don't need the "goto out" and the out: label. iw_bind_qp() - this always returns -EINVAL??? Should probably always return 0. iw_cm_disconnect() - should there be a lock around setting the cm_id state here? cm_disconnect_handler() - if the upcall to the client CM fails, do you really need to set state to IDLE? It should already be IDLE, right? Or am I confused? On Thu, 2005-12-15 at 12:00 -0600, Tom Tucker wrote: > FYI, a patch to support iWARP in the CMA has been posted to OpenIB for > review. The code is also checked into the iWARP branch in svn. > > -------- Forwarded Message -------- > From: Tom Tucker > To: openib-general at openib.org > Subject: [openib-general] [PATCH] iWARP Support added to the CMA > Date: Thu, 15 Dec 2005 11:55:36 -0600 > This is a patch to the iWARP branch that adds: > > - A generic iWARP transport CM module > - Support for iWARP transports to the CMA > - Modifications to the AMSO1100 driver for the iWARP transport CM > - ULP add_one event changes to filter events based on node_type > > The code has been tested on IB and iWARP HCA with both the cmatose and krping applications. > > The code can also be checked out from the iWARP branch with these patches applied. > > Signed-off-by: Tom Tucker > > > Index: ulp/ipoib/ipoib_main.c > =================================================================== > --- ulp/ipoib/ipoib_main.c (revision 4186) > +++ ulp/ipoib/ipoib_main.c (working copy) > @@ -1024,6 +1024,9 @@ > struct ipoib_dev_priv *priv; > int s, e, p; > > + if (device->node_type == IB_NODE_RNIC) > + return; > + > dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); > if (!dev_list) > return; > @@ -1054,6 +1057,9 @@ > struct ipoib_dev_priv *priv, *tmp; > struct list_head *dev_list; > > + if (device->node_type == IB_NODE_RNIC) > + return; > + > dev_list = ib_get_client_data(device, &ipoib_client); > > list_for_each_entry_safe(priv, tmp, dev_list, list) { > Index: include/rdma/ib_verbs.h > =================================================================== > --- include/rdma/ib_verbs.h (revision 4186) > +++ include/rdma/ib_verbs.h (working copy) > @@ -805,7 +805,7 @@ > struct ib_gid_cache **gid_cache; > }; > > -struct iw_cm; > +struct iw_cm_provider; > struct ib_device { > struct device *dma_device; > > @@ -822,7 +822,7 @@ > > u32 flags; > > - struct iw_cm *iwcm; > + struct iw_cm_verbs *iwcm; > > int (*query_device)(struct ib_device *device, > struct ib_device_attr *device_attr); > Index: include/rdma/iw_cm.h > =================================================================== > --- include/rdma/iw_cm.h (revision 4186) > +++ include/rdma/iw_cm.h (working copy) > @@ -1,5 +1,7 @@ > /* > * Copyright (c) 2005 Ammasso, Inc. All rights reserved. > + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -33,112 +35,119 @@ > #define IW_CM_H > > #include > +#include > > -/* iWARP connection attributes. */ > +struct iw_cm_id; > +struct iw_cm_event; > > -struct iw_conn_attr { > - struct in_addr local_addr; > - struct in_addr remote_addr; > - u16 local_port; > - u16 remote_port; > +enum iw_cm_event_type { > + IW_CM_EVENT_CONNECT_REQUEST = 1, /* connect request received */ > + IW_CM_EVENT_CONNECT_REPLY, /* reply from active connect request */ > + IW_CM_EVENT_ESTABLISHED, > + IW_CM_EVENT_LLP_DISCONNECT, > + IW_CM_EVENT_LLP_RESET, > + IW_CM_EVENT_LLP_TIMEOUT, > + IW_CM_EVENT_CLOSE > }; > > -/* This is provided in the event generated when > - * a remote peer accepts our connect request > - */ > - > -enum conn_result { > - IW_CONN_ACCEPT = 0, > - IW_CONN_RESET, > - IW_CONN_PEER_REJECT, > - IW_CONN_TIMEDOUT, > - IW_CONN_NO_ROUTE_TO_HOST, > - IW_CONN_INVALID_PARM > +struct iw_cm_event { > + enum iw_cm_event_type event; > + int status; > + u32 provider_id; > + struct sockaddr_in local_addr; > + struct sockaddr_in remote_addr; > + void *private_data; > + u8 private_data_len; > }; > - > -/* This structure is provided in the event that > - * completes an active connection request. > - */ > -struct iw_conn_results { > - enum conn_result result; > - struct iw_conn_attr conn_attr; > - u8 *private_data; > - int private_data_len; > -}; > > -/* This is provided in the event generated by a remote > - * connect request to a listening endpoint > - */ > -struct iw_conn_request { > - u32 cr_id; > - struct iw_conn_attr conn_attr; > - u8 *private_data; > - int private_data_len; > -}; > +typedef int (*iw_cm_handler)(struct iw_cm_id *cm_id, > + struct iw_cm_event *event); > > -/* Connection events. */ > -enum iw_cm_event_type { > - IW_EVENT_ACTIVE_CONNECT_RESULTS, > - IW_EVENT_CONNECT_REQUEST, > - IW_EVENT_DISCONNECT > +enum iw_cm_state { > + IW_CM_STATE_IDLE, /* unbound, inactive */ > + IW_CM_STATE_LISTEN, /* listen waiting for connect */ > + IW_CM_STATE_CONN_SENT, /* outbound waiting for peer accept */ > + IW_CM_STATE_CONN_RECV, /* inbound waiting for user accept */ > + IW_CM_STATE_ESTABLISHED, /* established */ > }; > > -struct iw_cm_event { > - struct ib_device *device; > - union { > - struct iw_conn_results active_results; > - struct iw_conn_request conn_request; > - } element; > - enum iw_cm_event_type event; > +typedef void (*iw_event_handler)(struct iw_cm_id* cm_id, > + struct iw_cm_event* event); > +struct iw_cm_id { > + iw_cm_handler cm_handler; /* client callback function */ > + void *context; /* context to provide to client cb */ > + enum iw_cm_state state; > + struct ib_device *device; > + struct ib_qp *qp; > + struct sockaddr_in local_addr; > + struct sockaddr_in remote_addr; > + u64 provider_id; /* device handle for this conn. */ > + iw_event_handler event_handler; /* callback for IW CM Provider events */ > }; > > -/* Listening endpoint. */ > -struct iw_listen_ep_attr { > - void (*event_handler)(struct iw_cm_event *, void *); > - void *listen_context; > - struct in_addr addr; > - u16 port; > - int backlog; > -}; > +/** > + * iw_create_cm_id - Allocate a communication identifier. > + * @device: Device associated with the cm_id. All related communication will > + * be associated with the specified device. > + * @cm_handler: Callback invoked to notify the user of CM events. > + * @context: User specified context associated with the communication > + * identifier. > + * > + * Communication identifiers are used to track connection states, > + * addr resolution requests, and listen requests. > + */ > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context); > > -struct iw_cm { > +/* This is provided in the event generated when > + * a remote peer accepts our connect request > + */ > > - int (*connect_qp)(struct ib_qp *ib_qp, > - struct iw_conn_attr* attr, > - void (*event_handler)(struct iw_cm_event*, void*), > - void* context, > - u8 *pdata, > - int pdata_len > - ); > +struct iw_cm_verbs { > + int (*connect)(struct iw_cm_id* cm_id, > + const void* private_data, > + u8 private_data_len); > + > + int (*disconnect)(struct iw_cm_id* cm_id, > + int abrupt); > > - int (*disconnect_qp)(struct ib_qp *qp, > - int abrupt > - ); > + int (*accept)(struct iw_cm_id*, > + const void *private_data, > + u8 pdata_data_len); > > - int (*accept_cr)(struct ib_device* ibdev, > - u32 cr_id, > - struct ib_qp *qp, > - void (*event_handler)(struct iw_cm_event*, void*), > - void *context, > - u8 *pdata, > - int pdata_len); > + int (*reject)(struct iw_cm_id* cm_id, > + const void* private_data, > + u8 private_data_len); > > - int (*reject_cr)(struct ib_device* ibdev, > - u32 cr_id, > - u8 *pdata, > - int pdata_len); > + int (*getpeername)(struct iw_cm_id* cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr); > > - int (*query_cr)(struct ib_device* ibdev, > - u32 cr_id, > - struct iw_conn_request* req); > + int (*create_listen)(struct iw_cm_id* cm_id, > + int backlog); > > - int (*create_listen_ep)(struct ib_device *ibdev, > - struct iw_listen_ep_attr *ep_attrs, > - void **ep_handle); > + int (*destroy_listen)(struct iw_cm_id* cm_id); > > - int (*destroy_listen_ep)(struct ib_device *ibdev, > - void *ep_handle); > - > }; > > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context); > +void iw_destroy_cm_id(struct iw_cm_id *cm_id); > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog); > +int iw_cm_getpeername(struct iw_cm_id *cm_id, > + struct sockaddr_in* local_add, > + struct sockaddr_in* remote_addr); > +int iw_cm_reject(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len); > +int iw_cm_accept(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len); > +int iw_cm_connect(struct iw_cm_id *cm_id, > + const void* pdata, u8 pdata_len); > +int iw_cm_disconnect(struct iw_cm_id *cm_id); > +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp); > + > #endif /* IW_CM_H */ > Index: core/cm.c > =================================================================== > --- core/cm.c (revision 4186) > +++ core/cm.c (working copy) > @@ -3227,6 +3227,10 @@ > int ret; > u8 i; > > + /* Ignore RNIC devices */ > + if (device->node_type == IB_NODE_RNIC) > + return; > + > cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * > device->phys_port_cnt, GFP_KERNEL); > if (!cm_dev) > @@ -3291,6 +3295,10 @@ > if (!cm_dev) > return; > > + /* Ignore RNIC devices */ > + if (device->node_type == IB_NODE_RNIC) > + return; > + > write_lock_irqsave(&cm.device_lock, flags); > list_del(&cm_dev->list); > write_unlock_irqrestore(&cm.device_lock, flags); > Index: core/iwcm.c > =================================================================== > --- core/iwcm.c (revision 0) > +++ core/iwcm.c (revision 0) > @@ -0,0 +1,671 @@ > +/* > + * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2004 Topspin Corporation. All rights reserved. > + * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. > + * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved. > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#include "cm_msgs.h" > + > +MODULE_AUTHOR("Tom Tucker"); > +MODULE_DESCRIPTION("iWARP CM"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +static void iwcm_add_one(struct ib_device *device); > +static void iwcm_remove_one(struct ib_device *device); > +struct iwcm_id_private; > + > +static struct ib_client iwcm_client = { > + .name = "cm", > + .add = iwcm_add_one, > + .remove = iwcm_remove_one > +}; > + > +static struct { > + spinlock_t lock; > + struct list_head device_list; > + rwlock_t device_lock; > + struct workqueue_struct* wq; > +} iwcm; > + > +struct iwcm_device; > +struct iwcm_port { > + struct iwcm_device *iwcm_dev; > + struct sockaddr_in local_addr; > + u8 port_num; > +}; > + > +struct iwcm_device { > + struct list_head list; > + struct ib_device *device; > + struct iwcm_port port[0]; > +}; > + > +struct iwcm_id_private { > + struct iw_cm_id id; > + > + spinlock_t lock; > + wait_queue_head_t wait; > + atomic_t refcount; > + > + struct rb_node listen_node; > + > + struct list_head work_list; > + atomic_t work_count; > +}; > + > +struct iwcm_work { > + struct work_struct work; > + struct iwcm_id_private* cm_id; > + struct iw_cm_event event; > +}; > + > +/* Called whenever a reference added for a cm_id */ > +static inline void iwcm_addref_id(struct iwcm_id_private *cm_id_priv) > +{ > + atomic_inc(&cm_id_priv->refcount); > +} > + > +/* Called whenever releasing a reference to a cm id */ > +static inline void iwcm_deref_id(struct iwcm_id_private *cm_id_priv) > +{ > + if (atomic_dec_and_test(&cm_id_priv->refcount)) > + wake_up(&cm_id_priv->wait); > +} > + > +static void cm_event_handler(struct iw_cm_id* cm_id, struct iw_cm_event* event); > + > +struct iw_cm_id *iw_create_cm_id(struct ib_device *device, > + iw_cm_handler cm_handler, > + void *context) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + > + iwcm_id_priv = kmalloc(sizeof *iwcm_id_priv, GFP_KERNEL); > + if (!iwcm_id_priv) > + return ERR_PTR(-ENOMEM); > + > + memset(iwcm_id_priv, 0, sizeof *iwcm_id_priv); > + iwcm_id_priv->id.state = IW_CM_STATE_IDLE; > + iwcm_id_priv->id.device = device; > + iwcm_id_priv->id.cm_handler = cm_handler; > + iwcm_id_priv->id.context = context; > + iwcm_id_priv->id.event_handler = cm_event_handler; > + > + spin_lock_init(&iwcm_id_priv->lock); > + init_waitqueue_head(&iwcm_id_priv->wait); > + atomic_set(&iwcm_id_priv->refcount, 1); > + > + return &iwcm_id_priv->id; > + > +} > +EXPORT_SYMBOL(iw_create_cm_id); > + > +struct iw_cm_id* iw_clone_id(struct iw_cm_id* parent) > +{ > + return iw_create_cm_id(parent->device, > + parent->cm_handler, > + parent->context); > +} > + > +void iw_destroy_cm_id(struct iw_cm_id *cm_id) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_LISTEN: > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->destroy_listen(cm_id); > + break; > + > + case IW_CM_STATE_CONN_RECV: > + case IW_CM_STATE_CONN_SENT: > + case IW_CM_STATE_ESTABLISHED: > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + ret = cm_id->device->iwcm->disconnect(cm_id,1); > + break; > + > + case IW_CM_STATE_IDLE: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + break; > + > + default: > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + printk(KERN_ERR "%s:%s:%u Illegal state %d for iw_cm_id.\n", > + __FILE__, __FUNCTION__, __LINE__, cm_id->state); > + ; > + } > + > + atomic_dec(&iwcm_id_priv->refcount); > + wait_event(iwcm_id_priv->wait, !atomic_read(&iwcm_id_priv->refcount)); > + > + kfree(iwcm_id_priv); > +} > +EXPORT_SYMBOL(iw_destroy_cm_id); > + > +int iw_cm_listen(struct iw_cm_id *cm_id, int backlog) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + if (cm_id->device == 0) { > + printk(KERN_ERR "device is NULL\n"); > + return -EINVAL; > + } > + > + if (cm_id->device->iwcm == 0) { > + printk(KERN_ERR "iwcm is NULL\n"); > + return -EINVAL; > + } > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + if (cm_id->state != IW_CM_STATE_IDLE) { > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + return -EBUSY; > + } > + cm_id->state = IW_CM_STATE_LISTEN; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + > + ret = cm_id->device->iwcm->create_listen(cm_id, backlog); > + if (ret != 0) { > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + } > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_listen); > + > +int iw_cm_getpeername(struct iw_cm_id *cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr) > +{ > + if (cm_id->device == 0) > + return -EINVAL; > + > + if (cm_id->device->iwcm == 0) > + return -EINVAL; > + > + /* Make sure there's a connection */ > + if (cm_id->state != IW_CM_STATE_ESTABLISHED) > + return -ENOTCONN; > + > + return cm_id->device->iwcm->getpeername(cm_id, local_addr, remote_addr); > +} > +EXPORT_SYMBOL(iw_cm_getpeername); > + > +int iw_cm_reject(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + unsigned long flags; > + int ret; > + > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&iwcm_id_priv->lock, flags); > + switch (cm_id->state) { > + case IW_CM_STATE_CONN_RECV: > + ret = cm_id->device->iwcm->reject(cm_id, private_data, private_data_len); > + cm_id->state = IW_CM_STATE_IDLE; > + break; > + default: > + ret = -EINVAL; > + goto out; > + } > + > +out: spin_unlock_irqrestore(&iwcm_id_priv->lock, flags); > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_reject); > + > +int iw_cm_accept(struct iw_cm_id *cm_id, > + const void *private_data, > + u8 private_data_len) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + int ret; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + switch (cm_id->state) { > + case IW_CM_STATE_CONN_RECV: > + ret = cm_id->device->iwcm->accept(cm_id, private_data, > + private_data_len); > + if (ret == 0) { > + struct iw_cm_event event; > + event.event = IW_CM_EVENT_ESTABLISHED; > + event.provider_id = cm_id->provider_id; > + event.status = 0; > + event.local_addr = cm_id->local_addr; > + event.remote_addr = cm_id->remote_addr; > + event.private_data = 0; > + event.private_data_len = 0; > + cm_event_handler(cm_id, &event); > + } > + > + break; > + default: > + ret = -EINVAL; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_accept); > + > +int iw_cm_bind_qp(struct iw_cm_id* cm_id, struct ib_qp* qp) > +{ > + int ret = -EINVAL; > + > + if (cm_id) > + cm_id->qp = qp; > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_bind_qp); > + > +int iw_cm_connect(struct iw_cm_id *cm_id, > + const void* pdata, u8 pdata_len) > +{ > + struct iwcm_id_private* cm_id_priv; > + int ret = 0; > + unsigned long flags; > + > + if (cm_id->state != IW_CM_STATE_IDLE) > + return -EBUSY; > + > + if (cm_id->device == 0) > + return -EINVAL; > + > + if (cm_id->device->iwcm == 0) > + return -ENOSYS; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + cm_id->state = IW_CM_STATE_CONN_SENT; > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + ret = cm_id->device->iwcm->connect(cm_id, pdata, pdata_len); > + if (ret != 0) { > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + cm_id->state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + } > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_connect); > + > +int iw_cm_disconnect(struct iw_cm_id *cm_id) > +{ > + struct iwcm_id_private *iwcm_id_priv; > + int ret; > + > + if (cm_id->device == 0 || cm_id->device->iwcm == 0 || cm_id->qp == 0) > + return -EINVAL; > + > + iwcm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + > + switch (cm_id->state) { > + case IW_CM_STATE_ESTABLISHED: > + ret = cm_id->device->iwcm->disconnect(cm_id, 1); > + cm_id->state = IW_CM_STATE_IDLE; > + if (ret == 0) { > + struct iw_cm_event event; > + event.event = IW_CM_EVENT_LLP_DISCONNECT; > + event.provider_id = cm_id->provider_id; > + event.status = 0; > + event.local_addr = cm_id->local_addr; > + event.remote_addr = cm_id->remote_addr; > + event.private_data = 0; > + event.private_data_len = 0; > + cm_event_handler(cm_id, &event); > + } > + > + break; > + default: > + ret = -EINVAL; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iw_cm_disconnect); > + > +static void iwcm_add_one(struct ib_device *device) > +{ > + struct iwcm_device *iwcm_dev; > + struct iwcm_port *port; > + unsigned long flags; > + u8 i; > + > + if (device->node_type != IB_NODE_RNIC) > + return; > + > + iwcm_dev = kmalloc(sizeof(*iwcm_dev) + sizeof(*port) * > + device->phys_port_cnt, GFP_KERNEL); > + if (!iwcm_dev) > + return; > + > + iwcm_dev->device = device; > + > + for (i = 1; i <= device->phys_port_cnt; i++) { > + port = &iwcm_dev->port[i-1]; > + port->iwcm_dev = iwcm_dev; > + port->port_num = i; > + } > + > + ib_set_client_data(device, &iwcm_client, iwcm_dev); > + > + write_lock_irqsave(&iwcm.device_lock, flags); > + list_add_tail(&iwcm_dev->list, &iwcm.device_list); > + write_unlock_irqrestore(&iwcm.device_lock, flags); > + return; > +} > + > +static void iwcm_remove_one(struct ib_device *device) > +{ > + struct iwcm_device *iwcm_dev; > + unsigned long flags; > + > + if (device->node_type != IB_NODE_RNIC) > + return; > + > + iwcm_dev = ib_get_client_data(device, &iwcm_client); > + if (!iwcm_dev) > + return; > + > + write_lock_irqsave(&iwcm.device_lock, flags); > + list_del(&iwcm_dev->list); > + write_unlock_irqrestore(&iwcm.device_lock, flags); > + > + kfree(iwcm_dev); > +} > + > +/* Handles an inbound connect request. The function creates a new > + * iw_cm_id to represent the new connection and inherits the client > + * callback function and other attributes from the listening parent. > + * > + * The work item contains a pointer to the listen_cm_id and the event. The > + * listen_cm_id contains the client cm_handler, context and device. These are > + * copied when the device is cloned. The event contains the new four tuple. > + */ > +static int cm_conn_req_handler(struct iwcm_work* work) > +{ > + struct iw_cm_id* cm_id; > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int rc; > + > + /* If the status was not successful, ignore request */ > + if (work->event.status) { > + printk(KERN_ERR "Bad status=%d for connection request ... " > + "should be filtered by provider\n", > + work->event.status); > + return work->event.status; > + } > + cm_id = iw_clone_id(&work->cm_id->id); > + if (IS_ERR(cm_id)) > + return PTR_ERR(cm_id); > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.provider_id = work->event.provider_id; > + cm_id_priv->id.state = IW_CM_STATE_CONN_RECV; > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + rc = cm_id->cm_handler(cm_id, &work->event); > + if (rc) { > + cm_id->state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(cm_id); > + } > + kfree(work); > + return 0; > +} > + > +/* > + * Handles the transition to established state on the passive side. > + */ > +static int cm_conn_est_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + if (cm_id_priv->id.state != IW_CM_STATE_CONN_RECV) { > + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for established event\n", > + __FUNCTION__, __LINE__, cm_id_priv->id.state); > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + ret = -EINVAL; > + goto error_out; > + } > + > + if (work->event.status == 0) { > + cm_id_priv = work->cm_id; > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; > + } else { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + } > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } > + > + error_out: > + kfree(work); > + return ret; > +} > + > +/* > + * Handles the reply to our connect request. There are three > + * possibilities: > + * - If the cm_id is in the wrong state when the event is > + * delivered, the event is ignored. [What should we do when the > + * provider does something crazy?] > + * - If the remote peer accepts the connection, we update the 4-tuple > + * in the cm_id with the remote peer info, move the cm_id to the > + * ESTABLISHED state and deliver the event to the client. > + * - If the remote peer rejects the connection, or there is some > + * connection error, move the cm_id to the IDLE state, and deliver > + * the event to the client. > + */ > +static int cm_conn_rep_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + if (cm_id_priv->id.state != IW_CM_STATE_CONN_SENT) { > + printk(KERN_ERR "%s:%d Invalid cm_id state=%d for connect reply event\n", > + __FUNCTION__, __LINE__, cm_id_priv->id.state); > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + ret = -EINVAL; > + goto error_out; > + } > + > + if (work->event.status == 0) { > + cm_id_priv = work->cm_id; > + cm_id_priv->id.local_addr = work->event.local_addr; > + cm_id_priv->id.remote_addr = work->event.remote_addr; > + cm_id_priv->id.state = IW_CM_STATE_ESTABLISHED; > + } else { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + } > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } > + > + error_out: > + kfree(work); > + return ret; > +} > + > +static int cm_disconnect_handler(struct iwcm_work* work) > +{ > + struct iwcm_id_private* cm_id_priv; > + unsigned long flags; > + int ret = 0; > + > + cm_id_priv = work->cm_id; > + spin_lock_irqsave(&cm_id_priv->lock, flags); > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + spin_unlock_irqrestore(&cm_id_priv->lock, flags); > + > + /* Call the client CM handler */ > + ret = cm_id_priv->id.cm_handler(&cm_id_priv->id, &work->event); > + if (ret) { > + cm_id_priv->id.state = IW_CM_STATE_IDLE; > + iw_destroy_cm_id(&cm_id_priv->id); > + } > + > + kfree(work); > + return ret; > +} > + > +static void cm_work_handler(void* arg) > +{ > + struct iwcm_work* work = (struct iwcm_work*)arg; > + int rc; > + > + switch (work->event.event) { > + case IW_CM_EVENT_CONNECT_REQUEST: > + rc = cm_conn_req_handler(work); > + break; > + case IW_CM_EVENT_CONNECT_REPLY: > + rc = cm_conn_rep_handler(work); > + break; > + case IW_CM_EVENT_ESTABLISHED: > + rc = cm_conn_est_handler(work); > + break; > + case IW_CM_EVENT_LLP_DISCONNECT: > + case IW_CM_EVENT_LLP_TIMEOUT: > + case IW_CM_EVENT_LLP_RESET: > + case IW_CM_EVENT_CLOSE: > + rc = cm_disconnect_handler(work); > + break; > + } > +} > + > +/* IW CM provider event callback handler. This function is called on > + * interrupt context. The function builds a work queue element > + * and enqueues it for processing on a work queue thread. This allows > + * CM client callback functions to block. > + */ > +static void cm_event_handler(struct iw_cm_id* cm_id, > + struct iw_cm_event* event) > +{ > + struct iwcm_work *work; > + struct iwcm_id_private* cm_id_priv; > + > + work = kmalloc(sizeof *work, GFP_ATOMIC); > + if (!work) > + return; > + > + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); > + INIT_WORK(&work->work, cm_work_handler, work); > + work->cm_id = cm_id_priv; > + work->event = *event; > + queue_work(iwcm.wq, &work->work); > +} > + > +static int __init iw_cm_init(void) > +{ > + memset(&iwcm, 0, sizeof iwcm); > + INIT_LIST_HEAD(&iwcm.device_list); > + rwlock_init(&iwcm.device_lock); > + spin_lock_init(&iwcm.lock); > + iwcm.wq = create_workqueue("iw_cm"); > + if (!iwcm.wq) > + return -ENOMEM; > + > + return ib_register_client(&iwcm_client); > +} > + > +static void __exit iw_cm_cleanup(void) > +{ > + ib_unregister_client(&iwcm_client); > +} > + > +module_init(iw_cm_init); > +module_exit(iw_cm_cleanup); > + > Index: core/addr.c > =================================================================== > --- core/addr.c (revision 4186) > +++ core/addr.c (working copy) > @@ -73,8 +73,13 @@ > if (!dev) > return -EADDRNOTAVAIL; > > - *gid = *(union ib_gid *) (dev->dev_addr + 4); > - *pkey = addr_get_pkey(dev); > + if (dev->type == ARPHRD_INFINIBAND) { > + *gid = *(union ib_gid *) (dev->dev_addr + 4); > + *pkey = addr_get_pkey(dev); > + } else { > + *gid = *(union ib_gid *) (dev->dev_addr); > + *pkey = 0; > + } > dev_put(dev); > return 0; > } > Index: core/Makefile > =================================================================== > --- core/Makefile (revision 4186) > +++ core/Makefile (working copy) > @@ -1,6 +1,6 @@ > EXTRA_CFLAGS += -Idrivers/infiniband/include -Idrivers/infiniband/ulp/ipoib > > -obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o \ > +obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_ping.o ib_cm.o iw_cm.o \ > ib_sa.o ib_at.o ib_addr.o rdma_cm.o > obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o > obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o ib_uat.o > @@ -14,6 +14,8 @@ > > ib_cm-y := cm.o > > +iw_cm-y := iwcm.o > + > rdma_cm-y := cma.o > > ib_addr-y := addr.o > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 4186) > +++ core/cma.c (working copy) > @@ -1,4 +1,5 @@ > /* > + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. > * Copyright (c) 2005 Voltaire Inc. All rights reserved. > * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. > * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. > @@ -30,9 +31,14 @@ > */ > #include > #include > +#include > +#include > +#include > +#include > #include > #include > #include > +#include > #include > > MODULE_AUTHOR("Guy German"); > @@ -100,7 +106,10 @@ > int timeout_ms; > struct ib_sa_query *query; > int query_id; > - struct ib_cm_id *cm_id; > + union { > + struct ib_cm_id *ib; > + struct iw_cm_id *iw; > + } cm_id; > }; > > struct cma_addr { > @@ -266,6 +275,16 @@ > IB_QP_PKEY_INDEX | IB_QP_PORT); > } > > +static int cma_init_iw_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) > +{ > + struct ib_qp_attr qp_attr; > + > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + > + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS); > +} > + > int rdma_create_qp(struct rdma_cm_id *id, struct ib_pd *pd, > struct ib_qp_init_attr *qp_init_attr) > { > @@ -285,6 +304,9 @@ > case IB_NODE_CA: > ret = cma_init_ib_qp(id_priv, qp); > break; > + case IB_NODE_RNIC: > + ret = cma_init_iw_qp(id_priv, qp); > + break; > default: > ret = -ENOSYS; > break; > @@ -314,7 +336,7 @@ > > /* Need to update QP attributes from default values. */ > qp_attr.qp_state = IB_QPS_INIT; > - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); > + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); > if (ret) > return ret; > > @@ -323,7 +345,7 @@ > return ret; > > qp_attr.qp_state = IB_QPS_RTR; > - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); > + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); > if (ret) > return ret; > > @@ -337,7 +359,7 @@ > int qp_attr_mask, ret; > > qp_attr.qp_state = IB_QPS_RTS; > - ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); > + ret = ib_cm_init_qp_attr(id_priv->cm_id.ib, &qp_attr, &qp_attr_mask); > if (ret) > return ret; > > @@ -419,8 +441,8 @@ > { > cma_exch(id_priv, CMA_DESTROYING); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) > + ib_destroy_cm_id(id_priv->cm_id.ib); > > list_del(&id_priv->listen_list); > if (id_priv->cma_dev) > @@ -476,8 +498,22 @@ > state = cma_exch(id_priv, CMA_DESTROYING); > cma_cancel_operation(id_priv, state); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id->device) { > + switch (id->device->node_type) { > + case IB_NODE_RNIC: > + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) { > + iw_destroy_cm_id(id_priv->cm_id.iw); > + id_priv->cm_id.iw = 0; > + } > + break; > + default: > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { > + ib_destroy_cm_id(id_priv->cm_id.ib); > + id_priv->cm_id.ib = 0; > + } > + break; > + } > + } > > if (id_priv->cma_dev) { > down(&mutex); > @@ -505,14 +541,14 @@ > if (ret) > goto reject; > > - ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); > + ret = ib_send_cm_rtu(id_priv->cm_id.ib, NULL, 0); > if (ret) > goto reject; > > return 0; > reject: > cma_modify_qp_err(&id_priv->id); > - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, NULL, 0); > return ret; > } > @@ -528,7 +564,7 @@ > return 0; > reject: > cma_modify_qp_err(&id_priv->id); > - ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, NULL, 0); > return ret; > } > @@ -586,7 +622,7 @@ > private_data_len); > if (ret) { > /* Destroy the CM ID by returning a non-zero value. */ > - id_priv->cm_id = NULL; > + id_priv->cm_id.ib = NULL; > cma_exch(id_priv, CMA_DESTROYING); > cma_release_remove(id_priv); > rdma_destroy_id(&id_priv->id); > @@ -675,7 +711,7 @@ > goto out; > } > > - conn_id->cm_id = cm_id; > + conn_id->cm_id.ib = cm_id; > cm_id->context = conn_id; > cm_id->cm_handler = cma_ib_handler; > > @@ -685,7 +721,7 @@ > IB_CM_REQ_PRIVATE_DATA_SIZE - offset); > if (ret) { > /* Destroy the CM ID by returning a non-zero value. */ > - conn_id->cm_id = NULL; > + conn_id->cm_id.ib = NULL; > cma_exch(conn_id, CMA_DESTROYING); > cma_release_remove(conn_id); > rdma_destroy_id(&conn_id->id); > @@ -695,6 +731,112 @@ > return ret; > } > > +static int cma_iw_handler(struct iw_cm_id* iw_id, struct iw_cm_event* event) > +{ > + struct rdma_id_private *id_priv = iw_id->context; > + enum rdma_cm_event_type event_type = 0; > + int ret = 0; > + > + atomic_inc(&id_priv->dev_remove); > + > + switch (event->event) { > + case IW_CM_EVENT_LLP_DISCONNECT: > + case IW_CM_EVENT_LLP_RESET: > + case IW_CM_EVENT_LLP_TIMEOUT: > + case IW_CM_EVENT_CLOSE: > + event_type = RDMA_CM_EVENT_DISCONNECTED; > + break; > + > + case IW_CM_EVENT_CONNECT_REQUEST: > + BUG_ON(1); > + break; > + > + case IW_CM_EVENT_CONNECT_REPLY: { > + if (event->status) > + event_type = RDMA_CM_EVENT_REJECTED; > + else > + event_type = RDMA_CM_EVENT_ESTABLISHED; > + break; > + } > + > + case IW_CM_EVENT_ESTABLISHED: > + event_type = RDMA_CM_EVENT_ESTABLISHED; > + break; > + } > + > + ret = cma_notify_user(id_priv, > + event_type, > + event->status, > + event->private_data, > + event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id.iw = NULL; > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return ret; > + } > + > + cma_release_remove(id_priv); > + return ret; > +} > + > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct rdma_cm_id* new_cm_id; > + struct rdma_id_private *listen_id, *conn_id; > + struct sockaddr_in* sin; > + int ret; > + > + listen_id = cm_id->context; > + atomic_inc(&listen_id->dev_remove); > + if (!cma_comp(listen_id, CMA_LISTEN)) { > + ret = -ECONNABORTED; > + goto out; > + } > + > + /* Create a new RDMA id the new IW CM ID */ > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > + listen_id->id.context); > + if (!new_cm_id) { > + ret = -ENOMEM; > + goto out; > + } > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > + atomic_inc(&conn_id->dev_remove); > + conn_id->state = CMA_CONNECT; > + > + /* New connection inherits device from parent */ > + cma_attach_to_dev(conn_id, listen_id->cma_dev); > + > + conn_id->cm_id.iw = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_iw_handler; > + > + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; > + *sin = iw_event->local_addr; > + > + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; > + *sin = iw_event->remote_addr; > + > + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, > + iw_event->private_data, > + iw_event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + conn_id->cm_id.iw = NULL; > + cma_exch(conn_id, CMA_DESTROYING); > + cma_release_remove(conn_id); > + rdma_destroy_id(&conn_id->id); > + } > + > +out: > + cma_release_remove(listen_id); > + return ret; > +} > + > static __be64 cma_get_service_id(struct sockaddr *addr) > { > return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + > @@ -706,21 +848,44 @@ > __be64 svc_id; > int ret; > > - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, > + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_req_handler, > id_priv); > - if (IS_ERR(id_priv->cm_id)) > - return PTR_ERR(id_priv->cm_id); > + if (IS_ERR(id_priv->cm_id.ib)) > + return PTR_ERR(id_priv->cm_id.ib); > > svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); > - ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); > + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0); > if (ret) { > - ib_destroy_cm_id(id_priv->cm_id); > - id_priv->cm_id = NULL; > + ib_destroy_cm_id(id_priv->cm_id.ib); > + id_priv->cm_id.ib = NULL; > } > > return ret; > } > > +static int cma_iw_listen(struct rdma_id_private *id_priv) > +{ > + int ret; > + struct sockaddr_in* sin; > + > + id_priv->cm_id.iw = iw_create_cm_id(id_priv->id.device, > + iw_conn_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id.iw)) > + return PTR_ERR(id_priv->cm_id.iw); > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; > + id_priv->cm_id.iw->local_addr = *sin; > + > + ret = iw_cm_listen(id_priv->cm_id.iw, 10 /* backlog */); > + if (ret) { > + iw_destroy_cm_id(id_priv->cm_id.iw); > + id_priv->cm_id.iw = NULL; > + } > + > + return ret; > +} > + > static int cma_duplicate_listen(struct rdma_id_private *id_priv) > { > struct rdma_id_private *cur_id_priv; > @@ -785,8 +950,9 @@ > goto out; > > list_add_tail(&id_priv->list, &listen_any_list); > - list_for_each_entry(cma_dev, &dev_list, list) > + list_for_each_entry(cma_dev, &dev_list, list) { > cma_listen_on_dev(id_priv, cma_dev); > + } > out: > up(&mutex); > return ret; > @@ -796,7 +962,6 @@ > { > struct rdma_id_private *id_priv; > int ret; > - > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > return -EINVAL; > @@ -806,6 +971,9 @@ > case IB_NODE_CA: > ret = cma_ib_listen(id_priv); > break; > + case IB_NODE_RNIC: > + ret = cma_iw_listen(id_priv); > + break; > default: > ret = -ENOSYS; > break; > @@ -890,6 +1058,30 @@ > return (id_priv->query_id < 0) ? id_priv->query_id : 0; > } > > +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) > +{ > + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; > + int rc; > + > + atomic_inc(&id_priv->dev_remove); > + > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) > + BUG_ON(1); > + > + rc = cma_notify_user(id_priv, event, 0, NULL, 0); > + if (rc) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return rc; > + } > + > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + return rc; > +} > + > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; > @@ -904,6 +1096,9 @@ > case IB_NODE_CA: > ret = cma_resolve_ib_route(id_priv, timeout_ms); > break; > + case IB_NODE_RNIC: > + ret = cma_resolve_iw_route(id_priv, timeout_ms); > + break; > default: > ret = -ENOSYS; > break; > @@ -952,20 +1147,133 @@ > cma_deref_id(id_priv); > } > > + > +/* Find the local interface with a route to the specified address and > + * bind the CM ID to this interface's CMA device > + */ > +static int cma_acquire_iw_dev(struct rdma_cm_id* id, struct sockaddr* addr) > +{ > + int ret = -ENOENT; > + struct cma_device* cma_dev; > + struct rdma_id_private *id_priv; > + struct sockaddr_in* sin; > + struct rtable *rt = 0; > + struct flowi fl; > + struct net_device* netdev; > + struct in_addr src_ip; > + unsigned char* dev_addr; > + > + sin = (struct sockaddr_in*)addr; > + if (sin->sin_family != AF_INET) > + return -EINVAL; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + > + /* If the address is local, use the device. If it is remote, > + * look up a route to get the local address > + */ > + netdev = ip_dev_find(sin->sin_addr.s_addr); > + if (netdev) { > + src_ip = sin->sin_addr; > + dev_addr = netdev->dev_addr; > + dev_put(netdev); > + } else { > + memset(&fl, 0, sizeof(fl)); > + fl.nl_u.ip4_u.daddr = sin->sin_addr.s_addr; > + if (ip_route_output_key(&rt, &fl)) { > + return -ENETUNREACH; > + } > + dev_addr = rt->idev->dev->dev_addr; > + src_ip.s_addr = rt->rt_src; > + > + ip_rt_put(rt); > + } > + > + down(&mutex); > + > + list_for_each_entry(cma_dev, &dev_list, list) { > + if (memcmp(dev_addr, > + &cma_dev->node_guid, > + sizeof(cma_dev->node_guid)) == 0) { > + /* If we find the device, then check if this > + * is an iWARP device. If it is, then call the > + * callback handler immediately because we > + * already have the native address > + */ > + if (cma_dev->device->node_type == IB_NODE_RNIC) { > + struct sockaddr_in* cm_sin; > + /* Set our source address */ > + cm_sin = (struct sockaddr_in*) > + &id_priv->id.route.addr.src_addr; > + cm_sin->sin_family = AF_INET; > + cm_sin->sin_addr.s_addr = src_ip.s_addr; > + > + /* Claim the device in the mutex */ > + cma_attach_to_dev(id_priv, cma_dev); > + ret = 0; > + break; > + } > + } > + } > + up(&mutex); > + > + return ret; > +} > + > + > +/** > + * rdma_resolve_addr - RDMA Resolve Address > + * > + * @id: RDMA identifier. > + * @src_addr: Source IP address > + * @dst_addr: Destination IP address > + * &timeout_ms: Timeout to wait for address resolution > + * > + * Bind the specified cm_id to a local interface and if this is an IB > + * CA, determine the GIDs associated with the specified IP addresses. > + */ > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > struct sockaddr *dst_addr, int timeout_ms) > { > struct rdma_id_private *id_priv; > - int ret; > + int ret = 0; > > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) > return -EINVAL; > > atomic_inc(&id_priv->refcount); > + > id->route.addr.dst_addr = *dst_addr; > - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, > - timeout_ms, addr_handler, id_priv); > + > + if (cma_acquire_iw_dev(id, dst_addr)==0) { > + > + enum rdma_cm_event_type event; > + > + cma_exch(id_priv, CMA_ADDR_RESOLVED); > + > + atomic_inc(&id_priv->dev_remove); > + > + event = RDMA_CM_EVENT_ADDR_RESOLVED; > + if (cma_notify_user(id_priv, event, 0, NULL, 0)) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_deref_id(id_priv); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return -EINVAL; > + } > + > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + > + } else { > + > + ret = ib_resolve_addr(src_addr, > + dst_addr, &id->route.addr.addr.ibaddr, > + timeout_ms, addr_handler, id_priv); > + > + } > + > if (ret) > goto err; > > @@ -980,10 +1288,13 @@ > int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) > { > struct rdma_id_private *id_priv; > + struct sockaddr_in* sin; > struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; > int ret; > > - if (addr->sa_family != AF_INET) > + sin = (struct sockaddr_in*)addr; > + > + if (sin->sin_family != AF_INET) > return -EINVAL; > > id_priv = container_of(id, struct rdma_id_private, id); > @@ -994,9 +1305,11 @@ > id->route.addr.src_addr = *addr; > ret = 0; > } else { > - ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); > - if (!ret) > - ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + if ((ret = cma_acquire_iw_dev(id, addr))) { > + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); > + if (!ret) > + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + } > } > > if (ret) > @@ -1041,10 +1354,10 @@ > if (!private_data) > return -ENOMEM; > > - id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > + id_priv->cm_id.ib = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > id_priv); > - if (IS_ERR(id_priv->cm_id)) { > - ret = PTR_ERR(id_priv->cm_id); > + if (IS_ERR(id_priv->cm_id.ib)) { > + ret = PTR_ERR(id_priv->cm_id.ib); > goto out; > } > > @@ -1075,25 +1388,61 @@ > req.max_cm_retries = CMA_MAX_CM_RETRIES; > req.srq = id_priv->id.qp->srq ? 1 : 0; > > - ret = ib_send_cm_req(id_priv->cm_id, &req); > + ret = ib_send_cm_req(id_priv->cm_id.ib, &req); > out: > kfree(private_data); > return ret; > } > > +static int cma_connect_iw(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct iw_cm_id* cm_id; > + struct sockaddr_in* sin; > + int ret; > + > + if (id_priv->id.qp == NULL) > + return -EINVAL; > + > + cm_id = iw_create_cm_id(id_priv->id.device, cma_iw_handler, id_priv); > + if (IS_ERR(cm_id)) { > + ret = PTR_ERR(cm_id); > + goto out; > + } > + > + id_priv->cm_id.iw = cm_id; > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; > + cm_id->local_addr = *sin; > + > + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; > + cm_id->remote_addr = *sin; > + > + iw_cm_bind_qp(cm_id, id_priv->id.qp); > + > + ret = iw_cm_connect(cm_id, conn_param->private_data, > + conn_param->private_data_len); > + > +out: > + return ret; > +} > + > int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > { > struct rdma_id_private *id_priv; > int ret; > > id_priv = container_of(id, struct rdma_id_private, id); > - if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > return -EINVAL; > > switch (id->device->node_type) { > case IB_NODE_CA: > ret = cma_connect_ib(id_priv, conn_param); > break; > + case IB_NODE_RNIC: > + ret = cma_connect_iw(id_priv, conn_param); > + break; > default: > ret = -ENOSYS; > break; > @@ -1131,7 +1480,7 @@ > rep.rnr_retry_count = conn_param->rnr_retry_count; > rep.srq = id_priv->id.qp->srq ? 1 : 0; > > - return ib_send_cm_rep(id_priv->cm_id, &rep); > + return ib_send_cm_rep(id_priv->cm_id.ib, &rep); > } > > int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > @@ -1147,6 +1496,12 @@ > case IB_NODE_CA: > ret = cma_accept_ib(id_priv, conn_param); > break; > + case IB_NODE_RNIC: { > + iw_cm_bind_qp(id_priv->cm_id.iw, id_priv->id.qp); > + ret = iw_cm_accept(id_priv->cm_id.iw, conn_param->private_data, > + conn_param->private_data_len); > + break; > + } > default: > ret = -ENOSYS; > break; > @@ -1175,9 +1530,15 @@ > > switch (id->device->node_type) { > case IB_NODE_CA: > - ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + ret = ib_send_cm_rej(id_priv->cm_id.ib, IB_CM_REJ_CONSUMER_DEFINED, > NULL, 0, private_data, private_data_len); > break; > + > + case IB_NODE_RNIC: > + ret = iw_cm_reject(id_priv->cm_id.iw, > + private_data, private_data_len); > + break; > + > default: > ret = -ENOSYS; > break; > @@ -1190,7 +1551,6 @@ > { > struct rdma_id_private *id_priv; > int ret; > - > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp(id_priv, CMA_CONNECT)) > return -EINVAL; > @@ -1202,9 +1562,12 @@ > switch (id->device->node_type) { > case IB_NODE_CA: > /* Initiate or respond to a disconnect. */ > - if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) > - ib_send_cm_drep(id_priv->cm_id, NULL, 0); > + if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0)) > + ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0); > break; > + case IB_NODE_RNIC: > + ret = iw_cm_disconnect(id_priv->cm_id.iw); > + break; > default: > break; > } > Index: Makefile > =================================================================== > --- Makefile (revision 4186) > +++ Makefile (working copy) > @@ -7,3 +7,5 @@ > obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ > obj-$(CONFIG_KDAPL) += ulp/kdapl/ > obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ > +obj-$(CONFIG_KRPING) += krping/ > +obj-$(CONFIG_RDMA_CMATOSE) += cmatose/ > Index: hw/amso1100/c2.c > =================================================================== > --- hw/amso1100/c2.c (revision 4482) > +++ hw/amso1100/c2.c (working copy) > @@ -933,7 +933,7 @@ > spin_lock_init(&c2_port->tx_lock); > > /* Copy our 48-bit ethernet hardware address */ > - memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_ENADDR, 6); > + memcpy_fromio(netdev->dev_addr, mmio_addr + C2_REGS_RDMA_ENADDR, 6); > > /* Validate the MAC address */ > if(!is_valid_ether_addr(netdev->dev_addr)) { > Index: hw/amso1100/c2_qp.c > =================================================================== > --- hw/amso1100/c2_qp.c (revision 4482) > +++ hw/amso1100/c2_qp.c (working copy) > @@ -184,7 +184,7 @@ > struct c2_vq_req *vq_req; > ccwr_qp_destroy_req_t wr; > ccwr_qp_destroy_rep_t *reply; > - int err; > + int err; > > /* > * Allocate a verb request message > @@ -343,8 +343,6 @@ > qp->send_sgl_depth = qp_attrs->cap.max_send_sge; > qp->rdma_write_sgl_depth = qp_attrs->cap.max_send_sge; > qp->recv_sgl_depth = qp_attrs->cap.max_recv_sge; > - qp->event_handler = NULL; > - qp->context = NULL; > > /* Initialize the SQ MQ */ > q_size = be32_to_cpu(reply->sq_depth); > Index: hw/amso1100/c2.h > =================================================================== > --- hw/amso1100/c2.h (revision 4482) > +++ hw/amso1100/c2.h (working copy) > @@ -113,6 +113,7 @@ > C2_REGS_Q2_MSGSIZE = 0x0038, > C2_REGS_Q2_SHARED = 0x0040, > C2_REGS_ENADDR = 0x004C, > + C2_REGS_RDMA_ENADDR = 0x0054, > C2_REGS_HRX_CUR = 0x006C, > }; > > @@ -592,16 +593,11 @@ > extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); > > /* CM */ > -extern int c2_qp_connect(struct c2_dev *c2dev, struct c2_qp *qp, u32 remote_addr, > - u16 remote_port, u32 pdata_len, u8 *pdata); > -extern int c2_cr_query(struct c2_dev *c2dev, u32 cr_id, > - struct c2_cr_query_attrs *cr_attrs); > -extern int c2_cr_accept(struct c2_dev *c2dev, u32 cr_id, struct c2_qp *qp, > - u32 pdata_len, u8 *pdata); > -extern int c2_cr_reject(struct c2_dev *c2dev, u32 cr_id); > -extern int c2_ep_listen_create(struct c2_dev *c2dev, u32 addr, u16 port, > - u32 backlog, struct c2_ep *ep); > -extern int c2_ep_listen_destroy(struct c2_dev *c2dev, struct c2_ep *ep); > +extern int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len); > +extern int c2_llp_service_create(struct iw_cm_id* cm_id, int backlog); > +extern int c2_llp_service_destroy(struct iw_cm_id* cm_id); > > /* MM */ > extern int c2_nsmr_register_phys_kern(struct c2_dev *c2dev, u64 **addr_list, > Index: hw/amso1100/c2_pd.c > =================================================================== > --- hw/amso1100/c2_pd.c (revision 4482) > +++ hw/amso1100/c2_pd.c (working copy) > @@ -44,6 +44,8 @@ > { > int err = 0; > > + printk(KERN_ERR "%s:%d\n", __FUNCTION__, __LINE__); > + > might_sleep(); > > atomic_set(&pd->sqp_count, 0); > Index: hw/amso1100/c2_ae.c > =================================================================== > --- hw/amso1100/c2_ae.c (revision 4482) > +++ hw/amso1100/c2_ae.c (working copy) > @@ -35,51 +35,37 @@ > #include "cc_status.h" > #include "cc_ae.h" > > -enum conn_result > -c2_convert_cm_status(u32 cc_status) > +static int c2_convert_cm_status(u32 cc_status) > { > switch (cc_status) { > - case CC_CONN_STATUS_SUCCESS: return IW_CONN_ACCEPT; > - case CC_CONN_STATUS_REJECTED: return IW_CONN_RESET; > - case CC_CONN_STATUS_REFUSED: return IW_CONN_PEER_REJECT; > - case CC_CONN_STATUS_TIMEDOUT: return IW_CONN_TIMEDOUT; > - case CC_CONN_STATUS_NETUNREACH: return IW_CONN_NO_ROUTE_TO_HOST; > - case CC_CONN_STATUS_HOSTUNREACH: return IW_CONN_NO_ROUTE_TO_HOST; > - case CC_CONN_STATUS_INVALID_RNIC: return IW_CONN_INVALID_PARM; > - case CC_CONN_STATUS_INVALID_QP: return IW_CONN_INVALID_PARM; > - case CC_CONN_STATUS_INVALID_QP_STATE: return IW_CONN_INVALID_PARM; > + case CC_CONN_STATUS_SUCCESS: > + return 0; > + case CC_CONN_STATUS_REJECTED: > + return -ENETRESET; > + case CC_CONN_STATUS_REFUSED: > + return -ECONNREFUSED; > + case CC_CONN_STATUS_TIMEDOUT: > + return -ETIMEDOUT; > + case CC_CONN_STATUS_NETUNREACH: > + return -ENETUNREACH; > + case CC_CONN_STATUS_HOSTUNREACH: > + return -EHOSTUNREACH; > + case CC_CONN_STATUS_INVALID_RNIC: > + return -EINVAL; > + case CC_CONN_STATUS_INVALID_QP: > + return -EINVAL; > + case CC_CONN_STATUS_INVALID_QP_STATE: > + return -EINVAL; > default: > panic("Unable to convert CM status: %d\n", cc_status); > break; > } > } > > -static int > -is_cm_event(cc_event_id_t id) > -{ > - int is_cm; > - > - switch (id) { > - case CCAE_ACTIVE_CONNECT_RESULTS: > - case CCAE_BAD_CLOSE: > - case CCAE_LLP_CLOSE_COMPLETE: > - case CCAE_LLP_CONNECTION_RESET: > - case CCAE_LLP_CONNECTION_LOST: > - is_cm = 1; > - break; > - case CCAE_TERMINATE_MESSAGE_RECEIVED: > - case CCAE_CQ_SQ_COMPLETION_OVERFLOW: > - default: > - is_cm = 0; > - break; > - } > - > - return is_cm; > -} > void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) > { > - ccwr_t *wr; > struct c2_mq *mq = c2dev->qptr_array[mq_index]; > + ccwr_t *wr; > void *resource_user_context; > struct iw_cm_event cm_event; > struct ib_event ib_event; > @@ -94,6 +80,7 @@ > if (!wr) > return; > > + memset(&cm_event, 0, sizeof(cm_event)); > event_id = c2_wr_get_id(wr); > resource_indicator = be32_to_cpu(wr->ae.ae_generic.resource_type); > resource_user_context = (void *)(unsigned long)wr->ae.ae_generic.user_context; > @@ -102,117 +89,126 @@ > case CC_RES_IND_QP: { > > struct c2_qp *qp = (struct c2_qp *)resource_user_context; > + cm_event.status = c2_convert_cm_status(c2_wr_get_result(wr)); > > - if (is_cm_event(event_id)) { > - > - cm_event.device = &c2dev->ibdev; > - if (event_id == CCAE_ACTIVE_CONNECT_RESULTS) { > - cm_event.event = IW_EVENT_ACTIVE_CONNECT_RESULTS; > - cm_event.element.active_results.result = > - c2_convert_cm_status(c2_wr_get_result(wr)); > - cm_event.element.active_results.conn_attr.local_addr.s_addr = > - wr->ae.ae_active_connect_results.laddr; > - cm_event.element.active_results.conn_attr.remote_addr.s_addr = > - wr->ae.ae_active_connect_results.raddr; > - cm_event.element.active_results.conn_attr.local_port = > - wr->ae.ae_active_connect_results.lport; > - cm_event.element.active_results.conn_attr.remote_port = > - wr->ae.ae_active_connect_results.rport; > - cm_event.element.active_results.private_data_len = > + switch (event_id) { > + case CCAE_ACTIVE_CONNECT_RESULTS: > + cm_event.event = IW_CM_EVENT_CONNECT_REPLY; > + cm_event.local_addr.sin_addr.s_addr = > + wr->ae.ae_active_connect_results.laddr; > + cm_event.remote_addr.sin_addr.s_addr = > + wr->ae.ae_active_connect_results.raddr; > + cm_event.local_addr.sin_port = > + wr->ae.ae_active_connect_results.lport; > + cm_event.remote_addr.sin_port = > + wr->ae.ae_active_connect_results.rport; > + cm_event.private_data_len = > be32_to_cpu(wr->ae.ae_active_connect_results.private_data_length); > > + if (cm_event.private_data_len) { > /* XXX */ > - pdata = kmalloc(cm_event.element.active_results.private_data_len, > - GFP_ATOMIC); > - if (!pdata) > - break; > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the remote peer > + * will retry */ > + dprintk("Ignored connect request -- no memory for pdata" > + "private_data_len=%d\n", cm_event.private_data_len); > + goto ignore_it; > + } > > memcpy(pdata, > wr->ae.ae_active_connect_results.private_data, > - cm_event.element.active_results.private_data_len); > - cm_event.element.active_results.private_data = pdata; > + cm_event.private_data_len); > > - } else { > - cm_event.event = IW_EVENT_DISCONNECT; > + cm_event.private_data = pdata; > } > + if (qp->cm_id->event_handler) > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > > - if (qp->event_handler) > - (*qp->event_handler)(&cm_event, qp->context); > + break; > > - if (pdata) > - kfree(pdata); > - } else { > - > + case CCAE_TERMINATE_MESSAGE_RECEIVED: > + case CCAE_CQ_SQ_COMPLETION_OVERFLOW: > ib_event.device = &c2dev->ibdev; > ib_event.element.qp = &qp->ibqp; > - /* XXX */ > ib_event.event = IB_EVENT_QP_REQ_ERR; > > if(qp->ibqp.event_handler) > - (*qp->ibqp.event_handler)(&ib_event, qp->context); > - } > + (*qp->ibqp.event_handler)(&ib_event, > + qp->ibqp.qp_context); > + case CCAE_BAD_CLOSE: > + case CCAE_LLP_CLOSE_COMPLETE: > + case CCAE_LLP_CONNECTION_RESET: > + case CCAE_LLP_CONNECTION_LOST: > + default: > + cm_event.event = IW_CM_EVENT_CLOSE; > + if (qp->cm_id->event_handler) > + qp->cm_id->event_handler(qp->cm_id, &cm_event); > > + } > break; > } > + > case CC_RES_IND_EP: { > > - struct c2_ep *ep = (struct c2_ep *)resource_user_context; > + struct iw_cm_id* cm_id = (struct iw_cm_id*)resource_user_context; > > + dprintk("CC_RES_IND_EP event_id=%d\n", event_id); > if (event_id != CCAE_CONNECTION_REQUEST) { > dprintk("%s: Invalid event_id: %d\n", __FUNCTION__, event_id); > break; > } > > - cm_event.device = &c2dev->ibdev; > - cm_event.event = IW_EVENT_CONNECT_REQUEST; > - cm_event.element.conn_request.cr_id = > + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; > + cm_event.provider_id = > wr->ae.ae_connection_request.cr_handle; > - cm_event.element.conn_request.conn_attr.local_addr.s_addr = > + cm_event.local_addr.sin_addr.s_addr = > wr->ae.ae_connection_request.laddr; > - cm_event.element.conn_request.conn_attr.remote_addr.s_addr = > + cm_event.remote_addr.sin_addr.s_addr = > wr->ae.ae_connection_request.raddr; > - cm_event.element.conn_request.conn_attr.local_port = > + cm_event.local_addr.sin_port = > wr->ae.ae_connection_request.lport; > - cm_event.element.conn_request.conn_attr.remote_port = > + cm_event.remote_addr.sin_port = > wr->ae.ae_connection_request.rport; > - cm_event.element.conn_request.private_data_len = > + cm_event.private_data_len = > be32_to_cpu(wr->ae.ae_connection_request.private_data_length); > > - /* XXX */ > - pdata = kmalloc(cm_event.element.conn_request.private_data_len, > - GFP_ATOMIC); > - if (!pdata) > - break; > + if (cm_event.private_data_len) { > + pdata = kmalloc(cm_event.private_data_len, GFP_ATOMIC); > + if (!pdata) { > + /* Ignore the request, maybe the remote peer > + * will retry */ > + dprintk("Ignored connect request -- no memory for pdata" > + "private_data_len=%d\n", cm_event.private_data_len); > + goto ignore_it; > + } > + memcpy(pdata, > + wr->ae.ae_connection_request.private_data, > + cm_event.private_data_len); > > - memcpy(pdata, > - wr->ae.ae_connection_request.private_data, > - cm_event.element.conn_request.private_data_len); > - > - cm_event.element.conn_request.private_data = pdata; > - > - if (ep->event_handler) > - (*ep->event_handler)(&cm_event, ep->listen_context); > - > - kfree(pdata); > + cm_event.private_data = pdata; > + } > + if (cm_id->event_handler) > + cm_id->event_handler(cm_id, &cm_event); > break; > } > + > case CC_RES_IND_CQ: { > struct c2_cq *cq = (struct c2_cq *)resource_user_context; > > + dprintk("IB_EVENT_CQ_ERR\n"); > ib_event.device = &c2dev->ibdev; > ib_event.element.cq = &cq->ibcq; > ib_event.event = IB_EVENT_CQ_ERR; > > if (cq->ibcq.event_handler) > - (*cq->ibcq.event_handler)(&ib_event, cq->ibcq.cq_context); > + cq->ibcq.event_handler(&ib_event, cq->ibcq.cq_context); > } > + > default: > break; > } > - > - /* > - * free the adapter message > - */ > + > + ignore_it: > c2_mq_free(mq); > } > - > Index: hw/amso1100/c2_provider.c > =================================================================== > --- hw/amso1100/c2_provider.c (revision 4482) > +++ hw/amso1100/c2_provider.c (working copy) > @@ -305,8 +305,6 @@ > struct c2_cq *cq; > int err; > > - dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > - > cq = kmalloc(sizeof(*cq), GFP_KERNEL); > if (!cq) { > dprintk("%s: Unable to allocate CQ\n", __FUNCTION__); > @@ -315,6 +313,7 @@ > > err = c2_init_cq(to_c2dev(ibdev), entries, NULL, cq); > if (err) { > + dprintk("%s: error initializing CQ\n", __FUNCTION__); > kfree(cq); > return ERR_PTR(err); > } > @@ -540,156 +539,96 @@ > return -ENOSYS; > } > > -static int c2_connect_qp(struct ib_qp *ib_qp, > - struct iw_conn_attr *attr, > - void (*event_handler)(struct iw_cm_event*, void*), > - void *context, > - u8 *pdata, > - int pdata_len > - ) > +static int c2_connect(struct iw_cm_id* cm_id, > + const void* pdata, u8 pdata_len) > { > - struct c2_qp *qp = to_c2qp(ib_qp); > int err; > + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); > > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > - if (!event_handler) > + if (cm_id->qp == NULL) > return -EINVAL; > > - /* > - * Store the event handler and the > - * context in the QP. > - */ > - qp->event_handler = event_handler; > - qp->context = context; > + /* Cache the cm_id in the qp */ > + qp->cm_id = cm_id; > > - err = c2_qp_connect(to_c2dev(ib_qp->device), qp, > - attr->remote_addr.s_addr, attr->remote_port, > - pdata_len, pdata); > - if (err) { > - qp->event_handler = NULL; > - qp->context = NULL; > - } > + err = c2_llp_connect(cm_id, pdata, pdata_len); > > return err; > } > > -static int c2_disconnect_qp(struct ib_qp *qp, > - int abrupt) > +static int c2_disconnect(struct iw_cm_id* cm_id, int abrupt) > { > struct ib_qp_attr attr; > + struct ib_qp *ib_qp = cm_id->qp; > int err; > > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > + if (ib_qp == 0) > + /* If this is a lietening endpoint, there is no QP */ > + return 0; > + > memset(&attr, 0, sizeof(struct ib_qp_attr)); > if (abrupt) > attr.qp_state = IB_QPS_ERR; > else > attr.qp_state = IB_QPS_SQD; > > - err = c2_modify_qp(qp, &attr, IB_QP_STATE); > + err = c2_modify_qp(ib_qp, &attr, IB_QP_STATE); > return err; > } > > -static int c2_accept_cr(struct ib_device *ibdev, > - u32 cr_id, > - struct ib_qp *ib_qp, > - void (*event_handler)(struct iw_cm_event*, void*), > - void *context, > - u8 *pdata, > - int pdata_len) > +static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 pdata_len) > { > - struct c2_qp *qp = to_c2qp(ib_qp); > int err; > > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > - /* > - * Store the event handler and the > - * context in the QP. > - */ > - qp->event_handler = event_handler; > - qp->context = context; > + err = c2_llp_accept(cm_id, pdata, pdata_len); > > - err = c2_cr_accept(to_c2dev(ibdev), cr_id, qp, > - pdata_len, pdata); > - > return err; > } > > -static int c2_reject_cr(struct ib_device *ibdev, > - u32 cr_id, > - u8 *pdata, > - int pdata_len) > +static int c2_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > { > int err; > > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > - err = c2_cr_reject(to_c2dev(ibdev), cr_id); > + err = c2_llp_reject(cm_id, pdata, pdata_len); > return err; > } > > -static int c2_query_cr(struct ib_device *ibdev, > - u32 cr_id, > - struct iw_conn_request *req) > +static int c2_getpeername(struct iw_cm_id* cm_id, > + struct sockaddr_in* local_addr, > + struct sockaddr_in* remote_addr ) > { > - int err; > - struct c2_cr_query_attrs cr_attrs; > - > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > - err = c2_cr_query(to_c2dev(ibdev), cr_id, &cr_attrs); > - if (!err) { > - req->cr_id = cr_id; > - req->conn_attr.local_addr.s_addr = cr_attrs.local_addr; > - req->conn_attr.local_port = cr_attrs.local_port; > - req->conn_attr.remote_addr.s_addr = cr_attrs.remote_addr; > - req->conn_attr.remote_port = cr_attrs.remote_port; > - /* XXX pdata? */ > - } > - return err; > + *local_addr = cm_id->local_addr; > + *remote_addr = cm_id->remote_addr; > + return 0; > } > > -static int c2_create_listen_ep(struct ib_device *ibdev, > - struct iw_listen_ep_attr *ep_attr, > - void **ep_handle) > +static int c2_service_create(struct iw_cm_id* cm_id, int backlog) > { > int err; > - struct c2_ep *ep; > > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > - > - ep = kmalloc(sizeof(*ep), GFP_KERNEL); > - if (!ep) { > - dprintk("%s: Unable to allocate EP\n", __FUNCTION__); > - return -ENOMEM; > - } > - > - ep->event_handler = ep_attr->event_handler; > - ep->listen_context = ep_attr->listen_context; > - > - err = c2_ep_listen_create(to_c2dev(ibdev), > - ep_attr->addr.s_addr, ep_attr->port, > - ep_attr->backlog, ep); > - if (err) > - kfree(ep); > - else > - *ep_handle = (void *)ep; > - > + err = c2_llp_service_create(cm_id, backlog); > return err; > } > > -static int c2_destroy_listen_ep(struct ib_device *ibdev, void *ep_handle) > +static int c2_service_destroy(struct iw_cm_id* cm_id) > { > - struct c2_ep *ep = (struct c2_ep *)ep_handle; > - > + int err; > dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); > > - c2_ep_listen_destroy(to_c2dev(ibdev), ep); > - kfree(ep); > - return 0; > + err = c2_llp_service_destroy(cm_id); > + > + return err; > } > > int c2_register_device(struct c2_dev *dev) > @@ -742,13 +681,13 @@ > dev->ibdev.post_recv = c2_post_receive; > > dev->ibdev.iwcm = kmalloc(sizeof(*dev->ibdev.iwcm), GFP_KERNEL); > - dev->ibdev.iwcm->connect_qp = c2_connect_qp; > - dev->ibdev.iwcm->disconnect_qp = c2_disconnect_qp; > - dev->ibdev.iwcm->accept_cr = c2_accept_cr; > - dev->ibdev.iwcm->reject_cr = c2_reject_cr; > - dev->ibdev.iwcm->query_cr = c2_query_cr; > - dev->ibdev.iwcm->create_listen_ep = c2_create_listen_ep; > - dev->ibdev.iwcm->destroy_listen_ep = c2_destroy_listen_ep; > + dev->ibdev.iwcm->connect = c2_connect; > + dev->ibdev.iwcm->disconnect = c2_disconnect; > + dev->ibdev.iwcm->accept = c2_accept; > + dev->ibdev.iwcm->reject = c2_reject; > + dev->ibdev.iwcm->getpeername = c2_getpeername; > + dev->ibdev.iwcm->create_listen = c2_service_create; > + dev->ibdev.iwcm->destroy_listen = c2_service_destroy; > > ret = ib_register_device(&dev->ibdev); > if (ret) > Index: hw/amso1100/c2_provider.h > =================================================================== > --- hw/amso1100/c2_provider.h (revision 4482) > +++ hw/amso1100/c2_provider.h (working copy) > @@ -115,17 +115,15 @@ > struct c2_wq { > spinlock_t lock; > }; > - > +struct iw_cm_id; > struct c2_qp { > struct ib_qp ibqp; > + struct iw_cm_id* cm_id; > spinlock_t lock; > atomic_t refcount; > wait_queue_head_t wait; > int qpn; > > - void (*event_handler)(struct iw_cm_event *, void *); > - void *context; > - > u32 adapter_handle; > u32 send_sgl_depth; > u32 recv_sgl_depth; > @@ -136,15 +134,6 @@ > struct c2_mq rq_mq; > }; > > -struct c2_ep { > - u32 adapter_handle; > - void (*event_handler)(struct iw_cm_event *, void *); > - void *listen_context; > - u32 addr; > - u16 port; > - int backlog; > -}; > - > struct c2_cr_query_attrs { > u32 local_addr; > u32 remote_addr; > Index: hw/amso1100/c2_cm.c > =================================================================== > --- hw/amso1100/c2_cm.c (revision 4482) > +++ hw/amso1100/c2_cm.c (working copy) > @@ -35,11 +35,10 @@ > #include "c2_vq.h" > #include > > -int > -c2_qp_connect(struct c2_dev *c2dev, struct c2_qp *qp, > - u32 remote_addr, u16 remote_port, > - u32 pdata_len, u8 *pdata) > +int c2_llp_connect(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > { > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > + struct c2_qp *qp = to_c2qp(cm_id->qp); > ccwr_qp_connect_req_t *wr; /* variable size needs a malloc. */ > struct c2_vq_req *vq_req; > int err; > @@ -70,8 +69,8 @@ > wr->rnic_handle = c2dev->adapter_handle; > wr->qp_handle = qp->adapter_handle; > > - wr->remote_addr = remote_addr; /* already in Network Byte Order */ > - wr->remote_port = remote_port; /* already in Network Byte Order */ > + wr->remote_addr = cm_id->remote_addr.sin_addr.s_addr; > + wr->remote_port = cm_id->remote_addr.sin_port; > > /* > * Move any private data from the callers's buf into > @@ -96,14 +95,18 @@ > } > > int > -c2_ep_listen_create(struct c2_dev *c2dev, u32 addr, > - u16 port, u32 backlog, struct c2_ep *ep) > +c2_llp_service_create(struct iw_cm_id* cm_id, int backlog) > { > + struct c2_dev *c2dev; > ccwr_ep_listen_create_req_t wr; > ccwr_ep_listen_create_rep_t *reply; > struct c2_vq_req *vq_req; > int err; > > + c2dev = to_c2dev(cm_id->device); > + if (c2dev == NULL) > + return -EINVAL; > + > /* > * Allocate verbs request. > */ > @@ -115,15 +118,15 @@ > * Build the WR > */ > c2_wr_set_id(&wr, CCWR_EP_LISTEN_CREATE); > - wr.hdr.context = (unsigned long)vq_req; > + wr.hdr.context = (u64)(unsigned long)vq_req; > wr.rnic_handle = c2dev->adapter_handle; > - wr.local_addr = addr; /* already in Net Byte Order */ > - wr.local_port = port; /* already in Net Byte Order */ > + wr.local_addr = cm_id->local_addr.sin_addr.s_addr; > + wr.local_port = cm_id->local_addr.sin_port; > wr.backlog = cpu_to_be32(backlog); > - wr.user_context = (unsigned long)ep; > + wr.user_context = (u64)(unsigned long)cm_id; > > /* > - * reference the request struct. dereferenced in the int handler. > + * Reference the request struct. Dereferenced in the int handler. > */ > vq_req_get(c2dev, vq_req); > > @@ -160,12 +163,7 @@ > /* > * get the adapter handle > */ > - ep->adapter_handle = reply->ep_handle; > - if (port != reply->local_port) > - { > - // XXX > - //*p_port = reply->local_port; > - } > + cm_id->provider_id = reply->ep_handle; > > /* > * free vq stuff > @@ -184,13 +182,19 @@ > > > int > -c2_ep_listen_destroy(struct c2_dev *c2dev, struct c2_ep *ep) > +c2_llp_service_destroy(struct iw_cm_id* cm_id) > { > + > + struct c2_dev *c2dev; > ccwr_ep_listen_destroy_req_t wr; > ccwr_ep_listen_destroy_rep_t *reply; > struct c2_vq_req *vq_req; > int err; > > + c2dev = to_c2dev(cm_id->device); > + if (c2dev == NULL) > + return -EINVAL; > + > /* > * Allocate verbs request. > */ > @@ -205,7 +209,7 @@ > c2_wr_set_id(&wr, CCWR_EP_LISTEN_DESTROY); > wr.hdr.context = (unsigned long)vq_req; > wr.rnic_handle = c2dev->adapter_handle; > - wr.ep_handle = ep->adapter_handle; > + wr.ep_handle = cm_id->provider_id; > > /* > * reference the request struct. dereferenced in the int handler. > @@ -250,87 +254,20 @@ > > > int > -c2_cr_query(struct c2_dev *c2dev, u32 cr_id, > - struct c2_cr_query_attrs *cr_attrs) > +c2_llp_accept(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > { > - ccwr_ep_query_req_t wr; > - ccwr_ep_query_rep_t *reply; > - struct c2_vq_req *vq_req; > - int err; > + struct c2_dev *c2dev = to_c2dev(cm_id->device); > + struct c2_qp *qp = to_c2qp(cm_id->qp); > + ccwr_cr_accept_req_t *wr; /* variable length WR */ > + struct c2_vq_req *vq_req; > + ccwr_cr_accept_rep_t *reply; /* VQ Reply msg ptr. */ > + int err; > > - /* > - * Create and send a WR_EP_CREATE... > - */ > - vq_req = vq_req_alloc(c2dev); > - if (!vq_req) { > - return -ENOMEM; > - } > + /* Make sure there's a bound QP */ > + if (qp == 0) > + return -EINVAL; > > - /* > - * Build the WR > - */ > - c2_wr_set_id(&wr, CCWR_EP_QUERY); > - wr.hdr.context = (unsigned long)vq_req; > - wr.rnic_handle = c2dev->adapter_handle; > - wr.ep_handle = cr_id; > - > /* > - * reference the request struct. dereferenced in the int handler. > - */ > - vq_req_get(c2dev, vq_req); > - > - /* > - * Send WR to adapter > - */ > - err = vq_send_wr(c2dev, (ccwr_t*)&wr); > - if (err) { > - vq_req_put(c2dev, vq_req); > - goto bail0; > - } > - > - /* > - * Wait for reply from adapter > - */ > - err = vq_wait_for_reply(c2dev, vq_req); > - if (err) { > - goto bail0; > - } > - > - /* > - * Process reply > - */ > - reply = (ccwr_ep_query_rep_t*)(unsigned long)vq_req->reply_msg; > - if (!reply) { > - err = -ENOMEM; > - goto bail0; > - } > - if ( (err = c2_errno(reply)) != 0) { > - goto bail1; > - } > - > - cr_attrs->local_addr = reply->local_addr; > - cr_attrs->local_port = reply->local_port; > - cr_attrs->remote_addr = reply->remote_addr; > - cr_attrs->remote_port = reply->remote_port; > - > -bail1: > - vq_repbuf_free(c2dev, reply); > -bail0: > - vq_req_free(c2dev, vq_req); > - return err; > -} > - > - > -int > -c2_cr_accept(struct c2_dev *c2dev, u32 cr_id, struct c2_qp *qp, > - u32 pdata_len, u8 *pdata) > -{ > - ccwr_cr_accept_req_t *wr; /* variable length WR */ > - struct c2_vq_req *vq_req; > - ccwr_cr_accept_rep_t* reply; /* VQ Reply msg ptr. */ > - int err; > - > - /* > * only support the max private_data length > */ > if (pdata_len > CC_MAX_PRIVATE_DATA_SIZE) { > @@ -357,7 +294,7 @@ > c2_wr_set_id(wr, CCWR_CR_ACCEPT); > wr->hdr.context = (unsigned long)vq_req; > wr->rnic_handle = c2dev->adapter_handle; > - wr->ep_handle = cr_id; > + wr->ep_handle = (u32)cm_id->provider_id; > wr->qp_handle = qp->adapter_handle; > if (pdata) { > wr->private_data_length = cpu_to_be32(pdata_len); > @@ -407,15 +344,17 @@ > return err; > } > > - > int > -c2_cr_reject(struct c2_dev *c2dev, u32 cr_id) > +c2_llp_reject(struct iw_cm_id* cm_id, const void* pdata, u8 pdata_len) > { > + struct c2_dev *c2dev; > ccwr_cr_reject_req_t wr; > struct c2_vq_req *vq_req; > ccwr_cr_reject_rep_t *reply; > int err; > > + c2dev = to_c2dev(cm_id->device); > + > /* > * Allocate verbs request. > */ > @@ -430,7 +369,7 @@ > c2_wr_set_id(&wr, CCWR_CR_REJECT); > wr.hdr.context = (unsigned long)vq_req; > wr.rnic_handle = c2dev->adapter_handle; > - wr.ep_handle = cr_id; > + wr.ep_handle = (u32)cm_id->provider_id; > > /* > * reference the request struct. dereferenced in the int handler. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Thu Dec 15 13:22:22 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 13:22:22 -0800 Subject: [openib-general] [PATCH] check create_srq in libibverbs In-Reply-To: (Shirley Ma's message of "Fri, 9 Dec 2005 17:11:03 -0700") References: Message-ID: Thanks, applied (at long last) From rdreier at cisco.com Thu Dec 15 13:24:32 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 13:24:32 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: <20051215164337.GV26722@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 15 Dec 2005 18:43:37 +0200") References: <20051215164337.GV26722@mellanox.co.il> Message-ID: Michael> Hardware guys confirmed that it does, as per spec: Michael> clearing these bits is the way to tell hardware that we Michael> have max_rd_atomic set to 0. I thought its obvious from Michael> documentation: do you think this needs clarification? I was just lazy and didn't even read the spec before asking. - R. From rdreier at cisco.com Thu Dec 15 13:25:58 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 13:25:58 -0800 Subject: [openib-general] Re: ipoib: question In-Reply-To: <20051214215915.GA18526@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 14 Dec 2005 23:59:15 +0200") References: <20051214215915.GA18526@mellanox.co.il> Message-ID: > Is this better? > > - return (struct ipoib_neigh **) (neigh->ha + 24 - > - (offsetof(struct neighbour, ha) & 4)); > + return (void*)neigh + > ALIGN(offsetof(struct neighbour, ha) + INFINIBAND_ALEN, x) I guess so, with "x" replaced by "sizeof (void *)". - R. From halr at voltaire.com Thu Dec 15 13:32:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2005 16:32:44 -0500 Subject: [openib-general] [PATCH] osmtest/osmt_service.c: Eliminate compile warnings with gcc version4.0.0 20050519 (Red Hat 4.0.0-8) Message-ID: <1134682364.4338.31.camel@hal.voltaire.com> osmtest/osmt_service.c: Eliminate compile warnings with gcc version4.0.0 20050519 (Red Hat 4.0.0-8) Signed-off-by: Hal Rosenstock Index: osmt_service.c =================================================================== --- osmt_service.c (revision 4478) +++ osmt_service.c (working copy) @@ -1071,7 +1071,8 @@ osmt_get_all_services_and_check_names( I "osmt_get_all_services_and_check_names: " "-I- Comparing source name : >%s<, with record name : >%s<, idx : %d\n", p_valid_service_names_arr[j],p_rec->service_name, p_checked_names[j]); - if ( strcmp(p_valid_service_names_arr[j],p_rec->service_name) == 0 ) + if ( strcmp((const char *)p_valid_service_names_arr[j], + (const char *)p_rec->service_name) == 0 ) { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmt_get_all_services_and_check_names: " From mst at mellanox.co.il Thu Dec 15 13:49:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 15 Dec 2005 23:49:27 +0200 Subject: [openib-general] [PATCH] ipoib: fix destructor usage Message-ID: <20051215214927.GA31053@mellanox.co.il> IPoIB uses neighbour ops->destructor to clean up struct ipoib_neigh, but ignores the fact that multiple neighbour objects can share the same ops structure, so setting it to NULL affects multiple neighbours. Fix this, by tracking all ipoib_neigh objects, and only cleaning destructor after no neighbour is going to use it. Note that ops structure isnt per device, so we track them in a global list. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-16 01:48:54.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-16 01:52:26.000000000 +0200 @@ -71,6 +71,9 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; +static spinlock_t ipoib_neigh_ops_list_lock; +static LIST_HEAD(ipoib_neigh_ops_list); + static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -244,9 +247,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -474,7 +476,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -482,8 +484,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -526,11 +526,8 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); - + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +754,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -767,23 +763,45 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) { + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + /* * Is this kosher? I can't find anybody in the kernel that * sets neigh->destructor, so we should be able to set it here * without trouble. */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; + spin_lock(&ipoib_neigh_ops_list_lock); + list_add_tail(&neigh->ops_list, &ipoib_neigh_ops_list); + neigh->neighbour->ops->destructor = ipoib_neigh_destructor; + spin_unlock(&ipoib_neigh_ops_list_lock); + return neigh; } -static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +void ipoib_neigh_free(struct ipoib_neigh *neigh) { - parms->neigh_setup = ipoib_neigh_setup; + struct ipoib_neigh *n; - return 0; + spin_lock(&ipoib_neigh_ops_list_lock); + list_del(&neigh->ops_list); + + list_for_each_entry(n, &ipoib_neigh_ops_list, ops_list) + if (n->neighbour->ops == neigh->neighbour->ops) + goto found; + + neigh->neighbour->ops->destructor = NULL; +found: + spin_unlock(&ipoib_neigh_ops_list_lock); + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) @@ -859,7 +877,6 @@ static void ipoib_setup(struct net_devic dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; dev->watchdog_timeo = HZ; @@ -1146,6 +1163,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + spin_lock_init(&ipoib_neigh_ops_list_lock); + return 0; err_wq: Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-16 01:48:54.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-16 01:49:08.000000000 +0200 @@ -107,9 +107,7 @@ static void ipoib_mcast_free(struct ipoi list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { if (neigh->ah) list_add_tail(&neigh->ah->list, &ah_list); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -733,13 +731,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 01:48:54.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 01:50:46.000000000 +0200 @@ -215,6 +215,7 @@ struct ipoib_neigh { struct neighbour *neighbour; struct list_head list; + struct list_head ops_list; }; static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) @@ -223,6 +224,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From rdreier at cisco.com Thu Dec 15 13:47:37 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 13:47:37 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix SRQ cleanup during destroy-qp In-Reply-To: <20051215092303.GA27784@mellanox.co.il> (Jack Morgenstein's message of "Thu, 15 Dec 2005 11:23:03 +0200") References: <20051215092303.GA27784@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Dec 15 13:54:35 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 13:54:35 -0800 Subject: [openib-general] Re: [PATCH] mthca thinko In-Reply-To: <20051215164402.GW26722@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 15 Dec 2005 18:44:02 +0200") References: <20051215164402.GW26722@mellanox.co.il> Message-ID: Thanks, applied. From mst at mellanox.co.il Thu Dec 15 14:02:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 16 Dec 2005 00:02:14 +0200 Subject: [openib-general] Re: ipoib: question In-Reply-To: References: Message-ID: <20051215220214.GA31463@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: ipoib: question > > > Is this better? > > > > - return (struct ipoib_neigh **) (neigh->ha + 24 - > > - (offsetof(struct neighbour, ha) & 4)); > > + return (void*)neigh + > > ALIGN(offsetof(struct neighbour, ha) + INFINIBAND_ALEN, x) > > I guess so, with "x" replaced by "sizeof (void *)". > > - R. > Right, there's also a ; missing - I hope you figured out it wasnt a real patch. The below does compile. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 02:15:55.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-16 02:39:42.000000000 +0200 @@ -219,8 +219,8 @@ struct ipoib_neigh { static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { - return (struct ipoib_neigh **) (neigh->ha + 24 - - (offsetof(struct neighbour, ha) & 4)); + return (void*)neigh + ALIGN(offsetof(struct neighbour, ha) + + INFINIBAND_ALEN, sizeof(void *)); } extern struct workqueue_struct *ipoib_workqueue; -- MST From rdreier at cisco.com Thu Dec 15 14:20:34 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 14:20:34 -0800 Subject: [openib-general] Re: [patch] mthca: fix SRQ cleanup during destroy-qp In-Reply-To: <20051215092618.GB27784@mellanox.co.il> (Jack Morgenstein's message of "Thu, 15 Dec 2005 11:26:18 +0200") References: <20051215092618.GB27784@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Dec 15 14:36:34 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 14:36:34 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct IB_QP_ACCESS_FLAGS handling In-Reply-To: <20051212165859.GO14936@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 12 Dec 2005 18:58:59 +0200") References: <20051212165859.GO14936@mellanox.co.il> Message-ID: Thanks, applied. From rdreier at cisco.com Thu Dec 15 14:39:45 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 14:39:45 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: <20051213090919.GW14936@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 13 Dec 2005 11:09:19 +0200") References: <20051213090919.GW14936@mellanox.co.il> Message-ID: Hmm, I'm not sure about this change any more now. Why is setting swe being tied to sra_max? We should be able to post RDMA writes even if reads/atomics are disabled, right? - R. From mst at mellanox.co.il Thu Dec 15 15:09:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 16 Dec 2005 01:09:55 +0200 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: References: Message-ID: <20051215230955.GA31616@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: correct max_rd_atomic handling > > Hmm, I'm not sure about this change any more now. Why is setting swe > being tied to sra_max? Note its tied to IB_QP_MAX_QP_RD_ATOMIC - the attribute bit, not the value. Well, as I see it SWE isnt really used in IB spec, but hardware needs it. Our hardware requires/allows you to set this bit in exactly same transitions where sre and sae are required/optional, and this unsurprisingly is when IB_QP_MAX_QP_RD_ATOMIC is required/optional. In other transitions its supposed to be 0 since thats the value reserved bits should have. VAPI and current mthca code seems to set this bit always: this seems to violate what the documentation says, but seems to work nevertheless. > We should be able to post RDMA writes even if > reads/atomics are disabled, right? > > - R. > Right. Thats why we set it inconditionally when IB_QP_MAX_QP_RD_ATOMIC attribute bit is set, and IB_QP_MAX_QP_RD_ATOMIC is required to go to RTS. -- MST From rdreier at cisco.com Thu Dec 15 15:10:26 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 15:10:26 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: <20051215230955.GA31616@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 16 Dec 2005 01:09:55 +0200") References: <20051215230955.GA31616@mellanox.co.il> Message-ID: Michael> Right. Thats why we set it inconditionally when Michael> IB_QP_MAX_QP_RD_ATOMIC attribute bit is set, and Michael> IB_QP_MAX_QP_RD_ATOMIC is required to go to RTS. Not for UC transport... I think this patch would break RDMA on UC QPs, right? - R. From mst at mellanox.co.il Thu Dec 15 15:44:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 16 Dec 2005 01:44:10 +0200 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: References: Message-ID: <20051215234410.GB31616@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] mthca: correct max_rd_atomic handling > > Michael> Right. Thats why we set it inconditionally when > Michael> IB_QP_MAX_QP_RD_ATOMIC attribute bit is set, and > Michael> IB_QP_MAX_QP_RD_ATOMIC is required to go to RTS. > > Not for UC transport... I think this patch would break RDMA on UC QPs, right? Ugh. Looks right. It did seem to work ... go figure. So, lets set MTHCA_QP_BIT_SWE together with MTHCA_FLIGHT_LIMIT as we did previously. The other bits are correct, though, arent they? Like this (untested: I'm out of the lab for the weekend). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- openib/drivers/infiniband/hw/mthca/mthca_qp.c (revision 4489) +++ openib/drivers/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -715,9 +715,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | (MTHCA_FLIGHT_LIMIT << 24) | - MTHCA_QP_BIT_SRE | - MTHCA_QP_BIT_SWE | - MTHCA_QP_BIT_SAE); + MTHCA_QP_BIT_SWE); if (qp->sq_policy == IB_SIGNAL_ALL_WR) qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); if (attr_mask & IB_QP_RETRY_CNT) { @@ -726,9 +724,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - if (attr->max_rd_atomic) + if (attr->max_rd_atomic) { + qp_context->params1 |= + cpu_to_be32(MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SAE); qp_context->params1 |= cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); + } qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } -- MST From jlentini at netapp.com Thu Dec 15 17:55:31 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 15 Dec 2005 20:55:31 -0500 (EST) Subject: [openib-general] Re: [PATCH][uDAPL] openib_cma provider update In-Reply-To: References: Message-ID: On Fri, 9 Dec 2005, Arlin Davis wrote: > James, > > I modified the IP address lookup during the open to take either a > network name, network address, or device name. This will make the > dat.conf setup a little easier and more flexible. I updated the > README, and /doc/dat.conf with details. > > Thanks, > > -arlin Committed in 4501. From jlentini at netapp.com Thu Dec 15 18:00:35 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 15 Dec 2005 21:00:35 -0500 (EST) Subject: [openib-general] Re: [PATCH][uDAPL] openib_scm uses incorrect rd_atomic values for modify_qp In-Reply-To: References: Message-ID: > James, > > Here is a fix for openib socket cm version. I ran into a problem > with the latest verbs qp_modify as a result of incorrect rd_atomic > values so I modified to use the values returned from the > ibv_query_device() instead of hard coded values. > > -arlin Sommitted in revision 4502. From rdreier at cisco.com Thu Dec 15 20:00:05 2005 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Dec 2005 20:00:05 -0800 Subject: [openib-general] Re: [PATCH] mthca: correct max_rd_atomic handling In-Reply-To: <20051215234410.GB31616@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 16 Dec 2005 01:44:10 +0200") References: <20051215234410.GB31616@mellanox.co.il> Message-ID: Looks right to me -- I applied it. From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 2/7] IB/mthca: correct log2 calculation In-Reply-To: <1134705617067-b51dec64cec55f52@cisco.com> Message-ID: <1134705617067-bb88e1b23a3e36b6@cisco.com> Fix thinko in rd_atomic calculation: ffs(x) - 1 does not find the next power of 2 -- it should be fls(x - 1). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 17 ++++++----------- 1 files changed, 6 insertions(+), 11 deletions(-) 6aa2e4e8063114bd7cea8616dd5848d3c64b4c36 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index c5c3d0e..84056a8 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -728,9 +728,9 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - qp_context->params1 |= cpu_to_be32(min(attr->max_rd_atomic ? - ffs(attr->max_rd_atomic) - 1 : 0, - 7) << 21); + if (attr->max_rd_atomic) + qp_context->params1 |= + cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } @@ -769,8 +769,6 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { - u8 rra_max; - if (qp->resp_depth && !attr->max_dest_rd_atomic) { /* * Lowering our responder resources to zero. @@ -798,13 +796,10 @@ int mthca_modify_qp(struct ib_qp *ibqp, MTHCA_QP_OPTPAR_RAE); } - for (rra_max = 0; - 1 << rra_max < attr->max_dest_rd_atomic && - rra_max < dev->qp_table.rdb_shift; - ++rra_max) - ; /* nothing */ + if (attr->max_dest_rd_atomic) + qp_context->params2 |= + cpu_to_be32(fls(attr->max_dest_rd_atomic - 1) << 21); - qp_context->params2 |= cpu_to_be32(rra_max << 21); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRA_MAX); qp->resp_depth = attr->max_dest_rd_atomic; -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 3/7] IB/mthca: don't change driver's copy of attributes if modify QP fails In-Reply-To: <1134705617067-bb88e1b23a3e36b6@cisco.com> Message-ID: <1134705617068-301e9a8555929947@cisco.com> Only change the driver's copy of the QP attributes in modify QP after checking the modify QP command completed successfully. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 11 ++++++----- 1 files changed, 6 insertions(+), 5 deletions(-) 44b5b0303327cfb23f135b95b2fe5436c81ed27c diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 84056a8..3543299 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -764,8 +764,6 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | MTHCA_QP_OPTPAR_RRE | MTHCA_QP_OPTPAR_RAE); - - qp->atomic_rd_en = attr->qp_access_flags; } if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { @@ -801,8 +799,6 @@ int mthca_modify_qp(struct ib_qp *ibqp, cpu_to_be32(fls(attr->max_dest_rd_atomic - 1) << 21); qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRA_MAX); - - qp->resp_depth = attr->max_dest_rd_atomic; } qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); @@ -844,8 +840,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, err = -EINVAL; } - if (!err) + if (!err) { qp->state = new_state; + if (attr_mask & IB_QP_ACCESS_FLAGS) + qp->atomic_rd_en = attr->qp_access_flags; + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + qp->resp_depth = attr->max_dest_rd_atomic; + } mthca_free_mailbox(dev, mailbox); -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 5/7] IB/mthca: Fix SRQ cleanup during QP destroy In-Reply-To: <1134705617068-92874cd0c1ec02ff@cisco.com> Message-ID: <1134705617068-ccaf1fa200ee1176@cisco.com> When cleaning up a CQ for a QP attached to SRQ, need to free an SRQ WQE only if the CQE is a receive completion. Signed-off-by: Jack Morgenstein Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 11 ++++++++++- 1 files changed, 10 insertions(+), 1 deletions(-) 576d2e4e40315e8140c04be99cd057720d8a3817 diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 4a8adce..fcef8dc 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -253,6 +253,15 @@ void mthca_cq_event(struct mthca_dev *de wake_up(&cq->wait); } +static inline int is_recv_cqe(struct mthca_cqe *cqe) +{ + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) + return !(cqe->opcode & 0x01); + else + return !(cqe->is_send & 0x80); +} + void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, struct mthca_srq *srq) { @@ -296,7 +305,7 @@ void mthca_cq_clean(struct mthca_dev *de while ((int) --prod_index - (int) cq->cons_index >= 0) { cqe = get_cqe(cq, prod_index & cq->ibcq.cqe); if (cqe->my_qpn == cpu_to_be32(qpn)) { - if (srq) + if (srq && is_recv_cqe(cqe)) mthca_free_srq_wqe(srq, be32_to_cpu(cqe->wqe)); ++nfreed; } else if (nfreed) -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 4/7] IB/mthca: Fix thinko in mthca_table_find() In-Reply-To: <1134705617068-301e9a8555929947@cisco.com> Message-ID: <1134705617068-92874cd0c1ec02ff@cisco.com> break only escapes from the innermost loop, and we want to escape both loops and return an answer. Noticed by Ishai Rabinovitch. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_memfree.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) 6c7d2a75b512c64c910b69adf32dbaddb461910b diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index 5798ed0..9fb985a 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -233,7 +233,7 @@ void *mthca_table_find(struct mthca_icm_ for (i = 0; i < chunk->npages; ++i) { if (chunk->mem[i].length >= offset) { page = chunk->mem[i].page; - break; + goto out; } offset -= chunk->mem[i].length; } -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 7/7] IB/mthca: Fix corner cases in max_rd_atomic value handling in modify QP In-Reply-To: <1134705617068-7e5f92b5a82fa6a2@cisco.com> Message-ID: <1134705617068-3687c807077d2ef3@cisco.com> sae and sre bits should only be set when setting sra_max. Further, in the old code, if the caller specifies max_rd_atomic = 0, the sre and sae bits are still set, with the result that the QP ends up with max_rd_atomic = 1 in effect. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) c4342d8a4d95e18b957b898dbf5bfce28fca2780 diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index e826c9f..d786ef4 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -747,9 +747,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | (MTHCA_FLIGHT_LIMIT << 24) | - MTHCA_QP_BIT_SRE | - MTHCA_QP_BIT_SWE | - MTHCA_QP_BIT_SAE); + MTHCA_QP_BIT_SWE); if (qp->sq_policy == IB_SIGNAL_ALL_WR) qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); if (attr_mask & IB_QP_RETRY_CNT) { @@ -758,9 +756,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, } if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) { - if (attr->max_rd_atomic) + if (attr->max_rd_atomic) { + qp_context->params1 |= + cpu_to_be32(MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SAE); qp_context->params1 |= cpu_to_be32(fls(attr->max_rd_atomic - 1) << 21); + } qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_SRA_MAX); } -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 1/7] IB/mthca: check RDMA limits Message-ID: <1134705617067-b51dec64cec55f52@cisco.com> Add limit checking on rd_atomic and dest_rd_atomic attributes: especially for max_dest_rd_atomic, a value that is larger than HCA capability can cause RDB overflow and corruption of another QP. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 14 ++++++++++++++ 1 files changed, 14 insertions(+), 0 deletions(-) 94361cf74a6fca1973d2fed5338d5fb4bcd902fa diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 7450550..c5c3d0e 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -591,6 +591,20 @@ int mthca_modify_qp(struct ib_qp *ibqp, return -EINVAL; } + if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC && + attr->max_rd_atomic > dev->limits.max_qp_init_rdma) { + mthca_dbg(dev, "Max rdma_atomic as initiator %u too large (max is %d)\n", + attr->max_rd_atomic, dev->limits.max_qp_init_rdma); + return -EINVAL; + } + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC && + attr->max_dest_rd_atomic > 1 << dev->qp_table.rdb_shift) { + mthca_dbg(dev, "Max rdma_atomic as responder %u too large (max %d)\n", + attr->max_dest_rd_atomic, 1 << dev->qp_table.rdb_shift); + return -EINVAL; + } + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) return PTR_ERR(mailbox); -- 0.99.9n From rolandd at cisco.com Thu Dec 15 20:00:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 04:00:17 +0000 Subject: [openib-general] [git patch review 6/7] IB/mthca: Fix IB_QP_ACCESS_FLAGS handling. In-Reply-To: <1134705617068-ccaf1fa200ee1176@cisco.com> Message-ID: <1134705617068-7e5f92b5a82fa6a2@cisco.com> This patch corrects some corner cases in managing the RAE/RRE bits in the mthca qp context. These bits need to be zero if the user requests max_dest_rd_atomic of zero. The bits need to be restored to the value implied by the qp access flags attribute in a previous (or the current) modify-qp command if the dest_rd_atomic variable is changed to non-zero. In the current implementation, the following scenario will not work: RESET-to-INIT set QP access flags to all disabled (zeroes) INIT-to-RTR set max_dest_rd_atomic=10, AND set qp_access_flags = IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_ATOMIC The current code will incorrectly take the access-flags value set in the RESET-to-INIT transition. We can simplify, and correct, this IB_QP_ACCESS_FLAGS handling: it is always safe to set qp access flags in the firmware command if either of IB_QP_MAX_DEST_RD_ATOMIC or IB_QP_ACCESS_FLAGS is set, so let's just set it to the correct value, always. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_qp.c | 87 ++++++++++++++------------------ 1 files changed, 37 insertions(+), 50 deletions(-) d1646f86a2a05a956adbb163c81a81bd621f055e diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 3543299..e826c9f 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -522,6 +522,36 @@ static void init_port(struct mthca_dev * mthca_warn(dev, "INIT_IB returned status %02x.\n", status); } +static __be32 get_hw_access_flags(struct mthca_qp *qp, struct ib_qp_attr *attr, + int attr_mask) +{ + u8 dest_rd_atomic; + u32 access_flags; + u32 hw_access_flags = 0; + + if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) + dest_rd_atomic = attr->max_dest_rd_atomic; + else + dest_rd_atomic = qp->resp_depth; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + access_flags = attr->qp_access_flags; + else + access_flags = qp->atomic_rd_en; + + if (!dest_rd_atomic) + access_flags &= IB_ACCESS_REMOTE_WRITE; + + if (access_flags & IB_ACCESS_REMOTE_READ) + hw_access_flags |= MTHCA_QP_BIT_RRE; + if (access_flags & IB_ACCESS_REMOTE_ATOMIC) + hw_access_flags |= MTHCA_QP_BIT_RAE; + if (access_flags & IB_ACCESS_REMOTE_WRITE) + hw_access_flags |= MTHCA_QP_BIT_RWE; + + return cpu_to_be32(hw_access_flags); +} + int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) { struct mthca_dev *dev = to_mdev(ibqp->device); @@ -743,57 +773,7 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_context->snd_db_index = cpu_to_be32(qp->sq.db_index); } - if (attr_mask & IB_QP_ACCESS_FLAGS) { - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE ? - MTHCA_QP_BIT_RWE : 0); - - /* - * Only enable RDMA reads and atomics if we have - * responder resources set to a non-zero value. - */ - if (qp->resp_depth) { - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_READ ? - MTHCA_QP_BIT_RRE : 0); - qp_context->params2 |= - cpu_to_be32(attr->qp_access_flags & IB_ACCESS_REMOTE_ATOMIC ? - MTHCA_QP_BIT_RAE : 0); - } - - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | - MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC) { - if (qp->resp_depth && !attr->max_dest_rd_atomic) { - /* - * Lowering our responder resources to zero. - * Turn off reads RDMA and atomics as responder. - * (RRE/RAE in params2 already zero) - */ - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - - if (!qp->resp_depth && attr->max_dest_rd_atomic) { - /* - * Increasing our responder resources from - * zero. Turn on RDMA reads and atomics as - * appropriate. - */ - qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_READ ? - MTHCA_QP_BIT_RRE : 0); - qp_context->params2 |= - cpu_to_be32(qp->atomic_rd_en & IB_ACCESS_REMOTE_ATOMIC ? - MTHCA_QP_BIT_RAE : 0); - - qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRE | - MTHCA_QP_OPTPAR_RAE); - } - if (attr->max_dest_rd_atomic) qp_context->params2 |= cpu_to_be32(fls(attr->max_dest_rd_atomic - 1) << 21); @@ -801,6 +781,13 @@ int mthca_modify_qp(struct ib_qp *ibqp, qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RRA_MAX); } + if (attr_mask & (IB_QP_ACCESS_FLAGS | IB_QP_MAX_DEST_RD_ATOMIC)) { + qp_context->params2 |= get_hw_access_flags(qp, attr, attr_mask); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RWE | + MTHCA_QP_OPTPAR_RRE | + MTHCA_QP_OPTPAR_RAE); + } + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); if (ibqp->srq) -- 0.99.9n From mst at mellanox.co.il Fri Dec 16 05:57:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 16 Dec 2005 15:57:40 +0200 Subject: [openib-general] [RFC] IB_AT_MOST Message-ID: <20051216135740.GA8300@mellanox.co.il> Hi! I recently noted that some middleware seems to use the "as much as possible" approach, for example, using maximum possible value for max_rd_atomic or other fields, in create/modify qp. An obvious thing could be to perform query_device and use max. values from there. However, it turns out that hardware max supported values might not be easy to express in terms of a single constant. Consider for example the max number of s/g entries supported per WQE: mellanox HCAs support different number of these for RC and UD QPs. So whatever single number query device reports, using it will never achieve what the user wants for all QP types. Rather than extending the device query for all thinkable hardware weirdness, I'd like to propose, instead, the following API extension (below): passing a negative value in e.g. qp attribute would have the meaning: let hardware use at most the specified value. This, as opposed to the usual "at least the specified value" meaning for positive values. How does the following work, for an API? Please comment. Thanks, MST Index: openib/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- openib/drivers/infiniband/include/rdma/ib_verbs.h (revision 4369) +++ openib/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -56,6 +56,8 @@ class_device_create(cls, devt, device, fmt, ## arg) #endif /* XXX end of hack */ +#define IB_AT_MOST(x) (-(x)) + union ib_gid { u8 raw[16]; struct { -- MST From jlentini at netapp.com Fri Dec 16 06:38:51 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 16 Dec 2005 09:38:51 -0500 (EST) Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> References: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> Message-ID: ian> Question 1: How to distinguish a address that the adapter can use ian> from that the adapter cannot use? Could you give an example? I am ian> really not very familiar with the I/O address details. Take a look at Documentation/IO-mapping.txt in the Linux source tree. ian> Question 2: Which memory type should be use given a continuous ian> range of physical memory? It seems simpler to use the ian> DAT_MEM_TYPE_IA type since no translation is needed. But is not ian> there any limitation to the memory to be registered using the ian> DAT_MEM_TYPE_IA, contrasted with the DAT_MEM_PHYSICAL type? If you have physical memory, use DAT_MEM_TYPE_PHYSICAL. If there is only 1 region, the array should only have 1 element. From jlentini at netapp.com Fri Dec 16 06:42:24 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 16 Dec 2005 09:42:24 -0500 (EST) Subject: [openib-general] [kDAPL]Need the array of physical pages be continuous when using dat_lmr_kcreate In-Reply-To: <7b2fa1820512120637m2869e1fdjf4c962decc4de9ae@mail.gmail.com> References: <7b2fa1820512120637m2869e1fdjf4c962decc4de9ae@mail.gmail.com> Message-ID: On Mon, 12 Dec 2005, Ian Jiang wrote: ian> I created a LMR from three buffers which were allocated ian> respectively with kmalloc of size 64kB. The registration went ian> well, but the subsequent rdma read dto completed with a ian> DAT_DTO_ERR_LOCAL_PROTECTION error. Was that because the physical ian> address of the three buffers were not continuous? Was the LMR the target or the destination of the RDMA read? What dat_mem_priv_flags values are you passing to dat_lmr_kcreate? If you are unsure of what to use, try DAT_MEM_PRIV_ALL_FLAG. From swise at opengridcomputing.com Fri Dec 16 09:00:06 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 Dec 2005 11:00:06 -0600 Subject: [openib-general] [PATCH] [IWARP BRANCH] - hang on reset from client Message-ID: <1134752406.13678.19.camel@stevo-desktop> The c2 provider was not caching the iw cm_id in the qp for the passive side connection. This causes an OOPS in the interrupt path when a RST is received on the passive side. Signed-off-by: Steve Wise Index: c2_provider.c =================================================================== --- c2_provider.c (revision 4497) +++ c2_provider.c (working copy) @@ -583,9 +583,13 @@ static int c2_accept(struct iw_cm_id* cm_id, const void *pdata, u8 pdata_len) { int err; + struct c2_qp* qp = container_of(cm_id->qp, struct c2_qp, ibqp); dprintk("%s:%s:%u\n", __FILE__, __FUNCTION__, __LINE__); + /* Cache the cm_id in the qp */ + qp->cm_id = cm_id; + err = c2_llp_accept(cm_id, pdata, pdata_len); return err; From hardydardy at amospalante.com Fri Dec 16 09:05:50 2005 From: hardydardy at amospalante.com (Clayton Smith) Date: Sat, 17 Dec 2005 01:05:50 +0800 Subject: [openib-general] Hey bro, found this site Message-ID: <000001c6028c$d773b800$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! http://www.oklasc.com/pt/?46&vawmv -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Dec 16 10:03:38 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 16 Dec 2005 10:03:38 -0800 Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: References: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> Message-ID: <20051216180338.GC8493@esmail.cup.hp.com> On Fri, Dec 16, 2005 at 09:38:51AM -0500, James Lentini wrote: > ian> Question 1: How to distinguish a address that the adapter can use > ian> from that the adapter cannot use? Could you give an example? I am > ian> really not very familiar with the I/O address details. > > Take a look at Documentation/IO-mapping.txt in the Linux source tree. While IO-mapping.txt gives a nice introduction into the topic of "bus addresses", the answer to the question lies in Documentation/DMA-API.txt. IO devices can only use "bus addresses" that are handed back by the interfaces described in DMA-API.txt. For OpenIB, ULPs (e.g. SDP or IPoIB) are responsible for properly mapping and unmapping for DMA use. While many architectures don't use IOMMU (and thus have 1:1 between host physical:bus address), virtualization seems to be forcing the issue in the "near" future. All DMA access will need to be enforced to isolate virtualized guests. This is something some platforms with IOMMUs enforce today (e.g. Sparc64, PPC64 and PA-RISC). hth, grant From rdreier at cisco.com Fri Dec 16 12:38:21 2005 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 12:38:21 -0800 Subject: [openib-general] Re: [PATCH] libibverbs: document immediate data ordering In-Reply-To: <20051201100252.GS25751@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 1 Dec 2005 12:02:52 +0200") References: <20051201100252.GS25751@mellanox.co.il> Message-ID: Thanks, applied. From ftillier at silverstorm.com Fri Dec 16 12:42:51 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 16 Dec 2005 12:42:51 -0800 Subject: [openib-general] [RFC] IB_AT_MOST In-Reply-To: <20051216135740.GA8300@mellanox.co.il> Message-ID: <000201c60281$4cf5c610$6401a8c0@infiniconsys.com> Hi Michael, > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Friday, December 16, 2005 5:58 AM > > Hi! > I recently noted that some middleware seems to use the "as much > as possible" approach, for example, using maximum possible value > for max_rd_atomic or other fields, in create/modify qp. > > An obvious thing could be to perform query_device and use max. > values from there. However, it turns out that hardware max supported > values might not be easy to express in terms of a single constant. > Consider for example the max number of s/g entries supported per > WQE: mellanox HCAs support different number of these for RC and UD > QPs. So whatever single number query device reports, using it will > never achieve what the user wants for all QP types. > > Rather than extending the device query for all thinkable hardware > weirdness, I'd like to propose, instead, the following API extension > (below): passing a negative value in e.g. qp attribute would have the > meaning: let hardware use at most the specified value. > This, as opposed to the usual "at least the specified value" meaning > for positive values. > > How does the following work, for an API? Please comment. I don't understand the IB_AT_MOST macro. If someone uses IB_AT_MOST( 1 ) and the hardware supports 4, they will get 4, which is definitely not "at most 1". I would rename it to IB_MAX, and define it a -1 or something like that. - Fab From halr at voltaire.com Fri Dec 16 12:54:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 16 Dec 2005 15:54:52 -0500 Subject: [openib-general] A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi Message-ID: <1134766491.4338.10299.camel@hal.voltaire.com> Hi, I have a couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi. There is the following code: if ( (mtu != ib_port_info_get_mtu_cap( p_old_pi )) || (op_vls != ib_port_info_get_op_vls(p_old_pi))) { if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_set_physp_pi: " "Sending Link Down due to op_vls or mtu change. MTU:%u,%u VL_CAP:%u,%u\n", mtu, ib_port_info_get_mtu_cap( p_old_pi ), op_vls, ib_port_info_get_op_vls(p_old_pi) ); } ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); This seems a little inconsistent to me. It seems like NeighborMTU would be the equivalent of OperationalVLs, rather than MTUCap (which is RO). Also, why does changing the MTU require that the link be taken down ? I also noticed a nit in the same function: p_pi->m_key_lease_period = p_mgr->p_subn->opt.m_key_lease_period; /* Check to see if the value we are setting is different than the value in the port_info. If it is - turn on send_set flag */ if (cl_memcmp( &p_pi->m_key_lease_period, &p_old_pi->m_key_lease_period, sizeof(p_pi->m_key_lease_period) )) send_set = TRUE; Should that be only when the Mkey is non 0 ? -- Hal From rolandd at cisco.com Fri Dec 16 15:48:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:54 -0800 Subject: [openib-general] [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <200512161548.HbgfRzF2TysjsR2G@cisco.com> Message-ID: <200512161548.lRw6KI369ooIXS9o@cisco.com> Copy routines for ipath driver --- drivers/infiniband/hw/ipath/ipath_copy.c | 666 ++++++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_dwordcpy.S | 62 ++ 2 files changed, 728 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_copy.c create mode 100644 drivers/infiniband/hw/ipath/ipath_dwordcpy.S 99f636a78e0d759ab663a7abb29e6a71b32a552d diff --git a/drivers/infiniband/hw/ipath/ipath_copy.c b/drivers/infiniband/hw/ipath/ipath_copy.c new file mode 100644 index 0000000..26211ad --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_copy.c @@ -0,0 +1,666 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_copy.c 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +/* + * This file provides support for doing sk_buff buffer swapping between + * the low level driver eager buffers, and the network layer. It's part + * of the core driver, rather than the ether driver, because it relies + * on variables and functions in the core driver. It exports a single + * entry point for use in the ipath_ether module. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include /* we can generate our own crc's for testing */ + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +#define TRUE 1 +#define FALSE 0 + +/* + * Allocate a PIO send buffer, initialize the header and copy it out. + */ +static int layer_send_getpiobuf(struct copy_data_s *cdp) +{ + int whichpb; + uint32_t device = cdp->device; + uint32_t extra_bytes; + uint32_t len, nwords; + uint32_t *piobuf; + + whichpb = ipath_getpiobuf(device); + if (whichpb < 0) { + cdp->error = whichpb; + return whichpb; + } + + /* + * Compute the max amount of data that can fit into a PIO buffer. + * buffer size - header size - trigger qword length & flags - CRC + */ + len = devdata[device].ipath_ibmaxlen - + sizeof(ether_header_typ) - 8 - (SIZE_OF_CRC << 2); + if (len > (cdp->len + cdp->extra)) + len = (cdp->len + cdp->extra); + /* Compute word aligment (i.e., (len & 3) ? 4 - (len & 3) : 0) */ + extra_bytes = (4 - len) & 3; + nwords = (sizeof(ether_header_typ) + len + extra_bytes) >> 2; + cdp->hdr->lrh[2] = htons(nwords + SIZE_OF_CRC); + cdp->hdr->bth[0] = htonl((OPCODE_ITH4X << 24) + (extra_bytes << 20) + + IPS_DEFAULT_P_KEY); + cdp->hdr->sub_opcode = OPCODE_ENCAP; + + cdp->hdr->bth[2] = 0; + /* Generate an interrupt on the receive side for the last fragment. */ + cdp->hdr->iph.pkt_flags = ((cdp->len+cdp->extra) == len) ? INFINIPATH_KPF_INTR : 0; + cdp->hdr->iph.chksum = + (uint16_t) IPS_LRH_BTH + + (uint16_t) (nwords + SIZE_OF_CRC) - + (uint16_t) ((cdp->hdr->iph.ver_port_tid_offset >> 16) & 0xFFFF) - + (uint16_t) (cdp->hdr->iph.ver_port_tid_offset & 0xFFFF) - + (uint16_t) cdp->hdr->iph.pkt_flags; + + piobuf = (uint32_t *) (((char *)(devdata[device].ipath_kregbase)) + + devdata[device].ipath_piobufbase + + whichpb * devdata[device].ipath_palign); + _IPATH_VDBG("send %d (%x %x %x %x %x %x %x)\n", nwords, + cdp->hdr->lrh[0], + cdp->hdr->lrh[1], + cdp->hdr->lrh[2], + cdp->hdr->lrh[3], + cdp->hdr->bth[0], cdp->hdr->bth[1], cdp->hdr->bth[2]); + /* + * Write len to control qword, no flags. + * +1 is for the qword padding of pbc. + */ + *((uint64_t *) piobuf) = (uint64_t) (nwords + 1); + piobuf += 2; + ipath_dwordcpy(piobuf, (uint32_t *) cdp->hdr, + sizeof(ether_header_typ) >> 2); + cdp->csum_pio = &((ether_header_typ *) piobuf)->csum; + cdp->to = piobuf + (sizeof(ether_header_typ) >> 2); + cdp->flen = nwords - (sizeof(ether_header_typ) >> 2); + cdp->hdr->frag_num++; + return 0; +} + +/* + * Copy data out of one or a chain of sk_buffs, into the PIO buffer. + * Fragment an sk_buff into multiple IB packets if the amount of data is + * more than a single eager send. + * Offset and len are in bytes. + * Note that this function is recursive! + */ +static void copy_bits(const struct sk_buff *skb, unsigned int offset, + unsigned int len, struct copy_data_s *cdp) +{ + unsigned int start = skb_headlen(skb); + unsigned int i, copy; + uint32_t n; + u8 *p; + + /* Copy header. */ + if ((int)(copy = start - offset) > 0) { + if (copy > len) + copy = len; + p = skb->data + offset; + offset += copy; + len -= copy; + /* + * If the alignment buffer is not empty, fill it and write + * it out. + */ + if (cdp->extra) { + if (cdp->extra == 4) + goto extra_copy_bits_done; + + while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { +extra_copy_bits_done: + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) < 0) + return; + *cdp->to++ = cdp->u.w; + cdp->extra = 0; + cdp->flen -= 1; + break; + } + } + } + while (copy >= 4) { + if (cdp->flen == 0 && layer_send_getpiobuf(cdp) < 0) + return; + n = copy >> 2; + if (n > cdp->flen) + n = cdp->flen; + ipath_dwordcpy(cdp->to, (uint32_t *) p, n); + cdp->to += n; + cdp->flen -= n; + n <<= 2; + p += n; + cdp->offset += n; + cdp->len -= n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + } + if (len == 0) + return; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + unsigned int end; + + end = start + frag->size; + if ((int)(copy = end - offset) > 0) { + u8 *vaddr; + + if (copy > len) + copy = len; + vaddr = kmap_skb_frag(frag); + p = vaddr + frag->page_offset + offset - start; + offset += copy; + len -= copy; + /* + * If the alignment buffer is not empty, fill + * it and write it out. + */ + if (cdp->extra) { + if (cdp->extra == 4) + goto extra1_copy_bits_done; + + while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { +extra1_copy_bits_done: + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) + < 0) + return; + *cdp->to++ = cdp->u.w; + cdp->extra = 0; + cdp->flen -= 1; + break; + } + } + } + while (copy >= 4) { + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) < 0) + return; + n = copy >> 2; + if (n > cdp->flen) + n = cdp->flen; + ipath_dwordcpy(cdp->to, (uint32_t *) p, n); + cdp->to += n; + cdp->flen -= n; + n <<= 2; + p += n; + cdp->offset += n; + cdp->len -= n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero + * which means that the loop here can't cause + * the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + kunmap_skb_frag(vaddr); + + if (len == 0) + return; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + unsigned int end; + + end = start + list->len; + if ((int)(copy = end - offset) > 0) { + if (copy > len) + copy = len; + copy_bits(list, offset - start, copy, cdp); + if (cdp->error || (len -= copy) == 0) + return; + } + start = end; + } + } + if (len) + cdp->error = -EFAULT; +} + +/* + * Copy data out of one or a chain of sk_buffs, into the PIO buffer, generating + * the checksum as we go. + * Fragment an sk_buff into multiple IB packets if the amount of data is + * more than a single eager send. + * Offset and len are in bytes. + * Note that this function is recursive! + */ +static void copy_and_csum_bits(const struct sk_buff *skb, unsigned int offset, + unsigned int len, struct copy_data_s *cdp) +{ + unsigned int start = skb_headlen(skb); + unsigned int i, copy; + unsigned int csum2; + uint32_t n; + u8 *p; + + /* Copy header. */ + if ((int)(copy = start - offset) > 0) { + if (copy > len) + copy = len; + p = skb->data + offset; + offset += copy; + len -= copy; + if (!cdp->checksum_calc) { + cdp->checksum_calc = TRUE; + + csum2 = csum_partial(p, copy, 0); + cdp->csum = csum_block_add(cdp->csum, csum2, cdp->pos); + cdp->pos += copy; + } + /* + * If the alignment buffer is not empty, fill it and + * write it out. + */ + if (cdp->extra) { + if (cdp->extra == 4) + goto extra_copy_and_csum_bits_done; + + while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + if (++cdp->extra == 4) { +extra_copy_and_csum_bits_done: + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) < 0) + return; + /* + * write the checksum before + * the last PIO write. + */ + if (cdp->flen == 1) { + *cdp->csum_pio = + csum_fold(cdp->csum); + mb(); + } + *cdp->to++ = cdp->u.w; + cdp->extra = 0; + cdp->flen -= 1; + break; + } + } + } + + while (copy >= 4) { + if (cdp->flen == 0 && layer_send_getpiobuf(cdp) < 0) + return; + + n = copy >> 2; + if (n > cdp->flen) + n = cdp->flen; + /* write the checksum before the last PIO write. */ + if (cdp->flen == n) { + *cdp->csum_pio = csum_fold(cdp->csum); + mb(); + } + ipath_dwordcpy(cdp->to, (uint32_t *) p, n); + cdp->to += n; + cdp->flen -= n; + n <<= 2; + p += n; + cdp->offset += n; + cdp->len -= n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + + cdp->checksum_calc = FALSE; + + if (len == 0) + return; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + unsigned int end; + + end = start + frag->size; + if ((int)(copy = end - offset) > 0) { + u8 *vaddr; + + if (copy > len) + copy = len; + vaddr = kmap_skb_frag(frag); + p = vaddr + frag->page_offset + offset - start; + offset += copy; + len -= copy; + + if (!cdp->checksum_calc) { + cdp->checksum_calc = TRUE; + + csum2 = csum_partial(p, copy, 0); + cdp->csum = csum_block_add(cdp->csum, csum2, + cdp->pos); + cdp->pos += copy; + } + /* + * If the alignment buffer is not empty, fill + * it and write it out. + */ + if (cdp->extra) { + if (cdp->extra == 4) + goto extra1_copy_and_csum_bits_done; + while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { +extra1_copy_and_csum_bits_done: + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) + < 0) { + kunmap_skb_frag(vaddr); + return; + } + /* + * write the checksum + * before the last PIO + * write. + */ + if (cdp->flen == 1) { + *cdp->csum_pio = + csum_fold(cdp-> + csum); + mb(); + } + *cdp->to++ = cdp->u.w; + cdp->extra = 0; + cdp->flen -= 1; + break; + } + } + } + while (copy >= 4) { + if (cdp->flen == 0 + && layer_send_getpiobuf(cdp) < 0) { + kunmap_skb_frag(vaddr); + return; + } + n = copy >> 2; + if (n > cdp->flen) + n = cdp->flen; + /* + * write the checksum before the last + * PIO write. + */ + if (cdp->flen == n) { + *cdp->csum_pio = csum_fold(cdp->csum); + mb(); + } + ipath_dwordcpy(cdp->to, (uint32_t *) p, n); + cdp->to += n; + cdp->flen -= n; + n <<= 2; + p += n; + cdp->offset += n; + cdp->len -= n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero + * which means that the loop here can't cause + * the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + kunmap_skb_frag(vaddr); + + cdp->checksum_calc = FALSE; + + if (len == 0) + return; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + unsigned int end; + + end = start + list->len; + if ((int)(copy = end - offset) > 0) { + if (copy > len) + copy = len; + copy_and_csum_bits(list, offset - start, copy, cdp); + if (cdp->error || (len -= copy) == 0) + return; + offset += copy; + } + start = end; + } + } + if (len) + cdp->error = -EFAULT; +} + +/* + * Note that the header should have the unchanging parts + * initialized but the rest of the header is computed as needed in + * order to break up skb data buffers larger than the hardware MTU. + * In other words, the Linux network stack MTU can be larger than the + * hardware MTU. + */ +int ipath_layer_send_skb(struct copy_data_s *cdata) +{ + int ret = 0; + uint16_t vlsllnh; + int device = cdata->device; + + if (device >= infinipath_max) { + _IPATH_INFO("Invalid unit %u, failing\n", device); + return -EINVAL; + } + if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) { + _IPATH_INFO("send while not open\n"); + ret = -EINVAL; + } else + if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) + || devdata[device].ipath_lid == 0) { + /* lid check is for when sma hasn't yet configured */ + ret = -ENETDOWN; + _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n", + devdata[device].ipath_lid, + devdata[device].ipath_flags); + } + vlsllnh = *((uint16_t *) cdata->hdr); + if (vlsllnh != htons(IPS_LRH_BTH)) { + _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n", + vlsllnh, htons(IPS_LRH_BTH)); + ret = -EINVAL; + } + if (ret) + goto done; + + cdata->error = 0; /* clear last calls error */ + + if (cdata->skb->ip_summed == CHECKSUM_HW) { + unsigned int csstart = cdata->skb->h.raw - cdata->skb->data; + + /* + * Computing the checksum is a bit tricky since if we fragment + * the packet, the fragment that should contain the checksum + * will have already been sent. The solution is to + * store the checksum in the header of the last fragment + * just before we write the last data word which triggers + * the last fragment to be sent. The receiver will + * check the header "tag" field, see that there is a + * checksum, and store the checksum back into the packet. + * + * Save the offset of the two byte checksum. + * Note that we have to add 2 to account for the two + * bytes of the ethernet address we stripped from the + * packet and put in the header. + */ + cdata->hdr->csum_offset = csstart + cdata->skb->csum + 2; + + if (cdata->offset < csstart) + copy_bits(cdata->skb, cdata->offset, + csstart - cdata->offset, cdata); + + if (cdata->error) { + return (cdata->error); + + } + + if (cdata->offset < cdata->skb->len) + copy_and_csum_bits(cdata->skb, cdata->offset, + cdata->skb->len - cdata->offset, + cdata); + + if (cdata->error) { + return (cdata->error); + } + + if (cdata->extra) { + while (cdata->extra < 4) + cdata->u.buf[cdata->extra++] = 0; + if (cdata->flen != 0 + || layer_send_getpiobuf(cdata) >= 0) { + /* + * write the checksum before the last + * PIO write. + */ + *cdata->csum_pio = csum_fold(cdata->csum); + mb(); + *cdata->to = cdata->u.w; + } + } + } else { + copy_bits(cdata->skb, cdata->offset, + cdata->skb->len - cdata->offset, cdata); + + if (cdata->error) { + return (cdata->error); + } + + if (cdata->extra) { + while (cdata->extra < 4) + cdata->u.buf[cdata->extra++] = 0; + if (cdata->flen != 0 + || layer_send_getpiobuf(cdata) >= 0) + *cdata->to = cdata->u.w; + } + } + + if (cdata->error) { + ret = cdata->error; + if (cdata->error != -EBUSY) + /* just means no PIO buffers available */ + _IPATH_UNIT_ERROR(device, + "layer_send copy_bits failed with error %d\n", + -ret); + } + + ipath_stats.sps_ether_spkts++; /* another ether packet sent */ + +done: + return ret; +} + +EXPORT_SYMBOL(ipath_layer_send_skb); diff --git a/drivers/infiniband/hw/ipath/ipath_dwordcpy.S b/drivers/infiniband/hw/ipath/ipath_dwordcpy.S new file mode 100644 index 0000000..fdd8ec7 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_dwordcpy.S @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_dwordcpy.S 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +/* + * ipath_dwordcpy - Copy a memory block, primarily for writing to the + * InfiniPath PIO buffers, which only support dword multiple writes, and + * thus can not use memcpy(). For this reason, we use nothing smaller than + * dword writes. + * It is also used as a fast copy routine in some places that have been + * measured to win over memcpy, and the performance delta matters. + * + * Count is number of dwords; might not be a qword multiple. +*/ + + .globl ipath_dwordcpy +/* rdi destination, rsi source, rdx count */ +ipath_dwordcpy: + movl %edx,%ecx + shrl $1,%ecx + andl $1,%edx + cld + rep + movsq + movl %edx,%ecx + rep + movsd + ret -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:54 -0800 Subject: [openib-general] [PATCH 00/13] [RFC] IB: PathScale InfiniPath driver In-Reply-To: <20051031150618.627779f1.akpm@osdl.org> Message-ID: <200512161548.jRuyTS0HPMLd7V81@cisco.com> Here is an initial submission from PathScale of a driver for InfiniPath InfiniBand HCAs. The driver is fairly big -- some single files are more than the 100 KB limit for lkml posts -- so I've split it up into a patch series so it can be reviewed inline. The split-up doesn't make sense functionally but I want to make review as easy as possible; any final import will merge the driver as a single git patch. I've also put the current splitup patchset into my git tree at git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git in the ipath branch. There are some things I noticed that could maybe be cleaned up, like having sysctls that set values also settable through module parameters under /sys/module, code inside #ifndef __KERNEL__ so include files can be shared with other PathScale code, code in ipath_i2c.c that might be simplified by using drivers/i2c, etc. I'd like to try to get a sense of whether I'm being too picky or whether PathScale really does need to fix these up before the driver is merged. Basically I'm trying to feel my way as a maintainer so I can find the right balance between wanting kernel code to be absolutely perfect and not wanting to put arbitrary hurdles in front of a vendor who has done a lot of work on contributing an open driver for their hardware. I am especially interested in feedback about the mergability of this driver from the broader kernel community, although of course feedback from the InfiniBand/RDMA community on IB-specific aspects of the driver is very much appreciated as well. Thanks, Roland From rolandd at cisco.com Fri Dec 16 15:48:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:54 -0800 Subject: [openib-general] [PATCH 02/13] [RFC] ipath debug header In-Reply-To: <200512161548.aLjaDpGm5aqk0k0p@cisco.com> Message-ID: <200512161548.HbgfRzF2TysjsR2G@cisco.com> Debugging macros for ipath driver --- drivers/infiniband/hw/ipath/ipath_debug.h | 211 +++++++++++++++++++++++++++++ 1 files changed, 211 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_debug.h 8f834f72344a3c7e9c5eafdf59b9fc96b4e08e5f diff --git a/drivers/infiniband/hw/ipath/ipath_debug.h b/drivers/infiniband/hw/ipath/ipath_debug.h new file mode 100644 index 0000000..c8b7374 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_debug.h @@ -0,0 +1,211 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_debug.h 4504 2005-12-16 06:15:47Z rjwalsh $ + */ + +#ifndef _IPATH_DEBUG_H +#define _IPATH_DEBUG_H + +/* + * This file contains tracing code that is lightweight, and can be + * called from both user and kernel mode + * _IPATH_DBG should have the same calling conventions and semantics + * for both user and kernel. + */ +#ifndef _IPATH_DEBUGGING /* tracing enabled or not */ +#define _IPATH_DEBUGGING 1 +#endif + +/* This macros should only be used in the trace library code. */ +#define _IPATH_DEBUG_VARS_DECL unsigned infinipath_debug; +extern unsigned infinipath_debug; +extern const char *ipath_get_unit_name(int unit); + +/* + * These are always defined, because _IPATH_ERROR is always defined, + * unlike the other debugging calls. It might make sense to change + * to using "fprintf(stderr", for the usermode version, but not now. + */ +#ifdef __KERNEL__ +#define __IPPRT printk +#define __IPATH_UNIT_ERRID(unit) ipath_get_unit_name(unit) +#define __IPATH_ERRID "infinipath" +#else +#define __IPPRT printf +extern char *__progname; +#define __IPATH_UNIT_ERRID(unit) __progname +#define __IPATH_ERRID __progname +#define KERN_ERR +#endif + +#if _IPATH_DEBUGGING + +/* + * Mask values for debugging. The scheme allows us to compile out any of + * the debug tracing stuff, and if compiled in, to enable or disable dynamically + * This can be set at modprobe time also: + * modprobe infinipath.ko infinipath_debug=7 + */ +#define __IPATH_INFO 0x1 /* generic low verbosity stuff */ +#define __IPATH_DBG 0x2 /* generic debug */ +#define __IPATH_TRSAMPLE 0x8 /* generate trace buffer sample entries */ +/* leave some low verbosity spots open */ +#define __IPATH_VERBDBG 0x40 /* very verbose debug */ +#define __IPATH_PKTDBG 0x80 /* print packet data */ +/* print process startup (init)/exit messages */ +#define __IPATH_PROCDBG 0x100 +/* print mmap/nopage stuff, not using VDBG any more */ +#define __IPATH_MMDBG 0x200 +#define __IPATH_USER_SEND 0x1000 /* use user mode send */ +#define __IPATH_KERNEL_SEND 0x2000 /* use kernel mode send */ +#define __IPATH_EPKTDBG 0x4000 /* print ethernet packet data */ +#define __IPATH_SMADBG 0x8000 /* sma packet debug */ +#define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) general debug on */ +#define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings on */ +#define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors on */ +#define __IPATH_IPATHPD 0x80000 /* Ethernet (IPATH) packet dump on */ +#define __IPATH_IPATHTABLE 0x100000 /* Ethernet (IPATH) table dump on */ + +#ifdef __KERNEL__ +#define __IPIDENT +#define __IPIDENT_ARG +#define __IP_INFO_TAG __IPATH_ERRID +#define _Pragma_unlikely +#else +#define KERN_INFO +#define KERN_DEBUG +#define __IPIDENT "%s" +#define __IPIDENT_ARG __ipath_mylabel, +#define __IP_INFO_TAG __func__ +extern char *__ipath_mylabel; +extern void ipath_set_mylabel(char *); +#define _Pragma_unlikely _Pragma("mips_frequency_hint never") +#endif + +#define _IPATH_UNIT_ERROR(unit,fmt,...) do { \ + _Pragma_unlikely \ + __IPPRT (KERN_ERR __IPIDENT "%s: " fmt, __IPIDENT_ARG __IPATH_UNIT_ERRID(unit), \ + ##__VA_ARGS__); \ + } while(0) + +#define _IPATH_ERROR(fmt,...) do { \ + _Pragma_unlikely \ + __IPPRT (KERN_ERR __IPIDENT "%s: " fmt, __IPIDENT_ARG __IPATH_ERRID, \ + ##__VA_ARGS__); \ + } while(0) + +#define _IPATH_INFO(fmt,...) do { \ + _Pragma_unlikely \ + if(unlikely(infinipath_debug&__IPATH_INFO)) \ + __IPPRT (KERN_INFO __IPIDENT "%s: " fmt,\ + __IPIDENT_ARG __IP_INFO_TAG,##__VA_ARGS__); \ + } while(0) + +#define __IPATH_USER_MODE_SEND unlikely(infinipath_debug & __IPATH_USER_SEND) +#define __IPATH_KERNEL_MODE_SEND unlikely(infinipath_debug & __IPATH_KERNEL_SEND) +#define __IPATH_PKTDBG_ON unlikely(infinipath_debug & __IPATH_PKTDBG) + +#define __IPATH_DBG_WHICH(which,fmt,...) do { \ + _Pragma_unlikely \ + if(unlikely(infinipath_debug&(which))) __IPPRT (KERN_DEBUG __IPIDENT "%s: " fmt,\ + __IPIDENT_ARG __func__,##__VA_ARGS__); \ + } while(0) +#define _IPATH_DBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_DBG,fmt,##__VA_ARGS__) +#define _IPATH_VDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_VERBDBG,fmt,##__VA_ARGS__) +#define _IPATH_PDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PKTDBG,fmt,##__VA_ARGS__) +#define _IPATH_EPDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_EPKTDBG,fmt,##__VA_ARGS__) +#define _IPATH_PRDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PROCDBG,fmt,##__VA_ARGS__) +#define _IPATH_MMDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_MMDBG,fmt,##__VA_ARGS__) +#define _IPATH_SMADBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_SMADBG,fmt,##__VA_ARGS__) +#define _IPATH_IPATHDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHDBG,fmt,##__VA_ARGS__) +#define _IPATH_IPATHWARN(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHWARN,fmt,##__VA_ARGS__) +#define _IPATH_IPATHERR(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHERR ,fmt,##__VA_ARGS__) +#define _IPATH_IPATHPD(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHPD ,fmt,##__VA_ARGS__) +#define _IPATH_IPATHTABLE(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHTABLE ,fmt,##__VA_ARGS__) + +#else /* ! _IPATH_DEBUGGING */ + +#define _IPATH_UNIT_ERROR(unit,fmt,...) do { \ + __IPPRT (KERN_ERR "%s" fmt, "",##__VA_ARGS__); \ + } while(0) + +#define _IPATH_ERROR(fmt,...) do { \ + __IPPRT (KERN_ERR "%s" fmt, "",##__VA_ARGS__); \ + } while(0) + +#define _IPATH_INFO(fmt,...) +#define _IPATH_DBG(fmt,...) +#define _IPATH_PDBG(fmt,...) +#define _IPATH_EPDBG(fmt,...) +#define _IPATH_PRDBG(fmt,...) +#define _IPATH_VDBG(fmt,...) +#define _IPATH_MMDBG(fmt,...) +#define _IPATH_SMADBG(fmt,...) +#define _IPATH_IPATHDBG(fmt,...) +#define _IPATH_IPATHWARN(fmt,...) +#define _IPATH_IPATHERR(fmt,...) +#define _IPATH_IPATHPD(fmt,...) +#define _IPATH_IPATHTABLE(fmt,...) + +/* + * define all of these even with debugging off, for the few places that do + * if(infinipath_debug&_IPATH_xyzzy), but in a way that will make the + * compiler eliminate the code + */ +#define __IPATH_INFO 0x0 /* generic low verbosity stuff */ +#define __IPATH_DBG 0x0 /* generic debug */ +#define __IPATH_CALL 0x0 /* function call entrance/exit */ +#define __IPATH_TRSAMPLE 0x0 /* generate trace buffer sample entries */ +#define __IPATH_VCALL 0x0 /* function call entrance/exit */ +#define __IPATH_VERBDBG 0x0 /* very verbose debug */ +#define __IPATH_PKTDBG 0x0 /* print packet data */ +#define __IPATH_PROCDBG 0x0 /* print process startup (init)/exit messages */ +/* print mmap/nopage stuff, not using VDBG any more */ +#define __IPATH_MMDBG 0x0 +#define __IPATH_EPKTDBG 0x0 /* print ethernet packet data */ +#define __IPATH_SMADBG 0x0 /* print process startup (init)/exit messages */ +#define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */ +#define __IPATH_IPATHWARN 0x0 /* Ethernet (IPATH) warnings on */ +#define __IPATH_IPATHERR 0x0 /* Ethernet (IPATH) errors on */ +#define __IPATH_IPATHPD 0x0 /* Ethernet (IPATH) packet dump on */ +#define __IPATH_IPATHTABLE 0x0 /* Ethernet (IPATH) packet dump on */ +#define __IPATH_USER_MODE_SEND 0 +#define __IPATH_KERNEL_MODE_SEND 0 +#define __IPATH_PKTDBG_ON 0 + +#endif /* _IPATH_DEBUGGING */ + +#endif /* _IPATH_DEBUG_H */ -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 05/13] [RFC] ipath LLD core, part 2 In-Reply-To: <200512161548.20XjmmxDHjOZRXcz@cisco.com> Message-ID: <200512161548.YvnmQHKTsmmCBp1k@cisco.com> Next part of ipath core driver --- drivers/infiniband/hw/ipath/ipath_driver.c | 2290 ++++++++++++++++++++++++++++ 1 files changed, 2290 insertions(+), 0 deletions(-) fc2b052ff2abadc8547dc1b319883f9c942b0ae4 diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index df650d6..0dee4ce 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2587,3 +2587,2293 @@ static int ipath_get_unit_counters(struc return -EFAULT; return ipath_get_counters(c.unit, (struct infinipath_counters *)c.data); } + +/* + * ioctls for the control device, which is useful when you don't want + * to open the main device and use up a port. + */ + +static int ipath_ctrl_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + + switch (cmd) { + case IPATH_GETSTATS: /* return driver stats */ + ret = ipath_get_stats((struct infinipath_stats *) a); + break; + case IPATH_GETUNITCOUNTERS: /* return chip counters */ + ret = ipath_get_unit_counters((struct infinipath_getunitcounters *) a); + break; + default: + _IPATH_DBG("%x not a valid CTRL ioctl for infinipath\n", cmd); + ret = -EINVAL; + break; + } + + return ret; +} + +long ipath_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + ipath_portdata *pd; + ipath_type unit; + uint32_t tmp, i, nactive = 0; + + if (cmd == IPATH_GETUNITS) { + /* + * Return number of units supported. This is called + * here as this ioctl is needed via both the normal and + * diags interface, and it does not need the device to + * be opened. + */ + return ipath_get_units(); + } + + pd = port_fp(fp); + if (!pd) { + if (IPATH_SMA == (unsigned long)fp->private_data) + /* sma separate; no pd */ + return (long)ipath_sma_ioctl(fp, cmd, a); +#ifdef IPATH_DIAG + else if (IPATH_DIAG == (unsigned long)fp->private_data) + /* diags separate; no pd */ + return (long)ipath_diags_ioctl(fp, cmd, a); +#endif + else if (IPATH_CTRL == (unsigned long)fp->private_data) + /* ctrl separate; no pd */ + return (long)ipath_ctrl_ioctl(fp, cmd, a); + else { + _IPATH_DBG("NULL pd from fp (%p), cmd=%x\n", fp, cmd); + return -ENODEV; /* bad; shouldn't ever happen */ + } + } + + unit = pd->port_unit; + + if ((devdata[unit].ipath_flags & IPATH_PRESENT) + && (cmd == IPATH_GETCOUNTERS || cmd == IPATH_GETSTATS + || cmd == IPATH_READ_EEPROM || cmd == IPATH_WRITE_EEPROM)) { + /* allowed to do these, as long as chip is accessible */ + } else if (!(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG + ("%s not initialized (flags=0x%x), failing ioctl #%u\n", + ipath_get_unit_name(unit), devdata[unit].ipath_flags, + _IOC_NR(cmd)); + ret = -ENODEV; + } else + if ((devdata[unit]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) { + _IPATH_DBG("%s link is down, failing ioctl #%u\n", + ipath_get_unit_name(unit), _IOC_NR(cmd)); + ret = -ENETDOWN; + } + + if (ret) + return ret; + + /* normal driver ioctls, not sim-specific */ + switch (cmd) { + case IPATH_USERINIT: + /* real application is starting on a port */ + ret = ipath_do_user_init(pd, (struct ipath_user_info *) a); + break; + case IPATH_BASEINFO: + /* it's done the init, now return the info it needs */ + ret = ipath_get_baseinfo(pd, (struct ipath_base_info *) a); + break; + case IPATH_GETPORT: + /* + * just return the unit:port that we were assigned, + * and the number of active chips. This is is used for + * doing sched_setaffinity() before initialization. + */ + for (i = 0; i < infinipath_max; i++) + if ((devdata[i].ipath_flags & IPATH_PRESENT) + && devdata[i].ipath_kregbase + && devdata[i].ipath_lid + && !(devdata[i].ipath_flags & + (IPATH_LINKDOWN | IPATH_LINKUNK))) + nactive++; + tmp = (nactive << 24) | (unit << 16) | unit; + if(copy_to_user((void *)a, &tmp, sizeof(unit))) + ret = EFAULT; + break; + case IPATH_GETLID: + /* get LID for given unit # */ + ret = ipath_layer_get_lid(a); + break; + case IPATH_UPDM_TID: /* update expected TID entries */ + ret = ipath_tid_update(pd, (struct _tidupd *)a); + break; + case IPATH_FREE_TID: /* free expected TID entries */ + ret = ipath_tid_free(pd, (struct _tidupd *)a); + break; + case IPATH_GETCOUNTERS: /* return chip counters */ + ret = ipath_get_counters(unit, (struct infinipath_counters *)a); + break; + case IPATH_GETSTATS: /* return driver stats */ + ret = ipath_get_stats((struct infinipath_stats *) a); + break; + case IPATH_GETUNITCOUNTERS: /* return chip counters */ + ret = ipath_get_unit_counters((struct infinipath_getunitcounters *) a); + break; + case IPATH_SET_PKEY: /* set a partition key */ + ret = ipath_set_partkey(pd, (uint16_t) a); + break; + case IPATH_RCVCTRL: /* error handling to manage the rcvq */ + ret = ipath_manage_rcvq(pd, (uint16_t) a); + break; + case IPATH_WRITE_EEPROM: + /* write the eeprom (for GUID) */ + ret = ipath_wr_eeprom(pd, (struct ipath_eeprom_req *)a); + break; + case IPATH_READ_EEPROM: /* read the eeprom (for GUID) */ + ret = ipath_rd_eeprom(pd->port_unit, + (struct ipath_eeprom_req *)a); + break; + case IPATH_WAIT: + /* + * wait for a receive intr for this port, or PIO avail + */ + ret = ipath_wait_intr(pd, (uint32_t) a); + break; + + default: + _IPATH_DBG("cmd %x (%c,%u) not a valid ioctl\n", cmd, + _IOC_TYPE(cmd), _IOC_NR(cmd)); + ret = -EINVAL; + break; + } + + return ret; +} + +static loff_t ipath_llseek(struct file *fp, loff_t off, int whence) +{ + loff_t ret; + + /* range checking is done where offset is used, not here. */ + down(&fp->f_dentry->d_inode->i_sem); + if (!whence) + ret = fp->f_pos = off; + else if (whence == 1) { + fp->f_pos += off; + ret = fp->f_pos; + } else + ret = -EINVAL; + up(&fp->f_dentry->d_inode->i_sem); + _IPATH_DBG("New offset %llx from seek %llx whence=%d\n", fp->f_pos, off, + whence); + + return ret; +} + +/* + * We use this to have a shared buffer between the kernel and the user + * code for the rcvhdr queue, egr buffers, and the per-port user regs and pio + * buffers in the chip. We have the open and close entries so we can bump + * the ref count and keep the driver from being unloaded while still mapped. + */ + +static struct vm_operations_struct ipath_vmops = { + .nopage = ipath_nopage, +}; + +static int ipath_mmap(struct file *fp, struct vm_area_struct *vm) +{ + int setlen = 0, ret = -EINVAL; + ipath_portdata *pd; + + if (fp->private_data && 255UL < (unsigned long)fp->private_data) { + pd = port_fp(fp); + { + /* + * This is the ipath_do_user_init() code, + * mapping the shared buffers into the user + * process. The address referred to by vm_pgoff + * is the virtual, not physical, address; we only + * do one mmap for each space mapped. + */ + uint64_t pgaddr, ureg; + + pgaddr = vm->vm_pgoff << PAGE_SHIFT; + + /* + * note that ureg does *NOT* have the kregvirt + * as part of it, to be sure that for 32 bit + * programs, we don't end up trying to map + * a > 44 address. Has to match ipath_get_baseinfo() + * code that sets __spi_uregbase + */ + + ureg = devdata[pd->port_unit].ipath_uregbase + + devdata[pd->port_unit].ipath_palign * pd->port_port; + + _IPATH_MMDBG + ("ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n", + pgaddr, vm->vm_start, vm->vm_end - vm->vm_start); + + if (pgaddr == ureg) { + /* it's the real hardware, so io_remap works */ + unsigned long phys; + if ((vm->vm_end - vm->vm_start) > PAGE_SIZE) { + _IPATH_INFO + ("FAIL mmap userreg: reqlen %lx > PAGE\n", + vm->vm_end - vm->vm_start); + ret = -EFAULT; + } else { + phys = + devdata[pd->port_unit]. + ipath_physaddr + ureg; + vm->vm_page_prot = + pgprot_noncached(vm->vm_page_prot); + + vm->vm_flags |= + VM_DONTCOPY | VM_DONTEXPAND | VM_IO + | VM_SHM | VM_LOCKED; + ret = + io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + vm->vm_end - vm->vm_start, + vm->vm_page_prot); + } + } else if (pgaddr == pd->port_piobufs) { + /* + * We use io_remap, so there is not a + * nopage handler for this case! + * when we map the PIO buffers, we want + * to map them as writeonly, no read possible. + */ + + unsigned long phys; + if ((vm->vm_end - vm->vm_start) > + (devdata[pd->port_unit].ipath_pbufsport * + devdata[pd->port_unit].ipath_palign)) { + _IPATH_INFO + ("FAIL mmap userreg: reqlen %lx > PAGE\n", + vm->vm_end - vm->vm_start); + ret = -EFAULT; + } else { + phys = + devdata[pd->port_unit]. + ipath_physaddr + pd->port_piobufs; + /* + * Do *NOT* mark this as + * non-cached (PWT bit), or we + * don't get the write combining + * behavior we want on the + * PIO buffers! + * vm->vm_page_prot = pgprot_noncached(vm->vm_page_prot); + */ + +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) + /* Enable WC */ + vm->vm_page_prot = + pgprot_writecombine(vm-> + vm_page_prot); +#endif + + if (vm->vm_flags & VM_READ) { + _IPATH_INFO + ("Can't map piobufs as readable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to + * later change to readable + * with mprotect + */ + + vm->vm_flags &= ~VM_MAYWRITE; + + vm->vm_flags |= + VM_DONTCOPY | VM_DONTEXPAND + | VM_IO | VM_SHM | + VM_LOCKED; + ret = + io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + vm->vm_end - vm->vm_start, + vm->vm_page_prot); + } + } + } else if (pgaddr == (uint64_t) pd->port_rcvegr_phys) { + if (!pd->port_rcvegrbuf_virt) + return -EFAULT; + /* + * page_alloc'ed egr memory, not + * physically contiguous + * *BUT* to work around the 32 bit mmap64 + * only handling 44 bits, we have remapped + * the first page to kernel virtual, so + * we have to do the conversion here to + * get back to the original virtual + * address (not contig pages) so we have + * to mark this for special handling. + */ + + /* + * not egrbufs * egrsize since they are + * no longer virtually contiguous. + */ + setlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE * + (1 << pd->port_rcvegrbuf_order); + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on egr bufs: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else { + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(3 | (uint64_t) pd); + if (vm->vm_flags & VM_WRITE) { + _IPATH_INFO + ("Can't map eager buffers as writable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to + * later change to writeable + * with mprotect + */ + + vm->vm_flags &= ~VM_MAYWRITE; + _IPATH_MMDBG + ("egrbufs, set private to %p, not %llx\n", + vm->vm_private_data, + pgaddr); + ret = 0; + } + } + } else if (pgaddr == (uint64_t) pd->port_rcvhdrq_phys) { + /* + * kmalloc'ed memory, physically + * contiguous; this is from + * spi_rcvhdr_base; we allow user to + * map read-write so they can write + * hdrq entries to allow protocol code + * to directly poll whether a hdrq entry + * has been written. + */ + setlen = + round_up(devdata[pd->port_unit]. + ipath_rcvhdrcnt * + devdata[pd->port_unit]. + ipath_rcvhdrentsize * + sizeof(uint32_t), PAGE_SIZE); + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on rcvhdrq: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else { + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(pgaddr | 1); + ret = 0; + } + } + /* + * when we map the PIO bufferavail registers, + * we want to map them as readonly, no read + * possible. + */ + else if (pgaddr == + devdata[pd->port_unit]. + ipath_pioavailregs_phys) { + /* + * kmalloc'ed memory, physically + * contiguous, one page only, readonly + */ + setlen = PAGE_SIZE; + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on pioavailregs_dma: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else if (vm->vm_flags & VM_WRITE) { + _IPATH_INFO + ("Can't map pioavailregs as writable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to later + * change with mprotect + */ + vm->vm_flags &= ~VM_MAYWRITE; + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(pgaddr | 2); + ret = 0; + } + } + if (!ret && setlen) { + /* keep page(s) from being swapped, etc. */ + vm->vm_flags |= + VM_DONTEXPAND | VM_DONTCOPY | VM_RESERVED | + VM_IO | VM_SHM; + } else { + /* failure, or io_remap case */ + vm->vm_private_data = NULL; + if (ret) + _IPATH_INFO + ("Failure %d, setlen %d, on addr %lx, off %lx\n", + ret, setlen, vm->vm_start, + vm->vm_pgoff); + } + } + } else /* something very wrong */ + _IPATH_INFO("fp_private wasn't set, no mmaping\n"); + + return ret; +} + +/* page fault handler. For each page that is first faulted in from the + * mmap'ed shared address buffer, this routine is called. + * It's always for a single page. + * We use the low bits of the private_data field to tell us which case + * we are dealing with. + */ + +static struct page *ipath_nopage(struct vm_area_struct *vma, unsigned long addr, + int *type) +{ + unsigned long avirt, /* the original [kv]malloc virtual address */ + paddr, /* physical address */ + off; /* calculated page offset */ + uint32_t which, chunk; + void *vaddr = NULL; + ipath_portdata *pd; + struct page *vpage = NOPAGE_SIGBUS; + + if (!(avirt = (unsigned long)vma->vm_private_data)) { + _IPATH_DBG("NULL private_data, vm_pgoff %lx\n", vma->vm_pgoff); + which = 0; /* quiet incorrect gcc warning */ + goto done; + } + which = avirt & 3; + avirt &= ~3ULL; + + if (addr > vma->vm_end) { + _IPATH_DBG("trying to fault in addr %lx past end\n", addr); + goto done; + } + + /* + * most of our memory is vmalloc'ed, but rcvhdr Q is physically + * contiguous, either from kmalloc or alloc_pages() + * pgoff is virtual. + */ + switch (which) { + case 1: /* rcvhdrq_phys */ + /* should always be 0 */ + off = vma->vm_pgoff - (avirt >> PAGE_SHIFT); + paddr = addr - vma->vm_start + (off << PAGE_SHIFT) + avirt; + _IPATH_MMDBG("hdrq %lx (u=%lx)\n", paddr, addr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + break; + case 2: /* PIO buffer avail regs */ + /* should always be 0 */ + off = vma->vm_pgoff - (avirt >> PAGE_SHIFT); + paddr = (addr - vma->vm_start + (off << PAGE_SHIFT) + avirt); + _IPATH_MMDBG("pioav %lx\n", paddr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + break; + case 3: + /* + * rcvegrbufs; page_alloc()'ed like rcvhdrq, but we + * have to pick out which page_alloc()'ed chunk it is. + */ + pd = (ipath_portdata *) avirt; + /* this should always be 0 */ + off = + vma->vm_pgoff - + ((unsigned long)pd->port_rcvegr_phys >> PAGE_SHIFT); + off = (addr - vma->vm_start + (off << PAGE_SHIFT)); + + chunk = off / (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order)); + if (chunk > pd->port_rcvegrbuf_chunks) + _IPATH_DBG("Bad egrbuf chunk %u (max %u); off = %lx\n", + chunk, pd->port_rcvegrbuf_chunks, off); + vaddr = pd->port_rcvegrbuf_virt[chunk] + + off % (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order)); + paddr = virt_to_phys(vaddr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + _IPATH_MMDBG("egrb %p,%lx\n", vaddr, paddr); + break; + default: + _IPATH_DBG + ("trying to fault in mmap addr %lx (avirt %lx) that isn't known (case %u)\n", + addr, avirt, which); + } + +done: + if (vpage != NOPAGE_SIGBUS && vpage != NOPAGE_OOM) { + if (which == 2) + /* + * media/video/video-buf.c doesn't do get_page() for + * buffer from alloc_page(). Hmmm. + * + * keep it from being swapped, complaints if + * process exits before we [vf]free it, etc, + * and keep shared page counts correct, etc. + */ + get_page(vpage); + mark_page_accessed(vpage); + if (type) + *type = VM_FAULT_MINOR; + } else + _IPATH_DBG("faultin of addr %lx vaddr %p avirt %lx failed\n", + addr, vaddr, avirt); + + return vpage; +} + +/* this is separate to allow for better optimization of ipath_intr() */ + +static void ipath_bad_intr(const ipath_type t, uint32_t * unexpectp) +{ + ipath_devdata *dd = &devdata[t]; + + /* + * sometimes happen during driver init and unload, don't want + * to process any interrupts at that point + */ + + /* this is just a bandaid, not a fix, if something goes badly wrong */ + if (++*unexpectp > 100) { + if (++*unexpectp > 105) { + /* + * ok, we must be taking somebody else's interrupts, + * due to a messed up mptable and/or PIRQ table, so + * unregister the interrupt. We've seen this + * during linuxbios development work, and it + * may happen in the future again. + */ + if (dd->pcidev && dd->pcidev->irq) { + _IPATH_UNIT_ERROR(t, + "Now %u unexpected interrupts, unregistering interrupt handler\n", + *unexpectp); + _IPATH_DBG("free_irq of irq %x\n", + dd->pcidev->irq); + free_irq(dd->pcidev->irq, dd); + dd->pcidev->irq = 0; + } + } + if (ipath_kget_kreg32(t, kr_intmask)) { + _IPATH_UNIT_ERROR(t, + "%u unexpected interrupts, disabling interrupts completely\n", + *unexpectp); + /* disable all interrupts, something is very wrong */ + ipath_kput_kreg(t, kr_intmask, 0ULL); + } + } else if (*unexpectp > 1) + _IPATH_DBG + ("Interrupt when not ready, should not happen, ignoring\n"); +} + +/* separate routine, for better optimization of ipath_intr() */ + +static void ipath_bad_regread(const ipath_type t) +{ + static int allbits; + ipath_devdata *dd = &devdata[t]; + + /* + * We print the message and disable interrupts, in hope of + * having a better chance of debugging the problem. + */ + _IPATH_UNIT_ERROR(t, + "Read of interrupt status failed (all bits set)\n"); + if (allbits++) { + /* disable all interrupts, something is very wrong */ + ipath_kput_kreg(t, kr_intmask, 0ULL); + if (allbits == 2) { + _IPATH_UNIT_ERROR(t, + "Still bad interrupt status, unregistering interrupt\n"); + free_irq(dd->pcidev->irq, dd); + dd->pcidev->irq = 0; + } else if (allbits > 2) { + if ((allbits % 10000) == 0) + printk("."); + } else + _IPATH_UNIT_ERROR(t, + "Disabling interrupts, multiple errors\n"); + } +} + +static irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs) +{ + ipath_devdata *dd = data; + const ipath_type t = IPATH_UNIT(dd); + uint32_t istat = ipath_kget_kreg32(t, kr_intstatus); + uint64_t estat = 0; + static unsigned unexpected = 0; + + if (unlikely(!istat)) { + ipath_stats.sps_nullintr++; + /* not our interrupt, or already handled */ + return IRQ_NONE; + } + if (unlikely(istat == ~0)) { + ipath_bad_regread(t); + /* don't know if it was our interrupt or not */ + return IRQ_NONE; + } + + ipath_stats.sps_ints++; + + /* + * this needs to be flags&initted, not statusp, so we keep + * taking interrupts even after link goes down, etc. + * Also, we *must* clear the interrupt at some point, or we won't + * take it again, which can be real bad for errors, etc... + */ + + if (!(dd->ipath_flags & IPATH_INITTED)) { + ipath_bad_intr(t, &unexpected); + return IRQ_NONE; + } + if (unexpected) + unexpected = 0; + + if (istat & ~infinipath_i_bitsextant) + _IPATH_UNIT_ERROR(t, + "interrupt with unknown interrupts %x set\n", + istat & (uint32_t) ~ infinipath_i_bitsextant); + + if (istat & INFINIPATH_I_ERROR) { + ipath_stats.sps_errints++; + estat = ipath_kget_kreg64(t, kr_errorstatus); + if (!estat) + _IPATH_INFO + ("error interrupt (%x), but no error bits set!\n", + istat); + else if (estat == ~0ULL) + /* + * should we try clearing all, or hope next read + * works? + */ + _IPATH_UNIT_ERROR(t, + "Read of error status failed (all bits set); ignoring\n"); + else + ipath_handle_errors(t, estat); + } + + if (istat & INFINIPATH_I_GPIO) { + /* Clear GPIO status bit 2 */ + ipath_kput_kreg(t, kr_gpio_clear, (uint64_t)(1 << 2)); + + /* + * Packets are available in the port 0 receive queue. + * Eventually this needs to be generalized to check + * IPATH_GPIO_INTR, and the specific GPIO bit, when + * GPIO interrupts start being used for other things. + * We skip that now to improve performance. + */ + ipath_kreceive(t); + } + + /* + * clear the ones we will deal with on this round + * We clear it early, mostly for receive interrupts, so we + * know the chip will have seen this by the time we process + * the queue, and will re-interrupt if necessary. The processor + * itself won't take the interrupt again until we return. + */ + ipath_kput_kreg(t, kr_intclear, istat); + + if (istat & INFINIPATH_I_SPIOBUFAVAIL) { + atomic_clear_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl); + + if (dd->ipath_portpiowait) { + uint32_t i; + /* + * start from port 1, since for now port 0 is + * never using wait_event for PIO + */ + for (i = 1; + dd->ipath_portpiowait && i < dd->ipath_cfgports; + i++) { + if (dd->ipath_pd[i] + && dd->ipath_portpiowait & (1U << i)) { + atomic_clear_mask(1U << i, + &dd-> + ipath_portpiowait); + if (dd->ipath_pd[i]-> + port_flag & IPATH_PORT_WAITING_PIO) + { + dd->ipath_pd[i]->port_flag &= + ~IPATH_PORT_WAITING_PIO; + wake_up_interruptible(&dd-> + ipath_pd + [i]-> + port_wait); + } + } + } + } + + if (dd->ipath_layer.l_intr) { + if (dd->ipath_layer.l_intr(t, + IPATH_LAYER_INT_SEND_CONTINUE)) { + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, + dd->ipath_sendctrl); + } + } + + if (dd->verbs_layer.l_piobufavail) { + if (!dd->verbs_layer.l_piobufavail(t)) { + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, + dd->ipath_sendctrl); + } + } + } + + /* + * we check for both transition from empty to non-empty, and urgent + * packets (those with the interrupt bit set in the header) + */ + + if (istat & ((infinipath_i_rcvavail_mask << INFINIPATH_I_RCVAVAIL_SHIFT) + | (infinipath_i_rcvurg_mask << INFINIPATH_I_RCVURG_SHIFT))) { + uint64_t portr; + int i; + uint32_t rcvdint = 0; + + portr = ((istat >> INFINIPATH_I_RCVAVAIL_SHIFT) & + infinipath_i_rcvavail_mask) + | ((istat >> INFINIPATH_I_RCVURG_SHIFT) & + infinipath_i_rcvurg_mask); + for (i = 0; i < dd->ipath_cfgports; i++) { + if (portr & (1 << i) && dd->ipath_pd[i]) { + if (i == 0) + ipath_kreceive(t); + else if (dd->ipath_pd[i]-> + port_flag & IPATH_PORT_WAITING_RCV) { + atomic_clear_mask + (IPATH_PORT_WAITING_RCV, + &dd->ipath_pd[i]->port_flag); + wake_up_interruptible(&dd->ipath_pd[i]-> + port_wait); + rcvdint |= 1U << i; + } + } + } + if (rcvdint) { + /* + * only want to take one interrupt, so turn off + * the rcv interrupt for all the ports that we + * did the wakeup on (but never for kernel port) + */ + atomic_clear_mask(rcvdint << + INFINIPATH_R_INTRAVAIL_SHIFT, + &dd->ipath_rcvctrl); + ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl); + } + } + + return IRQ_HANDLED; +} + +static void ipath_decode_err(char *buf, size_t blen, uint64_t err) +{ + *buf = '\0'; + if (err & INFINIPATH_E_RHDRLEN) + strlcat(buf, "rhdrlen ", blen); + if (err & INFINIPATH_E_RBADTID) + strlcat(buf, "rbadtid ", blen); + if (err & INFINIPATH_E_RBADVERSION) + strlcat(buf, "rbadversion ", blen); + if (err & INFINIPATH_E_RHDR) + strlcat(buf, "rhdr ", blen); + if (err & INFINIPATH_E_RLONGPKTLEN) + strlcat(buf, "rlongpktlen ", blen); + if (err & INFINIPATH_E_RSHORTPKTLEN) + strlcat(buf, "rshortpktlen ", blen); + if (err & INFINIPATH_E_RMAXPKTLEN) + strlcat(buf, "rmaxpktlen ", blen); + if (err & INFINIPATH_E_RMINPKTLEN) + strlcat(buf, "rminpktlen ", blen); + if (err & INFINIPATH_E_RFORMATERR) + strlcat(buf, "rformaterr ", blen); + if (err & INFINIPATH_E_RUNSUPVL) + strlcat(buf, "runsupvl ", blen); + if (err & INFINIPATH_E_RUNEXPCHAR) + strlcat(buf, "runexpchar ", blen); + if (err & INFINIPATH_E_RIBFLOW) + strlcat(buf, "ribflow ", blen); + if (err & INFINIPATH_E_REBP) + strlcat(buf, "EBP ", blen); + if (err & INFINIPATH_E_SUNDERRUN) + strlcat(buf, "sunderrun ", blen); + if (err & INFINIPATH_E_SPIOARMLAUNCH) + strlcat(buf, "spioarmlaunch ", blen); + if (err & INFINIPATH_E_SUNEXPERRPKTNUM) + strlcat(buf, "sunexperrpktnum ", blen); + if (err & INFINIPATH_E_SDROPPEDDATAPKT) + strlcat(buf, "sdroppeddatapkt ", blen); + if (err & INFINIPATH_E_SDROPPEDSMPPKT) + strlcat(buf, "sdroppedsmppkt ", blen); + if (err & INFINIPATH_E_SMAXPKTLEN) + strlcat(buf, "smaxpktlen ", blen); + if (err & INFINIPATH_E_SMINPKTLEN) + strlcat(buf, "sminpktlen ", blen); + if (err & INFINIPATH_E_SUNSUPVL) + strlcat(buf, "sunsupVL ", blen); + if (err & INFINIPATH_E_SPKTLEN) + strlcat(buf, "spktlen ", blen); + if (err & INFINIPATH_E_INVALIDADDR) + strlcat(buf, "invalidaddr ", blen); + if (err & INFINIPATH_E_RICRC) + strlcat(buf, "CRC ", blen); + if (err & INFINIPATH_E_RVCRC) + strlcat(buf, "VCRC ", blen); + if (err & INFINIPATH_E_RRCVEGRFULL) + strlcat(buf, "rcvegrfull ", blen); + if (err & INFINIPATH_E_RRCVHDRFULL) + strlcat(buf, "rcvhdrfull ", blen); + if (err & INFINIPATH_E_IBSTATUSCHANGED) + strlcat(buf, "ibcstatuschg ", blen); + if (err & INFINIPATH_E_RIBLOSTLINK) + strlcat(buf, "riblostlink ", blen); + if (err & INFINIPATH_E_HARDWARE) + strlcat(buf, "hardware ", blen); + if (err & INFINIPATH_E_RESET) + strlcat(buf, "reset ", blen); +} + +/* decode RHF errors; only used one place now, may want more later */ +static void get_rhf_errstring(uint32_t err, char *msg, size_t len) +{ + /* if no errors, and so don't need to check what's first */ + *msg = '\0'; + + if (err & INFINIPATH_RHF_H_ICRCERR) + strlcat(msg, "icrcerr ", len); + if (err & INFINIPATH_RHF_H_VCRCERR) + strlcat(msg, "vcrcerr ", len); + if (err & INFINIPATH_RHF_H_PARITYERR) + strlcat(msg, "parityerr ", len); + if (err & INFINIPATH_RHF_H_LENERR) + strlcat(msg, "lenerr ", len); + if (err & INFINIPATH_RHF_H_MTUERR) + strlcat(msg, "mtuerr ", len); + if (err & INFINIPATH_RHF_H_IHDRERR) + /* infinipath hdr checksum error */ + strlcat(msg, "ipathhdrerr ", len); + if (err & INFINIPATH_RHF_H_TIDERR) + strlcat(msg, "tiderr ", len); + if (err & INFINIPATH_RHF_H_MKERR) + /* bad port, offset, etc. */ + strlcat(msg, "invalid ipathhdr ", len); + if (err & INFINIPATH_RHF_H_IBERR) + strlcat(msg, "iberr ", len); + if (err & INFINIPATH_RHF_L_SWA) + strlcat(msg, "swA ", len); + if (err & INFINIPATH_RHF_L_SWB) + strlcat(msg, "swB ", len); +} + +static void ipath_handle_errors(const ipath_type t, uint64_t errs) +{ + char msg[512]; + uint32_t piobcnt; + uint64_t sbuf[4], ignore_this_time = 0; + int i; + int chkerrpkts = 0, noprint = 0; + cycles_t nc; + static cycles_t nextmsg_time; + static unsigned nmsgs, supp_msgs; + ipath_devdata *dd = &devdata[t]; + +#define E_SUM_PKTERRS (INFINIPATH_E_RHDRLEN | INFINIPATH_E_RBADTID \ + | INFINIPATH_E_RBADVERSION \ + | INFINIPATH_E_RHDR | INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN \ + | INFINIPATH_E_RMAXPKTLEN | INFINIPATH_E_RMINPKTLEN \ + | INFINIPATH_E_RFORMATERR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_RUNEXPCHAR \ + | INFINIPATH_E_REBP) + +#define E_SUM_ERRS ( INFINIPATH_E_SPIOARMLAUNCH \ + | INFINIPATH_E_SUNEXPERRPKTNUM | INFINIPATH_E_SDROPPEDDATAPKT \ + | INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SMAXPKTLEN \ + | INFINIPATH_E_SUNSUPVL | INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN \ + | INFINIPATH_E_INVALIDADDR) + + /* + * throttle back "fast" messages to no more than 10 per 5 seconds + * (1.4-2GHz clock). This isn't perfect, but it's a reasonable + * heuristic + * If we get more than 10, give a 5x longer delay + */ + nc = get_cycles(); + if (nmsgs > 10) { + if (nc < nextmsg_time) { + noprint = 1; + if (!supp_msgs++) + nextmsg_time = nc + 50000000000ULL; + } else if (supp_msgs) { + /* + * Print the message unless it's ibc status + * change only, which happens so often we never + * want to count it. + */ + if (dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED) { + ipath_decode_err(msg, sizeof msg, + dd-> + ipath_lasterror & + ~INFINIPATH_E_IBSTATUSCHANGED); + if (dd-> + ipath_lasterror & ~(INFINIPATH_E_RRCVEGRFULL + | + INFINIPATH_E_RRCVHDRFULL)) + _IPATH_UNIT_ERROR(t, + "Suppressed %u messages for fast-repeating errors (%s) (%llx)\n", + supp_msgs, msg, + dd->ipath_lasterror); + else { + /* + * rcvegrfull and rcvhdrqfull are + * "normal", for some types of + * processes (mostly benchmarks) + * that send huge numbers of + * messages, while not processing + * them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Suppressed %u messages for %s\n", + supp_msgs, msg); + } + } + supp_msgs = 0; + nmsgs = 0; + } + } else if (!nmsgs++ || nc > nextmsg_time) /* start timer */ + nextmsg_time = nc + 10000000000ULL; + + /* + * don't report errors that are masked (includes those always + * ignored) + */ + errs &= ~dd->ipath_maskederrs; + + /* do these first, they are most important */ + if (errs & INFINIPATH_E_HARDWARE) { + /* reuse same msg buf */ + ipath_handle_hwerrors(t, msg, sizeof msg); + } + + if (!noprint && (errs & ~infinipath_e_bitsextant)) + _IPATH_UNIT_ERROR(t, + "error interrupt with unknown errors %llx set\n", + errs & ~infinipath_e_bitsextant); + + if (errs & E_SUM_ERRS) { + /* if possible that sendbuffererror could be valid */ + piobcnt = dd->ipath_piobcnt; + /* read these before writing errorclear */ + sbuf[0] = ipath_kget_kreg64(t, kr_sendbuffererror); + sbuf[1] = ipath_kget_kreg64(t, kr_sendbuffererror + 1); + if (piobcnt > 128) { + sbuf[2] = ipath_kget_kreg64(t, kr_sendbuffererror + 2); + sbuf[3] = ipath_kget_kreg64(t, kr_sendbuffererror + 3); + } + + if (sbuf[0] || sbuf[1] + || (piobcnt > 128 && (sbuf[2] || sbuf[3]))) { + _IPATH_PDBG("SendbufErrs %llx %llx ", sbuf[0], sbuf[1]); + if (infinipath_debug & __IPATH_PKTDBG && piobcnt > 128) + printk("%llx %llx ", sbuf[2], sbuf[3]); + for (i = 0; i < piobcnt; i++) { + if (test_bit(i, sbuf)) { + uint32_t sendctrl; + if (infinipath_debug & __IPATH_PKTDBG) + printk("%u ", i); + sendctrl = + dd-> + ipath_sendctrl | INFINIPATH_S_DISARM + | (i << + INFINIPATH_S_DISARMPIOBUF_SHIFT); + ipath_kput_kreg(t, kr_sendctrl, + sendctrl); + } + } + if (infinipath_debug & __IPATH_PKTDBG) + printk("\n"); + } + if ((errs & + (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT + | INFINIPATH_E_SMINPKTLEN)) + && !(dd->ipath_flags & IPATH_LINKACTIVE)) { + /* + * This can happen when SMA is trying to bring + * the link up, but the IB link changes state + * at the "wrong" time. The IB logic then + * complains that the packet isn't valid. + * We don't want to confuse people, so we just + * don't print them, except at debug + */ + _IPATH_DBG + ("Ignoring pktsend errors %llx, because not yet active\n", + errs); + ignore_this_time |= + INFINIPATH_E_SDROPPEDDATAPKT | + INFINIPATH_E_SDROPPEDSMPPKT | + INFINIPATH_E_SMINPKTLEN; + } + } + + if (supp_msgs == 250000) { + /* + * It's not entirely reasonable assuming that the errors + * set in the last clear period are all responsible for + * the problem, but the alternative is to assume it's the only + * ones on this particular interrupt, which also isn't great + */ + dd->ipath_maskederrs |= dd->ipath_lasterror | errs; + ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs); + ipath_decode_err(msg, sizeof msg, + (dd->ipath_maskederrs & ~dd-> + ipath_ignorederrs)); + + if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) + _IPATH_UNIT_ERROR(t, + "Disabling error(s) %llx because occuring too frequently (%s)\n", + (dd->ipath_maskederrs & ~dd-> + ipath_ignorederrs), msg); + else { + /* + * rcvegrfull and rcvhdrqfull are "normal", + * for some types of processes (mostly benchmarks) + * that send huge numbers of messages, while not + * processing them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Disabling frequent queue full errors (%s)\n", + msg); + } + + /* + * re-enable the masked errors after around 3 minutes. + * in ipath_get_faststats(). If we have a series of + * fast repeating but different errors, the interval will keep + * stretching out, but that's OK, as that's pretty catastrophic. + */ + dd->ipath_unmasktime = nc + 400000000000ULL; + } + + ipath_kput_kreg(t, kr_errorclear, errs); + if (ignore_this_time) + errs &= ~ignore_this_time; + if (errs & ~dd->ipath_lasterror) { + errs &= ~dd->ipath_lasterror; + /* never suppress duplicate hwerrors or ibstatuschange */ + dd->ipath_lasterror |= errs & + ~(INFINIPATH_E_HARDWARE | INFINIPATH_E_IBSTATUSCHANGED); + } + if (!errs) + return; + + if (!noprint) + /* the ones we mask off are handled specially below or above */ + ipath_decode_err(msg, sizeof msg, + errs & ~(INFINIPATH_E_IBSTATUSCHANGED | + INFINIPATH_E_RRCVEGRFULL | + INFINIPATH_E_RRCVHDRFULL | + INFINIPATH_E_HARDWARE)); + else + /* so we don't need if(!noprint) at strlcat's below */ + *msg = 0; + + if (errs & E_SUM_PKTERRS) { + ipath_stats.sps_pkterrs++; + chkerrpkts = 1; + } + if (errs & E_SUM_ERRS) + ipath_stats.sps_errs++; + + if (errs & (INFINIPATH_E_RICRC | INFINIPATH_E_RVCRC)) { + ipath_stats.sps_crcerrs++; + chkerrpkts = 1; + } + + /* + * We don't want to print these two as they happen, or we can make + * the situation even worse, because it takes so long to print messages. + * to serial consoles. kernel ports get printed from fast_stats, no + * more than every 5 seconds, user ports get printed on close + */ + if (errs & INFINIPATH_E_RRCVHDRFULL) { + int any; + uint32_t hd, tl; + ipath_stats.sps_hdrqfull++; + for (any = i = 0; i < dd->ipath_cfgports; i++) { + if (i == 0) { + hd = dd->ipath_port0head; + tl = *dd->ipath_hdrqtailptr; + } else if (dd->ipath_pd[i] && + dd->ipath_pd[i]->port_rcvhdrtail_kvaddr) { + /* + * don't report same point multiple times, + * except kernel + */ + tl = (uint32_t) * + dd->ipath_pd[i]->port_rcvhdrtail_kvaddr; + if (tl == dd->ipath_lastrcvhdrqtails[i]) + continue; + hd = ipath_kget_ureg32(t, ur_rcvhdrhead, i); + } else + continue; + if (hd == (tl + 1) || (!hd && tl == dd->ipath_hdrqlast)) { + dd->ipath_lastrcvhdrqtails[i] = tl; + dd->ipath_pd[i]->port_hdrqfull++; + if (i == 0) + chkerrpkts = 1; + } + } + } + if (errs & INFINIPATH_E_RRCVEGRFULL) { + /* + * since this is of less importance and not likely to + * happen without also getting hdrfull, only count + * occurrences; don't check each port (or even the kernel + * vs user) + */ + ipath_stats.sps_etidfull++; + if (dd->ipath_port0head != *dd->ipath_hdrqtailptr) + chkerrpkts = 1; + } + + /* + * do this before IBSTATUSCHANGED, in case both bits set in a single + * interrupt; we want the STATUSCHANGE to "win", so we do our + * internal copy of state machine correctly + */ + if (errs & INFINIPATH_E_RIBLOSTLINK) { + /* force through block below */ + errs |= INFINIPATH_E_IBSTATUSCHANGED; + ipath_stats.sps_iblink++; + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT + | IPATH_LINKARMED | IPATH_LINKACTIVE); + if (!noprint) + _IPATH_DBG("Lost link, link now down (%s)\n", + ipath_ibcstatus_str[ipath_kget_kreg64 + (t, + kr_ibcstatus) & 0xf]); + } + + if ((errs & INFINIPATH_E_IBSTATUSCHANGED) && (!ipath_diags_enabled)) { + uint64_t val; + uint32_t ltstate; + + val = ipath_kget_kreg64(t, kr_ibcstatus); + ltstate = val & 0xff; + if(ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31) + _IPATH_DBG("Link state changed unit %u to 0x%x, last was 0x%llx\n", + t, ltstate, dd->ipath_lastibcstat); + else { + ltstate = dd->ipath_lastibcstat & 0xff; + if(ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31) + _IPATH_DBG("Link state unit %u changed to down state 0x%llx, last was 0x%llx\n", + t, val, dd->ipath_lastibcstat); + else + _IPATH_VDBG("Link state unit %u changed to 0x%llx from one of down states\n", + t, val); + } + ltstate = (val >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) & + INFINIPATH_IBCS_LINKTRAININGSTATE_MASK; + + if (ltstate == 2 || ltstate == 3) { + uint32_t last_ltstate; + + /* + * ignore cycling back and forth from states 2 to 3 + * while waiting for other end of link to come up + * except that if it keeps happening, we switch between + * linkinitstate SLEEP and POLL. While we cycle + * back and forth between them, we aren't seeing + * any other device, either no cable plugged in, + * other device powered off, other device is + * switch that hasn't yet polled us, etc. + */ + last_ltstate = (dd->ipath_lastibcstat >> + INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) + & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK; + if (last_ltstate == 2 || last_ltstate == 3) { + if (++dd->ipath_ibpollcnt > 4) { + uint64_t ibc; + dd->ipath_flags |= + IPATH_LINK_SLEEPING | IPATH_NOCABLE; + *dd->ipath_statusp |= + IPATH_STATUS_IB_NOCABLE; + _IPATH_VDBG + ("linkinitcmd POLL, move to SLEEP\n"); + ibc = dd->ipath_ibcctrl; + ibc |= INFINIPATH_IBCC_LINKINITCMD_SLEEP + << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + /* + * don't put linkinitcmd in + * ipath_ibcctrl, want that to + * stay a NOP + */ + ipath_kput_kreg(t, kr_ibcctrl, ibc); + dd->ipath_ibpollcnt = 0; + } + goto skip_ibchange; + } + } + /* some state other than 2 or 3 */ + dd->ipath_ibpollcnt = 0; + ipath_stats.sps_iblink++; + /* + * Note: We try to match the Mellanox HCA LED behavior + * as best we can. That changed around Oct 2003. + * Green indicates link state (something is plugged in, + * and we can train). Amber indicates the link is + * logically up (ACTIVE). Mellanox further blinks the + * amber LED to indicate data packet activity, but we + * have no hardware support for that, so it would require + * waking up every 10-20 msecs and checking the counters + * on the chip, and then turning the LED off if + * appropriate. That's visible overhead, so not something + * we will do. + */ + if (ltstate != 1 || ((dd->ipath_lastibcstat & 0x30) == 0x30 && + (val & 0x30) != 0x30)) { + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT + | IPATH_LINKACTIVE | + IPATH_LINKARMED); + *dd->ipath_statusp &= ~IPATH_STATUS_IB_READY; + if (!noprint) { + if ((dd->ipath_lastibcstat & 0x30) == 0x30) + /* if from up to down be more vocal */ + _IPATH_DBG("Link unit %u is now down (%s)\n", + t, ipath_ibcstatus_str + [ltstate]); + else + _IPATH_VDBG("Link unit %u is down (%s)\n", + t, ipath_ibcstatus_str + [ltstate]); + } + + if (val & 0x30) { + /* leave just green on, 0x11 and 0x21 */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= + INFINIPATH_EXTC_LEDPRIPORTGREENON; + } else /* not up at all, so turn the leds off */ + dd->ipath_extctrl &= + ~(INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON); + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (ltstate == 1 + && (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE))) { + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_INIT); + } + } else if ((val & 0x31) == 0x31) { + if (!noprint) + _IPATH_DBG("Link unit %u is now in active state\n", t); + dd->ipath_flags |= IPATH_LINKACTIVE; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKINIT | IPATH_LINKDOWN | + IPATH_LINKARMED | IPATH_NOCABLE | + IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~IPATH_STATUS_IB_NOCABLE; + *dd->ipath_statusp |= + IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF; + /* set the externally visible LEDs to indicate state */ + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON + | INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + + /* + * since we are now active, set the linkinitcmd + * to NOP (0) it was probably either POLL or SLEEP + */ + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl); + + if (devdata[t].ipath_layer.l_intr) + devdata[t].ipath_layer.l_intr(t, + IPATH_LAYER_INT_IF_UP); + } else if ((val & 0x31) == 0x11) { + /* + * set set INIT and DOWN. Down is checked by + * most of the other code, but INIT is useful + * to know in a few places. + */ + dd->ipath_flags |= IPATH_LINKINIT | IPATH_LINKDOWN; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKACTIVE | IPATH_LINKARMED + | IPATH_NOCABLE | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE + | IPATH_STATUS_IB_READY); + + /* set the externally visible LEDs to indicate state */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE)) { + /* + * if we got here while trying to bring + * the link up, try again, but only once more! + */ + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_ARMED); + dd->ipath_flags &= + ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE); + } + } else if ((val & 0x31) == 0x21) { + dd->ipath_flags |= IPATH_LINKARMED; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT | + IPATH_LINKACTIVE | IPATH_NOCABLE | + IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE + | IPATH_STATUS_IB_READY); + /* + * set the externally visible LEDs to indicate + * state (same as 0x11) + */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (dd->ipath_flags & IPATH_LINK_TOACTIVE) { + /* + * if we got here while trying to bring + * the link up, try again, but only once more! + */ + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_ACTIVE); + dd->ipath_flags &= ~IPATH_LINK_TOACTIVE; + } + } else { + if (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE)) + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_INIT); + else if (!noprint) + _IPATH_DBG("IBstatuschange unit %u: %s\n", + t, ipath_ibcstatus_str[ltstate]); + } + dd->ipath_lastibcstat = val; + } + +skip_ibchange: + + if (errs & INFINIPATH_E_RESET) { + if (!noprint) + _IPATH_UNIT_ERROR(t, + "Got reset, requires re-initialization (unload and reload driver)\n"); + dd->ipath_flags &= ~IPATH_INITTED; /* needs re-init */ + /* mark as having had error */ + *dd->ipath_statusp |= IPATH_STATUS_HWERROR; + *dd->ipath_statusp &= ~IPATH_STATUS_IB_CONF; + } + + if (!noprint && *msg) + _IPATH_UNIT_ERROR(t, "%s error\n", msg); + if (dd->ipath_sma_state_wanted & dd->ipath_flags) { + _IPATH_VDBG("sma wanted state %x, iflags now %x, waking\n", + dd->ipath_sma_state_wanted, dd->ipath_flags); + wake_up_interruptible(&ipath_sma_state_wait); + } + + if (chkerrpkts) + /* process possible error packets in hdrq */ + ipath_kreceive(t); +} + +/* must only be called if ipath_pd[port] is known to be allocated */ +static __inline__ void *ipath_get_egrbuf(const ipath_type t, uint32_t bufnum, + int err) +{ + return devdata[t].ipath_port0_skbs ? + (void *)devdata[t].ipath_port0_skbs[bufnum]->data : NULL; + +#ifdef _USE_FOR_DEBUGGING_ONLY + /* + * want routine to be inlined and fast this is here so if we do ports + * other than 0, I don't have to rewrite the code, since it's slightly + * complicated + */ + if (port != 1) { + void *chunkbase; + /* + * This calculation takes about 50 cycles. Could do + * what I did for protocol code, and have an array of + * addresses, getting it down to just a few cycles per + * lookup, at the cost of 16KB of memory. + */ + if (!devdata[t].ipath_pd[port]->port_rcvegrbuf_virt) + return NULL; + chunkbase = devdata[t].ipath_pd[port]->port_rcvegrbuf_virt + [bufnum / + devdata[t].ipath_pd[port]->port_rcvegrbufs_perchunk]; + return (void *)(chunkbase + + (bufnum % + devdata[t].ipath_pd[port]-> + port_rcvegrbufs_perchunk) + * devdata[t].ipath_rcvegrbufsize); + } +#endif +} + +/* receive an sma packet. Separate for better overall optimization */ +static void ipath_rcv_sma(const ipath_type t, uint32_t tlen, + uint64_t * rc, void *ebuf) +{ + int sindex, slen, elen; + void *smbuf; + uint8_t pad, *bthbytes; + + ipath_stats.sps_sma_rpkts++; /* another SMA packet received */ + + bthbytes = (uint8_t *) ((ips_message_header_typ *) & rc[1])->bth; + + pad = (bthbytes[1] >> 4) & 3; + elen = tlen - (IPATH_SMA_HDRSZ + pad + (uint32_t) sizeof(uint32_t)); + if (elen > (SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ)) + elen = SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ; + + spin_lock_irq(&ipath_sma_lock); + sindex = ipath_sma_next; + smbuf = ipath_sma_data[sindex].buf; + ipath_sma_data[sindex].unit = t; + slen = ipath_sma_data[ipath_sma_next].len; + memcpy(smbuf, &rc[1], IPATH_SMA_HDRSZ); + memcpy(smbuf + IPATH_SMA_HDRSZ, ebuf, elen); + if (slen) { + /* + * overwriting a yet unread old one (buffer wrap), have to + * advance ipath_sma_first to next oldest + */ + + /* count OK packets that we drop */ + ipath_stats.sps_krdrops++; + if (++ipath_sma_first >= IPATH_NUM_SMAPKTS) + ipath_sma_first = 0; + } + slen = ipath_sma_data[sindex].len = elen + IPATH_SMA_HDRSZ; + if (++ipath_sma_next >= IPATH_NUM_SMAPKTS) + ipath_sma_next = 0; + spin_unlock_irq(&ipath_sma_lock); +} + +/* + * receive a packet for the layered (ethernet) driver. + * Separate routine for better overall optimization + */ +static void ipath_rcv_layer(const ipath_type t, uint32_t etail, + uint32_t tlen, ether_header_typ * hdr) +{ + uint32_t elen; + uint8_t pad, *bthbytes; + struct sk_buff *skb; + struct sk_buff *nskb; + ipath_devdata *dd = &devdata[t]; + ipath_portdata *pd; + unsigned long pa, pent; + uint64_t *egrbase; + uint64_t lenvalid; /* in words */ + + if (dd->ipath_port0_skbs && hdr->sub_opcode == OPCODE_ENCAP) { + /* + * Allocate a new sk_buff to replace the one we give + * to the network stack. + */ + if (!(nskb = dev_alloc_skb(dd->ipath_ibmaxlen + 4))) { + /* count OK packets that we drop */ + ipath_stats.sps_krdrops++; + return; + } + + bthbytes = (uint8_t *) hdr->bth; + pad = (bthbytes[1] >> 4) & 3; + /* +CRC32 */ + elen = tlen - (sizeof(*hdr) + pad + sizeof(uint32_t)); + + skb_reserve(nskb, 4); + + skb = dd->ipath_port0_skbs[etail]; + dd->ipath_port0_skbs[etail] = nskb; + skb_put(skb, elen); + + pd = dd->ipath_pd[0]; + lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + pa = virt_to_phys(nskb->data); + pa += pd->port_egrskip; + pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid; + /* This is simplified for port 0 */ + egrbase = (uint64_t *) ((char *)(dd->ipath_kregbase) + + dd->ipath_rcvegrbase); + ipath_kput_memq(t, &egrbase[etail], pent); + + dd->ipath_layer.l_rcv(t, hdr, skb); + + /* another ether packet received */ + ipath_stats.sps_ether_rpkts++; + } else if (hdr->sub_opcode == OPCODE_LID_ARP) { + if (dd->ipath_layer.l_rcv_lid) + dd->ipath_layer.l_rcv_lid(t, hdr); + } + +} + +/* called from interrupt handler for errors or receive interrupt */ +void ipath_kreceive(const ipath_type t) +{ + uint64_t *rc; + void *ebuf; + ipath_devdata *dd = &devdata[t]; + const uint32_t rsize = dd->ipath_rcvhdrentsize; /* words */ + const uint32_t maxcnt = dd->ipath_rcvhdrcnt * rsize; /* in words */ + uint32_t etail = ~0U, l, hdrqtail, sma_this_time = 0; + ips_message_header_typ *hdr; + uint32_t eflags, i, etype, tlen, pkttot=0; + static uint64_t totcalls; /* stats, may eventually remove */ + char emsg[128]; + + if (!dd->ipath_hdrqtailptr) { + _IPATH_UNIT_ERROR(t, + "hdrqtailptr not set, can't do receives\n"); + return; + } + + if (test_and_set_bit(0, &dd->ipath_rcv_pending)) { + /* There is already a thread processing this queue. */ + return; + } + + if (dd->ipath_port0head == *dd->ipath_hdrqtailptr) + goto done; + +gotmore: + /* + * read only once at start. If in flood situation, this helps + * performance slightly. If more arrive while we are processing, + * we'll come back here and do them + */ + hdrqtail = *dd->ipath_hdrqtailptr; + + for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) { + uint32_t qp; + uint8_t *bthbytes; + + + rc = (uint64_t *) (dd->ipath_pd[0]->port_rcvhdrq + (l << 2)); + hdr = (ips_message_header_typ *) & rc[1]; + /* + * could make a network order version of IPATH_KD_QP, and + * do the obvious shift before masking to speed this up. + */ + qp = ntohl(hdr->bth[1]) & 0xffffff; + bthbytes = (uint8_t *) hdr->bth; + + eflags = ips_get_hdr_err_flags(rc); + etype = ips_get_rcv_type(rc); + tlen = ips_get_length_in_bytes(rc); /* total length */ + ebuf = NULL; + if (etype != RCVHQ_RCV_TYPE_EXPECTED) { + /* + * it turns out that the chips uses an eager buffer for + * all non-expected packets, whether it "needs" + * one or not. So always get the index, but + * don't set ebuf (so we try to copy data) + * unless the length requires it. + */ + etail = ips_get_index(rc); + if (tlen > sizeof(*hdr) + || etype == RCVHQ_RCV_TYPE_NON_KD) { + ebuf = ipath_get_egrbuf(t, etail, 0); + } + } + + /* + * both tiderr and ipathhdrerr are set for all plain IB + * packets; only ipathhdrerr should be set. + */ + + if (etype != RCVHQ_RCV_TYPE_NON_KD + && etype != RCVHQ_RCV_TYPE_ERROR + && ips_get_ipath_ver(hdr->iph.ver_port_tid_offset) != + IPS_PROTO_VERSION) { + _IPATH_PDBG("Bad InfiniPath protocol version %x\n", + etype); + } + + if (eflags & + ~(INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR)) { + get_rhf_errstring(eflags, emsg, sizeof emsg); + _IPATH_PDBG + ("RHFerrs %x hdrqtail=%x typ=%u tlen=%x opcode=%x egridx=%x: %s\n", + eflags, l, etype, tlen, bthbytes[0], + ips_get_index(rc), emsg); + } else if (etype == RCVHQ_RCV_TYPE_NON_KD) { + /* + * If there is a userland SMA and this is a MAD packet, + * then pass it to the userland SMA. + */ + if (ipath_sma_alive && qp <= 1) { + /* + * count OK packets that we drop because + * SMA isn't yet running, or because we + * are in an sma flood (no point in + * constantly acquiring the spin lock, and + * overwriting previous packets). + * Eventually things will recover. + * Similarly if the sma consumer is + * so far behind that we would overwrite + * (yes, it's outside the lock) + */ + if (!ipath_sma_data_spare || + ipath_sma_data[ipath_sma_next].len || + ++sma_this_time > IPATH_NUM_SMAPKTS) { + ipath_stats.sps_krdrops++; + } else if (ebuf) { + ipath_rcv_sma(t, tlen, rc, ebuf); + } + } else if (dd->verbs_layer.l_rcv) { + dd->verbs_layer.l_rcv(t, rc + 1, ebuf, tlen); + } else { + _IPATH_VDBG("received IB packet, not SMA (QP=%x)\n", + qp); + } + } else if (etype == RCVHQ_RCV_TYPE_EAGER) { + if (qp == IPATH_KD_QP && bthbytes[0] == + dd->ipath_layer.l_rcv_opcode && ebuf) + ipath_rcv_layer(t, etail, tlen, + (ether_header_typ *) hdr); + else + _IPATH_PDBG + ("typ %x, opcode %x (eager, qp=%x), len %x; ignored\n", + etype, bthbytes[0], qp, tlen); + } else if (etype == RCVHQ_RCV_TYPE_EXPECTED) { + _IPATH_DBG("Bug: Expected TID, opcode %x; ignored\n", + hdr->bth[0] & 0xff); + } else if (eflags & + (INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR)) + { + /* + * This is a type 3 packet, only the LRH is in + * the rcvhdrq, the rest of the header is in + * the eager buffer. + */ + uint8_t opcode; + if (ebuf) { + bthbytes = (uint8_t *) ebuf; + opcode = *bthbytes; + } else + opcode = 0; + get_rhf_errstring(eflags, emsg, sizeof emsg); + _IPATH_DBG + ("Err %x (%s), opcode %x, egrbuf %x, len %x\n", + eflags, emsg, opcode, etail, tlen); + } else { + /* + * error packet, type of error unknown. + * Probably type 3, but we don't know, so don't + * even try to print the opcode, etc. + */ + _IPATH_DBG + ("Error Pkt, but no eflags! egrbuf %x, len %x\n" + "hdrq@%lx;hdrq+%x rhf: %llx; hdr %llx %llx %llx %llx %llx\n", + etail, tlen, (unsigned long)rc, l, rc[0], rc[1], + rc[2], rc[3], rc[4], rc[5]); + } + l += rsize; + if (l >= maxcnt) + l = 0; + /* + * update for each packet, to help prevent overflows if we have + * lots of packets. + */ + (void)ipath_kput_ureg(t, ur_rcvhdrhead, l, 0); + if (etype != RCVHQ_RCV_TYPE_EXPECTED) + (void)ipath_kput_ureg(t, ur_rcvegrindexhead, etail, 0); + } + + pkttot += i; + + dd->ipath_port0head = l; + + if (hdrqtail != *dd->ipath_hdrqtailptr) + goto gotmore; /* more arrived while we handled first batch */ + + if(pkttot > ipath_stats.sps_maxpkts_call) + ipath_stats.sps_maxpkts_call = pkttot; + ipath_stats.sps_port0pkts += pkttot; + ipath_stats.sps_avgpkts_call = ipath_stats.sps_port0pkts / ++totcalls; + + if (sma_this_time) /* only once at end, not each time */ + wake_up_interruptible(&ipath_sma_wait); + +done: + clear_bit(0, &dd->ipath_rcv_pending); + smp_mb__after_clear_bit(); +} + +/* + * Update our shadow copy of the PIO availability register map, called + * whenever our local copy indicates we have run out of send buffers + * NOTE: This can be called from interrupt context by ipath_bufavail() + * and from non-interrupt context by ipath_getpiobuf(). + */ + +static void ipath_update_pio_bufs(const ipath_type t) +{ + unsigned long flags; + int i; + const unsigned piobregs = (unsigned)devdata[t].ipath_pioavregs; + + /* If the generation (check) bits have changed, then we update the + * busy bit for the corresponding PIO buffer. This algorithm will + * modify positions to the value they already have in some cases + * (i.e., no change), but it's faster than changing only the bits + * that have changed. + * + * We would like to do this atomicly, to avoid spinlocks in the + * critical send path, but that's not really possible, given the + * type of changes, and that this routine could be called on multiple + * cpu's simultaneously, so we lock in this routine only, to avoid + * conflicting updates; all we change is the shadow, and it's a + * single 64 bit memory location, so by definition the update is + * atomic in terms of what other cpu's can see in testing the + * bits. The spin_lock overhead isn't too bad, since it only + * happens when all buffers are in use, so only cpu overhead, + * not latency or bandwidth is affected. + */ +#define _IPATH_ALL_CHECKBITS 0x5555555555555555ULL + if (!devdata[t].ipath_pioavailregs_dma) { + _IPATH_DBG("Update shadow pioavail, but regs_dma NULL!\n"); + return; + } + if (infinipath_debug & __IPATH_VERBDBG) { + /* only if packet debug and verbose */ + _IPATH_PDBG("Refill avail, dma0=%llx shad0=%llx, " + "d1=%llx s1=%llx, d2=%llx s2=%llx, d3=%llx s3=%llx\n", + devdata[t].ipath_pioavailregs_dma[0], + devdata[t].ipath_pioavailshadow[0], + devdata[t].ipath_pioavailregs_dma[1], + devdata[t].ipath_pioavailshadow[1], + devdata[t].ipath_pioavailregs_dma[2], + devdata[t].ipath_pioavailshadow[2], + devdata[t].ipath_pioavailregs_dma[3], + devdata[t].ipath_pioavailshadow[3]); + if (piobregs > 4) + _IPATH_PDBG("2nd group, dma4=%llx shad4=%llx, " + "d5=%llx s5=%llx, d6=%llx s6=%llx, d7=%llx s7=%llx\n", + devdata[t].ipath_pioavailregs_dma[4], + devdata[t].ipath_pioavailshadow[4], + devdata[t].ipath_pioavailregs_dma[5], + devdata[t].ipath_pioavailshadow[5], + devdata[t].ipath_pioavailregs_dma[6], + devdata[t].ipath_pioavailshadow[6], + devdata[t].ipath_pioavailregs_dma[7], + devdata[t].ipath_pioavailshadow[7]); + } + spin_lock_irqsave(&ipath_pioavail_lock, flags); + for (i = 0; i < piobregs; i++) { + uint64_t pchbusy, pchg, piov, pnew; + /* Chip Errata: bug 6641; even and odd qwords>3 are swapped */ + piov = devdata[t].ipath_pioavailregs_dma[i > 3 ? i ^ 1 : i]; + pchg = + _IPATH_ALL_CHECKBITS & ~(devdata[t]. + ipath_pioavailshadow[i] ^ piov); + pchbusy = pchg << INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT; + if (pchg && (pchbusy & devdata[t].ipath_pioavailshadow[i])) { + pnew = devdata[t].ipath_pioavailshadow[i] & ~pchbusy; + pnew |= piov & pchbusy; + devdata[t].ipath_pioavailshadow[i] = pnew; + } + } + spin_unlock_irqrestore(&ipath_pioavail_lock, flags); +} + +static int ipath_do_user_init(ipath_portdata * pd, + struct ipath_user_info *uinfo) +{ + int ret = 0; + ipath_type t = pd->port_unit; + ipath_devdata *dd = &devdata[t]; + struct ipath_user_info kinfo; + + if (copy_from_user(&kinfo, uinfo, sizeof kinfo)) + ret = -EFAULT; + else { + /* for now, if major version is different, bail */ + if ((kinfo.spu_userversion >> 16) != IPATH_USER_SWMAJOR) { + _IPATH_INFO + ("User major version %d not same as driver major %d\n", + kinfo.spu_userversion >> 16, IPATH_USER_SWMAJOR); + ret = -ENODEV; + } else { + if ((kinfo.spu_userversion & 0xffff) != + IPATH_USER_SWMINOR) + _IPATH_DBG + ("User minor version %d not same as driver minor %d\n", + kinfo.spu_userversion & 0xffff, + IPATH_USER_SWMINOR); + if (kinfo.spu_rcvhdrsize) { + if ((ret = + ipath_setrcvhdrsize(t, + kinfo.spu_rcvhdrsize))) + goto done; + } else if (!dd->ipath_rcvhdrsize) { + /* + * first user of field, kernel or user + * code, and using default + */ + dd->ipath_rcvhdrsize = IPATH_DFLT_RCVHDRSIZE; + ipath_kput_kreg(pd->port_unit, kr_rcvhdrsize, + dd->ipath_rcvhdrsize); + _IPATH_VDBG + ("Use default protocol header size %u\n", + dd->ipath_rcvhdrsize); + } + + pd->port_egrskip = kinfo.spu_egrskip; + if (pd->port_egrskip) { + if (pd->port_egrskip & 3) { + _IPATH_DBG + ("eager skip 0x%x invalid, must be word multiple; using 0x%x\n", + pd->port_egrskip, + pd->port_egrskip & ~3); + pd->port_egrskip &= ~3; + } + _IPATH_DBG + ("user reserves 0x%x bytes at start of eager TIDs\n", + pd->port_egrskip); + } + + /* + * for now we do nothing with rcvhdrcnt: + * kinfo.spu_rcvhdrcnt + */ + + /* + * set up for the rcvhdr Q tail register writeback + * to user memory + */ + if (kinfo.spu_rcvhdraddr && + access_ok(VERIFY_WRITE, kinfo.spu_rcvhdraddr, + sizeof(uint64_t))) { + uint64_t physaddr, uaddr, off, atmp; + struct page *pagep; + off = offset_in_page(kinfo.spu_rcvhdraddr); + uaddr = + PAGE_MASK & (unsigned long)kinfo. + spu_rcvhdraddr; + if ((ret = ipath_mlock_nocopy(uaddr, &pagep))) { + _IPATH_INFO + ("Failed to lookup and lock address %llx for rcvhdrtail: errno %d\n", + kinfo.spu_rcvhdraddr, -ret); + goto done; + } + ipath_stats.sps_pagelocks++; + pd->port_rcvhdrtail_uaddr = uaddr; + pd->port_rcvhdrtail_pagep = pagep; + pd->port_rcvhdrtail_kvaddr = + page_address(pagep); + pd->port_rcvhdrtail_kvaddr += off; + physaddr = page_to_phys(pagep) + off; + _IPATH_VDBG + ("port %d user addr %llx hdrtailaddr, %llx physical (off=%llx)\n", + pd->port_port, kinfo.spu_rcvhdraddr, + physaddr, off); + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, + pd->port_port, physaddr); + atmp = + ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr, + pd->port_port); + if (physaddr != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, RcvHdrTailAddr%u written as %llx, read back as %llx\n", + pd->port_port, + physaddr, atmp); + ret = -EINVAL; + goto done; + } + } else { + _IPATH_DBG + ("Port %d rcvhdrtail addr %llx not valid\n", + pd->port_port, kinfo.spu_rcvhdraddr); + ret = -EINVAL; + goto done; + } + + /* + * for right now, kernel piobufs are at end, + * so port 1 is at 0 + */ + pd->port_piobufs = dd->ipath_piobufbase + + dd->ipath_pbufsport * (pd->port_port - + 1) * dd->ipath_palign; + _IPATH_VDBG("Set base of piobufs for port %u to 0x%x\n", + pd->port_port, pd->port_piobufs); + + /* + * Now allocate the rcvhdr Q and eager TIDs; + * skip the TID array for time being. + * If pd->port_port > chip-supported, we need + * to do extra stuff here to handle by handling + * overflow through port 0, someday + */ + if (!(ret = ipath_create_rcvhdrq(pd))) + ret = ipath_create_user_egr(pd); + if (!ret) { /* enable receives now */ + uint64_t head; + uint32_t head32; + /* atomically set enable bit for this port */ + atomic_set_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + pd->port_port), + &dd->ipath_rcvctrl); + + /* + * set the head registers for this port + * to the current values of the tail + * pointers, since we don't know if they + * were updated on last use of the port. + */ + head32 = + ipath_kget_ureg32(t, ur_rcvhdrtail, + pd->port_port); + head = (uint64_t) head32; + ipath_kput_ureg(t, ur_rcvhdrhead, head, + pd->port_port); + head32 = + ipath_kget_ureg32(t, ur_rcvegrindextail, + pd->port_port); + ipath_kput_ureg(t, ur_rcvegrindexhead, head32, + pd->port_port); + dd->ipath_lastegrheads[pd->port_port] = ~0; + dd->ipath_lastrcvhdrqtails[pd->port_port] = ~0; + _IPATH_VDBG + ("Wrote port%d head %llx, egrhead %x from tail regs\n", + pd->port_port, head, head32); + /* start at beginning after open */ + pd->port_tidcursor = 0; + { + /* + * now enable the port; the tail + * registers will be written to + * memory by the chip as soon + * as it sees the write to + * kr_rcvctrl. The update only + * happens on transition from 0 + * to 1, so clear it first, then + * set it as part of enabling + * the port. This will (very + * briefly) affect any other open + * ports, but it shouldn't be long + * enough to be an issue. + */ + ipath_kput_kreg(t, kr_rcvctrl, + dd-> + ipath_rcvctrl & + ~INFINIPATH_R_TAILUPD); + ipath_kput_kreg(t, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } + } + } + +done: + return ret; +} + +static int ipath_get_baseinfo(ipath_portdata * pd, + struct ipath_base_info *ubase) +{ + int ret = 0; + struct ipath_base_info kbase; + ipath_devdata *dd = &devdata[pd->port_unit]; + + /* be sure anything we don't set is 0ed */ + memset(&kbase, 0, sizeof kbase); + kbase.spi_rcvhdr_cnt = dd->ipath_rcvhdrcnt; + kbase.spi_rcvhdrent_size = dd->ipath_rcvhdrentsize; + kbase.spi_tidegrcnt = dd->ipath_rcvegrcnt; + kbase.spi_rcv_egrbufsize = dd->ipath_rcvegrbufsize; + kbase.spi_rcv_egrbuftotlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE * (1 << pd->port_rcvegrbuf_order); /* have to mmap whole thing */ + kbase.spi_rcv_egrperchunk = pd->port_rcvegrbufs_perchunk; + kbase.spi_rcv_egrchunksize = kbase.spi_rcv_egrbuftotlen / + pd->port_rcvegrbuf_chunks; + kbase.spi_tidcnt = dd->ipath_rcvtidcnt; + /* + * for this use, may be ipath_cfgports summed over all chips that + * are are configured and present + */ + kbase.spi_nports = dd->ipath_cfgports; + kbase.spi_unit = pd->port_unit; /* unit (chip/board) our port is on */ + /* for now, only a single page */ + kbase.spi_tid_maxsize = PAGE_SIZE; + + /* + * doing this per port, and based on the skip value, etc. + * This has to be the actual buffer size, since the protocol + * code treats it as an array. + * + * These have to be set to user addresses in the user code via mmap + * These values are used on return to user code for the mmap target + * addresses only. For 32 bit, same 44 bit address problem, so use + * the physical address, not virtual. Before 2.6.11, using the + * page_address() macro worked, but in 2.6.11, even that returns + * the full 64 bit address (upper bits all 1's). + * So far, using the physical addresses (or chip offsets, for + * chip mapping) works, but no doubt some future kernel release + * will chang that, and we'll be on to yet another method of + * dealing with this + */ + kbase.spi_rcvhdr_base = (uint64_t) pd->port_rcvhdrq_phys; + kbase.spi_rcv_egrbufs = (uint64_t) pd->port_rcvegr_phys; + kbase.spi_pioavailaddr = (uint64_t) dd->ipath_pioavailregs_phys; + kbase.spi_status = (uint64_t) kbase.spi_pioavailaddr + + (void *)dd->ipath_statusp - (void *)dd->ipath_pioavailregs_dma; + kbase.spi_piobufbase = (uint64_t) pd->port_piobufs; + kbase.__spi_uregbase = + dd->ipath_uregbase + dd->ipath_palign * pd->port_port; + + kbase.spi_pioindex = dd->ipath_pbufsport * (pd->port_port - 1); + kbase.spi_piocnt = dd->ipath_pbufsport; + kbase.spi_pioalign = dd->ipath_palign; + + kbase.spi_qpair = IPATH_KD_QP; + kbase.spi_piosize = dd->ipath_ibmaxlen; + kbase.spi_mtu = dd->ipath_ibmaxlen; /* maxlen, not ibmtu */ + kbase.spi_port = pd->port_port; + kbase.spi_sw_version = IPATH_KERN_SWVERSION; + kbase.spi_hw_version = dd->ipath_revision; + + if (copy_to_user(ubase, &kbase, sizeof kbase)) + ret = -EFAULT; + + return ret; +} + +/* + * return number of units supported by driver. This is infinipath_max, + * unless there are no initted units. + */ +static int ipath_get_units(void) +{ + int i; + + for (i = 0; i < infinipath_max; i++) + if (devdata[i].ipath_flags & IPATH_INITTED) + return infinipath_max; + return 0; +} + +/* write data to the EEPROM on the board */ +static int ipath_wr_eeprom(ipath_portdata * pd, struct ipath_eeprom_req *req) +{ + int ret = 0; + struct ipath_eeprom_req kreq; + void *buf = NULL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* not just any old user can write flash */ + if (copy_from_user(&kreq, req, sizeof kreq)) + return -EFAULT; + if (!kreq.addr || (kreq.offset + kreq.len) > 128) { + _IPATH_DBG + ("called with NULL addr %llx, or bad cnt %u or offset %u\n", + kreq.addr, kreq.len, kreq.offset); + return -EINVAL; + } + + if (!(buf = vmalloc(kreq.len))) { + ret = -ENOMEM; + _IPATH_UNIT_ERROR(pd->port_unit, + "Couldn't allocate memory to write %u bytes from eeprom\n", + kreq.len); + goto done; + } + if (copy_from_user(buf, (void *)kreq.addr, kreq.len)) { + ret = -EFAULT; + goto done; + } + if (ipath_eeprom_write(pd->port_unit, kreq.offset, buf, kreq.len)) { + ret = -ENXIO; + _IPATH_UNIT_ERROR(pd->port_unit, + "Failed write to eeprom %u bytes offset %u\n", + kreq.len, kreq.offset); + } + +done: + if (buf) + vfree(buf); + return ret; +} + +/* read data from the EEPROM on the board */ +int ipath_rd_eeprom(const ipath_type port_unit, struct ipath_eeprom_req *req) +{ + int ret = 0; + struct ipath_eeprom_req kreq; + void *buf = NULL; + + if (copy_from_user(&kreq, req, sizeof kreq)) + return -EFAULT; + if (!kreq.addr || (kreq.offset + kreq.len) > 128) { + _IPATH_DBG + ("called with NULL addr %llx, or bad cnt %u or offset %u\n", + kreq.addr, kreq.len, kreq.offset); + return -EINVAL; + } + + if (!(buf = vmalloc(kreq.len))) { + ret = -ENOMEM; + _IPATH_UNIT_ERROR(port_unit, + "Couldn't allocate memory to read %u bytes from eeprom\n", + kreq.len); + goto done; + } + if (ipath_eeprom_read(port_unit, kreq.offset, buf, kreq.len)) { + ret = -ENXIO; + _IPATH_UNIT_ERROR(port_unit, + "Failed reading %u bytes offset %u from eeprom\n", + kreq.len, kreq.offset); + } + if (copy_to_user((void *)kreq.addr, buf, kreq.len)) + ret = -EFAULT; + +done: + if (buf) + vfree(buf); + return ret; +} + +/* + * wait for something to happen on a port. Currently this is + * PIO buffer available, or a packet being received. For now, at + * least, we wait no longer than 1/2 seconds on rcv, 1 tick on PIO, so + * we recover from any bugs (or, as we see in ips.c init and close, cases + * where other side isn't yet ready). + * NOTE: currently called only with PIO or RCV, never both, so path with both + * has not been tested + */ +static int ipath_wait_intr(ipath_portdata * pd, uint32_t flag) +{ + ipath_devdata *dd = &devdata[pd->port_unit]; + /* stupid compiler can't tell it's initialized */ + uint32_t im = 0; + uint32_t head, tail, timeo = 0, wflag = 0; + + if (!(flag & (IPATH_WAIT_RCV | IPATH_WAIT_PIO))) + return -EINVAL; + if (flag & IPATH_WAIT_RCV) { + head = flag >> 16; + im = (1U << pd->port_port) << INFINIPATH_R_INTRAVAIL_SHIFT; + atomic_set_mask(im, &dd->ipath_rcvctrl); + /* + * now, before blocking, make sure that head is still == tail, + * reading from the chip, so we can be sure the interrupt enable + * has made it to the chip. If not equal, disable + * interrupt again and return immediately. This avoids + * races, and the overhead of the chip read doesn't + * matter much at this point, since we are waiting for + * something anyway. + */ + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + tail = + ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail, + pd->port_port); + if (tail == head) { + timeo = HZ / 2; + wflag = IPATH_PORT_WAITING_RCV; + } else { + atomic_clear_mask(im, &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } + if (flag & IPATH_WAIT_PIO) { + /* + * this one's a bit worse than the receive case, in that we + * can't really verify that at least one interrupt + * will happen... + * We do use a really short timeout, however + */ + timeo = 1; /* if both, the short PIO timeout wins */ + atomic_set_mask(1U << pd->port_port, &dd->ipath_portpiowait); + wflag |= IPATH_PORT_WAITING_PIO; + /* + * this has a possible race with the ipath stuff, so do + * it atomicly + */ + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(pd->port_unit, kr_sendctrl, dd->ipath_sendctrl); + } + if (wflag) { + pd->port_flag |= wflag; + wait_event_interruptible_timeout(pd->port_wait, + (pd->port_flag & wflag) != + wflag, timeo); + if (wflag & pd->port_flag & IPATH_PORT_WAITING_PIO) { + /* timed out, no PIO interrupts */ + atomic_clear_mask(IPATH_PORT_WAITING_PIO, + &pd->port_flag); + pd->port_piowait_to++; + atomic_clear_mask(1U << pd->port_port, + &dd->ipath_portpiowait); + /* + * *don't* clear the pio interrupt enable; + * let that happen in the interrupt handler; + * else we have a race condition. + */ + } + if (wflag & pd->port_flag & IPATH_PORT_WAITING_RCV) { + /* timed out, no packets received */ + atomic_clear_mask(IPATH_PORT_WAITING_RCV, + &pd->port_flag); + pd->port_rcvwait_to++; + atomic_clear_mask(im, &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } else { + /* else it's already happened, don't do wait_event overhead */ + if (flag & IPATH_WAIT_RCV) + pd->port_rcvnowait++; + if (flag & IPATH_WAIT_PIO) + pd->port_pionowait++; + } + return 0; +} -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:54 -0800 Subject: [openib-general] [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <200512161548.jRuyTS0HPMLd7V81@cisco.com> Message-ID: <200512161548.aLjaDpGm5aqk0k0p@cisco.com> Basic headers for the ipath driver --- drivers/infiniband/hw/ipath/ipath_common.h | 798 +++++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_kernel.h | 776 ++++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_layer.h | 131 ++++ drivers/infiniband/hw/ipath/ipath_registers.h | 359 +++++++++++ drivers/infiniband/hw/ipath/ips_common.h | 221 +++++++ 5 files changed, 2285 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_common.h create mode 100644 drivers/infiniband/hw/ipath/ipath_kernel.h create mode 100644 drivers/infiniband/hw/ipath/ipath_layer.h create mode 100644 drivers/infiniband/hw/ipath/ipath_registers.h create mode 100644 drivers/infiniband/hw/ipath/ips_common.h 200aa6cff25b6ab39be1f9d8949c2b3b4258ee1d diff --git a/drivers/infiniband/hw/ipath/ipath_common.h b/drivers/infiniband/hw/ipath/ipath_common.h new file mode 100644 index 0000000..ac33458 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_common.h @@ -0,0 +1,798 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_common.h 4491 2005-12-15 22:20:31Z rjwalsh $ + */ + +#ifndef _IPATH_COMMON_H +#define _IPATH_COMMON_H + +/* + * This file contains defines, structures, etc. that are used + * to communicate between kernel and user code. + */ + +#ifdef __KERNEL__ +#include +#include +#include +#else /* !__KERNEL__; user mode */ +#include +#include +#include +#include + +/* these aren't implemented for user mode, which is OK until we multi-thread */ +typedef struct _atomic { + uint32_t counter; +} atomic_t; /* no atomic_t type in user-land */ +#define atomic_set(a,v) ((a)->counter = (v)) +#define atomic_inc_return(a) (++(a)->counter) +#define likely(x) (x) +#define unlikely(x) (x) + +#define yield() sched_yield() + +/* + * too horrible to try and use the kernel get_cycles() or equivalent, + * so define and inline it here + */ + +#if !defined(rdtscll) +#if defined(__x86_64) || defined(__i386) +#define rdtscll(v) do {uint32_t a,d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); \ + (v) = ((uint64_t)a) | (((uint64_t)d)<<32); \ +} while(0) +#else +#error "No cycle counter routine implemented yet for this platform" +#endif +#endif /* !defined(rdtscll) */ + +#endif /* ! __KERNEL__ */ + +typedef uint8_t ipath_type; + +/* This is the IEEE-assigned OUI for PathScale, Inc. */ +#define IPATH_SRC_OUI_1 0x00 +#define IPATH_SRC_OUI_2 0x11 +#define IPATH_SRC_OUI_3 0x75 + +/* version of protocol header (known to chip also). In the long run, + * we should be able to generate and accept a range of version numbers; + * for now we only accept one, and it's compiled in. + */ +#define IPS_PROTO_VERSION 2 + +#ifndef _BITS_PER_BYTE +#define _BITS_PER_BYTE 8 +#endif + +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) + __attribute__ ((always_inline)); + +/* + * this is used for very short copies, usually 1 - 8 bytes, + * *NEVER* to the PIO buffers!!!!!!! use ipath_dwordcpy for longer + * copies, or any copy to the PIO buffers. Works for 32 and 64 bit + * gcc and pathcc + */ +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) +{ + void *ssv, *dsv; + uint32_t csv; + __asm__ __volatile__("cld\n\trep\n\tmovsb":"=&c"(csv), "=&D"(dsv), + "=&S"(ssv) + :"0"(cnt), "1"(dest), "2"(src) + :"memory"); +} + +/* + * optimized word copy; good for rev C and later opterons. Among the best for + * short copies, and does as well or slightly better than the optimizization + * guide copies 6 and 8 at 2KB. + */ +void ipath_dwordcpy(uint32_t * dest, uint32_t * src, uint32_t ndwords); + +/* + * These are compile time constants that you may want to enable or disable + * if you are trying to debug problems with code or performance. + * IPATH_VERBOSE_TRACING define as 1 if you want additional tracing in + * fastpath code + * IPATH_TRACE_REGWRITES define as 1 if you want register writes to be + * traced in faspath code + * _IPATH_TRACING define as 0 if you want to remove all tracing in a + * compilation unit + * _IPATH_DEBUGGING define as 0 if you want to remove debug prints + */ + +#define round_up(v,sz) (((v) + (sz)-1) & ~((sz)-1)) + +/* These are used in the driver, don't use them elsewhere */ +#define _IPATH_SIMFUNC_IOCTL_LOW 1 +#define _IPATH_SIMFUNC_IOCTL_HIGH 7 + +/* + * These tell the driver which ioctl's belong to the diags interface. + * As above, don't use them elsewhere. + */ +#define _IPATH_DIAG_IOCTL_LOW 100 +#define _IPATH_DIAG_IOCTL_HIGH 109 + +/* for IPATHSETREGBASE the length is the length covered by addr, in bytes */ +struct ipath_setregbase { + void *addr; + size_t length; +}; +/* + * IPATHINTERRUPT ioctl passes this as of rev 1.6 of the simulator; + * used to be an int + */ +struct ipath_int_vec { + int long long addr; + uint32_t info; +}; +struct ipath_eeprom_req { + long long addr; + uint16_t len; + uint16_t offset; +}; + +/* simulated chip space */ +#define IPATHSETREGBASE _IOW('s', 1, struct ipath_setregbase) +/* arg is currently unused */ +#define IPATHINTERRUPT _IOW('s', 2, struct ipath_int_vec) +/* + * arg is low 32 bits of the simulator sync register, and means that + * the simulator has processed up to and including that write + */ +#define IPATHSYNC _IOW('s', 3, int) + +/* + * simulator has initialized the memory from IPATHSETREGBASE, and driver + * can initialize based on the contents + */ +#define IPATHREADY _IOW('s', 4, int) +/* user mode userreg write, so we can notify simulators */ +#define IPATH_USERREG _IOW('s', 5, __ipath_rdummy) + +/* init; user params to kernel */ +#define IPATH_USERINIT _IOW('s', 16, struct ipath_user_info) +/* init; kernel/chip params to user */ +#define IPATH_BASEINFO _IOR('s', 17, struct ipath_base_info) +/* send a packet */ +#define IPATH_SENDPKT _IOW('s', 18, struct ipath_sendpkt) +/* + * if arg is 0, disable port, used when flushing after a hdrq overflow. + * If arg ia 1, re-enable, and return new value of head register + */ +#define IPATH_RCVCTRL _IOR('s', 19, uint32_t) +/* only to make iow macro happy, w/o a struct */ +static uint64_t __ipath_rdummy[2] __attribute__ ((unused)); +#define IPATH_READ_EEPROM _IOWR('s', 20, struct ipath_eeprom_req) +/* set an accepted partition key; up to 4 pkeys can be active at once */ +#define IPATH_SET_PKEY _IOW('s', 21, uint16_t) +#define IPATH_WRITE_EEPROM _IOWR('s', 22, struct ipath_eeprom_req) +/* set LID for interface (SMA) */ +#define IPATH_SET_LID _IOW('s', 23, uint32_t) +/* set IB MTU for interface (SMA) */ +#define IPATH_SET_MTU _IOW('s', 24, uint32_t) +/* set IB link state for interface (SMA) */ +#define IPATH_SET_LINKSTATE _IOW('s', 25, uint32_t) +/* send an SMA packet, sps_flags contains "normal" SMA unit and minor number. */ +#define IPATH_SEND_SMA_PKT _IOW('s', 26, struct ipath_sendpkt) +/* receive an SMA packet */ +#define IPATH_RCV_SMA_PKT _IOW('s', 27, struct ipath_sendpkt) +/* get the portinfo data (SMA) + * takes array of 13, returns port info fields. Data is in host order, + * not network order; SMA-only fields are not filled in + */ +#define IPATH_GET_PORTINFO _IOWR('s', 28, uint32_t *) +/* + * get the nodeinfo data (SMA) + * takes an array of 10, returns nodeinfo fields in host order + */ +#define IPATH_GET_NODEINFO _IOWR('s', 29, uint32_t *) +/* set GUID on interface (SMA; GUID given in network order) */ +#define IPATH_SET_GUID _IOW('s', 30, struct ipath_setguid) +/* set MLID for interface (SMA) */ +#define IPATH_SET_MLID _IOW('s', 31, uint32_t) +#define IPATH_GET_MLID _IOWR('s', 32, uint32_t *) /* get the MLID (SMA) */ +/* update expected TID entries */ +#define IPATH_UPDM_TID _IOWR('s', 33, struct _tidupd) +/* free expected TID entries */ +#define IPATH_FREE_TID _IOW('s', 34, struct _tidupd) +/* return assigned unit:port */ +#define IPATH_GETPORT _IOR('s', 35, uint32_t) +/* wait for rcv pkt or pioavail */ +#define IPATH_WAIT _IOW('s', 36, uint32_t) +/* return LID for passed in unit */ +#define IPATH_GETLID _IOR('s', 37, uint16_t) +/* return # of units supported by driver */ +#define IPATH_GETUNITS _IO('s', 38) +/* get the device status */ +#define IPATH_GET_DEVSTATUS _IOWR('s', 39, uint64_t *) + +/* available for reuse ('s', 48) */ + +/* diagnostic read */ +#define IPATH_DIAGREAD _IOR('s', 100, struct ipath_diag_info) +/* diagnostic write */ +#define IPATH_DIAGWRITE _IOW('s', 101, struct ipath_diag_info) +/* HT Config read */ +#define IPATH_DIAG_HTREAD _IOR('s', 102, struct ipath_diag_info) +/* HT config write */ +#define IPATH_DIAG_HTWRITE _IOW('s', 103, struct ipath_diag_info) +#define IPATH_DIAGENTER _IO('s', 104) /* Enter diagnostic mode */ +#define IPATH_DIAGLEAVE _IO('s', 105) /* Leave diagnostic mode */ +/* send a packet, sps_flags contains unit and minor number. */ +#define IPATH_SEND_DIAG_PKT _IOW('s', 106, struct ipath_sendpkt) +/* + * read I2C FLASH + * NOTE: To read the I2C device, the _uaddress field should contain + * a pointer to struct ipath_eeprom_req, and _unit must be valid + */ +#define IPATH_DIAG_RD_I2C _IOW('s', 107, struct ipath_diag_info) + +/* + * Monitoring ioctls. All of these work with the main device + * (/dev/ipath), if you don't mind using a port (e.g. you already have + * the device open.) IPATH_GETSTATS and IPATH_GETUNITCOUNTERS also + * work with the control device (/dev/ipath_ctrl), if you don't want to + * use a port. + */ + +/* return chip counters for current unit. */ +#define IPATH_GETCOUNTERS _IOR('s', 40, struct infinipath_counters) +/* return chip stats */ +#define IPATH_GETSTATS _IOR('s', 41, struct infinipath_stats) +/* return chip counters for a particular unit. */ +#define IPATH_GETUNITCOUNTERS _IOR('s', 42, struct infinipath_getunitcounters) + +/* + * unit is incoming unit number. + * data is a pointer to the infinipath_counters structure. + */ +struct infinipath_getunitcounters { + uint16_t unit; + uint64_t data; +}; + +/* + * The value in the BTH QP field that InfiniPath uses to differentiate + * an infinipath protocol IB packet vs standard IB transport + */ +#define IPATH_KD_QP 0x656b79 + +/* + * valid states passed to ipath_set_linkstate() user call + * (IPATH_SET_LINKSTATE ioctl) + */ +#define IPATH_IB_LINKDOWN 0 +#define IPATH_IB_LINKARM 1 +#define IPATH_IB_LINKACTIVE 2 + +/* + * stats maintained by the driver. For now, at least, this is global + * to all minor devices. + */ +struct infinipath_stats { + uint64_t sps_ints; /* number of interrupts taken */ + uint64_t sps_errints; /* number of interrupts for errors */ + /* number of errors from chip (not including packet errors or CRC) */ + uint64_t sps_errs; + /* number of packet errors from chip other than CRC */ + uint64_t sps_pkterrs; + /* number of packets with CRC errors (ICRC and VCRC) */ + uint64_t sps_crcerrs; + /* number of hardware errors reported (parity, etc.) */ + uint64_t sps_hwerrs; + /* number of times IB link changed state unexpectedly */ + uint64_t sps_iblink; + uint64_t sps_unused3; /* no longer used; left for compatibility */ + uint64_t sps_port0pkts; /* number of kernel (port0) packets received */ + /* number of "ethernet" packets sent by driver */ + uint64_t sps_ether_spkts; + /* number of "ethernet" packets received by driver */ + uint64_t sps_ether_rpkts; + uint64_t sps_sma_spkts; /* number of SMA packets sent by driver */ + uint64_t sps_sma_rpkts; /* number of SMA packets received by driver */ + /* number of times all ports rcvhdrq was full and packet dropped */ + uint64_t sps_hdrqfull; + /* number of times all ports egrtid was full and packet dropped */ + uint64_t sps_etidfull; + /* + * number of times we tried to send from driver, but no pio + * buffers avail + */ + uint64_t sps_nopiobufs; + uint64_t sps_ports; /* number of ports currently open */ + /* list of pkeys (other than default) accepted (0 means not set) */ + uint16_t sps_pkeys[4]; + /* lids for up to 4 infinipaths, indexed by infinipath # */ + uint16_t sps_lid[4]; + /* number of user ports per chip (not IB ports) */ + uint32_t sps_nports; + uint32_t sps_nullintr; /* not our interrupt, or already handled */ + uint32_t sps_maxpkts_call; /* max number of packets handled per receive call */ + uint32_t sps_avgpkts_call; /* avg number of packets handled per receive call */ + uint64_t sps_pagelocks; /* total number of pages ipath_mlock()'ed */ + /* total number of pages ipath_munlock()'ed */ + uint64_t sps_pageunlocks; + /* + * Number of packets dropped in kernel other than errors + * (ether packets if ipath not configured, sma/mad, etc.) + */ + uint64_t sps_krdrops; + /* mlids for up to 4 infinipaths, indexed by infinipath # */ + uint16_t sps_mlid[4]; + uint64_t __sps_pad[45]; /* pad for future growth */ +}; + +/* + * These are the status bits returned (in ascii form, 64bit value) + * by the IPATH_GETSTATS ioctl. + */ +#define IPATH_STATUS_INITTED 0x1 /* basic driver initialization done */ +#define IPATH_STATUS_DISABLED 0x2 /* hardware disabled */ +#define IPATH_STATUS_UNUSED 0x4 /* available */ +#define IPATH_STATUS_OIB_SMA 0x8 /* ipath_mad kernel SMA running */ +#define IPATH_STATUS_SMA 0x10 /* user SMA running */ +/* Chip (simulator) has been found and initted */ +#define IPATH_STATUS_CHIP_PRESENT 0x20 +#define IPATH_STATUS_IB_READY 0x40 /* IB link is at ACTIVE, has LID, + * usable for all VL's */ +/* after link up, LID,MTU,etc. has been configured */ +#define IPATH_STATUS_IB_CONF 0x80 +/* no link established, probably no cable */ +#define IPATH_STATUS_IB_NOCABLE 0x100 +/* A Fatal hardware error has occurred. */ +#define IPATH_STATUS_HWERROR 0x200 + +/* The list of usermode accessible registers. Also see Reg_* later in file */ +typedef enum _ipath_ureg { + ur_rcvhdrtail = 0, /* (RO) DMA RcvHdr to be used next. */ + /* (RW) RcvHdr entry to be processed next by host. */ + ur_rcvhdrhead = 1, + ur_rcvegrindextail = 2, /* (RO) Index of next Eager index to use. */ + ur_rcvegrindexhead = 3, /* (RW) Eager TID to be processed next */ + /* For internal use only; max register number. */ + _IPATH_UregMax +} ipath_ureg; + +/* SMA minor# no portinfo, one for all instances */ +#define IPATH_SMA 128 + +/* Control minor# no portinfo, one for all instances */ +#define IPATH_CTRL 130 + +/* + * This structure is returned by ipath_userinit() immediately after open + * to get implementation-specific info, and info specific to this + * instance. + */ +struct ipath_base_info { + /* version of hardware, for feature checking. */ + uint32_t spi_hw_version; + /* version of software, for feature checking. */ + uint32_t spi_sw_version; + /* InfiniPath port assigned, goes into sent packets */ + uint32_t spi_port; + /* + * IB MTU, packets IB data must be less than this. + * The MTU is in bytes, and will be a multiple of 4 bytes. + */ + uint32_t spi_mtu; + /* + * size of a PIO buffer. Any given packet's total + * size must be less than this (in words). Included is the + * starting control word, so if 513 is returned, then total + * pkt size is 512 words or less. + */ + uint32_t spi_piosize; + /* size of the TID cache in infinipath, in entries */ + uint32_t spi_tidcnt; + /* size of the TID Eager list in infinipath, in entries */ + uint32_t spi_tidegrcnt; + /* size of a single receive header queue entry. */ + uint32_t spi_rcvhdrent_size; + /* Count of receive header queue entries allocated. + * This may be less than the spu_rcvhdrcnt passed in!. + */ + uint32_t spi_rcvhdr_cnt; + + uint32_t __32_bit_compatibility_pad; /* DO NOT MOVE OR REMOVE */ + + /* address where receive buffer queue is mapped into */ + uint64_t spi_rcvhdr_base; + + /* user program. */ + + /* base address of eager TID receive buffers. */ + uint64_t spi_rcv_egrbufs; + + /* Allocated by initialization code, not by protocol. */ + + /* size of each TID buffer in host memory, + * starting at spi_rcv_egrbufs. It includes spu_egrskip, and is + * at least spi_mtu bytes, and the buffers are virtually contiguous + */ + uint32_t spi_rcv_egrbufsize; + /* + * The special QP (queue pair) value that identifies an infinipath + * protocol packet from standard IB packets. More, probably much + * more, to be added. + */ + uint32_t spi_qpair; + + /* + * user register base for init code, not to be used directly by + * protocol or applications + */ + uint64_t __spi_uregbase; + /* + * maximum buffer size in bytes that can be used in a + * single TID entry (assuming the buffer is aligned to this boundary). + * This is the minimum of what the hardware and software support + * Guaranteed to be a power of 2. + */ + uint32_t spi_tid_maxsize; + /* + * alignment of each pio send buffer (byte count + * to add to spi_piobufbase to get to second buffer) + */ + uint32_t spi_pioalign; + /* + * the index of the first pio buffer available + * to this process; needed to do lookup in spi_pioavailaddr; not added + * to spi_piobufbase + */ + uint32_t spi_pioindex; + uint32_t spi_piocnt; /* number of buffers mapped for this process */ + + /* + * base address of writeonly pio buffers for this process. + * Each buffer has spi_piosize words, and is aligned on spi_pioalign + * boundaries. spi_piocnt buffers are mapped from this address + */ + uint64_t spi_piobufbase; + + /* + * base address of readonly memory copy of the pioavail registers. + * There are 2 bits for each buffer. + */ + uint64_t spi_pioavailaddr; + + /* + * Address where driver updates a copy + * of the interface and driver status (IPATH_STATUS_*) as a 64 bit value + * It's followed by a string indicating hardware error, if there was one + */ + uint64_t spi_status; + + /* number of chip ports available to user processes */ + uint32_t spi_nports; + uint32_t spi_unit; /* unit number of chip we are using */ + uint32_t spi_rcv_egrperchunk; /* num bufs in each contiguous set */ + /* size in bytes of each contiguous set */ + uint32_t spi_rcv_egrchunksize; + /* total size of mmap to cover full rcvegrbuffers */ + uint32_t spi_rcv_egrbuftotlen; + /* + * ioctl cmd includes struct size, so pad out, and adjust down as + * new fields are added to keep size constant + */ + uint32_t __spi_pad[19]; +} __attribute__ ((aligned(8))); + +#define IPATH_WAIT_RCV 0x1 /* IPATH_WAIT, receive */ +#define IPATH_WAIT_PIO 0x2 /* IPATH_WAIT, PIO */ + +/* + * This version number is given to the driver by the user code during + * initialization in the spu_userversion field of ipath_user_info, so + * the driver can check for compatibility with user code. + * + * The major version changes when data structures + * change in an incompatible way. The driver must be the same or higher + * for initialization to succeed. In some cases, a higher version + * driver will not interoperate with older software, and initialization + * will return an error. + */ +#define IPATH_USER_SWMAJOR 1 + +/* + * Minor version differences are always compatible + * a within a major version, however if if user software is larger + * than driver software, some new features and/or structure fields + * may not be implemented; the user code must deal with this if it + * cares, or it must abort after initialization reports the difference + */ +#define IPATH_USER_SWMINOR 2 + +#define IPATH_USER_SWVERSION ((IPATH_USER_SWMAJOR<<16) | IPATH_USER_SWMINOR) + +/* Similarly, this is the kernel version going back to the user. It's slightly + * different, in that we want to tell if the driver was built as part of a + * PathScale release, or from the driver from the OpenIB, kernel.org, or a + * standard distribution, for support reasons. The high bit is 0 for + * non-PathScale, and 1 for PathScale-built/supplied. That bit is defined + * in Makefiles, rather than this file. + * + * It's returned by the driver to the user code during initialization + * in the spi_sw_version field of ipath_base_info, so the user code can + * in turn check for compatibility with the kernel. +*/ +#define IPATH_KERN_SWVERSION ((IPATH_KERN_TYPE<<31) | IPATH_USER_SWVERSION) + +/* + * This structure is passed to ipath_userinit() to tell the driver where + * user code buffers are, sizes, etc. + */ +struct ipath_user_info { + /* + * version of user software, to detect compatibility issues. + * Should be set to IPATH_USER_SWVERSION. + */ + uint32_t spu_userversion; + + /* desired number of receive header queue entries */ + uint32_t spu_rcvhdrcnt; + + /* + * Leave this much unused space at the start of + * each eager buffer for software use. Similar in effect to + * setting K_Offset to this value. needs to be 'small', on the + * order of one or two cachelines + */ + uint32_t spu_egrskip; + + /* + * number of words in KD protocol header + * This tells InfiniPath how many words to copy to rcvhdrq. If 0, + * kernel uses a default. Once set, attempts to set any other value + * are an error (EAGAIN) until driver is reloaded. + */ + uint32_t spu_rcvhdrsize; + + /* + * cache line aligned (64 byte) user address to + * which the rcvhdrtail register will be written by infinipath + * whenever it changes, so that no chip registers are read in + * the performance path. + */ + uint64_t spu_rcvhdraddr; + + /* + * ioctl cmd includes struct size, so pad out, + * and adjust down as new fields are added to keep size constant + */ + uint32_t __spu_pad[6]; +} __attribute__ ((aligned(8))); + +struct ipath_iovec { + /* Pointer to data, but same size 32 and 64 bit */ + uint64_t iov_base; + + /* + * Length of data; don't need 64 bits, but want + * ipath_sendpkt to remain same size as before 32 bit changes, so... + */ + uint64_t iov_len; +}; + +/* + * Describes a single packet for send. Each packet can have one or more + * buffers, but the total length (exclusive of IB headers) must be less + * than the MTU, and if using the PIO method, entire packet length, + * including IB headers, must be less than the ipath_piosize value (words). + * Use of this necessitates including sys/uio.h + */ +struct ipath_sendpkt { + uint32_t sps_flags; /* flags for packet (TBD) */ + uint32_t sps_cnt; /* number of entries to use in sps_iov */ + /* array of iov's describing packet. TEMPORARY */ + struct ipath_iovec sps_iov[4]; +}; + +struct _tidupd { /* used only in inlined function for ioctl. */ + uint32_t tidcnt; + uint32_t tid__unused; /* make structure same size in 32 and 64 bit */ + uint64_t tidvaddr; /* virtual address of first page in transfer */ + /* pointer (same size 32/64 bit) to uint16_t tid array */ + uint64_t tidlist; + + /* + * pointer (same size 32/64 bit) to bitmap of TIDs used + * for this call; checked for being large enough at open + */ + uint64_t tidmap; +}; + +struct ipath_setguid { /* set GUID for interface */ + uint64_t sguid; /* in network order */ + uint64_t sunit; /* unit number of interface */ +}; + +/* + * Structure used to send data to and receive data from a diags ioctl. + * + * NOTE: For HT reads and writes, we only support byte, word (16bits) and + * dword (32bits). All other sizes for HT are invalid. + */ +struct ipath_diag_info { + uint64_t _base_offset; /* register to start reading from */ + uint64_t _num_bytes; /* number of bytes to read or write */ + /* + * address in user space. + * for reads, this is the address to store the read result(s). + * for writes, it the address to get the write data from. + * This memory better be valid in user space! + */ + uint64_t _uaddress; + uint64_t _unit; /* Unit ID of chip we are accessing. */ + uint64_t _pad[15]; +}; + +/* + * Data layout in I2C flash (for GUID, etc.) + * All fields are little-endian binary unless otherwise stated + */ +#define IPATH_FLASH_VERSION 1 +struct ipath_flash { + uint8_t if_fversion; /* flash layout version (IPATH_FLASH_VERSION) */ + uint8_t if_csum; /* checksum protecting if_length bytes */ + /* + * valid length (in use, protected by if_csum), including if_fversion + * and if_sum themselves) + */ + uint8_t if_length; + uint8_t if_guid[8]; /* the GUID, in network order */ + /* number of GUIDs to use, starting from if_guid */ + uint8_t if_numguid; + uint8_t if_serial[12]; /* the board serial number, in ASCII */ + uint8_t if_mfgdate[8]; /* board mfg date (YYYYMMDD ASCII) */ + /* last board rework/test date (YYYYMMDD ASCII) */ + uint8_t if_testdate[8]; + uint8_t if_errcntp[4]; /* logging of error counts, TBD */ + /* powered on hours, updated at driver unload */ + uint8_t if_powerhour[2]; + uint8_t if_comment[32]; /* ASCII free-form comment field */ + uint8_t if_future[50]; /* 78 bytes used, min flash size is 128 bytes */ +}; + +uint8_t ipath_flash_csum(struct ipath_flash *, int); + +/* + * These are the counters implemented in the chip, and are listed in order. + * They are returned in this order by the IPATH_GETCOUNTERS ioctl + */ +struct infinipath_counters { + unsigned long long LBIntCnt; + unsigned long long LBFlowStallCnt; + unsigned long long Reserved1; + unsigned long long TxUnsupVLErrCnt; + unsigned long long TxDataPktCnt; + unsigned long long TxFlowPktCnt; + unsigned long long TxDwordCnt; + unsigned long long TxLenErrCnt; + unsigned long long TxMaxMinLenErrCnt; + unsigned long long TxUnderrunCnt; + unsigned long long TxFlowStallCnt; + unsigned long long TxDroppedPktCnt; + unsigned long long RxDroppedPktCnt; + unsigned long long RxDataPktCnt; + unsigned long long RxFlowPktCnt; + unsigned long long RxDwordCnt; + unsigned long long RxLenErrCnt; + unsigned long long RxMaxMinLenErrCnt; + unsigned long long RxICRCErrCnt; + unsigned long long RxVCRCErrCnt; + unsigned long long RxFlowCtrlErrCnt; + unsigned long long RxBadFormatCnt; + unsigned long long RxLinkProblemCnt; + unsigned long long RxEBPCnt; + unsigned long long RxLPCRCErrCnt; + unsigned long long RxBufOvflCnt; + unsigned long long RxTIDFullErrCnt; + unsigned long long RxTIDValidErrCnt; + unsigned long long RxPKeyMismatchCnt; + unsigned long long RxP0HdrEgrOvflCnt; + unsigned long long RxP1HdrEgrOvflCnt; + unsigned long long RxP2HdrEgrOvflCnt; + unsigned long long RxP3HdrEgrOvflCnt; + unsigned long long RxP4HdrEgrOvflCnt; + unsigned long long RxP5HdrEgrOvflCnt; + unsigned long long RxP6HdrEgrOvflCnt; + unsigned long long RxP7HdrEgrOvflCnt; + unsigned long long RxP8HdrEgrOvflCnt; + unsigned long long Reserved6; + unsigned long long Reserved7; + unsigned long long IBStatusChangeCnt; + unsigned long long IBLinkErrRecoveryCnt; + unsigned long long IBLinkDownedCnt; + unsigned long long IBSymbolErrCnt; +}; + +/* + * The next set of defines are for packet headers, and chip register + * and memory bits that are visible to and/or used by user-mode software + * The other bits that are used only by the driver or diags are in + * ipath_registers.h + */ + +/* RcvHdrFlags bits */ +#define INFINIPATH_RHF_LENGTH_MASK 0x7FF +#define INFINIPATH_RHF_LENGTH_SHIFT 0 +#define INFINIPATH_RHF_RCVTYPE_MASK 0x7 +#define INFINIPATH_RHF_RCVTYPE_SHIFT 11 +#define INFINIPATH_RHF_EGRINDEX_MASK 0x7FF +#define INFINIPATH_RHF_EGRINDEX_SHIFT 16 +#define INFINIPATH_RHF_H_ICRCERR 0x80000000 +#define INFINIPATH_RHF_H_VCRCERR 0x40000000 +#define INFINIPATH_RHF_H_PARITYERR 0x20000000 +#define INFINIPATH_RHF_H_LENERR 0x10000000 +#define INFINIPATH_RHF_H_MTUERR 0x08000000 +#define INFINIPATH_RHF_H_IHDRERR 0x04000000 +#define INFINIPATH_RHF_H_TIDERR 0x02000000 +#define INFINIPATH_RHF_H_MKERR 0x01000000 +#define INFINIPATH_RHF_H_IBERR 0x00800000 +#define INFINIPATH_RHF_L_SWA 0x00008000 +#define INFINIPATH_RHF_L_SWB 0x00004000 + +/* infinipath header fields */ +#define INFINIPATH_I_VERS_MASK 0xF +#define INFINIPATH_I_VERS_SHIFT 28 +#define INFINIPATH_I_PORT_MASK 0xF +#define INFINIPATH_I_PORT_SHIFT 24 +#define INFINIPATH_I_TID_MASK 0x7FF +#define INFINIPATH_I_TID_SHIFT 13 +#define INFINIPATH_I_OFFSET_MASK 0x1FFF +#define INFINIPATH_I_OFFSET_SHIFT 0 + +/* K_PktFlags bits */ +#define INFINIPATH_KPF_INTR 0x1 + +/* SendPIO per-buffer control */ +#define INFINIPATH_SP_LENGTHP1_MASK 0x3FF +#define INFINIPATH_SP_LENGTHP1_SHIFT 0 +#define INFINIPATH_SP_INTR 0x80000000 +#define INFINIPATH_SP_TEST 0x40000000 +#define INFINIPATH_SP_TESTEBP 0x20000000 + +/* SendPIOAvail bits */ +#define INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT 1 +#define INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT 0 + +#endif /* _IPATH_COMMON_H */ diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h new file mode 100644 index 0000000..ba53fa3 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -0,0 +1,776 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_kernel.h 4491 2005-12-15 22:20:31Z rjwalsh $ + */ + +#ifndef _IPATH_KERNEL_H +#define _IPATH_KERNEL_H + +#ifndef PCI_VENDOR_ID_PATHSCALE /* not in pci.ids yet */ +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1 +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH1 0xa +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH2 0xd +#endif + +/* + * This header file is the base header file for infinipath kernel code + * ipath_user.h serves a similar purpose for user code. + */ + +#include "ipath_common.h" +#include "ipath_debug.h" +#include "ipath_registers.h" +#include +#include + +/* only s/w major version of InfiniPath we can handle */ +#define IPATH_CHIP_VERS_MAJ 2U + +#define IPATH_CHIP_VERS_MIN 0U /* don't care about this except printing */ + +extern struct infinipath_stats ipath_stats; /* temporary, maybe always */ + +/* sysctl stuff */ +#define CTL_INFINIPATH 0x70736e69 /* "spin" as a hex value, top level */ +/* rest are in infinipath domain */ +#define CTL_INFINIPATH_DEBUG 1 /* infinipath_debug mask */ +#define CTL_INFINIPATH_TRACEMASK 2 /* trace mask */ +#define CTL_INFINIPATH_UNUSED 4 /* available for re-use */ +/* count of pio buffers reserved for kernel */ +#define CTL_INFINIPATH_LAYERBUF 8 + +/* only s/w version of chip (simulator) we can handle for now */ +#define IPATH_CHIP_SWVERSION IPATH_CHIP_VERS_MAJ + +typedef struct _ipath_portdata { + /* minor number of devices, for ipath_type use */ + unsigned port_unit; + /* array of struct page pointers */ + struct page **port_rcvegrbuf_pages; + /* array of virtual addresses (from above) */ + void **port_rcvegrbuf_virt; + void *port_rcvhdrq; /* rcvhdrq base, needs mmap before useful */ + /* kernel virtual address where hdrqtail is updated */ + uint64_t *port_rcvhdrtail_kvaddr; + struct page *port_rcvhdrtail_pagep; /* page * used for uaddr */ + /* + * temp buffer for expected send setup, allocated at open, instead + * of each setup call + */ + void *port_tid_pg_list; + wait_queue_head_t port_wait; /* when waiting for rcv or pioavail */ + /* + * rcvegr bufs base, physical, must fit + * in 44 bits so 32 bit programs mmap64 44 bit works) + */ + unsigned long port_rcvegr_phys; + /* for mmap of hdrq, must fit in 44 bits */ + unsigned long port_rcvhdrq_phys; + /* + * the actual user address that we ipath_mlock'ed, so we can + * ipath_munlock it at close + */ + unsigned long port_rcvhdrtail_uaddr; + /* + * number of opens on this instance (0 or 1; ignoring forks, dup, + * etc. for now) + */ + int port_cnt; + /* + * how much space to leave at start of eager TID entries for protocol + * use, on each TID + */ + unsigned port_egrskip; + unsigned port_port; /* instead of calculating it */ + uint32_t port_piobufs; /* chip offset of PIO buffers for this port */ + /* how many alloc_pages() chunks in port_rcvegrbuf_pages */ + uint32_t port_rcvegrbuf_chunks; + uint32_t port_rcvegrbufs_perchunk; /* how many egrbufs per chunk */ + /* order used with port_rcvegrbuf_pages */ + uint32_t port_rcvegrbuf_order; + uint32_t port_rcvhdrq_order; /* rcvhdrq order (for free_pages) */ + /* next expected TID to check when looking for free */ + uint32_t port_tidcursor; + /* next expected TID to check when looking for free */ + uint32_t port_flag; + /* WAIT_RCV that timed out, no interrupt */ + uint32_t port_rcvwait_to; + /* WAIT_PIO that timed out, no interrupt */ + uint32_t port_piowait_to; + uint32_t port_rcvnowait; /* WAIT_RCV already happened, no wait */ + uint32_t port_pionowait; /* WAIT_PIO already happened, no wait */ + uint32_t port_hdrqfull; /* total number of rcvhdrqfull errors */ + pid_t port_pid; /* pid of process using this port */ + /* same size as task_struct .comm[], but no define */ + char port_comm[16]; + uint16_t port_pkeys[4]; /* pkeys set by this use of this port */ +} ipath_portdata; + +struct sk_buff; + +/* + * control information for layered drivers + * This is used only as part of devdata via ipath_layer; + */ +struct _ipath_layer { + int (*l_intr) (const ipath_type, uint32_t); + int (*l_rcv) (const ipath_type, void *, struct sk_buff *); + int (*l_rcv_lid) (const ipath_type, void *); + uint16_t l_rcv_opcode; + uint16_t l_rcv_lid_opcode; +}; + +/* Verbs layer interface */ +struct _verbs_layer { + int (*l_piobufavail) (const ipath_type); + void (*l_rcv) (const ipath_type, void *, void *, u32); + void (*l_timer_cb) (const ipath_type); + struct timer_list l_timer; + unsigned l_flags; +}; + +/* + * These are the fields that only exist for port 0, not per port, so + * they aren't in ipath_devdata + */ +typedef struct _ipath_devdata { + /* driver data structures */ + /* mem-mapped pointer to base of chip regs */ + volatile uint64_t *ipath_kregbase; + /* end of mem-mapped chip space; range checking */ + uint64_t *ipath_kregend; + /* physical address of chip for io_remap, etc. */ + unsigned long ipath_physaddr; + /* base of memory alloced for ipath_kregbase, for free */ + uint64_t *ipath_kregalloc; + /* + * version of kregbase that doesn't have high bits set (for 32 bit + * programs, so mmap64 44 bit works) + */ + uint64_t *ipath_kregvirt; + /* virtual address where port0 rcvhdrqtail updated for this unit */ + volatile uint64_t *ipath_hdrqtailptr; + ipath_portdata **ipath_pd; /* ipath_cfgports pointers */ + /* sk_buffs used by port 0 eager receive queue */ + struct sk_buff **ipath_port0_skbs; + /* + * points to area where PIOavail registers will be DMA'ed. Has to + * be on a page of it's own, because the page will be mapped into user + * program space. This copy is *ONLY* ever written by DMA, not by + * the driver! Need a copy per device when we get to multiple devices + */ + volatile uint64_t *ipath_pioavailregs_dma; + /* original address for free */ + volatile uint64_t *__ipath_pioavailregs_base; + /* physical address where updates occur */ + unsigned long ipath_pioavailregs_phys; + struct _ipath_layer ipath_layer; + struct _verbs_layer verbs_layer; + /* total dwords sent (summed from counter) */ + uint64_t ipath_sword; + /* total dwords received (summed from counter) */ + uint64_t ipath_rword; + /* total packets sent (summed from counter) */ + uint64_t ipath_spkts; + /* total packets received (summed from counter) */ + uint64_t ipath_rpkts; + /* to make the receive interrupt failsafe */ + uint64_t ipath_lastqtail; + uint64_t _ipath_status; /* ipath_statusp initially points to this. */ + uint64_t ipath_guid; /* GUID for this interface, in network order */ + /* + * aggregrate of error bits reported since + * last cleared, for limiting of error reporting + */ + uint64_t ipath_lasterror; + /* + * aggregrate of error bits reported + * since last cleared, for limiting of hwerror reporting + */ + uint64_t ipath_lasthwerror; + /* + * errors masked because they occur too fast, + * also includes errors that are always ignored (ipath_ignorederrs) + */ + uint64_t ipath_maskederrs; + /* time at which to re-enable maskederrs */ + cycles_t ipath_unmasktime; + /* + * errors always ignored (masked), at least + * for a given chip/device, because they are wrong or not useful + */ + uint64_t ipath_ignorederrs; + /* count of egrfull errors, combined for all ports */ + uint64_t ipath_last_tidfull; + uint64_t ipath_lastport0rcv_cnt; /* for ipath_qcheck() */ + + uint32_t ipath_kregsize; /* size of memory at ipath_kregbase */ + /* number of registers used for pioavail */ + uint32_t ipath_pioavregs; + uint32_t ipath_flags; /* IPATH_POLL, etc. */ + /* ipath_flags sma is waiting for */ + uint32_t ipath_sma_state_wanted; + /* last buffer for user use, first buf for kernel use is this index. */ + uint32_t ipath_lastport_piobuf; + uint32_t pci_registered; /* driver is a registered pci device */ + uint32_t ipath_stats_timer_active; /* is a stats timer active */ + /* dwords sent read from infinipath counter */ + uint32_t ipath_lastsword; + /* dwords received read from infinipath counter */ + uint32_t ipath_lastrword; + /* sent packets read from infinipath counter */ + uint32_t ipath_lastspkts; + /* received packets read from infinipath counter */ + uint32_t ipath_lastrpkts; + uint32_t ipath_pbufsport; /* pio bufs allocated per port */ + /* + * number of ports configured as max; zero is + * set to number chip supports, less gives more pio bufs/port, etc. + */ + uint32_t ipath_cfgports; + /* our idea of the port0 rcvhdrq head offset */ + uint32_t ipath_port0head; + uint32_t ipath_p0_hdrqfull; /* count of port 0 hdrqfull errors */ + + /* + * (*cfgports) used to suppress multiple instances of same port + * staying stuck at same point + */ + uint32_t *ipath_lastrcvhdrqtails; + /* + * (*cfgports) used to suppress multiple instances of same port + * staying stuck at same point + */ + uint32_t *ipath_lastegrheads; + /* + * index of last piobuffer we used. Speeds up searching, by starting + * at this point. Doesn't matter if multiple cpu's use and update, + * last updater is only write that matters. Whenever it wraps, + * we update shadow copies. Need a copy per device when we get to + * multiple devices + */ + uint32_t ipath_lastpioindex; + uint32_t ipath_freezelen; /* max length of freezemsg */ + uint32_t ipath_consec_nopiobuf; /* consecutive times we wanted a PIO buffer + * but were unable to get one */ + uint32_t ipath_upd_pio_shadow; /* hint that we should update + * ipath_pioavailshadow before looking for a PIO buffer */ + uint32_t ipath_nosma_bufs; /* sequential tries for SMA send and no bufs */ + uint32_t ipath_nosma_secs; /* duration (seconds) ipath_nosma_bufs set */ + /* HT/PCI Vendor ID (here for NodeInfo) */ + uint16_t ipath_vendorid; + /* HT/PCI Device ID (here for NodeInfo) */ + uint16_t ipath_deviceid; + /* offset in HT config space of slave/primary interface block */ + uint8_t ipath_ht_slave_off; + int ipath_mtrr; /* registration handle for WRCOMB setting on */ + /* ref count of how many users set each pkey */ + atomic_t ipath_pkeyrefs[4]; + /* shadow copy of all exptids physaddr; used only by funcsim */ + uint64_t *ipath_tidsimshadow; + /* shadow copy of struct page *'s for exp tid pages */ + struct page **ipath_pageshadow; + /* + * IPATH_STATUS_* + * this address is mapped readonly into user processes so they can + * get status cheaply, whenever they want. + */ + uint64_t *ipath_statusp; + char *ipath_freezemsg; /* freeze msg if hw error put chip in freeze */ + struct pci_dev *pcidev; /* pci access data structure */ + /* timer used to prevent stats overflow, error throttling, etc. */ + struct timer_list ipath_stats_timer; + /* only allow one interrupt at a time. */ + unsigned long ipath_rcv_pending; + + /* + * shadow copies of registers; size indicates read access size + * Most of them are readonly, but some are write-only register, where + * we manipulate the bits in the shadow copy, and then write the shadow + * copy to infinipath + * We deliberately make most of these 32 bits, since they have + * restricted range and for any that we read, we won't to generate + * 32 bit accesses, since Opteron will generate 2 separate 32 bit + * HT transactions for a 64 bit read, and we want to avoid unnecessary + * HT transactions + */ + + /* This is the 64 bit group */ + /* + * shadow of pioavail, check to be sure it's large enough at + * init time. + */ + uint64_t ipath_pioavailshadow[8]; + uint64_t ipath_gpio_out; /* shadow of kr_gpio_out, for rmw ops */ + /* kr_revision value (also see ipath_majrev) */ + uint64_t ipath_revision; + /* shadow of ibcctrl, for interrupt handling of link changes, etc. */ + uint64_t ipath_ibcctrl; + /* + * last ibcstatus, to suppress "duplicate" status change messages, + * mostly from 2 to 3 + */ + uint64_t ipath_lastibcstat; + /* mask of hardware errors that are enabled */ + uint64_t ipath_hwerrmask; + uint64_t ipath_extctrl; /* shadow the gpio output contents */ + + /* these are the "32 bit" regs */ + /* + * number of GUIDs in the flash for this interface; may need some + * rethinking for setting on other ifaces + */ + uint32_t ipath_nguid; + uint32_t ipath_rcvctrl; /* shadow kr_rcvctrl */ + uint32_t ipath_sendctrl; /* shadow kr_sendctrl */ + uint32_t ipath_rcvhdrcnt; /* value we put in kr_rcvhdrcnt */ + uint32_t ipath_rcvhdrsize; /* value we put in kr_rcvhdrsize */ + uint32_t ipath_rcvhdrentsize; /* value we put in kr_rcvhdrentsize */ + /* byte offset of last entry in rcvhdrq */ + uint32_t ipath_hdrqlast; + uint32_t ipath_portcnt; /* kr_portcnt value */ + uint32_t ipath_palign; /* kr_pagealign value */ + uint32_t ipath_piobcnt; /* kr_sendpiobufcnt value */ + uint32_t ipath_piobufbase; /* kr_sendpiobufbase value */ + uint32_t ipath_piosize; /* kr_sendpiosize */ + uint32_t ipath_rcvegrbase; /* kr_rcvegrbase value */ + uint32_t ipath_rcvegrcnt; /* kr_rcvegrcnt value */ + uint32_t ipath_rcvtidbase; /* kr_rcvtidbase value */ + uint32_t ipath_rcvtidcnt; /* kr_rcvtidcnt value */ + uint32_t ipath_sregbase; /* kr_sendregbase */ + uint32_t ipath_uregbase; /* kr_userregbase */ + uint32_t ipath_cregbase; /* kr_counterregbase */ + uint32_t ipath_control; /* shadow the control register contents */ + uint32_t ipath_pcirev; /* PCI revision register (HTC rev on FPGA) */ + + uint32_t ipath_ibmtu; /* The MTU programmed for this unit */ + /* + * The max size IB packet, included IB headers that we can send. + * Starts same as ipath_piosize, but is affected when ibmtu is + * changed, or by size of eager buffers + */ + uint32_t ipath_ibmaxlen; + /* + * ibmaxlen at init time, limited by chip and by receive buffer size. + * Not changed after init. + */ + uint32_t ipath_init_ibmaxlen; + /* size we allocate for each rcvegrbuffer */ + uint32_t ipath_rcvegrbufsize; + uint32_t ipath_htwidth; /* width (2,4,8,16,32) from HT config reg */ + uint32_t ipath_htspeed; /* HT speed (200,400,800,1000) from HT config */ + /* bitmap of ports waiting for PIO avail intr */ + uint32_t ipath_portpiowait; + /* + *number of sequential ibcstatus change for polling active/quiet + * (i.e., link not coming up). + */ + uint32_t ipath_ibpollcnt; + uint16_t ipath_mlid; /* MLID programmed for this instance */ + uint16_t ipath_lid; /* LID programmed for this instance */ + /* list of pkeys programmed; 0 means not set */ + uint16_t ipath_pkeys[4]; + uint8_t ipath_serial[12]; /* ASCII serial number, from flash */ + uint8_t ipath_majrev; /* chip major rev, from ipath_revision */ + uint8_t ipath_minrev; /* chip minor rev, from ipath_revision */ + uint8_t ipath_boardrev; /* board rev, from ipath_revision */ + uint8_t ipath_unit; /* Unit number for this chip */ +} ipath_devdata; + +/* + * A segment is a linear region of low physical memory. + * XXX Maybe we should use phys addr here and kmap()/kunmap() + * Used by the verbs layer. + */ +struct ipath_seg { + void *vaddr; + u64 length; +}; + +/* The number of ipath_segs that fit in a page. */ +#define IPATH_SEGSZ (PAGE_SIZE / sizeof (struct ipath_seg)) + +struct ipath_segarray { + struct ipath_seg segs[IPATH_SEGSZ]; +}; + +/* + * Used by the verbs layer. + */ +struct ipath_mregion { + u64 user_base; /* User's address for this region */ + u64 iova; /* IB start address of this region */ + size_t length; + u32 lkey; + u32 offset; /* offset (bytes) to start of region */ + int access_flags; + u32 max_segs; /* number of ipath_segs in all the arrays */ + u32 mapsz; /* size of the map array */ + struct ipath_segarray *map[0]; /* the segments */ +}; + +/* + * These keep track of the copy progress within a memory region. + * Used by the verbs layer. + */ +struct ipath_sge { + struct ipath_mregion *mr; + void *vaddr; /* current pointer into the segment */ + u32 sge_length; /* length of the SGE */ + u32 length; /* remaining length of the segment */ + u16 m; /* current index: mr->map[m] */ + u16 n; /* current index: mr->map[m]->segs[n] */ +}; + +struct ipath_sge_state { + struct ipath_sge *sg_list; /* next SGE to be used if any */ + struct ipath_sge sge; /* progress state for the current SGE */ + u8 num_sge; +}; + +extern ipath_devdata devdata[]; +#define IPATH_UNIT(p) ((p)-devdata) +extern const uint32_t infinipath_max; /* number of units (chips) supported */ +extern const char *ipath_minor_names[]; + +extern int ipath_diags_enabled; /* is diags mode enabled? */ + +/* clean up any per-chip chip-specific stuff */ +void ipath_chip_cleanup(ipath_devdata *); +void ipath_chip_done(void); /* clean up any chip type-specific stuff */ +void ipath_handle_hwerrors(const ipath_type, char *, int); +int ipath_validate_rev(ipath_devdata *); +void ipath_clear_init_hwerrs(const ipath_type); + +/* + * This is here to simplify compatibility with source that supports + * multiple chip types + */ +void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen); + +/* these are primarily for SMA, but are also used by diags */ +int ipath_send_smapkt(struct ipath_sendpkt *); + +int ipath_wait_linkstate(const ipath_type, uint32_t, int); +void ipath_down_link(const ipath_type); +void ipath_set_ib_lstate(const ipath_type, int); +void ipath_kreceive(const ipath_type); +int ipath_setrcvhdrsize(const ipath_type, unsigned); + +/* for use in system calls, where we want to know device type, etc. */ +#define port_fp(fp) (((fp)->private_data>(void*)255UL)?((ipath_portdata *)fp->private_data):NULL) + +/* + * somebody is waiting in poll (initially + * used only for simulation notification of register/infinipath memory + * changes + */ +#define IPATH_POLL 0x1 +#define IPATH_INITTED 0x2 /* The chip or simulator is up and initted */ +#define IPATH_RCVHDRSZ_SET 0x4 /* set if any user code has set kr_rcvhdrsize */ +/* The chip or simulator is present and valid for accesses */ +#define IPATH_PRESENT 0x8 +/* HT link0 is only 8 bits wide, ignore upper byte crc errors, etc. */ +#define IPATH_8BIT_IN_HT0 0x10 +/* HT link1 is only 8 bits wide, ignore upper byte crc errors, etc. */ +#define IPATH_8BIT_IN_HT1 0x20 +/* The link is down (or not yet up 0x11 or earlier) */ +#define IPATH_LINKDOWN 0x40 +#define IPATH_LINKINIT 0x80 /* The link level is up (0x11) */ +/* The link is in the armed (0x21) state */ +#define IPATH_LINKARMED 0x100 +/* The link is in the active (0x31) state */ +#define IPATH_LINKACTIVE 0x200 +/* The link was taken down, but no interrupt yet */ +#define IPATH_LINKUNK 0x400 +/* link being moved to armed (0x21) state */ +#define IPATH_LINK_TOARMED 0x800 +/* link being moved to active (0x31) state */ +#define IPATH_LINK_TOACTIVE 0x1000 +/* linkinit cmd is SLEEP, move to POLL */ +#define IPATH_LINK_SLEEPING 0x2000 +/* no IB cable, or no device on IB cable */ +#define IPATH_NOCABLE 0x4000 +/* Supports port zero per packet receive interrupts via GPIO */ +#define IPATH_GPIO_INTR 0x8000 + +/* portdata flag values */ +#define IPATH_PORT_WAITING_RCV 0x4 /* waiting for a packet to arrive */ +/* waiting for a PIO buffer to be available */ +#define IPATH_PORT_WAITING_PIO 0x8 + +/* + * do the chip initialization, either on startup for the real hardware, + * or via ioctl for simulation. + */ +extern int ipath_init_chip(const ipath_type); +/* free up any allocated data at closes */ +extern void ipath_free_data(ipath_portdata * dd); +extern void ipath_init_picotime(void); /* init cycles to picosecs conversion */ +extern int ipath_bringup_serdes(const ipath_type); +extern int ipath_waitfor_mdio_cmdready(const ipath_type); +extern int ipath_waitfor_complete(const ipath_type, ipath_kreg, uint64_t, + uint64_t *); +extern void ipath_quiet_serdes(const ipath_type); +extern void ipath_get_boardname(uint8_t, char *, size_t); +extern int ipath_getpiobuf(int); +extern int ipath_bufavail(int); +extern int ipath_rd_eeprom(const ipath_type port_unit, + struct ipath_eeprom_req *); +extern uint64_t ipath_snap_cntr(const ipath_type, ipath_creg); + +/* + * these should be somewhat dynamic someday, although they are fixed + * for all users of the device on any given load. + * + * NOTE: There is a VM bug in the 2.4 Kernels similar to the one Dave + * fixed in the 2.6 Kernel. When using large or discontinuous memory, + * we get random kernel oops. So, in 2.4, we are just going to stick + * with 4k chunks instead of 64k chunks. + */ +/* (words) room for all IB headers and KD proto header */ +#define IPATH_RCVHDRENTSIZE 16 +/* + * 64K, which is about all you can hope to get contiguous. API allows + * users to request a size, for now I'm ignoring that. + */ +#define IPATH_RCVHDRCNT 1024 + +/* + * number of words in KD protocol header if not set by ipath_userinit(); + * this uses the full 64 bytes of rcvhdrentry + */ +#define IPATH_DFLT_RCVHDRSIZE 9 + +#define IPATH_MDIO_CMD_WRITE 1 +#define IPATH_MDIO_CMD_READ 2 +#define IPATH_MDIO_CLD_DIV 25 /* to get 2.5 Mhz mdio clock */ +#define IPATH_MDIO_CMDVALID 0x40000000 /* bit 30 */ +#define IPATH_MDIO_DATAVALID 0x80000000 /* bit 31 */ +#define IPATH_MDIO_CTRL_STD 0x0 + +#define IPATH_MDIO_REQ(cmd,dev,reg,data) ( (((uint64_t)IPATH_MDIO_CLD_DIV) << 32) | \ + ((cmd) << 26) | ((dev)<<21) | ((reg) << 16) | ((data) & 0xFFFF)) + +#define IPATH_MDIO_CTRL_XGXS_REG_8 0x8 /* signal and fifo status, in bank 31 */ + +/* controls loopback, redundancy */ +#define IPATH_MDIO_CTRL_8355_REG_1 0x10 +#define IPATH_MDIO_CTRL_8355_REG_2 0x11 /* premph, encdec, etc. */ +#define IPATH_MDIO_CTRL_8355_REG_6 0x15 /* Kchars, etc. */ +#define IPATH_MDIO_CTRL_8355_REG_9 0x18 +#define IPATH_MDIO_CTRL_8355_REG_10 0x1D + +/* + * these function similarly to the mlock/munlock system calls. + * ipath_mlock() is used to pin an address range (if not already pinned), + * and optionally return the list of physical addresses + * ipath_munlock() does the obvious, and ipath_mlock() cleans up all + * private memory, used at driver unload. + * ipath_mlock_nocopy() is similar to mlock, but only one page, and marks + * the vm so the page isn't taken away on a fork. + */ +int ipath_mlock(unsigned long, size_t, struct page **); +int ipath_mlock_nocopy(unsigned long, struct page **); +int ipath_munlock(size_t, struct page **); +void ipath_mlock_cleanup(ipath_portdata *); +int ipath_eeprom_read(const ipath_type, uint8_t, void *, int); +int ipath_eeprom_write(const ipath_type, uint8_t, void *, int); + +/* these are used for the registers that vary with port */ +void ipath_kput_kreg_port(const ipath_type, ipath_kreg, unsigned, uint64_t); +uint64_t ipath_kget_kreg64_port(const ipath_type, ipath_kreg, unsigned); + +#define ipath_func_krecord(a) +#define ipath_func_urecord(a, b) +#define ipath_func_mrecord(a, b) +#define ipath_func_rkrecord(a) +#define ipath_func_rurecord(a, b) +#define ipath_func_rmrecord(a, b) +#define ipath_func_rsrecord(a) +#define ipath_func_rcrecord(a) + +/* + * we could have a single register get/put routine, that takes a group + * type, but for now I've chosen to have separate routines; I think this + * is somewhat clearer and cleaner, but we'll see. It also gives us some + * error checking. 64 bit register reads should always work, but are + * inefficient on opteron (2 separate HT 32 bit reads), so we use kreg32 + * wherever possible. User register and counter register reads are always + * 32 bit reads, so only one form of those routines + */ + +/* + * return contents of a user register group register; not normally + * used in the kernel, except port 0 + */ +static __inline__ uint32_t ipath_kget_ureg32(const ipath_type, ipath_ureg, int) + __attribute__ ((always_inline)); +/* return contents of a kernel register group register */ +static __inline__ uint64_t ipath_kget_kreg64(const ipath_type, ipath_kreg) + __attribute__ ((always_inline)); +static __inline__ uint32_t ipath_kget_kreg32(const ipath_type, ipath_kreg) + __attribute__ ((always_inline)); +/* return contents of a counter register group register */ +static __inline__ uint32_t ipath_kget_creg32(const ipath_type, ipath_creg) + __attribute__ ((always_inline)); + +/* + * change contents of a user register group register; not normally + * used in the kernel, except port 0 + */ +static __inline__ void ipath_kput_ureg(const ipath_type, ipath_ureg, uint64_t, + int) __attribute__ ((always_inline)); +/* change contents of a kernel register group register */ +static __inline__ void ipath_kput_kreg(const ipath_type, ipath_kreg, uint64_t) + __attribute__ ((always_inline)); +static __inline__ void ipath_kput_memq(const ipath_type, volatile uint64_t *, + uint64_t) + __attribute__ ((always_inline)); + +#ifdef IPATH_COSIM +extern __u32 sim_readl(const volatile void __iomem * addr); +extern __u64 sim_readq(const volatile void __iomem * addr); +extern void sim_writel(__u32 val, volatile void __iomem * addr); +extern void sim_writeq(__u64 val, volatile void __iomem * addr); +#define ipath_readl(addr) sim_readl(addr) +#define ipath_readq(addr) sim_readq(addr) +#define ipath_writel(val, addr) sim_writel(val, addr) +#define ipath_writeq(val, addr) sim_writeq(val, addr) +#else +#define ipath_readl(addr) readl(addr) +#define ipath_readq(addr) readq(addr) +#define ipath_writel(val, addr) writel(val, addr) +#define ipath_writeq(val, addr) writeq(val, addr) +#endif + +/* + * At the moment, none of the s-registers are writable, so no ipath_kput_sreg() + * At the moment, none of the c-registers are writable, so no ipath_kput_creg() + */ + +/* + * return the contents of a register that is virtualized to be per port + * prints a debug message and returns ~0ULL on errors (not distinguishable from + * valid contents at runtime; we may add a separate error variable at some + * point). Initially, ipath_dev isn't needed because I only have one simulation + * but that will change soon + * This is normally not used by the kernel, but may be for debugging, + * and has a different implementation than user mode, which is why + * it's not in _common.h + */ +static __inline__ uint32_t ipath_kget_ureg32(const ipath_type stype, + ipath_ureg regno, int port) +{ + uint64_t *ubase; + + ubase = (uint64_t *) (devdata[stype].ipath_uregbase + + (char *)devdata[stype].ipath_kregbase + + devdata[stype].ipath_palign * port); + return ubase ? ipath_readl(ubase + regno) : 0; +} + +/* + * change the contents of a register that is virtualized to be per port + * prints a debug message and returns 1 on errors, 0 on success. + * Initially, ipath_dev isn't needed because I only have one simulation + * but that will change soon + */ +static __inline__ void ipath_kput_ureg(const ipath_type stype, ipath_ureg regno, + uint64_t value, int port) +{ + uint64_t *ubase; + + ubase = (uint64_t *) (devdata[stype].ipath_uregbase + + (char *)devdata[stype].ipath_kregbase + + devdata[stype].ipath_palign * port); + if (ubase) + ipath_writeq(value, &ubase[regno]); +} + +static __inline__ uint32_t ipath_kget_kreg32(const ipath_type stype, + ipath_kreg regno) +{ + volatile uint32_t *kreg32; + + if (!devdata[stype].ipath_kregbase) + return ~0; + + kreg32 = (volatile uint32_t *)&devdata[stype].ipath_kregbase[regno]; + return ipath_readl(kreg32); +} + +static __inline__ uint64_t ipath_kget_kreg64(const ipath_type stype, + ipath_kreg regno) +{ + if (!devdata[stype].ipath_kregbase) + return ~0ULL; + + return ipath_readq(&devdata[stype].ipath_kregbase[regno]); +} + +static __inline__ void ipath_kput_kreg(const ipath_type stype, + ipath_kreg regno, uint64_t value) +{ + if (devdata[stype].ipath_kregbase) + ipath_writeq(value, &devdata[stype].ipath_kregbase[regno]); +} + +static __inline__ uint32_t ipath_kget_creg32(const ipath_type stype, + ipath_sreg regno) +{ + uint64_t *cbase; + + cbase = (uint64_t *) (devdata[stype].ipath_cregbase + + (char *)devdata[stype].ipath_kregbase); + return cbase ? ipath_readl(cbase + regno) : 0; +} + +/* + * caddr is the destination chip address (full pointer, not offset), + * val is the qword to write there. We only handle a single qword (8 bytes). + * This is not used for copies to the PIO buffer, just TID updates, etc. + * This function is needed for simulation, and also localizes all chip + * mem writes for better/simpler debugging. + */ +static __inline__ void ipath_kput_memq(const ipath_type stype, + volatile uint64_t * caddr, uint64_t val) +{ + if (devdata[stype].ipath_kregbase) + ipath_writeq(val, caddr); +} + +#endif /* _IPATH_KERNEL_H */ diff --git a/drivers/infiniband/hw/ipath/ipath_layer.h b/drivers/infiniband/hw/ipath/ipath_layer.h new file mode 100644 index 0000000..3b7954d --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_layer.h @@ -0,0 +1,131 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_layer.h 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#ifndef _IPATH_LAYER_H +#define _IPATH_LAYER_H + +/* + * This header file is for symbols shared between the infinipath driver + * and drivers layered upon it (such as ipath). + */ + +struct sk_buff; +struct ipath_sge_state; + +struct ipath_layer_counters { + uint64_t symbol_error_counter; + uint64_t link_error_recovery_counter; + uint64_t link_downed_counter; + uint64_t port_rcv_errors; + uint64_t port_rcv_remphys_errors; + uint64_t port_xmit_discards; + uint64_t port_xmit_data; + uint64_t port_rcv_data; + uint64_t port_xmit_packets; + uint64_t port_rcv_packets; +}; + +extern int ipath_layer_register(const ipath_type device, + int (*l_intr) (const ipath_type, uint32_t), + int (*l_rcv) (const ipath_type, void *, + struct sk_buff *), + uint16_t rcv_opcode, + int (*l_rcv_lid) (const ipath_type, void *), + uint16_t rcv_lid_opcode); +extern int ipath_verbs_register(const ipath_type device, + int (*l_piobufavail) (const ipath_type device), + void (*l_rcv) (const ipath_type device, + void *rhdr, void *data, + u32 tlen), + void (*l_timer_cb) (const ipath_type device)); +extern void ipath_verbs_unregister(const ipath_type device); +extern int ipath_layer_open(const ipath_type device, uint32_t * pktmax); +extern int16_t ipath_layer_get_lid(const ipath_type device); +extern int ipath_layer_get_mac(const ipath_type device, uint8_t *); +extern int16_t ipath_layer_get_bcast(const ipath_type device); +extern int ipath_layer_get_num_of_dev(void); +extern int ipath_layer_get_cr_errpkey(const ipath_type device); +extern int ipath_kset_linkstate(uint32_t arg); +extern int ipath_kset_mtu(uint32_t); +extern void ipath_set_sps_lid(const ipath_type, uint32_t); +extern void ipath_layer_close(const ipath_type device); +extern int ipath_layer_send(const ipath_type device, void *hdr, void *data, + uint32_t datalen); +extern int ipath_verbs_send(const ipath_type device, uint32_t hdrwords, + uint32_t *hdr, uint32_t len, + struct ipath_sge_state *ss); +extern int ipath_layer_send_skb(struct copy_data_s *cdata); +extern void ipath_layer_set_piointbufavail_int(const ipath_type device); +extern void ipath_get_boardname(const ipath_type, char *name, size_t namelen); +extern void ipath_layer_snapshot_counters(const ipath_type t, u64 * swords, + u64 * rwords, u64 * spkts, + u64 * rpkts); +extern void ipath_layer_get_counters(const ipath_type device, + struct ipath_layer_counters *cntrs); +extern void ipath_layer_want_buffer(const ipath_type t); +extern int ipath_layer_set_guid(const ipath_type t, uint64_t guid); +extern uint64_t ipath_layer_get_guid(const ipath_type t); +extern uint32_t ipath_layer_get_nguid(const ipath_type t); +extern int ipath_layer_query_device(const ipath_type t, uint32_t * vendor, + uint32_t * boardrev, uint32_t * majrev, + uint32_t * minrev); +extern uint32_t ipath_layer_get_flags(const ipath_type t); +extern struct device *ipath_layer_get_pcidev(const ipath_type t); +extern uint16_t ipath_layer_get_deviceid(const ipath_type t); +extern uint64_t ipath_layer_get_lastibcstat(const ipath_type t); +extern uint32_t ipath_layer_get_ibmtu(const ipath_type t); +extern void ipath_layer_enable_timer(const ipath_type t); +extern void ipath_layer_disable_timer(const ipath_type t); +extern unsigned ipath_verbs_get_flags(const ipath_type device); +extern void ipath_verbs_set_flags(const ipath_type device, unsigned flags); +extern unsigned ipath_layer_get_npkeys(const ipath_type device); +extern unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index); +extern void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys); +extern int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys); + +/* ipath_ether interrupt values */ +#define IPATH_LAYER_INT_IF_UP 0x2 +#define IPATH_LAYER_INT_IF_DOWN 0x4 +#define IPATH_LAYER_INT_LID 0x8 +#define IPATH_LAYER_INT_SEND_CONTINUE 0x10 +#define IPATH_LAYER_INT_BCAST 0x40 + +/* _verbs_layer.l_flags */ +#define IPATH_VERBS_KERNEL_SMA 0x1 + +#endif /* _IPATH_LAYER_H */ diff --git a/drivers/infiniband/hw/ipath/ipath_registers.h b/drivers/infiniband/hw/ipath/ipath_registers.h new file mode 100644 index 0000000..6bf0c8b --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_registers.h @@ -0,0 +1,359 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_registers.h 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#ifndef _IPATH_REGISTERS_H +#define _IPATH_REGISTERS_H + +/* + * This file should only be included by kernel source, and by the diags. + * It defines the registers, and their contents, for the InfiniPath HT-400 chip + */ + +/* + * These are the InfiniPath register and buffer bit definitions, + * that are visible to software, and needed only by the kernel + * and diag code. A few, that are visible to protocol and user + * code are in ipath_common.h. Some bits are specific + * to a given chip implementation, and have been moved to the + * chip-specific source file + */ + +/* kr_revision bits */ +#define INFINIPATH_R_CHIPREVMINOR_MASK 0xFF +#define INFINIPATH_R_CHIPREVMINOR_SHIFT 0 +#define INFINIPATH_R_CHIPREVMAJOR_MASK 0xFF +#define INFINIPATH_R_CHIPREVMAJOR_SHIFT 8 +#define INFINIPATH_R_ARCH_MASK 0xFF +#define INFINIPATH_R_ARCH_SHIFT 16 +#define INFINIPATH_R_SOFTWARE_MASK 0xFF +#define INFINIPATH_R_SOFTWARE_SHIFT 24 +#define INFINIPATH_R_BOARDID_MASK 0xFF +#define INFINIPATH_R_BOARDID_SHIFT 32 +#define INFINIPATH_R_SIMULATOR 0x8000000000000000ULL + +/* kr_ontrol bits */ +#define INFINIPATH_C_FREEZEMODE 0x00000002 +#define INFINIPATH_C_LINKENABLE 0x00000004 + +/* kr_sendctrl bits */ +#define INFINIPATH_S_DISARMPIOBUF_SHIFT 16 +#define INFINIPATH_S_ABORT 0x00000001U +#define INFINIPATH_S_PIOINTBUFAVAIL 0x00000002U +#define INFINIPATH_S_PIOBUFAVAILUPD 0x00000004U +#define INFINIPATH_S_PIOENABLE 0x00000008U +#define INFINIPATH_S_DISARM 0x80000000U + +/* kr_rcvctrl bits */ +#define INFINIPATH_R_PORTENABLE_SHIFT 0 +#define INFINIPATH_R_INTRAVAIL_SHIFT 16 +#define INFINIPATH_R_TAILUPD 0x80000000 + +/* kr_intstatus, kr_intclear, kr_intmask bits */ +#define INFINIPATH_I_RCVURG_SHIFT 0 +#define INFINIPATH_I_RCVAVAIL_SHIFT 12 +#define INFINIPATH_I_ERROR 0x80000000 +#define INFINIPATH_I_SPIOSENT 0x40000000 +#define INFINIPATH_I_SPIOBUFAVAIL 0x20000000 +#define INFINIPATH_I_GPIO 0x10000000 + +/* kr_errorstatus, kr_errorclear, kr_errormask bits */ +#define INFINIPATH_E_RFORMATERR 0x0000000000000001ULL +#define INFINIPATH_E_RVCRC 0x0000000000000002ULL +#define INFINIPATH_E_RICRC 0x0000000000000004ULL +#define INFINIPATH_E_RMINPKTLEN 0x0000000000000008ULL +#define INFINIPATH_E_RMAXPKTLEN 0x0000000000000010ULL +#define INFINIPATH_E_RLONGPKTLEN 0x0000000000000020ULL +#define INFINIPATH_E_RSHORTPKTLEN 0x0000000000000040ULL +#define INFINIPATH_E_RUNEXPCHAR 0x0000000000000080ULL +#define INFINIPATH_E_RUNSUPVL 0x0000000000000100ULL +#define INFINIPATH_E_REBP 0x0000000000000200ULL +#define INFINIPATH_E_RIBFLOW 0x0000000000000400ULL +#define INFINIPATH_E_RBADVERSION 0x0000000000000800ULL +#define INFINIPATH_E_RRCVEGRFULL 0x0000000000001000ULL +#define INFINIPATH_E_RRCVHDRFULL 0x0000000000002000ULL +#define INFINIPATH_E_RBADTID 0x0000000000004000ULL +#define INFINIPATH_E_RHDRLEN 0x0000000000008000ULL +#define INFINIPATH_E_RHDR 0x0000000000010000ULL +#define INFINIPATH_E_RIBLOSTLINK 0x0000000000020000ULL +#define INFINIPATH_E_SMINPKTLEN 0x0000000020000000ULL +#define INFINIPATH_E_SMAXPKTLEN 0x0000000040000000ULL +#define INFINIPATH_E_SUNDERRUN 0x0000000080000000ULL +#define INFINIPATH_E_SPKTLEN 0x0000000100000000ULL +#define INFINIPATH_E_SDROPPEDSMPPKT 0x0000000200000000ULL +#define INFINIPATH_E_SDROPPEDDATAPKT 0x0000000400000000ULL +#define INFINIPATH_E_SPIOARMLAUNCH 0x0000000800000000ULL +#define INFINIPATH_E_SUNEXPERRPKTNUM 0x0000001000000000ULL +#define INFINIPATH_E_SUNSUPVL 0x0000002000000000ULL +#define INFINIPATH_E_IBSTATUSCHANGED 0x0001000000000000ULL +#define INFINIPATH_E_INVALIDADDR 0x0002000000000000ULL +#define INFINIPATH_E_RESET 0x0004000000000000ULL +#define INFINIPATH_E_HARDWARE 0x0008000000000000ULL + +/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */ +#define INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT 0 +#define INFINIPATH_HWE_TXEMEMPARITYERR_MASK 0xFULL +#define INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT 40 +#define INFINIPATH_HWE_RXEMEMPARITYERR_MASK 0x7FULL +#define INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT 44 +#define INFINIPATH_HWE_HTCBUSTREQPARITYERR 0x0000000080000000ULL +#define INFINIPATH_HWE_HTCBUSTRESPPARITYERR 0x0000000100000000ULL +#define INFINIPATH_HWE_HTCBUSIREQPARITYERR 0x0000000200000000ULL +#define INFINIPATH_HWE_RXDSYNCMEMPARITYERR 0x0000000400000000ULL +#define INFINIPATH_HWE_SERDESPLLFAILED 0x2000000000000000ULL +#define INFINIPATH_HWE_IBCBUSTOSPCPARITYERR 0x4000000000000000ULL +#define INFINIPATH_HWE_IBCBUSFRSPCPARITYERR 0x8000000000000000ULL + +/* kr_hwdiagctrl bits */ +#define INFINIPATH_DC_FORCEHTCENABLE 0x20 +#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK 0x3FULL +#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT 0 +#define INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK 0xFULL +#define INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT 40 +#define INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK 0x7FULL +#define INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT 44 +#define INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR 0x0000000080000000ULL +#define INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR 0x0000000100000000ULL +#define INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR 0x0000000200000000ULL +#define INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR 0x0000000400000000ULL +#define INFINIPATH_DC_COUNTERDISABLE 0x1000000000000000ULL +#define INFINIPATH_DC_COUNTERWREN 0x2000000000000000ULL +#define INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR 0x4000000000000000ULL +#define INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR 0x8000000000000000ULL + +/* kr_ibcctrl bits */ +#define INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK 0xFFULL +#define INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT 0 +#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK 0xFFULL +#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT 8 +#define INFINIPATH_IBCC_LINKINITCMD_MASK 0x3ULL +#define INFINIPATH_IBCC_LINKINITCMD_DISABLE 1 +/* cycle through TS1/TS2 till OK */ +#define INFINIPATH_IBCC_LINKINITCMD_POLL 2 +#define INFINIPATH_IBCC_LINKINITCMD_SLEEP 3 /* wait for TS1, then go on */ +#define INFINIPATH_IBCC_LINKINITCMD_SHIFT 16 +#define INFINIPATH_IBCC_LINKCMD_MASK 0x3ULL +#define INFINIPATH_IBCC_LINKCMD_INIT 1 /* move to 0x11 */ +#define INFINIPATH_IBCC_LINKCMD_ARMED 2 /* move to 0x21 */ +#define INFINIPATH_IBCC_LINKCMD_ACTIVE 3 /* move to 0x31 */ +#define INFINIPATH_IBCC_LINKCMD_SHIFT 18 +#define INFINIPATH_IBCC_MAXPKTLEN_MASK 0x7FFULL +#define INFINIPATH_IBCC_MAXPKTLEN_SHIFT 20 +#define INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK 0xFULL +#define INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT 32 +#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK 0xFULL +#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT 36 +#define INFINIPATH_IBCC_CREDITSCALE_MASK 0x7ULL +#define INFINIPATH_IBCC_CREDITSCALE_SHIFT 40 +#define INFINIPATH_IBCC_LOOPBACK 0x8000000000000000ULL +#define INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE 0x4000000000000000ULL + +/* kr_ibcstatus bits */ +#define INFINIPATH_IBCS_LINKTRAININGSTATE_MASK 0xF +#define INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT 0 +#define INFINIPATH_IBCS_LINKSTATE_MASK 0x7 +#define INFINIPATH_IBCS_LINKSTATE_SHIFT 4 +#define INFINIPATH_IBCS_TXREADY 0x40000000 +#define INFINIPATH_IBCS_TXCREDITOK 0x80000000 + +/* kr_extstatus bits */ +#define INFINIPATH_EXTS_SERDESPLLLOCK 0x1 +#define INFINIPATH_EXTS_GPIOIN_MASK 0xFFFFULL +#define INFINIPATH_EXTS_GPIOIN_SHIFT 48 + +/* kr_extctrl bits */ +#define INFINIPATH_EXTC_GPIOINVERT_MASK 0xFFFFULL +#define INFINIPATH_EXTC_GPIOINVERT_SHIFT 32 +#define INFINIPATH_EXTC_GPIOOE_MASK 0xFFFFULL +#define INFINIPATH_EXTC_GPIOOE_SHIFT 48 +#define INFINIPATH_EXTC_SERDESENABLE 0x80000000ULL +#define INFINIPATH_EXTC_SERDESCONNECT 0x40000000ULL +#define INFINIPATH_EXTC_SERDESENTRUNKING 0x20000000ULL +#define INFINIPATH_EXTC_SERDESDISRXFIFO 0x10000000ULL +#define INFINIPATH_EXTC_SERDESENPLPBK1 0x08000000ULL +#define INFINIPATH_EXTC_SERDESENPLPBK2 0x04000000ULL +#define INFINIPATH_EXTC_SERDESENENCDEC 0x02000000ULL +#define INFINIPATH_EXTC_LEDSECPORTGREENON 0x00000020ULL +#define INFINIPATH_EXTC_LEDSECPORTYELLOWON 0x00000010ULL +#define INFINIPATH_EXTC_LEDPRIPORTGREENON 0x00000008ULL +#define INFINIPATH_EXTC_LEDPRIPORTYELLOWON 0x00000004ULL +#define INFINIPATH_EXTC_LEDGBLOKGREENON 0x00000002ULL +#define INFINIPATH_EXTC_LEDGBLERRREDOFF 0x00000001ULL + +/* kr_mdio bits */ +#define INFINIPATH_MDIO_CLKDIV_MASK 0x7FULL +#define INFINIPATH_MDIO_CLKDIV_SHIFT 32 +#define INFINIPATH_MDIO_COMMAND_MASK 0x7ULL +#define INFINIPATH_MDIO_COMMAND_SHIFT 26 +#define INFINIPATH_MDIO_DEVADDR_MASK 0x1FULL +#define INFINIPATH_MDIO_DEVADDR_SHIFT 21 +#define INFINIPATH_MDIO_REGADDR_MASK 0x1FULL +#define INFINIPATH_MDIO_REGADDR_SHIFT 16 +#define INFINIPATH_MDIO_DATA_MASK 0xFFFFULL +#define INFINIPATH_MDIO_DATA_SHIFT 0 +#define INFINIPATH_MDIO_CMDVALID 0x0000000040000000ULL +#define INFINIPATH_MDIO_RDDATAVALID 0x0000000080000000ULL + +/* kr_partitionkey bits */ +#define INFINIPATH_PKEY_SIZE 16 +#define INFINIPATH_PKEY_MASK 0xFFFF +#define INFINIPATH_PKEY_DEFAULT_PKEY 0xFFFF + +/* kr_serdesconfig0 bits */ +#define INFINIPATH_SERDC0_RESET_MASK 0xfULL /* overal reset bits */ +#define INFINIPATH_SERDC0_RESET_PLL 0x10000000ULL /* pll reset */ +#define INFINIPATH_SERDC0_TXIDLE 0xF000ULL /* tx idle enables (per lane) */ + +/* kr_xgxsconfig bits */ +#define INFINIPATH_XGXS_RESET 0x7ULL +#define INFINIPATH_XGXS_MDIOADDR_MASK 0xfULL +#define INFINIPATH_XGXS_MDIOADDR_SHIFT 4 + +/* TID entries (memory) */ +#define INFINIPATH_RT_VALID 0x8000000000000000ULL +#define INFINIPATH_RT_ADDR_MASK 0xFFFFFFFFFFULL +#define INFINIPATH_RT_ADDR_SHIFT 0 +#define INFINIPATH_RT_BUFSIZE_MASK 0x3FFF +#define INFINIPATH_RT_BUFSIZE_SHIFT 48 + +/* mask of defined bits for various registers */ +extern const uint64_t infinipath_c_bitsextant, + infinipath_s_bitsextant, infinipath_r_bitsextant, + infinipath_i_bitsextant, infinipath_e_bitsextant, + infinipath_hwe_bitsextant, infinipath_dc_bitsextant, + infinipath_extc_bitsextant, infinipath_mdio_bitsextant, + infinipath_ibcs_bitsextant, infinipath_ibcc_bitsextant; + +/* masks that are different in different chips */ +extern const uint32_t infinipath_i_rcvavail_mask, infinipath_i_rcvurg_mask; +extern const uint64_t infinipath_hwe_htcmemparityerr_mask; +extern const uint64_t infinipath_hwe_spibdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_sphtdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_htcdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_htcdcmlockfailed_shift; +extern const uint64_t infinipath_hwe_sphtdcmlockfailed_shift; +extern const uint64_t infinipath_hwe_spibdcmlockfailed_shift; + +extern const uint64_t infinipath_hwe_htclnkabyte0crcerr; +extern const uint64_t infinipath_hwe_htclnkabyte1crcerr; +extern const uint64_t infinipath_hwe_htclnkbbyte0crcerr; +extern const uint64_t infinipath_hwe_htclnkbbyte1crcerr; + +/* + * These are the infinipath general register numbers (not offsets). + * The kernel registers are used directly, those beyond the kernel + * registers are calculated from one of the base registers. The use of + * an integer type doesn't allow type-checking as thorough as, say, + * an enum but allows for better hiding of chip differences. + */ +typedef const uint16_t + ipath_kreg, /* kernel-only, infinipath general registers */ + ipath_creg, /* kernel-only, infinipath counter registers */ + ipath_sreg; /* kernel-only, infinipath send registers */ + +/* + * These are all implemented such that 64 bit accesses work. + * Some implement no more than 32 bits. Because 64 bit reads + * require 2 HT cmds on opteron, we access those with 32 bit + * reads for efficiency (they are written as 64 bits, since + * the extra 32 bits are nearly free on writes, and it slightly reduces + * complexity). The rest are all accessed as 64 bits. + */ +extern ipath_kreg + /* These are the 32 bit group */ + kr_control, kr_counterregbase, kr_intmask, kr_intstatus, + kr_pagealign, kr_portcnt, kr_rcvtidbase, kr_rcvtidcnt, + kr_rcvegrbase, kr_rcvegrcnt, kr_scratch, kr_sendctrl, + kr_sendpiobufbase, kr_sendpiobufcnt, kr_sendpiosize, + kr_sendregbase, kr_userregbase, + /* These are the 64 bit group */ + kr_debugport, kr_debugportselect, kr_errorclear, kr_errormask, + kr_errorstatus, kr_extctrl, kr_extstatus, kr_gpio_clear, kr_gpio_mask, + kr_gpio_out, kr_gpio_status, kr_hwdiagctrl, kr_hwerrclear, + kr_hwerrmask, kr_hwerrstatus, kr_ibcctrl, kr_ibcstatus, kr_intblocked, + kr_intclear, kr_interruptconfig, kr_mdio, kr_partitionkey, kr_rcvbthqp, + kr_rcvbufbase, kr_rcvbufsize, kr_rcvctrl, kr_rcvhdrcnt, + kr_rcvhdrentsize, kr_rcvhdrsize, kr_rcvintmembase, kr_rcvintmemsize, + kr_revision, kr_sendbuffererror, kr_sendbuffererror1, + kr_sendbuffererror2, kr_sendbuffererror3, kr_sendpioavailaddr, + kr_serdesconfig0, kr_serdesconfig1, kr_serdesstatus, kr_txintmembase, + kr_txintmemsize, kr_xgxsconfig, kr_sync, kr_dump, + kr_simver, /* simulator only */ + __kr_invalid, /* a marker for debug, don't use them directly */ + /* a marker for debug, don't use them directly */ + __kr_lastvaliddirect, + /* use only with ipath_k*_kreg64_port(), not *kreg64() */ + kr_rcvhdraddr, + /* use only with ipath_k*_kreg64_port(), not *kreg64() */ + kr_rcvhdrtailaddr, + /* we define the full set for the diags, the kernel doesn't use them */ + kr_rcvhdraddr1, kr_rcvhdraddr2, kr_rcvhdraddr3, kr_rcvhdraddr4, + kr_rcvhdraddr5, kr_rcvhdraddr6, kr_rcvhdraddr7, kr_rcvhdraddr8, + kr_rcvhdrtailaddr1, kr_rcvhdrtailaddr2, kr_rcvhdrtailaddr3, + kr_rcvhdrtailaddr4, kr_rcvhdrtailaddr5, kr_rcvhdrtailaddr6, + kr_rcvhdrtailaddr7, kr_rcvhdrtailaddr8; + +/* + * first of the pioavail registers, the total number is + * (kr_sendpiobufcnt / 32); each buffer uses 2 bits + */ +extern ipath_sreg sr_sendpioavail; + +extern ipath_creg cr_badformatcnt, cr_erricrccnt, cr_errlinkcnt, + cr_errlpcrccnt, cr_errpkey, cr_errrcvflowctrlcnt, + cr_err_rlencnt, cr_errslencnt, cr_errtidfull, + cr_errtidvalid, cr_errvcrccnt, cr_ibstatuschange, + cr_intcnt, cr_invalidrlencnt, cr_invalidslencnt, + cr_lbflowstallcnt, cr_iblinkdowncnt, cr_iblinkerrrecovcnt, + cr_ibsymbolerrcnt, cr_pktrcvcnt, cr_pktrcvflowctrlcnt, + cr_pktsendcnt, cr_pktsendflowcnt, cr_portovflcnt, + cr_portovflcnt1, cr_portovflcnt2, cr_portovflcnt3, cr_portovflcnt4, + cr_portovflcnt5, cr_portovflcnt6, cr_portovflcnt7, cr_portovflcnt8, + cr_rcvebpcnt, cr_rcvovflcnt, cr_rxdroppktcnt, + cr_senddropped, cr_sendstallcnt, cr_sendunderruncnt, + cr_unsupvlcnt, cr_wordrcvcnt, cr_wordsendcnt; + +/* + * register bits for selecting i2c direction and values, used for I2C serial + * flash + */ +extern const uint16_t ipath_gpio_sda_num; +extern const uint16_t ipath_gpio_scl_num; +extern const uint64_t ipath_gpio_sda; +extern const uint64_t ipath_gpio_scl; + +#endif /* _IPATH_REGISTERS_H */ diff --git a/drivers/infiniband/hw/ipath/ips_common.h b/drivers/infiniband/hw/ipath/ips_common.h new file mode 100644 index 0000000..8a6a059 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ips_common.h @@ -0,0 +1,221 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ips_common.h 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#ifndef IPS_COMMON_H +#define IPS_COMMON_H + +typedef struct _ipath_header_typ { + /* + * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset - 14 bits + * before ECO change ~28 Dec 03. + * After that, Vers 4, Port 3, TID 11, offset 14. + */ + uint32_t ver_port_tid_offset; + uint16_t chksum; + uint16_t pkt_flags; +} ipath_header_typ; + +typedef struct _ips_message_header_typ { + uint16_t lrh[4]; + uint32_t bth[3]; + ipath_header_typ iph; + uint8_t sub_opcode; + uint8_t flags; + uint16_t src_rank; + /* 24 bits. The upper 8 bit is available for other use */ + union { + struct { + unsigned ack_seq_num : 24; + unsigned port : 4; + unsigned unused : 4; + }; + uint32_t ack_seq_num_org; + }; + uint8_t expected_tid_session_id; + uint8_t tinylen; /* to aid MPI */ + uint16_t tag; /* to aid MPI */ + union { + uint32_t mpi[4]; /* to aid MPI */ + uint32_t data[4]; + struct { + uint16_t mtu; + uint8_t major_ver; + uint8_t minor_ver; + uint32_t not_used; //free + uint32_t run_id; + uint32_t client_ver; + }; + }; +} ips_message_header_typ; + +typedef struct _ether_header_typ { + uint16_t lrh[4]; + uint32_t bth[3]; + ipath_header_typ iph; + uint8_t sub_opcode; + uint8_t cmd; + uint16_t lid; + uint16_t mac[3]; + uint8_t frag_num; + uint8_t seq_num; + uint32_t len; + /* MUST be of word size do to PIO write requirements */ + uint32_t csum; + uint16_t csum_offset; + uint16_t flags; + uint16_t first_2_bytes; + uint8_t unused[2]; /* currently unused */ +} ether_header_typ; + +/* + * The PIO buffer used for sending infinipath messages must only be written + * in 32-bit words, all the data must be written, and no writes can occur + * after the last word is written (which transfers "ownership" of the buffer + * to the chip and triggers the message to be sent). + * Since the Linux sk_buff structure can be recursive, non-aligned, and + * any number of bytes in each segment, we use the following structure + * to keep information about the overall state of the copy operation. + * This is used to save the information needed to store the checksum + * in the right place before sending the last word to the hardware and + * to buffer the last 0-3 bytes of non-word sized segments. + */ +struct copy_data_s { + ether_header_typ *hdr; + uint32_t *csum_pio; /* address of the PIO buffer to write csum to */ + uint32_t *to; /* address of the PIO buffer to write data to */ + uint32_t device; /* which device to allocate PIO bufs from */ + int error; /* set if there is an error. */ + int extra; /* amount of data saved in u.buf below */ + unsigned int len; /* total length to send in bytes */ + unsigned int flen; /* frament length in words */ + unsigned int csum; /* partial IP checksum */ + unsigned int pos; /* position for partial checksum */ + unsigned int offset; /* offset to where data currently starts */ + int checksum_calc; /* set to 'true' when the checksum has been calculated */ + struct sk_buff *skb; + union { + uint32_t w; + uint8_t buf[4]; + } u; +}; + +typedef struct copy_data_s copy_data_ctrl_typ; + +/* IB - LRH header consts */ +#define IPS_LRH_GRH 0x0003 /* 1. word of IB LRH - next header: GRH */ +#define IPS_LRH_BTH 0x0002 /* 1. word of IB LRH - next header: BTH */ + +#define IPS_OFFSET 0 + +/* + * defines the cut-off point between the header queue and eager/expected + * TID queue + */ +#define NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE ((sizeof(ips_message_header_typ) - offsetof(ips_message_header_typ, iph)) >> 2) + +/* OpCodes */ +#define OPCODE_IPS 0xC0 +#define OPCODE_ITH4X 0xC1 + +/* OpCode 30 is use by stand-alone test programs */ +#define OPCODE_RAW_DATA 0xDE +/* last OpCode (31) is reserved for test */ +#define OPCODE_TEST 0xDF + +/* sub OpCodes - ips */ +#define OPCODE_SEQ_DATA 0x01 +#define OPCODE_SEQ_CTRL 0x02 + +#define OPCODE_ACK 0x10 +#define OPCODE_NAK 0x11 + +#define OPCODE_ERR_CHK 0x20 +#define OPCODE_ERR_CHK_PLS 0x21 + +#define OPCODE_STARTUP 0x30 +#define OPCODE_STARTUP_ACK 0x31 +#define OPCODE_STARTUP_NAK 0x32 + +#define OPCODE_STARTUP_EXT 0x34 +#define OPCODE_STARTUP_ACK_EXT 0x35 +#define OPCODE_STARTUP_NAK_EXT 0x36 + +#define OPCODE_TIDS_RELEASE 0x40 +#define OPCODE_TIDS_RELEASE_CONFIRM 0x41 + +#define OPCODE_CLOSE 0x50 +#define OPCODE_CLOSE_ACK 0x51 +/* + * like OPCODE_CLOSE, but no complaint if other side has already closed. Used + * when doing abort(), MPI_Abort(), etc. + */ +#define OPCODE_ABORT 0x52 + +/* sub OpCodes - ith4x */ +#define OPCODE_ENCAP 0x81 +#define OPCODE_LID_ARP 0x82 + +/* Receive Header Queue: receive type (from infinipath) */ +#define RCVHQ_RCV_TYPE_EXPECTED 0 +#define RCVHQ_RCV_TYPE_EAGER 1 +#define RCVHQ_RCV_TYPE_NON_KD 2 +#define RCVHQ_RCV_TYPE_ERROR 3 + +/* misc. */ +#define SIZE_OF_CRC 1 + +#define EAGER_TID_ID INFINIPATH_I_TID_MASK + +#define IPS_DEFAULT_P_KEY 0xFFFF + +/* macros for processing rcvhdrq entries */ +#define ips_get_hdr_err_flags(StartOfBuffer) *(((uint32_t *)(StartOfBuffer))+1) +#define ips_get_index(StartOfBuffer) (((*((uint32_t *)(StartOfBuffer))) >> \ + INFINIPATH_RHF_EGRINDEX_SHIFT) & INFINIPATH_RHF_EGRINDEX_MASK) +#define ips_get_rcv_type(StartOfBuffer) ((*(((uint32_t *)(StartOfBuffer))) >> \ + INFINIPATH_RHF_RCVTYPE_SHIFT) & INFINIPATH_RHF_RCVTYPE_MASK) +#define ips_get_length_in_bytes(StartOfBuffer) \ + (uint32_t)(((*(((uint32_t *)(StartOfBuffer))) >> \ + INFINIPATH_RHF_LENGTH_SHIFT) & INFINIPATH_RHF_LENGTH_MASK) << 2) +#define ips_get_first_protocol_header(StartOfBuffer) (void *) \ + ((uint32_t *)(StartOfBuffer) + 2) +#define ips_get_ips_header(StartOfBuffer) ((ips_message_header_typ *) \ + ((uint32_t *)(StartOfBuffer) + 2)) +#define ips_get_ipath_ver(ipath_header) (((ipath_header) >> INFINIPATH_I_VERS_SHIFT) \ + & INFINIPATH_I_VERS_MASK) +#endif -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:54 -0800 Subject: [openib-general] [PATCH 04/13] [RFC] ipath LLD core, part 1 In-Reply-To: <200512161548.lRw6KI369ooIXS9o@cisco.com> Message-ID: <200512161548.20XjmmxDHjOZRXcz@cisco.com> First part of core driver --- drivers/infiniband/hw/ipath/ipath_driver.c | 2589 ++++++++++++++++++++++++++++ 1 files changed, 2589 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_driver.c 04a2c405bc3b7f074758c4329933e9499681ccf0 diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c new file mode 100644 index 0000000..df650d6 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -0,0 +1,2589 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_driver.c 4500 2005-12-16 01:34:22Z rjwalsh $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#include /* we can generate our own crc's for testing */ + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* + * Our LSB-assigned major number, so scripts can figure + * out how to make entry in /dev. + */ + +static int ipath_major = 233; + +/* + * number of buffers reserved for driver (layered drivers and SMA send), + * settable via sysctl, although it may not take effect if user + * processes have the port open. Reserved at end of buffer list + */ + +static uint infinipath_kpiobufs = 32; + +/* + * number of ports we are configured to use (to allow for more pio + * buffers per port, etc.) Zero means use chip value + */ + +static uint infinipath_cfgports; + +/* + * number of units we are configured to use (to allow for bringup on + * multi-chip systems) Zero means use only one for now, but eventually + * will mean to use infinipath_max + */ + +static uint infinipath_cfgunits; + +uint64_t ipath_dummy_val_for_testing; + +static __kernel_pid_t ipath_sma_alive; /* PID of SMA, if it's running */ +static spinlock_t ipath_sma_lock; /* SMA receive */ + +/* max SM received packets we'll queue; we keep the most recent packets. */ + +#define IPATH_NUM_SMAPKTS 16 + +#define IPATH_SMA_HDRSZ (8+12+8) /* LRH+BTH+DETH */ + +static struct _ipath_sma_rpkt { + /* length of received packet; non-zero if queued */ + uint32_t len; + /* unit number of interface packet was received from */ + uint32_t unit; + uint8_t *buf; +} ipath_sma_data[IPATH_NUM_SMAPKTS]; + +static unsigned ipath_sma_first; /* oldest sma packet index */ +static unsigned ipath_sma_next; /* next sma packet index to use */ + +/* + * ipath_sma_data_bufs has one extra, pointed to by ipath_sma_data_spare, + * so we can exchange buffers to do copy_to_user, and not hold the lock + * across the copy_to_user(). + */ + +#define SMA_MAX_PKTSZ (IPATH_SMA_HDRSZ+256) /* max len of an SMA packet */ + +static uint8_t ipath_sma_data_bufs[IPATH_NUM_SMAPKTS + 1][SMA_MAX_PKTSZ]; +static uint8_t *ipath_sma_data_spare; +/* sma waits globally on all units */ +static wait_queue_head_t ipath_sma_wait; +static wait_queue_head_t ipath_sma_state_wait; + +struct infinipath_stats ipath_stats; + +static __inline__ uint64_t ipath_kget_sreg(const ipath_type, ipath_sreg) + __attribute__ ((always_inline)); + +/* + * this will only be used for diags, now that we have enabled the DMA + * of the sendpioavail regs to system memory. + */ + +static __inline__ uint64_t ipath_kget_sreg(const ipath_type stype, + ipath_sreg regno) +{ + uint64_t val; + uint64_t *sbase; + + sbase = (uint64_t *) (devdata[stype].ipath_sregbase + + (char *)devdata[stype].ipath_kregbase); + val = sbase ? sbase[regno] : 0ULL; + return val; +} + +/* + * make infinipath_debug, etc. changeable on the fly via sysctl. + */ + +static int ipath_sysctl(ctl_table *, int, struct file *, void __user *, + size_t *, loff_t *); + +static int ipath_do_user_init(ipath_portdata *, struct ipath_user_info *); +static int ipath_get_baseinfo(ipath_portdata *, struct ipath_base_info *); +static int ipath_get_units(void); +static int ipath_wr_eeprom(ipath_portdata *, struct ipath_eeprom_req *); +static int ipath_wait_intr(ipath_portdata *, uint32_t); +static int ipath_tid_update(ipath_portdata *, struct _tidupd *); +static int ipath_tid_free(ipath_portdata *, struct _tidupd *); +static int ipath_get_counters(ipath_type, struct infinipath_counters *); +static int ipath_get_unit_counters(struct infinipath_getunitcounters *a); +static int ipath_get_stats(struct infinipath_stats *); +static int ipath_set_partkey(ipath_portdata *, uint16_t); +static int ipath_manage_rcvq(ipath_portdata *, uint16_t); +static void ipath_clean_partkey(ipath_portdata *, ipath_devdata *); +static void ipath_disarm_piobufs(const ipath_type, unsigned, unsigned); +static int ipath_create_user_egr(ipath_portdata *); +static int ipath_create_port0_egr(ipath_portdata *); +static int ipath_create_rcvhdrq(ipath_portdata *); +static void ipath_handle_errors(const ipath_type, uint64_t); +static void ipath_update_pio_bufs(const ipath_type); +static __inline__ void *ipath_get_egrbuf(const ipath_type, uint32_t, int); +static int ipath_shutdown_link(const ipath_type); +static int ipath_bringup_link(const ipath_type); +int ipath_bringup_serdes(const ipath_type); +static void ipath_get_faststats(unsigned long); +static int ipath_setup_htconfig(struct pci_dev *, uint64_t *, const ipath_type); +static struct page *ipath_nopage(struct vm_area_struct *, unsigned long, int *); +static irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *regs); +static void ipath_decode_err(char *, size_t, uint64_t); +void ipath_free_pddata(ipath_devdata *, uint32_t, int); +static void ipath_clear_tids(const ipath_type, unsigned); +static void ipath_get_guid(const ipath_type); +static int ipath_sma_ioctl(struct file *, unsigned int, unsigned long); +static int ipath_rcvsma_pkt(struct ipath_sendpkt *); +static int ipath_kset_lid(uint32_t); +static int ipath_kset_mlid(uint32_t); +static int ipath_get_mlid(uint32_t *); +static int ipath_get_devstatus(uint64_t *); +static int ipath_kset_guid(struct ipath_setguid *); +static int ipath_get_portinfo(uint32_t *); +static int ipath_get_nodeinfo(uint32_t *); +#ifdef _IPATH_EXTRA_DEBUG +static void ipath_dump_allregs(char *, ipath_type); +#endif + +static const char ipath_sma_name[] = "infinipath_SMA"; + +/* + * is diags mode enabled? if it is, then things like auto bringup of + * links is disabled + */ + +int ipath_diags_enabled = 0; + +void ipath_chip_done(void) +{ +} + +void ipath_chip_cleanup(ipath_devdata * dd) +{ +} + +/* + * cache aligned location + * + * where port 0 rcvhdrtail register is written back; also want + * nothing else sharing the cache line, so make it a cache line in size + * used for all units + * + * This is volatile as it's the target of a DMA from the chip. + */ + +static volatile uint64_t ipath_port0_rcvhdrtail[512] + __attribute__ ((aligned(4096))); + +#define MODNAME "ipath_core" +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " +#define PFX MODNAME ": " + +/* + * min buffers we want to have per port, after driver + */ + +#define IPATH_MIN_USER_PORT_BUFCNT 8 + +/* The size has to be longer than this string, so we can + * append board/chip information to it in the init code. + */ +static char ipath_core_version[192] = _IPATH_IDSTR "\n"; +static char *chip_driver_version; +static int chip_driver_size; + +/* mylid and lidbase are to deal with LIDs in "fabric", until SM is working */ + +module_param(infinipath_debug, uint, 0644); +module_param(infinipath_kpiobufs, uint, 0644); +module_param(infinipath_cfgports, uint, 0644); +module_param(infinipath_cfgunits, uint, 0644); + +MODULE_PARM_DESC(infinipath_debug, "mask for debug prints"); +MODULE_PARM_DESC(infinipath_cfgports, "Set max number of ports to use"); +MODULE_PARM_DESC(infinipath_cfgunits, "Set max number of devices to use"); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("PathScale "); +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); + +#ifdef IPATH_DIAG +static __kernel_pid_t ipath_diag_alive; /* PID of diags, if running */ +extern int ipath_diags_ioctl(struct file *, unsigned, unsigned long); +static int ipath_opendiag(struct inode *, struct file *); +#endif + +#if __IPATH_INFO || __IPATH_DBG +static const char *ipath_ibcstatus_str[] = { + "Disabled", + "LinkUp", + "PollActive", + "PollQuiet", + "SleepDelay", + "SleepQuiet", + "LState6", /* unused */ + "LState7", /* unused */ + "CfgDebounce", + "CfgRcvfCfg", + "CfgWaitRmt", + "CfgIdle", + "RecovRetrain", + "LState0xD", /* unused */ + "RecovWaitRmt", + "RecovIdle", +}; +#endif + +static ssize_t show_version(struct device_driver *dev, char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%s", ipath_core_version); +} + +static ssize_t show_status(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + if(!dd->ipath_statusp) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%llx\n", *(dd->ipath_statusp)); +} + +static const char *ipath_status_str[] = { + "Initted", + "Disabled", + "4", /* unused */ + "OIB_SMA", + "SMA", + "Present", + "IB_link_up", + "IB_configured", + "NoIBcable", + "Fatal_Hardware_Error", + NULL, +}; + +static ssize_t show_status_str(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + int i, any; + uint64_t s; + + if(!dd) + return -EINVAL; + + if(!dd->ipath_statusp) + return -EINVAL; + + s = *(dd->ipath_statusp); + *buf = '\0'; + for (any = i = 0; s && ipath_status_str[i]; i++) { + if (s & 1) { + if (any && strlcat(buf, " ", PAGE_SIZE) >= PAGE_SIZE) + /* overflow */ + break; + if (strlcat(buf, ipath_status_str[i], + PAGE_SIZE) >= PAGE_SIZE) + break; + any = 1; + } + s >>= 1; + } + if(any) + strlcat(buf, "\n", PAGE_SIZE); + + return strlen(buf); +} + +static ssize_t show_lid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_lid); +} + +static ssize_t show_mlid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_mlid); +} + +static ssize_t show_guid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + uint8_t *guid; + + if(!dd) + return -EINVAL; + + guid = (uint8_t *)&(dd->ipath_guid); + + return snprintf(buf, PAGE_SIZE, "%x:%x:%x:%x:%x:%x:%x:%x\n", + guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], + guid[6], guid[7]); +} + +static ssize_t show_nguid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_nguid); +} + +static ssize_t show_serial(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + buf[sizeof dd->ipath_serial] = '\0'; + memcpy(buf, dd->ipath_serial, sizeof dd->ipath_serial); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_unit(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + ipath_devdata *dd = dev_get_drvdata(dev); + + if(!dd) + return -EINVAL; + + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit); + return strlen(buf); +} + +static DRIVER_ATTR(version, S_IRUGO, show_version, NULL); +static DEVICE_ATTR(status, S_IRUGO, show_status, NULL); +static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL); +static DEVICE_ATTR(lid, S_IRUGO, show_lid, NULL); +static DEVICE_ATTR(mlid, S_IRUGO, show_mlid, NULL); +static DEVICE_ATTR(guid, S_IRUGO, show_guid, NULL); +static DEVICE_ATTR(nguid, S_IRUGO, show_nguid, NULL); +static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL); +static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL); + +/* + * Much like proc_dointvec_minmax, but only one int, and show as 0x on read + * Apparently between 2.6.3 and 2.6.10, convenience functions were added + * that I should probably convert to using. For now, do the minimum possible + * change, using the new ppos parameter, instead of f_pos + */ + +static int ipath_sysctl(ctl_table * ct, int wr, struct file *f, void __user * b, + size_t * l, loff_t * ppos) +{ + char t[20]; + int len, ret = 0; + + if (*ppos && !wr) + *l = 0; + if (!*l) + goto done; + if (!access_ok(wr ? VERIFY_READ : VERIFY_WRITE, b, *l)) { + ret = -EFAULT; + goto done; + } + len = min_t(int, sizeof t, *l); + + if (!wr) { + /* All of our changeable sysctl stuff is unsigned's for now */ + *l = snprintf(t, len, "%x\n", *(unsigned *)ct->data); + if (*l < 0) + *l = 0; + else + copy_to_user(b, t, *l); + } else { + int i; + char *e; + if (copy_from_user(t, b, len)) { + ret = -EFAULT; + goto done; + } + t[len < (sizeof t - 1) ? len : (sizeof t - 1)] = '\0'; + i = simple_strtoul(t, &e, 0); + if (e > t) { + /* + * All of our changeable sysctl stuff is + * unsigned's for now + */ + if (ct->ctl_name == CTL_INFINIPATH_LAYERBUF) { + /* we don't need locking for this, + * because we still do the normal avail + * checks, it's just a question of what + * range we check within; at best + * during the update, we miss checking + * some buffers we could have used, + * for a short period + */ + + int d; + + if (i < 1) { + _IPATH_ERROR + ("Must have at least one kernel PIO buffer\n"); + ret = -EINVAL; + goto done; + } + for (d = 0; d < infinipath_max; d++) { + if (devdata[d].ipath_kregbase) { + if (i > + (devdata[d].ipath_piobcnt - + (devdata[d]. + ipath_cfgports * + IPATH_MIN_USER_PORT_BUFCNT))) + { + _IPATH_UNIT_ERROR(d, + "Allocating %d PIO bufs for kernel leaves too few for %d user ports (%d each)\n", + i, + devdata + [d]. + ipath_cfgports + - 1, + IPATH_MIN_USER_PORT_BUFCNT); + ret = -EINVAL; + goto done; + } + devdata[d].ipath_lastport_piobuf = + devdata[d].ipath_piobcnt - i; + devdata[d].ipath_lastpioindex = + devdata[d].ipath_lastport_piobuf; + } + } + } + *(unsigned *)ct->data = i; + } else { + ret = -EINVAL; + goto done; + } + *l = len; + } +done: + + *ppos += *l; + return ret; +} + +/* + * make infinipath_debug changeable on the fly via sysctl. + */ + +static struct ctl_table_header *ipath_ctl_header; + +static ctl_table ipath_ctl_debug[] = { + { + .ctl_name = CTL_INFINIPATH_DEBUG, + .procname = "debug", + .data = &infinipath_debug, + .maxlen = sizeof(infinipath_debug), + .mode = 0644, + .proc_handler = ipath_sysctl, + } + , + { + .ctl_name = CTL_INFINIPATH_LAYERBUF, + .procname = "kern_piobufs", + .data = &infinipath_kpiobufs, + .maxlen = sizeof(infinipath_kpiobufs), + .mode = 0644, + .proc_handler = ipath_sysctl, + } + , + {.ctl_name = 0} +}; + +static ctl_table ipath_ctl[] = { + { + .ctl_name = CTL_INFINIPATH, + .procname = "infinipath", + .mode = 0555, + .child = ipath_ctl_debug, + }, + {.ctl_name = 0} +}; + +/* + * called from add_timer and user counter read calls, to deal with + * counters that wrap in "human time". The words sent and received, and + * the packets sent and received are all that we worry about. For now, + * at least, we don't worry about error counters, because if they wrap + * that quickly, we probably don't care. We may eventually just make this + * handle all the counters. word counters can wrap in about 20 seconds + * of full bandwidth traffic, packet counters in a few hours. + */ + +uint64_t ipath_snap_cntr(const ipath_type t, ipath_creg creg) +{ + uint32_t val; + uint64_t val64, t0, t1; + ipath_devdata *dd = &devdata[t]; + static uint64_t one_sec_in_cycles; + extern uint32_t _ipath_pico_per_cycle; + + if (!one_sec_in_cycles && _ipath_pico_per_cycle) + one_sec_in_cycles = 1000000000000UL / _ipath_pico_per_cycle; + + t0 = get_cycles(); + val = ipath_kget_creg32(t, creg); + t1 = get_cycles(); + if ((t1 - t0) > one_sec_in_cycles && val == ~0) { + /* + * This is just a way to detect things that are quite broken. + * Normally this should take just a few cycles (the check is + * for long enough that we don't care if we get pre-empted.) + * An Opteron HT O read timeout is 4 seconds with normal + * NB values + */ + + _IPATH_UNIT_ERROR(t, "Error! Reading counter 0x%x timed out\n", + creg); + return 0ULL; + } + + if (creg == cr_wordsendcnt) { + if (val != dd->ipath_lastsword) { + dd->ipath_sword += val - dd->ipath_lastsword; + dd->ipath_lastsword = val; + } + val64 = dd->ipath_sword; + } else if (creg == cr_wordrcvcnt) { + if (val != dd->ipath_lastrword) { + dd->ipath_rword += val - dd->ipath_lastrword; + dd->ipath_lastrword = val; + } + val64 = dd->ipath_rword; + } else if (creg == cr_pktsendcnt) { + if (val != dd->ipath_lastspkts) { + dd->ipath_spkts += val - dd->ipath_lastspkts; + dd->ipath_lastspkts = val; + } + val64 = dd->ipath_spkts; + } else if (creg == cr_pktrcvcnt) { + if (val != dd->ipath_lastrpkts) { + dd->ipath_rpkts += val - dd->ipath_lastrpkts; + dd->ipath_lastrpkts = val; + } + val64 = dd->ipath_rpkts; + } else + val64 = (uint64_t) val; + + return val64; +} + +/* + * print the delta of egrfull/hdrqfull errors for kernel ports no more + * than every 5 seconds. User processes are printed at close, but kernel + * doesn't close, so... Separate routine so may call from other places + * someday, and so function name when printed by _IPATH_INFO is meaningfull + */ + +static void ipath_qcheck(const ipath_type t) +{ + static uint64_t last_tot_hdrqfull; + size_t blen = 0; + ipath_devdata *dd = &devdata[t]; + char buf[128]; + + *buf = 0; + if (dd->ipath_pd[0]->port_hdrqfull != dd->ipath_p0_hdrqfull) { + blen = snprintf(buf, sizeof buf, "port 0 hdrqfull %u", + dd->ipath_pd[0]->port_hdrqfull - + dd->ipath_p0_hdrqfull); + dd->ipath_p0_hdrqfull = dd->ipath_pd[0]->port_hdrqfull; + } + if (ipath_stats.sps_etidfull != dd->ipath_last_tidfull) { + blen += + snprintf(buf + blen, sizeof buf - blen, "%srcvegrfull %llu", + blen ? ", " : "", + ipath_stats.sps_etidfull - dd->ipath_last_tidfull); + dd->ipath_last_tidfull = ipath_stats.sps_etidfull; + } + + /* + * this is actually the number of hdrq full interrupts, not actual + * events, but at the moment that's mostly what I'm interested in. + * Actual count, etc. is in the counters, if needed. For production + * users this won't ordinarily be printed. + */ + + if ((infinipath_debug & (__IPATH_PKTDBG | __IPATH_DBG)) && + ipath_stats.sps_hdrqfull != last_tot_hdrqfull) { + blen += + snprintf(buf + blen, sizeof buf - blen, + "%shdrqfull %llu (all ports)", blen ? ", " : "", + ipath_stats.sps_hdrqfull - last_tot_hdrqfull); + last_tot_hdrqfull = ipath_stats.sps_hdrqfull; + } + if (blen) + _IPATH_DBG("%s\n", buf); + + if(*dd->ipath_hdrqtailptr != dd->ipath_port0head) { + if(dd->ipath_lastport0rcv_cnt == ipath_stats.sps_port0pkts) { + _IPATH_PDBG("missing rcv interrupts? port0 hd=%llx tl=%x; port0pkts %llx\n", + *dd->ipath_hdrqtailptr, dd->ipath_port0head,ipath_stats.sps_port0pkts); + ipath_kreceive(t); + } + dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts; + } +} + +/* + * called from add_timer to get word counters from chip before they + * can overflow + */ + +static void ipath_get_faststats(unsigned long t) +{ + uint32_t val; + ipath_devdata *dd = &devdata[t]; + static unsigned cnt; + + /* + * don't access the chip while running diags, or memory diags + * can fail + */ + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT) || + ipath_diags_enabled) { + /* but re-arm the timer, for diags case; won't hurt other */ + goto done; + } + + ipath_snap_cntr((ipath_type) t, cr_wordsendcnt); + ipath_snap_cntr((ipath_type) t, cr_wordrcvcnt); + ipath_snap_cntr((ipath_type) t, cr_pktsendcnt); + ipath_snap_cntr((ipath_type) t, cr_pktrcvcnt); + + ipath_qcheck(t); + + /* + * deal with repeat error suppression. Doesn't really matter if + * last error was almost a full interval ago, or just a few usecs + * ago; still won't get more than 2 per interval. We may want + * longer intervals for this eventually, could do with mod, counter + * or separate timer. Also see code in ipath_handle_errors() and + * ipath_handle_hwerrors(). + */ + + if (dd->ipath_lasterror) + dd->ipath_lasterror = 0; + if (dd->ipath_lasthwerror) + dd->ipath_lasthwerror = 0; + if ((devdata[t].ipath_maskederrs & ~devdata[t].ipath_ignorederrs) + && get_cycles() > devdata[t].ipath_unmasktime) { + char ebuf[256]; + ipath_decode_err(ebuf, sizeof ebuf, + (devdata[t].ipath_maskederrs & ~devdata[t]. + ipath_ignorederrs)); + if ((devdata[t].ipath_maskederrs & ~devdata[t]. + ipath_ignorederrs) + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) { + _IPATH_UNIT_ERROR(t, "Re-enabling masked errors (%s)\n", + ebuf); + } else { + /* + * rcvegrfull and rcvhdrqfull are "normal", + * for some types of processes (mostly benchmarks) + * that send huge numbers of messages, while + * not processing them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Disabling frequent queue full errors (%s)\n", + ebuf); + } + devdata[t].ipath_maskederrs = devdata[t].ipath_ignorederrs; + ipath_kput_kreg(t, kr_errormask, ~devdata[t].ipath_maskederrs); + } + + if (dd->ipath_flags & IPATH_LINK_SLEEPING) { + uint64_t ibc; + _IPATH_VDBG("linkinitcmd SLEEP, move to POLL\n"); + dd->ipath_flags &= ~IPATH_LINK_SLEEPING; + ibc = dd->ipath_ibcctrl; + /* + * don't put linkinitcmd in ipath_ibcctrl, want that to + * stay a NOP + */ + ibc |= + INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + ipath_kput_kreg(t, kr_ibcctrl, ibc); + } + + /* limit qfull messages to ~one per minute per port */ + if ((++cnt & 0x10)) { + for (val = devdata[t].ipath_cfgports - 1; ((int)val) >= 0; + val--) { + if (dd->ipath_lastegrheads[val] != ~0) + dd->ipath_lastegrheads[val] = ~0; + if (dd->ipath_lastrcvhdrqtails[val] != ~0) + dd->ipath_lastrcvhdrqtails[val] = ~0; + } + } + + if(dd->ipath_nosma_bufs) { + dd->ipath_nosma_secs += 5; + if(dd->ipath_nosma_secs >= 30) { + _IPATH_SMADBG("No SMA bufs avail %u seconds; cancelling pending sends\n", + dd->ipath_nosma_secs); + ipath_disarm_piobufs(t, dd->ipath_lastport_piobuf, + dd->ipath_piobcnt - dd->ipath_lastport_piobuf); + dd->ipath_nosma_secs = 0; /* start again, if necessary */ + } + else + _IPATH_SMADBG("No SMA bufs avail %u tries, after %u seconds\n", + dd->ipath_nosma_bufs, dd->ipath_nosma_secs); + } + +done: + mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); +} + + +static void __devexit infinipath_remove_one(struct pci_dev *); +static int infinipath_init_one(struct pci_dev *, const struct pci_device_id *); + +const struct pci_device_id infinipath_pci_tbl[] = { + { + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH2, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, + {0,} +}; + +MODULE_DEVICE_TABLE(pci, infinipath_pci_tbl); + +static struct pci_driver infinipath_driver = { + .name = MODNAME, + .driver.owner = THIS_MODULE, + .probe = infinipath_init_one, + .remove = __devexit_p(infinipath_remove_one), + .id_table = infinipath_pci_tbl, +}; + +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) +int remap_area_pages(unsigned long address, unsigned long phys_addr, + unsigned long size, unsigned long flags); +#endif + +static int infinipath_init_one(struct pci_dev *pdev, + const struct pci_device_id *ent) +{ + int ret, len, j; + static int chip_idx = -1; + unsigned long addr; + uint64_t pioaddr, piolen, intconfig; + uint8_t rev; + ipath_type dev; + + /* + * XXX: Right now, we have a hardcoded array of devices. We'll + * change this in a future release, but not just yet. For the + * moment, we're limited to 4 infinipath devices per system. + */ + + dev = ++chip_idx; + + _IPATH_VDBG("initializing unit #%u\n", dev); + if ((!infinipath_cfgunits && (dev >= 1)) || + (infinipath_cfgunits && (dev >= infinipath_cfgunits)) || + (dev >= infinipath_max)) { + _IPATH_ERROR("Trying to initialize unit %u, max is %u\n", + dev, infinipath_max - 1); + return -EINVAL; + } + + devdata[dev].pci_registered = 1; + devdata[dev].ipath_unit = dev; + + if ((ret = pci_enable_device(pdev))) { + _IPATH_DBG("pci_enable unit %u failed: %x\n", dev, ret); + } + + if ((ret = pci_request_regions(pdev, MODNAME))) + _IPATH_INFO("pci_request_regions unit %u fails: %d\n", dev, + ret); + + if ((ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) != 0) + _IPATH_INFO("pci_set_dma_mask unit %u fails: %d\n", dev, ret); + + pci_set_master(pdev); /* probably not be needed for HT */ + + addr = pci_resource_start(pdev, 0); + len = pci_resource_len(pdev, 0); + _IPATH_VDBG + ("regbase (0) %lx len %d irq %x, vend %x/%x driver_data %lx\n", + addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data); + devdata[dev].ipath_deviceid = ent->device; /* save for later use */ + devdata[dev].ipath_vendorid = ent->vendor; + for (j = 0; j < 6; j++) { + if (!pdev->resource[j].start) + continue; + _IPATH_VDBG("BAR %d start %lx, end %lx, len %lx\n", + j, pdev->resource[j].start, + pdev->resource[j].end, pci_resource_len(pdev, j)); + } + + if (!addr) { + _IPATH_UNIT_ERROR(dev, "No valid address in BAR 0!\n"); + return -ENODEV; + } + + if ((ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev))) { + _IPATH_UNIT_ERROR(dev, + "Failed to read PCI revision ID unit %u: %d\n", + dev, ret); + return ret; /* shouldn't ever happen */ + } else + devdata[dev].ipath_pcirev = rev; + + devdata[dev].ipath_kregbase = ioremap_nocache(addr, len); +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) + printk("Remapping pages WC\n"); + remap_area_pages((u64) devdata[dev].ipath_kregbase + 1024 * 1024, + addr + 1024 * 1024, 1024 * 1024, _PAGE_MA_WC); + /* devdata[dev].ipath_kregbase = __ioremap(addr, len, _PAGE_MA_WC); */ +#endif + + if (!devdata[dev].ipath_kregbase) { + _IPATH_DBG("Unable to map io addr %lx to kvirt, failing\n", + addr); + ret = -ENOMEM; + goto fail; + } + devdata[dev].ipath_kregend = + (uint64_t *) ((void *)devdata[dev].ipath_kregbase + len); + devdata[dev].ipath_physaddr = addr; /* used for io_remap, etc. */ + /* for user mmap */ + devdata[dev].ipath_kregvirt = phys_to_virt(addr); + _IPATH_VDBG("mapped io addr %lx to kregbase %p kregvirt %p\n", addr, + devdata[dev].ipath_kregbase, devdata[dev].ipath_kregvirt); + + /* + * set these up before registering the interrupt handler, just + * in case + */ + devdata[dev].pcidev = pdev; + pci_set_drvdata(pdev, &(devdata[dev])); + + /* + * set up our interrupt handler; SA_SHIRQ probably not needed, + * but won't hurt for now. + */ + + if (!pdev->irq) { + _IPATH_UNIT_ERROR(dev, "irq is 0, failing init\n"); + ret = -EINVAL; + goto fail; + } + if ((ret = request_irq(pdev->irq, ipath_intr, + SA_SHIRQ, MODNAME, &devdata[dev]))) { + _IPATH_UNIT_ERROR(dev, + "Couldn't setup interrupt handler, irq=%u: %d\n", + pdev->irq, ret); + goto fail; + } + + /* + * clear ipath_flags here instead of in ipath_init_chip as it is set + * by ipath_setup_htconfig. + */ + devdata[dev].ipath_flags = 0; + if (ipath_setup_htconfig(pdev, &intconfig, dev)) + _IPATH_DBG + ("Failed to setup HT config, continuing anyway for now\n"); + + ret = ipath_init_chip(dev); /* do the chip-specific init */ + if (!ret) { +#ifdef CONFIG_MTRR + unsigned bits; + /* + * Set the PIO buffers to be WCCOMB, so we get HT bursts + * to the chip. Linux (possibly the hardware) requires + * it to be on a power of 2 address matching the length + * (which has to be a power of 2). For rev1, that means + * the base address, for rev2, it will be just the PIO + * buffers themselves. + */ + pioaddr = addr + devdata[dev].ipath_piobufbase; + piolen = devdata[dev].ipath_piobcnt * + round_up(devdata[dev].ipath_piosize, + devdata[dev].ipath_palign); + + for (bits = 0; !(piolen & (1ULL << bits)); bits++) ; + if (piolen != (1ULL << bits)) { + _IPATH_DBG("piolen 0x%llx not power of 2, bits=%u\n", + piolen, bits); + piolen >>= bits; + while (piolen >>= 1) + bits++; + piolen = 1ULL << (bits + 1); + _IPATH_DBG("Changed piolen to 0x%llx bits=%u\n", piolen, + bits); + } + if (pioaddr & (piolen - 1)) { + uint64_t atmp; + _IPATH_DBG + ("pioaddr %llx not on right boundary for size %llx, fixing\n", + pioaddr, piolen); + atmp = pioaddr & ~(piolen - 1); + if (atmp < addr || (atmp + piolen) > (addr + len)) { + _IPATH_UNIT_ERROR(dev, + "No way to align address/size (%llx/%llx), no WC mtrr\n", + atmp, piolen << 1); + ret = -ENODEV; + } else { + _IPATH_DBG + ("changing WC base from %llx to %llx, len from %llx to %llx\n", + pioaddr, atmp, piolen, piolen << 1); + pioaddr = atmp; + piolen <<= 1; + } + } + + if (!ret) { + int cookie; + _IPATH_VDBG + ("Setting mtrr for chip to WC (addr %llx, len=0x%llx)\n", + pioaddr, piolen); + cookie = mtrr_add(pioaddr, piolen, MTRR_TYPE_WRCOMB, 0); + if (cookie < 0) { + _IPATH_INFO + ("mtrr_add(%llx,0x%llx,WC,0) failed (%d)\n", + pioaddr, piolen, cookie); + ret = -EINVAL; + } else { + _IPATH_VDBG + ("Set mtrr for chip to WC, cookie is %d\n", + cookie); + devdata[dev].ipath_mtrr = (uint32_t) cookie; + } + } +#endif /* CONFIG_MTRR */ + } + + if (!ret && devdata[dev].ipath_kregbase && (devdata[dev].ipath_flags + & IPATH_PRESENT)) { + /* + * for the hardware, enable interrupts only after + * kr_interruptconfig is written, if we could set it up + */ + if (intconfig) { + /* interrupt address */ + ipath_kput_kreg(dev, kr_interruptconfig, intconfig); + /* enable all interrupts */ + ipath_kput_kreg(dev, kr_intmask, ~0ULL); + /* force re-interrupt of any pending interrupts. */ + ipath_kput_kreg(dev, kr_intclear, 0ULL); + /* OK, the chip is usable, marked it as initialized */ + *devdata[dev].ipath_statusp |= IPATH_STATUS_INITTED; + } else + _IPATH_UNIT_ERROR(dev, + "No interrupts enabled, couldn't setup interrupt address\n"); + } else if(ret != -EPERM) + _IPATH_INFO("Not configuring unit %u interrupts, init failed\n", + dev); + + device_create_file(&(pdev->dev), &dev_attr_status); + device_create_file(&(pdev->dev), &dev_attr_status_str); + device_create_file(&(pdev->dev), &dev_attr_lid); + device_create_file(&(pdev->dev), &dev_attr_mlid); + device_create_file(&(pdev->dev), &dev_attr_guid); + device_create_file(&(pdev->dev), &dev_attr_nguid); + device_create_file(&(pdev->dev), &dev_attr_serial); + device_create_file(&(pdev->dev), &dev_attr_unit); + + /* + * We used to cleanup here, with pci_release_regions, etc. but that + * can cause other problems if we want to run diags, etc., so instead + * defer that until driver unload. + */ + +fail: /* after we've done at least some of the pci setup */ + if(ret == -EPERM) /* disabled device, don't want module load error; + * just want to carry status through to this point */ + ret = 0; + + return ret; +} + + + +#define HT_CAPABILITY_ID 0x08 /* HT capabilities not defined in kernel */ +#define HT_INTR_DISC_CONFIG 0x80 /* HT interrupt and discovery cap */ +#define HT_INTR_REG_INDEX 2 /* intconfig requires indirect accesses */ + +/* + * setup the interruptconfig register from the HT config info. + * Also clear CRC errors in HT linkcontrol, if necessary. + * This is done only for the real hardware. It is done before + * chip address space is initted, so can't touch infinipath registers + */ + +static int ipath_setup_htconfig(struct pci_dev *pdev, uint64_t * iaddr, + const ipath_type t) +{ + uint8_t cap_type; + uint32_t int_handler_addr_lower; + uint32_t int_handler_addr_upper; + uint64_t ihandler = 0; + int i, pos, ret = 0; + + *iaddr = 0ULL; /* init to zero in case not able to configure */ + + /* + * Read the capability info to find the interrupt info, and also + * handle clearing CRC errors in linkctrl register if necessary. + * We do this early, before we ever enable errors or hardware errors, + * mostly to avoid causing the chip to enter freeze mode. + */ + if(!(pos = pci_find_capability(pdev, HT_CAPABILITY_ID))) { + _IPATH_UNIT_ERROR(t, + "Couldn't find HyperTransport capability; no interrupts\n"); + return -ENODEV; + } + do { + /* the HT capability type byte is 3 bytes after the + * capability byte. + */ + if(pci_read_config_byte(pdev, pos+3, &cap_type)) { + _IPATH_INFO + ("Couldn't read config command @ %d\n", pos); + continue; + } + if(!(cap_type & 0xE0)) { + /* bits 13-15 of command==0 is slave/primary block. + * Clear any HT CRC errors. We only bother to + * do this at load time, because it's OK if it + * happened before we were loaded (first time + * after boot/reset), but any time after that, + * it's fatal anyway. Also need to not check for + * for upper byte errors if we are in 8 bit mode, + * so figure out our width. For now, at least, + * also complain if it's 8 bit. + */ + uint8_t linkwidth = 0, linkerr, link_a_b_off, link_off; + uint16_t linkctrl = 0; + + devdata[t].ipath_ht_slave_off = pos; + /* command word, master_host bit */ + if((cap_type >> 2) & 1) /* master host || slave */ + link_a_b_off = 4; + else + link_a_b_off = 0; + _IPATH_VDBG("HT%u (Link %c) connected to processor\n", + link_a_b_off ? 1 : 0, + link_a_b_off ? 'B' : 'A'); + + link_a_b_off += pos; + + /* + * check both link control registers; clear both + * HT CRC sets if necessary. + */ + + for (i = 0; i < 2; i++) { + link_off = pos + i * 4 + 0x4; + if (pci_read_config_word + (pdev, link_off, &linkctrl)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link control%d register\n", + i); + else if (linkctrl & (0xf << 8)) { + _IPATH_VDBG + ("Clear linkctrl%d CRC Error bits %x\n", + i, linkctrl & (0xf << 8)); + /* + * now write them back to clear + * the error. + */ + pci_write_config_byte(pdev, link_off, + linkctrl & (0xf << + 8)); + } + } + + /* + * As with HT CRC bits, same for protocol errors + * that might occur during boot. + */ + + for (i = 0; i < 2; i++) { + link_off = pos + i * 4 + 0xd; + if (pci_read_config_byte + (pdev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't read linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) { + _IPATH_VDBG + ("HT linkerr%d bits 0x%x set, clearing\n", + linkerr >> 4, i); + /* + * writing the linkerr bits that + * are set will clear them + */ + if (pci_write_config_byte + (pdev, link_off, linkerr)) + _IPATH_DBG + ("Failed write to clear HT linkerror%d\n", + i); + if (pci_read_config_byte + (pdev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't reread linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) + _IPATH_INFO + ("HT linkerror%d bits 0x%x couldn't be cleared\n", + i, linkerr >> 4); + } + } + + /* + * this is just for our link to the host, not + * devices connected through tunnel. + */ + + if (pci_read_config_byte + (pdev, link_a_b_off + 7, &linkwidth)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link width config register\n"); + else { + uint32_t width; + switch (linkwidth & 7) { + case 5: + width = 4; + break; + case 4: + width = 2; + break; + case 3: + width = 32; + break; + case 1: + width = 16; + break; + case 0: + default: /* if wrong, assume 8 bit */ + width = 8; + break; + } + ((ipath_devdata *) pci_get_drvdata(pdev))-> + ipath_htwidth = width; + + if (linkwidth != 0x11) { + _IPATH_UNIT_ERROR(t, + "Not configured for 16 bit HT (%x)\n", + linkwidth); + if (!(linkwidth & 0xf)) { + _IPATH_DBG + ("Will ignore HT lane1 errors\n"); + ((ipath_devdata *) + pci_get_drvdata(pdev))-> + ipath_flags |= IPATH_8BIT_IN_HT0; + } + } + } + + /* + * this is just for our link to the host, not + * devices connected through tunnel. + */ + + if (pci_read_config_byte + (pdev, link_a_b_off + 0xd, &linkwidth)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link frequency config register\n"); + else { + uint32_t speed; + switch (linkwidth & 0xf) { + case 6: + speed = 1000; + break; + case 5: + speed = 800; + break; + case 4: + speed = 600; + break; + case 3: + speed = 500; + break; + case 2: + speed = 400; + break; + case 1: + speed = 300; + break; + default: + /* + * assume reserved and + * vendor-specific are 200... + */ + case 0: + speed = 200; + break; + } + ((ipath_devdata *) pci_get_drvdata(pdev))-> + ipath_htspeed = speed; + } + } else if (cap_type == HT_INTR_DISC_CONFIG) { + /* use indirection register to get the intr handler */ + uint32_t intvec; + pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, + 0x10); + pci_read_config_dword(pdev, pos + 4, + &int_handler_addr_lower); + + pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, + 0x11); + pci_read_config_dword(pdev, pos + 4, + &int_handler_addr_upper); + + ihandler = (uint64_t) int_handler_addr_lower | + ((uint64_t) int_handler_addr_upper << 32); + + /* + * I'm unable to find an exported API to get + * the the actual vector, either from the PCI + * infrastructure, or from the APIC + * infrastructure. This heuristic seems to be + * valid for Opteron on 2.6.x kernels, for irq's > 2. + * It may not be universally true... Bug 2338 + * + * Oh well; the heuristic doesn't work for the + * AMI/Iwill BIOS... But the good news is, + * somewhere by 2.6.9, when CONFIG_PCI_MSI is + * enabled, the irq field actually turned into + * the vector number + * We therefore require that MSI be enabled... + */ + + intvec = pdev->irq; + /* + * clear any bits there; normally not set but + * we'll overload this for some debug purposes + * (setting the HTC debug register value from + * software, rather than GPIOs), so it might be + * set on a driver reload. + */ + + ihandler &= ~0xff0000; + /* x86 vector goes in intrinfo[23:16] */ + ihandler |= intvec << 16; + _IPATH_VDBG + ("ihandler lower %x, upper %x, intvec %x, interruptconfig %llx\n", + int_handler_addr_lower, int_handler_addr_upper, + intvec, ihandler); + + /* return to caller, can't program yet. */ + *iaddr = ihandler; + /* + * no break, have to be sure we find link control + * stuff also + */ + } + + } while((pos=pci_find_next_capability(pdev, pos, HT_CAPABILITY_ID))); + + if (!ihandler) { + _IPATH_UNIT_ERROR(t, + "Couldn't find interrupt handler in config space\n"); + ret = -ENODEV; + } + return ret; +} + +/* + * get the GUID from the i2c device + * When we add the multi-chip support, we will probably have to add + * the ability to use the number of guids field, and get the guid from + * the first chip's flash, to use for all of them. + */ + +static void ipath_get_guid(const ipath_type t) +{ + void *buf; + struct ipath_flash *ifp; + uint64_t guid; + int len; + uint8_t csum, *bguid; + + if (t && devdata[0].ipath_nguid > 1 && t <= devdata[0].ipath_nguid) { + uint8_t oguid; + devdata[t].ipath_guid = devdata[0].ipath_guid; + bguid = (uint8_t *) & devdata[t].ipath_guid; + + oguid = bguid[7]; + bguid[7] += t; + if (oguid > bguid[7]) { + if (bguid[6] == 0xff) { + if (bguid[5] == 0xff) { + _IPATH_UNIT_ERROR(t, + "Can't set %s GUID from base GUID, wraps to OUI!\n", + ipath_get_unit_name + (t)); + devdata[t].ipath_guid = 0; + return; + } + bguid[5]++; + } + bguid[6]++; + } + devdata[t].ipath_nguid = 1; + + _IPATH_DBG + ("nguid %u, so adding %u to device 0 guid, for %llx (big-endian)\n", + devdata[0].ipath_nguid, t, devdata[t].ipath_guid); + return; + } + + len = offsetof(struct ipath_flash, if_future); + if (!(buf = vmalloc(len))) { + _IPATH_UNIT_ERROR(t, + "Couldn't allocate memory to read %u bytes from eeprom for GUID\n", + len); + return; + } + + if (ipath_eeprom_read(t, 0, buf, len)) { + _IPATH_UNIT_ERROR(t, "Failed reading GUID from eeprom\n"); + goto done; + } + ifp = (struct ipath_flash *)buf; + + csum = ipath_flash_csum(ifp, 0); + if (csum != ifp->if_csum) { + _IPATH_INFO("Bad I2C flash checksum: 0x%x, not 0x%x\n", + csum, ifp->if_csum); + goto done; + } + if (*(uint64_t *) ifp->if_guid == 0ULL + || *(uint64_t *) ifp->if_guid == ~0ULL) { + _IPATH_UNIT_ERROR(t, "Invalid GUID %llx from flash; ignoring\n", + *(uint64_t *) ifp->if_guid); + goto done; /* don't allow GUID if all 0 or all 1's */ + } + + /* complain, but allow it */ + if (*(uint64_t *) ifp->if_guid == 0x100007511000000) + _IPATH_INFO + ("Warning, GUID %llx is default, probabaly not correct!\n", + *(uint64_t *) ifp->if_guid); + + bguid = ifp->if_guid; + if(!bguid[0] && !bguid[1] && !bguid[2]) { + /* original incorrect GUID format in flash; fix in core copy, by + * shifting up 2 octets; don't need to change top octet, since both + * it and shifted are 0.. */ + bguid[1] = bguid[3]; + bguid[2] = bguid[4]; + bguid[3] = bguid[4] = 0; + guid = *(uint64_t *)ifp->if_guid; + _IPATH_VDBG("Old GUID format in flash, top 3 zero, shifting 2 octets\n"); + } + else + guid = *(uint64_t *)ifp->if_guid; + devdata[t].ipath_guid = guid; + devdata[t].ipath_nguid = ifp->if_numguid; + memcpy(devdata[t].ipath_serial, ifp->if_serial, sizeof(ifp->if_serial)); + _IPATH_VDBG("Initted GUID to %llx (big-endian) from i2c flash\n", + devdata[t].ipath_guid); + +done: + vfree(buf); +} + +static void __devexit infinipath_remove_one(struct pci_dev *pdev) +{ + ipath_devdata *dd; + + _IPATH_VDBG("pci_release, pdev=%p\n", pdev); + if (pdev) { + device_remove_file(&(pdev->dev), &dev_attr_status); + device_remove_file(&(pdev->dev), &dev_attr_status_str); + device_remove_file(&(pdev->dev), &dev_attr_lid); + device_remove_file(&(pdev->dev), &dev_attr_mlid); + device_remove_file(&(pdev->dev), &dev_attr_guid); + device_remove_file(&(pdev->dev), &dev_attr_nguid); + device_remove_file(&(pdev->dev), &dev_attr_serial); + device_remove_file(&(pdev->dev), &dev_attr_unit); + dd = pci_get_drvdata(pdev); + pci_set_drvdata(pdev, NULL); + _IPATH_VDBG + ("Releasing pci memory regions, devdata %p, unit %u\n", dd, + (uint32_t) (dd - devdata)); + if (dd && dd->ipath_kregbase) { + _IPATH_VDBG("Unmapping kregbase %p\n", + dd->ipath_kregbase); + iounmap((void *)dd->ipath_kregbase); + dd->ipath_kregbase = NULL; + } + pci_release_regions(pdev); + _IPATH_VDBG("calling pci_disable_device\n"); + pci_disable_device(pdev); + } +} + +int ipath_open(struct inode *, struct file *); +static int ipath_opensma(struct inode *, struct file *); +int ipath_close(struct inode *, struct file *); +static unsigned int ipath_poll(struct file *, struct poll_table_struct *); +long ipath_ioctl(struct file *, unsigned int, unsigned long); +static loff_t ipath_llseek(struct file *, loff_t, int); +static int ipath_mmap(struct file *, struct vm_area_struct *); + +static struct file_operations ipath_fops = { + .owner = THIS_MODULE, + .open = ipath_open, + .release = ipath_close, + .poll = ipath_poll, + /* + * all of ours are completely compatible and don't require the + * kernel lock + */ + .compat_ioctl = ipath_ioctl, + /* we don't need kernel lock for our ioctls */ + .unlocked_ioctl = ipath_ioctl, + .llseek = ipath_llseek, + .mmap = ipath_mmap +}; + +static DECLARE_MUTEX(ipath_mutex); /* general driver use */ +spinlock_t ipath_pioavail_lock; + +/* + * For now, at least (and probably forever), we don't require root + * or equivalent permissions to use the device. + */ + +int ipath_open(struct inode *in, struct file *fp) +{ + int ret = 0, minor, i, prefunit=-1, devmax; + int maxofallports, npresent = 0, notup = 0; + ipath_type ndev; + + down(&ipath_mutex); + + minor = iminor(in); + _IPATH_VDBG("open on dev %lx (minor %d)\n", (long)in->i_rdev, minor); + + /* This code is present to allow a knowledgeable person to specify the + * layout of processes to processors before opening this driver, and + * then we'll assign the process to the "closest" HT-400 to + * that processor * (we assume reasonable connectivity, for now). + * This code assumes that if affinity has been set before this + * point, that at most one cpu is set; for now this is reasonable. + * I check for both cpus_empty() and cpus_full(), in case some + * kernel variant sets none of the bits when no affinity is set. + * 2.6.11 and 12 kernels have all present cpus set. + * Some day we'll have to fix it up further to handle a cpu subset. + * This algorithm fails for two HT-400's connected in tunnel fashion. + * Eventually this needs real topology information. + * There may be some issues with dual core numbering as well. This + * needs more work prior to release. + */ + if(minor != IPATH_SMA +#ifdef IPATH_DIAG + && minor != IPATH_DIAG +#endif + && minor != IPATH_CTRL + && !cpus_empty(current->cpus_allowed) + && !cpus_full(current->cpus_allowed)) { + int ncpus = num_online_cpus(), curcpu = -1; + for(i=0; icpus_allowed)) { + _IPATH_PRDBG("%s[%u] affinity set for cpu %d\n", + current->comm, current->pid, i); + curcpu = i; + } + if(curcpu != -1) { + for(ndev = 0; ndev < infinipath_max; ndev++) + if((devdata[ndev].ipath_flags & IPATH_PRESENT) + && devdata[ndev].ipath_kregbase) + npresent++; + if(npresent) { + prefunit = curcpu/(ncpus/npresent); + _IPATH_DBG("%s[%u] %d chips, %d cpus, " + "%d cpus/chip, select unit %d\n", + current->comm, current->pid, + npresent, ncpus, ncpus/npresent, + prefunit); + } + } + } + + if (minor == IPATH_SMA) { + ret = ipath_opensma(in, fp); + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + goto done; + } +#ifdef IPATH_DIAG + else if (minor == IPATH_DIAG) { + ret = ipath_opendiag(in, fp); + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + goto done; + } +#endif + else if (minor == IPATH_CTRL) { + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + ret = 0; + goto done; + } + else if (minor) { + /* + * minor number 0 is used for all chips, we choose available + * chip ourselves, it isn't based on what they open. + */ + + _IPATH_DBG("open on invalid minor %u\n", minor); + ret = -ENXIO; + goto done; + } + + /* + * for now, we use all ports on one, then all ports on the + * next, etc. Eventually we want to tweak this to be cpu/chip + * topology aware, and round-robin across chips that are + * configured and connected, placing processes on the closest + * available processor that isn't already over-allocated. + * multi-HT400 topology could be better handled + */ + + npresent = maxofallports = 0; + for (ndev = 0; ndev < infinipath_max; ndev++) { + if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) || + !devdata[ndev].ipath_kregbase) + continue; + npresent++; + if ((devdata[ndev]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) { + _IPATH_VDBG("unit %u present, but link not ready\n", + ndev); + notup++; + continue; + } else if (!devdata[ndev].ipath_lid) { + _IPATH_VDBG + ("unit %u present, but LID not assigned, down\n", + ndev); + notup++; + continue; + } + if (devdata[ndev].ipath_cfgports > maxofallports) + maxofallports = devdata[ndev].ipath_cfgports; + } + + /* + * user ports start at 1, kernel port is 0 + * For now, we do round-robin access across all chips + */ + + devmax = prefunit!=-1 ? prefunit+1 : infinipath_max; +recheck: + for (i = 1; i < maxofallports; i++) { + for (ndev = prefunit!=-1?prefunit:0; ndev < devmax; ndev++) { + if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) || + !devdata[ndev].ipath_kregbase + || !devdata[ndev].ipath_lid + || (devdata[ndev]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) + break; /* can't use this chip */ + if (i >= devdata[ndev].ipath_cfgports) + break; /* max'ed out on users of this chip */ + if (!devdata[ndev].ipath_pd[i]) { + void *p, *ptmp; + p = kmalloc(sizeof(ipath_portdata), GFP_KERNEL); + + /* + * allocate memory for use in + * ipath_tid_update() just once at open, + * not per call. Reduces cost of expected + * send setup + */ + + ptmp = + kmalloc(devdata[ndev].ipath_rcvtidcnt * + sizeof(uint16_t) + + + devdata[ndev].ipath_rcvtidcnt * + sizeof(struct page **), GFP_KERNEL); + if (!p || !ptmp) { + _IPATH_UNIT_ERROR(ndev, + "Unable to allocate portdata memory, failing open\n"); + ret = -ENOMEM; + kfree(p); + kfree(ptmp); + goto done; + } + memset(p, 0, sizeof(ipath_portdata)); + devdata[ndev].ipath_pd[i] = p; + devdata[ndev].ipath_pd[i]->port_port = i; + devdata[ndev].ipath_pd[i]->port_unit = ndev; + devdata[ndev].ipath_pd[i]->port_tid_pg_list = + ptmp; + init_waitqueue_head(&devdata[ndev].ipath_pd[i]-> + port_wait); + } + if (!devdata[ndev].ipath_pd[i]->port_cnt) { + devdata[ndev].ipath_pd[i]->port_cnt = 1; + fp->private_data = + (void *)devdata[ndev].ipath_pd[i]; + _IPATH_PRDBG("%s[%u] opened unit:port %u:%u\n", + current->comm, current->pid, ndev, + i); + devdata[ndev].ipath_pd[i]->port_pid = + current->pid; + strncpy(devdata[ndev].ipath_pd[i]->port_comm, + current->comm, + sizeof(devdata[ndev].ipath_pd[i]-> + port_comm)); + ipath_stats.sps_ports++; + goto done; + } + } + } + + if (npresent) { + if (notup) { + ret = -ENETDOWN; + _IPATH_DBG + ("No ports available (none initialized and ready)\n"); + } else { + if(prefunit > 0) { /* if we started above unit 0, retry from 0 */ + _IPATH_PRDBG("%s[%u] no ports on prefunit %d, clear and re-check\n", + current->comm, current->pid, prefunit); + devmax = infinipath_max; + prefunit = -1; + goto recheck; + } + ret = -EBUSY; + _IPATH_DBG("No ports available\n"); + } + } else { + ret = -ENXIO; + _IPATH_DBG("No boards found\n"); + } + +done: + up(&ipath_mutex); + return ret; +} + +static int ipath_opensma(struct inode *in, struct file *fp) +{ + ipath_type s; + + if (ipath_sma_alive) { + _IPATH_DBG("SMA already running (pid %u), failing\n", + ipath_sma_alive); + return -EBUSY; + } + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* all SMA functions are root-only */ + + for (s = 0; s < infinipath_max; s++) { + /* we need at least one infinipath device to be initialized. */ + if (devdata[s].ipath_flags & IPATH_INITTED) { + ipath_sma_alive = current->pid; + *devdata[s].ipath_statusp |= IPATH_STATUS_SMA; + *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + } + } + if (ipath_sma_alive) { + _IPATH_SMADBG + ("SMA device now open, SMA active as PID %u\n", + ipath_sma_alive); + return 0; + } + _IPATH_DBG("No hardware yet found and initted, failing\n"); + return -ENODEV; +} + + +#ifdef IPATH_DIAG +static int ipath_opendiag(struct inode *in, struct file *fp) +{ + ipath_type s; + + if (ipath_diag_alive) { + _IPATH_DBG("Diags already running (pid %u), failing\n", + ipath_diag_alive); + return -EBUSY; + } + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* all diags functions are root-only */ + + for (s = 0; s < infinipath_max; s++) + /* + * we need at least one infinipath device to be present + * (don't use INITTED, because we want to be able to open + * even if device is in freeze mode, which cleared INITTED. + * There is s small amount of risk to this, which is + * why we also verify kregbase is set. + */ + + if ((devdata[s].ipath_flags & IPATH_PRESENT) + && devdata[s].ipath_kregbase) { + ipath_diag_alive = current->pid; + _IPATH_DBG("diag device now open, active as PID %u\n", + ipath_diag_alive); + return 0; + } + _IPATH_DBG("No hardware yet found and initted, failing diags\n"); + return -ENODEV; +} +#endif + +/* + * clear all TID entries for a port, expected and eager. + * Used from ipath_close(), and at chip initialization. + */ + +static void ipath_clear_tids(const ipath_type t, unsigned port) +{ + volatile uint64_t *tidbase; + int i; + ipath_devdata *dd; + uint64_t tidval; + dd = &devdata[t]; + + if (!dd->ipath_kregbase) + return; + + /* + * chip errata bug 7358, try to work around it by marking invalid + * tids as having max length + */ + + tidval = + (~0ULL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT; + + /* + * need to invalidate all of the expected TID entries for this + * port, so we don't have valid entries that might somehow get + * used (early in next use of this port, or through some bug) + * We don't bother with the eager, because they are initialized + * each time before receives are enabled; expected aren't + */ + + tidbase = (volatile uint64_t *)((char *)(dd->ipath_kregbase) + + dd->ipath_rcvtidbase + + port * dd->ipath_rcvtidcnt * + sizeof(*tidbase)); + _IPATH_VDBG("Invalidate expected TIDs for port %u, tidbase=%p\n", port, + tidbase); + for (i = 0; i < dd->ipath_rcvtidcnt; i++) + ipath_kput_memq(t, &tidbase[i], tidval); + yield(); /* don't hog the cpu */ + + /* zero the eager TID entries */ + tidbase = (volatile uint64_t *)((char *)(dd->ipath_kregbase) + + dd->ipath_rcvegrbase + + port * dd->ipath_rcvegrcnt * + sizeof(*tidbase)); + + for (i = 0; i < dd->ipath_rcvegrcnt; i++) + ipath_kput_memq(t, &tidbase[i], tidval); + yield(); /* don't hog the cpu */ +} + +int ipath_close(struct inode *in, struct file *fp) +{ + int ret = 0; + ipath_portdata *pd; + + _IPATH_VDBG("close on dev %lx, private data %p\n", (long)in->i_rdev, + fp->private_data); + + down(&ipath_mutex); + if (iminor(in) == IPATH_SMA) { + ipath_type s; + + ipath_sma_alive = 0; + _IPATH_SMADBG("Closing SMA device\n"); + for (s = 0; s < infinipath_max; s++) { + if (!(devdata[s].ipath_flags & IPATH_INITTED)) + continue; + *devdata[s].ipath_statusp &= ~IPATH_STATUS_SMA; + if (devdata[s].verbs_layer.l_flags & + IPATH_VERBS_KERNEL_SMA) + *devdata[s].ipath_statusp |= + IPATH_STATUS_OIB_SMA; + } + } +#ifdef IPATH_DIAG + else if (iminor(in) == IPATH_DIAG) { + ipath_diag_alive = 0; + _IPATH_DBG("Closing DIAG device\n"); + } +#endif + else if (fp->private_data && 255UL < (unsigned long)fp->private_data) { + ipath_type t; + unsigned port; + ipath_devdata *dd; + + pd = (ipath_portdata *) fp->private_data; + port = pd->port_port; + fp->private_data = NULL; + t = pd->port_unit; + if (t > infinipath_max) { + _IPATH_ERROR + ("closing, fp %p, pd %p, but unit %x not valid!\n", + fp, pd, t); + goto done; + } + dd = &devdata[t]; + + if (pd->port_hdrqfull) { + _IPATH_PRDBG + ("%s[%u] had %u rcvhdrqfull errors during run\n", + pd->port_comm, pd->port_pid, pd->port_hdrqfull); + pd->port_hdrqfull = 0; + } + + if (pd->port_rcvwait_to || pd->port_piowait_to + || pd->port_rcvnowait || pd->port_pionowait) { + _IPATH_VDBG + ("port%u, %u rcv, %u pio wait timeo; %u rcv %u, pio already\n", + pd->port_port, pd->port_rcvwait_to, + pd->port_piowait_to, pd->port_rcvnowait, + pd->port_pionowait); + pd->port_rcvwait_to = pd->port_piowait_to = + pd->port_rcvnowait = pd->port_pionowait = 0; + } + if (pd->port_flag) { + _IPATH_DBG("port %u port_flag still set to 0x%x\n", + pd->port_port, pd->port_flag); + pd->port_flag = 0; + } + + if (devdata[t].ipath_kregbase) { + if (pd->port_rcvhdrtail_uaddr) { + pd->port_rcvhdrtail_uaddr = 0; + pd->port_rcvhdrtail_kvaddr = NULL; + ipath_munlock(1, &pd->port_rcvhdrtail_pagep); + pd->port_rcvhdrtail_pagep = NULL; + ipath_stats.sps_pageunlocks++; + } + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, port, 0ULL); + ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr, + pd->port_port, 0); + + /* clean up the pkeys for this port user */ + ipath_clean_partkey(pd, dd); + + if (port < dd->ipath_cfgports) { + int i = dd->ipath_pbufsport * (port - 1); + ipath_disarm_piobufs(t, i, dd->ipath_pbufsport); + + /* atomically clear receive enable port. */ + atomic_clear_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + port), + &devdata[t].ipath_rcvctrl); + ipath_kput_kreg(t, kr_rcvctrl, + devdata[t].ipath_rcvctrl); + + if (dd->ipath_pageshadow) { + /* + * unlock any expected TID + * entries port still had in use + */ + int port_tidbase = + pd->port_port * dd->ipath_rcvtidcnt; + int i, cnt = 0, maxtid = + port_tidbase + dd->ipath_rcvtidcnt; + + _IPATH_VDBG + ("Port %u unlocking any locked expTID pages\n", + pd->port_port); + for (i = port_tidbase; i < maxtid; i++) { + if (dd->ipath_pageshadow[i]) { + ipath_munlock(1, + &dd-> + ipath_pageshadow + [i]); + dd->ipath_pageshadow[i] + = NULL; + cnt++; + ipath_stats. + sps_pageunlocks++; + } + } + if (cnt) + _IPATH_VDBG + ("Port %u had %u expTID entries locked\n", + pd->port_port, cnt); + if (ipath_stats.sps_pagelocks + || ipath_stats.sps_pageunlocks) + _IPATH_VDBG + ("%llu pages locked, %llu unlocked with" + " ipath_m{un}lock\n", + ipath_stats.sps_pagelocks, + ipath_stats. + sps_pageunlocks); + } + ipath_stats.sps_ports--; + _IPATH_PRDBG("%s[%u] closed port %u:%u\n", + pd->port_comm, pd->port_pid, t, + port); + } + } + + pd->port_cnt = 0; + pd->port_pid = 0; + + ipath_clear_tids(t, pd->port_port); + + ipath_free_pddata(dd, pd->port_port, 0); + } + +done: + up(&ipath_mutex); + + return ret; +} + +/* + * cancel a range of PIO buffers, used when they might be armed, but + * not triggered. Used at init to ensure buffer state, and also user + * process close, in case it died while writing to a PIO buffer + */ + +static void ipath_disarm_piobufs(const ipath_type t, unsigned first, + unsigned cnt) +{ + unsigned i, last = first + cnt; + uint64_t sendctrl; + for (i = first; i < last; i++) { + sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM | + (i << INFINIPATH_S_DISARMPIOBUF_SHIFT); + ipath_kput_kreg(t, kr_sendctrl, sendctrl); + } +} + +static void ipath_clean_partkey(ipath_portdata * pd, ipath_devdata * dd) +{ + int i, j, pchanged = 0; + uint64_t oldpkey; + + /* for debugging only */ + oldpkey = + (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd-> + ipath_pkeys[1] << 16) + | ((uint64_t) dd->ipath_pkeys[2] << 32) + | ((uint64_t) dd->ipath_pkeys[3] << 48); + + for (i = 0; i < (sizeof(pd->port_pkeys) / sizeof(pd->port_pkeys[0])); + i++) { + if (!pd->port_pkeys[i]) + continue; + _IPATH_VDBG("look for key[%d] %hx in pkeys\n", i, + pd->port_pkeys[i]); + for (j = 0; + j < (sizeof(dd->ipath_pkeys) / sizeof(dd->ipath_pkeys[0])); + j++) { + /* check for match independent of the global bit */ + if ((dd->ipath_pkeys[j] & 0x7fff) == + (pd->port_pkeys[i] & 0x7fff)) { + if (atomic_dec_and_test(&dd->ipath_pkeyrefs[j])) { + _IPATH_VDBG + ("p%u clear key %x matches #%d\n", + pd->port_port, pd->port_pkeys[i], + j); + ipath_stats.sps_pkeys[j] = + dd->ipath_pkeys[j] = 0; + pchanged++; + } else + _IPATH_VDBG + ("p%u key %x matches #%d, but ref still %d\n", + pd->port_port, pd->port_pkeys[i], + j, + atomic_read(&dd-> + ipath_pkeyrefs[j])); + break; + } + } + pd->port_pkeys[i] = 0; + } + if (pchanged) { + uint64_t pkey; + pkey = + (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd-> + ipath_pkeys[1] << 16) + | ((uint64_t) dd->ipath_pkeys[2] << 32) + | ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_VDBG("p%u old pkey reg %llx, new pkey reg %llx\n", + pd->port_port, oldpkey, pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + } +} + +static unsigned int ipath_poll(struct file *fp, struct poll_table_struct *pt) +{ + int ret; + ipath_portdata *pd; + + pd = port_fp(fp); + /* nothing for select/poll in this driver, at least for now */ + ret = 0; + + return ret; +} + +/* + * wait up to msecs milliseconds for IB link state change to occur + * for now, take the easy polling route. Currently used only by + * the SMA ioctls. Returns 0 if state reached, otherwise -ETIMEDOUT + * state can have multiple states set, for any of several transitions. + */ + +int ipath_wait_linkstate(const ipath_type t, uint32_t state, int msecs) +{ + devdata[t].ipath_sma_state_wanted = state; + wait_event_interruptible_timeout(ipath_sma_state_wait, + (devdata[t].ipath_flags & state), + msecs_to_jiffies(msecs)); + devdata[t].ipath_sma_state_wanted = 0; + + if (!(devdata[t].ipath_flags & state)) + _IPATH_DBG + ("Didn't reach linkstate %s within %u ms (ibcc %llx %s)\n", + /* test INIT ahead of DOWN, both can be set */ + (state & IPATH_LINKINIT) ? "INIT" : + ((state & IPATH_LINKDOWN) ? "DOWN" : + ((state & IPATH_LINKARMED) ? "ARM" : "ACTIVE")), + msecs, ipath_kget_kreg64(t, kr_ibcctrl), + ipath_ibcstatus_str[ipath_kget_kreg64(t, kr_ibcstatus) & + 0xf]); + return (devdata[t].ipath_flags & state) ? 0 : -ETIMEDOUT; +} + +/* unit number is already validated in ipath_ioctl() */ +static int ipath_kset_lid(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + _IPATH_SMADBG("Unit %u setting lid to 0x%x, was 0x%x\n", unit, arg, + devdata[unit].ipath_lid); + ipath_set_sps_lid(unit, arg); + return 0; +} + +static int ipath_kset_mlid(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + _IPATH_SMADBG("Unit %u setting mlid to 0x%x, was 0x%x\n", unit, arg, + devdata[unit].ipath_mlid); + ipath_stats.sps_mlid[unit] = devdata[unit].ipath_mlid = arg; + if (devdata[unit].ipath_layer.l_intr) + devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_BCAST); + return 0; +} + +/* unit number is in incoming, overwritten on return with data */ + +static int ipath_get_devstatus(uint64_t * a) +{ + int ret; + uint64_t unit64; + uint32_t unit; + uint64_t devstatus; + + if ((ret = copy_from_user(&unit64, a, sizeof unit64))) { + _IPATH_DBG("Failed to copy in unit: %d\n", ret); + return ret; + } + unit = unit64; + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + devstatus = *devdata[unit].ipath_statusp; + + if ((ret = copy_to_user(a, &devstatus, sizeof devstatus))) + _IPATH_DBG("Failed to copy out device status: %d\n", ret); + return ret; +} + +/* unit number is in incoming, overwritten on return with data */ + +static int ipath_get_mlid(uint32_t * a) +{ + int ret; + uint32_t unit; + uint32_t mlid; + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in mlid: %d\n", ret); + return ret; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + mlid = devdata[unit].ipath_mlid; + + if ((ret = copy_to_user(a, &mlid, sizeof mlid))) + _IPATH_DBG("Failed to copy out MLID: %d\n", ret); + return ret; +} + +static int ipath_kset_guid(struct ipath_setguid *a) +{ + struct ipath_setguid setguid; + int ret; + + if ((ret = copy_from_user(&setguid, a, sizeof setguid))) { + _IPATH_DBG("Failed to copy in guid info: %d\n", ret); + return ret; + } + if (setguid.sunit >= infinipath_max || + !(devdata[setguid.sunit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %llu\n", setguid.sunit); + return -ENODEV; + } + if (setguid.sguid == 0ULL || setguid.sguid == ~0ULL) { + /* + * use INFO, not DBG, because ipath_mux doesn't yet + * complain about errors on this + */ + + _IPATH_INFO("Ignoring attempt to set invalid GUID %llx\n", + setguid.sguid); + return -EINVAL; + } + devdata[setguid.sunit].ipath_guid = setguid.sguid; + devdata[setguid.sunit].ipath_nguid = 1; + _IPATH_DBG("SMA set hardware GUID unit %llu to %llx (network order)\n", + setguid.sunit, devdata[setguid.sunit].ipath_guid); + return 0; +} + +/* + * receive an IB packet with QP 0 or 1. For now, we have no timeout implemented + * We put the actual received count into the iov on return, and the unit we + * received from goes into the lower 16 bits of sps_flags. + * This receives from all/any of the active chips, and we currently do not + * allow specifying just one (we could, by filling in unit in the library + * before the syscall, and checking here). + */ + +static int ipath_rcvsma_pkt(struct ipath_sendpkt * p) +{ + struct ipath_sendpkt rpkt; + int i, any, ret; + unsigned long flags; + + if ((ret = copy_from_user(&rpkt, p, sizeof rpkt))) { + _IPATH_DBG("Failed to copy in pkt struct (%d)\n", ret); + return ret; + } + if (!ipath_sma_data_spare) { + _IPATH_DBG("can't do receive, sma not initialized\n"); + return -ENETDOWN; + } + + for (any = i = 0; i < infinipath_max; i++) + if (devdata[i].ipath_flags & IPATH_INITTED) + any++; + if (!any) { /* no hardware, freeze, etc. */ + _IPATH_SMADBG("Didn't find any initialized and usable chips\n"); + return -ENODEV; + } + + wait_event_interruptible(ipath_sma_wait, + ipath_sma_data[ipath_sma_first].len); + + spin_lock_irqsave(&ipath_sma_lock, flags); + if (ipath_sma_data[ipath_sma_first].len) { + int len; + uint32_t slen; + uint8_t *sdata; + struct _ipath_sma_rpkt *smpkt = + &ipath_sma_data[ipath_sma_first]; + + /* + * we swap out the buffer we are going to use with the + * spare buffer and set spare to that buffer. This code + * is the only code that ever manipulates spare, other + * than the initialization code. This code should never + * be entered by more than one process at a time, and + * if it is, the user code doing so deserves what it gets; + * it won't break anything in the driver by doing so. + * We do it this way to avoid holding a lock across the + * copy_to_user, which could fault, or delay a long time + * while paging occurs; ditto for printks + */ + + slen = smpkt->len; + sdata = smpkt->buf; + rpkt.sps_flags = smpkt->unit; + smpkt->buf = ipath_sma_data_spare; + ipath_sma_data_spare = sdata; + smpkt->len = 0; /* it's available again */ + if (++ipath_sma_first >= IPATH_NUM_SMAPKTS) + ipath_sma_first = 0; + spin_unlock_irqrestore(&ipath_sma_lock, flags); + + len = min((uint32_t) rpkt.sps_iov[0].iov_len, slen); + ret = + copy_to_user((void *)rpkt.sps_iov[0].iov_base, sdata, len); + _IPATH_VDBG + ("SMA packet (index=%d), len %d (actual %d) buf %p, ubuf %llx\n", + ipath_sma_first, slen, len, sdata, + rpkt.sps_iov[0].iov_base); + if (!ret) { + /* actual length read. */ + rpkt.sps_iov[0].iov_len = len; + rpkt.sps_cnt = 1; /* received one packet */ + if ((ret = copy_to_user(p, &rpkt, sizeof rpkt))) + _IPATH_DBG + ("Failed to copy out pkt struct (%d)\n", + ret); + } else + _IPATH_DBG("copyout failed: %d\n", ret); + } else { + /* usually means SMA process received a signal */ + spin_unlock_irqrestore(&ipath_sma_lock, flags); + return -EAGAIN; + } + + return ret; +} + +/* unit number is in first word incoming, overwritten on return with data */ +static int ipath_get_portinfo(uint32_t * a) +{ + int ret; + uint32_t unit, tmp, tmp2; + ipath_devdata *dd; + uint32_t portinfo[13]; /* just the data for Portinfo, in host horder */ + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in portinfo: %d\n", ret); + return ret; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + dd = &devdata[unit]; + /* so we only initialize non-zero fields. */ + memset(portinfo, 0, sizeof portinfo); + + /* + * Notimpl yet M_Key (64) + * Notimpl yet GID (64) + */ + + portinfo[4] = (dd->ipath_lid << 16); + + /* + * Notimpl yet SMLID (should we store this in the driver, in + * case SMA dies?) + * CapabilityMask is 0, we don't support any of these + * DiagCode is 0; we don't store any diag info for now + * Notimpl yet M_KeyLeasePeriod (we don't support M_Key) + */ + + /* LocalPortNum is whichever port number they ask for */ + portinfo[7] = (unit << 24) + /* LinkWidthEnabled */ + |(2 << 16) + /* LinkWidthSupported (really 2, but that's not IB valid...) */ + |(3 << 8) + /* LinkWidthActive */ + |(2 << 0); + tmp = dd->ipath_lastibcstat & 0xff; + tmp2 = 5; + if (tmp == 0x11) + tmp = 2; + else if (tmp == 0x21) + tmp = 3; + else if (tmp == 0x31) + tmp = 4; + else { + tmp = 0; /* down */ + tmp2 = tmp & 0xf; + } + portinfo[8] = (1 << 28) /* LinkSpeedSupported */ + |(tmp << 24) /* PortState */ + |(tmp2 << 20) /* PortPhysicalState */ + |(2 << 16) + + /* LinkDownDefaultState */ + /* M_KeyProtectBits == 0 */ + /* NotImpl yet LMC == 0 (we can support all values) */ + |(1 << 4) /* LinkSpeedActive */ + |(1 << 0); /* LinkSpeedEnabled */ + switch (dd->ipath_ibmtu) { + case 4096: + tmp = 5; + break; + case 2048: + tmp = 4; + break; + case 1024: + tmp = 3; + break; + case 512: + tmp = 2; + break; + case 256: + tmp = 1; + break; + default: /* oops, something is wrong */ + _IPATH_DBG + ("Problem, ipath_ibmtu 0x%x not a valid IB MTU, treat as 2048\n", + dd->ipath_ibmtu); + tmp = 4; + break; + } + portinfo[9] = (tmp << 28) + /* NeighborMTU */ + /* Notimpl MasterSMSL */ + |(1 << 20) + + /* VLCap */ + /* Notimpl InitType (actually, an SMA decision) */ + /* VLHighLimit is 0 (only one VL) */ + ; /* VLArbitrationHighCap is 0 (only one VL) */ + portinfo[10] = /* VLArbitrationLowCap is 0 (only one VL) */ + /* InitTypeReply is SMA decision */ + (5 << 16) /* MTUCap 4096 */ + |(7 << 13) /* VLStallCount */ + |(0x1f << 8) /* HOQLife */ + |(1 << 4) /* OperationalVLs 0 */ + |(1 << 3) + + /* PartitionEnforcementInbound */ + /* PartitionEnforcementOutbound not enforced */ + /* FilterRawinbound not enforced */ + ; /* FilterRawOutbound not enforced */ + /* M_KeyViolations are not counted by hardware, SMA can count */ + tmp = ipath_kget_creg32(unit, cr_errpkey); + /* P_KeyViolations are counted by hardware. */ + portinfo[11] = ((tmp & 0xffff) << 0); + portinfo[12] = + /* Q_KeyViolations are not counted by hardware */ + (1 << 8) + + /* GUIDCap */ + /* SubnetTimeOut handled by SMA */ + /* RespTimeValue handled by SMA */ + ; + /* LocalPhyErrors are programmed to max */ + portinfo[12] |= (0xf << 20) + |(0xf << 16) /* OverRunErrors are programmed to max */ + ; + + if ((ret = copy_to_user(a, portinfo, sizeof portinfo))) + _IPATH_DBG("Failed to copy out portinfo: %d\n", ret); + return ret; +} + +/* unit number is in first word incoming, overwritten on return with data */ +static int ipath_get_nodeinfo(uint32_t * a) +{ + int ret; + uint32_t unit; /*, tmp, tmp2; */ + ipath_devdata *dd; + uint32_t nodeinfo[10]; /* just the data for Nodeinfo, in host horder */ + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in nodeinfo: %d\n", ret); + return ret; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + /* VDBG because sma normally probes for all possible units */ + _IPATH_VDBG("Invalid unit %u\n", unit); + return -ENODEV; + } + dd = &devdata[unit]; + + /* so we only initialize non-zero fields. */ + memset(nodeinfo, 0, sizeof nodeinfo); + + nodeinfo[0] = /* BaseVersion is SMA */ + /* ClassVersion is SMA */ + (1 << 8) /* NodeType */ + |(1 << 0); /* NumPorts */ + nodeinfo[1] = (uint32_t) (dd->ipath_guid >> 32); + nodeinfo[2] = (uint32_t) (dd->ipath_guid & 0xffffffff); + nodeinfo[3] = nodeinfo[1]; /* PortGUID == SystemImageGUID for us */ + nodeinfo[4] = nodeinfo[2]; /* PortGUID == SystemImageGUID for us */ + nodeinfo[5] = nodeinfo[3]; /* PortGUID == NodeGUID for us */ + nodeinfo[6] = nodeinfo[4]; /* PortGUID == NodeGUID for us */ + nodeinfo[7] = (4 << 16) /* we support 4 pkeys */ + |(dd->ipath_deviceid << 0); + /* our chip version as 16 bits major, 16 bits minor */ + nodeinfo[8] = dd->ipath_minrev | (dd->ipath_majrev << 16); + nodeinfo[9] = (unit << 24) | (dd->ipath_vendorid << 0); + + if ((ret = copy_to_user(a, nodeinfo, sizeof nodeinfo))) + _IPATH_DBG("Failed to copy out nodeinfo: %d\n", ret); + return ret; +} + +static int ipath_sma_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + switch (cmd) { + case IPATH_SEND_SMA_PKT: /* send SMA packet */ + if (!(ret = ipath_send_smapkt((struct ipath_sendpkt *) a))) + /* another SMA packet sent */ + ipath_stats.sps_sma_spkts++; + break; + case IPATH_RCV_SMA_PKT: /* recieve an SMA or MAD packet */ + ret = ipath_rcvsma_pkt((struct ipath_sendpkt *) a); + break; + case IPATH_SET_LID: /* set our lid, (SMA) */ + ret = ipath_kset_lid((uint32_t) a); + break; + case IPATH_SET_MTU: /* set the IB mtu (not maxpktlen) (SMA) */ + ret = ipath_kset_mtu((uint32_t) a); + break; + case IPATH_SET_LINKSTATE: + /* walk through the linkstate states (SMA) */ + ret = ipath_kset_linkstate((uint32_t) a); + break; + case IPATH_GET_PORTINFO: /* get the SMA portinfo */ + ret = ipath_get_portinfo((uint32_t *) a); + break; + case IPATH_GET_NODEINFO: /* get the SMA nodeinfo */ + ret = ipath_get_nodeinfo((uint32_t *) a); + break; + case IPATH_SET_GUID: + /* + * set our guid, (SMA). This is not normally + * used, but provides a way to set the GUID when the i2c flash + * has a problem, or for special testing. + */ + ret = ipath_kset_guid((struct ipath_setguid *)a); + break; + case IPATH_SET_MLID: /* set multicast LID for ipath broadcast */ + ret = ipath_kset_mlid((uint32_t) a); + break; + case IPATH_GET_MLID: /* get multicast LID for ipath broadcast */ + ret = ipath_get_mlid((uint32_t *) a); + break; + case IPATH_GET_DEVSTATUS: /* get device status */ + ret = ipath_get_devstatus((uint64_t *) a); + break; + default: + _IPATH_DBG("%x not a valid SMA ioctl for infinipath\n", cmd); + ret = -EINVAL; + break; + } + return ret; +} + +static int ipath_get_unit_counters(struct infinipath_getunitcounters *a) +{ + struct infinipath_getunitcounters c; + + if(copy_from_user(&c, (void *)a, sizeof c)) + return -EFAULT; + return ipath_get_counters(c.unit, (struct infinipath_counters *)c.data); +} -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 06/13] [RFC] ipath LLD core, part 3 In-Reply-To: <200512161548.YvnmQHKTsmmCBp1k@cisco.com> Message-ID: <200512161548.KglSM2YESlGlEQfQ@cisco.com> Last part of core driver --- drivers/infiniband/hw/ipath/ipath_driver.c | 2380 ++++++++++++++++++++++++++++ 1 files changed, 2380 insertions(+), 0 deletions(-) f7ffc0cabd62be5e13ad84027d5712e6f92d9cc1 diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 0dee4ce..87b6dae 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -4877,3 +4877,2383 @@ static int ipath_wait_intr(ipath_portdat } return 0; } + +/* + * The new implementation as of Oct 2004 is that the driver assigns + * the tid and returns it to the caller. To make it easier to + * catch bugs, and to reduce search time, we keep a cursor for + * each port, walking the shadow tid array to find one that's not + * in use. + * + * For now, if we can't allocate the full list, we fail, although + * in the long run, we'll allocate as many as we can, and the + * caller will deal with that by trying the remaining pages later. + * That means that when we fail, we have to mark the tids as not in + * use again, in our shadow copy. + * + * It's up to the caller to free the tids when they are done. + * We'll unlock the pages as they free them. + * + * Also, right now we are locking one page at a time, but since + * the intended use of this routine is for a single group of + * virtually contiguous pages, that should change to improve + * performance. + */ +static int ipath_tid_update(ipath_portdata * pd, struct _tidupd *tidu) +{ + int ret = 0, ntids; + uint32_t tid, porttid, cnt, i, tidcnt; + struct _tidupd tu; + uint16_t *tidlist; + ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t vaddr, physaddr, lenvalid; + volatile uint64_t *tidbase; + uint64_t tidmap[8]; + struct page **pagep = NULL; + + tu.tidcnt = 0; /* for early errors */ + if (!dd->ipath_pageshadow) { + ret = -ENOMEM; + goto done; + } + if (copy_from_user(&tu, tidu, sizeof tu)) { + ret = -EFAULT; + goto done; + } + + if (!(cnt = tu.tidcnt)) { + _IPATH_DBG("After copyin, tidcnt 0, tidlist %llx\n", + tu.tidlist); + /* or should we treat as success? likely a bug */ + ret = -EFAULT; + goto done; + } + tidcnt = dd->ipath_rcvtidcnt; + if (cnt >= tidcnt) { /* make sure it all fits in port_tid_pg_list */ + _IPATH_INFO + ("Process tried to allocate %u TIDs, only trying max (%u)\n", + cnt, tidcnt); + cnt = tidcnt; + } + pagep = (struct page **)pd->port_tid_pg_list; + tidlist = (uint16_t *) (&pagep[cnt]); + + memset(tidmap, 0, sizeof(tidmap)); + tid = pd->port_tidcursor; + /* before decrement; chip actual # */ + porttid = pd->port_port * tidcnt; + ntids = tidcnt; + tidbase = (volatile uint64_t *)((volatile char *) + (devdata[pd->port_unit]. + ipath_kregbase) + + devdata[pd->port_unit]. + ipath_rcvtidbase + + porttid * sizeof(*tidbase)); + + _IPATH_VDBG("Port%u %u tids, cursor %u, tidbase %p\n", pd->port_port, + cnt, tid, tidbase); + + vaddr = tu.tidvaddr; /* virtual address of first page in transfer */ + if (!access_ok(VERIFY_WRITE, (void *)vaddr, cnt * PAGE_SIZE)) { + _IPATH_DBG("Fail vaddr %llx, %u pages, !access_ok\n", + vaddr, cnt); + ret = -EFAULT; + goto done; + } + if ((ret = ipath_mlock((unsigned long)vaddr, cnt, pagep))) { + if (ret == -EBUSY) { + _IPATH_DBG + ("Failed to lock addr %p, %u pages (already locked)\n", + (void *)vaddr, cnt); + /* + * for now, continue, and see what happens + * but with the new implementation, this should + * never happen, unless perhaps the user has + * mpin'ed the pages themselves (something we + * need to test) + */ + ret = 0; + } else { + _IPATH_INFO + ("Failed to lock addr %p, %u pages: errno %d\n", + (void *)vaddr, cnt, -ret); + goto done; + } + } + for (i = 0; i < cnt; i++, vaddr += PAGE_SIZE) { + for (; ntids--; tid++) { + if (tid == tidcnt) + tid = 0; + if (!dd->ipath_pageshadow[porttid + tid]) + break; + } + if (ntids < 0) { + /* + * oops, wrapped all the way through their TIDs, + * and didn't have enough free; see comments at + * start of routine + */ + _IPATH_DBG + ("Not enough free TIDs for %u pages (index %d), failing\n", + cnt, i); + i--; /* last tidlist[i] not filled in */ + ret = -ENOMEM; + break; + } + tidlist[i] = tid; + _IPATH_VDBG("Updating idx %u to TID %u, vaddr %llx\n", + i, tid, vaddr); + /* for now we "know" system pages and TID pages are same size */ + /* for ipath_free_tid */ + dd->ipath_pageshadow[porttid + tid] = pagep[i]; + __set_bit(tid, tidmap); /* don't need atomic or it's overhead */ + physaddr = page_to_phys(pagep[i]); + ipath_stats.sps_pagelocks++; + _IPATH_VDBG("TID %u, vaddr %llx, physaddr %llx pgp %p\n", + tid, vaddr, physaddr, pagep[i]); + /* + * in words (fixed, full page). could make less for very last + * page in transfer, but for now we won't worry about it. + */ + lenvalid = PAGE_SIZE >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + physaddr |= lenvalid | INFINIPATH_RT_VALID; + ipath_kput_memq(pd->port_unit, &tidbase[tid], physaddr); + /* + * don't check this tid in ipath_portshadow, since we + * just filled it in; start with the next one. + */ + tid++; + } + + if (ret) { + uint32_t limit; + uint64_t tidval; + /* + * chip errata bug 7358, try to work around it by + * marking invalid tids as having max length + */ + tidval = + (~0ULL & INFINIPATH_RT_BUFSIZE_MASK) << + INFINIPATH_RT_BUFSIZE_SHIFT; + cleanup: + /* jump here if copy out of updated info failed... */ + _IPATH_DBG("After failure (ret=%d), undo %d of %d entries\n", + -ret, i, cnt); + /* same code that's in ipath_free_tid() */ + if ((limit = sizeof(tidmap) * _BITS_PER_BYTE) > tidcnt) + /* just in case size changes in future */ + limit = tidcnt; + tid = find_first_bit((const unsigned long *)tidmap, limit); + /* + * chip errata bug 7358, try to work around it by + * marking invalid tids as having max length + */ + tidval = + (~0ULL & INFINIPATH_RT_BUFSIZE_MASK) << + INFINIPATH_RT_BUFSIZE_SHIFT; + for (; tid < limit; tid++) { + if (!test_bit(tid, tidmap)) + continue; + if (dd->ipath_pageshadow[porttid + tid]) { + _IPATH_VDBG("Freeing TID %u\n", tid); + ipath_kput_memq(pd->port_unit, &tidbase[tid], + tidval); + dd->ipath_pageshadow[porttid + tid] = NULL; + ipath_stats.sps_pageunlocks++; + } + } + (void)ipath_munlock(cnt, pagep); + } else { + /* + * copy the updated array, with ipath_tid's filled in, + * back to user. Since we did the copy in already, this + * "should never fail" + * If it does, we have to clean up... + */ + int r; + if ((r = + copy_to_user((void *)tu.tidlist, tidlist, + cnt * sizeof(*tidlist)))) { + _IPATH_DBG + ("Failed to copy out %d TIDs (%lx bytes) to %llx (ret %x)\n", + cnt, cnt * sizeof(*tidlist), tu.tidlist, r); + ret = -EFAULT; + goto cleanup; + } + if (copy_to_user((void *)tu.tidmap, tidmap, sizeof tidmap)) { + _IPATH_DBG("Failed to copy out TID map to %llx\n", + tu.tidmap); + ret = -EFAULT; + goto cleanup; + } + if (tid == tidcnt) + tid = 0; + pd->port_tidcursor = tid; + } + +done: + if (ret) + _IPATH_DBG + ("Failed to map %u TID pages, failing with %d, tidu %p\n", + tu.tidcnt, -ret, tidu); + return ret; +} + +/* + * right now we are unlocking one page at a time, but since + * the intended use of this routine is for a single group of + * virtually contiguous pages, that should change to improve + * performance. We check that the TID is in range for this port + * but otherwise don't check validity; if user has an error and + * frees the wrong tid, it's only their own data that can thereby + * be corrupted. We do check that the TID was in use, for sanity + * We always use our idea of the saved address, not the address that + * they pass in to us. + */ + +static int ipath_tid_free(ipath_portdata * pd, struct _tidupd *tidu) +{ + int ret = 0; + uint32_t tid, porttid, cnt, limit, tidcnt; + struct _tidupd tu; + ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t *tidbase; + uint64_t tidmap[8]; + uint64_t tidval; + + tu.tidcnt = 0; /* for early errors */ + if (!dd->ipath_pageshadow) { + ret = -ENOMEM; + goto done; + } + + if (copy_from_user(&tu, tidu, sizeof tu)) { + _IPATH_DBG("copy of tidupd structure failed\n"); + ret = -EFAULT; + goto done; + } + if (copy_from_user(tidmap, (void *)tu.tidmap, sizeof tidmap)) { + _IPATH_DBG("copy of tidmap failed\n"); + ret = -EFAULT; + goto done; + } + + porttid = pd->port_port * dd->ipath_rcvtidcnt; + tidbase = + (uint64_t *) ((char *)(devdata[pd->port_unit].ipath_kregbase) + + devdata[pd->port_unit].ipath_rcvtidbase + + porttid * sizeof(*tidbase)); + + tidcnt = dd->ipath_rcvtidcnt; + if ((limit = sizeof(tidmap) * _BITS_PER_BYTE) > tidcnt) + limit = tidcnt; /* just in case size changes in future */ + tid = find_first_bit((const unsigned long *)tidmap, limit); + _IPATH_VDBG + ("Port%u free %u tids; first bit (max=%d) set is %d, porttid %u\n", + pd->port_port, tu.tidcnt, limit, tid, porttid); + /* + * chip errata bug 7358, try to work around it by marking invalid + * tids as having max length + */ + tidval = + (~0ULL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT; + for (cnt = 0; tid < limit; tid++) { + /* + * small optimization; if we detect a run of 3 or so without + * any set, use find_first_bit again. That's mainly to + * accelerate the case where we wrapped, so we have some at + * the beginning, and some at the end, and a big gap + * in the middle. + */ + if (!test_bit(tid, tidmap)) + continue; + cnt++; + if (dd->ipath_pageshadow[porttid + tid]) { + _IPATH_VDBG("Freeing TID %u\n", tid); + ipath_kput_memq(pd->port_unit, &tidbase[tid], tidval); + ipath_munlock(1, &dd->ipath_pageshadow[porttid + tid]); + dd->ipath_pageshadow[porttid + tid] = NULL; + ipath_stats.sps_pageunlocks++; + } else + _IPATH_DBG("Unused tid %u, ignoring\n", tid); + } + if (cnt != tu.tidcnt) + _IPATH_DBG("passed in tidcnt %d, only %d bits set in map\n", + tu.tidcnt, cnt); +done: + if (ret) + _IPATH_DBG("Failed to unmap %u TID pages, failing with %d\n", + tu.tidcnt, -ret); + return ret; +} + +/* called from user init code, and also layered driver init */ +int ipath_setrcvhdrsize(const ipath_type mdev, unsigned rhdrsize) +{ + int ret = 0; + if (devdata[mdev].ipath_flags & IPATH_RCVHDRSZ_SET) { + if (devdata[mdev].ipath_rcvhdrsize != rhdrsize) { + _IPATH_INFO + ("Error: can't set protocol header size %u, already %u\n", + rhdrsize, devdata[mdev].ipath_rcvhdrsize); + ret = -EAGAIN; + } else + /* OK if set already, with same value, nothing to do */ + _IPATH_VDBG("Reuse same protocol header size %u\n", + devdata[mdev].ipath_rcvhdrsize); + } else if (rhdrsize > + (devdata[mdev].ipath_rcvhdrentsize - + (sizeof(uint64_t) / sizeof(uint32_t)))) { + _IPATH_DBG + ("Error: can't set protocol header size %u (> max %u)\n", + rhdrsize, + devdata[mdev].ipath_rcvhdrentsize - + (uint32_t) (sizeof(uint64_t) / sizeof(uint32_t))); + ret = -EOVERFLOW; + } else { + devdata[mdev].ipath_flags |= IPATH_RCVHDRSZ_SET; + devdata[mdev].ipath_rcvhdrsize = rhdrsize; + ipath_kput_kreg(mdev, kr_rcvhdrsize, + devdata[mdev].ipath_rcvhdrsize); + _IPATH_VDBG("Set protocol header size to %u\n", + devdata[mdev].ipath_rcvhdrsize); + } + return ret; +} + +/* + * find an available pio buffer, and do appropriate marking as busy, etc. + * returns buffer number if one found (>=0), negative number is error. + * Used by ipath_send_smapkt and ipath_layer_send + */ +int ipath_getpiobuf(int mdev) +{ + int i, j, starti, updated = 0; + unsigned piobcnt, iter; + unsigned long flags; + ipath_devdata *dd = &devdata[mdev]; + uint64_t *shadow = dd->ipath_pioavailshadow; + + piobcnt = (unsigned)dd->ipath_piobcnt; + starti = dd->ipath_lastport_piobuf; + iter = piobcnt - starti; + if (dd->ipath_upd_pio_shadow) { + /* + * minor optimization. If we had no buffers on last call, + * start out by doing the update; continue and do scan + * even if no buffers were updated, to be paranoid + */ + ipath_update_pio_bufs(mdev); + /* we scanned here, don't do it at end of scan */ + updated = 1; + i = starti; + } else + i = dd->ipath_lastpioindex; + +rescan: + /* + * while test_and_set_bit() is atomic, + * we do that and then the change_bit(), and the pair is not. + * See if this is the cause of the remaining armlaunch errors. + */ + spin_lock_irqsave(&ipath_pioavail_lock, flags); + for (j = 0; j < iter; j++, i++) { + if (i >= piobcnt) + i = starti; + /* + * To avoid bus lock overhead, we first find a candidate + * buffer, then do the test and set, and continue if + * that fails. + */ + if (test_bit((2 * i) + 1, shadow) || + test_and_set_bit((2 * i) + 1, shadow)) { + continue; + } + /* flip generation bit */ + change_bit(2 * i, shadow); + break; + } + spin_unlock_irqrestore(&ipath_pioavail_lock, flags); + + if (j == iter) { + /* + * first time through; shadow exhausted, but may be + * real buffers available, so go see; if any updated, + * rescan (once) + */ + if (!updated) { + ipath_update_pio_bufs(mdev); + updated = 1; + i = starti; + goto rescan; + } + dd->ipath_upd_pio_shadow = 1; + /* not atomic, but if we lose one once in a while, that's OK */ + ipath_stats.sps_nopiobufs++; + if (!(++dd->ipath_consec_nopiobuf % 100000)) { + _IPATH_DBG + ("%u pio sends with no bufavail; dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n", + dd->ipath_consec_nopiobuf, + dd->ipath_pioavailregs_dma[0], + dd->ipath_pioavailregs_dma[1], + dd->ipath_pioavailregs_dma[2], + dd->ipath_pioavailregs_dma[3], + shadow[0], shadow[1], shadow[2], shadow[3]); + /* + * 4 buffers per byte, 4 registers above, cover + * rest below + */ + if (dd->ipath_piobcnt > (sizeof(shadow[0]) * 4 * 4)) + _IPATH_DBG + ("2nd group: dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n", + dd->ipath_pioavailregs_dma[4], + dd->ipath_pioavailregs_dma[5], + dd->ipath_pioavailregs_dma[6], + dd->ipath_pioavailregs_dma[7], + shadow[4], shadow[5], shadow[6], + shadow[7]); + } + return -EBUSY; + } + + if (updated && dd->ipath_layer.l_intr) { + /* + * ran out of bufs, now some (at least this one we just got) + * are now available, so tell the layered driver. + */ + dd->ipath_layer.l_intr(mdev, IPATH_LAYER_INT_SEND_CONTINUE); + } + + /* + * set next starting place. Since it's just an optimization, + * it doesn't matter who wins on this, so no locking + */ + dd->ipath_lastpioindex = i + 1; + if(dd->ipath_upd_pio_shadow) + dd->ipath_upd_pio_shadow = 0; + if(dd->ipath_consec_nopiobuf) + dd->ipath_consec_nopiobuf = 0; + return i; +} + +/* + * this is like ipath_getpiobuf(), except it just probes to see if a buffer + * is available. If it returns that there is one, it's not allocated, + * and so may not be available if caller tries to send. + * NOTE: This can be called from interrupt context by ipath_intr() + * and from non-interrupt context by layer_send_getpiobuf(). + */ +int ipath_bufavail(int mdev) +{ + int i; + unsigned piobcnt; + uint64_t *shadow = devdata[mdev].ipath_pioavailshadow; + + piobcnt = (unsigned)devdata[mdev].ipath_piobcnt; + + for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++) + if (!test_bit((2 * i) + 1, shadow)) + return 1; + + /* if none, check for update and rescan if we updated */ + ipath_update_pio_bufs(mdev); + for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++) + if (!test_bit((2 * i) + 1, shadow)) + return 1; + _IPATH_PDBG("No bufs avail\n"); + return 0; +} + +/* + * This routine is no longer on any critical paths; it is used only + * for sending SMA packets, but that could change in the future, so it + * should be kept pretty tight, with anything that + * increases the cache footprint, adds branches, etc. carefully + * examined, and if needed only for unusual cases, should, be moved out to + * a separate routine, or out of the main execution path. + * Because it's currently sma only, there are no checks to see if the + * link is up; sma must be able to send in the not fully initialized state + */ +int ipath_send_smapkt(struct ipath_sendpkt * upkt) +{ + int i, ret = 0, whichpb; + uint32_t *piobuf, plen = 0, clen; + uint64_t pboff; + struct ipath_sendpkt kpkt; + struct ipath_iovec *iov = kpkt.sps_iov; + ipath_type t; + + if (unlikely((copy_from_user(&kpkt, upkt, sizeof kpkt)))) + ret = -EFAULT; + if (ret) { + _IPATH_VDBG("Send failed: error %d\n", -ret); + goto done; + } + t = kpkt.sps_flags; + if (t >= infinipath_max || !(devdata[t].ipath_flags & IPATH_PRESENT) || + !devdata[t].ipath_kregbase) { + _IPATH_SMADBG("illegal unit %u for sma send\n", t); + return -ENODEV; + } + if (!(devdata[t].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_SMADBG("unit %u not usable\n", t); + return -ENODEV; + } + + /* need total length before first word written */ + plen = sizeof(uint32_t); /* +1 word is for the qword padding */ + for (i = 0; i < kpkt.sps_cnt; i++) + /* each must be dword multiple */ + plen += kpkt.sps_iov[i].iov_len; + + if ((plen + 4) > devdata[t].ipath_ibmaxlen) { + _IPATH_DBG("Pkt len 0x%x > ibmaxlen %x!\n", plen - 4, + devdata[t].ipath_ibmaxlen); + ret = -EINVAL; + goto done; /* before writing pbc */ + } + plen >>= 2; /* in words */ + + whichpb = ipath_getpiobuf(t); + if (whichpb < 0) { + ret = whichpb; + devdata[t].ipath_nosma_bufs++; + _IPATH_SMADBG("No PIO buffers available unit %u %u times\n", + t, devdata[t].ipath_nosma_bufs); + goto done; + } + if(devdata[t].ipath_nosma_bufs) { + _IPATH_SMADBG( + "Unit %u got SMA send buffer after %u failures, %u seconds\n", + t, devdata[t].ipath_nosma_bufs, devdata[t].ipath_nosma_secs); + devdata[t].ipath_nosma_bufs = 0; + devdata[t].ipath_nosma_secs = 0; + } + if((devdata[t].ipath_lastibcstat & 0x11) != 0x11 && + (devdata[t].ipath_lastibcstat & 0x21) != 0x21) { + /* we need to be at least at INIT for SMA packets to go out. If we + * aren't, something has gone wrong, and SMA hasn't noticed. + * Therefore we'll try to go to INIT here, in hopes of fixing up the + * problem. First we verify that indeed the state is still "bad" + * (that is, that lastibcstat * isn't "stale") */ + uint64_t val; + val = ipath_kget_kreg64(t, kr_ibcstatus); + if((val & 0x11) != 0x11 && (val & 0x21) != 0x21) { + _IPATH_SMADBG("Invalid Link state 0x%llx unit %u for send, try INIT\n", + val, t); + ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKCMD_INIT); + val = ipath_kget_kreg64(t, kr_ibcstatus); + if((val & 0x11) != 0x11 && (val & 0x21) != 0x21) + _IPATH_SMADBG("Link state still not OK unit %u (0x%llx) after INIT\n", + t, val); + else + _IPATH_SMADBG("Link state OK unit %u (0x%llx) after INIT\n", + t, val); + } + /* and continue, regardless */ + } + + pboff = devdata[t].ipath_piobufbase; + piobuf = (uint32_t *) (((char *)(devdata[t].ipath_kregbase)) + pboff + + whichpb * devdata[t].ipath_palign); + + if(infinipath_debug & __IPATH_PKTDBG) // SMA and PKT, both + _IPATH_SMADBG("unit %u 0x%x+1w pio%d, (scnt %d)\n", + t, plen - 1, whichpb, kpkt.sps_cnt); + + ret = 0; + clen = 2; /* size of the pbc */ + { + /* + * If this code ever gets used for anything performance + * oriented, or that isn't inherently single-threaded, + * then I need to implement the original idea of our + * own equivalent of copy_from_user that uses only dword + * or qword copies. copy_from_user() can use byte copies, + * and that is a problem for our chip. + */ + static uint32_t tmpbuf[2176 / sizeof(uint32_t)]; + *(uint64_t *) tmpbuf = (uint64_t) plen; + for (i = 0; i < kpkt.sps_cnt; i++) { + if (unlikely + (copy_from_user + (tmpbuf + clen, (void *)iov->iov_base, + iov->iov_len))) + ret = -EFAULT; /* no break */ + clen += iov->iov_len >> 2; + iov++; + } + ipath_dwordcpy(piobuf, tmpbuf, clen); + } + + /* flush the packet out now, don't leave it waiting around */ + mb(); + + if (ret) { + /* + * Packet is bad, so we need to use the PIO abort mechanism to + * abort the packet + */ + uint32_t sendctrl; + sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM | + (whichpb << INFINIPATH_S_DISARMPIOBUF_SHIFT); + _IPATH_DBG("Doing PIO abort on buffer %u after error\n", + whichpb); + ipath_kput_kreg(t, kr_sendctrl, sendctrl); + } + +done: + return ret; +} + +/* + * implemention of the ioctl to get the counter values from the chip + * For the time being, we get all of them when asked, no shadowing. + * We need to shadow the byte counters at a minimum, because otherwise + * they will wrap in just a few seconds at full bandwidth + * The second argument is the user address to which we do the copy_to_user() + */ +static int ipath_get_counters(ipath_type t, + struct infinipath_counters * ucounters) +{ + int ret = 0; + uint64_t val; + uint64_t *ucreg; + uint16_t vcreg; + + ucreg = (uint64_t *) ucounters; + /* + * for now, let's do this one at a time. It's not the most + * optimal method, but it is simple, and has no intermediate + * memory requirements. + */ + for (vcreg = 0; + vcreg < (sizeof(struct infinipath_counters) / sizeof(val)); + vcreg++, ucreg++) { + ipath_creg creg = vcreg; + val = ipath_snap_cntr(t, creg); + if ((ret = copy_to_user(ucreg, &val, sizeof(val)))) { + _IPATH_DBG("copy_to_user error on counter %d\n", creg); + break; + } + } + + return ret; +} + +/* + * implemention of the ioctl to get the stats values from the driver + * The argument is the user address to which we do the copy_to_user() + */ +static int ipath_get_stats(struct infinipath_stats *ustats) +{ + int ret = 0; + + if ((ret = copy_to_user(ustats, &ipath_stats, sizeof(ipath_stats)))) + _IPATH_DBG("copy_to_user error on driver stats\n"); + + return ret; +} + +/* set a partition key. We can have up to 4 active at a time (other than + * the default, which is always allowed). This is somewhat tricky, since + * multiple ports may set the same key, so we reference count them, and + * clean up at exit. All 4 partition keys are packed into a single + * infinipath register. It's an error for a process to set the same + * pkey multiple times. We provide no mechanism to de-allocate a pkey + * at this time, we may eventually need to do that. + * I've used the atomic operations, and no locking, and only make a single + * pass through what's available. This should be more than adequate for + * some time. I'll think about spinlocks or the like if and as it's necessary + */ +static int ipath_set_partkey(ipath_portdata *pd, uint16_t key) +{ + ipath_devdata *dd; + int i, any = 0, pidx = -1; + uint16_t lkey = key & 0x7FFF; + + dd = &devdata[pd->port_unit]; + + if (lkey == (IPS_DEFAULT_P_KEY & 0x7FFF)) { + /* nothing to do; this key always valid */ + return 0; + } + + _IPATH_VDBG + ("p%u try to set pkey %hx, current keys %hx:%x %hx:%x %hx:%x %hx:%x\n", + pd->port_port, key, dd->ipath_pkeys[0], + atomic_read(&dd->ipath_pkeyrefs[0]), dd->ipath_pkeys[1], + atomic_read(&dd->ipath_pkeyrefs[1]), dd->ipath_pkeys[2], + atomic_read(&dd->ipath_pkeyrefs[2]), dd->ipath_pkeys[3], + atomic_read(&dd->ipath_pkeyrefs[3])); + + if (!lkey) { + _IPATH_PRDBG("p%u tries to set key 0, not allowed\n", + pd->port_port); + return -EINVAL; + } + + /* + * Set the full membership bit, because it has to be + * set in the register or the packet, and it seems + * cleaner to set in the register than to force all + * callers to set it. (see bug 4331) + */ + key |= 0x8000; + + for (i = 0; i < ARRAY_SIZE(pd->port_pkeys); i++) { + if (!pd->port_pkeys[i] && pidx == -1) + pidx = i; + if (pd->port_pkeys[i] == key) { + _IPATH_VDBG + ("p%u tries to set same pkey (%x) more than once\n", + pd->port_port, key); + return -EEXIST; + } + } + if (pidx == -1) { + _IPATH_DBG + ("All pkeys for port %u already in use, can't set %x\n", + pd->port_port, key); + return -EBUSY; + } + for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i]) { + any++; + continue; + } + if (dd->ipath_pkeys[i] == key) { + if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) { + pd->port_pkeys[pidx] = key; + _IPATH_VDBG + ("p%u set key %x matches #%d, count now %d\n", + pd->port_port, key, i, + atomic_read(&dd->ipath_pkeyrefs[i])); + return 0; + } else { + /* lost race, decrement count, catch below */ + atomic_dec(&dd->ipath_pkeyrefs[i]); + _IPATH_VDBG + ("Lost race, count was 0, after dec, it's %d\n", + atomic_read(&dd->ipath_pkeyrefs[i])); + any++; + } + } + if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) { + /* + * It makes no sense to have both the limited and full + * membership PKEY set at the same time since the + * unlimited one will disable the limited one. + */ + return -EEXIST; + } + } + if (!any) { + _IPATH_DBG + ("port %u, all pkeys already in use, can't set %x\n", + pd->port_port, key); + return -EBUSY; + } + for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i] && + atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { + uint64_t pkey; + + /* for ipathstats, etc. */ + ipath_stats.sps_pkeys[i] = lkey; + pd->port_pkeys[pidx] = dd->ipath_pkeys[i] = key; + pkey = + (uint64_t) dd->ipath_pkeys[0] | + ((uint64_t) dd->ipath_pkeys[1] << 16) | + ((uint64_t) dd->ipath_pkeys[2] << 32) | + ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_PRDBG + ("p%u set key %x in #%d, portidx %d, new pkey reg %llx\n", + pd->port_port, key, i, pidx, pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + + return 0; + } + } + _IPATH_DBG + ("port %u, all pkeys already in use 2nd pass, can't set %x\n", + pd->port_port, key); + return -EBUSY; +} + +/* + * stop_start == 0 disables receive on the port, for use in queue overflow + * conditions. stop_start==1 re-enables, and returns value of tail register, + * to be used to re-init the software copy of the head register + */ + +static int ipath_manage_rcvq(ipath_portdata * pd, uint16_t start_stop) +{ + ipath_devdata *dd; + /* + * This needs to be volatile, so that the compiler doesn't + * optimize away the read to the device's mapped memory. + */ + volatile uint64_t tval; + + dd = &devdata[pd->port_unit]; + _IPATH_PRDBG("%sabling rcv for unit %u port %u\n", + start_stop ? "en" : "dis", pd->port_unit, pd->port_port); + /* atomically clear receive enable port. */ + if (start_stop) { + /* + * on enable, force in-memory copy of the tail register + * to 0, so that protocol code doesn't have to worry + * about whether or not the chip has yet updated + * the in-memory copy or not on return from the system + * call. The chip always resets it's tail register back + * to 0 on a transition from disabled to enabled. + * This could cause a problem if software was broken, + * and did the enable w/o the disable, but eventually + * the in-memory copy will be updated and correct + * itself, even in the face of software bugs. + */ + *pd->port_rcvhdrtail_kvaddr = 0; + atomic_set_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port), + &dd->ipath_rcvctrl); + } else + atomic_clear_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + pd->port_port), &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + /* now be sure chip saw it before we return */ + tval = ipath_kget_kreg64(pd->port_unit, kr_scratch); + if (start_stop) { + /* + * and try to be sure that tail reg update has happened + * too. This should in theory interlock with the RXE + * changes to the tail register. Don't assign it to + * the tail register in memory copy, since we could + * overwrite an update by the chip if we did. + */ + tval = + ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail, + pd->port_port); + } + /* always; new head should be equal to new tail; see above */ + return 0; +} + +/* + * This routine is now quite different for user and kernel, because + * the kernel uses skb's, for the accelerated network performance + * This is the user port version + * + * allocate the eager TID buffers and program them into infinipath + * They are no longer completely contiguous, we do multiple + * alloc_pages() calls. + */ +static int ipath_create_user_egr(ipath_portdata * pd) +{ + char *buf; + ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t *egrbase, egroff, lenvalid; + unsigned e, egrcnt, alloced, order, egrperchunk, chunk; + unsigned long pa, pent; + + egrcnt = dd->ipath_rcvegrcnt; + egroff = + dd->ipath_rcvegrbase + pd->port_port * egrcnt * sizeof(*egrbase); + egrbase = (uint64_t *) ((char *)(dd->ipath_kregbase) + egroff); + _IPATH_VDBG("Allocating %d egr buffers, at chip offset %llx (%p)\n", + egrcnt, egroff, egrbase); + + /* + * to avoid wasting a lot of memory, we allocate 32KB chunks of + * physically contiguous memory, advance through it until used up + * and then allocate more. Of course, we need memory to store + * those extra pointers, now. Started out with 256KB, but under + * heavy memory pressure (creating large files and then copying + * them over NFS while doing lots of MPI jobs), we hit some + * alloc_pages() failures, even though we can sleep... (2.6.10) + * Still get failures at 64K. 32K is the lowest we can go without + * waiting more memory again. It seems likely that the coalescing + * in free_pages, etc. still has issues (as it has had previously + * during 2.6.x development). + */ + order = get_order(0x8000); + alloced = + round_up(dd->ipath_rcvegrbufsize * egrcnt, + (1 << order) * PAGE_SIZE); + egrperchunk = ((1 << order) * PAGE_SIZE) / dd->ipath_rcvegrbufsize; + chunk = (egrcnt + egrperchunk - 1) / egrperchunk; + pd->port_rcvegrbuf_chunks = chunk; + pd->port_rcvegrbufs_perchunk = egrperchunk; + pd->port_rcvegrbuf_order = order; + pd->port_rcvegrbuf_pages = + vmalloc(chunk * sizeof(pd->port_rcvegrbuf_pages[0])); + pd->port_rcvegrbuf_virt = + vmalloc(chunk * sizeof(pd->port_rcvegrbuf_virt[0])); + if (!pd->port_rcvegrbuf_pages || !pd->port_rcvegrbuf_pages) { + _IPATH_UNIT_ERROR(pd->port_unit, + "Unable to allocate %u EGR buffer array pointers\n", + chunk); + if (pd->port_rcvegrbuf_pages) { + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + } + return -ENOMEM; + } + for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { + /* + * GFP_USER, but without GFP_FS, so buffer cache can + * be coalesced (we hope); otherwise, even at order 4, heavy + * filesystem activity makes these fail + */ + if (! + (pd->port_rcvegrbuf_pages[e] = + alloc_pages(__GFP_WAIT | __GFP_IO, order))) { + _IPATH_UNIT_ERROR(pd->port_unit, + "Unable to allocate EGR buffer array %u/%u\n", + e, pd->port_rcvegrbuf_chunks); + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + vfree(pd->port_rcvegrbuf_virt); + pd->port_rcvegrbuf_virt = NULL; + return -ENOMEM; + } + } + + /* + * calculate physical, then phys_to_virt() + * so that we get an address that fits in 64 bits, so we can use + * mmap64 from 32 bit programs on the chip and kernel virtual + * addresses (mmap64 for 32 bit programs on i386 and x86_64 + * only has 44 bits of address, because it uses mmap2()) + * We do this with the first chunk; We don't need a kernel + * virtually contiguous address to give the user virtually + * contiguous mappings. It just complicates the nopage routine + * a little tiny bit ;) + */ + buf = page_address(pd->port_rcvegrbuf_pages[0]); + pa = virt_to_phys(buf); + pd->port_rcvegr_phys = pa; + + /* in words */ + lenvalid = (dd->ipath_rcvegrbufsize - pd->port_egrskip) >> 2; + _IPATH_VDBG + ("port%u egrbuf vaddr %p, cpu %d, egrskip %u, len %llx words\n", + pd->port_port, buf, smp_processor_id(), pd->port_egrskip, + lenvalid); + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + + for (e = chunk = 0; chunk < pd->port_rcvegrbuf_chunks; chunk++) { + int i, n; + struct page *p; + p = pd->port_rcvegrbuf_pages[chunk]; + pa = page_to_phys(p); + buf = page_address(p); + /* + * stash away for later use, since page_address() lookup + * is not cheap + */ + pd->port_rcvegrbuf_virt[chunk] = buf; + if (pa & ~INFINIPATH_RT_ADDR_MASK) + _IPATH_INFO + ("physaddr %lx has more than 40 bits, using only 40!\n", + pa); + n = 1 << pd->port_rcvegrbuf_order; + for (i = 0; i < n; i++) + SetPageReserved(virt_to_page(buf + (i * PAGE_SIZE))); + + /* clear buffer for security, sanity, and, debugging */ + memset(buf, 0, PAGE_SIZE * n); + + for (i = 0; e < egrcnt && i < egrperchunk; e++, i++) { + pent = + ((pa + + pd-> + port_egrskip) & INFINIPATH_RT_ADDR_MASK) | + lenvalid; + + ipath_kput_memq(pd->port_unit, &egrbase[e], pent); + _IPATH_VDBG("egr %u phys %lx val %lx\n", e, pa, pent); + pa += dd->ipath_rcvegrbufsize; + } + yield(); /* don't hog the cpu */ + } + + return 0; +} + +/* + * This routine is now quite different for user and kernel, because + * the kernel uses skb's, for the accelerated network performance + * This is the kernel (port0) version + * + * Allocate the eager TID buffers and program them into infinipath. + * We use the network layer alloc_skb() allocator to allocate the memory, and + * either use the buffers as is for things like SMA packets, or pass + * the buffers up to the ipath layered driver and thence the network layer, + * replacing them as we do so (see ipath_kreceive()) + */ +static int ipath_create_port0_egr(ipath_portdata * pd) +{ + int ret = 0; + uint64_t *egrbase, egroff; + unsigned e, egrcnt; + ipath_devdata *dd; + struct sk_buff **skbs; + + dd = &devdata[pd->port_unit]; + egrcnt = dd->ipath_rcvegrcnt; + egroff = + dd->ipath_rcvegrbase + pd->port_port * egrcnt * sizeof(*egrbase); + egrbase = (uint64_t *) ((char *)(dd->ipath_kregbase) + egroff); + _IPATH_VDBG + ("unit%u Allocating %d egr buffers, at chip offset %llx (%p)\n", + pd->port_unit, egrcnt, egroff, egrbase); + + skbs = vmalloc(sizeof(*dd->ipath_port0_skbs) * egrcnt); + if (skbs == NULL) + ret = -ENOMEM; + else { + for (e = 0; e < egrcnt; e++) { + /* + * This is a bit tricky in that we allocate + * extra space for 2 bytes of the 14 byte + * ethernet header. These two bytes are passed + * in the ipath header so the rest of the data + * is word aligned. We allocate 4 bytes so that the + * data buffer stays word aligned. + * See ipath_kreceive() for more details. + */ + skbs[e] = + __dev_alloc_skb(dd->ipath_ibmaxlen + 4, GFP_KERNEL); + if (skbs[e] == NULL) { + _IPATH_UNIT_ERROR(pd->port_unit, + "SKB allocation error for eager TID %u\n", + e); + while (e != 0) + dev_kfree_skb(skbs[--e]); + ret = -ENOMEM; + break; + } + skb_reserve(skbs[e], 4); + } + } + /* + * after loop above, so we can test non-NULL + * to see if ready to use at receive, etc. Hope this fixes some + * panics. + */ + dd->ipath_port0_skbs = skbs; + + /* + * have to tell chip each time we init it + * even if we are re-using previous memory. + */ + if (!ret) { + uint64_t lenvalid; /* in words */ + + lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + for (e = 0; e < egrcnt; e++) { + unsigned long pa, pent; + + pa = virt_to_phys(dd->ipath_port0_skbs[e]->data); + pa += pd->port_egrskip; + if (!e && (pa & ~INFINIPATH_RT_ADDR_MASK)) + _IPATH_INFO + ("phys addr %lx has more than 40 bits, using only 40!!!\n", + pa); + pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid; + /* + * don't need this except extreme debugging, + * but leaving to save future typing. + * _IPATH_VDBG("egr[%d] %p <- %lx\n", e, &egrbase[e], pent); + */ + ipath_kput_memq(pd->port_unit, &egrbase[e], pent); + } + yield(); /* don't hog the cpu */ + } + + return ret; +} + +/* + * this *must* be physically contiguous memory, and for now, + * that limits it to what kmalloc can do. + */ +static int ipath_create_rcvhdrq(ipath_portdata * pd) +{ + int i, ret = 0, amt, order, pgs; + char *qt; + struct page *p; + unsigned long pa, pa0; + + amt = round_up(devdata[pd->port_unit].ipath_rcvhdrcnt + * devdata[pd->port_unit].ipath_rcvhdrentsize * + sizeof(uint32_t), PAGE_SIZE); + if (!pd->port_rcvhdrq) { + order = get_order(amt); + /* + * not using REPEAT isn't viable; at 128KB, we can easily fail + * this. The problem with REPEAT is we can block here + * "forever". There isn't an inbetween, unfortunately. + * We could reduce the risk by never freeing the rcvhdrq + * except at unload, but even then, the first time a + * port is used, we could delay for some time... + */ + p = alloc_pages(GFP_USER, order); + if (!p) { + _IPATH_UNIT_ERROR(pd->port_unit, + "attempt to allocate order %u memory for port %u rcvhdrq failed\n", + order, pd->port_port); + return -ENOMEM; + } + + /* + * should use kmap (and later kunmap), even though high mem will + * always be mapped on x86_64, to play it safe, but for some + * bizarre reason these aren't exported symbols... + */ + pd->port_rcvhdrq = page_address(p); + if (!virt_addr_valid(pd->port_rcvhdrq)) { + _IPATH_DBG + ("weird, virt_addr_valid false right after alloc_pages\n"); + _IPATH_DBG("__pa(%p) is %lx, num_physpages %lx\n", + pd->port_rcvhdrq, __pa(pd->port_rcvhdrq), + num_physpages); + } + pd->port_rcvhdrq_phys = virt_to_phys(pd->port_rcvhdrq); + pd->port_rcvhdrq_order = order; + + pa0 = pd->port_rcvhdrq_phys; + pgs = amt >> PAGE_SHIFT; + _IPATH_VDBG + ("%d pages at %p (phys %lx) order=%u for port %u rcvhdr Q\n", + pgs, pd->port_rcvhdrq, pa0, pd->port_rcvhdrq_order, + pd->port_port); + + /* + * verify it's really physically contiguous, to be paranoid + * also mark pages as reserved, to avoid problems when + * user process with them mapped then exits. + */ + qt = pd->port_rcvhdrq; + SetPageReserved(virt_to_page(qt)); + qt += PAGE_SIZE; + for (pa = pa0, i = 1; i < pgs; i++, qt += PAGE_SIZE) { + SetPageReserved(virt_to_page(qt)); + pa = virt_to_phys(qt); + if (pa != (pa0 + (i * PAGE_SIZE))) + _IPATH_INFO + ("pg %d at %p phys %lx not contiguous\n", i, + qt, pa); + else + _IPATH_VDBG("pg %d at %p phys %lx\n", i, qt, + pa); + } + } + + /* + * clear for security, sanity, and/or debugging (each time we + * use/reuse) + */ + memset(pd->port_rcvhdrq, 0, amt); + + /* + * tell chip each time we init it, even if we are re-using previous + * memory (we zero it at process close) + */ + _IPATH_VDBG("writing port %d rcvhdraddr as %lx\n", pd->port_port, + pd->port_rcvhdrq_phys); + ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr, pd->port_port, + pd->port_rcvhdrq_phys); + + return ret; +} + +#ifdef _IPATH_EXTRA_DEBUG +/* + * occasionally useful to dump the full set of kernel registers for debugging. + */ +static void ipath_dump_allregs(char *what, ipath_type t) +{ + uint16_t reg; + _IPATH_DBG("%s\n", what); + for (reg = 0; reg <= 0x100; reg++) { + uint64_t v = ipath_kget_kreg64(t, reg); + if (!(reg % 4)) + printk("\n%3x: ", reg); + printk("%16llx ", v); + } + printk("\n"); +} +#endif /* _IPATH_EXTRA_DEBUG */ + +/* + * Do the actual initialization sequence on the chip. For the real + * hardware, this is done from the init routine called from the PCI + * infrastructure. + */ +int ipath_init_chip(const ipath_type t) +{ + int ret = 0, i; + uint32_t val32, kpiobufs; + uint64_t val, atmp; + volatile uint32_t *piobuf; + uint32_t pioincr; + ipath_devdata *dd = &devdata[t]; + ipath_portdata *pd; + struct page *vpage; + char boardn[32]; + + /* first time only, set after static version info */ + if (!chip_driver_version) { + i = strlen(ipath_core_version); + chip_driver_version = ipath_core_version + i; + chip_driver_size = sizeof ipath_core_version - i; + } + + /* + * have to clear shadow copies of registers at init that are not + * otherwise set here, or all kinds of bizarre things happen with + * driver on chip reset + */ + dd->ipath_rcvhdrsize = 0; + + /* + * don't clear ipath_flags as 8bit mode was set before entering + * this func. However, we do set the linkstate to unknown + */ + + /* so we can watch for a transition */ + dd->ipath_flags |= IPATH_LINKUNK; + dd->ipath_flags &= ~(IPATH_LINKACTIVE | IPATH_LINKARMED | IPATH_LINKDOWN + | IPATH_LINKINIT); + + _IPATH_VDBG("Try to read spc chip revision\n"); + dd->ipath_revision = ipath_kget_kreg64(t, kr_revision); + + /* + * set up fundamental info we need to use the chip; we assume if + * the revision reg and these regs are OK, we don't need to special + * case the rest + */ + dd->ipath_sregbase = ipath_kget_kreg32(t, kr_sendregbase); + dd->ipath_cregbase = ipath_kget_kreg32(t, kr_counterregbase); + dd->ipath_uregbase = ipath_kget_kreg32(t, kr_userregbase); + _IPATH_VDBG("ipath_kregbase %p, sendbase %x usrbase %x, cntrbase %x\n", + dd->ipath_kregbase, dd->ipath_sregbase, dd->ipath_uregbase, + dd->ipath_cregbase); + if ((dd->ipath_revision & 0xffffffff) == 0xffffffff || + (dd->ipath_sregbase & 0xffffffff) == 0xffffffff || + (dd->ipath_cregbase & 0xffffffff) == 0xffffffff || + (dd->ipath_uregbase & 0xffffffff) == 0xffffffff) { + _IPATH_UNIT_ERROR(t, + "Register read failures from chip, giving up initialization\n"); + ret = -ENODEV; + goto done; + } + + /* clear the initial reset flag, in case first driver load */ + ipath_kput_kreg(t, kr_errorclear, INFINIPATH_E_RESET); + + dd->ipath_portcnt = ipath_kget_kreg32(t, kr_portcnt); + if (!infinipath_cfgports) + dd->ipath_cfgports = dd->ipath_portcnt; + else if (infinipath_cfgports <= dd->ipath_portcnt) { + dd->ipath_cfgports = infinipath_cfgports; + _IPATH_DBG("Configured to use %u ports out of %u in chip\n", + dd->ipath_cfgports, dd->ipath_portcnt); + } else { + dd->ipath_cfgports = dd->ipath_portcnt; + _IPATH_DBG + ("Tried to configured to use %u ports; chip only supports %u\n", + infinipath_cfgports, dd->ipath_portcnt); + } + dd->ipath_pd = kmalloc(sizeof(*dd->ipath_pd) * dd->ipath_cfgports, + GFP_KERNEL); + if (!dd->ipath_pd) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate portdata array, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_pd, 0, sizeof(*dd->ipath_pd) * dd->ipath_cfgports); + + dd->ipath_lastegrheads = kmalloc(sizeof(*dd->ipath_lastegrheads) + * dd->ipath_cfgports, GFP_KERNEL); + dd->ipath_lastrcvhdrqtails = kmalloc(sizeof(*dd->ipath_lastrcvhdrqtails) + * dd->ipath_cfgports, GFP_KERNEL); + if (!dd->ipath_lastegrheads || !dd->ipath_lastrcvhdrqtails) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate head arrays, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_lastrcvhdrqtails, 0, + sizeof(*dd->ipath_lastrcvhdrqtails) + * dd->ipath_cfgports); + memset(dd->ipath_lastegrheads, 0, sizeof(*dd->ipath_lastegrheads) + * dd->ipath_cfgports); + + dd->ipath_pd[0] = kmalloc(sizeof(ipath_portdata), GFP_KERNEL); + if (!dd->ipath_pd[0]) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate portdata for port 0, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_pd[0], 0, sizeof(ipath_portdata)); + + pd = dd->ipath_pd[0]; + pd->port_unit = t; + pd->port_port = 0; + pd->port_cnt = 1; + /* The port 0 pkey table is used by the layer interface. */ + pd->port_pkeys[0] = IPS_DEFAULT_P_KEY; + + dd->ipath_rcvtidcnt = ipath_kget_kreg32(t, kr_rcvtidcnt); + dd->ipath_rcvtidbase = ipath_kget_kreg32(t, kr_rcvtidbase); + dd->ipath_rcvegrcnt = ipath_kget_kreg32(t, kr_rcvegrcnt); + dd->ipath_rcvegrbase = ipath_kget_kreg32(t, kr_rcvegrbase); + dd->ipath_palign = ipath_kget_kreg32(t, kr_pagealign); + dd->ipath_piobufbase = ipath_kget_kreg32(t, kr_sendpiobufbase); + dd->ipath_piosize = ipath_kget_kreg32(t, kr_sendpiosize); + dd->ipath_ibmtu = 4096; /* default to largest legal MTU */ + dd->ipath_piobcnt = ipath_kget_kreg32(t, kr_sendpiobufcnt); + + _IPATH_VDBG + ("Revision %llx (PCI %x), %u ports, %u tids, %u egrtids, %u piobufs\n", + dd->ipath_revision, dd->ipath_pcirev, dd->ipath_portcnt, + dd->ipath_rcvtidcnt, dd->ipath_rcvegrcnt, dd->ipath_piobcnt); + + if (((dd->ipath_revision >> INFINIPATH_R_SOFTWARE_SHIFT) & INFINIPATH_R_SOFTWARE_MASK) != IPATH_CHIP_SWVERSION) { /* >= maybe, someday */ + _IPATH_UNIT_ERROR(t, + "Driver only handles version %d, chip swversion is %d (%llx), failng\n", + IPATH_CHIP_SWVERSION, + (int)(dd-> + ipath_revision >> + INFINIPATH_R_SOFTWARE_SHIFT) & + INFINIPATH_R_SOFTWARE_MASK, + dd->ipath_revision); + ret = -ENOSYS; + goto done; + } + dd->ipath_majrev = (uint8_t) ((dd->ipath_revision >> + INFINIPATH_R_CHIPREVMAJOR_SHIFT) & + INFINIPATH_R_CHIPREVMAJOR_MASK); + dd->ipath_minrev = + (uint8_t) ((dd-> + ipath_revision >> INFINIPATH_R_CHIPREVMINOR_SHIFT) & + INFINIPATH_R_CHIPREVMINOR_MASK); + dd->ipath_boardrev = + (uint8_t) ((dd-> + ipath_revision >> INFINIPATH_R_BOARDID_SHIFT) & + INFINIPATH_R_BOARDID_MASK); + + ipath_get_boardname(t, boardn, sizeof boardn); + + { + snprintf(chip_driver_version, chip_driver_size, + "Driver %u.%u, %s, InfiniPath%u %u.%u, PCI %u, SW Compat %u\n", + IPATH_CHIP_VERS_MAJ, IPATH_CHIP_VERS_MIN, boardn, + (unsigned)(dd-> + ipath_revision >> INFINIPATH_R_ARCH_SHIFT) & + INFINIPATH_R_ARCH_MASK, dd->ipath_majrev, + dd->ipath_minrev, dd->ipath_pcirev, + (unsigned)(dd-> + ipath_revision >> + INFINIPATH_R_SOFTWARE_SHIFT) & + INFINIPATH_R_SOFTWARE_MASK); + + } + + _IPATH_DBG("%s", chip_driver_version); + + /* + * we ignore most issues after reporting them, but have to specially + * handle hardware-disabled chips. + */ + if(ipath_validate_rev(dd) == 2) { + ret = -EPERM; /* unique error, known to infinipath_init_one() */ + goto done; + } + + /* + * zero all the TID entries at startup. We do this for sanity, + * in case of a previous driver crash of some kind, and also + * because the chip powers up with these memories in an unknown + * state. Use portcnt, not cfgports, since this is for the full chip, + * not for current (possibly different) configuration value + * Chip Errata bug 6447 + */ + for (val32 = 0; val32 < dd->ipath_portcnt; val32++) + ipath_clear_tids(t, val32); + + dd->ipath_rcvhdrentsize = IPATH_RCVHDRENTSIZE; + /* we could bump this + * to allow for full rcvegrcnt + rcvtidcnt, but then it no + * longer nicely fits power of two, and since we now use + * alloc_pages, the rest would be wasted. + */ + dd->ipath_rcvhdrcnt = dd->ipath_rcvegrcnt; + /* + * setup offset of last valid entry in rcvhdrq, for various tests, to + * avoid calculating each time we need it + */ + dd->ipath_hdrqlast = + dd->ipath_rcvhdrentsize * (dd->ipath_rcvhdrcnt - 1); + ipath_kput_kreg(t, kr_rcvhdrentsize, dd->ipath_rcvhdrentsize); + ipath_kput_kreg(t, kr_rcvhdrcnt, dd->ipath_rcvhdrcnt); + /* + * not in ipath_rcvhdrsize, so user programs can set differently, but + * so any early packets see the default size. + */ + ipath_kput_kreg(t, kr_rcvhdrsize, IPATH_DFLT_RCVHDRSIZE); + + /* + * we "know" that this works + * out OK. It's actually a bit more than we need, but 2048+64 isn't + * quite enough for full size, and we want the +N to be a power of 2 + * to give us reasonable alignment and fit within page_alloc()'ed + * memory + */ + dd->ipath_rcvegrbufsize = dd->ipath_piosize; + + /* + * the min() check here is currently a nop, but it may not always be, + * depending on just how we do ipath_rcvegrbufsize + */ + dd->ipath_ibmaxlen = min(dd->ipath_piosize, dd->ipath_rcvegrbufsize); + dd->ipath_init_ibmaxlen = dd->ipath_ibmaxlen; + + /* + * set up the shadow copies of the piobufavail registers, which + * we compare against the chip registers for now, and the in + * memory DMA'ed copies of the registers. This has to be done + * early, before we calculate lastport, etc. + */ + val = dd->ipath_piobcnt; + /* + * calc number of pioavail registers, and save it; we have 2 bits + * per buffer + */ + dd->ipath_pioavregs = + round_up(val, sizeof(uint64_t) * _BITS_PER_BYTE / 2) / + (sizeof(uint64_t) * _BITS_PER_BYTE / 2); + if (dd->ipath_pioavregs > + (sizeof(dd->ipath_pioavailshadow) / + sizeof(dd->ipath_pioavailshadow[0]))) { + dd->ipath_pioavregs = + sizeof(dd->ipath_pioavailshadow) / + sizeof(dd->ipath_pioavailshadow[0]); + dd->ipath_piobcnt = dd->ipath_pioavregs * sizeof(uint64_t) * _BITS_PER_BYTE >> 1; /* 2 bits/reg */ + _IPATH_INFO + ("Warning: %lld piobufs is too many to fit in shadow, only using %d\n", + val, dd->ipath_piobcnt); + } + + if (!infinipath_kpiobufs) { + /* have to have at least one, for SMA */ + kpiobufs = infinipath_kpiobufs = 1; + } else if (dd->ipath_piobcnt < + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT)) { + _IPATH_INFO + ("Too few PIO buffers (%u) for %u ports to have %u each!\n", + dd->ipath_piobcnt, dd->ipath_cfgports, + IPATH_MIN_USER_PORT_BUFCNT); + kpiobufs = 1; /* reserve just the minimum for SMA/ether */ + } else + kpiobufs = infinipath_kpiobufs; + + if (kpiobufs > + (dd->ipath_piobcnt - + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT))) { + i = dd->ipath_piobcnt - + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT); + if (i < 0) + i = 0; + _IPATH_INFO + ("Allocating %d PIO bufs for kernel leaves too few for %d user ports (%d each); using %u\n", + kpiobufs, dd->ipath_cfgports - 1, + IPATH_MIN_USER_PORT_BUFCNT, i); + /* + * shouldn't change infinipath_kpiobufs, because could be + * different for different devices... + */ + kpiobufs = i; + } + dd->ipath_lastport_piobuf = dd->ipath_piobcnt - kpiobufs; + dd->ipath_pbufsport = dd->ipath_cfgports > 1 ? + dd->ipath_lastport_piobuf / (dd->ipath_cfgports - 1) : 0; + val32 = dd->ipath_lastport_piobuf - + (dd->ipath_pbufsport * (dd->ipath_cfgports - 1)); + if (val32 > 0) { + _IPATH_DBG + ("allocating %u pbufs/port leaves %u unused, add to kernel\n", + dd->ipath_pbufsport, val32); + dd->ipath_lastport_piobuf -= val32; + _IPATH_DBG("%u pbufs/port leaves %u unused, add to kernel\n", + dd->ipath_pbufsport, val32); + } + dd->ipath_lastpioindex = dd->ipath_lastport_piobuf; + _IPATH_VDBG + ("%d PIO bufs %u - %u, %u each for %u user ports\n", + kpiobufs, dd->ipath_lastport_piobuf, dd->ipath_piobcnt, dd->ipath_pbufsport, + dd->ipath_cfgports - 1); + + /* + * this has to be page aligned, and on a page of it's own, so we + * can map it into user space. We also use it to give processes + * a copy of ipath_statusp, on a separate cacheline, followed by + * a copy of the freeze error string, if it's happened. Might also + * use that space for other things. + */ + val = round_up(2 * L1_CACHE_BYTES + sizeof(*dd->ipath_statusp) + + dd->ipath_pioavregs * sizeof(uint64_t), 2 * PAGE_SIZE); + if (!(dd->ipath_pioavailregs_dma = kmalloc(val * sizeof(uint64_t), + GFP_KERNEL))) { + _IPATH_UNIT_ERROR(t, + "failed to allocate PIOavail reg area in memory\n"); + ret = -ENOMEM; + goto done; + } + if ((PAGE_SIZE - 1) & (uint64_t) dd->ipath_pioavailregs_dma) { + dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma; + dd->ipath_pioavailregs_dma = (uint64_t *) + round_up((uint64_t) dd->ipath_pioavailregs_dma, PAGE_SIZE); + } else + dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma; + /* + * zero initial, since whole thing mapped + * into user space, and don't want info leak, or confusing garbage + */ + memset((void *)dd->ipath_pioavailregs_dma, 0, PAGE_SIZE); + + /* + * we really want L2 cache aligned, but for current CPUs of interest, + * they are the same. + */ + dd->ipath_statusp = (uint64_t *) ((char *)dd->ipath_pioavailregs_dma + + ((2 * L1_CACHE_BYTES + + dd->ipath_pioavregs * + sizeof(uint64_t)) & + ~L1_CACHE_BYTES)); + /* copy the current value now that it's really allocated */ + *dd->ipath_statusp = dd->_ipath_status; + /* + * setup buffer to hold freeze msg, accessible to apps, following + * statusp + */ + dd->ipath_freezemsg = (char *)&dd->ipath_statusp[1]; + /* and it's length */ + dd->ipath_freezelen = L1_CACHE_BYTES - sizeof(dd->ipath_statusp[0]); + + atmp = virt_to_phys(dd->ipath_pioavailregs_dma); + /* stash physical address for user progs */ + dd->ipath_pioavailregs_phys = atmp; + (void)ipath_kput_kreg(t, kr_sendpioavailaddr, atmp); + /* + * this is to detect s/w errors, which the h/w works around by + * ignoring the low 6 bits of address, if it wasn't aligned. + */ + val = ipath_kget_kreg64(t, kr_sendpioavailaddr); + if (val != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, SendPIOAvailAddr written as %llx, read back as %llx\n", + atmp, val); + ret = -EINVAL; + goto done; + } + + if (t * 64 > (sizeof(ipath_port0_rcvhdrtail) - 64)) { + _IPATH_UNIT_ERROR(t, + "unit %u too large for port 0 rcvhdrtail buffer size\n", + t); + ret = -ENODEV; + } + + /* + * kernel modules loaded into vmalloc'ed memory, + * verify that when we assume that, map to phys, and back to virt, + * that we get the right contents, so we did the mapping right. + */ + vpage = vmalloc_to_page((void *)ipath_port0_rcvhdrtail); + if (vpage == NOPAGE_SIGBUS || vpage == NOPAGE_OOM) { + _IPATH_UNIT_ERROR(t, "vmalloc_to_page for rcvhdrtail fails!\n"); + ret = -ENOMEM; + goto done; + } + + /* + * 64 is driven by cache line size, and also by chip requirement + * that low 6 bits be 0 + */ + val = page_to_phys(vpage) + t * 64; + + /* verify that the alignment requirement was met */ + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, 0, val); + atmp = ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr, 0); + if (val != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, RcvHdrTailAddr0 written as %llx, read back as %llx from %x\n", + val, atmp, kr_rcvhdrtailaddr); + ret = -EINVAL; + goto done; + } + /* so we can get current tail in ipath_kreceive(), per chip */ + dd->ipath_hdrqtailptr = + &ipath_port0_rcvhdrtail[t * + (64 / sizeof(ipath_port0_rcvhdrtail[0]))]; + + ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP); + + /* + * make sure we are not in freeze, and PIO send enabled, so + * writes to pbc happen + */ + ipath_kput_kreg(t, kr_hwerrmask, 0ULL); + ipath_kput_kreg(t, kr_hwerrclear, ~0ULL); + ipath_kput_kreg(t, kr_control, 0ULL); + ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_PIOENABLE); + + /* + * write the pbc of each buffer, to be sure it's initialized, then + * cancel all the buffers, and also abort any packets that might + * have been in flight for some reason (the latter is for driver + * unload/reload, but isn't a bad idea at first init). + * PIO send isn't enabled at this point, so there is no danger + * of sending these out on the wire. + * Chip Errata bug 6610 + */ + piobuf = (uint32_t *) (((char *)(dd->ipath_kregbase)) + + dd->ipath_piobufbase); + pioincr = devdata[t].ipath_palign / sizeof(*piobuf); + for (i = 0; i < dd->ipath_piobcnt; i++) { + *piobuf = 16; /* reasonable word count, just to init pbc */ + piobuf += pioincr; + } + /* self-clearing */ + ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_ABORT); + + /* + * before error clears, since we expect serdes pll errors during + * this, the first time after reset + */ + if (ipath_bringup_link(t)) { + _IPATH_INFO("Failed to bringup IB link\n"); + ret = -ENETDOWN; + goto done; + } + + /* + * clear any "expected" hwerrs from reset and/or initialization + * clear any that aren't enabled (at least this once), and then + * set the enable mask + */ + ipath_clear_init_hwerrs(t); + ipath_kput_kreg(t, kr_hwerrclear, ~0ULL); + ipath_kput_kreg(t, kr_hwerrmask, dd->ipath_hwerrmask); + + dd->ipath_maskederrs = dd->ipath_ignorederrs; + ipath_kput_kreg(t, kr_errorclear, ~0ULL); /* clear all */ + /* enable errors that are masked, at least this first time. */ + ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs); + /* clear any interrups up to this point (ints still not enabled) */ + ipath_kput_kreg(t, kr_intclear, ~0ULL); + + ipath_stats.sps_lid[t] = dd->ipath_lid; + + /* + * allocate the shadow TID array, so we can ipath_munlock + * previous entries. It make make more sense to move the pageshadow + * to the port data structure, so we only allocate memory for ports + * actually in use, since we at 8k per port, now + */ + dd->ipath_pageshadow = (struct page **) + vmalloc(dd->ipath_cfgports * dd->ipath_rcvtidcnt * + sizeof(struct page *)); + if (!dd->ipath_pageshadow) + _IPATH_UNIT_ERROR(t, + "failed to allocate shadow page * array, no expected sends!\n"); + else + memset(dd->ipath_pageshadow, 0, + dd->ipath_cfgports * dd->ipath_rcvtidcnt * + sizeof(struct page *)); + + /* set up the port 0 (kernel) rcvhdr q and egr TIDs */ + if (!(ret = ipath_create_rcvhdrq(dd->ipath_pd[0]))) + ret = ipath_create_port0_egr(dd->ipath_pd[0]); + if (ret) + _IPATH_UNIT_ERROR(t, + "failed to allocate port 0 (kernel) rcvhdrq and/or egr bufs\n"); + else { + init_waitqueue_head(&ipath_sma_wait); + init_waitqueue_head(&ipath_sma_state_wait); + + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + + ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP); + + /* Enable PIO send, and update of PIOavail regs to memory. */ + dd->ipath_sendctrl = INFINIPATH_S_PIOENABLE + | INFINIPATH_S_PIOBUFAVAILUPD; + ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl); + + /* + * enable port 0 receive, and receive interrupt + * other ports done as user opens and inits them + */ + dd->ipath_rcvctrl = INFINIPATH_R_TAILUPD | + (1ULL << INFINIPATH_R_PORTENABLE_SHIFT) | + (1ULL << INFINIPATH_R_INTRAVAIL_SHIFT); + ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl); + + /* + * now ready for use + * this should be cleared whenever we detect a reset, or + * initiate one. + */ + dd->ipath_flags |= IPATH_INITTED; + + /* + * init our shadow copies of head from tail values, and write + * head values to match + */ + val32 = ipath_kget_ureg32(t, ur_rcvegrindextail, 0); + (void)ipath_kput_ureg(t, ur_rcvegrindexhead, val32, 0); + dd->ipath_port0head = ipath_kget_ureg32(t, ur_rcvhdrtail, 0); + (void)ipath_kput_ureg(t, ur_rcvhdrhead, dd->ipath_port0head, 0); + + /* + * by now pioavail updates to memory should have occurred, + * so copy them into our working/shadow registers; this is + * in case something went wrong with abort, but mostly to + * get the initial values of the generation bit correct + */ + for (i = 0; i < dd->ipath_pioavregs; i++) { + /* + * Chip Errata bug 6641; even and odd qwords>3 + * are swapped + */ + if (i > 3) { + if (i & 1) + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i - 1]; + else + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i + 1]; + } else + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i]; + } + /* can get counters, stats, etc. */ + dd->ipath_flags |= IPATH_PRESENT; + } + + /* + * cause retrigger of pending interrupts ignored during init, even if + * we had errors + */ + ipath_kput_kreg(t, kr_intclear, 0ULL); + + /* + * set up stats retrieval timer, even if we had errors in last + * portion of setup + */ + init_timer(&dd->ipath_stats_timer); + dd->ipath_stats_timer.function = ipath_get_faststats; + dd->ipath_stats_timer.data = (unsigned long)t; + /* every 5 seconds; */ + dd->ipath_stats_timer.expires = jiffies + 5 * HZ; + /* takes ~16 seconds to overflow at full IB 4x bandwdith */ + add_timer(&dd->ipath_stats_timer); + + dd->ipath_stats_timer_active = 1; + +done: + if (!ret) { + ipath_get_guid(t); + *dd->ipath_statusp |= IPATH_STATUS_CHIP_PRESENT; + if (!ipath_sma_data_spare) { + /* first init, setup SMA data structs */ + ipath_sma_data_spare = + ipath_sma_data_bufs[IPATH_NUM_SMAPKTS]; + for (i = 0; i < IPATH_NUM_SMAPKTS; i++) + ipath_sma_data[i].buf = ipath_sma_data_bufs[i]; + } + /* + * sps_nports is a global, so, we set it to the highest + * number of ports of any of the chips we find; we never + * decrement it, at least for now. + */ + if (dd->ipath_cfgports > ipath_stats.sps_nports) + ipath_stats.sps_nports = dd->ipath_cfgports; + } + /* if ret is non-zero, we probably should do some cleanup here... */ + return ret; +} + +int ipath_waitfor_complete(const ipath_type t, ipath_kreg reg_id, + uint64_t bits_to_wait_for, uint64_t * valp) +{ + uint64_t timeout, lastval, val; + + lastval = ipath_kget_kreg64(t, reg_id); + timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */ + do { + val = ipath_kget_kreg64(t, reg_id); + *valp = val; /* so they have something, even on failures. */ + if ((val & bits_to_wait_for) == bits_to_wait_for) + return 0; + if (val != lastval) + _IPATH_VDBG + ("Changed from %llx to %llx, waiting for %llx bits\n", + lastval, val, bits_to_wait_for); + yield(); + if (get_cycles() > timeout) { + _IPATH_DBG + ("Didn't get bits %llx in register 0x%x, got %llx\n", + bits_to_wait_for, reg_id, *valp); + return ENODEV; + } + } while (1); +} + +/* + * like ipath_waitfor_complete(), but we wait for the CMDVALID bit to go away + * indicating the last command has completed. It doesn't return data + */ +int ipath_waitfor_mdio_cmdready(const ipath_type t) +{ + uint64_t timeout; + uint64_t val; + + timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */ + do { + val = ipath_kget_kreg64(t, kr_mdio); + if (!(val & IPATH_MDIO_CMDVALID)) + return 0; + yield(); + if (get_cycles() > timeout) { + _IPATH_DBG("CMDVALID stuck in mdio reg? (%llx)\n", val); + return ENODEV; + } + } while (1); +} + +void ipath_set_ib_lstate(const ipath_type t, int which) +{ + ipath_devdata *dd = &devdata[t]; + char *what; + + /* + * For all cases, we'll either be setting a new value of linkcmd, or + * we want it to be NOP, so clear it here. + * Similarly, we want the linkinitcmd to be NOP for everything + * other than explictly than explictly changing linkinitcmd, + * and for that case, we want to first clear any existing bits + */ + dd->ipath_ibcctrl &= ~((INFINIPATH_IBCC_LINKCMD_MASK << + INFINIPATH_IBCC_LINKCMD_SHIFT) | + (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT)); + + if (which == INFINIPATH_IBCC_LINKCMD_INIT) { + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE + | IPATH_LINK_SLEEPING); + /* so we can watch for a transition */ + dd->ipath_flags |= IPATH_LINKDOWN; + what = "INIT"; + } else if (which == INFINIPATH_IBCC_LINKCMD_ARMED) { + dd->ipath_flags |= IPATH_LINK_TOARMED; + dd->ipath_flags &= ~(IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING); + /* + * this is mainly for loopback testing. If INITCMD is + * NOP or SLEEP, the link won't ever come up in loopback... + */ + if (! + (dd-> + ipath_flags & (IPATH_LINKINIT | IPATH_LINKARMED | + IPATH_LINKACTIVE))) { + _IPATH_SMADBG + ("going to armed, but link not yet up, set POLL\n"); + dd->ipath_ibcctrl |= + INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + } + what = "ARMED"; + } else if (which == INFINIPATH_IBCC_LINKCMD_ACTIVE) { + dd->ipath_flags |= IPATH_LINK_TOACTIVE; + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING); + what = "ACTIVE"; + } else if (which & (INFINIPATH_IBCC_LINKINITCMD_MASK << INFINIPATH_IBCC_LINKINITCMD_SHIFT)) { /* down, disable, etc. */ + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE); + if (((which & INFINIPATH_IBCC_LINKINITCMD_MASK) >> + INFINIPATH_IBCC_LINKINITCMD_SHIFT) == + INFINIPATH_IBCC_LINKINITCMD_SLEEP) { + dd->ipath_flags |= IPATH_LINK_SLEEPING | IPATH_LINKDOWN; + } else + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_ibcctrl |= + which & (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + what = "DOWN"; + } else { + what = "UNKNOWN"; + _IPATH_INFO("Unknown link transition requested (which=0x%x)\n", + which); + } + + dd->ipath_ibcctrl |= ((uint64_t) which & INFINIPATH_IBCC_LINKCMD_MASK) + << INFINIPATH_IBCC_LINKCMD_SHIFT; + + _IPATH_SMADBG("Trying to move unit %u to %s, current ltstate is %s\n", + t, what, ipath_ibcstatus_str[(ipath_kget_kreg64(t, kr_ibcstatus) + >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) + & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]); + ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl); +} + +static int ipath_bringup_link(const ipath_type t) +{ + ipath_devdata *dd = &devdata[t]; + uint64_t val, ibc; + int ret = 0; + + dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* hold IBC in reset */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + + /* + * Note that prior to try 14 or 15 of IB, the credit scaling + * wasn't working, because it was swapped for writes with the + * 1 bit default linkstate field + */ + + /* ignore pbc and align word */ + val = dd->ipath_piosize - 2 * sizeof(uint32_t); + /* + * for ICRC, which we only send in diag test pkt mode, and we don't + * need to worry about that for mtu + */ + val += 1; + /* + * set the IBC maxpktlength to the size of our pio buffers + * the maxpktlength is in words. This is *not* the IB data MTU + */ + ibc = (val / sizeof(uint32_t)) << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; + /* in KB */ + ibc |= 0x5ULL << INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT; + /* how often flowctrl sent + * more or less in usecs; balance against watermark value, so that + * in theory senders always get a flow control update in time to not + * let the IB link go idle. + */ + ibc |= 0x3ULL << INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT; + /* max error tolerance */ + ibc |= 0xfULL << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT; + /* use "real" buffer space for */ + ibc |= 4ULL << INFINIPATH_IBCC_CREDITSCALE_SHIFT; + /* IB credit flow control. */ + ibc |= 0xfULL << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT; + /* initially come up waiting for TS1, without sending anything. */ + dd->ipath_ibcctrl = ibc; + /* don't put linkinitcmd in ipath_ibcctrl, want that to stay a NOP */ + ibc |= + INFINIPATH_IBCC_LINKINITCMD_SLEEP << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + dd->ipath_flags |= IPATH_LINK_SLEEPING; + ipath_kput_kreg(t, kr_ibcctrl, ibc); + + ret = ipath_bringup_serdes(t); + + if (ret) + _IPATH_INFO("Could not initialize SerDes, not usable\n"); + else { + dd->ipath_control |= INFINIPATH_C_LINKENABLE; /* enable IBC */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + } + + return ret; +} + +/* + * called from ipath_shutdown_link(), and from sma doing a LINKDOWN + * Left as a separate function for historical reasons, and may want + * it to do more than just call ipath_set_ib_lstate() again sometime + * in the future. + */ +void ipath_down_link(const ipath_type t) +{ + ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKINITCMD_SLEEP << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); +} + +/* + * do this when driver is being unloaded, or perhaps for diags, and + * maybe when we get an interrupt of a fatal link error that requires + * bringing the linkd down and back up + */ +static int ipath_shutdown_link(const ipath_type t) +{ + uint64_t val; + ipath_devdata *dd = &devdata[t]; + int ret = 0; + + _IPATH_DBG("Shutting down the link\n"); + ipath_down_link(t); + + /* + * we are shutting down, so tell the layered driver. We don't + * do this on just a link state change, much like ethernet, + * a cable unplug, etc. doesn't change driver state + */ + if (dd->ipath_layer.l_intr) + dd->ipath_layer.l_intr(t, IPATH_LAYER_INT_IF_DOWN); + + dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* disable IBC */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_CONF | IPATH_STATUS_IB_READY); + + /* + * clear SerdesEnable and turn the leds off; do this here because + * we are unloading, so don't count on interrupts to move along + */ + + ipath_quiet_serdes(t); + val = dd->ipath_extctrl & + ~(INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON); + dd->ipath_extctrl = val; + ipath_kput_kreg(t, kr_extctrl, val); + + if (dd->ipath_stats_timer_active) { + del_timer_sync(&dd->ipath_stats_timer); + dd->ipath_stats_timer_active = 0; + } + if (*dd->ipath_statusp & IPATH_STATUS_CHIP_PRESENT) { + /* can't do anything more with chip */ + /* needs re-init */ + *dd->ipath_statusp &= ~IPATH_STATUS_CHIP_PRESENT; + if (dd->ipath_kregbase) { + /* + * if we haven't already cleaned up before these + * are to ensure any register reads/writes "fail" + * until re-init + */ + dd->ipath_kregbase = NULL; + dd->ipath_kregvirt = NULL; + dd->ipath_uregbase = 0ULL; + dd->ipath_sregbase = 0ULL; + dd->ipath_cregbase = 0ULL; + dd->ipath_kregsize = 0; + } +#ifdef CONFIG_MTRR + if (dd->ipath_mtrr) { + _IPATH_VDBG("undoing WCCOMB on pio buffers\n"); + mtrr_del(dd->ipath_mtrr, 0, 0); + dd->ipath_mtrr = 0; + } +#endif + } + + return ret; +} + +/* + * when closing, free up any allocated data for a port, if the + * reference count goes to zero + * Note: this also frees the portdata itself! + */ +void ipath_free_pddata(ipath_devdata * dd, uint32_t port, int freehdrq) +{ + ipath_portdata *pd = dd->ipath_pd[port]; + + if (!pd) + return; + if (freehdrq) + /* + * only clear and free portdata if we are going to + * also release the hdrq, otherwise we leak the hdrq on each + * open/close cycle + */ + dd->ipath_pd[port] = NULL; + /* cleanup locked pages private data structures */ + ipath_mlock_cleanup(pd); + if (freehdrq && pd->port_rcvhdrq) { + int i, n = 1 << pd->port_rcvhdrq_order; + _IPATH_VDBG("free closed port %d rcvhdrq @ %p (order=%u)\n", + pd->port_port, pd->port_rcvhdrq, + pd->port_rcvhdrq_order); + for (i = 0; i < n; i++) + ClearPageReserved(virt_to_page + (pd->port_rcvhdrq + (i * PAGE_SIZE))); + free_pages((unsigned long)pd->port_rcvhdrq, + pd->port_rcvhdrq_order); + pd->port_rcvhdrq = NULL; + } + if (port && pd->port_rcvegrbuf_pages) { /* always free this, however */ + void *virt; + unsigned e, i, n = 1 << pd->port_rcvegrbuf_order; + if (pd->port_rcvegrbuf_virt) { + for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { + virt = pd->port_rcvegrbuf_virt[e]; + for (i = 0; i < n; i++) + ClearPageReserved(virt_to_page + (virt + + (i * PAGE_SIZE))); + _IPATH_VDBG + ("egrbuf free_pages(%p, %x), chunk %u/%u\n", + virt, pd->port_rcvegrbuf_order, e, + pd->port_rcvegrbuf_chunks); + free_pages((unsigned long)virt, + pd->port_rcvegrbuf_order); + } + vfree(pd->port_rcvegrbuf_virt); + pd->port_rcvegrbuf_virt = NULL; + } + pd->port_rcvegrbuf_chunks = 0; + _IPATH_VDBG("free closed port %d rcvegrbufs ptr array\n", + pd->port_port); + /* now the pointer array. */ + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + } else if (port == 0 && dd->ipath_port0_skbs) { + unsigned e; + struct sk_buff **skbs = dd->ipath_port0_skbs; + + dd->ipath_port0_skbs = NULL; + _IPATH_VDBG("free closed port %d ipath_port0_skbs @ %p\n", + pd->port_port, skbs); + for (e = 0; e < dd->ipath_rcvegrcnt; e++) + if (skbs[e]) + dev_kfree_skb(skbs[e]); + vfree(skbs); + } + if (freehdrq) { + kfree(pd->port_tid_pg_list); + kfree(pd); + } +} + +int __init infinipath_init(void) +{ + int r = 0, i; + + _IPATH_DBG(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version); + + ipath_init_picotime(); /* init cycles -> pico conversion */ + + if (!ipath_ctl_header) { /* should be always */ + if (!(ipath_ctl_header = register_sysctl_table(ipath_ctl, 1))) + _IPATH_INFO("Couldn't register sysctl interface\n"); + } + + /* + * initialize the statusp to temporary storage so we can use it + * everywhere without first checking. When we "really" assign it, + * we copy from _ipath_status + */ + for (i = 0; i < infinipath_max; i++) + devdata[i].ipath_statusp = &devdata[i]._ipath_status; + + /* + * init these early, in case we take an interrupt as soon as the irq + * is setup. Saw a spinlock panic once that appeared to be due to that + * problem, when they were initted later on. + */ + spin_lock_init(&ipath_pioavail_lock); + spin_lock_init(&ipath_sma_lock); + + pci_register_driver(&infinipath_driver); + + driver_create_file(&(infinipath_driver.driver), &driver_attr_version); + + if ((r = register_chrdev(ipath_major, MODNAME, &ipath_fops))) + _IPATH_ERROR("Unable to register %s device\n", MODNAME); + + + /* + * never return an error, since we could have stuff registered, + * resources used, etc., even if no hardware found. This way we + * can clean up through unload. + */ + return 0; +} + +/* + * note: if for some reason the unload fails after this routine, and leaves + * the driver enterable by user code, we'll almost certainly crash and burn... + */ +static void __exit infinipath_cleanup(void) +{ + int r, m, port; + + driver_remove_file(&(infinipath_driver.driver), &driver_attr_version); + if (ipath_ctl_header) { + unregister_sysctl_table(ipath_ctl_header); + ipath_ctl_header = NULL; + } else + _IPATH_DBG("No sysctl unregister, not registered OK\n"); + if ((r = unregister_chrdev(ipath_major, MODNAME))) + _IPATH_DBG("unregister of device failed: %d\n", r); + + + /* + * turn off rcv, send, and interrupts for all ports, all drivers + * should also hard reset the chip here? + * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs + * for all versions of the driver, if they were allocated + */ + for (m = 0; m < infinipath_max; m++) { + uint64_t val; + ipath_devdata *dd = &devdata[m]; + if (dd->ipath_kregbase) { + /* in case unload fails, be consistent */ + dd->ipath_rcvctrl = 0U; + ipath_kput_kreg(m, kr_rcvctrl, dd->ipath_rcvctrl); + + /* + * gracefully stop all sends allowing any in + * progress to trickle out first. + */ + ipath_kput_kreg(m, kr_sendctrl, 0ULL); + val = ipath_kget_kreg64(m, kr_scratch); /* flush it */ + /* + * enough for anything that's going to trickle + * out to have actually done so. + */ + udelay(5); + + /* + * abort any armed or launched PIO buffers that + * didn't go. (self clearing). Will cause any + * packet currently being transmitted to go out + * with an EBP, and may also cause a short packet + * error on the receiver. + */ + ipath_kput_kreg(m, kr_sendctrl, INFINIPATH_S_ABORT); + + /* mask interrupts, but not errors */ + ipath_kput_kreg(m, kr_intmask, 0ULL); + ipath_shutdown_link(m); + + /* + * clear all interrupts and errors. Next time + * driver is loaded, we know that whatever is + * set happened while we were unloaded + */ + ipath_kput_kreg(m, kr_hwerrclear, ~0ULL); + ipath_kput_kreg(m, kr_errorclear, ~0ULL); + ipath_kput_kreg(m, kr_intclear, ~0ULL); + if (dd->__ipath_pioavailregs_base) { + kfree((void *)dd->__ipath_pioavailregs_base); + dd->__ipath_pioavailregs_base = + dd->ipath_pioavailregs_dma = 0; + } + + if (dd->ipath_pageshadow) { + struct page **tmpp = dd->ipath_pageshadow; + int i, cnt = 0; + + _IPATH_VDBG + ("Unlocking any expTID pages still locked\n"); + for (port = 0; port < dd->ipath_cfgports; + port++) { + int port_tidbase = + port * dd->ipath_rcvtidcnt; + int maxtid = + port_tidbase + dd->ipath_rcvtidcnt; + for (i = port_tidbase; i < maxtid; i++) { + if (tmpp[i]) { + ipath_munlock(1, + &tmpp[i]); + tmpp[i] = 0; + cnt++; + } + } + } + if (cnt) { + ipath_stats.sps_pageunlocks += cnt; + _IPATH_VDBG + ("There were still %u expTID entries locked\n", + cnt); + } + if (ipath_stats.sps_pagelocks + || ipath_stats.sps_pageunlocks) + _IPATH_VDBG + ("%llu pages locked, %llu unlocked via ipath_m{un}lock\n", + ipath_stats.sps_pagelocks, + ipath_stats.sps_pageunlocks); + + _IPATH_VDBG + ("Free shadow page tid array at %p\n", + dd->ipath_pageshadow); + vfree(dd->ipath_pageshadow); + dd->ipath_pageshadow = NULL; + } + + /* + * free any resources still in use (usually just + * kernel ports) at unload + */ + for (port = 0; port < dd->ipath_cfgports; port++) + ipath_free_pddata(dd, port, 1); + kfree(dd->ipath_pd); + /* + * debuggability, in case some cleanup path + * tries to use it after this + */ + dd->ipath_pd = NULL; + } + + if (dd->pcidev) { + if (dd->pcidev->irq) { + _IPATH_VDBG("unit %u free_irq of irq %x\n", m, + dd->pcidev->irq); + free_irq(dd->pcidev->irq, dd); + } else + _IPATH_DBG + ("irq is 0, not doing free_irq for unit %u\n", + m); + dd->pcidev = NULL; + } + if (dd->pci_registered) { + _IPATH_VDBG + ("Unregistering pci infrastructure unit %u\n", m); + pci_unregister_driver(&infinipath_driver); + dd->pci_registered = 0; + } else + _IPATH_VDBG + ("unit %u: no pci unreg, wasn't registered\n", m); + ipath_chip_cleanup(dd); /* clean up any per-chip chip-specific stuff */ + } + /* + * clean up any chip-specific stuff for now, only one type of chip + * for any given driver + */ + ipath_chip_done(); + + /* cleanup all our locked pages private data structures */ + ipath_mlock_cleanup(NULL); +} + +/* This is a generic function here, so it can return device-specific + * info. This allows keeping in sync with the version that supports + * multiple chip types. +*/ +void ipath_get_boardname(const ipath_type t, char *name, size_t namelen) +{ + ipath_ht_get_boardname(t, name, namelen); +} + +module_init(infinipath_init); +module_exit(infinipath_cleanup); + +EXPORT_SYMBOL(infinipath_debug); +EXPORT_SYMBOL(ipath_get_boardname); -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 08/13] [RFC] ipath core last bit In-Reply-To: <200512161548.3fqe3fMerrheBMdX@cisco.com> Message-ID: <200512161548.y9KRuNtfMzpZjwni@cisco.com> Last piece of ipath LLD --- drivers/infiniband/hw/ipath/ipath_layer.c | 1155 +++++++++++++++++++++++++++++ 1 files changed, 1155 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_layer.c 978ded82c9b5a4bca4e55f36d20ef4a585c50f38 diff --git a/drivers/infiniband/hw/ipath/ipath_layer.c b/drivers/infiniband/hw/ipath/ipath_layer.c new file mode 100644 index 0000000..6a60851 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_layer.c @@ -0,0 +1,1155 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_layer.c 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +/* + * These are the routines used by layered drivers, currently just the + * layered ethernet driver and verbs layer. + */ + +#include + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* unit number is already validated in ipath_ioctl() */ +int ipath_kset_linkstate(uint32_t arg) +{ + ipath_type unit = 0xffff & (arg >> 16); + uint32_t lstate; + ipath_devdata *dd; + + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + dd = &devdata[unit]; + arg &= 0xffff; + if (arg != IPATH_IB_LINKDOWN && arg != IPATH_IB_LINKARM && + arg != IPATH_IB_LINKACTIVE) { + _IPATH_DBG("Unknown linkstate 0x%x requested\n", arg); + return -EINVAL; + } + if (arg == IPATH_IB_LINKDOWN) { + ipath_down_link(unit); /* really moving it to idle */ + lstate = IPATH_LINKDOWN | IPATH_LINK_SLEEPING; + } else if (arg == IPATH_IB_LINKARM) { + if (!(dd->ipath_flags & + (IPATH_LINKINIT | IPATH_LINKARMED | IPATH_LINKDOWN | + IPATH_LINK_SLEEPING | IPATH_LINKACTIVE))) + _IPATH_DBG + ("don't know current state (flags 0x%x), try anyway\n", + dd->ipath_flags); + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ARMED); + lstate = IPATH_LINKARMED; + } else { + int tryarmed = 0; + /* + * because we sometimes go to ARMED, but then back to 0x11 + * (initialized) before the SMA asks us to move to ACTIVE, + * we will try to advance state to ARMED here, if necessary + */ + if (!(dd->ipath_flags & + (IPATH_LINKINIT | IPATH_LINKARMED | IPATH_LINKDOWN | + IPATH_LINK_SLEEPING | IPATH_LINKACTIVE))) { + /* this one is just paranoia */ + _IPATH_DBG + ("don't know current state (flags 0x%x), try anyway\n", + dd->ipath_flags); + tryarmed = 1; + + } + if (!(dd->ipath_flags & (IPATH_LINKARMED | IPATH_LINKACTIVE))) + tryarmed = 1; + if (tryarmed) { + ipath_set_ib_lstate(unit, + INFINIPATH_IBCC_LINKCMD_ARMED); + /* + * give it up to 2 seconds to get to ARMED or + * ACTIVE; continue afterwards even if we fail + */ + if (ipath_wait_linkstate + (unit, IPATH_LINKARMED | IPATH_LINKACTIVE, 2000)) + _IPATH_VDBG + ("try for active, even though didn't get to ARMED\n"); + } + + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ACTIVE); + lstate = IPATH_LINKACTIVE; + } + return ipath_wait_linkstate(unit, lstate, 5000); +} + +/* + * we can handle "any" incoming size, the issue here is whether we + * need to restrict our outgoing size. For now, we don't do any + * sanity checking on this, and we don't deal with what happens to + * programs that are already running when the size changes. + * unit number is already validated in ipath_ioctl() + * NOTE: changing the MTU will usually cause the IBC to go back to + * link initialize (0x11) state... + */ +int ipath_kset_mtu(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + uint32_t piosize; + int changed = 0; + + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + /* + * mtu is IB data payload max. It's the largest power of 2 less + * than piosize (or even larger, since it only really controls the + * largest we can receive; we can send the max of the mtu and piosize). + * We check that it's one of the valid IB sizes. + */ + if (arg != 256 && arg != 512 && arg != 1024 && arg != 2048 && + arg != 4096) { + _IPATH_DBG("Trying to set invalid mtu %u, failing\n", arg); + return -EINVAL; + } + if (devdata[unit].ipath_ibmtu == arg) { + return 0; /* same as current */ + } + + piosize = devdata[unit].ipath_ibmaxlen; + devdata[unit].ipath_ibmtu = arg; + + /* + * the 128 is the max IB header size allowed for in our pio send buffers + * If we are reducing the MTU below that, this doesn't completely make + * sense, but it's OK. + */ + if (arg >= (piosize - 128)) { + /* hasn't been changed */ + if (piosize == devdata[unit].ipath_init_ibmaxlen) + _IPATH_VDBG + ("mtu 0x%x >= ibmaxlen hardware max, nothing to do\n", + arg); + else { + _IPATH_VDBG + ("mtu 0x%x restores ibmaxlen to full amount 0x%x\n", + arg, piosize); + devdata[unit].ipath_ibmaxlen = piosize; + changed = 1; + } + } else if ((arg + 128) == devdata[unit].ipath_ibmaxlen) + _IPATH_VDBG("ibmaxlen %x same as current, no change\n", arg); + else { + piosize = arg + 128; + _IPATH_VDBG("ibmaxlen was 0x%x, setting to 0x%x (mtu 0x%x)\n", + devdata[unit].ipath_ibmaxlen, piosize, arg); + devdata[unit].ipath_ibmaxlen = piosize; + changed = 1; + } + + if (changed) { + /* + * set the IBC maxpktlength to the size of our pio + * buffers in words + */ + uint64_t ibc = devdata[unit].ipath_ibcctrl; + ibc &= ~(INFINIPATH_IBCC_MAXPKTLEN_MASK << + INFINIPATH_IBCC_MAXPKTLEN_SHIFT); + + piosize = piosize - 2 * sizeof(uint32_t); /* ignore pbc */ + devdata[unit].ipath_ibmaxlen = piosize; + piosize /= sizeof(uint32_t); /* in words */ + /* + * for ICRC, which we only send in diag test pkt mode, and we + * don't need to worry about that for mtu + */ + piosize += 1; + + ibc |= piosize << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; + devdata[unit].ipath_ibcctrl = ibc; + ipath_kput_kreg(unit, kr_ibcctrl, devdata[unit].ipath_ibcctrl); + } + return 0; +} + +void ipath_set_sps_lid(const ipath_type unit, uint32_t arg) +{ + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return; + } + + ipath_stats.sps_lid[unit] = devdata[unit].ipath_lid = arg; + if (devdata[unit].ipath_layer.l_intr) + devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_LID); +} + +/* XXX - need to inform anyone who cares this just happened. */ +int ipath_layer_set_guid(const ipath_type device, uint64_t guid) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return -ENODEV; + } + devdata[device].ipath_guid = guid; + return 0; +} + +uint64_t ipath_layer_get_guid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + return devdata[device].ipath_guid; +} + +uint32_t ipath_layer_get_nguid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + return devdata[device].ipath_nguid; +} + +int ipath_layer_query_device(const ipath_type device, uint32_t * vendor, + uint32_t * boardrev, uint32_t * majrev, + uint32_t * minrev) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return -ENODEV; + } + + *vendor = devdata[device].ipath_vendorid; + *boardrev = devdata[device].ipath_boardrev; + *majrev = devdata[device].ipath_majrev; + *minrev = devdata[device].ipath_minrev; + + return 0; +} + +uint32_t ipath_layer_get_flags(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_flags; +} + +struct device *ipath_layer_get_pcidev(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return NULL; + } + + return &(devdata[device].pcidev->dev); +} + +uint16_t ipath_layer_get_deviceid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_deviceid; +} + +uint64_t ipath_layer_get_lastibcstat(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_lastibcstat; +} + +uint32_t ipath_layer_get_ibmtu(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_ibmtu; +} + +int ipath_layer_register(const ipath_type device, + int (*l_intr) (const ipath_type, uint32_t), + int (*l_rcv) (const ipath_type, void *, + struct sk_buff *), uint16_t l_rcv_opcode, + int (*l_rcv_lid) (const ipath_type, void *), + uint16_t l_rcv_lid_opcode) +{ + int ret = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 1; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return 1; + } + + _IPATH_VDBG("intr %p rx %p, rx_lid %p\n", l_intr, l_rcv, l_rcv_lid); + if (devdata[device].ipath_layer.l_intr + || devdata[device].ipath_layer.l_rcv) { + _IPATH_DBG + ("Layered device already registered on unit %u, failing\n", + device); + return 1; + } + + if(!(*devdata[device].ipath_statusp & IPATH_STATUS_SMA)) + *devdata[device].ipath_statusp |= IPATH_STATUS_OIB_SMA; + devdata[device].ipath_layer.l_intr = l_intr; + devdata[device].ipath_layer.l_rcv = l_rcv; + devdata[device].ipath_layer.l_rcv_lid = l_rcv_lid; + devdata[device].ipath_layer.l_rcv_opcode = l_rcv_opcode; + devdata[device].ipath_layer.l_rcv_lid_opcode = l_rcv_lid_opcode; + + return ret; +} + +static void ipath_verbs_timer(unsigned long t) +{ + /* + * If port 0 receive packet interrupts are not availabile, + * check the receive queue. + */ + if (!(devdata[t].ipath_flags & IPATH_GPIO_INTR)) + ipath_kreceive(t); + + /* Handle verbs layer timeouts. */ + if (devdata[t].verbs_layer.l_timer_cb) + devdata[t].verbs_layer.l_timer_cb(t); + + mod_timer(&devdata[t].verbs_layer.l_timer, jiffies + 1); +} + +/* Verbs layer registration. */ +int ipath_verbs_register(const ipath_type device, + int (*l_piobufavail) (const ipath_type device), + void (*l_rcv) (const ipath_type device, void *rhdr, + void *data, u32 tlen), + void (*l_timer_cb) (const ipath_type device)) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return 0; + } + + _IPATH_VDBG("piobufavail %p rx %p\n", l_piobufavail, l_rcv); + if (devdata[device].verbs_layer.l_piobufavail || + devdata[device].verbs_layer.l_rcv) { + _IPATH_DBG("Verbs layer already registered on unit %u, " + "failing\n", device); + return 0; + } + + devdata[device].verbs_layer.l_piobufavail = l_piobufavail; + devdata[device].verbs_layer.l_rcv = l_rcv; + devdata[device].verbs_layer.l_timer_cb = l_timer_cb; + devdata[device].verbs_layer.l_flags = 0; + + return 1; +} + +void ipath_verbs_unregister(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return; + } + + *devdata[device].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + devdata[device].verbs_layer.l_piobufavail = NULL; + devdata[device].verbs_layer.l_rcv = NULL; + devdata[device].verbs_layer.l_timer_cb = NULL; + devdata[device].verbs_layer.l_flags = 0; +} + +int ipath_layer_open(const ipath_type device, uint32_t * pktmax) +{ + int ret = 0; + uint32_t intval = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 1; + } + if (!devdata[device].ipath_layer.l_intr + || !devdata[device].ipath_layer.l_rcv) { + _IPATH_DBG("layer not registered, failing\n"); + return 1; + } + + if ((ret = + ipath_setrcvhdrsize(device, NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE))) + return ret; + + *pktmax = devdata[device].ipath_ibmaxlen; + + if (*devdata[device].ipath_statusp & IPATH_STATUS_IB_READY) + intval |= IPATH_LAYER_INT_IF_UP; + if (ipath_stats.sps_lid[device]) + intval |= IPATH_LAYER_INT_LID; + if (ipath_stats.sps_mlid[device]) + intval |= IPATH_LAYER_INT_BCAST; + /* + * do this on open, in case low level is already up and + * just layered driver was reloaded, etc. + */ + if (intval) + devdata[device].ipath_layer.l_intr(device, intval); + + return ret; +} + +int16_t ipath_layer_get_lid(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + _IPATH_VDBG("returning mylid 0x%x for layered dev %d\n", + devdata[device].ipath_lid, device); + return devdata[device].ipath_lid; +} + +/* + * get the MAC address. This is the EUID-64 OUI octets (top 3), then + * skip the next 2 (which should both be zero or 0xff). + * The returned MAC is in network order + * mac points to at least 6 bytes of buffer + * returns 0 on error (to be consistent with get_lid and get_bcast + * return 1 on success + * We assume that by the time the LID is set, that the GUID is as valid + * as it's ever going to be, rather than adding yet another status bit. + */ + +int ipath_layer_get_mac(const ipath_type device, uint8_t * mac) +{ + uint8_t *guid; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return 0; + } + guid = (uint8_t *) & devdata[device].ipath_guid; + + mac[0] = guid[0]; + mac[1] = guid[1]; + mac[2] = guid[2]; + mac[3] = guid[5]; + mac[4] = guid[6]; + mac[5] = guid[7]; + if((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff)) + _IPATH_DBG("Warning, guid bytes 3 and 4 not 0 or 0xffff: %x %x\n", + guid[3], guid[4]); + _IPATH_VDBG("Returning %x:%x:%x:%x:%x:%x\n", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + return 1; +} + +int16_t ipath_layer_get_bcast(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return 0; + } + + _IPATH_VDBG("returning broadcast LID 0x%x for unit %u\n", + devdata[device].ipath_mlid, device); + return devdata[device].ipath_mlid; +} + +int ipath_layer_get_num_of_dev(void) +{ + return infinipath_max; +} + +int ipath_layer_get_cr_errpkey(const ipath_type device) +{ + return ipath_kget_creg32(device, cr_errpkey); +} + +void ipath_layer_close(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + if (!devdata[device].ipath_layer.l_intr + || !devdata[device].ipath_layer.l_rcv) { + /* normal if not all chips are present */ + _IPATH_VDBG("layer close without open\n"); + } else { + devdata[device].ipath_layer.l_intr = NULL; + devdata[device].ipath_layer.l_rcv = NULL; + devdata[device].ipath_layer.l_rcv_lid = NULL; + devdata[device].ipath_layer.l_rcv_opcode = 0; + devdata[device].ipath_layer.l_rcv_lid_opcode = 0; + } +} + +static inline void copy_aligned(uint32_t *piobuf, struct ipath_sge_state *ss, + uint32_t length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length) { + u32 len = sge->length; + u32 w; + + BUG_ON(len == 0); + if (len > length) + len = length; + /* Need to round up for the last dword in the packet. */ + w = (len + 3) >> 2; + ipath_dwordcpy(piobuf, sge->vaddr, w); + piobuf += w; + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } +} + +static inline void copy_unaligned(uint32_t *piobuf, struct ipath_sge_state *ss, + uint32_t length) +{ + struct ipath_sge *sge = &ss->sge; + union { + u8 wbuf[4]; + u32 w; + } u; + int extra = 0; + + while (length) { + u32 len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + length -= len; + while (len) { + u.wbuf[extra++] = *(u8 *) sge->vaddr; + sge->vaddr++; + sge->length--; + sge->sge_length--; + if (extra >= 4) { + *piobuf++ = u.w; + extra = 0; + } + len--; + } + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + } + if (extra) { + while (extra < 4) + u.wbuf[extra++] = 0; + *piobuf = u.w; + } +} + +/* + * This is like ipath_send_smapkt() in that we need to be able to send + * packets after the chip is initialized (MADs) but also like + * ipath_layer_send() since its used by the verbs layer. + */ +int ipath_verbs_send(const ipath_type device, uint32_t hdrwords, + uint32_t *hdr, uint32_t len, struct ipath_sge_state *ss) +{ + ipath_devdata *dd = &devdata[device]; + int whichpb; + uint32_t *piobuf, plen; + uint64_t pboff; + + if (device >= infinipath_max || + !(dd->ipath_flags & IPATH_PRESENT) || !dd->ipath_kregbase) { + _IPATH_DBG("illegal unit %u\n", device); + return -ENODEV; + } + if (!(dd->ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return -ENODEV; + } + /* +1 is for the qword padding of pbc */ + plen = hdrwords + ((len + 3) >> 2) + 1; + if ((plen << 2) > dd->ipath_ibmaxlen) { + _IPATH_DBG("packet len 0x%x too long, failing\n", plen); + return -EINVAL; + } + + /* Get a PIO buffer to use. */ + if ((whichpb = ipath_getpiobuf(device)) < 0) + return whichpb; + + pboff = dd->ipath_piobufbase; + piobuf = (uint32_t *) (((char *)(dd->ipath_kregbase)) + pboff + + whichpb * dd->ipath_palign); + _IPATH_EPDBG("0x%x+1w pio%d\n", plen - 1, whichpb); + + /* Write len to control qword, no flags. */ + *((uint64_t *) piobuf) = (uint64_t) plen; + piobuf += 2; + ipath_dwordcpy(piobuf, hdr, hdrwords); + if (len == 0) + return 0; + piobuf += hdrwords; + /* + * If we really wanted to check everything, we would have to + * check that each segment starts on a dword boundary and is + * a dword multiple in length. + * Since there can be lots of segments, we only check for a simple + * common case where the amount to copy is contained in one segment. + */ + if (ss->sge.length == len) + copy_aligned(piobuf, ss, len); + else + copy_unaligned(piobuf, ss, len); + return 0; +} + +void ipath_layer_snapshot_counters(const ipath_type device, u64 * swords, + u64 * rwords, u64 * spkts, u64 * rpkts) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_PRESENT)) { + _IPATH_DBG("illegal unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return; + } + *swords = ipath_snap_cntr(device, cr_wordsendcnt); + *rwords = ipath_snap_cntr(device, cr_wordrcvcnt); + *spkts = ipath_snap_cntr(device, cr_pktsendcnt); + *rpkts = ipath_snap_cntr(device, cr_pktrcvcnt); +} + +/* + * Return the counters needed by recv_pma_get_portcounters(). + */ +void ipath_layer_get_counters(const ipath_type device, + struct ipath_layer_counters *cntrs) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_PRESENT)) { + _IPATH_DBG("illegal unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return; + } + cntrs->symbol_error_counter = + ipath_snap_cntr(device, cr_ibsymbolerrcnt); + cntrs->link_error_recovery_counter = + ipath_snap_cntr(device, cr_iblinkerrrecovcnt); + cntrs->link_downed_counter = ipath_snap_cntr(device, cr_iblinkdowncnt); + cntrs->port_rcv_errors = ipath_snap_cntr(device, cr_err_rlencnt) + + ipath_snap_cntr(device, cr_invalidrlencnt) + + ipath_snap_cntr(device, cr_erricrccnt) + + ipath_snap_cntr(device, cr_errvcrccnt) + + ipath_snap_cntr(device, cr_badformatcnt); + cntrs->port_rcv_remphys_errors = ipath_snap_cntr(device, cr_rcvebpcnt); + cntrs->port_xmit_discards = ipath_snap_cntr(device, cr_unsupvlcnt); + cntrs->port_xmit_data = ipath_snap_cntr(device, cr_wordsendcnt); + cntrs->port_rcv_data = ipath_snap_cntr(device, cr_wordrcvcnt); + cntrs->port_xmit_packets = ipath_snap_cntr(device, cr_pktsendcnt); + cntrs->port_rcv_packets = ipath_snap_cntr(device, cr_pktrcvcnt); +} + +void ipath_layer_want_buffer(const ipath_type device) +{ + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &devdata[device].ipath_sendctrl); + ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl); +} + +int ipath_layer_send(const ipath_type device, void *hdr, void *data, + uint32_t datawords) +{ + int ret = 0, whichpb; + uint32_t *piobuf, plen; + uint16_t vlsllnh; + uint64_t pboff; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return -EINVAL; + } + if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) { + _IPATH_DBG("send while not open\n"); + ret = -EINVAL; + } else + if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) + || devdata[device].ipath_lid == 0) { + /* lid check is for when sma hasn't yet configured */ + ret = -ENETDOWN; + _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n", + devdata[device].ipath_lid, + devdata[device].ipath_flags); + } + /* +1 is for the qword padding of pbc */ + plen = (sizeof(ips_message_header_typ) >> 2) + datawords + 1; + if (plen > (devdata[device].ipath_ibmaxlen >> 2)) { + _IPATH_DBG("packet len 0x%x too long, failing\n", plen); + ret = -EINVAL; + } + vlsllnh = *((uint16_t *) hdr); + if (vlsllnh != htons(IPS_LRH_BTH)) { + _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n", + vlsllnh, htons(IPS_LRH_BTH)); + ret = -EINVAL; + } + if (ret) + goto done; + + /* Get a PIO buffer to use. */ + if ((whichpb = ipath_getpiobuf(device)) < 0) { + ret = whichpb; + goto done; + } + + pboff = devdata[device].ipath_piobufbase; + piobuf = + (uint32_t *) (((char *)(devdata[device].ipath_kregbase)) + pboff + + whichpb * devdata[device].ipath_palign); + _IPATH_EPDBG("0x%x+1w pio%d\n", plen - 1, whichpb); + + /* len to control qword, no flags */ + *((uint64_t *) piobuf) = (uint64_t) plen; + piobuf += 2; + ipath_dwordcpy(piobuf, hdr, (sizeof(ips_message_header_typ) >> 2)); + piobuf += (sizeof(ips_message_header_typ) >> 2); + ipath_dwordcpy(piobuf, data, datawords); + + ipath_stats.sps_ether_spkts++; /* another ether packet sent */ + +done: + return ret; +} + +void ipath_layer_set_piointbufavail_int(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &devdata[device].ipath_sendctrl); + + ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl); +} + +void ipath_layer_enable_timer(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + /* + * HT-400 has a design flaw where the chip and kernel idea + * of the tail register don't always agree, and therefore we won't + * get an interrupt on the next packet received. + * If the board supports per packet receive interrupts, use it. + * Otherwise, the timer function periodically checks for packets + * to cover this case. + * Either way, the timer is needed for verbs layer related + * processing. + */ + if (devdata[device].ipath_flags & IPATH_GPIO_INTR) { + ipath_kput_kreg(device, kr_debugportselect, 0x2074076542310UL); + /* Enable GPIO bit 2 interrupt */ + ipath_kput_kreg(device, kr_gpio_mask, (uint64_t)(1 << 2)); + } + + init_timer(&devdata[device].verbs_layer.l_timer); + devdata[device].verbs_layer.l_timer.function = ipath_verbs_timer; + devdata[device].verbs_layer.l_timer.data = (unsigned long)device; + devdata[device].verbs_layer.l_timer.expires = jiffies + 1; + add_timer(&devdata[device].verbs_layer.l_timer); +} + +void ipath_layer_disable_timer(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + /* Disable GPIO bit 2 interrupt */ + if (devdata[device].ipath_flags & IPATH_GPIO_INTR) + ipath_kput_kreg(device, kr_gpio_mask, 0); + + del_timer_sync(&devdata[device].verbs_layer.l_timer); +} + +/* + * Get the verbs layer flags. + */ +unsigned ipath_verbs_get_flags(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].verbs_layer.l_flags; +} + +/* + * Set the verbs layer flags. + */ +void ipath_verbs_set_flags(const ipath_type device, unsigned flags) +{ + ipath_type s; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + devdata[device].verbs_layer.l_flags = flags; + + for (s = 0; s < infinipath_max; s++) { + if (!(devdata[s].ipath_flags & IPATH_INITTED)) + continue; + if ((flags & IPATH_VERBS_KERNEL_SMA) && + !(*devdata[s].ipath_statusp & IPATH_STATUS_SMA)) { + *devdata[s].ipath_statusp |= IPATH_STATUS_OIB_SMA; + } else { + *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + } + } +} + +/* + * Return the size of the PKEY table for port 0. + */ +unsigned ipath_layer_get_npkeys(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys); +} + +/* + * Return the indexed PKEY from the port 0 PKEY table. + */ +unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + if (index >= ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys)) + return 0; + + return devdata[device].ipath_pd[0]->port_pkeys[index]; +} + +/* + * Return the PKEY table for port 0. + */ +void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys) +{ + struct _ipath_portdata *pd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + pd = devdata[device].ipath_pd[0]; + memcpy(pkeys, pd->port_pkeys, sizeof(pd->port_pkeys)); +} + +/* + * Decrecment the reference count for the given PKEY. + * Return true if this was the last reference and the hardware table entry + * needs to be changed. + */ +static inline int rm_pkey(ipath_devdata *dd, uint16_t key) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (dd->ipath_pkeys[i] != key) + continue; + if (atomic_dec_and_test(&dd->ipath_pkeyrefs[i])) { + dd->ipath_pkeys[i] = 0; + return 1; + } + break; + } + return 0; +} + +/* + * Add the given PKEY to the hardware table. + * Return an error code if unable to add the entry, zero if no change, + * or 1 if the hardware PKEY register needs to be updated. + */ +static inline int add_pkey(ipath_devdata *dd, uint16_t key) +{ + int i; + uint16_t lkey = key & 0x7FFF; + int any = 0; + + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i]) { + any++; + continue; + } + /* If it matches exactly, try to increment the ref count */ + if (dd->ipath_pkeys[i] == key) { + if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) + return 0; + /* Lost the race. Look for an empty slot below. */ + atomic_dec(&dd->ipath_pkeyrefs[i]); + any++; + } + /* + * It makes no sense to have both the limited and unlimited + * PKEY set at the same time since the unlimited one will + * disable the limited one. + */ + if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) + return -EEXIST; + } + if (!any) + return -EBUSY; + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i] && + atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { + /* for ipathstats, etc. */ + ipath_stats.sps_pkeys[i] = lkey; + dd->ipath_pkeys[i] = key; + return 1; + } + } + return -EBUSY; +} + +/* + * Set the PKEY table for port 0. + */ +int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys) +{ + ipath_portdata *pd; + ipath_devdata *dd; + int i; + int changed = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + pd = dd->ipath_pd[0]; + + for (i = 0; i > ARRAY_SIZE(pd->port_pkeys); i++) { + uint16_t key = pkeys[i]; + uint16_t okey = pd->port_pkeys[i]; + + if (key == okey) + continue; + /* + * The value of this PKEY table entry is changing. + * Remove the old entry in the hardware's array of PKEYs. + */ + if (okey & 0x7FFF) + changed |= rm_pkey(dd, okey); + if (key & 0x7FFF) { + int ret = add_pkey(dd, key); + + if (ret < 0) + key = 0; + else + changed |= ret; + } + pd->port_pkeys[i] = key; + } + if (changed) { + uint64_t pkey; + + pkey = (uint64_t) dd->ipath_pkeys[0] | + ((uint64_t) dd->ipath_pkeys[1] << 16) | + ((uint64_t) dd->ipath_pkeys[2] << 32) | + ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_VDBG("p0 new pkey reg %llx\n", pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + } + return 0; +} + +/* + * Registers that vary with the chip implementation constants (port) + * use this routine. + */ +uint64_t ipath_kget_kreg64_port(const ipath_type stype, ipath_kreg regno, + unsigned port) +{ + ipath_kreg tmp = + (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ? + regno + port : + ((port < devdata[stype].ipath_portcnt + && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid); + return ipath_kget_kreg64(stype, tmp); +} + +/* + * Registers that vary with the chip implementation constants (port) + * use this routine. + */ +void ipath_kput_kreg_port(const ipath_type stype, ipath_kreg regno, + unsigned port, uint64_t value) +{ + ipath_kreg tmp = + (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ? + regno + port : + ((port < devdata[stype].ipath_portcnt + && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid); + ipath_kput_kreg(stype, tmp, value); +} + +EXPORT_SYMBOL(ipath_kset_linkstate); +EXPORT_SYMBOL(ipath_kset_mtu); +EXPORT_SYMBOL(ipath_layer_close); +EXPORT_SYMBOL(ipath_layer_get_bcast); +EXPORT_SYMBOL(ipath_layer_get_cr_errpkey); +EXPORT_SYMBOL(ipath_layer_get_deviceid); +EXPORT_SYMBOL(ipath_layer_get_flags); +EXPORT_SYMBOL(ipath_layer_get_guid); +EXPORT_SYMBOL(ipath_layer_get_ibmtu); +EXPORT_SYMBOL(ipath_layer_get_lastibcstat); +EXPORT_SYMBOL(ipath_layer_get_lid); +EXPORT_SYMBOL(ipath_layer_get_mac); +EXPORT_SYMBOL(ipath_layer_get_nguid); +EXPORT_SYMBOL(ipath_layer_get_num_of_dev); +EXPORT_SYMBOL(ipath_layer_get_pcidev); +EXPORT_SYMBOL(ipath_layer_open); +EXPORT_SYMBOL(ipath_layer_query_device); +EXPORT_SYMBOL(ipath_layer_register); +EXPORT_SYMBOL(ipath_layer_send); +EXPORT_SYMBOL(ipath_layer_set_guid); +EXPORT_SYMBOL(ipath_layer_set_piointbufavail_int); +EXPORT_SYMBOL(ipath_layer_snapshot_counters); +EXPORT_SYMBOL(ipath_layer_get_counters); +EXPORT_SYMBOL(ipath_layer_want_buffer); +EXPORT_SYMBOL(ipath_verbs_register); +EXPORT_SYMBOL(ipath_verbs_send); +EXPORT_SYMBOL(ipath_verbs_unregister); +EXPORT_SYMBOL(ipath_set_sps_lid); +EXPORT_SYMBOL(ipath_layer_enable_timer); +EXPORT_SYMBOL(ipath_layer_disable_timer); +EXPORT_SYMBOL(ipath_verbs_get_flags); +EXPORT_SYMBOL(ipath_verbs_set_flags); +EXPORT_SYMBOL(ipath_layer_get_npkeys); +EXPORT_SYMBOL(ipath_layer_get_pkey); +EXPORT_SYMBOL(ipath_layer_get_pkeys); +EXPORT_SYMBOL(ipath_layer_set_pkeys); -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 09/13] [RFC] ipath IB driver headers In-Reply-To: <200512161548.y9KRuNtfMzpZjwni@cisco.com> Message-ID: <200512161548.zxp6FKcabEu47EnS@cisco.com> Headers for ipath IB driver --- drivers/infiniband/hw/ipath/ipath_verbs.h | 527 +++++++++++++++++++++++++++++ drivers/infiniband/hw/ipath/verbs_debug.h | 104 ++++++ 2 files changed, 631 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_verbs.h create mode 100644 drivers/infiniband/hw/ipath/verbs_debug.h 8b106f1a0a6cb02f702c4ef957acad1bf8225c7d diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h new file mode 100644 index 0000000..4a4c65a --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -0,0 +1,527 @@ +/* + * Copyright (c) 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_verbs.h 4504 2005-12-16 06:15:47Z rjwalsh $ + */ + +#ifndef IPATH_VERBS_H +#define IPATH_VERBS_H + +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" +#include "verbs_debug.h" + +#define CTL_IPATH_VERBS 0x70736e68 /* "spin" as a hex value, top level */ +#define CTL_IPATH_VERBS_FAULT 1 +#define CTL_IPATH_VERBS_DEBUG 2 + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IPATH_UVERBS_ABI_VERSION 1 + +/* + * Define an ib_cq_notify value that is not valid so we know when CQ + * notifications are armed. + */ +#define IB_CQ_NONE (IB_CQ_NEXT_COMP + 1) + +enum { + IB_RNR_NAK = 0x20, + + IB_NAK_PSN_ERROR = 0x60, + IB_NAK_INVALID_REQUEST = 0x61, + IB_NAK_REMOTE_ACCESS_ERROR = 0x62, + IB_NAK_REMOTE_OPERATIONAL_ERROR = 0x63, + IB_NAK_INVALID_RD_REQUEST = 0x64 +}; + +/* IB Performance Manager status values */ +enum { + IB_PMA_SAMPLE_STATUS_DONE = 0x00, + IB_PMA_SAMPLE_STATUS_STARTED = 0x01, + IB_PMA_SAMPLE_STATUS_RUNNING = 0x02 +}; + +/* Mandatory IB performance counter select values. */ +#define IB_PMA_PORT_XMIT_DATA __constant_htons(0x0001) +#define IB_PMA_PORT_RCV_DATA __constant_htons(0x0002) +#define IB_PMA_PORT_XMIT_PKTS __constant_htons(0x0003) +#define IB_PMA_PORT_RCV_PKTS __constant_htons(0x0004) +#define IB_PMA_PORT_XMIT_WAIT __constant_htons(0x0005) + +struct ib_reth { + u64 vaddr; + u32 rkey; + u32 length; +} __attribute__ ((packed)); + +struct ib_atomic_eth { + u64 vaddr; + u32 rkey; + u64 swap_data; + u64 compare_data; +} __attribute__ ((packed)); + +struct ipath_other_headers { + uint32_t bth[3]; + union { + struct { + uint32_t deth[2]; + uint32_t imm_data; + } ud; + struct { + struct ib_reth reth; + uint32_t imm_data; + } rc; + struct { + uint32_t aeth; + uint64_t atomic_ack_eth; + } at; + uint32_t imm_data; + uint32_t aeth; + struct ib_atomic_eth atomic_eth; + } u; +} __attribute__ ((packed)); + +/* + * Note that UD packets with a GRH header are 8+40+12+8 = 68 bytes long + * (72 w/ imm_data). + * Only the first 56 bytes of the IB header will be in the + * eager header buffer. The remaining 12 or 16 bytes are in the data buffer. + */ +struct ipath_ib_header { + uint16_t lrh[4]; + union { + struct { + struct ib_grh grh; + struct ipath_other_headers oth; + } l; + struct ipath_other_headers oth; + } u; +} __attribute__ ((packed)); + +/* + * There is one struct ipath_mcast for each multicast GID. + * All attached QPs are then stored as a list of + * struct ipath_mcast_qp. + */ +struct ipath_mcast_qp { + struct list_head list; + struct ipath_qp *qp; +}; + +struct ipath_mcast { + struct rb_node rb_node; + union ib_gid mgid; + struct list_head qp_list; + wait_queue_head_t wait; + atomic_t refcount; +}; + +/* Memory region */ +struct ipath_mr { + struct ib_mr ibmr; + struct ipath_mregion mr; /* must be last */ +}; + +/* Fast memory region */ +struct ipath_fmr { + struct ib_fmr ibfmr; + u8 page_size; + struct ipath_mregion mr; /* must be last */ +}; + +/* Protection domain */ +struct ipath_pd { + struct ib_pd ibpd; + int user; /* non-zero if created from user space */ +}; + +/* Address Handle */ +struct ipath_ah { + struct ib_ah ibah; + struct ib_ah_attr attr; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct ipath_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct ipath_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + */ + +struct ipath_cq { + struct ib_cq ibcq; + struct tasklet_struct comptask; + spinlock_t lock; + u8 notify; + u8 triggered; + u32 head; /* new records added to the head */ + u32 tail; /* poll_cq() reads from here. */ + struct ib_wc queue[1]; /* this is actually ibcq.cqe + 1 */ +}; + +/* + * Send work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->s_max_sge. + */ +struct ipath_swqe { + struct ib_send_wr wr; /* don't use wr.sg_list */ + u32 psn; /* first packet sequence number */ + u32 lpsn; /* last packet sequence number */ + u32 ssn; /* send sequence number */ + u32 length; /* total length of data in sg_list */ + struct ipath_sge sg_list[0]; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->r_max_sge. + */ +struct ipath_rwqe { + u64 wr_id; + u32 length; /* total length of data in sg_list */ + u8 num_sge; + struct ipath_sge sg_list[0]; +}; + +struct ipath_rq { + spinlock_t lock; + u32 head; /* new work requests posted to the head */ + u32 tail; /* receives pull requests from here. */ + u32 size; /* size of RWQE array */ + u8 max_sge; + struct ipath_rwqe *wq; /* RWQE array */ +}; + +struct ipath_srq { + struct ib_srq ibsrq; + struct ipath_rq rq; + u32 limit; /* send signal when number of RWQEs < limit */ +}; + +/* + * Variables prefixed with s_ are for the requester (sender). + * Variables prefixed with r_ are for the responder (receiver). + * Variables prefixed with ack_ are for responder replies. + * + * Common variables are protected by both r_rq.lock and s_lock in that order + * which only happens in modify_qp() or changing the QP 'state'. + */ +struct ipath_qp { + struct ib_qp ibqp; + struct ipath_qp *next; /* link list for QPN hash table */ + struct list_head piowait; /* link for wait PIO buf */ + struct list_head timerwait; /* link for waiting for timeouts */ + struct ib_ah_attr remote_ah_attr; + struct ipath_ib_header s_hdr; /* next packet header to send */ + atomic_t refcount; + wait_queue_head_t wait; + struct tasklet_struct s_task; + struct ipath_sge_state *s_cur_sge; + struct ipath_sge_state s_sge; /* current send request data */ + struct ipath_sge_state s_rdma_sge; /* current RDMA read send data */ + struct ipath_sge_state r_sge; /* current receive data */ + spinlock_t s_lock; + int s_flags; + u32 s_hdrwords; /* size of s_hdr in 32 bit words */ + u32 s_cur_size; /* size of send packet in bytes */ + u32 s_len; /* total length of s_sge */ + u32 s_rdma_len; /* total length of s_rdma_sge */ + u32 s_next_psn; /* PSN for next request */ + u32 s_last_psn; /* last response PSN processed */ + u32 s_psn; /* current packet sequence number */ + u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */ + u32 s_ack_psn; /* PSN for next ACK or RDMA_READ */ + u64 s_ack_atomic; /* data for atomic ACK */ + u64 r_wr_id; /* ID for current receive WQE */ + u64 r_atomic_data; /* data for last atomic op */ + u32 r_atomic_psn; /* PSN of last atomic op */ + u32 r_len; /* total length of r_sge */ + u32 r_rcv_len; /* receive data len processed */ + u32 r_psn; /* expected rcv packet sequence number */ + u8 state; /* QP state */ + u8 s_state; /* opcode of last packet sent */ + u8 s_ack_state; /* opcode of packet to ACK */ + u8 s_nak_state; /* non-zero if NAK is pending */ + u8 r_state; /* opcode of last packet received */ + u8 r_reuse_sge; /* for UC receive errors */ + u8 r_sge_inx; /* current index into sg_list */ + u8 s_max_sge; /* size of s_wq->sg_list */ + u8 qp_access_flags; + u8 s_retry_cnt; /* number of times to retry */ + u8 s_rnr_retry_cnt; + u8 s_min_rnr_timer; + u8 s_retry; /* requester retry counter */ + u8 s_rnr_retry; /* requester RNR retry counter */ + u8 s_pkey_index; /* PKEY index to use */ + enum ib_mtu path_mtu; + atomic_t msn; /* message sequence number */ + u32 remote_qpn; + u32 qkey; /* QKEY for this QP (for UD or RD) */ + u32 s_size; /* send work queue size */ + u32 s_head; /* new entries added here */ + u32 s_tail; /* next entry to process */ + u32 s_cur; /* current work queue entry */ + u32 s_last; /* last un-ACK'ed entry */ + u32 s_ssn; /* SSN of tail entry */ + u32 s_lsn; /* limit sequence number (credit) */ + struct ipath_swqe *s_wq; /* send work queue */ + struct ipath_rq r_rq; /* receive work queue */ +}; + +/* + * Bit definitions for s_flags. + */ +#define IPATH_S_BUSY 0 +#define IPATH_S_SIGNAL_REQ_WR 1 + +/* + * Since struct ipath_swqe is not a fixed size, we can't simply index into + * struct ipath_qp.s_wq. This function does the array index computation. + */ +static inline struct ipath_swqe *get_swqe_ptr(struct ipath_qp *qp, unsigned n) +{ + return (struct ipath_swqe *)((char *) qp->s_wq + + (sizeof(struct ipath_swqe) + + qp->s_max_sge * sizeof(struct ipath_sge)) * n); +} + +/* + * Since struct ipath_rwqe is not a fixed size, we can't simply index into + * struct ipath_rq.wq. This function does the array index computation. + */ +static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n) +{ + return (struct ipath_rwqe *)((char *) rq->wq + + (sizeof(struct ipath_rwqe) + + rq->max_sge * sizeof(struct ipath_sge)) * n); +} + +#define QPN_MAX (1 << 24) +#define QPNMAP_ENTRIES (QPN_MAX / PAGE_SIZE / 8) +#define BITS_PER_PAGE (PAGE_SIZE*8) +#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) +#define mk_qpn(qpt, map, off) (((map) - (qpt)->map)*BITS_PER_PAGE + (off)) +#define find_next_offset(map, off) \ + find_next_zero_bit((map)->page, BITS_PER_PAGE, off) + +/* + * QPN-map pages start out as NULL, they get allocated upon + * first use and are never deallocated. This way, + * large bitmaps are not allocated unless large numbers of QPs are used. + */ +struct qpn_map { + atomic_t n_free; + void *page; +}; + +struct ipath_qp_table { + spinlock_t lock; + u32 last; /* last QP number allocated */ + u32 max; /* size of the hash table */ + u32 nmaps; /* size of the map table */ + struct ipath_qp **table; + struct qpn_map map[QPNMAP_ENTRIES]; /* bit map of free numbers */ +}; + +struct ipath_lkey_table { + spinlock_t lock; + u32 next; /* next unused index (speeds search) */ + u32 gen; /* generation count */ + u32 max; /* size of the table */ + struct ipath_mregion **table; +}; + +struct ipath_opcode_stats { + u64 n_packets; /* number of packets */ + u64 n_bytes; /* total number of bytes */ +}; + +struct ipath_ibdev { + struct ib_device ibdev; + ipath_type ib_unit; /* This is the device number */ + u16 sm_lid; /* in host order */ + u8 sm_sl; + u8 mkeyprot_resv_lmc; + unsigned long mkey_lease_timeout; /* non-zero when timer is set */ + + /* The following fields are really per port. */ + struct ipath_qp_table qp_table; + struct ipath_lkey_table lk_table; + struct list_head pending[3]; /* FIFO of QPs waiting for ACKs */ + struct list_head piowait; /* list for wait PIO buf */ + struct list_head rnrwait; /* list of QPs waiting for RNR timer */ + spinlock_t pending_lock; + __be64 sys_image_guid; /* in network order */ + __be64 gid_prefix; /* in network order */ + __be64 mkey; + u64 ipath_sword; /* total dwords sent (sample result) */ + u64 ipath_rword; /* total dwords received (sample result) */ + u64 ipath_spkts; /* total packets sent (sample result) */ + u64 ipath_rpkts; /* total packets received (sample result) */ + u64 n_multicast_xmit; /* total multicast packets sent */ + u64 n_multicast_rcv; /* total multicast packets received */ + u32 n_rc_resends; + u32 n_rc_acks; + u32 n_rc_qacks; + u32 n_seq_naks; + u32 n_rdma_seq; + u32 n_rnr_naks; + u32 n_other_naks; + u32 n_timeouts; + u32 n_pkt_drops; + u32 n_wqe_errs; + u32 n_rdma_dup_busy; + u32 n_piowait; + u32 n_no_piobuf; + u32 port_cap_flags; + u32 pma_sample_start; + u32 pma_sample_interval; + __be16 pma_counter_select[5]; + u16 pma_tag; + u16 qkey_violations; + u16 mkey_violations; + u16 mkey_lease_period; + u16 pending_index; /* which pending queue is active */ + u8 pma_sample_status; + u8 subnet_timeout; + struct ipath_opcode_stats opstats[128]; +}; + +struct ipath_ucontext { + struct ib_ucontext ibucontext; +}; + +static inline struct ipath_mr *to_imr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct ipath_mr, ibmr); +} + +static inline struct ipath_fmr *to_ifmr(struct ib_fmr *ibfmr) +{ + return container_of(ibfmr, struct ipath_fmr, ibfmr); +} + +static inline struct ipath_pd *to_ipd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct ipath_pd, ibpd); +} + +static inline struct ipath_ah *to_iah(struct ib_ah *ibah) +{ + return container_of(ibah, struct ipath_ah, ibah); +} + +static inline struct ipath_cq *to_icq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct ipath_cq, ibcq); +} + +static inline struct ipath_srq *to_isrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct ipath_srq, ibsrq); +} + +static inline struct ipath_qp *to_iqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct ipath_qp, ibqp); +} + +static inline struct ipath_ibdev *to_idev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct ipath_ibdev, ibdev); +} + +int ipath_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad); + +static inline struct ipath_ucontext *to_iucontext(struct ib_ucontext + *ibucontext) +{ + return container_of(ibucontext, struct ipath_ucontext, ibucontext); +} + +#endif /* IPATH_VERBS_H */ diff --git a/drivers/infiniband/hw/ipath/verbs_debug.h b/drivers/infiniband/hw/ipath/verbs_debug.h new file mode 100644 index 0000000..21e0f8c --- /dev/null +++ b/drivers/infiniband/hw/ipath/verbs_debug.h @@ -0,0 +1,104 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: verbs_debug.h 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#ifndef _VERBS_DEBUG_H +#define _VERBS_DEBUG_H + +/* + * This file contains tracing code for the ib_ipath kernel module. + */ +#ifndef _VERBS_DEBUGGING /* tracing enabled or not */ +#define _VERBS_DEBUGGING 1 +#endif + +extern unsigned ib_ipath_debug; + +#define __VERBS_ERRID "ib_ipath" +#define __VERBS_UNIT_ERRID(unit) ipath_get_unit_name(unit) + +#define _VERBS_ERROR(fmt,...) do { \ + printk (KERN_ERR "%s: " fmt, __VERBS_ERRID,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_UNIT_ERROR(unit,fmt,...) do { \ + printk (KERN_ERR "%s: " fmt, __VERBS_ERRID(unit),##__VA_ARGS__); \ + } while(0) + +#if _VERBS_DEBUGGING + +/* + * Mask values for debugging. The scheme allows us to compile out any of + * the debug tracing stuff, and if compiled in, to enable or disable dynamically + * This can be set at modprobe time also: + * modprobe ib_path ib_ipath_debug=3 + */ +#define __VERBS_INFO 0x1 /* generic low verbosity stuff */ +#define __VERBS_DBG 0x2 /* generic debug */ +#define __VERBS_VDBG 0x4 /* verbose debug */ +#define __VERBS_SMADBG 0x8000 /* sma packet debug */ + +#define _VERBS_INFO(fmt,...) do { \ + if(unlikely(ib_ipath_debug&__VERBS_INFO)) \ + printk (KERN_INFO "%s: " fmt,__VERBS_ERRID,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_DBG(fmt,...) do { \ + if(unlikely(ib_ipath_debug&__VERBS_DBG)) \ + printk (KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_VDBG(fmt,...) do { \ + if(unlikely(ib_ipath_debug&__VERBS_VDBG)) \ + printk (KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_SMADBG(fmt,...) do { \ + if(unlikely(ib_ipath_debug&__VERBS_SMADBG)) \ + printk (KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#else /* ! _VERBS_DEBUGGING */ + +#define _VERBS_INFO(fmt,...) +#define _VERBS_DBG(fmt,...) +#define _VERBS_VDBG(fmt,...) +#define _VERBS_SMADBG(fmt,...) + +#endif /* _VERBS_DEBUGGING */ + +#endif /* _VERBS_DEBUG_H */ -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <200512161548.KglSM2YESlGlEQfQ@cisco.com> Message-ID: <200512161548.3fqe3fMerrheBMdX@cisco.com> Misc ipath LLD files --- drivers/infiniband/hw/ipath/ipath_ht400.c | 1164 +++++++++++++++++++++++++++++ drivers/infiniband/hw/ipath/ipath_i2c.c | 472 ++++++++++++ drivers/infiniband/hw/ipath/ipath_lib.c | 92 ++ drivers/infiniband/hw/ipath/ipath_mlock.c | 139 +++ 4 files changed, 1867 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_ht400.c create mode 100644 drivers/infiniband/hw/ipath/ipath_i2c.c create mode 100644 drivers/infiniband/hw/ipath/ipath_lib.c create mode 100644 drivers/infiniband/hw/ipath/ipath_mlock.c 8d0f07cdb6b7e4f243e31f5ac9fddb4225908062 diff --git a/drivers/infiniband/hw/ipath/ipath_ht400.c b/drivers/infiniband/hw/ipath/ipath_ht400.c new file mode 100644 index 0000000..2d5b795 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_ht400.c @@ -0,0 +1,1164 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_ht400.c 4491 2005-12-15 22:20:31Z rjwalsh $ + */ + +/* + * The first part of this file is shared with the diags, the second + * part is used only in the kernel. + */ + +#include /* for offsetof */ + +#ifdef __KERNEL__ +#include +#include +#include +#include +#include "ipath_kernel.h" +#else +#include +#include +#endif + +#include "ipath_registers.h" +#include "ipath_common.h" + +/* + * This lists the InfiniPath registers, in the actual chip layout. This + * structure should never be directly accessed. It is included by the + * user mode diags, and so must be able to be compiled in both user + * and kernel mode. + */ +struct _infinipath_do_not_use_kernel_regs { + unsigned long long Revision; + unsigned long long Control; + unsigned long long PageAlign; + unsigned long long PortCnt; + unsigned long long DebugPortSelect; + unsigned long long DebugPort; + unsigned long long SendRegBase; + unsigned long long UserRegBase; + unsigned long long CounterRegBase; + unsigned long long Scratch; + unsigned long long ReservedMisc1; + unsigned long long InterruptConfig; + unsigned long long IntBlocked; + unsigned long long IntMask; + unsigned long long IntStatus; + unsigned long long IntClear; + unsigned long long ErrorMask; + unsigned long long ErrorStatus; + unsigned long long ErrorClear; + unsigned long long HwErrMask; + unsigned long long HwErrStatus; + unsigned long long HwErrClear; + unsigned long long HwDiagCtrl; + unsigned long long MDIO; + unsigned long long IBCStatus; + unsigned long long IBCCtrl; + unsigned long long ExtStatus; + unsigned long long ExtCtrl; + unsigned long long GPIOOut; + unsigned long long GPIOMask; + unsigned long long GPIOStatus; + unsigned long long GPIOClear; + unsigned long long RcvCtrl; + unsigned long long RcvBTHQP; + unsigned long long RcvHdrSize; + unsigned long long RcvHdrCnt; + unsigned long long RcvHdrEntSize; + unsigned long long RcvTIDBase; + unsigned long long RcvTIDCnt; + unsigned long long RcvEgrBase; + unsigned long long RcvEgrCnt; + unsigned long long RcvBufBase; + unsigned long long RcvBufSize; + unsigned long long RxIntMemBase; + unsigned long long RxIntMemSize; + unsigned long long RcvPartitionKey; + unsigned long long ReservedRcv[10]; + unsigned long long SendCtrl; + unsigned long long SendPIOBufBase; + unsigned long long SendPIOSize; + unsigned long long SendPIOBufCnt; + unsigned long long SendPIOAvailAddr; + unsigned long long TxIntMemBase; + unsigned long long TxIntMemSize; + unsigned long long ReservedSend[9]; + unsigned long long SendBufferError; + unsigned long long SendBufferErrorCONT1; + unsigned long long SendBufferErrorCONT2; + unsigned long long SendBufferErrorCONT3; + unsigned long long ReservedSBE[4]; + unsigned long long RcvHdrAddr0; + unsigned long long RcvHdrAddr1; + unsigned long long RcvHdrAddr2; + unsigned long long RcvHdrAddr3; + unsigned long long RcvHdrAddr4; + unsigned long long RcvHdrAddr5; + unsigned long long RcvHdrAddr6; + unsigned long long RcvHdrAddr7; + unsigned long long RcvHdrAddr8; + unsigned long long ReservedRHA[7]; + unsigned long long RcvHdrTailAddr0; + unsigned long long RcvHdrTailAddr1; + unsigned long long RcvHdrTailAddr2; + unsigned long long RcvHdrTailAddr3; + unsigned long long RcvHdrTailAddr4; + unsigned long long RcvHdrTailAddr5; + unsigned long long RcvHdrTailAddr6; + unsigned long long RcvHdrTailAddr7; + unsigned long long RcvHdrTailAddr8; + unsigned long long ReservedRHTA[7]; + unsigned long long Sync; /* Software only */ + unsigned long long Dump; /* Software only */ + unsigned long long SimVer; /* Software only */ + unsigned long long ReservedSW[5]; + unsigned long long SerdesConfig0; + unsigned long long SerdesConfig1; + unsigned long long SerdesStatus; + unsigned long long XGXSConfig; + unsigned long long ReservedSW2[4]; +}; + +#ifdef __KERNEL__ /* kernel uses reg#; diags use offset in bytes, not reg # */ +#define IPATH_KREG_OFFSET(field) (offsetof(struct \ + _infinipath_do_not_use_kernel_regs, field) / sizeof(uint64_t)) +#define IPATH_CREG_OFFSET(field) (offsetof( \ + struct infinipath_counters, field) / sizeof(uint64_t)) +#else /* diags */ +#define IPATH_KREG_OFFSET(field) (offsetof(struct \ + _infinipath_do_not_use_kernel_regs, field)) +#define IPATH_CREG_OFFSET(field) (offsetof( \ + struct infinipath_counters, field)) +#endif /* __KERNEL__ */ + +ipath_kreg + kr_control = IPATH_KREG_OFFSET(Control), + kr_counterregbase = IPATH_KREG_OFFSET(CounterRegBase), + kr_debugport = IPATH_KREG_OFFSET(DebugPort), + kr_debugportselect = IPATH_KREG_OFFSET(DebugPortSelect), + kr_errorclear = IPATH_KREG_OFFSET(ErrorClear), + kr_errormask = IPATH_KREG_OFFSET(ErrorMask), + kr_errorstatus = IPATH_KREG_OFFSET(ErrorStatus), + kr_extctrl = IPATH_KREG_OFFSET(ExtCtrl), + kr_extstatus = IPATH_KREG_OFFSET(ExtStatus), + kr_gpio_clear = IPATH_KREG_OFFSET(GPIOClear), + kr_gpio_mask = IPATH_KREG_OFFSET(GPIOMask), + kr_gpio_out = IPATH_KREG_OFFSET(GPIOOut), + kr_gpio_status = IPATH_KREG_OFFSET(GPIOStatus), + kr_hwdiagctrl = IPATH_KREG_OFFSET(HwDiagCtrl), + kr_hwerrclear = IPATH_KREG_OFFSET(HwErrClear), + kr_hwerrmask = IPATH_KREG_OFFSET(HwErrMask), + kr_hwerrstatus = IPATH_KREG_OFFSET(HwErrStatus), + kr_ibcctrl = IPATH_KREG_OFFSET(IBCCtrl), + kr_ibcstatus = IPATH_KREG_OFFSET(IBCStatus), + kr_intblocked = IPATH_KREG_OFFSET(IntBlocked), + kr_intclear = IPATH_KREG_OFFSET(IntClear), + kr_interruptconfig = IPATH_KREG_OFFSET(InterruptConfig), + kr_intmask = IPATH_KREG_OFFSET(IntMask), + kr_intstatus = IPATH_KREG_OFFSET(IntStatus), + kr_mdio = IPATH_KREG_OFFSET(MDIO), + kr_pagealign = IPATH_KREG_OFFSET(PageAlign), + kr_partitionkey = IPATH_KREG_OFFSET(RcvPartitionKey), + kr_portcnt = IPATH_KREG_OFFSET(PortCnt), + kr_rcvbthqp = IPATH_KREG_OFFSET(RcvBTHQP), + kr_rcvbufbase = IPATH_KREG_OFFSET(RcvBufBase), + kr_rcvbufsize = IPATH_KREG_OFFSET(RcvBufSize), + kr_rcvctrl = IPATH_KREG_OFFSET(RcvCtrl), + kr_rcvegrbase = IPATH_KREG_OFFSET(RcvEgrBase), + kr_rcvegrcnt = IPATH_KREG_OFFSET(RcvEgrCnt), + kr_rcvhdrcnt = IPATH_KREG_OFFSET(RcvHdrCnt), + kr_rcvhdrentsize = IPATH_KREG_OFFSET(RcvHdrEntSize), + kr_rcvhdrsize = IPATH_KREG_OFFSET(RcvHdrSize), + kr_rcvintmembase = IPATH_KREG_OFFSET(RxIntMemBase), + kr_rcvintmemsize = IPATH_KREG_OFFSET(RxIntMemSize), + kr_rcvtidbase = IPATH_KREG_OFFSET(RcvTIDBase), + kr_rcvtidcnt = IPATH_KREG_OFFSET(RcvTIDCnt), + kr_revision = IPATH_KREG_OFFSET(Revision), + kr_scratch = IPATH_KREG_OFFSET(Scratch), + kr_sendbuffererror = IPATH_KREG_OFFSET(SendBufferError), + kr_sendbuffererror1 = IPATH_KREG_OFFSET(SendBufferErrorCONT1), + kr_sendbuffererror2 = IPATH_KREG_OFFSET(SendBufferErrorCONT2), + kr_sendbuffererror3 = IPATH_KREG_OFFSET(SendBufferErrorCONT3), + kr_sendctrl = IPATH_KREG_OFFSET(SendCtrl), + kr_sendpioavailaddr = IPATH_KREG_OFFSET(SendPIOAvailAddr), + kr_sendpiobufbase = IPATH_KREG_OFFSET(SendPIOBufBase), + kr_sendpiobufcnt = IPATH_KREG_OFFSET(SendPIOBufCnt), + kr_sendpiosize = IPATH_KREG_OFFSET(SendPIOSize), + kr_sendregbase = IPATH_KREG_OFFSET(SendRegBase), + kr_txintmembase = IPATH_KREG_OFFSET(TxIntMemBase), + kr_txintmemsize = IPATH_KREG_OFFSET(TxIntMemSize), + kr_userregbase = IPATH_KREG_OFFSET(UserRegBase), + /* no simulator, register not used */ + kr_sync = IPATH_KREG_OFFSET(Scratch), + /* no simulator, register not used */ + kr_dump = IPATH_KREG_OFFSET(Scratch), + /* no simulator, register not used */ + kr_simver = IPATH_KREG_OFFSET(Scratch), + /* onchip serdes */ + kr_serdesconfig0 = IPATH_KREG_OFFSET(SerdesConfig0), + /* onchip serdes */ + kr_serdesconfig1 = IPATH_KREG_OFFSET(SerdesConfig1), + /* onchip serdes */ + kr_serdesstatus = IPATH_KREG_OFFSET(SerdesStatus), + /* onchip serdes */ + kr_xgxsconfig = IPATH_KREG_OFFSET(XGXSConfig), + /* + * last valid direct use register other than diag-only registers + */ + __kr_lastvaliddirect = IPATH_KREG_OFFSET(ReservedSW2[0]), + /* always invalid for initializing */ + __kr_invalid = IPATH_KREG_OFFSET(ReservedSW2[0]) + 1, + /* + * These should not be used directly via ipath_kget_kreg64(), + * use them with ipath_kget_kreg64_port() + */ + kr_rcvhdraddr = IPATH_KREG_OFFSET(RcvHdrAddr0), /* not for direct use */ + /* not for direct use */ + kr_rcvhdrtailaddr = IPATH_KREG_OFFSET(RcvHdrTailAddr0), + /* we define the full set for the diags, the kernel doesn't use them */ + kr_rcvhdraddr1 = IPATH_KREG_OFFSET(RcvHdrAddr1), + kr_rcvhdraddr2 = IPATH_KREG_OFFSET(RcvHdrAddr2), + kr_rcvhdraddr3 = IPATH_KREG_OFFSET(RcvHdrAddr3), + kr_rcvhdraddr4 = IPATH_KREG_OFFSET(RcvHdrAddr4), + kr_rcvhdrtailaddr1 = IPATH_KREG_OFFSET(RcvHdrTailAddr1), + kr_rcvhdrtailaddr2 = IPATH_KREG_OFFSET(RcvHdrTailAddr2), + kr_rcvhdrtailaddr3 = IPATH_KREG_OFFSET(RcvHdrTailAddr3), + kr_rcvhdrtailaddr4 = IPATH_KREG_OFFSET(RcvHdrTailAddr4), + kr_rcvhdraddr5 = IPATH_KREG_OFFSET(RcvHdrAddr5), + kr_rcvhdraddr6 = IPATH_KREG_OFFSET(RcvHdrAddr6), + kr_rcvhdraddr7 = IPATH_KREG_OFFSET(RcvHdrAddr7), + kr_rcvhdraddr8 = IPATH_KREG_OFFSET(RcvHdrAddr8), + kr_rcvhdrtailaddr5 = IPATH_KREG_OFFSET(RcvHdrTailAddr5), + kr_rcvhdrtailaddr6 = IPATH_KREG_OFFSET(RcvHdrTailAddr6), + kr_rcvhdrtailaddr7 = IPATH_KREG_OFFSET(RcvHdrTailAddr7), + kr_rcvhdrtailaddr8 = IPATH_KREG_OFFSET(RcvHdrTailAddr8); + +/* + * first of the pioavail registers, the total number is + * (kr_sendpiobufcnt / 32); each buffer uses 2 bits + * More properly, it's: + * (kr_sendpiobufcnt / ((sizeof(uint64_t)*BITS_PER_BYTE)/2)) + */ +ipath_sreg sr_sendpioavail = 0; + +ipath_creg + cr_badformatcnt = IPATH_CREG_OFFSET(RxBadFormatCnt), + cr_erricrccnt = IPATH_CREG_OFFSET(RxICRCErrCnt), + cr_errlinkcnt = IPATH_CREG_OFFSET(RxLinkProblemCnt), + cr_errlpcrccnt = IPATH_CREG_OFFSET(RxLPCRCErrCnt), + cr_errpkey = IPATH_CREG_OFFSET(RxPKeyMismatchCnt), + cr_errrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowCtrlErrCnt), + cr_err_rlencnt = IPATH_CREG_OFFSET(RxLenErrCnt), + cr_errslencnt = IPATH_CREG_OFFSET(TxLenErrCnt), + cr_errtidfull = IPATH_CREG_OFFSET(RxTIDFullErrCnt), + cr_errtidvalid = IPATH_CREG_OFFSET(RxTIDValidErrCnt), + cr_errvcrccnt = IPATH_CREG_OFFSET(RxVCRCErrCnt), + cr_ibstatuschange = IPATH_CREG_OFFSET(IBStatusChangeCnt), + /* calc from Reg_CounterRegBase + offset */ + cr_intcnt = IPATH_CREG_OFFSET(LBIntCnt), + cr_invalidrlencnt = IPATH_CREG_OFFSET(RxMaxMinLenErrCnt), + cr_invalidslencnt = IPATH_CREG_OFFSET(TxMaxMinLenErrCnt), + cr_lbflowstallcnt = IPATH_CREG_OFFSET(LBFlowStallCnt), + cr_pktrcvcnt = IPATH_CREG_OFFSET(RxDataPktCnt), + cr_pktrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowPktCnt), + cr_pktsendcnt = IPATH_CREG_OFFSET(TxDataPktCnt), + cr_pktsendflowcnt = IPATH_CREG_OFFSET(TxFlowPktCnt), + cr_portovflcnt = IPATH_CREG_OFFSET(RxP0HdrEgrOvflCnt), + cr_portovflcnt1 = IPATH_CREG_OFFSET(RxP1HdrEgrOvflCnt), + cr_portovflcnt2 = IPATH_CREG_OFFSET(RxP2HdrEgrOvflCnt), + cr_portovflcnt3 = IPATH_CREG_OFFSET(RxP3HdrEgrOvflCnt), + cr_portovflcnt4 = IPATH_CREG_OFFSET(RxP4HdrEgrOvflCnt), + cr_portovflcnt5 = IPATH_CREG_OFFSET(RxP5HdrEgrOvflCnt), + cr_portovflcnt6 = IPATH_CREG_OFFSET(RxP6HdrEgrOvflCnt), + cr_portovflcnt7 = IPATH_CREG_OFFSET(RxP7HdrEgrOvflCnt), + cr_portovflcnt8 = IPATH_CREG_OFFSET(RxP8HdrEgrOvflCnt), + cr_rcvebpcnt = IPATH_CREG_OFFSET(RxEBPCnt), + cr_rcvovflcnt = IPATH_CREG_OFFSET(RxBufOvflCnt), + cr_senddropped = IPATH_CREG_OFFSET(TxDroppedPktCnt), + cr_sendstallcnt = IPATH_CREG_OFFSET(TxFlowStallCnt), + cr_sendunderruncnt = IPATH_CREG_OFFSET(TxUnderrunCnt), + cr_wordrcvcnt = IPATH_CREG_OFFSET(RxDwordCnt), + cr_wordsendcnt = IPATH_CREG_OFFSET(TxDwordCnt), + cr_unsupvlcnt = IPATH_CREG_OFFSET(TxUnsupVLErrCnt), + cr_rxdroppktcnt = IPATH_CREG_OFFSET(RxDroppedPktCnt), + cr_iblinkerrrecovcnt = IPATH_CREG_OFFSET(IBLinkErrRecoveryCnt), + cr_iblinkdowncnt = IPATH_CREG_OFFSET(IBLinkDownedCnt), + cr_ibsymbolerrcnt = IPATH_CREG_OFFSET(IBSymbolErrCnt); + +/* kr_sendctrl bits */ +#define INFINIPATH_S_DISARMPIOBUF_MASK 0xFF + +/* kr_rcvctrl bits */ +#define INFINIPATH_R_PORTENABLE_MASK 0x1FF +#define INFINIPATH_R_INTRAVAIL_MASK 0x1FF + +/* kr_intstatus, kr_intclear, kr_intmask bits */ +#define INFINIPATH_I_RCVURG_MASK 0x1FF +#define INFINIPATH_I_RCVAVAIL_MASK 0x1FF + +/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */ +#define INFINIPATH_HWE_HTCMEMPARITYERR_MASK 0x3FFFFFULL +#define INFINIPATH_HWE_HTCLNKABYTE0CRCERR 0x0000000000800000ULL +#define INFINIPATH_HWE_HTCLNKABYTE1CRCERR 0x0000000001000000ULL +#define INFINIPATH_HWE_HTCLNKBBYTE0CRCERR 0x0000000002000000ULL +#define INFINIPATH_HWE_HTCLNKBBYTE1CRCERR 0x0000000004000000ULL +#define INFINIPATH_HWE_HTCMISCERR4 0x0000000008000000ULL +#define INFINIPATH_HWE_HTCMISCERR5 0x0000000010000000ULL +#define INFINIPATH_HWE_HTCMISCERR6 0x0000000020000000ULL +#define INFINIPATH_HWE_HTCMISCERR7 0x0000000040000000ULL +#define INFINIPATH_HWE_MEMBISTFAILED 0x0040000000000000ULL +#define INFINIPATH_HWE_COREPLL_FBSLIP 0x0080000000000000ULL +#define INFINIPATH_HWE_COREPLL_RFSLIP 0x0100000000000000ULL +#define INFINIPATH_HWE_HTBPLL_FBSLIP 0x0200000000000000ULL +#define INFINIPATH_HWE_HTBPLL_RFSLIP 0x0400000000000000ULL +#define INFINIPATH_HWE_HTAPLL_FBSLIP 0x0800000000000000ULL +#define INFINIPATH_HWE_HTAPLL_RFSLIP 0x1000000000000000ULL +#define INFINIPATH_HWE_EXTSERDESPLLFAILED 0x2000000000000000ULL + +/* kr_hwdiagctrl bits */ +#define INFINIPATH_DC_NUMHTMEMS 22 + +/* kr_extstatus bits */ +#define INFINIPATH_EXTS_FREQSEL 0x2 +#define INFINIPATH_EXTS_SERDESSEL 0x4 +#define INFINIPATH_EXTS_MEMBIST_ENDTEST 0x0000000000004000 +#define INFINIPATH_EXTS_MEMBIST_CORRECT 0x0000000000008000 + +/* kr_extctrl bits */ + +/* + * masks and bits that are different in different chips, or present only + * in one + */ +const uint32_t infinipath_i_rcvavail_mask = INFINIPATH_I_RCVAVAIL_MASK; +const uint32_t infinipath_i_rcvurg_mask = INFINIPATH_I_RCVURG_MASK; +const uint64_t infinipath_hwe_htcmemparityerr_mask = + INFINIPATH_HWE_HTCMEMPARITYERR_MASK; + +const uint64_t infinipath_hwe_spibdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_sphtdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_htcdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_htcdcmlockfailed_shift = 0ULL; +const uint64_t infinipath_hwe_sphtdcmlockfailed_shift = 0ULL; +const uint64_t infinipath_hwe_spibdcmlockfailed_shift = 0ULL; + +const uint64_t infinipath_hwe_htclnkabyte0crcerr = + INFINIPATH_HWE_HTCLNKABYTE0CRCERR; +const uint64_t infinipath_hwe_htclnkabyte1crcerr = + INFINIPATH_HWE_HTCLNKABYTE1CRCERR; +const uint64_t infinipath_hwe_htclnkbbyte0crcerr = + INFINIPATH_HWE_HTCLNKBBYTE0CRCERR; +const uint64_t infinipath_hwe_htclnkbbyte1crcerr = + INFINIPATH_HWE_HTCLNKBBYTE1CRCERR; + +const uint64_t infinipath_c_bitsextant = + (INFINIPATH_C_FREEZEMODE | INFINIPATH_C_LINKENABLE); + +const uint64_t infinipath_s_bitsextant = + (INFINIPATH_S_ABORT | INFINIPATH_S_PIOINTBUFAVAIL | + INFINIPATH_S_PIOBUFAVAILUPD | INFINIPATH_S_PIOENABLE | + INFINIPATH_S_DISARM | + (INFINIPATH_S_DISARMPIOBUF_MASK << INFINIPATH_S_DISARMPIOBUF_SHIFT)); + +const uint64_t infinipath_r_bitsextant = + ((INFINIPATH_R_PORTENABLE_MASK << INFINIPATH_R_PORTENABLE_SHIFT) | + (INFINIPATH_R_INTRAVAIL_MASK << INFINIPATH_R_INTRAVAIL_SHIFT) | + INFINIPATH_R_TAILUPD); + +const uint64_t infinipath_i_bitsextant = + ((INFINIPATH_I_RCVURG_MASK << INFINIPATH_I_RCVURG_SHIFT) | + (INFINIPATH_I_RCVAVAIL_MASK << INFINIPATH_I_RCVAVAIL_SHIFT) | + INFINIPATH_I_ERROR | INFINIPATH_I_SPIOSENT | + INFINIPATH_I_SPIOBUFAVAIL | INFINIPATH_I_GPIO); + +const uint64_t infinipath_e_bitsextant = + (INFINIPATH_E_RFORMATERR | INFINIPATH_E_RVCRC | INFINIPATH_E_RICRC | + INFINIPATH_E_RMINPKTLEN | INFINIPATH_E_RMAXPKTLEN | + INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN | + INFINIPATH_E_RUNEXPCHAR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_REBP | + INFINIPATH_E_RIBFLOW | INFINIPATH_E_RBADVERSION | + INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | + INFINIPATH_E_RBADTID | INFINIPATH_E_RHDRLEN | + INFINIPATH_E_RHDR | INFINIPATH_E_RIBLOSTLINK | + INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SMAXPKTLEN | + INFINIPATH_E_SUNDERRUN | INFINIPATH_E_SPKTLEN | + INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SDROPPEDDATAPKT | + INFINIPATH_E_SPIOARMLAUNCH | INFINIPATH_E_SUNEXPERRPKTNUM | + INFINIPATH_E_SUNSUPVL | INFINIPATH_E_IBSTATUSCHANGED | + INFINIPATH_E_INVALIDADDR | INFINIPATH_E_RESET | INFINIPATH_E_HARDWARE); + +const uint64_t infinipath_hwe_bitsextant = + (INFINIPATH_HWE_HTCMEMPARITYERR_MASK << + INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) | + (INFINIPATH_HWE_TXEMEMPARITYERR_MASK << + INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) | + (INFINIPATH_HWE_RXEMEMPARITYERR_MASK << + INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) | + INFINIPATH_HWE_HTCLNKABYTE0CRCERR | + INFINIPATH_HWE_HTCLNKABYTE1CRCERR | INFINIPATH_HWE_HTCLNKBBYTE0CRCERR | + INFINIPATH_HWE_HTCLNKBBYTE1CRCERR | INFINIPATH_HWE_HTCMISCERR4 | + INFINIPATH_HWE_HTCMISCERR5 | INFINIPATH_HWE_HTCMISCERR6 | + INFINIPATH_HWE_HTCMISCERR7 | INFINIPATH_HWE_HTCBUSTREQPARITYERR | + INFINIPATH_HWE_HTCBUSTRESPPARITYERR | + INFINIPATH_HWE_HTCBUSIREQPARITYERR | + INFINIPATH_HWE_RXDSYNCMEMPARITYERR | INFINIPATH_HWE_MEMBISTFAILED | + INFINIPATH_HWE_COREPLL_FBSLIP | INFINIPATH_HWE_COREPLL_RFSLIP | + INFINIPATH_HWE_HTBPLL_FBSLIP | INFINIPATH_HWE_HTBPLL_RFSLIP | + INFINIPATH_HWE_HTAPLL_FBSLIP | INFINIPATH_HWE_HTAPLL_RFSLIP | + INFINIPATH_HWE_EXTSERDESPLLFAILED | + INFINIPATH_HWE_IBCBUSTOSPCPARITYERR | + INFINIPATH_HWE_IBCBUSFRSPCPARITYERR; + +const uint64_t infinipath_dc_bitsextant = + (INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK << + INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT) | + (INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK << + INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT) | + (INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK << + INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT) | + INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR | + INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR | + INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR | + INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR | + INFINIPATH_DC_COUNTERDISABLE | INFINIPATH_DC_COUNTERWREN | + INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR | + INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR; + +const uint64_t infinipath_ibcc_bitsextant = + (INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK << + INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT) | + (INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK << + INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT) | + (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT) | + (INFINIPATH_IBCC_LINKCMD_MASK << INFINIPATH_IBCC_LINKCMD_SHIFT) | + (INFINIPATH_IBCC_MAXPKTLEN_MASK << INFINIPATH_IBCC_MAXPKTLEN_SHIFT) | + (INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK << + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) | + (INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK << + INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) | + (INFINIPATH_IBCC_CREDITSCALE_MASK << + INFINIPATH_IBCC_CREDITSCALE_SHIFT) | + INFINIPATH_IBCC_LOOPBACK | INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + +const uint64_t infinipath_mdio_bitsextant = + (INFINIPATH_MDIO_CLKDIV_MASK << INFINIPATH_MDIO_CLKDIV_SHIFT) | + (INFINIPATH_MDIO_COMMAND_MASK << INFINIPATH_MDIO_COMMAND_SHIFT) | + (INFINIPATH_MDIO_DEVADDR_MASK << INFINIPATH_MDIO_DEVADDR_SHIFT) | + (INFINIPATH_MDIO_REGADDR_MASK << INFINIPATH_MDIO_REGADDR_SHIFT) | + (INFINIPATH_MDIO_DATA_MASK << INFINIPATH_MDIO_DATA_SHIFT) | + INFINIPATH_MDIO_CMDVALID | INFINIPATH_MDIO_RDDATAVALID; + +const uint64_t infinipath_ibcs_bitsextant = + (INFINIPATH_IBCS_LINKTRAININGSTATE_MASK << + INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) | + (INFINIPATH_IBCS_LINKSTATE_MASK << INFINIPATH_IBCS_LINKSTATE_SHIFT) | + INFINIPATH_IBCS_TXREADY | INFINIPATH_IBCS_TXCREDITOK; + +const uint64_t infinipath_extc_bitsextant = + (INFINIPATH_EXTC_GPIOINVERT_MASK << INFINIPATH_EXTC_GPIOINVERT_SHIFT) | + (INFINIPATH_EXTC_GPIOOE_MASK << INFINIPATH_EXTC_GPIOOE_SHIFT) | + INFINIPATH_EXTC_SERDESENABLE | INFINIPATH_EXTC_SERDESCONNECT | + INFINIPATH_EXTC_SERDESENTRUNKING | INFINIPATH_EXTC_SERDESDISRXFIFO | + INFINIPATH_EXTC_SERDESENPLPBK1 | INFINIPATH_EXTC_SERDESENPLPBK2 | + INFINIPATH_EXTC_SERDESENENCDEC | INFINIPATH_EXTC_LEDSECPORTGREENON | + INFINIPATH_EXTC_LEDSECPORTYELLOWON | INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON | INFINIPATH_EXTC_LEDGBLOKGREENON | + INFINIPATH_EXTC_LEDGBLERRREDOFF; + +/* Start of Documentation block for SerDes registers + * serdes and xgxs register bits; not all have defines, + * since I haven't yet needed them all, and I'm lazy. Those that I needed + * are in ipath_registers.h + +serdesConfig0Out (R/W) + Default Value +bit[3:0] - ResetA/B/C/D (4'b1111) +bit[7:4] -L1PwrdnA/B/C/D (4'b0000) +bit[11:8] - RxIdleEnX (4'b0000) +bit[15:12] - TxIdleEnX (4'b0000) +bit[19:16] - RxDetectEnX (4'b0000) +bit[23:20] - BeaconTxEnX (4'b0000) +bit[27:24] - RxTermEnX (4'b0000) +bit[28] - ResetPLL (1'b0) +bit[29] -L2Pwrdn (1'b0) +bit[37:30] - Offset[7:0] (8'b00000000) +bit[38] -OffsetEn (1'b0) +bit[39] -ParLBPK (1'b0) +bit[40] -ParReset (1'b0) +bit[42:41] - RefSel (2'b10) +bit[43] - PW (1'b0) +bit[47:44] - LPBKA/B/C/D (4'b0000) +bit[49:48] - ClkBufTermAdj (2'b0) +bit[51:50] - RxTermAdj (2'b0) +bit[53:52] - TxTermAdj (2'b0) +bit[55:54] - RxEqCtl (2'b0) +bit[63:56] - Reserved + +cce_wip_serdesConfig1Out[63:0] (R/W) +bit[3:0] - HiDrvX (4'b0000) +bit[7:4] - LoDrvX (4'b0000) +bit[12:11] - DtxA[3:0] (4'b0000) +bit[15:12] - DtxB[3:0] (4'b0000) +bit[19:16] - DtxC[3:0] (4'b0000) +bit[23:20] - DtxD[3:0] (4'b0000) +bit[27:24] - DeqA[3:0] (4'b0000) +bit[31:28] - DeqB[3:0] (4'b0000) +bit[35:32] - DeqC[3:0] (4'b0000) +bit[39:36] - DeqD[3:0] (4'b0000) +Framer interface, bits 40-59, not used +bit[44:40] - FmOffsetA[4:0] (5'b00000) +bit[49:45] - FmOffsetB[4:0] (5'b00000) +bit[54:50] - FmOffsetC[4:0] (5'b00000) +bit[59:55] - FmOffsetD[4:0] (5'b00000) +bit[63:60] - FmOffsetEnA/B/C/D (4'b0000) + +SerdesStatus[63:0] (RO) +bit[3:0] - TxIdleDetectA/B/C/D +bit[7:4] - RxDetectA/B/C/D +bit[11:8] - BeaconDetectA/B/C/D +bit[63:12] - Reserved + +XGXSConfigOut[63:0] +bit[2:0] - Resets, init to 1; bit 0 unused? +bit[3] - MDIO, select register bank for vendor specific register + (0x1e if set, else 0x1f); vendor-specific status in register 8 + bits 0-3 lanes0-3 signal detect, 1 if detected + bits 4-7 lanes0-3 CTC fifo errors, 1 if detected (latched until read) +bit[8:4] - MDIO port address +bit[18:9] - lnk_sync_mask +bit[22:19] - polarity inv + +Documentation end */ + +/* + * + * General specs: + * ExtCtrl[63:48] = EXTC_GPIOOE[15:0] + * ExtCtrl[47:32] = EXTC_GPIOInvert[15:0] + * ExtStatus[63:48] = GpioIn[15:0] + * + * GPIO[1] = EEPROM_SDA + * GPIO[0] = EEPROM_SCL + */ + +#define _IPATH_GPIO_SDA_NUM 1 +#define _IPATH_GPIO_SCL_NUM 0 + +#define IPATH_GPIO_SDA \ + (1UL << (_IPATH_GPIO_SDA_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT)) +#define IPATH_GPIO_SCL \ + (1UL << (_IPATH_GPIO_SCL_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT)) + +/* + * register bits for selecting i2c direction and values, used for I2C serial + * flash + */ +const uint16_t ipath_gpio_sda_num = _IPATH_GPIO_SDA_NUM; +const uint16_t ipath_gpio_scl_num = _IPATH_GPIO_SCL_NUM; +const uint64_t ipath_gpio_sda = IPATH_GPIO_SDA; +const uint64_t ipath_gpio_scl = IPATH_GPIO_SCL; + +/* The remaining portion of this file is used only for the kernel */ +#ifdef __KERNEL__ + +#include +#include +#include +#include +#include +#include + +/* + * This file contains all of the code that is specific to the InfiniPath + * HT-400 chip. + */ + +/* we support up to 4 chips per system */ +const uint32_t infinipath_max = 4; +ipath_devdata devdata[4]; +static const char *ipath_unit_names[4] = { + "infinipath0", "infinipath1", "infinipath2", "infinipath3" +}; + +const char *ipath_get_unit_name(int unit) +{ + return ipath_unit_names[unit]; +} + +static void ipath_check_htlink(ipath_type t); + +/* + * display hardware errors. Use same msg buffer as regular errors to avoid + * excessive stack use. Most hardware errors are catastrophic, but for + * right now, we'll print them and continue. + * We reuse the same message buffer as ipath_handle_errors() to avoid + * excessive stack usage. + */ +void ipath_handle_hwerrors(const ipath_type t, char *msg, int msgl) +{ + uint64_t hwerrs = ipath_kget_kreg64(t, kr_hwerrstatus); + uint32_t bits, ctrl; + int isfatal = 0; + char bitsmsg[64]; + + if (!hwerrs) { + _IPATH_VDBG("Called but no hardware errors set\n"); + /* + * better than printing cofusing messages + * This seems to be related to clearing the crc error, or + * the pll error during init. + */ + return; + } else if (hwerrs == ~0ULL) { + _IPATH_UNIT_ERROR(t, + "Read of hardware error status failed (all bits set); ignoring\n"); + return; + } + ipath_stats.sps_hwerrs++; + + /* + * clear the error, regardless of whether we continue or stop using + * the chip. + */ + ipath_kput_kreg(t, kr_hwerrclear, hwerrs); + + hwerrs &= devdata[t].ipath_hwerrmask; + + /* + * make sure we get this much out, unless told to be quiet, + * or it's occurred within the last 5 seconds + */ + if ((hwerrs & ~devdata[t].ipath_lasthwerror) || + (infinipath_debug & __IPATH_VERBDBG)) + _IPATH_INFO("Hardware error: hwerr=0x%llx (cleared)\n", hwerrs); + devdata[t].ipath_lasthwerror |= hwerrs; + + if (hwerrs & ~infinipath_hwe_bitsextant) + _IPATH_UNIT_ERROR(t, + "hwerror interrupt with unknown errors %llx set\n", + hwerrs & ~infinipath_hwe_bitsextant); + + ctrl = ipath_kget_kreg32(t, kr_control); + if (ctrl & INFINIPATH_C_FREEZEMODE) { + if (hwerrs) { + /* + * if any set that we aren't ignoring + * only make the complaint once, in case it's stuck or + * recurring, and we get here multiple times + */ + if (devdata[t].ipath_flags & IPATH_INITTED) { + _IPATH_UNIT_ERROR(t, + "Fatal Error (freezemode), no longer usable\n"); + isfatal = 1; + } + *devdata[t].ipath_statusp &= ~IPATH_STATUS_IB_READY; + /* mark as having had error */ + *devdata[t].ipath_statusp |= IPATH_STATUS_HWERROR; + /* + * mark as not usable, at a minimum until driver + * is reloaded, probably until reboot, since no + * other reset is possible. + */ + devdata[t].ipath_flags &= ~IPATH_INITTED; + } else { + _IPATH_DBG + ("Clearing freezemode on ignored hardware error\n"); + ctrl &= ~INFINIPATH_C_FREEZEMODE; + ipath_kput_kreg(t, kr_control, ctrl); + } + } + + *msg = '\0'; + + /* + * may someday want to decode into which bits are which + * functional area for parity errors, etc. + */ + if (hwerrs & (infinipath_hwe_htcmemparityerr_mask + << INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_HTCMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[HTC Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & (INFINIPATH_HWE_RXEMEMPARITYERR_MASK + << INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_RXEMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[RXE Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & (INFINIPATH_HWE_TXEMEMPARITYERR_MASK + << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_TXEMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[TXE Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & INFINIPATH_HWE_IBCBUSTOSPCPARITYERR) + strlcat(msg, "[IB2IPATH Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_IBCBUSFRSPCPARITYERR) + strlcat(msg, "[IPATH2IB Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSIREQPARITYERR) + strlcat(msg, "[HTC Ireq Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSTREQPARITYERR) + strlcat(msg, "[HTC Treq Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSTRESPPARITYERR) + strlcat(msg, "[HTC Tresp Parity]", msgl); + +/* keep the code below somewhat more readonable; not used elsewhere */ +#define _IPATH_HTLINK0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \ + infinipath_hwe_htclnkabyte1crcerr) +#define _IPATH_HTLINK1_CRCBITS (infinipath_hwe_htclnkbbyte0crcerr | \ + infinipath_hwe_htclnkbbyte1crcerr) +#define _IPATH_HTLANE0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \ + infinipath_hwe_htclnkbbyte0crcerr) +#define _IPATH_HTLANE1_CRCBITS (infinipath_hwe_htclnkabyte1crcerr | \ + infinipath_hwe_htclnkbbyte1crcerr) + if (hwerrs & (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS)) { + char bitsmsg[64]; + uint64_t crcbits = hwerrs & + (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS); + /* don't check if 8bit HT */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0) + crcbits &= ~infinipath_hwe_htclnkabyte1crcerr; + /* don't check if 8bit HT */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1) + crcbits &= ~infinipath_hwe_htclnkbbyte1crcerr; + /* + * we'll want to ignore link errors on link that is + * not in use, if any. For now, complain about both + */ + if (crcbits) { + uint16_t ctrl0, ctrl1; + snprintf(bitsmsg, sizeof bitsmsg, + "[HT%s lane %s CRC (%llx); ignore till reload]", + !(crcbits & _IPATH_HTLINK1_CRCBITS) ? + "0 (A)" : (!(crcbits & _IPATH_HTLINK0_CRCBITS) + ? "1 (B)" : "0+1 (A+B)"), + !(crcbits & _IPATH_HTLANE1_CRCBITS) ? "0" + : (!(crcbits & _IPATH_HTLANE0_CRCBITS) ? "1" : + "0+1"), crcbits); + strlcat(msg, bitsmsg, msgl); + + /* + * print extra info for debugging. + * slave/primary config word 4, 8 (link control 0, 1) + */ + + if (pci_read_config_word(devdata[t].pcidev, + devdata[t].ipath_ht_slave_off + + 0x4, &ctrl0)) + _IPATH_INFO + ("Couldn't read linkctrl0 of slave/primary config block\n"); + else if (!(ctrl0 & 1 << 6)) /* not if EOC bit set */ + _IPATH_DBG("HT linkctrl0 0x%x%s%s\n", ctrl0, + ((ctrl0 >> 8) & 7) ? " CRC" : "", + ((ctrl0 >> 4) & 1) ? "linkfail" : + ""); + if (pci_read_config_word + (devdata[t].pcidev, + devdata[t].ipath_ht_slave_off + 0x8, &ctrl1)) + _IPATH_INFO + ("Couldn't read linkctrl1 of slave/primary config block\n"); + else if (!(ctrl1 & 1 << 6)) /* not if EOC bit set */ + _IPATH_DBG("HT linkctrl1 0x%x%s%s\n", ctrl1, + ((ctrl1 >> 8) & 7) ? " CRC" : "", + ((ctrl1 >> 4) & 1) ? "linkfail" : + ""); + + /* disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~crcbits; + ipath_kput_kreg(t, kr_hwerrmask, + devdata[t].ipath_hwerrmask); + _IPATH_DBG("HT crc errs: %s\n", msg); + } else + _IPATH_DBG + ("ignoring HT crc errors 0x%llx, not in use\n", + hwerrs & (_IPATH_HTLINK0_CRCBITS | + _IPATH_HTLINK1_CRCBITS)); + } + + if (hwerrs & INFINIPATH_HWE_HTCMISCERR5) + strlcat(msg, "[HT core Misc5]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCMISCERR6) + strlcat(msg, "[HT core Misc6]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCMISCERR7) + strlcat(msg, "[HT core Misc7]", msgl); + if (hwerrs & INFINIPATH_HWE_MEMBISTFAILED) { + strlcat(msg, "[Memory BIST test failed, HT-400 unusable]", + msgl); + /* ignore from now on, so disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~INFINIPATH_HWE_MEMBISTFAILED; + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } +#define _IPATH_PLL_FAIL (INFINIPATH_HWE_COREPLL_FBSLIP | \ + INFINIPATH_HWE_COREPLL_RFSLIP | \ + INFINIPATH_HWE_HTBPLL_FBSLIP | \ + INFINIPATH_HWE_HTBPLL_RFSLIP | \ + INFINIPATH_HWE_HTAPLL_FBSLIP | \ + INFINIPATH_HWE_HTAPLL_RFSLIP) + + if (hwerrs & _IPATH_PLL_FAIL) { + snprintf(bitsmsg, sizeof bitsmsg, + "[PLL failed (%llx), HT-400 unusable]", + hwerrs & _IPATH_PLL_FAIL); + strlcat(msg, bitsmsg, msgl); + /* ignore from now on, so disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~(hwerrs & _IPATH_PLL_FAIL); + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } + + if (hwerrs & INFINIPATH_HWE_EXTSERDESPLLFAILED) { + /* + * If it occurs, it is left masked since the eternal interface + * is unused + */ + devdata[t].ipath_hwerrmask &= + ~INFINIPATH_HWE_EXTSERDESPLLFAILED; + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } + + if (hwerrs & INFINIPATH_HWE_RXDSYNCMEMPARITYERR) + strlcat(msg, "[Rx Dsync]", msgl); + if (hwerrs & INFINIPATH_HWE_SERDESPLLFAILED) + strlcat(msg, "[SerDes PLL]", msgl); + + _IPATH_UNIT_ERROR(t, "%s hardware error\n", msg); + if (isfatal && (!ipath_diags_enabled)) { + if (devdata[t].ipath_freezemsg) { + /* + * for proc status file ; if no trailing } is copied, we'll know + * it was truncated. + */ + snprintf(devdata[t].ipath_freezemsg, + devdata[t].ipath_freezelen, "{%s}", msg); + } + } +} + +/* fill in the board name, based on the board revision register */ +void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen) +{ + char *n = NULL; + uint8_t boardrev = devdata[t].ipath_boardrev; + + switch (boardrev) { + case 4: /* Ponderosa is one of the bringup boards */ + n = "Ponderosa"; + break; + case 5: /* HT-460 original production board */ + n = "InfiniPath_HT-460"; + break; + case 7: /* HT-460 small form factor production board */ + n = "InfiniPath_HT-460-2"; + break; + case 6: + n = "OEM_Board_3"; + break; + case 8: + n = "LS/X-1"; + break; + case 9: /* Comstock bringup test board */ + n = "Comstock"; + break; + case 10: + n = "OEM_Board_2"; + break; + case 11: + n = "OEM_Board_4"; + break; + default: /* don't know, just print the number */ + _IPATH_ERROR("Don't yet know about board with ID %u\n", + boardrev); + snprintf(name, namelen, "UnknownBoardRev%u", boardrev); + break; + } + if (n) + snprintf(name, namelen, "%s", n); +} + +int ipath_validate_rev(ipath_devdata * dd) +{ + if (dd->ipath_majrev != 3 || dd->ipath_minrev != 2) { + /* + * This version of the driver only supports the HT-400 + * Rev 3.2 + */ + _IPATH_UNIT_ERROR(IPATH_UNIT(dd), + "Unsupported HT-400 revision %u.%u!\n", + dd->ipath_majrev, dd->ipath_minrev); + return 1; + } + if (dd->ipath_htspeed != 800) + _IPATH_UNIT_ERROR(IPATH_UNIT(dd), + "Incorrectly configured for HT @ %uMHz\n", + dd->ipath_htspeed); + if(dd->ipath_boardrev == 7 || dd->ipath_boardrev == 11 || + dd->ipath_boardrev == 6) + dd->ipath_flags |= IPATH_GPIO_INTR; + else if (dd->ipath_boardrev == 8) { /* LS/X-1 */ + uint64_t val; + val = ipath_kget_kreg64(dd->ipath_pd[0]->port_unit, kr_extstatus); + if(val & INFINIPATH_EXTS_SERDESSEL) { /* hardware disabled */ + /* This means that the chip is hardware disabled, and will + * not be able to bring up the link, in any case. We special + * case this and abort early, to avoid later messages. We + * also set the DISABLED status bit + */ + _IPATH_DBG("Unit %u is hardware-disabled\n", + dd->ipath_pd[0]->port_unit); + *dd->ipath_statusp |= IPATH_STATUS_DISABLED; + return 2; /* this value is handled differently */ + } + } + return 0; +} + +static void ipath_check_htlink(ipath_type t) +{ + uint8_t linkerr, link_off, i; + + for (i = 0; i < 2; i++) { + link_off = devdata[t].ipath_ht_slave_off + i * 4 + 0xd; + if (pci_read_config_byte(devdata[t].pcidev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't read linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) { + _IPATH_VDBG("HT linkerr%d bits 0x%x set, clearing\n", + linkerr >> 4, i); + /* + * writing the linkerr bits that are set should + * clear them + */ + if (pci_write_config_byte + (devdata[t].pcidev, link_off, linkerr)) + _IPATH_DBG + ("Failed write to clear HT linkerror%d\n", + i); + if (pci_read_config_byte + (devdata[t].pcidev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't reread linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) + _IPATH_INFO + ("HT linkerror%d bits 0x%x couldn't be cleared\n", + i, linkerr >> 4); + } + } +} + +/* + * now that we have finished initializing everything that might reasonably + * cause a hardware error, and cleared those errors bits as they occur, + * we can enable hardware errors in the mask (potentially enabling + * freeze mode), and enable hardware errors as errors (along with + * everything else) in errormask + */ +void ipath_clear_init_hwerrs(ipath_type t) +{ + uint64_t val, extsval; + + extsval = ipath_kget_kreg64(t, kr_extstatus); + + if (!(extsval & INFINIPATH_EXTS_MEMBIST_ENDTEST)) + _IPATH_UNIT_ERROR(t, "MemBIST did not complete!\n"); + + ipath_check_htlink(t); + + /* barring bugs, all hwerrors become interrupts, which can */ + val = ~0ULL; + /* don't look at crc lane1 if 8 bit */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0) + val &= ~infinipath_hwe_htclnkabyte1crcerr; + /* don't look at crc lane1 if 8 bit */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1) + val &= ~infinipath_hwe_htclnkbbyte1crcerr; + + /* + * disable RXDSYNCMEMPARITY because external serdes is unused, + * and therefore the logic will never be used or initialized, + * and uninitialized state will normally result in this error + * being asserted. Similarly for the external serdess pll + * lock signal. + */ + val &= + ~(INFINIPATH_HWE_EXTSERDESPLLFAILED | + INFINIPATH_HWE_RXDSYNCMEMPARITYERR); + + /* + * Disable MISCERR4 because of an inversion in the HT core + * logic checking for errors that cause this bit to be set. + * The errata can also cause the protocol error bit to be set + * in the HT config space linkerror register(s). + */ + val &= ~INFINIPATH_HWE_HTCMISCERR4; + + /* + * PLL ignored because MDIO interface has a logic problem for reads, + * on Comstock and Ponderosa. BRINGUP + */ + if (devdata[t].ipath_boardrev == 4 || devdata[t].ipath_boardrev == 9) + val &= ~INFINIPATH_HWE_EXTSERDESPLLFAILED; /* BRINGUP */ + devdata[t].ipath_hwerrmask = val; +} + +/* bring up the serdes */ +int ipath_bringup_serdes(ipath_type t) +{ + uint64_t val, config1; + int ret = 0, change = 0; + + _IPATH_DBG("Trying to bringup serdes\n"); + + if (ipath_kget_kreg64(t, kr_hwerrstatus) & + INFINIPATH_HWE_SERDESPLLFAILED) { + _IPATH_DBG + ("At start, serdes PLL failed bit set in hwerrstatus, clearing and continuing\n"); + ipath_kput_kreg(t, kr_hwerrclear, + INFINIPATH_HWE_SERDESPLLFAILED); + } + + val = ipath_kget_kreg64(t, kr_serdesconfig0); + config1 = ipath_kget_kreg64(t, kr_serdesconfig1); + + _IPATH_VDBG + ("Initial serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n", + val, config1, ipath_kget_kreg64(t, kr_serdesstatus), + ipath_kget_kreg64(t, kr_xgxsconfig)); + + /* force reset on */ + val |= + INFINIPATH_SERDC0_RESET_PLL /* | INFINIPATH_SERDC0_RESET_MASK */ ; + ipath_kput_kreg(t, kr_serdesconfig0, val); + udelay(15); /* need pll reset set at least for a bit */ + + if (val & INFINIPATH_SERDC0_RESET_PLL) { + uint64_t val2 = val &= ~INFINIPATH_SERDC0_RESET_PLL; + /* set lane resets, and tx idle, during pll reset */ + val2 |= INFINIPATH_SERDC0_RESET_MASK | INFINIPATH_SERDC0_TXIDLE; + _IPATH_VDBG("Clearing serdes PLL reset (writing %llx)\n", val2); + ipath_kput_kreg(t, kr_serdesconfig0, val2); + /* be sure chip saw it */ + val = ipath_kget_kreg64(t, kr_scratch); + /* + * need pll reset clear at least 11 usec before lane resets + * cleared; give it a few more + */ + udelay(15); + val = val2; /* for check below */ + } + + if (val & (INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK + | INFINIPATH_SERDC0_TXIDLE)) { + val &= + ~(INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK + | INFINIPATH_SERDC0_TXIDLE); + ipath_kput_kreg(t, kr_serdesconfig0, val); /* clear them */ + } + + val = ipath_kget_kreg64(t, kr_xgxsconfig); + if (((val >> INFINIPATH_XGXS_MDIOADDR_SHIFT) & + INFINIPATH_XGXS_MDIOADDR_MASK) != 3) { + val &= + ~(INFINIPATH_XGXS_MDIOADDR_MASK << + INFINIPATH_XGXS_MDIOADDR_SHIFT); + /* we use address 3 */ + val |= 3ULL << INFINIPATH_XGXS_MDIOADDR_SHIFT; + change = 1; + } + if (val & INFINIPATH_XGXS_RESET) { /* normally true after boot */ + val &= ~INFINIPATH_XGXS_RESET; + change = 1; + } + if (change) + ipath_kput_kreg(t, kr_xgxsconfig, val); + + val = ipath_kget_kreg64(t, kr_serdesconfig0); + + config1 &= ~0x0ffffffff00ULL; /* clear current and de-emphasis bits */ + config1 |= 0x00000000000ULL; /* set current to 20ma */ + config1 |= 0x0cccc000000ULL; /* set de-emphasis to -5.68dB */ + ipath_kput_kreg(t, kr_serdesconfig1, config1); + + _IPATH_VDBG + ("After setup: serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n", + val, config1, ipath_kget_kreg64(t, kr_serdesstatus), + ipath_kget_kreg64(t, kr_xgxsconfig)); + + if ((!ipath_waitfor_mdio_cmdready(t))) { + ipath_kput_kreg(t, kr_mdio, IPATH_MDIO_REQ(IPATH_MDIO_CMD_READ, + 31, + IPATH_MDIO_CTRL_XGXS_REG_8, + 0)); + if (ipath_waitfor_complete + (t, kr_mdio, IPATH_MDIO_DATAVALID, &val)) + _IPATH_DBG + ("Never got MDIO data for XGXS status read\n"); + else + _IPATH_VDBG("MDIO Read reg8, 'bank' 31 %x\n", + (uint32_t) val); + } else + _IPATH_DBG("Never got MDIO cmdready for XGXS status read\n"); + + return ret; /* for now, say we always succeeded */ +} + +/* set serdes to txidle; driver is being unloaded */ +void ipath_quiet_serdes(const ipath_type t) +{ + uint64_t val = ipath_kget_kreg64(t, kr_serdesconfig0); + + val |= INFINIPATH_SERDC0_TXIDLE; + _IPATH_DBG("Setting TxIdleEn on serdes (config0 = %llx)\n", val); + ipath_kput_kreg(t, kr_serdesconfig0, val); +} + +EXPORT_SYMBOL(ipath_get_unit_name); + +#endif /* __KERNEL__ */ diff --git a/drivers/infiniband/hw/ipath/ipath_i2c.c b/drivers/infiniband/hw/ipath/ipath_i2c.c new file mode 100644 index 0000000..4ee3d46 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_i2c.c @@ -0,0 +1,472 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_i2c.c 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* + * InfiniPath I2C Driver for onboard flash + * Bus Master, Standard Speed only. + * HT-460 uses the Atmel AT24C01 I2C serial FLASH part. + * This part is a 1Kbit part, that uses no programmable address bits, + * (the address is 1010000b) + */ + +typedef enum i2c_line_type_e { + i2c_line_scl = 0, + i2c_line_sda +} ipath_i2c_type; + +typedef enum i2c_line_state_e { + i2c_line_low = 0, + i2c_line_high +} ipath_i2c_state; + +#define READ_CMD 1 +#define WRITE_CMD 0 + +static int ipath_eeprom_init; + +/* + * The gpioval manipulation really should be protected by spinlocks + * or be converted to use atomic operations (unfortunately, atomic.h + * doesn't cover 64 bit ops for some of them). + */ + +int i2c_gpio_set(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state new_line_state); +int i2c_gpio_get(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state * curr_statep); + +/* + * returns 0 if the line was set to the new state successfully, non-zero + * on error. + */ +int i2c_gpio_set(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state new_line_state) +{ + uint64_t read_val, write_val, mask, *gpioval; + + gpioval = &devdata[dev].ipath_gpio_out; + read_val = ipath_kget_kreg64(dev, kr_extctrl); + if (line == i2c_line_scl) + mask = ipath_gpio_scl; + else + mask = ipath_gpio_sda; + + if (new_line_state == i2c_line_high) + /* tri-state the output rather than force high */ + write_val = read_val & ~mask; + else + /* config line to be an output */ + write_val = read_val | mask; + ipath_kput_kreg(dev, kr_extctrl, write_val); + + /* set high and verify */ + if (new_line_state == i2c_line_high) + write_val = 0x1UL; + else + write_val = 0x0UL; + + if (line == i2c_line_scl) { + write_val <<= ipath_gpio_scl_num; + *gpioval = *gpioval & ~(1UL << ipath_gpio_scl_num); + *gpioval |= write_val; + } else { + write_val <<= ipath_gpio_sda_num; + *gpioval = *gpioval & ~(1UL << ipath_gpio_sda_num); + *gpioval |= write_val; + } + ipath_kput_kreg(dev, kr_gpio_out, *gpioval); + + return 0; +} + +/* + * returns 0 if the line was set to the new state successfully, non-zero + * on error. curr_state is not set on error. + */ +int i2c_gpio_get(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state * curr_statep) +{ + uint64_t read_val, write_val, mask; + + /* check args */ + if (curr_statep == NULL) + return 1; + + read_val = ipath_kget_kreg64(dev, kr_extctrl); + /* config line to be an input */ + if (line == i2c_line_scl) + mask = ipath_gpio_scl; + else + mask = ipath_gpio_sda; + write_val = read_val & ~mask; + ipath_kput_kreg(dev, kr_extctrl, write_val); + read_val = ipath_kget_kreg64(dev, kr_extstatus); + + if (read_val & mask) + *curr_statep = i2c_line_high; + else + *curr_statep = i2c_line_low; + + return 0; +} + +/* + * would prefer to not inline this, to avoid code bloat, and simplify debugging + * But when compiling against 2.6.10 kernel tree, it gets an error, so + * not for now. + */ +static void ipath_i2c_delay(ipath_type, int); + +/* + * we use this instead of udelay directly, so we can make sure + * that previous register writes have been flushed all the way + * to the chip. Since we are delaying anyway, the cost doesn't + * hurt, and makes the bit twiddling more regular + * If delay is negative, we'll do the chip read, to be sure write made it + * to our chip, but won't do udelay() + */ +static void ipath_i2c_delay(ipath_type dev, int dtime) +{ + /* + * This needs to be volatile, so that the compiler doesn't + * optimize away the read to the device's mapped memory. + */ + volatile uint32_t read_val; + if (!dtime) + return; + read_val = ipath_kget_kreg32(dev, kr_scratch); + if (--dtime > 0) /* register read takes about .5 usec, itself */ + udelay(dtime); +} + +static void ipath_scl_out(ipath_type dev, uint8_t bit, int delay) +{ + i2c_gpio_set(dev, i2c_line_scl, bit ? i2c_line_high : i2c_line_low); + + ipath_i2c_delay(dev, delay); +} + +static void ipath_sda_out(ipath_type dev, uint8_t bit, int delay) +{ + i2c_gpio_set(dev, i2c_line_sda, bit ? i2c_line_high : i2c_line_low); + + ipath_i2c_delay(dev, delay); +} + +static uint8_t ipath_sda_in(ipath_type dev, int delay) +{ + ipath_i2c_state bit; + + if (i2c_gpio_get(dev, i2c_line_sda, &bit)) + _IPATH_DBG("get bit failed!\n"); + + ipath_i2c_delay(dev, delay); + + return bit == i2c_line_high ? 1U : 0; +} + +/* see if ack following write is true */ +static int ipath_i2c_ackrcv(ipath_type dev) +{ + uint8_t ack_received; + + /* AT ENTRY SCL = LOW */ + /* change direction, ignore data */ + ack_received = ipath_sda_in(dev, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ack_received = ipath_sda_in(dev, 1) == 0; + ipath_scl_out(dev, i2c_line_low, 1); + return ack_received; +} + +/* + * write a byte, one bit at a time. Returns 0 if we got the following + * ack, otherwise 1 + */ +static int ipath_wr_byte(ipath_type dev, uint8_t data) +{ + int bit_cntr; + uint8_t bit; + + for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) { + bit = (data >> bit_cntr) & 1; + ipath_sda_out(dev, bit, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_low, 1); + } + if (!ipath_i2c_ackrcv(dev)) + return 1; + return 0; +} + +static void send_ack(ipath_type dev) +{ + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_high, 1); +} + +/* + * ipath_i2c_startcmd - Transmit the start condition, followed by + * address/cmd + * (both clock/data high, clock high, data low while clock is high) + */ +static int ipath_i2c_startcmd(ipath_type dev, uint8_t offset_dir) +{ + int res; + + /* issue start sequence */ + ipath_sda_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_low, 1); + + /* issue length and direction byte */ + res = ipath_wr_byte(dev, offset_dir); + + if (res) + _IPATH_VDBG("No ack to complete start\n"); + return res; +} + +/* + * stop_cmd - Transmit the stop condition + * (both clock/data low, clock high, data high while clock is high) + */ +static void stop_cmd(ipath_type dev) +{ + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_sda_out(dev, i2c_line_high, 3); +} + +/* + * ipath_eeprom_reset - reset I2C communication. + * + * eeprom: Atmel AT24C01 + * + */ + +static int ipath_eeprom_reset(ipath_type dev) +{ + int clock_cycles_left = 9; + uint64_t *gpioval = &devdata[dev].ipath_gpio_out; + + ipath_eeprom_init = 1; + *gpioval = ipath_kget_kreg64(dev, kr_gpio_out); + _IPATH_VDBG("Resetting i2c flash; initial gpioout reg is %llx\n", + *gpioval); + + /* + * This is to get the i2c into a known state, by first going low, + * then tristate sda (and then tristate scl as first thing in loop) + */ + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_high, 1); + + while (clock_cycles_left--) { + ipath_scl_out(dev, i2c_line_high, 1); + + if (ipath_sda_in(dev, 0)) { + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_low, 1); + return 0; + } + + ipath_scl_out(dev, i2c_line_low, 1); + } + + return 1; +} + +/* + * ipath_eeprom_read - Receives x # byte from the eeprom via I2C. + * + * eeprom: Atmel AT24C01 + * + */ + +int ipath_eeprom_read(ipath_type dev, uint8_t eeprom_offset, void *buffer, + int len) +{ + /* compiler complains unless initialized */ + uint8_t single_byte = 0; + int bit_cntr; + + if (!ipath_eeprom_init) + ipath_eeprom_reset(dev); + + eeprom_offset = (eeprom_offset << 1) | READ_CMD; + + if (ipath_i2c_startcmd(dev, eeprom_offset)) { + _IPATH_DBG("Failed startcmd\n"); + stop_cmd(dev); + return 1; + } + + /* + * flash keeps clocking data out as long as we ack, automatically + * incrementing the address. + */ + while (len-- > 0) { + /* get data */ + single_byte = 0; + for (bit_cntr = 8; bit_cntr; bit_cntr--) { + uint8_t bit; + ipath_scl_out(dev, i2c_line_high, 1); + bit = ipath_sda_in(dev, 0); + single_byte |= bit << (bit_cntr - 1); + ipath_scl_out(dev, i2c_line_low, 1); + } + + /* send ack if not the last byte */ + if (len) + send_ack(dev); + + *((uint8_t *) buffer) = single_byte; + (uint8_t *) buffer++; + } + + stop_cmd(dev); + + return 0; +} + +/* + * ipath_eeprom_write - writes data to the eeprom via I2C. + * +*/ +int ipath_eeprom_write(ipath_type dev, uint8_t eeprom_offset, void *buffer, + int len) +{ + uint8_t single_byte; + int sub_len; + uint8_t *bp = buffer; + int max_wait_time, i; + + if (!ipath_eeprom_init) + ipath_eeprom_reset(dev); + + while (len > 0) { + if (ipath_i2c_startcmd(dev, (eeprom_offset << 1) | WRITE_CMD)) { + _IPATH_DBG("Failed to start cmd offset %u\n", + eeprom_offset); + goto failed_write; + } + + sub_len = min(len, 4); + eeprom_offset += sub_len; + len -= sub_len; + + for (i = 0; i < sub_len; i++) { + if (ipath_wr_byte(dev, *bp++)) { + _IPATH_DBG + ("no ack after byte %u/%u (%u total remain)\n", + i, sub_len, len + sub_len - i); + goto failed_write; + } + } + + stop_cmd(dev); + + /* + * wait for write complete by waiting for a successful + * read (the chip replies with a zero after the write + * cmd completes, and before it writes to the flash. + * The startcmd for the read will fail the ack until + * the writes have completed. We do this inline to avoid + * the debug prints that are in the real read routine + * if the startcmd fails. + */ + max_wait_time = 100; + while (ipath_i2c_startcmd(dev, READ_CMD)) { + stop_cmd(dev); + if (!--max_wait_time) { + _IPATH_DBG + ("Did not get successful read to complete write\n"); + goto failed_write; + } + } + /* now read the zero byte */ + for (i = single_byte = 0; i < 8; i++) { + uint8_t bit; + ipath_scl_out(dev, i2c_line_high, 1); + bit = ipath_sda_in(dev, 0); + ipath_scl_out(dev, i2c_line_low, 1); + single_byte <<= 1; + single_byte |= bit; + } + stop_cmd(dev); + } + + return 0; + +failed_write: + stop_cmd(dev); + return 1; +} + +uint8_t ipath_flash_csum(struct ipath_flash * ifp, int adjust) +{ + uint8_t *ip = (uint8_t *) ifp; + uint8_t csum = 0, len; + + for (len = 0; len < ifp->if_length; len++) + csum += *ip++; + csum -= ifp->if_csum; + csum = ~csum; + if (adjust) + ifp->if_csum = csum; + return csum; +} diff --git a/drivers/infiniband/hw/ipath/ipath_lib.c b/drivers/infiniband/hw/ipath/ipath_lib.c new file mode 100644 index 0000000..d9a40b8 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_lib.c @@ -0,0 +1,92 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_lib.c 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +/* + * This is library code for the driver, similar to what's in libinfinipath for + * usermode code. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" + +unsigned infinipath_debug = __IPATH_INFO; + +uint32_t _ipath_pico_per_cycle; /* always present, for now */ + +/* + * This isn't perfect, but it's close enough for timing work. We want this + * to work on systems where the cycle counter isn't the same as the clock + * frequency. The one msec spin is OK, since we execute this only once + * when first loaded. We don't use CURRENT_TIME because on some systems + * it only has jiffy resolution; we just assume udelay is well calibrated + * and that we aren't likely to be rescheduled. Do it multiple times, + * with a yield in between, to try to make sure we get the "true minimum" + * value. + * _ipath_pico_per_cycle isn't going to lead to completely accurate + * conversions from timestamps to nanoseconds, but it's close enough + * for our purposes, which is mainly to allow people to show events with + * nsecs or usecs if desired, rather than cycles. + */ +void ipath_init_picotime(void) +{ + int i; + u_int64_t ts, te, delta = -1ULL; + + for (i = 0; i < 5; i++) { + ts = get_cycles(); + udelay(250); + te = get_cycles(); + if ((te - ts) < delta) + delta = te - ts; + yield(); + } + _ipath_pico_per_cycle = 250000000 / delta; +} diff --git a/drivers/infiniband/hw/ipath/ipath_mlock.c b/drivers/infiniband/hw/ipath/ipath_mlock.c new file mode 100644 index 0000000..72eb7c0 --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_mlock.c @@ -0,0 +1,139 @@ +/* + * Copyright (c) 2003, 2004, 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_mlock.c 4365 2005-12-10 00:04:16Z rjwalsh $ + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "ipath_kernel.h" + +/* + * Our version of the kernel mlock function. This function is no longer + * exposed, so we need to do it ourselves. It takes a given start page + * (page aligned user virtual address) and pins it and the following specified + * number of pages. + * For now, num_pages is always 1, but that will probably change at some + * point (because caller is doing expected sends on a single virtually + * contiguous buffer, so we can do all pages at once). + */ +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) +{ + int n; + + _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page); + down_read(¤t->mm->mmap_sem); + n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1, + p, NULL); + up_read(¤t->mm->mmap_sem); + if (n != num_pages) { + _IPATH_INFO + ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n", + num_pages, start_page, n); + if (n < 0) /* it's an errno */ + return n; + return -ENOMEM; /* no way to know actual error */ + } + + return 0; +} + +/* + * this is similar to ipath_mlock, but it's always one page, and we mark + * the page as locked for i/o, and shared. This is used for the user process + * page that contains the destination address for the rcvhdrq tail update, + * so we need to have the vma. If we don't do this, the page can be taken + * away from us on fork, even if the child never touches it, and then + * the user process never sees the tail register updates. + */ +int ipath_mlock_nocopy(unsigned long start_page, struct page **p) +{ + int n; + struct vm_area_struct *vm = NULL; + + down_read(¤t->mm->mmap_sem); + n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm); + up_read(¤t->mm->mmap_sem); + if (n != 1) { + _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n", + start_page, n); + if (n < 0) /* it's an errno */ + return n; + return -ENOMEM; /* no way to know actual error */ + } + vm->vm_flags |= VM_SHM | VM_LOCKED; + + return 0; +} + +/* + * Our version of the kernel munlock function. This function is no longer + * exposed, so we need to do it ourselves. It unpins the start page + * (a page aligned full user virtual address, not a page number) + * and pins it and the following specified number of pages. + */ +int ipath_munlock(size_t num_pages, struct page **p) +{ + int i; + + for (i = 0; i < num_pages; i++) { + _IPATH_MMDBG("%u/%lu put_page %p\n", i, num_pages, p[i]); + SetPageDirty(p[i]); + put_page(p[i]); + } + return 0; +} + +/* + * This routine frees up all the allocations made in this file; it's a nop + * now, but I'm leaving it in case we go back to a more sophisticated + * implementation later. + */ +void ipath_mlock_cleanup(ipath_portdata * pd) +{ +} -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 11/13] [RFC] ipath verbs, part 2 In-Reply-To: <200512161548.W9sJn4CLmdhnSTcH@cisco.com> Message-ID: <200512161548.mhIvDiba3wkjPaMc@cisco.com> Second half of ipath verbs --- drivers/infiniband/hw/ipath/ipath_verbs.c | 2931 +++++++++++++++++++++++++++++ 1 files changed, 2931 insertions(+), 0 deletions(-) 3f617d81354835f183e089849cca09e295b2df0a diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 808326e..25d738d 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -3242,3 +3242,2934 @@ static int get_rwqe(struct ipath_qp *qp, spin_unlock(&rq->lock); return 1; } + +/* + * This is called from ipath_qp_rcv() to process an incomming UC packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 psn; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + struct ib_reth *reth; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12; /* LRH + BTH */ + psn = be32_to_cpu(ohdr->bth[2]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */ + /* + * The header with GRH is 60 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 4 bytes of + * the BTH header (PSN) is in the data buffer. + */ + psn = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + + wc.imm_data = 0; + wc.wc_flags = 0; + + spin_lock_irqsave(&qp->r_rq.lock, flags); + + /* Compare the PSN verses the expected PSN. */ + if (unlikely(cmp24(psn, qp->r_psn) != 0)) { + /* + * Handle a sequence error. + * Silently drop any current message. + */ + qp->r_psn = psn; + inv: + qp->r_state = IB_OPCODE_UC_SEND_LAST; + switch (opcode) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_ONLY: + case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE: + goto send_first; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_ONLY: + case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + goto rdma_first; + + default: + dev->n_pkt_drops++; + goto done; + } + } + + /* Check for opcode sequence errors. */ + switch (qp->r_state) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_MIDDLE: + if (opcode == IB_OPCODE_UC_SEND_MIDDLE || + opcode == IB_OPCODE_UC_SEND_LAST || + opcode == IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE) + break; + goto inv; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + if (opcode == IB_OPCODE_UC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_UC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + break; + goto inv; + + default: + if (opcode == IB_OPCODE_UC_SEND_FIRST || + opcode == IB_OPCODE_UC_SEND_ONLY || + opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE || + opcode == IB_OPCODE_UC_RDMA_WRITE_FIRST || + opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY || + opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE) + break; + goto inv; + } + + /* OK, process the packet. */ + switch (opcode) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_ONLY: + case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE: + send_first: + if (qp->r_reuse_sge) { + qp->r_reuse_sge = 0; + qp->r_sge = qp->s_rdma_sge; + } else if (!get_rwqe(qp, 0)) { + dev->n_pkt_drops++; + goto done; + } + /* Save the WQE so we can reuse it in case of an error. */ + qp->s_rdma_sge = qp->r_sge; + qp->r_rcv_len = 0; + if (opcode == IB_OPCODE_UC_SEND_ONLY) + goto send_last; + else if (opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE) + goto send_last_imm; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_MIDDLE: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE: + send_last_imm: + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_LAST: + send_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + wc.byte_len = tlen + qp->r_rcv_len; + if (unlikely(wc.byte_len > qp->r_len)) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + /* XXX Need to free SGEs */ + last_imm: + copy_sge(&qp->r_sge, data, tlen); + wc.wr_id = qp->r_wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); + break; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_ONLY: + case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: /* consume RWQE */ + rdma_first: + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + hdrsize += sizeof(*reth); + qp->r_len = be32_to_cpu(reth->length); + qp->r_rcv_len = 0; + if (qp->r_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, + vaddr, rkey, + IB_ACCESS_REMOTE_WRITE))) { + dev->n_pkt_drops++; + goto done; + } + } else { + qp->r_sge.sg_list = NULL; + qp->r_sge.sge.mr = NULL; + qp->r_sge.sge.vaddr = NULL; + qp->r_sge.sge.length = 0; + qp->r_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) { + dev->n_pkt_drops++; + goto done; + } + if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY) + goto rdma_last; + else if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE) + goto rdma_last_imm; + /* FALLTHROUGH */ + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + dev->n_pkt_drops++; + goto done; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE: + rdma_last_imm: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + if (qp->r_reuse_sge) { + qp->r_reuse_sge = 0; + } else if (!get_rwqe(qp, 1)) { + dev->n_pkt_drops++; + goto done; + } + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + wc.byte_len = 0; + goto last_imm; + + case IB_OPCODE_UC_RDMA_WRITE_LAST: + rdma_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, tlen); + break; + + default: + /* Drop packet for unknown opcodes. */ + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + dev->n_pkt_drops++; + return; + } + qp->r_psn++; + qp->r_state = opcode; +done: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); +} + +/* + * Put this QP on the RNR timeout list for the device. + * XXX Use a simple list for now. We might need a priority + * queue if we have lots of QPs waiting for RNR timeouts + * but that should be rare. + */ +static void insert_rnr_queue(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + if (list_empty(&dev->rnrwait)) + list_add(&qp->timerwait, &dev->rnrwait); + else { + struct list_head *l = &dev->rnrwait; + struct ipath_qp *nqp = list_entry(l->next, struct ipath_qp, + timerwait); + + while (qp->s_rnr_timeout >= nqp->s_rnr_timeout) { + qp->s_rnr_timeout -= nqp->s_rnr_timeout; + l = l->next; + if (l->next == &dev->rnrwait) + break; + nqp = list_entry(l->next, struct ipath_qp, timerwait); + } + list_add(&qp->timerwait, l); + } + spin_unlock_irqrestore(&dev->pending_lock, flags); +} + +/* + * This is called from do_uc_send() or do_rc_send() to forward a WQE addressed + * to the same HCA. + * Note that although we are single threaded due to the tasklet, we still + * have to protect against post_send(). We don't have to worry about + * receive interrupts since this is a connected protocol and all packets + * will pass through here. + */ +static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); + struct ipath_qp *qp; + struct ipath_swqe *wqe; + struct ipath_sge *sge; + unsigned long flags; + u64 sdata; + + qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn); + if (!qp) { + dev->n_pkt_drops++; + return; + } + +again: + spin_lock_irqsave(&sqp->s_lock, flags); + + if (!(state_ops[sqp->state] & IPATH_PROCESS_SEND_OK)) { + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto done; + } + + /* Get the next send request. */ + if (sqp->s_last == sqp->s_head) { + /* Send work queue is empty. */ + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto done; + } + + /* + * We can rely on the entry not changing without the s_lock + * being held until we update s_last. + */ + wqe = get_swqe_ptr(sqp, sqp->s_last); + spin_unlock_irqrestore(&sqp->s_lock, flags); + + wc->wc_flags = 0; + wc->imm_data = 0; + + sqp->s_sge.sge = wqe->sg_list[0]; + sqp->s_sge.sg_list = wqe->sg_list + 1; + sqp->s_sge.num_sge = wqe->wr.num_sge; + sqp->s_len = wqe->length; + switch (wqe->wr.opcode) { + case IB_WR_SEND_WITH_IMM: + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wqe->wr.imm_data; + /* FALLTHROUGH */ + case IB_WR_SEND: + spin_lock_irqsave(&qp->r_rq.lock, flags); + if (!get_rwqe(qp, 0)) { + rnr_nak: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + /* Handle RNR NAK */ + if (qp->ibqp.qp_type == IB_QPT_UC) + goto send_comp; + if (sqp->s_rnr_retry == 0) { + wc->status = IB_WC_RNR_RETRY_EXC_ERR; + goto err; + } + if (sqp->s_rnr_retry_cnt < 7) + sqp->s_rnr_retry--; + dev->n_rnr_naks++; + sqp->s_rnr_timeout = rnr_table[sqp->s_min_rnr_timer]; + insert_rnr_queue(sqp); + goto done; + } + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + break; + + case IB_WR_RDMA_WRITE_WITH_IMM: + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wqe->wr.imm_data; + spin_lock_irqsave(&qp->r_rq.lock, flags); + if (!get_rwqe(qp, 1)) + goto rnr_nak; + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + /* FALLTHROUGH */ + case IB_WR_RDMA_WRITE: + if (wqe->length == 0) + break; + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, wqe->length, + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_WRITE))) { + acc_err: + wc->status = IB_WC_REM_ACCESS_ERR; + err: + wc->wr_id = wqe->wr.wr_id; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = 0; + wc->qp_num = sqp->ibqp.qp_num; + wc->src_qp = sqp->remote_qpn; + wc->pkey_index = 0; + wc->slid = sqp->remote_ah_attr.dlid; + wc->sl = sqp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_sqerror_qp(sqp, wc); + goto done; + } + break; + + case IB_WR_RDMA_READ: + if (unlikely(!ipath_rkey_ok(dev, &sqp->s_sge, wqe->length, + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_READ))) { + goto acc_err; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ))) + goto acc_err; + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->wr.num_sge; + break; + + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, sizeof(u64), + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_ATOMIC))) { + goto acc_err; + } + /* Perform atomic OP and save result. */ + sdata = wqe->wr.wr.atomic.swap; + spin_lock_irqsave(&dev->pending_lock, flags); + qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr; + if (wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) { + *(u64 *) qp->r_sge.sge.vaddr = + qp->r_atomic_data + sdata; + } else if (qp->r_atomic_data == wqe->wr.wr.atomic.compare_add) { + *(u64 *) qp->r_sge.sge.vaddr = sdata; + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + *(u64 *) sqp->s_sge.sge.vaddr = qp->r_atomic_data; + goto send_comp; + + default: + goto done; + } + + sge = &sqp->s_sge.sge; + while (sqp->s_len) { + u32 len = sqp->s_len; + + if (len > sge->length) + len = sge->length; + BUG_ON(len == 0); + copy_sge(&qp->r_sge, sge->vaddr, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--sqp->s_sge.num_sge) + *sge = *sqp->s_sge.sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + sqp->s_len -= len; + } + + if (wqe->wr.opcode == IB_WR_RDMA_WRITE || + wqe->wr.opcode == IB_WR_RDMA_READ) + goto send_comp; + + if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM) + wc->opcode = IB_WC_RECV_RDMA_WITH_IMM; + else + wc->opcode = IB_WC_RECV; + wc->wr_id = qp->r_wr_id; + wc->status = IB_WC_SUCCESS; + wc->vendor_err = 0; + wc->byte_len = wqe->length; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = qp->remote_qpn; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc->pkey_index = 0; + wc->slid = qp->remote_ah_attr.dlid; + wc->sl = qp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, + wqe->wr.send_flags & IB_SEND_SOLICITED); + +send_comp: + sqp->s_rnr_retry = sqp->s_rnr_retry_cnt; + + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &sqp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc->wr_id = wqe->wr.wr_id; + wc->status = IB_WC_SUCCESS; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = wqe->length; + wc->qp_num = sqp->ibqp.qp_num; + wc->src_qp = 0; + wc->pkey_index = 0; + wc->slid = 0; + wc->sl = 0; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_cq_enter(to_icq(sqp->ibqp.send_cq), wc, 0); + } + + /* Update s_last now that we are finished with the SWQE */ + spin_lock_irqsave(&sqp->s_lock, flags); + if (++sqp->s_last >= sqp->s_size) + sqp->s_last = 0; + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto again; + +done: + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +/* + * Flush send work queue. + * The QP s_lock should be held. + */ +static void ipath_get_credit(struct ipath_qp *qp, u32 aeth) +{ + u32 credit = (aeth >> 24) & 0x1F; + + /* + * If credit == 0x1F, credit is invalid and we can send + * as many packets as we like. Otherwise, we have to + * honor the credit field. + */ + if (credit == 0x1F) { + qp->s_lsn = (u32) -1; + } else if (qp->s_lsn != (u32) -1) { + /* Compute new LSN (i.e., MSN + credit) */ + credit = (aeth + credit_table[credit]) & 0xFFFFFF; + if (cmp24(credit, qp->s_lsn) > 0) + qp->s_lsn = credit; + } + + /* Restart sending if it was blocked due to lack of credits. */ + if (qp->s_cur != qp->s_head && + (qp->s_lsn == (u32) -1 || + cmp24(get_swqe_ptr(qp, qp->s_cur)->ssn, qp->s_lsn + 1) <= 0)) { + tasklet_schedule(&qp->s_task); + } +} + +/* + * This is called from ipath_rc_rcv() to process an incomming RC ACK + * for the given QP. + * Called at interrupt level with the QP s_lock held. + * Returns 1 if OK, 0 if current operation should be aborted (NAK). + */ +static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + struct ipath_swqe *wqe; + + /* + * Remove the QP from the timeout queue (or RNR timeout queue). + * If ipath_ib_timer() has already removed it, + * it's OK since we hold the QP s_lock and ipath_restart_rc() + * just won't find anything to restart if we ACK everything. + */ + spin_lock(&dev->pending_lock); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + spin_unlock(&dev->pending_lock); + + /* + * Note that NAKs implicitly ACK outstanding SEND and + * RDMA write requests and implicitly NAK RDMA read and + * atomic requests issued before the NAK'ed request. + * The MSN won't include the NAK'ed request but will include + * an ACK'ed request(s). + */ + wqe = get_swqe_ptr(qp, qp->s_last); + + /* Nothing is pending to ACK/NAK. */ + if (qp->s_last == qp->s_tail) + return 0; + + /* + * The MSN might be for a later WQE than the PSN indicates so + * only complete WQEs that the PSN finishes. + */ + while (cmp24(psn, wqe->lpsn) >= 0) { + /* If we are ACKing a WQE, the MSN should be >= the SSN. */ + if (cmp24(aeth, wqe->ssn) < 0) + break; + /* + * If this request is a RDMA read or atomic, and the ACK is + * for a later operation, this ACK NAKs the RDMA read or atomic. + * In other words, only a RDMA_READ_LAST or ONLY can ACK + * a RDMA read and likewise for atomic ops. + * Note that the NAK case can only happen if relaxed ordering + * is used and requests are sent after an RDMA read + * or atomic is sent but before the response is received. + */ + if ((wqe->wr.opcode == IB_WR_RDMA_READ && + opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) || + ((wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP || + wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) && + (opcode != IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE || + cmp24(wqe->psn, psn) != 0))) { + /* The last valid PSN seen is the previous request's. */ + qp->s_last_psn = wqe->psn - 1; + /* Retry this request. */ + ipath_restart_rc(qp, wqe->psn, &wc); + /* + * No need to process the ACK/NAK since we are + * restarting an earlier request. + */ + return 0; + } + /* Post a send completion queue entry if requested. */ + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wqe->wr.wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + } + qp->s_retry = qp->s_retry_cnt; + /* + * If we are completing a request which is in the process + * of being resent, we can stop resending it since we know + * the responder has already seen it. + */ + if (qp->s_last == qp->s_cur) { + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + wqe = get_swqe_ptr(qp, qp->s_cur); + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + wqe = get_swqe_ptr(qp, qp->s_last); + if (qp->s_last == qp->s_tail) + break; + } + + switch (aeth >> 29) { + case 0: /* ACK */ + dev->n_rc_acks++; + /* If this is a partial ACK, reset the retransmit timer. */ + if (qp->s_last != qp->s_tail) { + spin_lock(&dev->pending_lock); + list_add_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); + } + ipath_get_credit(qp, aeth); + qp->s_rnr_retry = qp->s_rnr_retry_cnt; + qp->s_retry = qp->s_retry_cnt; + qp->s_last_psn = psn; + return 1; + + case 1: /* RNR NAK */ + dev->n_rnr_naks++; + if (qp->s_rnr_retry == 0) { + if (qp->s_last == qp->s_tail) + return 0; + + wc.status = IB_WC_RNR_RETRY_EXC_ERR; + goto class_b; + } + if (qp->s_rnr_retry_cnt < 7) + qp->s_rnr_retry--; + if (qp->s_last == qp->s_tail) + return 0; + + /* The last valid PSN seen is the previous request's. */ + qp->s_last_psn = wqe->psn - 1; + + /* Restart this request after the RNR timeout. */ + wqe = get_swqe_ptr(qp, qp->s_last); + + dev->n_rc_resends += (int)qp->s_psn - (int)psn; + + /* + * If we are starting the request from the beginning, let the + * normal send code handle initialization. + */ + qp->s_cur = qp->s_last; + if (cmp24(psn, wqe->psn) <= 0) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } else { + u32 n; + + n = qp->s_cur; + for (;;) { + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) { + if (cmp24(psn, qp->s_next_psn) >= 0) { + qp->s_cur = n; + wqe = get_swqe_ptr(qp, n); + } + break; + } + wqe = get_swqe_ptr(qp, n); + if (cmp24(psn, wqe->psn) < 0) + break; + qp->s_cur = n; + } + qp->s_psn = psn; + + /* + * Set the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See do_rc_send(). + */ + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + break; + + case IB_WR_RDMA_READ: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + } + + qp->s_rnr_timeout = rnr_table[(aeth >> 24) & 0x1F]; + insert_rnr_queue(qp); + return 0; + + case 3: /* NAK */ + /* The last valid PSN seen is the previous request's. */ + if (qp->s_last != qp->s_tail) + qp->s_last_psn = wqe->psn - 1; + switch ((aeth >> 24) & 0x1F) { + case 0: /* PSN sequence error */ + dev->n_seq_naks++; + /* + * Back up to the responder's expected PSN. + * XXX Note that we might get a NAK in the + * middle of an RDMA READ response which + * terminates the RDMA READ. + */ + if (qp->s_last == qp->s_tail) + break; + + if (cmp24(psn, wqe->psn) < 0) { + break; + } + /* Retry the request. */ + ipath_restart_rc(qp, psn, &wc); + break; + + case 1: /* Invalid Request */ + wc.status = IB_WC_REM_INV_REQ_ERR; + dev->n_other_naks++; + goto class_b; + + case 2: /* Remote Access Error */ + wc.status = IB_WC_REM_ACCESS_ERR; + dev->n_other_naks++; + goto class_b; + + case 3: /* Remote Operation Error */ + wc.status = IB_WC_REM_OP_ERR; + dev->n_other_naks++; + class_b: + wc.wr_id = wqe->wr.wr_id; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_sqerror_qp(qp, &wc); + break; + + default: + /* Ignore other reserved NAK error codes */ + goto reserved; + } + qp->s_rnr_retry = qp->s_rnr_retry_cnt; + return 0; + + default: /* 2: reserved */ + reserved: + /* Ignore reserved NAK codes. */ + return 0; + } +} + +/* + * This is called from ipath_qp_rcv() to process an incomming RC packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 psn; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + int diff; + struct ib_reth *reth; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12; /* LRH + BTH */ + psn = be32_to_cpu(ohdr->bth[2]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */ + /* + * The header with GRH is 60 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 4 bytes of + * the BTH header (PSN) is in the data buffer. + */ + psn = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + + /* + * Process responses (ACKs) before anything else. + * Note that the packet sequence number will be for something + * in the send work queue rather than the expected receive + * packet sequence number. In other words, this QP is the + * requester. + */ + if (opcode >= IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST && + opcode <= IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) { + + spin_lock_irqsave(&qp->s_lock, flags); + + /* Ignore invalid responses. */ + if (cmp24(psn, qp->s_next_psn) >= 0) { + goto ack_done; + } + + /* Ignore duplicate responses. */ + diff = cmp24(psn, qp->s_last_psn); + if (unlikely(diff <= 0)) { + /* Update credits for "ghost" ACKs */ + if (diff == 0 && opcode == IB_OPCODE_RC_ACKNOWLEDGE) { + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + if ((pad >> 29) == 0) { + ipath_get_credit(qp, pad); + } + } + goto ack_done; + } + + switch (opcode) { + case IB_OPCODE_RC_ACKNOWLEDGE: + case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + if (opcode == IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) { + *(u64 *) qp->s_sge.sge.vaddr = *(u64 *) data; + } + if (!do_rc_ack(qp, pad, psn, opcode) || + opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST) { + goto ack_done; + } + hdrsize += 4; + /* + * do_rc_ack() has already checked the PSN so skip + * the sequence check. + */ + goto rdma_read; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + /* no AETH, no ACK */ + if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) { + dev->n_rdma_seq++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + goto ack_done; + } + rdma_read: + if (unlikely(qp->s_state != + IB_OPCODE_RC_RDMA_READ_REQUEST)) + goto ack_done; + if (unlikely(tlen != (hdrsize + pmtu + 4))) + goto ack_done; + if (unlikely(pmtu >= qp->s_len)) + goto ack_done; + /* We got a response so update the timeout. */ + if (unlikely(qp->s_last == qp->s_tail || + get_swqe_ptr(qp, qp->s_last)->wr.opcode != + IB_WR_RDMA_READ)) + goto ack_done; + spin_lock(&dev->pending_lock); + if (qp->s_rnr_timeout == 0 && + qp->timerwait.next != LIST_POISON1) { + list_move_tail(&qp->timerwait, + &dev->pending[dev-> + pending_index]); + } + spin_unlock(&dev->pending_lock); + /* + * Update the RDMA receive state but do the copy w/o + * holding the locks and blocking interrupts. + * XXX Yet another place that affects relaxed + * RDMA order since we don't want s_sge modified. + */ + qp->s_len -= pmtu; + qp->s_last_psn = psn; + spin_unlock_irqrestore(&qp->s_lock, flags); + copy_sge(&qp->s_sge, data, pmtu); + return; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + /* ACKs READ req. */ + if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) { + dev->n_rdma_seq++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + goto ack_done; + } + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: + if (unlikely(qp->s_state != + IB_OPCODE_RC_RDMA_READ_REQUEST)) { + goto ack_done; + } + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* + * Check that the data size is >= 1 && <= pmtu. + * Remember to account for the AETH header (4) + * and ICRC (4). + */ + if (unlikely(tlen <= (hdrsize + pad + 8))) { + /* XXX Need to generate an error CQ entry. */ + goto ack_done; + } + tlen -= hdrsize + pad + 8; + if (unlikely(tlen != qp->s_len)) { + /* XXX Need to generate an error CQ entry. */ + goto ack_done; + } + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + copy_sge(&qp->s_sge, data, tlen); + if (do_rc_ack(qp, pad, psn, + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST)) { + /* + * Change the state so we contimue + * processing new requests. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + goto ack_done; + } + ack_done: + spin_unlock_irqrestore(&qp->s_lock, flags); + return; + } + + spin_lock_irqsave(&qp->r_rq.lock, flags); + + /* Compute 24 bits worth of difference. */ + diff = cmp24(psn, qp->r_psn); + if (unlikely(diff)) { + if (diff > 0) { + /* + * Packet sequence error. + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or + * NAK is pending though. + */ + spin_lock(&qp->s_lock); + if ((qp->s_ack_state >= + IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) || + qp->s_nak_state != 0) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_NAK_PSN_ERROR; + /* Use the expected PSN. */ + qp->s_ack_psn = qp->r_psn; + goto resched; + } + + /* + * Handle a duplicate request. + * Don't re-execute SEND, RDMA write or atomic op. + * Don't NAK errors, just silently drop the duplicate request. + * Note that r_sge, r_len, and r_rcv_len may be + * in use so don't modify them. + * + * We are supposed to ACK the earliest duplicate PSN + * but we can coalesce an outstanding duplicate ACK. + * We have to send the earliest so that RDMA reads + * can be restarted at the requester's expected PSN. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE && + cmp24(psn, qp->s_ack_psn) >= 0) { + if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) + qp->s_ack_psn = psn; + spin_unlock(&qp->s_lock); + goto done; + } + switch (opcode) { + case IB_OPCODE_RC_RDMA_READ_REQUEST: + /* + * We have to be careful to not change s_rdma_sge + * while do_rc_send() is using it and not holding + * the s_lock. + */ + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && + qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { + spin_unlock(&qp->s_lock); + dev->n_rdma_dup_busy++; + goto done; + } + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + qp->s_rdma_len = be32_to_cpu(reth->length); + if (qp->s_rdma_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* + * Address range must be a subset of the + * original request and start on pmtu + * boundaries. + */ + if (unlikely(!ipath_rkey_ok(dev, + &qp->s_rdma_sge, + qp->s_rdma_len, + vaddr, rkey, + IB_ACCESS_REMOTE_READ))) + { + goto done; + } + } else { + qp->s_rdma_sge.sg_list = NULL; + qp->s_rdma_sge.num_sge = 0; + qp->s_rdma_sge.sge.mr = NULL; + qp->s_rdma_sge.sge.vaddr = NULL; + qp->s_rdma_sge.sge.length = 0; + qp->s_rdma_sge.sge.sge_length = 0; + } + break; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * Check for the PSN of the last atomic operations + * performed and resend the result if found. + */ + if ((psn & 0xFFFFFF) != qp->r_atomic_psn) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_atomic = qp->r_atomic_data; + break; + } + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + goto resched; + } + + /* Check for opcode sequence errors. */ + switch (qp->r_state) { + case IB_OPCODE_RC_SEND_FIRST: + case IB_OPCODE_RC_SEND_MIDDLE: + if (opcode == IB_OPCODE_RC_SEND_MIDDLE || + opcode == IB_OPCODE_RC_SEND_LAST || + opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE) + break; + nack_inv: + /* + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or + * NAK is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_NAK_INVALID_REQUEST; + qp->s_ack_psn = qp->r_psn; + goto resched; + + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + if (opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + break; + goto nack_inv; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * Drop all new requests until a response has been sent. + * A new request then ACKs the RDMA response we sent. + * Relaxed ordering would allow new requests to be + * processed but we would need to keep a queue + * of rwqe's for all that are in progress. + * Note that we can't RNR NAK this request since the RDMA + * READ or atomic response is already queued to be sent + * (unless we implement a response send queue). + */ + goto done; + + default: + if (opcode == IB_OPCODE_RC_SEND_MIDDLE || + opcode == IB_OPCODE_RC_SEND_LAST || + opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE || + opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + goto nack_inv; + break; + } + + wc.imm_data = 0; + wc.wc_flags = 0; + + /* OK, process the packet. */ + switch (opcode) { + case IB_OPCODE_RC_SEND_FIRST: + if (!get_rwqe(qp, 0)) { + rnr_nak: + /* + * A RNR NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read or atomic + * is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_RNR_NAK | qp->s_min_rnr_timer; + qp->s_ack_psn = qp->r_psn; + goto resched; + } + qp->r_rcv_len = 0; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_MIDDLE: + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + send_middle: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + goto nack_inv; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + goto nack_inv; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE: + /* consume RWQE */ + if (!get_rwqe(qp, 1)) + goto rnr_nak; + goto send_last_imm; + + case IB_OPCODE_RC_SEND_ONLY: + case IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE: + if (!get_rwqe(qp, 0)) + goto rnr_nak; + qp->r_rcv_len = 0; + if (opcode == IB_OPCODE_RC_SEND_ONLY) + goto send_last; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE: + send_last_imm: + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_LAST: + case IB_OPCODE_RC_RDMA_WRITE_LAST: + send_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + goto nack_inv; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + wc.byte_len = tlen + qp->r_rcv_len; + if (unlikely(wc.byte_len > qp->r_len)) { + goto nack_inv; + } + /* XXX Need to free SGEs */ + copy_sge(&qp->r_sge, data, tlen); + atomic_inc(&qp->msn); + if (opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY) + break; + wc.wr_id = qp->r_wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); + break; + + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + case IB_OPCODE_RC_RDMA_WRITE_ONLY: + case IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + /* consume RWQE */ + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + hdrsize += sizeof(*reth); + qp->r_len = be32_to_cpu(reth->length); + qp->r_rcv_len = 0; + if (qp->r_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, + vaddr, rkey, + IB_ACCESS_REMOTE_WRITE))) { + nack_acc: + /* + * A NAK will ACK earlier sends and RDMA + * writes. + * Don't queue the NAK if a RDMA read, + * atomic, or NAK is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= + IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->s_ack_state = IB_OPCODE_RC_RDMA_WRITE_ONLY; + qp->s_nak_state = IB_NAK_REMOTE_ACCESS_ERROR; + qp->s_ack_psn = qp->r_psn; + goto resched; + } + } else { + qp->r_sge.sg_list = NULL; + qp->r_sge.sge.mr = NULL; + qp->r_sge.sge.vaddr = NULL; + qp->r_sge.sge.length = 0; + qp->r_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) + goto nack_acc; + if (opcode == IB_OPCODE_RC_RDMA_WRITE_FIRST) + goto send_middle; + else if (opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY) + goto send_last; + if (!get_rwqe(qp, 1)) + goto rnr_nak; + goto send_last_imm; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + spin_lock(&qp->s_lock); + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && + qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_rdma_len = be32_to_cpu(reth->length); + if (qp->s_rdma_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->s_rdma_sge, + qp->s_rdma_len, + vaddr, rkey, + IB_ACCESS_REMOTE_READ))) { + spin_unlock(&qp->s_lock); + goto nack_acc; + } + /* + * Update the next expected PSN. + * We add 1 later below, so only add the remainder here. + */ + if (qp->s_rdma_len > pmtu) + qp->r_psn += (qp->s_rdma_len - 1) / pmtu; + } else { + qp->s_rdma_sge.sg_list = NULL; + qp->s_rdma_sge.num_sge = 0; + qp->s_rdma_sge.sge.mr = NULL; + qp->s_rdma_sge.sge.vaddr = NULL; + qp->s_rdma_sge.sge.length = 0; + qp->s_rdma_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ))) + goto nack_acc; + /* + * We need to increment the MSN here instead of when we + * finish sending the result since a duplicate request would + * increment it more than once. + */ + atomic_inc(&qp->msn); + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + qp->r_psn++; + qp->r_state = opcode; + goto rdmadone; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD:{ + struct ib_atomic_eth *ateth; + u64 vaddr; + u64 sdata; + u32 rkey; + + if (!has_grh) + ateth = &ohdr->u.atomic_eth; + else { + ateth = (struct ib_atomic_eth *)data; + data += sizeof(*ateth); + } + vaddr = be64_to_cpu(ateth->vaddr); + if (unlikely(vaddr & 0x7)) + goto nack_inv; + rkey = be32_to_cpu(ateth->rkey); + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, + sizeof(u64), vaddr, rkey, + IB_ACCESS_REMOTE_ATOMIC))) { + goto nack_acc; + } + if (unlikely(!(qp->qp_access_flags & + IB_ACCESS_REMOTE_ATOMIC))) + goto nack_acc; + /* Perform atomic OP and save result. */ + sdata = be64_to_cpu(ateth->swap_data); + spin_lock(&dev->pending_lock); + qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr; + if (opcode == IB_OPCODE_RC_FETCH_ADD) { + *(u64 *) qp->r_sge.sge.vaddr = + qp->r_atomic_data + sdata; + } else if (qp->r_atomic_data == + be64_to_cpu(ateth->compare_data)) { + *(u64 *) qp->r_sge.sge.vaddr = sdata; + } + spin_unlock(&dev->pending_lock); + atomic_inc(&qp->msn); + qp->r_atomic_psn = psn & 0xFFFFFF; + psn |= 1 << 31; + break; + } + + default: + /* Drop packet for unknown opcodes. */ + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return; + } + qp->r_psn++; + qp->r_state = opcode; + /* Send an ACK if requested or required. */ + if (psn & (1 << 31)) { + /* + * Coalesce ACKs unless there is a RDMA READ or + * ATOMIC pending. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state == IB_OPCODE_RC_ACKNOWLEDGE || + qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) { + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + qp->s_ack_atomic = qp->r_atomic_data; + goto resched; + } + spin_unlock(&qp->s_lock); + } +done: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return; + +resched: + /* Try to send ACK right away but not if do_rc_send() is active. */ + if (qp->s_hdrwords == 0 && + (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST || + qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP)) + send_rc_ack(qp); + +rdmadone: + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* Call do_rc_send() in another thread. */ + tasklet_schedule(&qp->s_task); +} + +/* + * This is called from ipath_ib_rcv() to process an incomming packet + * for the given QP. + * Called at interrupt level. + */ +static inline void ipath_qp_rcv(struct ipath_ibdev *dev, + struct ipath_ib_header *hdr, int has_grh, + void *data, u32 tlen, struct ipath_qp *qp) +{ + /* Check for valid receive state. */ + if (!(state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) { + dev->n_pkt_drops++; + return; + } + + switch (qp->ibqp.qp_type) { + case IB_QPT_SMI: + case IB_QPT_GSI: + case IB_QPT_UD: + ipath_ud_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + case IB_QPT_RC: + ipath_rc_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + case IB_QPT_UC: + ipath_uc_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + default: + break; + } +} + +/* + * This is called from ipath_kreceive() to process an incomming packet at + * interrupt level. Tlen is the length of the header + data + CRC in bytes. + */ +static void ipath_ib_rcv(const ipath_type t, void *rhdr, void *data, u32 tlen) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_ib_header *hdr = rhdr; + struct ipath_other_headers *ohdr; + struct ipath_qp *qp; + u32 qp_num; + int lnh; + u8 opcode; + + if (dev == NULL) + return; + + if (tlen < 24) { /* LRH+BTH+CRC */ + dev->n_pkt_drops++; + return; + } + + /* Check for GRH */ + lnh = be16_to_cpu(hdr->lrh[0]) & 3; + if (lnh == IPS_LRH_BTH) + ohdr = &hdr->u.oth; + else if (lnh == IPS_LRH_GRH) + ohdr = &hdr->u.l.oth; + else { + dev->n_pkt_drops++; + return; + } + + opcode = *(u8 *) (&ohdr->bth[0]); + dev->opstats[opcode].n_bytes += tlen; + dev->opstats[opcode].n_packets++; + + /* Get the destination QP number. */ + qp_num = be32_to_cpu(ohdr->bth[1]) & 0xFFFFFF; + if (qp_num == 0xFFFFFF) { + struct ipath_mcast *mcast; + struct ipath_mcast_qp *p; + + mcast = ipath_mcast_find(&hdr->u.l.grh.dgid); + if (mcast == NULL) { + dev->n_pkt_drops++; + return; + } + dev->n_multicast_rcv++; + list_for_each_entry_rcu(p, &mcast->qp_list, list) + ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen, + p->qp); + /* + * Notify ipath_multicast_detach() if it is waiting for us + * to finish. + */ + if (atomic_dec_return(&mcast->refcount) <= 1) + wake_up(&mcast->wait); + } else if ((qp = ipath_lookup_qpn(&dev->qp_table, qp_num)) != NULL) { + ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen, qp); + /* + * Notify ipath_destroy_qp() if it is waiting for us to finish. + */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } else + dev->n_pkt_drops++; +} + +/* + * This is called from ipath_do_rcv_timer() at interrupt level + * to check for QPs which need retransmits and to collect performance numbers. + */ +static void ipath_ib_timer(const ipath_type t) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_qp *resend = NULL; + struct ipath_qp *rnr = NULL; + struct list_head *last; + struct ipath_qp *qp; + unsigned long flags; + + if (dev == NULL) + return; + + spin_lock_irqsave(&dev->pending_lock, flags); + /* Start filling the next pending queue. */ + if (++dev->pending_index >= ARRAY_SIZE(dev->pending)) + dev->pending_index = 0; + /* Save any requests still in the new queue, they have timed out. */ + last = &dev->pending[dev->pending_index]; + while (!list_empty(last)) { + qp = list_entry(last->next, struct ipath_qp, timerwait); + if (last->next == LIST_POISON1 || + last->next != &qp->timerwait || + qp->timerwait.prev != last) { + INIT_LIST_HEAD(last); + } else { + list_del(&qp->timerwait); + qp->timerwait.prev = (struct list_head *) resend; + resend = qp; + atomic_inc(&qp->refcount); + } + } + last = &dev->rnrwait; + if (!list_empty(last)) { + qp = list_entry(last->next, struct ipath_qp, timerwait); + if (--qp->s_rnr_timeout == 0) { + do { + if (last->next == LIST_POISON1 || + last->next != &qp->timerwait || + qp->timerwait.prev != last) { + INIT_LIST_HEAD(last); + break; + } + list_del(&qp->timerwait); + qp->timerwait.prev = (struct list_head *) rnr; + rnr = qp; + if (list_empty(last)) + break; + qp = list_entry(last->next, struct ipath_qp, + timerwait); + } while (qp->s_rnr_timeout == 0); + } + } + /* We should only be in the started state if pma_sample_start != 0 */ + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_STARTED && + --dev->pma_sample_start == 0) { + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_RUNNING; + ipath_layer_snapshot_counters(dev->ib_unit, &dev->ipath_sword, + &dev->ipath_rword, + &dev->ipath_spkts, + &dev->ipath_rpkts); + } + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_RUNNING) { + if (dev->pma_sample_interval == 0) { + u64 ta, tb, tc, td; + + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_DONE; + ipath_layer_snapshot_counters(dev->ib_unit, + &ta, &tb, &tc, &td); + + dev->ipath_sword = ta - dev->ipath_sword; + dev->ipath_rword = tb - dev->ipath_rword; + dev->ipath_spkts = tc - dev->ipath_spkts; + dev->ipath_rpkts = td - dev->ipath_rpkts; + } else { + dev->pma_sample_interval--; + } + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + + /* XXX What if timer fires again while this is running? */ + for (qp = resend; qp != NULL; + qp = (struct ipath_qp *) qp->timerwait.prev) { + struct ib_wc wc; + + spin_lock_irqsave(&qp->s_lock, flags); + if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) { + dev->n_timeouts++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + } + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + for (qp = rnr; qp != NULL; + qp = (struct ipath_qp *) qp->timerwait.prev) { + tasklet_schedule(&qp->s_task); + } +} + +/* + * This is called from ipath_intr() at interrupt level when a PIO buffer + * is available after ipath_verbs_send() returned an error that no + * buffers were available. + * Return 0 if we consumed all the PIO buffers and we still have QPs + * waiting for buffers (for now, just do a tasklet_schedule and return one). + */ +static int ipath_ib_piobufavail(const ipath_type t) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_qp *qp; + unsigned long flags; + + if (dev == NULL) + return 1; + + spin_lock_irqsave(&dev->pending_lock, flags); + while (!list_empty(&dev->piowait)) { + qp = list_entry(dev->piowait.next, struct ipath_qp, piowait); + list_del(&qp->piowait); + tasklet_schedule(&qp->s_task); + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + + return 1; +} + +static struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct ipath_qp *qp; + int err; + struct ipath_swqe *swq = NULL; + struct ipath_ibdev *dev; + size_t sz; + + if (init_attr->cap.max_send_sge > 255 || + init_attr->cap.max_recv_sge > 255) + return ERR_PTR(-ENOMEM); + + switch (init_attr->qp_type) { + case IB_QPT_UC: + case IB_QPT_RC: + sz = sizeof(struct ipath_sge) * init_attr->cap.max_send_sge + + sizeof(struct ipath_swqe); + swq = vmalloc((init_attr->cap.max_send_wr + 1) * sz); + if (swq == NULL) + return ERR_PTR(-ENOMEM); + /* FALLTHROUGH */ + case IB_QPT_UD: + case IB_QPT_SMI: + case IB_QPT_GSI: + qp = kmalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + qp->r_rq.size = init_attr->cap.max_recv_wr + 1; + sz = sizeof(struct ipath_sge) * init_attr->cap.max_recv_sge + + sizeof(struct ipath_rwqe); + qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + if (!qp->r_rq.wq) { + kfree(qp); + return ERR_PTR(-ENOMEM); + } + + /* + * ib_create_qp() will initialize qp->ibqp + * except for qp->ibqp.qp_num. + */ + spin_lock_init(&qp->s_lock); + spin_lock_init(&qp->r_rq.lock); + atomic_set(&qp->refcount, 0); + init_waitqueue_head(&qp->wait); + tasklet_init(&qp->s_task, + init_attr->qp_type == IB_QPT_RC ? do_rc_send : + do_uc_send, (unsigned long)qp); + qp->piowait.next = LIST_POISON1; + qp->piowait.prev = LIST_POISON2; + qp->timerwait.next = LIST_POISON1; + qp->timerwait.prev = LIST_POISON2; + qp->state = IB_QPS_RESET; + qp->s_wq = swq; + qp->s_size = init_attr->cap.max_send_wr + 1; + qp->s_max_sge = init_attr->cap.max_send_sge; + qp->r_rq.max_sge = init_attr->cap.max_recv_sge; + qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ? + 1 << IPATH_S_SIGNAL_REQ_WR : 0; + dev = to_idev(ibpd->device); + err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); + if (err) { + vfree(swq); + vfree(qp->r_rq.wq); + kfree(qp); + return ERR_PTR(err); + } + ipath_reset_qp(qp); + + /* Tell the core driver that the kernel SMA is present. */ + if (qp->ibqp.qp_type == IB_QPT_SMI) + ipath_verbs_set_flags(dev->ib_unit, + IPATH_VERBS_KERNEL_SMA); + break; + + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +/* + * Note that this can be called while the QP is actively sending or receiving! + */ +static int ipath_destroy_qp(struct ib_qp *ibqp) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_ibdev *dev = to_idev(ibqp->device); + unsigned long flags; + + /* Tell the core driver that the kernel SMA is gone. */ + if (qp->ibqp.qp_type == IB_QPT_SMI) + ipath_verbs_set_flags(dev->ib_unit, 0); + + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + qp->state = IB_QPS_ERR; + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* Stop the sending tasklet. */ + tasklet_kill(&qp->s_task); + + /* Make sure the QP isn't on the timeout list. */ + spin_lock_irqsave(&dev->pending_lock, flags); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock_irqrestore(&dev->pending_lock, flags); + + /* + * Make sure that the QP is not in the QPN table so receive interrupts + * will discard packets for this QP. + * XXX Also remove QP from multicast table. + */ + if (atomic_read(&qp->refcount) != 0) + ipath_free_qp(&dev->qp_table, qp); + + vfree(qp->s_wq); + vfree(qp->r_rq.wq); + kfree(qp); + return 0; +} + +static struct ib_srq *ipath_create_srq(struct ib_pd *ibpd, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata) +{ + struct ipath_srq *srq; + u32 sz; + + if (srq_init_attr->attr.max_sge < 1) + return ERR_PTR(-EINVAL); + + srq = kmalloc(sizeof(*srq), GFP_KERNEL); + if (!srq) + return ERR_PTR(-ENOMEM); + + /* Need to use vmalloc() if we want to support large #s of entries. */ + srq->rq.size = srq_init_attr->attr.max_wr + 1; + sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + sizeof(struct ipath_rwqe); + srq->rq.wq = vmalloc(srq->rq.size * sz); + if (!srq->rq.wq) { + kfree(srq); + return ERR_PTR(-ENOMEM); + } + + /* + * ib_create_srq() will initialize srq->ibsrq. + */ + spin_lock_init(&srq->rq.lock); + srq->rq.head = 0; + srq->rq.tail = 0; + srq->rq.max_sge = srq_init_attr->attr.max_sge; + srq->limit = srq_init_attr->attr.srq_limit; + + return &srq->ibsrq; +} + +int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + unsigned long flags; + + if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irqsave(&srq->rq.lock, flags); + srq->limit = attr->srq_limit; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + if (attr_mask & IB_SRQ_MAX_WR) { + u32 size = attr->max_wr + 1; + struct ipath_rwqe *wq, *p; + u32 n; + u32 sz; + + if (attr->max_sge < srq->rq.max_sge) + return -EINVAL; + + sz = sizeof(struct ipath_rwqe) + + attr->max_sge * sizeof(struct ipath_sge); + wq = vmalloc(size * sz); + if (!wq) + return -ENOMEM; + + spin_lock_irqsave(&srq->rq.lock, flags); + if (srq->rq.head < srq->rq.tail) + n = srq->rq.size + srq->rq.head - srq->rq.tail; + else + n = srq->rq.head - srq->rq.tail; + if (size <= n || size <= srq->limit) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + vfree(wq); + return -EINVAL; + } + n = 0; + p = wq; + while (srq->rq.tail != srq->rq.head) { + struct ipath_rwqe *wqe; + int i; + + wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + p->wr_id = wqe->wr_id; + p->length = wqe->length; + p->num_sge = wqe->num_sge; + for (i = 0; i < wqe->num_sge; i++) + p->sg_list[i] = wqe->sg_list[i]; + n++; + p = (struct ipath_rwqe *)((char *) p + sz); + if (++srq->rq.tail >= srq->rq.size) + srq->rq.tail = 0; + } + vfree(srq->rq.wq); + srq->rq.wq = wq; + srq->rq.size = size; + srq->rq.head = n; + srq->rq.tail = 0; + srq->rq.max_sge = attr->max_sge; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + return 0; +} + +static int ipath_destroy_srq(struct ib_srq *ibsrq) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + + vfree(srq->rq.wq); + kfree(srq); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct ipath_cq *cq = to_icq(ibcq); + unsigned long flags; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { + if (cq->tail == cq->head) + break; + *entry = cq->queue[cq->tail]; + if (++cq->tail == cq->ibcq.cqe) + cq->tail = 0; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +static struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct ipath_cq *cq; + + /* Need to use vmalloc() if we want to support large #s of entries. */ + cq = vmalloc(sizeof(*cq) + entries * sizeof(*cq->queue)); + if (!cq) + return ERR_PTR(-ENOMEM); + /* + * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. + * The number of entries should be >= the number requested or + * return an error. + */ + cq->ibcq.cqe = entries + 1; + cq->notify = IB_CQ_NONE; + cq->triggered = 0; + spin_lock_init(&cq->lock); + tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); + cq->head = 0; + cq->tail = 0; + + return &cq->ibcq; +} + +static int ipath_destroy_cq(struct ib_cq *ibcq) +{ + struct ipath_cq *cq = to_icq(ibcq); + + tasklet_kill(&cq->comptask); + vfree(cq); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct ipath_cq *cq = to_icq(ibcq); + unsigned long flags; + + spin_lock_irqsave(&cq->lock, flags); + /* + * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow + * any other transitions. + */ + if (cq->notify != IB_CQ_NEXT_COMP) + cq->notify = notify; + spin_unlock_irqrestore(&cq->lock, flags); + return 0; +} + +static int ipath_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + uint32_t vendor, boardrev, majrev, minrev; + + memset(props, 0, sizeof(*props)); + + props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR | + IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | + IB_DEVICE_SYS_IMAGE_GUID; + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + props->vendor_id = vendor; + props->vendor_part_id = boardrev; + props->hw_ver = boardrev << 16 | majrev << 8 | minrev; + + props->sys_image_guid = dev->sys_image_guid; + props->node_guid = ipath_layer_get_guid(dev->ib_unit); + + props->max_mr_size = ~0ull; + props->max_qp = 0xffff; + props->max_qp_wr = 0xffff; + props->max_sge = 255; + props->max_cq = 0xffff; + props->max_cqe = 0xffff; + props->max_mr = 0xffff; + props->max_pd = 0xffff; + props->max_qp_rd_atom = 1; + props->max_qp_init_rd_atom = 1; + /* props->max_res_rd_atom */ + props->max_srq = 0xffff; + props->max_srq_wr = 0xffff; + props->max_srq_sge = 255; + /* props->local_ca_ack_delay */ + props->atomic_cap = IB_ATOMIC_HCA; + props->max_pkeys = ipath_layer_get_npkeys(dev->ib_unit); + props->max_mcast_grp = 0xffff; + props->max_mcast_qp_attach = 0xffff; + props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * + props->max_mcast_grp; + + return 0; +} + +static int ipath_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + uint32_t flags = ipath_layer_get_flags(dev->ib_unit); + enum ib_mtu mtu; + uint32_t l; + uint16_t lid = ipath_layer_get_lid(dev->ib_unit); + + memset(props, 0, sizeof(*props)); + props->lid = lid ? lid : IB_LID_PERMISSIVE; + props->lmc = dev->mkeyprot_resv_lmc & 7; + props->sm_lid = dev->sm_lid; + props->sm_sl = dev->sm_sl; + if (flags & IPATH_LINKDOWN) + props->state = IB_PORT_DOWN; + else if (flags & IPATH_LINKARMED) + props->state = IB_PORT_ARMED; + else if (flags & IPATH_LINKACTIVE) + props->state = IB_PORT_ACTIVE; + else if (flags & IPATH_LINK_SLEEPING) + props->state = IB_PORT_ACTIVE_DEFER; + else + props->state = IB_PORT_NOP; + /* See phys_state_show() */ + props->phys_state = 5; /* LinkUp */ + props->port_cap_flags = dev->port_cap_flags; + props->gid_tbl_len = 1; + props->max_msg_sz = 4096; + props->pkey_tbl_len = ipath_layer_get_npkeys(dev->ib_unit); + props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->ib_unit); + props->qkey_viol_cntr = dev->qkey_violations; + props->active_width = IB_WIDTH_4X; + /* See rate_show() */ + props->active_speed = 1; /* Regular 10Mbs speed. */ + props->max_vl_num = 1; /* VLCap = VL0 */ + props->init_type_reply = 0; + + props->max_mtu = IB_MTU_4096; + l = ipath_layer_get_ibmtu(dev->ib_unit); + switch (l) { + case 4096: + mtu = IB_MTU_4096; + break; + case 2048: + mtu = IB_MTU_2048; + break; + case 1024: + mtu = IB_MTU_1024; + break; + case 512: + mtu = IB_MTU_512; + break; + case 256: + mtu = IB_MTU_256; + break; + default: + mtu = IB_MTU_2048; + } + props->active_mtu = mtu; + props->subnet_timeout = dev->subnet_timeout; + + return 0; +} + +static int ipath_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + if (device_modify_mask & IB_DEVICE_MODIFY_SYS_IMAGE_GUID) + to_idev(device)->sys_image_guid = device_modify->sys_image_guid; + + return 0; +} + +static int ipath_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + atomic_set_mask(props->set_port_cap_mask, &dev->port_cap_flags); + atomic_clear_mask(props->clr_port_cap_mask, &dev->port_cap_flags); + if (port_modify_mask & IB_PORT_SHUTDOWN) + ipath_kset_linkstate(dev->ib_unit << 16 | IPATH_IB_LINKDOWN); + if (port_modify_mask & IB_PORT_RESET_QKEY_CNTR) + dev->qkey_violations = 0; + return 0; +} + +static int ipath_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + if (index >= ipath_layer_get_npkeys(dev->ib_unit)) + return -EINVAL; + *pkey = ipath_layer_get_pkey(dev->ib_unit, index); + return 0; +} + +static int ipath_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + if (index >= 1) + return -EINVAL; + gid->global.subnet_prefix = dev->gid_prefix; + gid->global.interface_id = ipath_layer_get_guid(dev->ib_unit); + + return 0; +} + +static struct ib_pd *ipath_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct ipath_pd *pd; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + /* ib_alloc_pd() will initialize pd->ibpd. */ + pd->user = udata != NULL; + + return &pd->ibpd; +} + +static int ipath_dealloc_pd(struct ib_pd *ibpd) +{ + struct ipath_pd *pd = to_ipd(ibpd); + + kfree(pd); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static struct ib_ah *ipath_create_ah(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + struct ipath_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_ATOMIC); + if (!ah) + return ERR_PTR(-ENOMEM); + + /* ib_create_ah() will initialize ah->ibah. */ + ah->attr = *ah_attr; + + return &ah->ibah; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_destroy_ah(struct ib_ah *ibah) +{ + struct ipath_ah *ah = to_iah(ibah); + + kfree(ah); + + return 0; +} + +static struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ipath_mr *mr; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* ib_get_dma_mr() will initialize mr->ibmr except for lkey and rkey. */ + memset(mr, 0, sizeof *mr); + mr->mr.access_flags = acc; + return &mr->ibmr; +} + +static struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 *iova_start) +{ + struct ipath_mr *mr; + int n, m, i; + + /* Allocate struct plus pointers to first level page tables. */ + m = (num_phys_buf + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); + if (!mr->mr.map[i]) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + } + mr->mr.mapsz = m; + + /* + * ib_reg_phys_mr() will initialize mr->ibmr except for + * lkey and rkey. + */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey; + mr->mr.user_base = *iova_start; + mr->mr.iova = *iova_start; + mr->mr.length = 0; + mr->mr.offset = 0; + mr->mr.access_flags = acc; + mr->mr.max_segs = num_phys_buf; + m = 0; + n = 0; + for (i = 0; i < num_phys_buf; i++) { + mr->mr.map[m]->segs[n].vaddr = + phys_to_virt(buffer_list[i].addr); + mr->mr.map[m]->segs[n].length = buffer_list[i].size; + mr->mr.length += buffer_list[i].size; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + return &mr->ibmr; +} + +static struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + struct ib_udata *udata) +{ + struct ipath_mr *mr; + struct ib_umem_chunk *chunk; + int n, m, i; + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + /* Allocate struct plus pointers to first level page tables. */ + m = (n + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); + if (!mr->mr.map[i]) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + } + mr->mr.mapsz = m; + + /* + * ib_uverbs_reg_mr() will initialize mr->ibmr except for + * lkey and rkey. + */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey; + mr->mr.user_base = region->user_base; + mr->mr.iova = region->virt_base; + mr->mr.length = region->length; + mr->mr.offset = region->offset; + mr->mr.access_flags = mr_access_flags; + mr->mr.max_segs = n; + m = 0; + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (i = 0; i < chunk->nmap; i++) { + mr->mr.map[m]->segs[n].vaddr = + page_address(chunk->page_list[i].page); + mr->mr.map[m]->segs[n].length = region->page_size; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + } + return &mr->ibmr; +} + +/* + * Note that this is called to free MRs created by + * ipath_get_dma_mr() or ipath_reg_user_mr(). + */ +static int ipath_dereg_mr(struct ib_mr *ibmr) +{ + struct ipath_mr *mr = to_imr(ibmr); + int i; + + ipath_free_lkey(&to_idev(ibmr->device)->lk_table, ibmr->lkey); + i = mr->mr.mapsz; + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return 0; +} + +static struct ib_fmr *ipath_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ipath_fmr *fmr; + int m, i; + + /* Allocate struct plus pointers to first level page tables. */ + m = (fmr_attr->max_pages + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + fmr = kmalloc(sizeof *fmr + m * sizeof fmr->mr.map[0], GFP_KERNEL); + if (!fmr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + fmr->mr.map[i] = kmalloc(sizeof *fmr->mr.map[0], GFP_KERNEL); + if (!fmr->mr.map[i]) { + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return ERR_PTR(-ENOMEM); + } + } + fmr->mr.mapsz = m; + + /* ib_alloc_fmr() will initialize fmr->ibfmr except for lkey & rkey. */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &fmr->mr)) { + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return ERR_PTR(-ENOMEM); + } + fmr->ibfmr.rkey = fmr->ibfmr.lkey = fmr->mr.lkey; + /* Resources are allocated but no valid mapping (RKEY can't be used). */ + fmr->mr.user_base = 0; + fmr->mr.iova = 0; + fmr->mr.length = 0; + fmr->mr.offset = 0; + fmr->mr.access_flags = mr_access_flags; + fmr->mr.max_segs = fmr_attr->max_pages; + fmr->page_size = fmr_attr->page_size; + return &fmr->ibfmr; +} + +/* + * This may be called from interrupt context. + * XXX Can we ever be called to map a portion of the RKEY space? + */ +static int ipath_map_phys_fmr(struct ib_fmr *ibfmr, + u64 * page_list, int list_len, u64 iova) +{ + struct ipath_fmr *fmr = to_ifmr(ibfmr); + struct ipath_lkey_table *rkt; + unsigned long flags; + int m, n, i; + u32 ps; + + if (list_len > fmr->mr.max_segs) + return -EINVAL; + rkt = &to_idev(ibfmr->device)->lk_table; + spin_lock_irqsave(&rkt->lock, flags); + fmr->mr.user_base = iova; + fmr->mr.iova = iova; + ps = 1 << fmr->page_size; + fmr->mr.length = list_len * ps; + m = 0; + n = 0; + ps = 1 << fmr->page_size; + for (i = 0; i < list_len; i++) { + fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]); + fmr->mr.map[m]->segs[n].length = ps; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + spin_unlock_irqrestore(&rkt->lock, flags); + return 0; +} + +static int ipath_unmap_fmr(struct list_head *fmr_list) +{ + struct ipath_fmr *fmr; + + list_for_each_entry(fmr, fmr_list, ibfmr.list) { + fmr->mr.user_base = 0; + fmr->mr.iova = 0; + fmr->mr.length = 0; + } + return 0; +} + +static int ipath_dealloc_fmr(struct ib_fmr *ibfmr) +{ + struct ipath_fmr *fmr = to_ifmr(ibfmr); + int i; + + ipath_free_lkey(&to_idev(ibfmr->device)->lk_table, ibfmr->lkey); + i = fmr->mr.mapsz; + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + return sprintf(buf, "%d.%d\n", majrev, minrev); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + ipath_get_boardname(dev->ib_unit, buf, 128); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + ipath_get_boardname(dev->ib_unit, buf, 128); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_stats(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + char *p; + int i; + + sprintf(buf, + "RC resends %d\n" + "RC QACKs %d\n" + "RC ACKs %d\n" + "RC SEQ NAKs %d\n" + "RC RDMA seq %d\n" + "RC RNR NAKs %d\n" + "RC OTH NAKs %d\n" + "RC timeouts %d\n" + "RC RDMA dup %d\n" + "piobuf wait %d\n" + "no piobuf %d\n" + "PKT drops %d\n" + "WQE errs %d\n", + dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, + dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, + dev->n_other_naks, dev->n_timeouts, dev->n_rdma_dup_busy, + dev->n_piowait, dev->n_no_piobuf, dev->n_pkt_drops, + dev->n_wqe_errs); + p = buf; + for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { + if (!dev->opstats[i].n_packets && !dev->opstats[i].n_bytes) + continue; + p += strlen(p); + sprintf(p, "%02x %llu/%llu\n", + i, dev->opstats[i].n_packets, dev->opstats[i].n_bytes); + } + return strlen(buf); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static CLASS_DEVICE_ATTR(stats, S_IRUGO, show_stats, NULL); + +static struct class_device_attribute *ipath_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_hca_type, + &class_device_attr_board_id, + &class_device_attr_stats +}; + +/* + * Allocate a ucontext. + */ + +static struct ib_ucontext *ipath_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct ipath_ucontext *context; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + return &context->ibucontext; +} + +static int ipath_dealloc_ucontext(struct ib_ucontext *context) +{ + kfree(to_iucontext(context)); + return 0; +} + +/* + * Register our device with the infiniband core. + */ +static int ipath_register_ib_device(const ipath_type t) +{ + struct ipath_ibdev *idev; + struct ib_device *dev; + int i; + int ret; + + idev = (struct ipath_ibdev *)ib_alloc_device(sizeof *idev); + if (idev == NULL) + return -ENOMEM; + + dev = &idev->ibdev; + + /* Only need to initialize non-zero fields. */ + spin_lock_init(&idev->qp_table.lock); + spin_lock_init(&idev->lk_table.lock); + idev->sm_lid = IB_LID_PERMISSIVE; + idev->gid_prefix = __constant_cpu_to_be64(0xfe80000000000000UL); + idev->qp_table.last = 1; /* QPN 0 and 1 are special. */ + idev->qp_table.max = ib_ipath_qp_table_size; + idev->qp_table.nmaps = 1; + idev->qp_table.table = kmalloc(idev->qp_table.max * + sizeof(*idev->qp_table.table), + GFP_KERNEL); + if (idev->qp_table.table == NULL) { + ret = -ENOMEM; + goto err_qp; + } + memset(idev->qp_table.table, 0, + idev->qp_table.max * sizeof(*idev->qp_table.table)); + for (i = 0; i < ARRAY_SIZE(idev->qp_table.map); i++) { + atomic_set(&idev->qp_table.map[i].n_free, BITS_PER_PAGE); + idev->qp_table.map[i].page = NULL; + } + /* + * The top ib_ipath_lkey_table_size bits are used to index the table. + * The lower 8 bits can be owned by the user (copied from the LKEY). + * The remaining bits act as a generation number or tag. + */ + idev->lk_table.max = 1 << ib_ipath_lkey_table_size; + idev->lk_table.table = kmalloc(idev->lk_table.max * + sizeof(*idev->lk_table.table), + GFP_KERNEL); + if (idev->lk_table.table == NULL) { + ret = -ENOMEM; + goto err_lk; + } + memset(idev->lk_table.table, 0, + idev->lk_table.max * sizeof(*idev->lk_table.table)); + spin_lock_init(&idev->pending_lock); + INIT_LIST_HEAD(&idev->pending[0]); + INIT_LIST_HEAD(&idev->pending[1]); + INIT_LIST_HEAD(&idev->pending[2]); + INIT_LIST_HEAD(&idev->piowait); + INIT_LIST_HEAD(&idev->rnrwait); + idev->pending_index = 0; + idev->port_cap_flags = + IB_PORT_SYS_IMAGE_GUID_SUP | IB_PORT_CLIENT_REG_SUP; + idev->pma_counter_select[0] = IB_PMA_PORT_XMIT_DATA; + idev->pma_counter_select[1] = IB_PMA_PORT_RCV_DATA; + idev->pma_counter_select[2] = IB_PMA_PORT_XMIT_PKTS; + idev->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS; + idev->pma_counter_select[5] = IB_PMA_PORT_XMIT_WAIT; + + /* + * The system image GUI is supposed to be the same for all + * IB HCAs in a single system. + * Note that this code assumes device zero is found first. + */ + idev->sys_image_guid = + t ? ipath_devices[t]->sys_image_guid : ipath_layer_get_guid(t); + idev->ib_unit = t; + + strlcpy(dev->name, "ipath%d", IB_DEVICE_NAME_MAX); + dev->node_guid = ipath_layer_get_guid(t); + dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION; + dev->uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); + dev->node_type = IB_NODE_CA; + dev->phys_port_cnt = 1; + dev->dma_device = ipath_layer_get_pcidev(t); + dev->class_dev.dev = dev->dma_device; + dev->query_device = ipath_query_device; + dev->modify_device = ipath_modify_device; + dev->query_port = ipath_query_port; + dev->modify_port = ipath_modify_port; + dev->query_pkey = ipath_query_pkey; + dev->query_gid = ipath_query_gid; + dev->alloc_ucontext = ipath_alloc_ucontext; + dev->dealloc_ucontext = ipath_dealloc_ucontext; + dev->alloc_pd = ipath_alloc_pd; + dev->dealloc_pd = ipath_dealloc_pd; + dev->create_ah = ipath_create_ah; + dev->destroy_ah = ipath_destroy_ah; + dev->create_srq = ipath_create_srq; + dev->modify_srq = ipath_modify_srq; + dev->destroy_srq = ipath_destroy_srq; + dev->create_qp = ipath_create_qp; + dev->modify_qp = ipath_modify_qp; + dev->destroy_qp = ipath_destroy_qp; + dev->post_send = ipath_post_send; + dev->post_recv = ipath_post_receive; + dev->post_srq_recv = ipath_post_srq_receive; + dev->create_cq = ipath_create_cq; + dev->destroy_cq = ipath_destroy_cq; + dev->poll_cq = ipath_poll_cq; + dev->req_notify_cq = ipath_req_notify_cq; + dev->get_dma_mr = ipath_get_dma_mr; + dev->reg_phys_mr = ipath_reg_phys_mr; + dev->reg_user_mr = ipath_reg_user_mr; + dev->dereg_mr = ipath_dereg_mr; + dev->alloc_fmr = ipath_alloc_fmr; + dev->map_phys_fmr = ipath_map_phys_fmr; + dev->unmap_fmr = ipath_unmap_fmr; + dev->dealloc_fmr = ipath_dealloc_fmr; + dev->attach_mcast = ipath_multicast_attach; + dev->detach_mcast = ipath_multicast_detach; + dev->process_mad = ipath_process_mad; + + ret = ib_register_device(dev); + if (ret) + goto err_reg; + + for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) { + ret = class_device_create_file(&dev->class_dev, + ipath_class_attributes[i]); + if (ret) + goto err_class; + } + + ipath_layer_enable_timer(t); + + ipath_devices[t] = idev; + return 0; + +err_class: + ib_unregister_device(dev); +err_reg: + kfree(idev->lk_table.table); +err_lk: + kfree(idev->qp_table.table); +err_qp: + ib_dealloc_device(dev); + return ret; +} + +static void ipath_unregister_ib_device(struct ipath_ibdev *dev) +{ + struct ib_device *ibdev = &dev->ibdev; + + ipath_layer_disable_timer(dev->ib_unit); + + ib_unregister_device(ibdev); + + if (!list_empty(&dev->pending[0]) || !list_empty(&dev->pending[1]) || + !list_empty(&dev->pending[2])) + _VERBS_ERROR("ipath%d pending list not empty!\n", dev->ib_unit); + if (!list_empty(&dev->piowait)) + _VERBS_ERROR("ipath%d piowait list not empty!\n", dev->ib_unit); + if (!list_empty(&dev->rnrwait)) + _VERBS_ERROR("ipath%d rnrwait list not empty!\n", dev->ib_unit); + if (mcast_tree.rb_node != NULL) + _VERBS_ERROR("ipath%d multicast table memory leak!\n", + dev->ib_unit); + /* + * Note that ipath_unregister_ib_device() can be called before all + * the QPs are destroyed! + */ + ipath_free_all_qps(&dev->qp_table); + kfree(dev->qp_table.table); + kfree(dev->lk_table.table); + ib_dealloc_device(ibdev); +} + +int __init ipath_verbs_init(void) +{ + int i; + + number_of_devices = ipath_layer_get_num_of_dev(); + i = number_of_devices * sizeof(struct ipath_ibdev *); + ipath_devices = kmalloc(i, GFP_ATOMIC); + if (ipath_devices == NULL) + return -ENOMEM; + + for (i = 0; i < number_of_devices; i++) { + int ret = ipath_verbs_register(i, ipath_ib_piobufavail, + ipath_ib_rcv, ipath_ib_timer); + + if (ret == 0) + ipath_devices[i] = NULL; + else if ((ret = ipath_register_ib_device(i)) != 0) { + _VERBS_ERROR("ib_ipath%d cannot register ib device " + "(%d)!\n", i, ret); + ipath_verbs_unregister(i); + ipath_devices[i] = NULL; + } + } + + return 0; +} + +void __exit ipath_verbs_cleanup(void) +{ + int i; + + for (i = 0; i < number_of_devices; i++) + if (ipath_devices[i]) { + ipath_unregister_ib_device(ipath_devices[i]); + ipath_verbs_unregister(i); + } + + kfree(ipath_devices); +} + +module_init(ipath_verbs_init); +module_exit(ipath_verbs_cleanup); -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 10/13] [RFC] ipath verbs, part 1 In-Reply-To: <200512161548.zxp6FKcabEu47EnS@cisco.com> Message-ID: <200512161548.W9sJn4CLmdhnSTcH@cisco.com> First half of ipath verbs driver --- drivers/infiniband/hw/ipath/ipath_verbs.c | 3244 +++++++++++++++++++++++++++++ 1 files changed, 3244 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_verbs.c 72075ecec75f8c42e444a7d7d8ffcf340a845b96 diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c new file mode 100644 index 0000000..808326e --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -0,0 +1,3244 @@ +/* + * Copyright (c) 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_verbs.c 4491 2005-12-15 22:20:31Z rjwalsh $ + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "ipath_common.h" +#include "ips_common.h" +#include "ipath_layer.h" +#include "ipath_verbs.h" + +/* + * Compare the lower 24 bits of the two values. + * Returns an integer <, ==, or > than zero. + */ +static inline int cmp24(u32 a, u32 b) +{ + return (((int) a) - ((int) b)) << 8; +} + +#define MODNAME "ib_ipath" +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " +#define PFX MODNAME ": " + + +/* Not static, because we don't want the compiler removing it */ +const char ipath_verbs_version[] = "ipath_verbs " _IPATH_IDSTR; + +unsigned int ib_ipath_qp_table_size = 251; +module_param(ib_ipath_qp_table_size, uint, 0444); +MODULE_PARM_DESC(ib_ipath_qp_table_size, "QP table size"); + +unsigned int ib_ipath_lkey_table_size = 12; +module_param(ib_ipath_lkey_table_size, uint, 0444); +MODULE_PARM_DESC(ib_ipath_lkey_table_size, + "LKEY table size in bits (2^n, 1 <= n <= 23)"); + +unsigned int ib_ipath_debug; /* debug mask */ +module_param(ib_ipath_debug, uint, 0644); +MODULE_PARM_DESC(ib_ipath_debug, "Verbs debug mask"); + + +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 len, struct ib_send_wr *wr, struct ib_wc *wc); +static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc); +static int ipath_destroy_qp(struct ib_qp *ibqp); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("PathScale "); +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); + +enum { + IPATH_FAULT_RC_DROP_SEND_F = 1, + IPATH_FAULT_RC_DROP_SEND_M, + IPATH_FAULT_RC_DROP_SEND_L, + IPATH_FAULT_RC_DROP_SEND_O, + IPATH_FAULT_RC_DROP_RDMA_WRITE_F, + IPATH_FAULT_RC_DROP_RDMA_WRITE_M, + IPATH_FAULT_RC_DROP_RDMA_WRITE_L, + IPATH_FAULT_RC_DROP_RDMA_WRITE_O, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_F, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_M, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_L, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_O, + IPATH_FAULT_RC_DROP_ACK, +}; + +enum { + IPATH_TRANS_INVALID = 0, + IPATH_TRANS_ANY2RST, + IPATH_TRANS_RST2INIT, + IPATH_TRANS_INIT2INIT, + IPATH_TRANS_INIT2RTR, + IPATH_TRANS_RTR2RTS, + IPATH_TRANS_RTS2RTS, + IPATH_TRANS_SQERR2RTS, + IPATH_TRANS_ANY2ERR, + IPATH_TRANS_RTS2SQD, /* XXX Wait for expected ACKs & signal event */ + IPATH_TRANS_SQD2SQD, /* error if not drained & parameter change */ + IPATH_TRANS_SQD2RTS, /* error if not drained */ +}; + +enum { + IPATH_POST_SEND_OK = 0x0001, + IPATH_POST_RECV_OK = 0x0002, + IPATH_PROCESS_RECV_OK = 0x0004, + IPATH_PROCESS_SEND_OK = 0x0008, +}; + +static int state_ops[IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = 0, + [IB_QPS_INIT] = IPATH_POST_RECV_OK, + [IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, + [IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | + IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK, + [IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | + IPATH_POST_SEND_OK, + [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, + [IB_QPS_ERR] = 0, +}; + +/* + * Convert the AETH credit code into the number of credits. + */ +static u32 credit_table[31] = { + 0, /* 0 */ + 1, /* 1 */ + 2, /* 2 */ + 3, /* 3 */ + 4, /* 4 */ + 6, /* 5 */ + 8, /* 6 */ + 12, /* 7 */ + 16, /* 8 */ + 24, /* 9 */ + 32, /* A */ + 48, /* B */ + 64, /* C */ + 96, /* D */ + 128, /* E */ + 192, /* F */ + 256, /* 10 */ + 384, /* 11 */ + 512, /* 12 */ + 768, /* 13 */ + 1024, /* 14 */ + 1536, /* 15 */ + 2048, /* 16 */ + 3072, /* 17 */ + 4096, /* 18 */ + 6144, /* 19 */ + 8192, /* 1A */ + 12288, /* 1B */ + 16384, /* 1C */ + 24576, /* 1D */ + 32768 /* 1E */ +}; + +/* + * Convert the AETH RNR timeout code into the number of milliseconds. + */ +static u32 rnr_table[32] = { + 656, /* 0 */ + 1, /* 1 */ + 1, /* 2 */ + 1, /* 3 */ + 1, /* 4 */ + 1, /* 5 */ + 1, /* 6 */ + 1, /* 7 */ + 1, /* 8 */ + 1, /* 9 */ + 1, /* A */ + 1, /* B */ + 1, /* C */ + 1, /* D */ + 2, /* E */ + 2, /* F */ + 3, /* 10 */ + 4, /* 11 */ + 6, /* 12 */ + 8, /* 13 */ + 11, /* 14 */ + 16, /* 15 */ + 21, /* 16 */ + 31, /* 17 */ + 41, /* 18 */ + 62, /* 19 */ + 82, /* 1A */ + 123, /* 1B */ + 164, /* 1C */ + 246, /* 1D */ + 328, /* 1E */ + 492 /* 1F */ +}; + +/* + * Translate ib_wr_opcode into ib_wc_opcode. + */ +static enum ib_wc_opcode wc_opcode[] = { + [IB_WR_RDMA_WRITE] = IB_WC_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = IB_WC_RDMA_WRITE, + [IB_WR_SEND] = IB_WC_SEND, + [IB_WR_SEND_WITH_IMM] = IB_WC_SEND, + [IB_WR_RDMA_READ] = IB_WC_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = IB_WC_COMP_SWAP, + [IB_WR_ATOMIC_FETCH_AND_ADD] = IB_WC_FETCH_ADD +}; + +/* + * Array of device pointers. + */ +static uint32_t number_of_devices; +static struct ipath_ibdev **ipath_devices; + +/* + * Global table of GID to attached QPs. + * The table is global to all ipath devices since a send from one QP/device + * needs to be locally routed to any locally attached QPs on the same + * or different device. + */ +static struct rb_root mcast_tree; +static spinlock_t mcast_lock = SPIN_LOCK_UNLOCKED; + +/* + * Allocate a structure to link a QP to the multicast GID structure. + */ +static struct ipath_mcast_qp *ipath_mcast_qp_alloc(struct ipath_qp *qp) +{ + struct ipath_mcast_qp *mqp; + + mqp = kmalloc(sizeof(*mqp), GFP_KERNEL); + if (!mqp) + return NULL; + + mqp->qp = qp; + atomic_inc(&qp->refcount); + + return mqp; +} + +static void ipath_mcast_qp_free(struct ipath_mcast_qp *mqp) +{ + struct ipath_qp *qp = mqp->qp; + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + + kfree(mqp); +} + +/* + * Allocate a structure for the multicast GID. + * A list of QPs will be attached to this structure. + */ +static struct ipath_mcast *ipath_mcast_alloc(union ib_gid *mgid) +{ + struct ipath_mcast *mcast; + + mcast = kmalloc(sizeof(*mcast), GFP_KERNEL); + if (!mcast) + return NULL; + + mcast->mgid = *mgid; + INIT_LIST_HEAD(&mcast->qp_list); + init_waitqueue_head(&mcast->wait); + atomic_set(&mcast->refcount, 0); + + return mcast; +} + +static void ipath_mcast_free(struct ipath_mcast *mcast) +{ + struct ipath_mcast_qp *p, *tmp; + + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) + ipath_mcast_qp_free(p); + + kfree(mcast); +} + +/* + * Search the global table for the given multicast GID. + * Return it or NULL if not found. + * The caller is responsible for decrementing the reference count if found. + */ +static struct ipath_mcast *ipath_mcast_find(union ib_gid *mgid) +{ + struct rb_node *n; + unsigned long flags; + + spin_lock_irqsave(&mcast_lock, flags); + n = mcast_tree.rb_node; + while (n) { + struct ipath_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipath_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mgid.raw, sizeof(union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else { + atomic_inc(&mcast->refcount); + spin_unlock_irqrestore(&mcast_lock, flags); + return mcast; + } + } + spin_unlock_irqrestore(&mcast_lock, flags); + + return NULL; +} + +/* + * Insert the multicast GID into the table and + * attach the QP structure. + * Return zero if both were added. + * Return EEXIST if the GID was already in the table but the QP was added. + * Return ESRCH if the QP was already attached and neither structure was added. + */ +static int ipath_mcast_add(struct ipath_mcast *mcast, + struct ipath_mcast_qp *mqp) +{ + struct rb_node **n = &mcast_tree.rb_node; + struct rb_node *pn = NULL; + unsigned long flags; + + spin_lock_irqsave(&mcast_lock, flags); + + while (*n) { + struct ipath_mcast *tmcast; + struct ipath_mcast_qp *p; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipath_mcast, rb_node); + + ret = memcmp(mcast->mgid.raw, tmcast->mgid.raw, + sizeof(union ib_gid)); + if (ret < 0) { + n = &pn->rb_left; + continue; + } + if (ret > 0) { + n = &pn->rb_right; + continue; + } + + /* Search the QP list to see if this is already there. */ + list_for_each_entry_rcu(p, &tmcast->qp_list, list) { + if (p->qp == mqp->qp) { + spin_unlock_irqrestore(&mcast_lock, flags); + return ESRCH; + } + } + list_add_tail_rcu(&mqp->list, &tmcast->qp_list); + spin_unlock_irqrestore(&mcast_lock, flags); + return EEXIST; + } + + list_add_tail_rcu(&mqp->list, &mcast->qp_list); + + atomic_inc(&mcast->refcount); + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &mcast_tree); + + spin_unlock_irqrestore(&mcast_lock, flags); + + return 0; +} + +static int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, + u16 lid) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_mcast *mcast; + struct ipath_mcast_qp *mqp; + + /* + * Allocate data structures since its better to do this outside of + * spin locks and it will most likely be needed. + */ + mcast = ipath_mcast_alloc(gid); + if (mcast == NULL) + return -ENOMEM; + mqp = ipath_mcast_qp_alloc(qp); + if (mqp == NULL) { + ipath_mcast_free(mcast); + return -ENOMEM; + } + switch (ipath_mcast_add(mcast, mqp)) { + case ESRCH: + /* Neither was used: can't attach the same QP twice. */ + ipath_mcast_qp_free(mqp); + ipath_mcast_free(mcast); + return -EINVAL; + case EEXIST: /* The mcast wasn't used */ + ipath_mcast_free(mcast); + break; + default: + break; + } + return 0; +} + +static int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, + u16 lid) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_mcast *mcast = NULL; + struct ipath_mcast_qp *p, *tmp; + struct rb_node *n; + unsigned long flags; + int last = 0; + + spin_lock_irqsave(&mcast_lock, flags); + + /* Find the GID in the mcast table. */ + n = mcast_tree.rb_node; + while (1) { + int ret; + + if (n == NULL) { + spin_unlock_irqrestore(&mcast_lock, flags); + return 0; + } + + mcast = rb_entry(n, struct ipath_mcast, rb_node); + ret = memcmp(gid->raw, mcast->mgid.raw, sizeof(union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + break; + } + + /* Search the QP list. */ + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) { + if (p->qp != qp) + continue; + /* + * We found it, so remove it, but don't poison the forward link + * until we are sure there are no list walkers. + */ + list_del_rcu(&p->list); + + /* If this was the last attached QP, remove the GID too. */ + if (list_empty(&mcast->qp_list)) { + rb_erase(&mcast->rb_node, &mcast_tree); + last = 1; + } + break; + } + + spin_unlock_irqrestore(&mcast_lock, flags); + + if (p) { + /* + * Wait for any list walkers to finish before freeing the + * list element. + */ + wait_event(mcast->wait, atomic_read(&mcast->refcount) <= 1); + ipath_mcast_qp_free(p); + } + if (last) { + atomic_dec(&mcast->refcount); + wait_event(mcast->wait, !atomic_read(&mcast->refcount)); + ipath_mcast_free(mcast); + } + + return 0; +} + +/* + * Copy data to SGE memory. + */ +static void copy_sge(struct ipath_sge_state *ss, void *data, u32 length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length) { + u32 len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + memcpy(sge->vaddr, data, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + data += len; + length -= len; + } +} + +/* + * Skip over length bytes of SGE memory. + */ +static void skip_sge(struct ipath_sge_state *ss, u32 length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length > sge->sge_length) { + length -= sge->sge_length; + ss->sge = *ss->sg_list++; + } + while (length) { + u32 len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } +} + +static inline u32 alloc_qpn(struct ipath_qp_table *qpt) +{ + u32 i, offset, max_scan, qpn; + struct qpn_map *map; + + qpn = qpt->last + 1; + if (qpn >= QPN_MAX) + qpn = 2; + offset = qpn & BITS_PER_PAGE_MASK; + map = &qpt->map[qpn / BITS_PER_PAGE]; + max_scan = qpt->nmaps - !offset; + for (i = 0;;) { + if (unlikely(!map->page)) { + unsigned long page = get_zeroed_page(GFP_KERNEL); + unsigned long flags; + + /* + * Free the page if someone raced with us + * installing it: + */ + spin_lock_irqsave(&qpt->lock, flags); + if (map->page) + free_page(page); + else + map->page = (void *)page; + spin_unlock_irqrestore(&qpt->lock, flags); + if (unlikely(!map->page)) + break; + } + if (likely(atomic_read(&map->n_free))) { + do { + if (!test_and_set_bit(offset, map->page)) { + atomic_dec(&map->n_free); + qpt->last = qpn; + return qpn; + } + offset = find_next_offset(map, offset); + qpn = mk_qpn(qpt, map, offset); + /* + * This test differs from alloc_pidmap(). + * If find_next_offset() does find a zero bit, + * we don't need to check for QPN wrapping + * around past our starting QPN. We + * just need to be sure we don't loop forever. + */ + } while (offset < BITS_PER_PAGE && qpn < QPN_MAX); + } + /* + * In order to keep the number of pages allocated to a minimum, + * we scan the all existing pages before increasing the size + * of the bitmap table. + */ + if (++i > max_scan) { + if (qpt->nmaps == QPNMAP_ENTRIES) + break; + map = &qpt->map[qpt->nmaps++]; + offset = 0; + } else if (map < &qpt->map[qpt->nmaps]) { + ++map; + offset = 0; + } else { + map = &qpt->map[0]; + offset = 2; + } + qpn = mk_qpn(qpt, map, offset); + } + return 0; +} + +static inline void free_qpn(struct ipath_qp_table *qpt, u32 qpn) +{ + struct qpn_map *map; + + map = qpt->map + qpn / BITS_PER_PAGE; + if (map->page) + clear_bit(qpn & BITS_PER_PAGE_MASK, map->page); + atomic_inc(&map->n_free); +} + +/* + * Allocate the next available QPN and put the QP into the hash table. + * The hash table holds a reference to the QP. + */ +static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, + enum ib_qp_type type) +{ + unsigned long flags; + u32 qpn; + + if (type == IB_QPT_SMI) + qpn = 0; + else if (type == IB_QPT_GSI) + qpn = 1; + else { + /* Allocate the next available QPN */ + qpn = alloc_qpn(qpt); + if (qpn == 0) { + return -ENOMEM; + } + } + qp->ibqp.qp_num = qpn; + + /* Add the QP to the hash table. */ + spin_lock_irqsave(&qpt->lock, flags); + + qpn %= qpt->max; + qp->next = qpt->table[qpn]; + qpt->table[qpn] = qp; + atomic_inc(&qp->refcount); + + spin_unlock_irqrestore(&qpt->lock, flags); + return 0; +} + +/* + * Remove the QP from the table so it can't be found asynchronously by + * the receive interrupt routine. + */ +static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) +{ + struct ipath_qp *q, **qpp; + unsigned long flags; + int fnd = 0; + + spin_lock_irqsave(&qpt->lock, flags); + + /* Remove QP from the hash table. */ + qpp = &qpt->table[qp->ibqp.qp_num % qpt->max]; + for (; (q = *qpp) != NULL; qpp = &q->next) { + if (q == qp) { + *qpp = qp->next; + qp->next = NULL; + atomic_dec(&qp->refcount); + fnd = 1; + break; + } + } + + spin_unlock_irqrestore(&qpt->lock, flags); + + if (!fnd) + return; + + /* If QPN is not reserved, mark QPN free in the bitmap. */ + if (qp->ibqp.qp_num > 1) + free_qpn(qpt, qp->ibqp.qp_num); + + wait_event(qp->wait, !atomic_read(&qp->refcount)); +} + +/* + * Remove all QPs from the table. + */ +static void ipath_free_all_qps(struct ipath_qp_table *qpt) +{ + unsigned long flags; + struct ipath_qp *qp, *nqp; + u32 n; + + for (n = 0; n < qpt->max; n++) { + spin_lock_irqsave(&qpt->lock, flags); + qp = qpt->table[n]; + qpt->table[n] = NULL; + spin_unlock_irqrestore(&qpt->lock, flags); + + while (qp) { + nqp = qp->next; + if (qp->ibqp.qp_num > 1) + free_qpn(qpt, qp->ibqp.qp_num); + if (!atomic_dec_and_test(&qp->refcount) || + !ipath_destroy_qp(&qp->ibqp)) + _VERBS_INFO("QP memory leak!\n"); + qp = nqp; + } + } + + for (n = 0; n < ARRAY_SIZE(qpt->map); n++) { + if (qpt->map[n].page) + free_page((unsigned long)qpt->map[n].page); + } +} + +/* + * Return the QP with the given QPN. + * The caller is responsible for decrementing the QP reference count when done. + */ +static struct ipath_qp *ipath_lookup_qpn(struct ipath_qp_table *qpt, u32 qpn) +{ + unsigned long flags; + struct ipath_qp *qp; + + spin_lock_irqsave(&qpt->lock, flags); + + for (qp = qpt->table[qpn % qpt->max]; qp; qp = qp->next) { + if (qp->ibqp.qp_num == qpn) { + atomic_inc(&qp->refcount); + break; + } + } + + spin_unlock_irqrestore(&qpt->lock, flags); + return qp; +} + +static int ipath_alloc_lkey(struct ipath_lkey_table *rkt, + struct ipath_mregion *mr) +{ + unsigned long flags; + u32 r; + u32 n; + + spin_lock_irqsave(&rkt->lock, flags); + + /* Find the next available LKEY */ + r = n = rkt->next; + for (;;) { + if (rkt->table[r] == NULL) + break; + r = (r + 1) & (rkt->max - 1); + if (r == n) { + spin_unlock_irqrestore(&rkt->lock, flags); + _VERBS_INFO("LKEY table full\n"); + return 0; + } + } + rkt->next = (r + 1) & (rkt->max - 1); + /* + * Make sure lkey is never zero which is reserved to indicate an + * unrestricted LKEY. + */ + rkt->gen++; + mr->lkey = (r << (32 - ib_ipath_lkey_table_size)) | + ((((1 << (24 - ib_ipath_lkey_table_size)) - 1) & rkt->gen) << 8); + if (mr->lkey == 0) { + mr->lkey |= 1 << 8; + rkt->gen++; + } + rkt->table[r] = mr; + spin_unlock_irqrestore(&rkt->lock, flags); + + return 1; +} + +static void ipath_free_lkey(struct ipath_lkey_table *rkt, u32 lkey) +{ + unsigned long flags; + u32 r; + + if (lkey == 0) + return; + r = lkey >> (32 - ib_ipath_lkey_table_size); + spin_lock_irqsave(&rkt->lock, flags); + rkt->table[r] = NULL; + spin_unlock_irqrestore(&rkt->lock, flags); +} + +/* + * Check the IB SGE for validity and initialize our internal version of it. + * Return 1 if OK, else zero. + */ +static int ipath_lkey_ok(struct ipath_lkey_table *rkt, struct ipath_sge *isge, + struct ib_sge *sge, int acc) +{ + struct ipath_mregion *mr; + size_t off; + + /* + * We use LKEY == zero to mean a physical kmalloc() address. + * This is a bit of a hack since we rely on dma_map_single() + * being reversible by calling bus_to_virt(). + */ + if (sge->lkey == 0) { + isge->mr = NULL; + isge->vaddr = bus_to_virt(sge->addr); + isge->length = sge->length; + isge->sge_length = sge->length; + return 1; + } + spin_lock(&rkt->lock); + mr = rkt->table[(sge->lkey >> (32 - ib_ipath_lkey_table_size))]; + spin_unlock(&rkt->lock); + if (unlikely(mr == NULL || mr->lkey != sge->lkey)) + return 0; + + off = sge->addr - mr->user_base; + if (unlikely(sge->addr < mr->user_base || + off + sge->length > mr->length || + (mr->access_flags & acc) != acc)) + return 0; + + off += mr->offset; + isge->mr = mr; + isge->m = 0; + isge->n = 0; + while (off >= mr->map[isge->m]->segs[isge->n].length) { + off -= mr->map[isge->m]->segs[isge->n].length; + if (++isge->n >= IPATH_SEGSZ) { + isge->m++; + isge->n = 0; + } + } + isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off; + isge->length = mr->map[isge->m]->segs[isge->n].length - off; + isge->sge_length = sge->length; + return 1; +} + +/* + * Initialize the qp->s_sge after a restart. + * The QP s_lock should be held. + */ +static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) +{ + struct ipath_ibdev *dev; + u32 len; + + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * + ib_mtu_enum_to_int(qp->path_mtu); + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + skip_sge(&qp->s_sge, len); + qp->s_len = wqe->length - len; + dev = to_idev(qp->ibqp.device); + spin_lock(&dev->pending_lock); + if (qp->timerwait.next == LIST_POISON1) + list_add_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); +} + +/* + * Check the IB virtual address, length, and RKEY. + * Return 1 if OK, else zero. + * The QP r_rq.lock should be held. + */ +static int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss, + u32 len, u64 vaddr, u32 rkey, int acc) +{ + struct ipath_lkey_table *rkt = &dev->lk_table; + struct ipath_sge *sge = &ss->sge; + struct ipath_mregion *mr; + size_t off; + + spin_lock(&rkt->lock); + mr = rkt->table[(rkey >> (32 - ib_ipath_lkey_table_size))]; + spin_unlock(&rkt->lock); + if (unlikely(mr == NULL || mr->lkey != rkey)) + return 0; + + off = vaddr - mr->iova; + if (unlikely(vaddr < mr->iova || off + len > mr->length || + (mr->access_flags & acc) == 0)) + return 0; + + off += mr->offset; + sge->mr = mr; + sge->m = 0; + sge->n = 0; + while (off >= mr->map[sge->m]->segs[sge->n].length) { + off -= mr->map[sge->m]->segs[sge->n].length; + if (++sge->n >= IPATH_SEGSZ) { + sge->m++; + sge->n = 0; + } + } + sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off; + sge->length = mr->map[sge->m]->segs[sge->n].length - off; + sge->sge_length = len; + ss->sg_list = NULL; + ss->num_sge = 1; + return 1; +} + +/* + * Add a new entry to the completion queue. + * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + */ +static void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig) +{ + unsigned long flags; + u32 next; + + spin_lock_irqsave(&cq->lock, flags); + + cq->queue[cq->head] = *entry; + next = cq->head + 1; + if (next == cq->ibcq.cqe) + next = 0; + if (next != cq->tail) + cq->head = next; + else { + /* XXX - need to mark current wr as having an error... */ + } + + if (cq->notify == IB_CQ_NEXT_COMP || + (cq->notify == IB_CQ_SOLICITED && sig)) { + cq->notify = IB_CQ_NONE; + cq->triggered++; + /* + * This will cause send_complete() to be called in + * another thread. + */ + tasklet_schedule(&cq->comptask); + } + + spin_unlock_irqrestore(&cq->lock, flags); + + if (entry->status != IB_WC_SUCCESS) + to_idev(cq->ibcq.device)->n_wqe_errs++; +} + +static void send_complete(unsigned long data) +{ + struct ipath_cq *cq = (struct ipath_cq *)data; + + /* + * The completion handler will most likely rearm the notification + * and poll for all pending entries. If a new completion entry + * is added while we are in this routine, tasklet_schedule() + * won't call us again until we return so we check triggered to + * see if we need to call the handler again. + */ + for (;;) { + u8 triggered = cq->triggered; + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (cq->triggered == triggered) + return; + } +} + +/* + * This is the QP state transition table. + * See ipath_modify_qp() for details. + */ +static const struct { + int trans; + u32 req_param[IB_QPT_RAW_IPV6]; + u32 opt_param[IB_QPT_RAW_IPV6]; +} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = IPATH_TRANS_RST2INIT, + .req_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + }, + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = IPATH_TRANS_INIT2INIT, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + } + }, + [IB_QPS_RTR] = { + .trans = IPATH_TRANS_INIT2RTR, + .req_param = { + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [IB_QPT_RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_RTR2RTS, + .req_param = { + [IB_QPT_SMI] = IB_QP_SQ_PSN, + [IB_QPT_GSI] = IB_QP_SQ_PSN, + [IB_QPT_UD] = IB_QP_SQ_PSN, + [IB_QPT_UC] = IB_QP_SQ_PSN, + [IB_QPT_RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + }, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_RTS2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + } + }, + [IB_QPS_SQD] = { + .trans = IPATH_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_SQD2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + }, + [IB_QPS_SQD] = { + .trans = IPATH_TRANS_SQD2SQD, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_SQERR2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = IB_QP_CUR_STATE, + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR } + } +}; + +/* + * Initialize the QP state to the reset state. + */ +static void ipath_reset_qp(struct ipath_qp *qp) +{ + qp->remote_qpn = 0; + qp->qkey = 0; + qp->qp_access_flags = 0; + qp->s_hdrwords = 0; + qp->s_psn = 0; + qp->r_psn = 0; + atomic_set(&qp->msn, 0); + if (qp->ibqp.qp_type == IB_QPT_RC) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->r_state = IB_OPCODE_RC_SEND_LAST; + } else { + qp->s_state = IB_OPCODE_UC_SEND_LAST; + qp->r_state = IB_OPCODE_UC_SEND_LAST; + } + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + qp->s_nak_state = 0; + qp->s_rnr_timeout = 0; + qp->s_head = 0; + qp->s_tail = 0; + qp->s_cur = 0; + qp->s_last = 0; + qp->s_ssn = 1; + qp->s_lsn = 0; + qp->r_rq.head = 0; + qp->r_rq.tail = 0; + qp->r_reuse_sge = 0; +} + +/* + * Flush send work queue. + * The QP s_lock should be held. + */ +static void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + _VERBS_INFO("Send queue error on QP%d/%d: err: %d\n", + qp->ibqp.qp_num, qp->remote_qpn, wc->status); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + + wc->status = IB_WC_WR_FLUSH_ERR; + + while (qp->s_last != qp->s_head) { + wc->wr_id = wqe->wr.wr_id; + wc->opcode = wc_opcode[wqe->wr.opcode]; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + wqe = get_swqe_ptr(qp, qp->s_last); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->state = IB_QPS_SQE; +} + +/* + * Flush both send and receive work queues. + * QP r_rq.lock and s_lock should be held. + */ +static void ipath_error_qp(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + + _VERBS_INFO("QP%d/%d in error state\n", + qp->ibqp.qp_num, qp->remote_qpn); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + wc.status = IB_WC_WR_FLUSH_ERR; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + + while (qp->s_last != qp->s_head) { + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + wc.wr_id = wqe->wr.wr_id; + wc.opcode = wc_opcode[wqe->wr.opcode]; + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->s_hdrwords = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + + wc.opcode = IB_WC_RECV; + while (qp->r_rq.tail != qp->r_rq.head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; + if (++qp->r_rq.tail >= qp->r_rq.size) + qp->r_rq.tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } +} + +static int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + struct ipath_qp *qp = to_iqp(ibqp); + enum ib_qp_state cur_state, new_state; + u32 req_param, opt_param; + unsigned long flags; + + if (attr_mask & IB_QP_CUR_STATE) { + cur_state = attr->cur_qp_state; + if (cur_state != IB_QPS_RTR && + cur_state != IB_QPS_RTS && + cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE) + return -EINVAL; + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + } else { + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + cur_state = qp->state; + } + + if (attr_mask & IB_QP_STATE) { + new_state = attr->qp_state; + if (new_state < 0 || new_state > IB_QPS_ERR) + goto inval; + } else + new_state = cur_state; + + switch (qp_state_table[cur_state][new_state].trans) { + case IPATH_TRANS_INVALID: + goto inval; + + case IPATH_TRANS_ANY2RST: + ipath_reset_qp(qp); + break; + + case IPATH_TRANS_ANY2ERR: + ipath_error_qp(qp); + break; + + } + + req_param = + qp_state_table[cur_state][new_state].req_param[qp->ibqp.qp_type]; + opt_param = + qp_state_table[cur_state][new_state].opt_param[qp->ibqp.qp_type]; + + if ((req_param & attr_mask) != req_param) + goto inval; + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) + goto inval; + + if (attr_mask & IB_QP_PKEY_INDEX) { + struct ipath_ibdev *dev = to_idev(ibqp->device); + + if (attr->pkey_index >= ipath_layer_get_npkeys(dev->ib_unit)) + goto inval; + qp->s_pkey_index = attr->pkey_index; + } + + if (attr_mask & IB_QP_DEST_QPN) + qp->remote_qpn = attr->dest_qp_num; + + if (attr_mask & IB_QP_SQ_PSN) { + qp->s_next_psn = attr->sq_psn; + qp->s_last_psn = qp->s_next_psn - 1; + } + + if (attr_mask & IB_QP_RQ_PSN) + qp->r_psn = attr->rq_psn; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + qp->qp_access_flags = attr->qp_access_flags; + + if (attr_mask & IB_QP_AV) + qp->remote_ah_attr = attr->ah_attr; + + if (attr_mask & IB_QP_PATH_MTU) + qp->path_mtu = attr->path_mtu; + + if (attr_mask & IB_QP_RETRY_CNT) + qp->s_retry = qp->s_retry_cnt = attr->retry_cnt; + + if (attr_mask & IB_QP_RNR_RETRY) { + qp->s_rnr_retry = attr->rnr_retry; + if (qp->s_rnr_retry > 7) + qp->s_rnr_retry = 7; + qp->s_rnr_retry_cnt = qp->s_rnr_retry; + } + + if (attr_mask & IB_QP_MIN_RNR_TIMER) + qp->s_min_rnr_timer = attr->min_rnr_timer & 0x1F; + + if (attr_mask & IB_QP_QKEY) + qp->qkey = attr->qkey; + + if (attr_mask & IB_QP_PKEY_INDEX) + qp->s_pkey_index = attr->pkey_index; + + qp->state = new_state; + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* + * Try to move to ARMED if QP1 changed to the RTS state. + */ + if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) { + struct ipath_ibdev *dev = to_idev(ibqp->device); + + /* + * Bounce the link even if it was active so the SM will + * reinitialize the SMA's state. + */ + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKDOWN); + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKARM); + } + return 0; + +inval: + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return -EINVAL; +} + +/* + * Compute the AETH (syndrome + MSN). + * The QP s_lock should be held. + */ +static u32 ipath_compute_aeth(struct ipath_qp *qp) +{ + u32 aeth = atomic_read(&qp->msn) & 0xFFFFFF; + + if (qp->s_nak_state) { + aeth |= qp->s_nak_state << 24; + } else if (qp->ibqp.srq) { + /* Shared receive queues don't generate credits. */ + aeth |= 0x1F << 24; + } else { + u32 min, max, x; + u32 credits; + + /* + * Compute the number of credits available (RWQEs). + * XXX Not holding the r_rq.lock here so there is a small + * chance that the pair of reads are not atomic. + */ + credits = qp->r_rq.head - qp->r_rq.tail; + if ((int)credits < 0) + credits += qp->r_rq.size; + /* Binary search the credit table to find the code to use. */ + min = 0; + max = 31; + for (;;) { + x = (min + max) / 2; + if (credit_table[x] == credits) + break; + if (credit_table[x] > credits) + max = x; + else if (min == x) + break; + else + min = x; + } + aeth |= x << 24; + } + return cpu_to_be32(aeth); +} + + +static void no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + if (qp->piowait.next == LIST_POISON1) + list_add_tail(&qp->piowait, &dev->piowait); + spin_unlock_irqrestore(&dev->pending_lock, flags); + /* + * Note that as soon as ipath_layer_want_buffer() is called and + * possibly before it returns, ipath_ib_piobufavail() + * could be called. If we are still in the tasklet function, + * tasklet_schedule() will not call us until the next time + * tasklet_schedule() is called. + * We clear the tasklet flag now since we are committing to return + * from the tasklet function. + */ + tasklet_unlock(&qp->s_task); + ipath_layer_want_buffer(dev->ib_unit); + dev->n_piowait++; +} + +/* + * Process entries in the send work queue until the queue is exhausted. + * Only allow one CPU to send a packet per QP (tasklet). + * Otherwise, after we drop the QP lock, two threads could send + * packets out of order. + * This is similar to do_rc_send() below except we don't have timeouts or + * resends. + */ +static void do_uc_send(unsigned long data) +{ + struct ipath_qp *qp = (struct ipath_qp *)data; + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe; + unsigned long flags; + u16 lrh0; + u32 hwords; + u32 nwords; + u32 extra_bytes; + u32 bth0; + u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + u32 len; + struct ipath_other_headers *ohdr; + struct ib_wc wc; + + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) + return; + + if (unlikely(qp->remote_ah_attr.dlid == + ipath_layer_get_lid(dev->ib_unit))) { + /* Pass in an uninitialized ib_wc to save stack space. */ + ipath_ruc_loopback(qp, &wc); + clear_bit(IPATH_S_BUSY, &qp->s_flags); + return; + } + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + +again: + /* Check for a constructed packet to be sent. */ + if (qp->s_hdrwords != 0) { + /* + * If no PIO bufs are available, return. + * An interrupt will call ipath_ib_piobufavail() + * when one is available. + */ + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, + (uint32_t *) &qp->s_hdr, + qp->s_cur_size, qp->s_cur_sge)) { + no_bufs_available(qp, dev); + return; + } + /* Record that we sent the packet and s_hdr is empty. */ + qp->s_hdrwords = 0; + } + + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ + hwords = 5; + + /* + * The lock is needed to synchronize between + * setting qp->s_ack_state and post_send(). + */ + spin_lock_irqsave(&qp->s_lock, flags); + + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + goto done; + + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + + /* Send a request. */ + wqe = get_swqe_ptr(qp, qp->s_last); + switch (qp->s_state) { + default: + /* Signal the completion of the last send (if there is one). */ + if (qp->s_last != qp->s_tail) { + if (++qp->s_last == qp->s_size) + qp->s_last = 0; + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wqe->wr.wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, + 0); + } + wqe = get_swqe_ptr(qp, qp->s_last); + } + /* Check if send work queue is empty. */ + if (qp->s_tail == qp->s_head) + goto done; + /* + * Start a new request. + */ + qp->s_psn = wqe->psn = qp->s_next_psn; + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + qp->s_len = len = wqe->length; + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (len > pmtu) { + qp->s_state = IB_OPCODE_UC_SEND_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) { + qp->s_state = IB_OPCODE_UC_SEND_ONLY; + } else { + qp->s_state = + IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + hwords += sizeof(struct ib_reth) / 4; + if (len > pmtu) { + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_ONLY; + } else { + qp->s_state = + IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the RETH */ + ohdr->u.rc.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + break; + + default: + goto done; + } + if (++qp->s_tail >= qp->s_size) + qp->s_tail = 0; + break; + + case IB_OPCODE_UC_SEND_FIRST: + qp->s_state = IB_OPCODE_UC_SEND_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_MIDDLE: + len = qp->s_len; + if (len > pmtu) { + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) + qp->s_state = IB_OPCODE_UC_SEND_LAST; + else { + qp->s_state = IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + break; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + len = qp->s_len; + if (len > pmtu) { + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_LAST; + else { + qp->s_state = + IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + break; + } + bth2 = qp->s_next_psn++ & 0xFFFFFF; + qp->s_len -= len; + bth0 |= qp->s_state << 24; + + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Construct the header. */ + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + } + qp->s_hdrwords = hwords; + qp->s_cur_sge = &qp->s_sge; + qp->s_cur_size = len; + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); + + /* Check for more work to do. */ + goto again; + +done: + spin_unlock_irqrestore(&qp->s_lock, flags); + clear_bit(IPATH_S_BUSY, &qp->s_flags); +} + +/* + * Process entries in the send work queue until credit or queue is exhausted. + * Only allow one CPU to send a packet per QP (tasklet). + * Otherwise, after we drop the QP s_lock, two threads could send + * packets out of order. + */ +static void do_rc_send(unsigned long data) +{ + struct ipath_qp *qp = (struct ipath_qp *)data; + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe; + struct ipath_sge_state *ss; + unsigned long flags; + u16 lrh0; + u32 hwords; + u32 nwords; + u32 extra_bytes; + u32 bth0; + u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + u32 len; + struct ipath_other_headers *ohdr; + char newreq; + + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) + return; + + if (unlikely(qp->remote_ah_attr.dlid == + ipath_layer_get_lid(dev->ib_unit))) { + struct ib_wc wc; + + /* + * Pass in an uninitialized ib_wc to be consistent with + * other places where ipath_ruc_loopback() is called. + */ + ipath_ruc_loopback(qp, &wc); + clear_bit(IPATH_S_BUSY, &qp->s_flags); + return; + } + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + +again: + /* Check for a constructed packet to be sent. */ + if (qp->s_hdrwords != 0) { + /* + * If no PIO bufs are available, return. + * An interrupt will call ipath_ib_piobufavail() + * when one is available. + */ + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, + (uint32_t *) &qp->s_hdr, + qp->s_cur_size, qp->s_cur_sge)) { + no_bufs_available(qp, dev); + return; + } + /* Record that we sent the packet and s_hdr is empty. */ + qp->s_hdrwords = 0; + } + + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ + hwords = 5; + + /* + * The lock is needed to synchronize between + * setting qp->s_ack_state, resend timer, and post_send(). + */ + spin_lock_irqsave(&qp->s_lock, flags); + + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + + /* Sending responses has higher priority over sending requests. */ + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) { + /* + * Send a response. + * Note that we are in the responder's side of the QP context. + */ + switch (qp->s_ack_state) { + case IB_OPCODE_RC_RDMA_READ_REQUEST: + ss = &qp->s_rdma_sge; + len = qp->s_rdma_len; + if (len > pmtu) { + len = pmtu; + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + } else { + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY; + } + qp->s_rdma_len -= len; + bth0 |= qp->s_ack_state << 24; + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + ss = &qp->s_rdma_sge; + len = qp->s_rdma_len; + if (len > pmtu) { + len = pmtu; + } else { + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + } + qp->s_rdma_len -= len; + bth0 |= qp->s_ack_state << 24; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: + /* + * We have to prevent new requests from changing + * the r_sge state while a ipath_verbs_send() + * is in progress. + * Changing r_state allows the receiver + * to continue processing new packets. + * We do it here now instead of above so + * that we are sure the packet was sent before + * changing the state. + */ + qp->r_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + goto send_req; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + ss = NULL; + len = 0; + qp->r_state = IB_OPCODE_RC_SEND_LAST; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + ohdr->u.at.aeth = ipath_compute_aeth(qp); + ohdr->u.at.atomic_ack_eth = + cpu_to_be64(qp->s_ack_atomic); + hwords += sizeof(ohdr->u.at) / 4; + break; + + default: + /* Send a regular ACK. */ + ss = NULL; + len = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + bth0 |= qp->s_ack_state << 24; + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + } + bth2 = qp->s_ack_psn++ & 0xFFFFFF; + } else { + send_req: + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || + qp->s_rnr_timeout) + goto done; + + /* Send a request. */ + wqe = get_swqe_ptr(qp, qp->s_cur); + switch (qp->s_state) { + default: + /* + * Resend an old request or start a new one. + * + * We keep track of the current SWQE so that + * we don't reset the "furthest progress" state + * if we need to back up. + */ + newreq = 0; + if (qp->s_cur == qp->s_tail) { + /* Check if send work queue is empty. */ + if (qp->s_tail == qp->s_head) + goto done; + qp->s_psn = wqe->psn = qp->s_next_psn; + newreq = 1; + } + /* + * Note that we have to be careful not to modify the + * original work request since we may need to resend + * it. + */ + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + qp->s_len = len = wqe->length; + ss = &qp->s_sge; + bth2 = 0; + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + /* If no credit, return. */ + if (qp->s_lsn != (u32) -1 && + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + goto done; + } + wqe->lpsn = wqe->psn; + if (len > pmtu) { + wqe->lpsn += (len - 1) / pmtu; + qp->s_state = IB_OPCODE_RC_SEND_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) { + qp->s_state = IB_OPCODE_RC_SEND_ONLY; + } else { + qp->s_state = + IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth2 = 1 << 31; /* Request ACK. */ + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_RDMA_WRITE: + if (newreq) + qp->s_lsn++; + /* FALLTHROUGH */ + case IB_WR_RDMA_WRITE_WITH_IMM: + /* If no credit, return. */ + if (qp->s_lsn != (u32) -1 && + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + goto done; + } + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + hwords += sizeof(struct ib_reth) / 4; + wqe->lpsn = wqe->psn; + if (len > pmtu) { + wqe->lpsn += (len - 1) / pmtu; + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_ONLY; + } else { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after RETH */ + ohdr->u.rc.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr. + send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + bth2 = 1 << 31; /* Request ACK. */ + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_RDMA_READ: + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; + hwords += sizeof(ohdr->u.rc.reth) / 4; + if (newreq) { + qp->s_lsn++; + /* + * Adjust s_next_psn to count the + * expected number of responses. + */ + if (len > pmtu) + qp->s_next_psn += + (len - 1) / pmtu; + wqe->lpsn = qp->s_next_psn++; + } + ss = NULL; + len = 0; + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + qp->s_state = + wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP ? + IB_OPCODE_RC_COMPARE_SWAP : + IB_OPCODE_RC_FETCH_ADD; + ohdr->u.atomic_eth.vaddr = + cpu_to_be64(wqe->wr.wr.atomic.remote_addr); + ohdr->u.atomic_eth.rkey = + cpu_to_be32(wqe->wr.wr.atomic.rkey); + ohdr->u.atomic_eth.swap_data = + cpu_to_be64(wqe->wr.wr.atomic.swap); + ohdr->u.atomic_eth.compare_data = + cpu_to_be64(wqe->wr.wr.atomic.compare_add); + hwords += sizeof(struct ib_atomic_eth) / 4; + if (newreq) { + qp->s_lsn++; + wqe->lpsn = wqe->psn; + } + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + ss = NULL; + len = 0; + break; + + default: + goto done; + } + if (newreq) { + if (++qp->s_tail >= qp->s_size) + qp->s_tail = 0; + } + bth2 |= qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + spin_lock(&dev->pending_lock); + if (qp->timerwait.next == LIST_POISON1) { + list_add_tail(&qp->timerwait, + &dev->pending[dev-> + pending_index]); + } + spin_unlock(&dev->pending_lock); + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + /* + * This case can only happen if a send is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_FIRST: + qp->s_state = IB_OPCODE_RC_SEND_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_MIDDLE: + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = &qp->s_sge; + len = qp->s_len; + if (len > pmtu) { + /* + * Request an ACK every 1/2 MB to avoid + * retransmit timeouts. + */ + if (((wqe->length - len) % (512 * 1024)) == 0) + bth2 |= 1 << 31; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) + qp->s_state = IB_OPCODE_RC_SEND_LAST; + else { + qp->s_state = + IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth2 |= 1 << 31; /* Request ACK. */ + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + /* + * This case can only happen if a RDMA write is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = &qp->s_sge; + len = qp->s_len; + if (len > pmtu) { + /* + * Request an ACK every 1/2 MB to avoid + * retransmit timeouts. + */ + if (((wqe->length - len) % (512 * 1024)) == 0) + bth2 |= 1 << 31; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_LAST; + else { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + bth2 |= 1 << 31; /* Request ACK. */ + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + /* + * This case can only happen if a RDMA read is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * pmtu; + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len); + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; + hwords += sizeof(ohdr->u.rc.reth) / 4; + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = NULL; + len = 0; + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * We shouldn't start anything new until this request + * is finished. The ACK will handle rescheduling us. + * XXX The number of outstanding ones is negotiated + * at connection setup time (see pg. 258,289)? + * XXX Also, if we support multiple outstanding + * requests, we need to check the WQE IB_SEND_FENCE + * flag and not send a new request if a RDMA read or + * atomic is pending. + */ + goto done; + } + qp->s_len -= len; + bth0 |= qp->s_state << 24; + /* XXX queue resend timeout. */ + } + /* Make sure it is non-zero before dropping the lock. */ + qp->s_hdrwords = hwords; + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Construct the header. */ + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + qp->s_hdrwords = hwords; + } + qp->s_cur_sge = ss; + qp->s_cur_size = len; + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); + + /* Check for more work to do. */ + goto again; + +done: + spin_unlock_irqrestore(&qp->s_lock, flags); + clear_bit(IPATH_S_BUSY, &qp->s_flags); +} + +static void send_rc_ack(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + u16 lrh0; + u32 bth0; + u32 hwords; + struct ipath_other_headers *ohdr; + + /* Construct the header. */ + ohdr = &qp->s_hdr.u.oth; + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ + hwords = 6; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + ohdr = &qp->s_hdr.u.l.oth; + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + } + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + ohdr->u.aeth = ipath_compute_aeth(qp); + if (qp->s_ack_state >= IB_OPCODE_RC_COMPARE_SWAP) { + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); + hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; + } else { + bth0 |= IB_OPCODE_RC_ACKNOWLEDGE << 24; + } + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & 0xFFFFFF); + + /* + * If we can send the ACK, clear the ACK state. + */ + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, + 0, NULL) == 0) { + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + dev->n_rc_qacks++; + } +} + +/* + * Back up the requester to resend the last un-ACKed request. + * The QP s_lock should be held. + */ +static void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) +{ + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + struct ipath_ibdev *dev; + u32 n; + + /* + * If there are no requests pending, we are done. + */ + if (cmp24(psn, qp->s_next_psn) >= 0 || qp->s_last == qp->s_tail) + goto done; + + if (qp->s_retry == 0) { + wc->wr_id = wqe->wr.wr_id; + wc->status = IB_WC_RETRY_EXC_ERR; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = 0; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = qp->remote_qpn; + wc->pkey_index = 0; + wc->slid = qp->remote_ah_attr.dlid; + wc->sl = qp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_sqerror_qp(qp, wc); + return; + } + qp->s_retry--; + + /* + * Remove the QP from the timeout queue. + * Note: it may already have been removed by ipath_ib_timer(). + */ + dev = to_idev(qp->ibqp.device); + spin_lock(&dev->pending_lock); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + spin_unlock(&dev->pending_lock); + + if (wqe->wr.opcode == IB_WR_RDMA_READ) + dev->n_rc_resends++; + else + dev->n_rc_resends += (int)qp->s_psn - (int)psn; + + /* + * If we are starting the request from the beginning, let the + * normal send code handle initialization. + */ + qp->s_cur = qp->s_last; + if (cmp24(psn, wqe->psn) <= 0) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } else { + n = qp->s_cur; + for (;;) { + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) { + if (cmp24(psn, qp->s_next_psn) >= 0) { + qp->s_cur = n; + wqe = get_swqe_ptr(qp, n); + } + break; + } + wqe = get_swqe_ptr(qp, n); + if (cmp24(psn, wqe->psn) < 0) + break; + qp->s_cur = n; + } + qp->s_psn = psn; + + /* + * Reset the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See do_rc_send(). + */ + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + break; + + case IB_WR_RDMA_READ: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + } + +done: + tasklet_schedule(&qp->s_task); +} + +/* + * Handle RC and UC post sends. + */ +static int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ipath_swqe *wqe; + unsigned long flags; + u32 next; + int i, j; + int acc; + + /* + * Don't allow RDMA reads or atomic operations on UC or + * undefined operations. + * Make sure buffer is large enough to hold the result for atomics. + */ + if (qp->ibqp.qp_type == IB_QPT_UC) { + if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) + return -EINVAL; + } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) + return -EINVAL; + else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && + (wr->num_sge == 0 || wr->sg_list[0].length < sizeof(u64) || + wr->sg_list[0].addr & 0x7)) + return -EINVAL; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + return -ENOMEM; + + spin_lock_irqsave(&qp->s_lock, flags); + next = qp->s_head + 1; + if (next >= qp->s_size) + next = 0; + if (next == qp->s_last) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + + wqe = get_swqe_ptr(qp, qp->s_head); + wqe->wr = *wr; + wqe->ssn = qp->s_ssn++; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + acc = wr->opcode >= IB_WR_RDMA_READ ? IB_ACCESS_LOCAL_WRITE : 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], acc)) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->wr.num_sge = j; + qp->s_head = next; + /* + * Wake up the send tasklet if the QP is not waiting + * for an RNR timeout. + */ + next = qp->s_rnr_timeout; + spin_unlock_irqrestore(&qp->s_lock, flags); + + if (next == 0) { + if (qp->ibqp.qp_type == IB_QPT_UC) + do_uc_send((unsigned long) qp); + else + do_rc_send((unsigned long) qp); + } + return 0; +} + +/* + * Note that we actually send the data as it is posted instead of putting + * the request into a ring buffer. If we wanted to use a ring buffer, + * we would need to save a reference to the destination address in the SWQE. + */ +static int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_other_headers *ohdr; + struct ib_ah_attr *ah_attr; + struct ipath_sge_state ss; + struct ipath_sge *sg_list; + struct ib_wc wc; + u32 hwords; + u32 nwords; + u32 len; + u32 extra_bytes; + u32 bth0; + u16 lrh0; + u16 lid; + int i; + + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + return 0; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + return -EINVAL; + + if (wr->num_sge > 1) { + sg_list = kmalloc((qp->s_max_sge - 1) * sizeof(*sg_list), + GFP_ATOMIC); + if (!sg_list) + return -ENOMEM; + } else + sg_list = NULL; + + /* Check the buffer to send. */ + ss.sg_list = sg_list; + ss.sge.mr = NULL; + ss.sge.vaddr = NULL; + ss.sge.length = 0; + ss.sge.sge_length = 0; + ss.num_sge = 0; + len = 0; + for (i = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) + return -EINVAL; + + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&dev->lk_table, ss.num_sge ? + sg_list + ss.num_sge : &ss.sge, + &wr->sg_list[i], 0)) { + return -EINVAL; + } + len += wr->sg_list[i].length; + ss.num_sge++; + } + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + + /* Construct the header. */ + ah_attr = &to_iah(wr->wr.ud.ah)->attr; + if (ah_attr->dlid >= 0xC000 && ah_attr->dlid < 0xFFFF) + dev->n_multicast_xmit++; + if (unlikely(ah_attr->dlid == ipath_layer_get_lid(dev->ib_unit))) { + /* Pass in an uninitialized ib_wc to save stack space. */ + ipath_ud_loopback(qp, &ss, len, wr, &wc); + goto done; + } + if (ah_attr->ah_flags & IB_AH_GRH) { + /* Header size in 32-bit words. */ + hwords = 17; + lrh0 = IPS_LRH_GRH; + ohdr = &qp->s_hdr.u.l.oth; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((wr->opcode == + IB_WR_SEND_WITH_IMM ? 6 : 5) + nwords + + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = ah_attr->grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid; + /* + * Don't worry about sending to locally attached + * multicast QPs. It is unspecified by the spec. what happens. + */ + } else { + /* Header size in 32-bit words. */ + hwords = 7; + lrh0 = IPS_LRH_BTH; + ohdr = &qp->s_hdr.u.oth; + } + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + ohdr->u.ud.imm_data = wr->imm_data; + wc.imm_data = wr->imm_data; + hwords += 1; + bth0 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE << 24; + } else if (wr->opcode == IB_WR_SEND) { + wc.imm_data = 0; + bth0 = IB_OPCODE_UD_SEND_ONLY << 24; + } else + return -EINVAL; + lrh0 |= ah_attr->sl << 4; + if (qp->ibqp.qp_type == IB_QPT_SMI) + lrh0 |= 0xF000; /* Set VL */ + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */ + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + lid = ipath_layer_get_lid(dev->ib_unit); + qp->s_hdr.lrh[3] = lid ? cpu_to_be16(lid) : IB_LID_PERMISSIVE; + if (wr->send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth0 |= extra_bytes << 20; + bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY : + ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(wr->wr.ud.remote_qpn); + /* XXX Could lose a PSN count but not worth locking */ + ohdr->bth[2] = cpu_to_be32(qp->s_psn++ & 0xFFFFFF); + /* + * Qkeys with the high order bit set mean use the + * qkey from the QP context instead of the WR. + */ + ohdr->u.ud.deth[0] = cpu_to_be32((int)wr->wr.ud.remote_qkey < 0 ? + qp->qkey : wr->wr.ud.remote_qkey); + ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num); + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, + len, &ss)) + dev->n_no_piobuf++; + +done: + /* Queue the completion status entry. */ + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wr->send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.vendor_err = 0; + wc.opcode = IB_WC_SEND; + wc.byte_len = len; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + /* XXX initialize other fields? */ + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + } + kfree(sg_list); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + int err = 0; + + /* Check that state is OK to post send. */ + if (!(state_ops[qp->state] & IPATH_POST_SEND_OK)) { + *bad_wr = wr; + return -EINVAL; + } + + for (; wr; wr = wr->next) { + switch (qp->ibqp.qp_type) { + case IB_QPT_UC: + case IB_QPT_RC: + err = ipath_post_rc_send(qp, wr); + break; + + case IB_QPT_SMI: + case IB_QPT_GSI: + case IB_QPT_UD: + err = ipath_post_ud_send(qp, wr); + break; + + default: + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + } + return err; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + unsigned long flags; + + /* Check that state is OK to post receive. */ + if (!(state_ops[qp->state] & IPATH_POST_RECV_OK)) { + *bad_wr = wr; + return -EINVAL; + } + + for (; wr; wr = wr->next) { + struct ipath_rwqe *wqe; + u32 next; + int i, j; + + if (wr->num_sge > qp->r_rq.max_sge) { + *bad_wr = wr; + return -ENOMEM; + } + + spin_lock_irqsave(&qp->r_rq.lock, flags); + next = qp->r_rq.head + 1; + if (next >= qp->r_rq.size) + next = 0; + if (next == qp->r_rq.tail) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -ENOMEM; + } + + wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe->wr_id = wr->wr_id; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(qp->ibqp.pd)->user && + wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->num_sge = j; + qp->r_rq.head = next; + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + } + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + struct ipath_ibdev *dev = to_idev(ibsrq->device); + unsigned long flags; + + for (; wr; wr = wr->next) { + struct ipath_rwqe *wqe; + u32 next; + int i, j; + + if (wr->num_sge > srq->rq.max_sge) { + *bad_wr = wr; + return -ENOMEM; + } + + spin_lock_irqsave(&srq->rq.lock, flags); + next = srq->rq.head + 1; + if (next >= srq->rq.size) + next = 0; + if (next == srq->rq.tail) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -ENOMEM; + } + + wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe->wr_id = wr->wr_id; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(srq->ibsrq.pd)->user && + wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&dev->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->num_sge = j; + srq->rq.head = next; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + return 0; +} + +/* + * This is called from ipath_qp_rcv() to process an incomming UD packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 qkey; + u32 src_qp; + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_rwqe *wqe; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12 + 8; /* LRH + BTH + DETH */ + qkey = be32_to_cpu(ohdr->u.ud.deth[0]); + src_qp = be32_to_cpu(ohdr->u.ud.deth[1]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12 + 8; /* LRH + GRH + BTH + DETH */ + /* + * The header with GRH is 68 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 12 bytes of + * the IB header is in the data buffer. + */ + qkey = be32_to_cpu(((u32 *) data)[1]); + src_qp = be32_to_cpu(((u32 *) data)[2]); + data += 12; + } + src_qp &= 0xFFFFFF; + + /* Check that the qkey matches. */ + if (unlikely(qkey != qp->qkey)) { + /* XXX OK to lose a count once in a while. */ + dev->qkey_violations++; + dev->n_pkt_drops++; + return; + } + + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + if (unlikely(tlen < (hdrsize + pad + 4))) { + /* Drop incomplete packets. */ + dev->n_pkt_drops++; + return; + } + + /* + * A GRH is expected to preceed the data even if not + * present on the wire. + */ + wc.byte_len = tlen - (hdrsize + pad + 4) + sizeof(struct ib_grh); + + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + if (opcode == IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE) { + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else + wc.imm_data = ohdr->u.ud.imm_data; + wc.wc_flags = IB_WC_WITH_IMM; + hdrsize += sizeof(u32); + } else if (opcode == IB_OPCODE_UD_SEND_ONLY) { + wc.imm_data = 0; + wc.wc_flags = 0; + } else { + dev->n_pkt_drops++; + return; + } + + /* + * Get the next work request entry to find where to put the data. + * Note that it is safe to drop the lock after changing rq->tail + * since ipath_post_receive() won't fill the empty slot. + */ + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + } else { + srq = NULL; + rq = &qp->r_rq; + } + spin_lock_irqsave(&rq->lock, flags); + if (rq->tail == rq->head) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + return; + } + /* Silently drop packets which are too big. */ + wqe = get_rwqe_ptr(rq, rq->tail); + if (wc.byte_len > wqe->length) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + return; + } + wc.wr_id = wqe->wr_id; + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq && srq->ibsrq.event_handler) { + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + struct ib_event ev; + + srq->limit = 0; + spin_unlock_irqrestore(&rq->lock, flags); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock_irqrestore(&rq->lock, flags); + } else + spin_unlock_irqrestore(&rq->lock, flags); + if (has_grh) { + copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); + wc.wc_flags |= IB_WC_GRH; + } else + skip_sge(&qp->r_sge, sizeof(struct ib_grh)); + copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = src_qp; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc.pkey_index = 0; + wc.slid = be16_to_cpu(hdr->lrh[3]); + wc.sl = (be16_to_cpu(hdr->lrh[0]) >> 4) & 0xF; + wc.dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); +} + +/* + * This is called from ipath_post_ud_send() to forward a WQE addressed + * to the same HCA. + */ +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 length, struct ib_send_wr *wr, + struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); + struct ipath_qp *qp; + struct ib_ah_attr *ah_attr; + unsigned long flags; + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_sge_state rsge; + struct ipath_sge *sge; + struct ipath_rwqe *wqe; + + qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); + if (!qp) + return; + + /* Check that the qkey matches. */ + if (unlikely(wr->wr.ud.remote_qkey != qp->qkey)) { + /* XXX OK to lose a count once in a while. */ + dev->qkey_violations++; + dev->n_pkt_drops++; + goto done; + } + + /* + * A GRH is expected to preceed the data even if not + * present on the wire. + */ + wc->byte_len = length + sizeof(struct ib_grh); + + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wr->imm_data; + } else { + wc->wc_flags = 0; + wc->imm_data = 0; + } + + /* + * Get the next work request entry to find where to put the data. + * Note that it is safe to drop the lock after changing rq->tail + * since ipath_post_receive() won't fill the empty slot. + */ + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + } else { + srq = NULL; + rq = &qp->r_rq; + } + spin_lock_irqsave(&rq->lock, flags); + if (rq->tail == rq->head) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto done; + } + /* Silently drop packets which are too big. */ + wqe = get_rwqe_ptr(rq, rq->tail); + if (wc->byte_len > wqe->length) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto done; + } + wc->wr_id = wqe->wr_id; + rsge.sge = wqe->sg_list[0]; + rsge.sg_list = wqe->sg_list + 1; + rsge.num_sge = wqe->num_sge; + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq && srq->ibsrq.event_handler) { + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + struct ib_event ev; + + srq->limit = 0; + spin_unlock_irqrestore(&rq->lock, flags); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock_irqrestore(&rq->lock, flags); + } else + spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; + if (ah_attr->ah_flags & IB_AH_GRH) { + copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); + wc->wc_flags |= IB_WC_GRH; + } else + skip_sge(&rsge, sizeof(struct ib_grh)); + sge = &ss->sge; + while (length) { + u32 len = sge->length; + + if (len > length) + len = length; + BUG_ON(len == 0); + copy_sge(&rsge, sge->vaddr, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } + wc->status = IB_WC_SUCCESS; + wc->opcode = IB_WC_RECV; + wc->vendor_err = 0; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = sqp->ibqp.qp_num; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc->pkey_index = 0; + wc->slid = ipath_layer_get_lid(dev->ib_unit); + wc->sl = ah_attr->sl; + wc->dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, + wr->send_flags & IB_SEND_SOLICITED); + +done: + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +/* + * Copy the next RWQE into the QP's RWQE. + * Return zero if no RWQE is available. + * Called at interrupt level with the QP r_rq.lock held. + */ +static int get_rwqe(struct ipath_qp *qp, int wr_id_only) +{ + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_rwqe *wqe; + + if (!qp->ibqp.srq) { + rq = &qp->r_rq; + if (unlikely(rq->tail == rq->head)) + return 0; + wqe = get_rwqe_ptr(rq, rq->tail); + qp->r_wr_id = wqe->wr_id; + if (!wr_id_only) { + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + qp->r_len = wqe->length; + } + if (++rq->tail >= rq->size) + rq->tail = 0; + return 1; + } + + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + spin_lock(&rq->lock); + if (unlikely(rq->tail == rq->head)) { + spin_unlock(&rq->lock); + return 0; + } + wqe = get_rwqe_ptr(rq, rq->tail); + qp->r_wr_id = wqe->wr_id; + if (!wr_id_only) { + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + qp->r_len = wqe->length; + } + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq->ibsrq.event_handler) { + struct ib_event ev; + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + srq->limit = 0; + spin_unlock(&rq->lock); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock(&rq->lock); + } else + spin_unlock(&rq->lock); + return 1; +} -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> Message-ID: <200512161548.lokgvLraSGi0enUH@cisco.com> Kconfig and Makefile for ipath driver. (Leaving out changes to base drivers/infiniband/{Kconfig,Makefile} to hook these new files into kernel build) --- drivers/infiniband/hw/ipath/Kconfig | 18 ++++++++++++++++++ drivers/infiniband/hw/ipath/Makefile | 15 +++++++++++++++ 2 files changed, 33 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/Kconfig create mode 100644 drivers/infiniband/hw/ipath/Makefile 8748441795d589631fc58cbf477f485ff6716348 diff --git a/drivers/infiniband/hw/ipath/Kconfig b/drivers/infiniband/hw/ipath/Kconfig new file mode 100644 index 0000000..092faa6 --- /dev/null +++ b/drivers/infiniband/hw/ipath/Kconfig @@ -0,0 +1,18 @@ +config IPATH_CORE + tristate "PathScale InfiniPath Driver" + depends on PCI_MSI && X86_64 + ---help--- + This is a low-level driver for PathScale InfiniPath host + channel adapters (HCAs) based on the HT-400 chip, including the + InfiniPath HT-460, the small form factor InfiniPath HT-460, + the InfiniPath HT-470 and the Linux Networx LS/X. + +config INFINIBAND_IPATH + tristate "PathScale InfiniPath Verbs Driver" + depends on IPATH_CORE && INFINIBAND + ---help--- + This is a driver that provides InfiniBand verbs support for + PathScale InfiniPath host channel adapters (HCAs). This + allows these devices to be used with both kernel upper level + protocols such as IP-over-InfiniBand as well as with userspace + applications (in conjunction with InfiniBand userspace access). diff --git a/drivers/infiniband/hw/ipath/Makefile b/drivers/infiniband/hw/ipath/Makefile new file mode 100644 index 0000000..dbe2557 --- /dev/null +++ b/drivers/infiniband/hw/ipath/Makefile @@ -0,0 +1,15 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +EXTRA_CFLAGS += -Wall -O3 -g3 + +_ipath_idstr:="$$""Id: kernel.org InfiniPath Release 1.1 $$"" $$""Date: $(shell date +%F-%R)"" $$" +EXTRA_CFLAGS += -D_IPATH_IDSTR='$(_ipath_idstr)' -DIPATH_KERN_TYPE=0 + +obj-$(CONFIG_IPATH_CORE) += ipath_core.o +obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o + +ipath_core-objs := ipath_copy.o ipath_driver.o \ + ipath_dwordcpy.o ipath_ht400.o ipath_i2c.o ipath_layer.o \ + ipath_lib.o ipath_mlock.o + +ib_ipath-objs := ipath_mad.o ipath_verbs.o -- 0.99.9n From rolandd at cisco.com Fri Dec 16 15:48:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 16 Dec 2005 15:48:55 -0800 Subject: [openib-general] [PATCH 12/13] [RFC] ipath verbs MAD handling In-Reply-To: <200512161548.mhIvDiba3wkjPaMc@cisco.com> Message-ID: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> MAD handling for ipath verbs driver --- drivers/infiniband/hw/ipath/ipath_mad.c | 1020 +++++++++++++++++++++++++++++++ 1 files changed, 1020 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/hw/ipath/ipath_mad.c 21556dcebce3886215a02ff1b730a60beea53125 diff --git a/drivers/infiniband/hw/ipath/ipath_mad.c b/drivers/infiniband/hw/ipath/ipath_mad.c new file mode 100644 index 0000000..361c7fb --- /dev/null +++ b/drivers/infiniband/hw/ipath/ipath_mad.c @@ -0,0 +1,1020 @@ +/* + * Copyright (c) 2005. PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + * + * $Id: ipath_mad.c 4491 2005-12-15 22:20:31Z rjwalsh $ + */ + +#include +#include + +#include "ips_common.h" +#include "ipath_verbs.h" +#include "ipath_layer.h" + + +#define IB_SMP_INVALID_FIELD __constant_htons(0x001C) + +static int reply(struct ib_smp *smp, int line) +{ + + /* + * The verbs framework will handle the directed/LID route + * packet changes. + */ + smp->method = IB_MGMT_METHOD_GET_RESP; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + smp->status |= IB_SMP_DIRECTION; + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static inline int recv_subn_get_nodedescription(struct ib_smp *smp) +{ + + strncpy(smp->data, "Infinipath", sizeof(smp->data)); + + return reply(smp, __LINE__); +} + +struct nodeinfo { + u8 base_version; + u8 class_version; + u8 node_type; + u8 num_ports; + __be64 sys_guid; + __be64 node_guid; + __be64 port_guid; + __be16 partition_cap; + __be16 device_id; + __be32 revision; + u8 local_port_num; + u8 vendor_id[3]; +} __attribute__ ((packed)); + +/* + * XXX The num_ports value will need a layer function to get the value + * if we ever have more than one IB port on a chip. + * We will also need to get the GUID for the port. + */ +static inline int recv_subn_get_nodeinfo(struct ib_smp *smp, + struct ib_device *ibdev, u8 port) +{ + struct nodeinfo *nip = (struct nodeinfo *)&smp->data; + ipath_type t = to_idev(ibdev)->ib_unit; + uint32_t vendor, boardid, majrev, minrev; + + nip->base_version = 1; + nip->class_version = 1; + nip->node_type = 1; /* channel adapter */ + nip->num_ports = 1; + /* This is already in network order */ + nip->sys_guid = to_idev(ibdev)->sys_image_guid; + nip->node_guid = ipath_layer_get_guid(t); + nip->port_guid = nip->sys_guid; + nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(t)); + nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(t)); + ipath_layer_query_device(t, &vendor, &boardid, &majrev, &minrev); + nip->revision = cpu_to_be32((majrev << 16) | minrev); + nip->local_port_num = port; + nip->vendor_id[0] = 0; + nip->vendor_id[1] = vendor >> 8; + nip->vendor_id[2] = vendor; + + return reply(smp, __LINE__); +} + +static int recv_subn_get_guidinfo(struct ib_smp *smp, struct ib_device *ibdev) +{ + uint32_t t = to_idev(ibdev)->ib_unit; + u32 startgx = 8 * be32_to_cpu(smp->attr_mod); + u64 *p = (u64 *) smp->data; + + /* 32 blocks of 8 64-bit GUIDs per block */ + + memset(smp->data, 0, sizeof(smp->data)); + + /* + * We only support one GUID for now. + * If this changes, the portinfo.guid_cap field needs to be updated too. + */ + if (startgx == 0) { + /* The first is a copy of the read-only HW GUID. */ + *p = ipath_layer_get_guid(t); + } + + return reply(smp, __LINE__); +} + +struct port_info { + __be64 mkey; + __be64 gid_prefix; + __be16 lid; + __be16 sm_lid; + __be32 cap_mask; + __be16 diag_code; + __be16 mkey_lease_period; + u8 local_port_num; + u8 link_width_enabled; + u8 link_width_supported; + u8 link_width_active; + u8 linkspeed_portstate; /* 4 bits, 4 bits */ + u8 portphysstate_linkdown; /* 4 bits, 4 bits */ + u8 mkeyprot_resv_lmc; /* 2 bits, 3 bits, 3 bits */ + u8 linkspeedactive_enabled; /* 4 bits, 4 bits */ + u8 neighbormtu_mastersmsl; /* 4 bits, 4 bits */ + u8 vlcap_inittype; /* 4 bits, 4 bits */ + u8 vl_high_limit; + u8 vl_arb_high_cap; + u8 vl_arb_low_cap; + u8 inittypereply_mtucap; /* 4 bits, 4 bits */ + u8 vlstallcnt_hoqlife; /* 3 bits, 5 bits */ + u8 operationalvl_pei_peo_fpi_fpo; /* 4 bits, 1, 1, 1, 1 */ + __be16 mkey_violations; + __be16 pkey_violations; + __be16 qkey_violations; + u8 guid_cap; + u8 clientrereg_resv_subnetto; /* 1 bit, 2 bits, 5 bits */ + u8 resv_resptimevalue; /* 3 bits, 5 bits */ + u8 localphyerrors_overrunerrors; /* 4 bits, 4 bits */ + __be16 max_credit_hint; + u8 resv; + u8 link_roundtrip_latency[3]; +} __attribute__ ((packed)); + +static int recv_subn_get_portinfo(struct ib_smp *smp, struct ib_device *ibdev, + u8 port) +{ + u32 lportnum = be32_to_cpu(smp->attr_mod); + struct ipath_ibdev *dev; + struct port_info *pip = (struct port_info *)smp->data; + u32 tmp, tmp2; + + if (lportnum == 0) { + lportnum = port; + smp->attr_mod = cpu_to_be32(lportnum); + } + + if (lportnum < 1 || lportnum > ibdev->phys_port_cnt) + return IB_MAD_RESULT_FAILURE; + + dev = to_idev(ibdev); + + /* Clear all fields. Only set the non-zero fields. */ + memset(smp->data, 0, sizeof(smp->data)); + + /* Only return the mkey if the protection field allows it. */ + if ((dev->mkeyprot_resv_lmc >> 6) == 0) + pip->mkey = dev->mkey; + else + pip->mkey = 0; + pip->gid_prefix = dev->gid_prefix; + tmp = ipath_layer_get_lid(dev->ib_unit); + pip->lid = tmp ? cpu_to_be16(tmp) : IB_LID_PERMISSIVE; + pip->sm_lid = cpu_to_be16(dev->sm_lid); + pip->cap_mask = cpu_to_be32(dev->port_cap_flags); + /* pip->diag_code; */ + pip->mkey_lease_period = cpu_to_be16(dev->mkey_lease_period); + pip->local_port_num = port; + pip->link_width_enabled = 2; /* 4x */ + pip->link_width_supported = 3; /* 1x or 4x */ + pip->link_width_active = 2; /* 4x */ + pip->linkspeed_portstate = 0x10; /* 2.5Gbps */ + tmp = ipath_layer_get_lastibcstat(dev->ib_unit) & 0xff; + tmp2 = 5; /* link up */ + if (tmp == 0x11) + pip->linkspeed_portstate |= 2; /* initialize */ + else if (tmp == 0x21) + pip->linkspeed_portstate |= 3; /* armed */ + else if (tmp == 0x31) + pip->linkspeed_portstate |= 4; /* active */ + else { + pip->linkspeed_portstate |= 1; /* down */ + tmp2 = tmp & 0xf; + } + /* default state is polling */ + pip->portphysstate_linkdown = (tmp2 << 4) | 2; + pip->mkeyprot_resv_lmc = dev->mkeyprot_resv_lmc; + pip->linkspeedactive_enabled = 0x11; /* 2.5Gbps, 2.5Gbps */ + switch (ipath_layer_get_ibmtu(dev->ib_unit)) { + case 4096: + tmp = IB_MTU_4096; + break; + case 2048: + tmp = IB_MTU_2048; + break; + case 1024: + tmp = IB_MTU_1024; + break; + case 512: + tmp = IB_MTU_512; + break; + case 256: + tmp = IB_MTU_256; + break; + default: /* oops, something is wrong */ + tmp = IB_MTU_2048; + break; + } + pip->neighbormtu_mastersmsl = (tmp << 4) | dev->sm_sl; + pip->vlcap_inittype = 0x10; /* VLCap = VL0, InitType = 0 */ + /* pip->vl_high_limit; // only one VL */ + /* pip->vl_arb_high_cap; // only one VL */ + /* pip->vl_arb_low_cap; // only one VL */ + pip->inittypereply_mtucap = IB_MTU_4096; /* InitTypeReply = 0 */ + /* pip->vlstallcnt_hoqlife; // HCAs ignore VLStallCount and HOQLife */ + pip->operationalvl_pei_peo_fpi_fpo = 0x18; /* OVLs = 1, PEI = 1 */ + pip->mkey_violations = cpu_to_be16(dev->mkey_violations); + /* P_KeyViolations are counted by hardware. */ + tmp = ipath_layer_get_cr_errpkey(dev->ib_unit) & 0xFFFF; + pip->pkey_violations = cpu_to_be16(tmp); + pip->qkey_violations = cpu_to_be16(dev->qkey_violations); + /* Only the hardware GUID is supported for now */ + pip->guid_cap = 1; + pip->clientrereg_resv_subnetto = dev->subnet_timeout; + /* 32.768 usec. response time (guessing) */ + pip->resv_resptimevalue = 3; + /* LocalPhyErrors=max, OverRunErrors=max */ + pip->localphyerrors_overrunerrors = 0xFF; + /* pip->max_credit_hint; */ + /* pip->link_roundtrip_latency[3]; */ + + return reply(smp, __LINE__); +} + +static int recv_subn_get_pkeytable(struct ib_smp *smp, struct ib_device *ibdev) +{ + u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff); + u16 *p = (u16 *) smp->data; + + /* 64 blocks of 32 16-bit P_Key entries */ + + memset(smp->data, 0, sizeof(smp->data)); + if (startpx == 0) + ipath_layer_get_pkeys(to_idev(ibdev)->ib_unit, p); + else + smp->status |= IB_SMP_INVALID_FIELD; + + return reply(smp, __LINE__); +} + +static inline int recv_subn_set_guidinfo(struct ib_smp *smp, + struct ib_device *ibdev) +{ + /* The only GUID we support is the first read-only entry. */ + return recv_subn_get_guidinfo(smp, ibdev); +} + +static inline int recv_subn_set_portinfo(struct ib_smp *smp, + struct ib_device *ibdev, u8 port) +{ + struct port_info *pip = (struct port_info *)smp->data; + uint32_t lportnum = be32_to_cpu(smp->attr_mod); + struct ib_event event; + struct ipath_ibdev *dev; + uint32_t flags; + char clientrereg = 0; + u32 tmp; + u32 tmp2; + int ret; + + if (lportnum == 0) { + lportnum = port; + smp->attr_mod = cpu_to_be32(lportnum); + } + + if (lportnum < 1 || lportnum > ibdev->phys_port_cnt) + return IB_MAD_RESULT_FAILURE; + + dev = to_idev(ibdev); + event.device = ibdev; + event.element.port_num = port; + + if (dev->mkey != pip->mkey) + dev->mkey = pip->mkey; + + if (pip->gid_prefix != dev->gid_prefix) + dev->gid_prefix = pip->gid_prefix; + + tmp = be16_to_cpu(pip->lid); + if (tmp != ipath_layer_get_lid(dev->ib_unit)) { + ipath_set_sps_lid(dev->ib_unit, tmp); + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + tmp = be16_to_cpu(pip->sm_lid); + if (tmp != dev->sm_lid) { + dev->sm_lid = tmp; + event.event = IB_EVENT_SM_CHANGE; + ib_dispatch_event(&event); + } + + dev->mkey_lease_period = be16_to_cpu(pip->mkey_lease_period); + +#if 0 + tmp = pip->link_width_enabled; + if (tmp && (tmp != lpp->linkwidthenabled)) { + lpp->linkwidthenabled = tmp; + /* JAG - notify driver here */ + } +#endif + + tmp = pip->linkspeed_portstate & 0xF; + flags = ipath_layer_get_flags(dev->ib_unit); + if (flags & IPATH_LINKDOWN) + tmp2 = IB_PORT_DOWN; + else if (flags & IPATH_LINKINIT) + tmp2 = IB_PORT_INIT; + else if (flags & IPATH_LINKARMED) + tmp2 = IB_PORT_ARMED; + else if (flags & IPATH_LINKACTIVE) + tmp2 = IB_PORT_ACTIVE; + else + tmp2 = IB_PORT_NOP; + if (tmp && tmp != tmp2) { + switch (tmp) { + case IB_PORT_DOWN: + case IB_PORT_INIT: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKDOWN); + if (tmp2 == IB_PORT_ACTIVE) { + event.event = IB_EVENT_PORT_ERR; + ib_dispatch_event(&event); + } + break; + + case IB_PORT_ARMED: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKARM); + if (tmp2 == IB_PORT_ACTIVE) { + event.event = IB_EVENT_PORT_ERR; + ib_dispatch_event(&event); + } + break; + + case IB_PORT_ACTIVE: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKACTIVE); + event.event = IB_EVENT_PORT_ACTIVE; + ib_dispatch_event(&event); + break; + + default: + /* XXX We have already partially updated our state! */ + return IB_MAD_RESULT_FAILURE; + } + } +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_PortPhysicalState); + if (tmp && (tmp != lpp->portphysicalstate)) { + lpp->portphysicalstate = tmp; + /* JAG - notify driver here */ + } + + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_LinkDownDefaultState); + if (tmp && (tmp != lpp->linkdowndefaultstate)) { + lpp->linkdowndefaultstate = tmp; + /* JAG - notify driver here */ + } +#endif + + dev->mkeyprot_resv_lmc = pip->mkeyprot_resv_lmc; + +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_LinkSpeedEnabled); + if (tmp && (tmp != lpp->linkspeedenabled)) { + lpp->linkspeedenabled = tmp; + /* JAG - notify driver here */ + } +#endif + + tmp = (pip->neighbormtu_mastersmsl >> 4) & 0xF; + if (tmp) { + switch (tmp) { + case IB_MTU_256: + tmp2 = 256; + break; + case IB_MTU_512: + tmp2 = 512; + break; + case IB_MTU_1024: + tmp2 = 1024; + break; + case IB_MTU_2048: + tmp2 = 2048; + break; + case IB_MTU_4096: + tmp2 = 4096; + break; + default: + /* XXX We have already partially updated our state! */ + return IB_MAD_RESULT_FAILURE; + } + + ipath_kset_mtu(dev->ib_unit << 16 | tmp2); + } + + dev->sm_sl = pip->neighbormtu_mastersmsl & 0xF; + +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_VLHighLimit); + if (tmp != lpp->vlhighlimit) { + lpp->vlhighlimit = tmp; + /* JAG - notify driver here */ + } + + lpp->inittypereply = + BF_GET(g.madp, iba_Subn_PortInfo, FIELD_InitTypeReply); + + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_OperationalVLs); + if (tmp && (tmp != lpp->operationalvls)) { + lpp->operationalvls = tmp; + /* JAG - notify driver here */ + } +#endif + + if (pip->mkey_violations != 0) + dev->mkey_violations = 0; +#if 0 + /* XXX Hardware counter can't be reset. */ + if (pip->pkey_violations != 0) + dev->pkey_violations = 0; +#endif + + if (pip->qkey_violations != 0) + dev->qkey_violations = 0; + +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_LocalPhyErrors); + if (tmp != lpp->localphyerrors) { + lpp->localphyerrors = tmp; + /* JAG - notify driver here */ + } + + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_OverrunErrors); + if (tmp != lpp->overrunerrors) { + lpp->overrunerrors = tmp; + /* JAG - notify driver here */ + } +#endif + + dev->subnet_timeout = pip->clientrereg_resv_subnetto & 0x1F; + + if (pip->clientrereg_resv_subnetto & 0x80) { + clientrereg = 1; + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + ret = recv_subn_get_portinfo(smp, ibdev, port); + + if (clientrereg) + pip->clientrereg_resv_subnetto |= 0x80; + + return ret; +} + +static inline int recv_subn_set_pkeytable(struct ib_smp *smp, + struct ib_device *ibdev) +{ + u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff); + u16 *p = (u16 *) smp->data; + + if (startpx != 0 || + ipath_layer_set_pkeys(to_idev(ibdev)->ib_unit, p) != 0) + smp->status |= IB_SMP_INVALID_FIELD; + + return recv_subn_get_pkeytable(smp, ibdev); +} + +#define IB_PMA_CLASS_PORT_INFO __constant_htons(0x0001) +#define IB_PMA_PORT_SAMPLES_CONTROL __constant_htons(0x0010) +#define IB_PMA_PORT_SAMPLES_RESULT __constant_htons(0x0011) +#define IB_PMA_PORT_COUNTERS __constant_htons(0x0012) +#define IB_PMA_PORT_COUNTERS_EXT __constant_htons(0x001D) +#define IB_PMA_PORT_SAMPLES_RESULT_EXT __constant_htons(0x001E) + +struct ib_perf { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + __be16 status; + __be16 unused; + __be64 tid; + __be16 attr_id; + __be16 resv; + __be32 attr_mod; + u8 reserved[40]; + u8 data[192]; +} __attribute__ ((packed)); + +struct ib_pma_classportinfo { + u8 base_version; + u8 class_version; + __be16 cap_mask; + u8 reserved[3]; + u8 resp_time_value; /* only lower 5 bits */ + union ib_gid redirect_gid; + __be32 redirect_tc_sl_fl; /* 8, 4, 20 bits respectively */ + __be16 redirect_lid; + __be16 redirect_pkey; + __be32 redirect_qp; /* only lower 24 bits */ + __be32 redirect_qkey; + union ib_gid trap_gid; + __be32 trap_tc_sl_fl; /* 8, 4, 20 bits respectively */ + __be16 trap_lid; + __be16 trap_pkey; + __be32 trap_hl_qp; /* 8, 24 bits respectively */ + __be32 trap_qkey; +} __attribute__ ((packed)); + +struct ib_pma_portsamplescontrol { + u8 opcode; + u8 port_select; + u8 tick; + u8 counter_width; /* only lower 3 bits */ + __be32 counter_mask0_9; /* 2, 10 * 3, bits */ + __be16 counter_mask10_14; /* 1, 5 * 3, bits */ + u8 sample_mechanisms; + u8 sample_status; /* only lower 2 bits */ + __be64 option_mask; + __be64 vendor_mask; + __be32 sample_start; + __be32 sample_interval; + __be16 tag; + __be16 counter_select[15]; +} __attribute__ ((packed)); + +struct ib_pma_portsamplesresult { + __be16 tag; + __be16 sample_status; /* only lower 2 bits */ + __be32 counter[15]; +} __attribute__ ((packed)); + +struct ib_pma_portsamplesresult_ext { + __be16 tag; + __be16 sample_status; /* only lower 2 bits */ + __be32 extended_width; /* only upper 2 bits */ + __be64 counter[15]; +} __attribute__ ((packed)); + +struct ib_pma_portcounters { + u8 reserved; + u8 port_select; + __be16 counter_select; + __be16 symbol_error_counter; + u8 link_error_recovery_counter; + u8 link_downed_counter; + __be16 port_rcv_errors; + __be16 port_rcv_remphys_errors; + __be16 port_rcv_switch_relay_errors; + __be16 port_xmit_discards; + u8 port_xmit_constraint_errors; + u8 port_rcv_constraint_errors; + u8 reserved1; + u8 lli_ebor_errors; /* 4, 4, bits */ + __be16 reserved2; + __be16 vl15_dropped; + __be32 port_xmit_data; + __be32 port_rcv_data; + __be32 port_xmit_packets; + __be32 port_rcv_packets; +} __attribute__ ((packed)); + +struct ib_pma_portcounters_ext { + u8 reserved; + u8 port_select; + __be16 counter_select; + __be32 reserved1; + __be64 port_xmit_data; + __be64 port_rcv_data; + __be64 port_xmit_packets; + __be64 port_rcv_packets; + __be64 port_unicast_xmit_packets; + __be64 port_unicast_rcv_packets; + __be64 port_multicast_xmit_packets; + __be64 port_multicast_rcv_packets; +} __attribute__ ((packed)); + +static int recv_pma_get_classportinfo(struct ib_perf *pmp) +{ + /* + struct ib_pma_classportinfo *p = + (struct ib_pma_classportinfo *)pmp->data; + */ + + memset(pmp->data, 0, sizeof(pmp->data)); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portsamplescontrol(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portsamplescontrol *p = + (struct ib_pma_portsamplescontrol *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + unsigned long flags; + + memset(pmp->data, 0, sizeof(pmp->data)); + + p->port_select = port; + p->tick = 0xFA; /* 1 ms. */ + p->counter_width = 4; /* 32 bit counters */ + p->counter_mask0_9 = __constant_htonl(0x09248000); /* counters 0-4 */ + spin_lock_irqsave(&dev->pending_lock, flags); + p->sample_status = dev->pma_sample_status; + p->sample_start = cpu_to_be32(dev->pma_sample_start); + p->sample_interval = cpu_to_be32(dev->pma_sample_interval); + p->tag = cpu_to_be16(dev->pma_tag); + p->counter_select[0] = dev->pma_counter_select[0]; + p->counter_select[1] = dev->pma_counter_select[1]; + p->counter_select[2] = dev->pma_counter_select[2]; + p->counter_select[3] = dev->pma_counter_select[3]; + p->counter_select[4] = dev->pma_counter_select[4]; + spin_unlock_irqrestore(&dev->pending_lock, flags); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_set_portsamplescontrol(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portsamplescontrol *p = + (struct ib_pma_portsamplescontrol *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + unsigned long flags; + u32 start = be32_to_cpu(p->sample_start); + + if (pmp->attr_mod == 0 && p->port_select == port && start != 0) { + spin_lock_irqsave(&dev->pending_lock, flags); + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_DONE) { + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_STARTED; + dev->pma_sample_start = start; + dev->pma_sample_interval = + be32_to_cpu(p->sample_interval); + dev->pma_tag = be16_to_cpu(p->tag); + if (p->counter_select[0]) + dev->pma_counter_select[0] = + p->counter_select[0]; + if (p->counter_select[1]) + dev->pma_counter_select[1] = + p->counter_select[1]; + if (p->counter_select[2]) + dev->pma_counter_select[2] = + p->counter_select[2]; + if (p->counter_select[3]) + dev->pma_counter_select[3] = + p->counter_select[3]; + if (p->counter_select[4]) + dev->pma_counter_select[4] = + p->counter_select[4]; + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + } + return recv_pma_get_portsamplescontrol(pmp, ibdev, port); +} + +static u64 get_counter(struct ipath_ibdev *dev, __be16 sel) +{ + switch (sel) { + case IB_PMA_PORT_XMIT_DATA: + return dev->ipath_sword; + case IB_PMA_PORT_RCV_DATA: + return dev->ipath_rword; + case IB_PMA_PORT_XMIT_PKTS: + return dev->ipath_spkts; + case IB_PMA_PORT_RCV_PKTS: + return dev->ipath_rpkts; + case IB_PMA_PORT_XMIT_WAIT: + default: + return 0; + } +} + +static int recv_pma_get_portsamplesresult(struct ib_perf *pmp, + struct ib_device *ibdev) +{ + struct ib_pma_portsamplesresult *p = + (struct ib_pma_portsamplesresult *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + int i; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->tag = cpu_to_be16(dev->pma_tag); + p->sample_status = cpu_to_be16(dev->pma_sample_status); + for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++) + p->counter[i] = + cpu_to_be32(get_counter(dev, dev->pma_counter_select[i])); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portsamplesresult_ext(struct ib_perf *pmp, + struct ib_device *ibdev) +{ + struct ib_pma_portsamplesresult_ext *p = + (struct ib_pma_portsamplesresult_ext *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + int i; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->tag = cpu_to_be16(dev->pma_tag); + p->sample_status = cpu_to_be16(dev->pma_sample_status); + p->extended_width = __constant_cpu_to_be16(0x80000000); /* 64 bits */ + for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++) + p->counter[i] = + cpu_to_be64(get_counter(dev, dev->pma_counter_select[i])); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portcounters(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data; + struct ipath_layer_counters cntrs; + + ipath_layer_get_counters(to_idev(ibdev)->ib_unit, &cntrs); + + memset(pmp->data, 0, sizeof(pmp->data)); + p->port_select = port; + if (cntrs.symbol_error_counter > 0xFFFFUL) + p->symbol_error_counter = 0xFFFF; + else + p->symbol_error_counter = + cpu_to_be16((u16)cntrs.symbol_error_counter); + if (cntrs.link_error_recovery_counter > 0xFFUL) + p->link_error_recovery_counter = 0xFF; + else + p->link_error_recovery_counter = + (u8)cntrs.link_error_recovery_counter; + if (cntrs.link_downed_counter > 0xFFUL) + p->link_downed_counter = 0xFF; + else + p->link_downed_counter = (u8)cntrs.link_downed_counter; + if (cntrs.port_rcv_errors > 0xFFFFUL) + p->port_rcv_errors = 0xFFFF; + else + p->port_rcv_errors = cpu_to_be16((u16)cntrs.port_rcv_errors); + if (cntrs.port_rcv_remphys_errors > 0xFFFFUL) + p->port_rcv_remphys_errors = 0xFFFF; + else + p->port_rcv_remphys_errors = + cpu_to_be16((u16)cntrs.port_rcv_remphys_errors); + if (cntrs.port_xmit_discards > 0xFFFFUL) + p->port_xmit_discards = 0xFFFF; + else + p->port_xmit_discards = + cpu_to_be16((u16)cntrs.port_xmit_discards); + if (cntrs.port_xmit_data > 0xFFFFFFFFUL) + p->port_xmit_data = 0xFFFFFFFF; + else + p->port_xmit_data = cpu_to_be32((u32)cntrs.port_xmit_data); + if (cntrs.port_rcv_data > 0xFFFFFFFFUL) + p->port_rcv_data = 0xFFFFFFFF; + else + p->port_rcv_data = cpu_to_be32((u32)cntrs.port_rcv_data); + if (cntrs.port_xmit_packets > 0xFFFFFFFFUL) + p->port_xmit_packets = 0xFFFFFFFF; + else + p->port_xmit_packets = + cpu_to_be32((u32)cntrs.port_xmit_packets); + if (cntrs.port_rcv_packets > 0xFFFFFFFFUL) + p->port_rcv_packets = 0xFFFFFFFF; + else + p->port_rcv_packets = cpu_to_be32((u32)cntrs.port_rcv_packets); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portcounters_ext(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters_ext *p = + (struct ib_pma_portcounters_ext *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + u64 swords, rwords, spkts, rpkts; + + ipath_layer_snapshot_counters(to_idev(ibdev)->ib_unit, + &swords, &rwords, &spkts, &rpkts); + + memset(pmp->data, 0, sizeof(pmp->data)); + p->port_select = port; + p->port_xmit_data = cpu_to_be64(swords); + p->port_rcv_data = cpu_to_be64(rwords); + p->port_xmit_packets = cpu_to_be64(spkts); + p->port_rcv_packets = cpu_to_be64(rpkts); + p->port_unicast_xmit_packets = + cpu_to_be64(spkts - dev->n_multicast_xmit); + p->port_unicast_rcv_packets = + cpu_to_be64(rpkts - dev->n_multicast_rcv); + p->port_multicast_xmit_packets = cpu_to_be64(dev->n_multicast_xmit); + p->port_multicast_rcv_packets = cpu_to_be64(dev->n_multicast_rcv); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_set_portcounters(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + /* XXX HW counters can't be cleared. */ + return recv_pma_get_portcounters(pmp, ibdev, port); +} + +static int recv_pma_set_portcounters_ext(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + /* XXX HW counters can't be cleared. */ + return recv_pma_get_portcounters_ext(pmp, ibdev, port); +} + +static inline int process_subn(struct ib_device *ibdev, int mad_flags, + u8 port_num, struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + struct ib_smp *smp = (struct ib_smp *)out_mad; + struct ipath_ibdev *dev = to_idev(ibdev); + + /* Is the mkey in the process of expiring? */ + if (dev->mkey_lease_timeout && jiffies >= dev->mkey_lease_timeout) { + dev->mkey_lease_timeout = 0; + dev->mkeyprot_resv_lmc &= 0x3F; + } + + /* + * M_Key checking depends on + * Portinfo:M_Key_protect_bits + */ + if ((mad_flags & IB_MAD_IGNORE_MKEY) == 0 && dev->mkey != 0 && + dev->mkey != smp->mkey && (smp->method != IB_MGMT_METHOD_GET || + (dev->mkeyprot_resv_lmc >> 7) != 0)) { + if (dev->mkey_violations != 0xFFFF) + ++dev->mkey_violations; + if (dev->mkey_lease_timeout || dev->mkey_lease_period == 0) + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + dev->mkey_lease_timeout = jiffies + dev->mkey_lease_period * HZ; + /* Future: Generate a trap notice. */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + *out_mad = *in_mad; + switch (smp->method) { + case IB_MGMT_METHOD_GET: + switch (smp->attr_id) { + case IB_SMP_ATTR_NODE_DESC: + return recv_subn_get_nodedescription(smp); + + case IB_SMP_ATTR_NODE_INFO: + return recv_subn_get_nodeinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_GUID_INFO: + return recv_subn_get_guidinfo(smp, ibdev); + + case IB_SMP_ATTR_PORT_INFO: + return recv_subn_get_portinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_PKEY_TABLE: + return recv_subn_get_pkeytable(smp, ibdev); + + default: + break; + } + break; + + case IB_MGMT_METHOD_SET: + switch (smp->attr_id) { + case IB_SMP_ATTR_GUID_INFO: + return recv_subn_set_guidinfo(smp, ibdev); + + case IB_SMP_ATTR_PORT_INFO: + return recv_subn_set_portinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_PKEY_TABLE: + return recv_subn_set_pkeytable(smp, ibdev); + + default: + break; + } + break; + + default: + break; + } + return IB_MAD_RESULT_FAILURE; +} + +static inline int process_perf(struct ib_device *ibdev, u8 port_num, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + struct ib_perf *pmp = (struct ib_perf *)out_mad; + + *out_mad = *in_mad; + switch (pmp->method) { + case IB_MGMT_METHOD_GET: + switch (pmp->attr_id) { + case IB_PMA_CLASS_PORT_INFO: + return recv_pma_get_classportinfo(pmp); + + case IB_PMA_PORT_SAMPLES_CONTROL: + return recv_pma_get_portsamplescontrol(pmp, ibdev, + port_num); + + case IB_PMA_PORT_SAMPLES_RESULT: + return recv_pma_get_portsamplesresult(pmp, ibdev); + + case IB_PMA_PORT_SAMPLES_RESULT_EXT: + return recv_pma_get_portsamplesresult_ext(pmp, ibdev); + + case IB_PMA_PORT_COUNTERS: + return recv_pma_get_portcounters(pmp, ibdev, port_num); + + case IB_PMA_PORT_COUNTERS_EXT: + return recv_pma_get_portcounters_ext(pmp, ibdev, + port_num); + + default: + break; + } + break; + + case IB_MGMT_METHOD_SET: + switch (pmp->attr_id) { + case IB_PMA_PORT_SAMPLES_CONTROL: + return recv_pma_set_portsamplescontrol(pmp, ibdev, + port_num); + + case IB_PMA_PORT_COUNTERS: + return recv_pma_set_portcounters(pmp, ibdev, port_num); + + case IB_PMA_PORT_COUNTERS_EXT: + return recv_pma_set_portcounters_ext(pmp, ibdev, + port_num); + + default: + break; + } + break; + + default: + break; + } + return IB_MAD_RESULT_FAILURE; +} + +/* + * Note that the verbs framework has already done the MAD sanity checks, + * and hop count/pointer updating for IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE MADs. + * + * Return IB_MAD_RESULT_SUCCESS if this is a MAD that we are not interested + * in processing. + */ +int ipath_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + switch (in_mad->mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + return process_subn(ibdev, mad_flags, port_num, + in_mad, out_mad); + + case IB_MGMT_CLASS_PERF_MGMT: + return process_perf(ibdev, port_num, in_mad, out_mad); + + default: + return IB_MAD_RESULT_SUCCESS; + } +} -- 0.99.9n From robert.j.woodruff at intel.com Fri Dec 16 16:37:17 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Fri, 16 Dec 2005 16:37:17 -0800 Subject: [openib-general] SDP problem on SVN 4507 In-Reply-To: <20051215220214.GA31463@mellanox.co.il> Message-ID: I am seeing a strange problem with SDP on SVN4507. When I run NetPIPE over SDP by itself, it runs just fine. However, if I run MPI over uDAPL/CMA at the same time, I seem to have a problem. I start 2 copies of MPI running Intel MPI benchmark. Then, I start the NetPIPE server and it starts to listen waiting for a connect request. Then I start the client side and it fails (on the connect() call) with an errno of 111. If I then stop the MPI/uDAPL/CMA jobs, SDP/NetPIPE can then connect OK. Not sure if this is an SDP issue or some problem with CMA/CM. Has anyone else seen a similar behavior ? woody From ianjiang.ict at gmail.com Sat Dec 17 00:43:42 2005 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Sat, 17 Dec 2005 16:43:42 +0800 Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: <20051216180338.GC8493@esmail.cup.hp.com> References: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> <20051216180338.GC8493@esmail.cup.hp.com> Message-ID: <7b2fa1820512170043y7ae0e0ccrc577733b708b6399@mail.gmail.com> Hi Grant, Thanks very much. I scanned the IO-mapping.txt, DMA-API.txt and DMA-mapping.txt as soon as I could and have got a main concept now. As your mentioned, ULPs in OpenIB (e.g. SDP or IPoIB) are responsible for properly mapping and unmapping for DMA use. AFAIK, SDP is implemented with the IB native verbs. What about the kDAPL? In my opinion the kDAPL does not do the mapping and unmapping work. So it is the responsibility of the kernel applications using the kDAPL. Am I right? On 12/17/05, Grant Grundler wrote: > > While IO-mapping.txt gives a nice introduction into the topic > of "bus addresses", the answer to the question lies in > Documentation/DMA-API.txt. IO devices can only use "bus addresses" > that are handed back by the interfaces described in DMA-API.txt. > For OpenIB, ULPs (e.g. SDP or IPoIB) are responsible for properly > mapping and unmapping for DMA use. > > While many architectures don't use IOMMU (and thus have 1:1 > between host physical:bus address), virtualization seems to be > forcing the issue in the "near" future. All DMA access will need > to be enforced to isolate virtualized guests. This is something > some platforms with IOMMUs enforce today (e.g. Sparc64, PPC64 and > PA-RISC). > > hth, > grant > -- Ian Jiang ianjiang.ict at gmail.com Laboratory of Spatial Information Technology Division of System Architecture Institute of Computing Technology Chinese Academy of Sciences -------------- next part -------------- An HTML attachment was scrubbed... URL: From penberg at cs.helsinki.fi Sat Dec 17 04:33:16 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Sat, 17 Dec 2005 14:33:16 +0200 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <200512161548.aLjaDpGm5aqk0k0p@cisco.com> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> Message-ID: <84144f020512170433h151a7667o42c382242f81347b@mail.gmail.com> Hi Roland, On 12/17/05, Roland Dreier wrote: > +/* > + * This file contains defines, structures, etc. that are used > + * to communicate between kernel and user code. > + */ > + > +#ifdef __KERNEL__ > +#include > +#include > +#include > +#else /* !__KERNEL__; user mode */ > +#include > +#include > +#include > +#include > + > +/* these aren't implemented for user mode, which is OK until we multi-thread */ > +typedef struct _atomic { > + uint32_t counter; > +} atomic_t; /* no atomic_t type in user-land */ > +#define atomic_set(a,v) ((a)->counter = (v)) > +#define atomic_inc_return(a) (++(a)->counter) > +#define likely(x) (x) > +#define unlikely(x) (x) > + > +#define yield() sched_yield() > + > +/* > + * too horrible to try and use the kernel get_cycles() or equivalent, > + * so define and inline it here > + */ > + > +#if !defined(rdtscll) > +#if defined(__x86_64) || defined(__i386) > +#define rdtscll(v) do {uint32_t a,d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); \ > + (v) = ((uint64_t)a) | (((uint64_t)d)<<32); \ > +} while(0) > +#else > +#error "No cycle counter routine implemented yet for this platform" > +#endif > +#endif /* !defined(rdtscll) */ Do we really need this ugly userspace emulation code in the kernel? > +/* > + * this is used for very short copies, usually 1 - 8 bytes, > + * *NEVER* to the PIO buffers!!!!!!! use ipath_dwordcpy for longer > + * copies, or any copy to the PIO buffers. Works for 32 and 64 bit > + * gcc and pathcc > + */ > +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) > +{ > + void *ssv, *dsv; > + uint32_t csv; > + __asm__ __volatile__("cld\n\trep\n\tmovsb":"=&c"(csv), "=&D"(dsv), > + "=&S"(ssv) > + :"0"(cnt), "1"(dest), "2"(src) > + :"memory"); > +} > + > +/* > + * optimized word copy; good for rev C and later opterons. Among the best for > + * short copies, and does as well or slightly better than the optimizization > + * guide copies 6 and 8 at 2KB. > + */ > +void ipath_dwordcpy(uint32_t * dest, uint32_t * src, uint32_t ndwords); What is this used for? Why can't yo use memcpy? > +#define round_up(v,sz) (((v) + (sz)-1) & ~((sz)-1)) Please use ALIGN(). > +/* These are used in the driver, don't use them elsewhere */ > +#define _IPATH_SIMFUNC_IOCTL_LOW 1 > +#define _IPATH_SIMFUNC_IOCTL_HIGH 7 > + > +/* > + * These tell the driver which ioctl's belong to the diags interface. > + * As above, don't use them elsewhere. > + */ > +#define _IPATH_DIAG_IOCTL_LOW 100 > +#define _IPATH_DIAG_IOCTL_HIGH 109 [snip, snip] You seem to be introducing loads of new ioctls. Any reason you can't use sysfs and/or configfs? > +/* macros for processing rcvhdrq entries */ > +#define ips_get_hdr_err_flags(StartOfBuffer) *(((uint32_t *)(StartOfBuffer))+1) > +#define ips_get_index(StartOfBuffer) (((*((uint32_t *)(StartOfBuffer))) >> \ > + INFINIPATH_RHF_EGRINDEX_SHIFT) & INFINIPATH_RHF_EGRINDEX_MASK) > +#define ips_get_rcv_type(StartOfBuffer) ((*(((uint32_t *)(StartOfBuffer))) >> \ > + INFINIPATH_RHF_RCVTYPE_SHIFT) & INFINIPATH_RHF_RCVTYPE_MASK) > +#define ips_get_length_in_bytes(StartOfBuffer) \ > + (uint32_t)(((*(((uint32_t *)(StartOfBuffer))) >> \ > + INFINIPATH_RHF_LENGTH_SHIFT) & INFINIPATH_RHF_LENGTH_MASK) << 2) > +#define ips_get_first_protocol_header(StartOfBuffer) (void *) \ > + ((uint32_t *)(StartOfBuffer) + 2) > +#define ips_get_ips_header(StartOfBuffer) ((ips_message_header_typ *) \ > + ((uint32_t *)(StartOfBuffer) + 2)) > +#define ips_get_ipath_ver(ipath_header) (((ipath_header) >> INFINIPATH_I_VERS_SHIFT) \ > + & INFINIPATH_I_VERS_MASK) Please use static inlines instead for readability. From penberg at cs.helsinki.fi Sat Dec 17 04:38:57 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Sat, 17 Dec 2005 14:38:57 +0200 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <200512161548.lRw6KI369ooIXS9o@cisco.com> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> Message-ID: <84144f020512170438p5acbc445v30f275aca2d09afe@mail.gmail.com> On 12/17/05, Roland Dreier wrote: > +#define TRUE 1 > +#define FALSE 0 Please kill these. Pekka From hch at infradead.org Sat Dec 17 05:14:56 2005 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 17 Dec 2005 13:14:56 +0000 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <200512161548.aLjaDpGm5aqk0k0p@cisco.com> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> Message-ID: <20051217131456.GA13043@infradead.org> > + * $Id: ipath_common.h 4491 2005-12-15 22:20:31Z rjwalsh $ please remove RCSIDs everywhere. > +#ifdef __KERNEL__ > +#include > +#include > +#include > +#else /* !__KERNEL__; user mode */ > +#include > +#include > +#include > +#include > + > +/* these aren't implemented for user mode, which is OK until we multi-thread */ > +typedef struct _atomic { > + uint32_t counter; > +} atomic_t; /* no atomic_t type in user-land */ > +#define atomic_set(a,v) ((a)->counter = (v)) > +#define atomic_inc_return(a) (++(a)->counter) > +#define likely(x) (x) > +#define unlikely(x) (x) > + > +#define yield() sched_yield() Please push this out. It's fine if they reuse kernel-code in userspace this way, but please move the compat wrappers to a separate file that's not in the kernel tree. > +typedef uint8_t ipath_type; totally meaningless typedef > +#ifndef _BITS_PER_BYTE > +#define _BITS_PER_BYTE 8 > +#endif WTF? > + > +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) > + __attribute__ ((always_inline)); > +/* > + * this is used for very short copies, usually 1 - 8 bytes, > + * *NEVER* to the PIO buffers!!!!!!! use ipath_dwordcpy for longer > + * copies, or any copy to the PIO buffers. Works for 32 and 64 bit > + * gcc and pathcc > + */ > +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) in kernel land we __inline__ includes always_inline. Also no need for a separate prototype for a just following inline function. > +{ > + void *ssv, *dsv; > + uint32_t csv; > + __asm__ __volatile__("cld\n\trep\n\tmovsb":"=&c"(csv), "=&D"(dsv), > + "=&S"(ssv) > + :"0"(cnt), "1"(dest), "2"(src) > + :"memory"); > +} No way we're gonna put assembler code into such a driver. > +struct ipath_int_vec { > + int long long addr; > + uint32_t info; > +}; please always used fixes-size types for user communication. also please avoid ioctls like the rest of the IB codebase. > +/* Similarly, this is the kernel version going back to the user. It's slightly > + * different, in that we want to tell if the driver was built as part of a > + * PathScale release, or from the driver from the OpenIB, kernel.org, or a > + * standard distribution, for support reasons. The high bit is 0 for > + * non-PathScale, and 1 for PathScale-built/supplied. That bit is defined > + * in Makefiles, rather than this file. > + * > + * It's returned by the driver to the user code during initialization > + * in the spi_sw_version field of ipath_base_info, so the user code can > + * in turn check for compatibility with the kernel. > +*/ > +#define IPATH_KERN_SWVERSION ((IPATH_KERN_TYPE<<31) | IPATH_USER_SWVERSION) NACK, there's no way we're gonna put in a way to identify an "official" version. The official version is the last one in mainline always. > +#ifndef PCI_VENDOR_ID_PATHSCALE /* not in pci.ids yet */ > +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1 > +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH1 0xa > +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH2 0xd > +#endif so move it there? > +typedef struct _ipath_portdata { please avoid typedefs for struct types. > +/* > + * these should be somewhat dynamic someday, although they are fixed > + * for all users of the device on any given load. > + * > + * NOTE: There is a VM bug in the 2.4 Kernels similar to the one Dave > + * fixed in the 2.6 Kernel. When using large or discontinuous memory, > + * we get random kernel oops. So, in 2.4, we are just going to stick > + * with 4k chunks instead of 64k chunks. > + */ No one cares about 2.4 kernels here. > + * these function similarly to the mlock/munlock system calls. > + * ipath_mlock() is used to pin an address range (if not already pinned), > + * and optionally return the list of physical addresses > + * ipath_munlock() does the obvious, and ipath_mlock() cleans up all > + * private memory, used at driver unload. > + * ipath_mlock_nocopy() is similar to mlock, but only one page, and marks > + * the vm so the page isn't taken away on a fork. > + */ > +int ipath_mlock(unsigned long, size_t, struct page **); > +int ipath_mlock_nocopy(unsigned long, struct page **); this kind of thing definitly doesn't belong into an LLDD. or maybe it's just stale prototypes? > +#ifdef IPATH_COSIM > +extern __u32 sim_readl(const volatile void __iomem * addr); > +extern __u64 sim_readq(const volatile void __iomem * addr); > +extern void sim_writel(__u32 val, volatile void __iomem * addr); > +extern void sim_writeq(__u64 val, volatile void __iomem * addr); > +#define ipath_readl(addr) sim_readl(addr) > +#define ipath_readq(addr) sim_readq(addr) > +#define ipath_writel(val, addr) sim_writel(val, addr) > +#define ipath_writeq(val, addr) sim_writeq(val, addr) > +#else > +#define ipath_readl(addr) readl(addr) > +#define ipath_readq(addr) readq(addr) > +#define ipath_writel(val, addr) writel(val, addr) > +#define ipath_writeq(val, addr) writeq(val, addr) > +#endif Please use the proper functions directly. Your simulator can override them if nessecary. > +static __inline__ uint32_t ipath_kget_kreg32(const ipath_type stype, > + ipath_kreg regno) > +{ > + volatile uint32_t *kreg32; > + > + if (!devdata[stype].ipath_kregbase) > + return ~0; > + > + kreg32 = (volatile uint32_t *)&devdata[stype].ipath_kregbase[regno]; volatile use is probably always wrong. but this whole functions looks like a very odd wrapper anyway? From hch at infradead.org Sat Dec 17 05:16:14 2005 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 17 Dec 2005 13:16:14 +0000 Subject: [openib-general] Re: [PATCH 00/13] [RFC] IB: PathScale InfiniPath driver In-Reply-To: <200512161548.jRuyTS0HPMLd7V81@cisco.com> References: <20051031150618.627779f1.akpm@osdl.org> <200512161548.jRuyTS0HPMLd7V81@cisco.com> Message-ID: <20051217131614.GB13043@infradead.org> On Fri, Dec 16, 2005 at 03:48:54PM -0800, Roland Dreier wrote: > having sysctls that set values also settable through module parameters > under /sys/module, code inside #ifndef __KERNEL__ so include files can > be shared with other PathScale code, code in ipath_i2c.c that might be > simplified by using drivers/i2c, etc. I'd like to try to get a sense > of whether I'm being too picky or whether PathScale really does need > to fix these up before the driver is merged. Yes, please fix this stuff before. The current driver looks like a horrible mess. Is there some political plot going where pathscale folks are forcing you to send this out in this scheme? Otherwise I couldn't explain the code quality magnitudes lower than normally expected from your merges. From hch at infradead.org Sat Dec 17 05:16:49 2005 From: hch at infradead.org (Christoph Hellwig) Date: Sat, 17 Dec 2005 13:16:49 +0000 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <200512161548.lRw6KI369ooIXS9o@cisco.com> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> Message-ID: <20051217131649.GC13043@infradead.org> On Fri, Dec 16, 2005 at 03:48:54PM -0800, Roland Dreier wrote: > Copy routines for ipath driver NACK, assembler copy routines don't belong into drivers. From mst at mellanox.co.il Sat Dec 17 07:32:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 17 Dec 2005 17:32:39 +0200 Subject: [openib-general] Re: SDP problem on SVN 4507 In-Reply-To: References: Message-ID: <20051217153239.GA17388@mellanox.co.il> Quoting r. Bob Woodruff : > Subject: SDP problem on SVN 4507 > > > I am seeing a strange problem with SDP on SVN4507. > > When I run NetPIPE over SDP by itself, it runs just fine. > However, if I run MPI over uDAPL/CMA at the same > time, I seem to have a problem. > I start 2 copies of MPI running Intel MPI benchmark. Then, > I start the NetPIPE server and it starts to listen waiting for a connect > request. > Then I start the client side and it fails (on the connect() call) with > an > errno of 111. If I then stop the MPI/uDAPL/CMA jobs, > SDP/NetPIPE can then connect OK. Not sure if this is an SDP issue > or some problem with CMA/CM. > > Has anyone else seen a similar behavior ? > > > woody > I hope this will be resolved with the move to CMA. -- MST From rdreier at cisco.com Sat Dec 17 07:51:04 2005 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 17 Dec 2005 07:51:04 -0800 Subject: [openib-general] Re: [PATCH 00/13] [RFC] IB: PathScale InfiniPath driver In-Reply-To: <20051217131614.GB13043@infradead.org> (Christoph Hellwig's message of "Sat, 17 Dec 2005 13:16:14 +0000") References: <20051031150618.627779f1.akpm@osdl.org> <200512161548.jRuyTS0HPMLd7V81@cisco.com> <20051217131614.GB13043@infradead.org> Message-ID: Christoph> Is there some political plot going where pathscale Christoph> folks are forcing you to send this out in this scheme? Christoph> Otherwise I couldn't explain the code quality Christoph> magnitudes lower than normally expected from your Christoph> merges. No political plot -- this posting was an RFC in the literal sense, with no expectation that the code is mergable as-is. I just want to get comments early so that we have a better idea of what needs to be fixed. For example, what's your feeling about sysctls in drivers? BTW, Pathscale people -- please respond to the comments that are made about your driver... - R. From mst at mellanox.co.il Sat Dec 17 07:55:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 17 Dec 2005 17:55:44 +0200 Subject: [openib-general] Re: [RFC] IB_AT_MOST In-Reply-To: <000201c60281$4cf5c610$6401a8c0@infiniconsys.com> References: <000201c60281$4cf5c610$6401a8c0@infiniconsys.com> Message-ID: <20051217155544.GB17388@mellanox.co.il> Quoting r. Fab Tillier : > Subject: RE: [RFC] IB_AT_MOST > > Hi Michael, > > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Friday, December 16, 2005 5:58 AM > > > > Hi! > > I recently noted that some middleware seems to use the "as much > > as possible" approach, for example, using maximum possible value > > for max_rd_atomic or other fields, in create/modify qp. > > > > An obvious thing could be to perform query_device and use max. > > values from there. However, it turns out that hardware max supported > > values might not be easy to express in terms of a single constant. > > Consider for example the max number of s/g entries supported per > > WQE: mellanox HCAs support different number of these for RC and UD > > QPs. So whatever single number query device reports, using it will > > never achieve what the user wants for all QP types. > > > > Rather than extending the device query for all thinkable hardware > > weirdness, I'd like to propose, instead, the following API extension > > (below): passing a negative value in e.g. qp attribute would have the > > meaning: let hardware use at most the specified value. > > This, as opposed to the usual "at least the specified value" meaning > > for positive values. > > > > How does the following work, for an API? Please comment. > > I don't understand the IB_AT_MOST macro. If someone uses IB_AT_MOST( 1) and > the hardware supports 4, they will get 4, which is definitely not "at most 1". Yes, but we could easily fix this in the hardware provider so that they get 1. > I would rename it to IB_MAX, and define it a -1 or something like that. This is an option, too. -- MST From akpm at osdl.org Sat Dec 17 12:38:16 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:16 -0800 Subject: [openib-general] Re: [git patch review 2/7] IB/mthca: correct log2 calculation In-Reply-To: <1134705617067-bb88e1b23a3e36b6@cisco.com> References: <1134705617067-b51dec64cec55f52@cisco.com> <1134705617067-bb88e1b23a3e36b6@cisco.com> Message-ID: <20051217123816.18ad94e0.akpm@osdl.org> Roland Dreier wrote: > > Fix thinko in rd_atomic calculation: ffs(x) - 1 does not find the next > power of 2 -- it should be fls(x - 1). Please use round_up_pow_of_two(). From akpm at osdl.org Sat Dec 17 12:38:27 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:27 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <200512161548.aLjaDpGm5aqk0k0p@cisco.com> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> Message-ID: <20051217123827.32f119da.akpm@osdl.org> Roland Dreier wrote: > > ... > > +#ifdef __KERNEL__ > +#include > +#include > +#include > +#else /* !__KERNEL__; user mode */ > +#include > +#include > +#include > +#include > + > +/* these aren't implemented for user mode, which is OK until we multi-thread */ > +typedef struct _atomic { > + uint32_t counter; > +} atomic_t; /* no atomic_t type in user-land */ > +#define atomic_set(a,v) ((a)->counter = (v)) > +#define atomic_inc_return(a) (++(a)->counter) > +#define likely(x) (x) > +#define unlikely(x) (x) > + > +#define yield() sched_yield() Some might get upset about what I assume is userspace test harness code or what _is_ this doing?) in a driver. But if the maintainers find it useful we can live with it, > +#ifndef _BITS_PER_BYTE > +#define _BITS_PER_BYTE 8 > +#endif I'd be inclined to stick BITS_PER_BYTE into include/linux/types.h. > +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) > + __attribute__ ((always_inline)); s/__inline__/inline/ throughout. > +#define round_up(v,sz) (((v) + (sz)-1) & ~((sz)-1)) We have ALIGN() > +struct ipath_int_vec { > + int long long addr; long long > + uint32_t info; > +}; > +struct ipath_eeprom_req { > + long long addr; Like this. > + uint16_t len; > + uint16_t offset; > +}; > ... > +#define IPATH_USERINIT _IOW('s', 16, struct ipath_user_info) > +/* init; kernel/chip params to user */ > +#define IPATH_BASEINFO _IOR('s', 17, struct ipath_base_info) > +/* send a packet */ > +#define IPATH_SENDPKT _IOW('s', 18, struct ipath_sendpkt) uh-oh. ioctls. Do we have compat conversions for them all, if needed? > +/* > + * A segment is a linear region of low physical memory. > + * XXX Maybe we should use phys addr here and kmap()/kunmap() > + * Used by the verbs layer. > + */ > +struct ipath_seg { > + void *vaddr; > + u64 length; > +}; Suggest `long' for the length. We don't need 64 bits on 32-bit machines. > +struct ipath_mregion { > + u64 user_base; /* User's address for this region */ void *. > + u64 iova; /* IB start address of this region */ Maybe here too. > +int ipath_mlock(unsigned long, size_t, struct page **); Sometimes it does `int foo()' and sometimes `extern int foo()'. I tend to think the `extern' is a waste of space. > +#define ipath_func_krecord(a) > +#define ipath_func_urecord(a, b) > +#define ipath_func_mrecord(a, b) > +#define ipath_func_rkrecord(a) > +#define ipath_func_rurecord(a, b) > +#define ipath_func_rmrecord(a, b) > +#define ipath_func_rsrecord(a) > +#define ipath_func_rcrecord(a) What are all these doing? Might need do{}while(0) for safety. > +#ifdef IPATH_COSIM > +extern __u32 sim_readl(const volatile void __iomem * addr); > +extern __u64 sim_readq(const volatile void __iomem * addr); The driver has a strange mixture of int32_t, s32 and __s32. s32 is preferred. > + */ > +static __inline__ uint32_t ipath_kget_ureg32(const ipath_type stype, > + ipath_ureg regno, int port) > +{ > + uint64_t *ubase; > + > + ubase = (uint64_t *) (devdata[stype].ipath_uregbase > + + (char *)devdata[stype].ipath_kregbase > + + devdata[stype].ipath_palign * port); > + return ubase ? ipath_readl(ubase + regno) : 0; > +} Are all these u64's needed on 32-bit? > +static __inline__ uint64_t ipath_kget_kreg64(const ipath_type stype, > + ipath_kreg regno) > +{ > + if (!devdata[stype].ipath_kregbase) > + return ~0ULL; We don't know that the architecture implements u64 as unsigned long long. Some use unsigned long. Best way of implmenting the all-ones pattern is just `-1'. Gee. Big driver. From akpm at osdl.org Sat Dec 17 12:38:33 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:33 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <200512161548.lRw6KI369ooIXS9o@cisco.com> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> Message-ID: <20051217123833.1aa430ab.akpm@osdl.org> Roland Dreier wrote: > > + .globl ipath_dwordcpy > +/* rdi destination, rsi source, rdx count */ > +ipath_dwordcpy: > + movl %edx,%ecx > + shrl $1,%ecx > + andl $1,%edx > + cld > + rep > + movsq > + movl %edx,%ecx > + rep > + movsd > + ret err, we have a portability problem. From akpm at osdl.org Sat Dec 17 12:38:38 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:38 -0800 Subject: [openib-general] Re: [PATCH 04/13] [RFC] ipath LLD core, part 1 In-Reply-To: <200512161548.20XjmmxDHjOZRXcz@cisco.com> References: <200512161548.lRw6KI369ooIXS9o@cisco.com> <200512161548.20XjmmxDHjOZRXcz@cisco.com> Message-ID: <20051217123838.7732c201.akpm@osdl.org> Roland Dreier wrote: > > + if ((ret = copy_from_user(&rpkt, p, sizeof rpkt))) { > + _IPATH_DBG("Failed to copy in pkt struct (%d)\n", ret); > + return ret; > + } The driver does this quite a lot. copy_from_user() will return the number of bytes remaining to copy. So I think we'll be wanting `return -EFAULT;' in lots of places rather than this. From akpm at osdl.org Sat Dec 17 12:38:50 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:50 -0800 Subject: [openib-general] Re: [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <200512161548.3fqe3fMerrheBMdX@cisco.com> References: <200512161548.KglSM2YESlGlEQfQ@cisco.com> <200512161548.3fqe3fMerrheBMdX@cisco.com> Message-ID: <20051217123850.aa6cfd53.akpm@osdl.org> Roland Dreier wrote: > > ... > +/* > + * This isn't perfect, but it's close enough for timing work. We want this > + * to work on systems where the cycle counter isn't the same as the clock > + * frequency. The one msec spin is OK, since we execute this only once > + * when first loaded. We don't use CURRENT_TIME because on some systems > + * it only has jiffy resolution; we just assume udelay is well calibrated > + * and that we aren't likely to be rescheduled. Do it multiple times, > + * with a yield in between, to try to make sure we get the "true minimum" > + * value. > + * _ipath_pico_per_cycle isn't going to lead to completely accurate > + * conversions from timestamps to nanoseconds, but it's close enough > + * for our purposes, which is mainly to allow people to show events with > + * nsecs or usecs if desired, rather than cycles. > + */ > +void ipath_init_picotime(void) > +{ > + int i; > + u_int64_t ts, te, delta = -1ULL; > + > + for (i = 0; i < 5; i++) { > + ts = get_cycles(); > + udelay(250); > + te = get_cycles(); > + if ((te - ts) < delta) > + delta = te - ts; > + yield(); > + } > + _ipath_pico_per_cycle = 250000000 / delta; > +} hm, I hope this is debug code which is going away. If not, we should take a look at what it's trying to do here. > +/* > + * Our version of the kernel mlock function. This function is no longer > + * exposed, so we need to do it ourselves. It takes a given start page > + * (page aligned user virtual address) and pins it and the following specified > + * number of pages. > + * For now, num_pages is always 1, but that will probably change at some > + * point (because caller is doing expected sends on a single virtually > + * contiguous buffer, so we can do all pages at once). > + */ > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > +{ > + int n; > + > + _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page); > + down_read(¤t->mm->mmap_sem); > + n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1, > + p, NULL); > + up_read(¤t->mm->mmap_sem); > + if (n != num_pages) { > + _IPATH_INFO > + ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n", > + num_pages, start_page, n); > + if (n < 0) /* it's an errno */ > + return n; > + return -ENOMEM; /* no way to know actual error */ > + } > + > + return 0; > +} OK. It's perhaps not a very well named function. > +/* > + * this is similar to ipath_mlock, but it's always one page, and we mark > + * the page as locked for i/o, and shared. This is used for the user process > + * page that contains the destination address for the rcvhdrq tail update, > + * so we need to have the vma. If we don't do this, the page can be taken > + * away from us on fork, even if the child never touches it, and then > + * the user process never sees the tail register updates. > + */ > +int ipath_mlock_nocopy(unsigned long start_page, struct page **p) > +{ > + int n; > + struct vm_area_struct *vm = NULL; > + > + down_read(¤t->mm->mmap_sem); > + n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm); > + up_read(¤t->mm->mmap_sem); > + if (n != 1) { > + _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n", > + start_page, n); > + if (n < 0) /* it's an errno */ > + return n; > + return -ENOMEM; /* no way to know actual error */ > + } > + vm->vm_flags |= VM_SHM | VM_LOCKED; > + > + return 0; > +} I don't think we want to be setting the user's VMA's vm_flags in this manner. This is purely to retain the physical page across fork? > +/* > + * Our version of the kernel munlock function. This function is no longer > + * exposed, so we need to do it ourselves. It unpins the start page > + * (a page aligned full user virtual address, not a page number) > + * and pins it and the following specified number of pages. > + */ > +int ipath_munlock(size_t num_pages, struct page **p) > +{ > + int i; > + > + for (i = 0; i < num_pages; i++) { > + _IPATH_MMDBG("%u/%lu put_page %p\n", i, num_pages, p[i]); > + SetPageDirty(p[i]); > + put_page(p[i]); > + } > + return 0; > +} Nope, SetPageDirty() doesn't tell the VM that the page is dirty - it'll never get written out. Use set_page_dirty_lock(). From akpm at osdl.org Sat Dec 17 12:38:56 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 12:38:56 -0800 Subject: [openib-general] Re: [PATCH 08/13] [RFC] ipath core last bit In-Reply-To: <200512161548.y9KRuNtfMzpZjwni@cisco.com> References: <200512161548.3fqe3fMerrheBMdX@cisco.com> <200512161548.y9KRuNtfMzpZjwni@cisco.com> Message-ID: <20051217123856.d16529a5.akpm@osdl.org> Roland Dreier wrote: > > +EXPORT_SYMBOL(ipath_kset_linkstate); > +EXPORT_SYMBOL(ipath_kset_mtu); > +EXPORT_SYMBOL(ipath_layer_close); > +EXPORT_SYMBOL(ipath_layer_get_bcast); > +EXPORT_SYMBOL(ipath_layer_get_cr_errpkey); > +EXPORT_SYMBOL(ipath_layer_get_deviceid); > +EXPORT_SYMBOL(ipath_layer_get_flags); > +EXPORT_SYMBOL(ipath_layer_get_guid); > +EXPORT_SYMBOL(ipath_layer_get_ibmtu); > etc EXPORT_SMBOL_GPL? From mst at mellanox.co.il Sat Dec 17 13:30:30 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 17 Dec 2005 23:30:30 +0200 Subject: [openib-general] Re: [git patch review 2/7] IB/mthca: correct log2 calculation In-Reply-To: <20051217123816.18ad94e0.akpm@osdl.org> References: <20051217123816.18ad94e0.akpm@osdl.org> Message-ID: <20051217213030.GB19246@mellanox.co.il> Quoting r. Andrew Morton : > Subject: Re: [git patch review 2/7] IB/mthca: correct log2 calculation > > Roland Dreier wrote: > > > > Fix thinko in rd_atomic calculation: ffs(x) - 1 does not find the next > > power of 2 -- it should be fls(x - 1). > > Please use round_up_pow_of_two(). Yes, but we want the bit number. roundup_pow_of_two does a shift. -- MST From rjwalsh at pathscale.com Sat Dec 17 13:29:23 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 13:29:23 -0800 Subject: [openib-general] Re: [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <20051217123850.aa6cfd53.akpm@osdl.org> References: <200512161548.KglSM2YESlGlEQfQ@cisco.com> <200512161548.3fqe3fMerrheBMdX@cisco.com> <20051217123850.aa6cfd53.akpm@osdl.org> Message-ID: <1134854963.20575.17.camel@phosphene.durables.org> > > +void ipath_init_picotime(void) > > +{ > > + int i; > > + u_int64_t ts, te, delta = -1ULL; > > + > > + for (i = 0; i < 5; i++) { > > + ts = get_cycles(); > > + udelay(250); > > + te = get_cycles(); > > + if ((te - ts) < delta) > > + delta = te - ts; > > + yield(); > > + } > > + _ipath_pico_per_cycle = 250000000 / delta; > > +} > > hm, I hope this is debug code which is going away. If not, we should take > a look at what it's trying to do here. This isn't debug code. It's used to calculate the roughly number picoseconds per cycle. This is used in the driver to make sure HyperTransport reads haven't timed out (see ipath_snap_cntr in ipath_driver.c) when reading chip counters. If you can think of a better way of figuring out whether a read took greater than a certain length of time, I'd be interested in knowing it. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 13:33:55 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 13:33:55 -0800 Subject: [openib-general] Re: [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <20051217123850.aa6cfd53.akpm@osdl.org> References: <200512161548.KglSM2YESlGlEQfQ@cisco.com> <200512161548.3fqe3fMerrheBMdX@cisco.com> <20051217123850.aa6cfd53.akpm@osdl.org> Message-ID: <1134855235.20575.22.camel@phosphene.durables.org> > > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > OK. It's perhaps not a very well named function. Really? Suggestion for a better name? > > + } > > + vm->vm_flags |= VM_SHM | VM_LOCKED; > > + > > + return 0; > > +} > > I don't think we want to be setting the user's VMA's vm_flags in this > manner. This is purely to retain the physical page across fork? I didn't write this bit of the driver, but I believe this is the case. Is there a better way of doing this? > > +int ipath_munlock(size_t num_pages, struct page **p) > > +{ > > + int i; > > + > > + for (i = 0; i < num_pages; i++) { > > + _IPATH_MMDBG("%u/%lu put_page %p\n", i, num_pages, p[i]); > > + SetPageDirty(p[i]); > > + put_page(p[i]); > > + } > > + return 0; > > +} > > Nope, SetPageDirty() doesn't tell the VM that the page is dirty - it'll > never get written out. Use set_page_dirty_lock(). OK. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 13:34:36 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 13:34:36 -0800 Subject: [openib-general] Re: [PATCH 04/13] [RFC] ipath LLD core, part 1 In-Reply-To: <20051217123838.7732c201.akpm@osdl.org> References: <200512161548.lRw6KI369ooIXS9o@cisco.com> <200512161548.20XjmmxDHjOZRXcz@cisco.com> <20051217123838.7732c201.akpm@osdl.org> Message-ID: <1134855276.20575.25.camel@phosphene.durables.org> On Sat, 2005-12-17 at 12:38 -0800, Andrew Morton wrote: > Roland Dreier wrote: > > > > + if ((ret = copy_from_user(&rpkt, p, sizeof rpkt))) { > > + _IPATH_DBG("Failed to copy in pkt struct (%d)\n", ret); > > + return ret; > > + } > > The driver does this quite a lot. copy_from_user() will return the number > of bytes remaining to copy. So I think we'll be wanting `return -EFAULT;' > in lots of places rather than this. Thanks. Will fix. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 13:38:43 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 13:38:43 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <84144f020512170438p5acbc445v30f275aca2d09afe@mail.gmail.com> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <84144f020512170438p5acbc445v30f275aca2d09afe@mail.gmail.com> Message-ID: <1134855523.20575.29.camel@phosphene.durables.org> On Sat, 2005-12-17 at 14:38 +0200, Pekka Enberg wrote: > On 12/17/05, Roland Dreier wrote: > > +#define TRUE 1 > > +#define FALSE 0 > > Please kill these. OK. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From ebiederm at xmission.com Sat Dec 17 13:51:19 2005 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 17 Dec 2005 14:51:19 -0700 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <20051217131456.GA13043@infradead.org> (Christoph Hellwig's message of "Sat, 17 Dec 2005 13:14:56 +0000") References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> Message-ID: Christoph Hellwig writes: > please always used fixes-size types for user communication. also please > avoid ioctls like the rest of the IB codebase. Could someone please explain to me how the uverbs abuse of write is better that ioctl? Every single command seems to have a __u64 response fields that is a pointer into user space. When you write your commands and read your responses like the netlink layer does I can see the sense of it. But making write an ioctl by another name... One of the scarier comments I have seen lately from ib_user_verbs.h /* * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). * Specifically: * - Do not use pointer types -- pass pointers in __u64 instead. * - Make sure that any structure larger than 4 bytes is padded to a * multiple of 8 bytes. Otherwise the structure size will be * different between 32-bit and 64-bit architectures. */ The two points that get called out. - Embedded pointers are a large part of what make ioctl a maintenance nightmare. I admit we are 15-20 years away before big machines exhaust the capability of 64bit pointers so we aren't likely to run into size issues soon. But a write that changes your address space is ugly, and unexpected. What looks like a reimplementation of readv/writev using this technique is also scary. - 64bit compilers will not pad every structure to 8 bytes. This only will happen if you happen to have an 8 byte element in your structure that is only aligned to 32bits by a 32bit structure. Unfortunately the 32bit gcc only aligns long long to 32bits on x86, which triggers the described behavior. Eric From bunk at stusta.de Sat Dec 17 13:52:51 2005 From: bunk at stusta.de (Adrian Bunk) Date: Sat, 17 Dec 2005 22:52:51 +0100 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <200512161548.lokgvLraSGi0enUH@cisco.com> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> Message-ID: <20051217215251.GV23349@stusta.de> On Fri, Dec 16, 2005 at 03:48:55PM -0800, Roland Dreier wrote: >... > --- /dev/null > +++ b/drivers/infiniband/hw/ipath/Kconfig > @@ -0,0 +1,18 @@ > +config IPATH_CORE > + tristate "PathScale InfiniPath Driver" > + depends on PCI_MSI && X86_64 >... The driver shouldn't use assembler code and therefore no longer depend on X86_64. > --- /dev/null > +++ b/drivers/infiniband/hw/ipath/Makefile > @@ -0,0 +1,15 @@ > +EXTRA_CFLAGS += -Idrivers/infiniband/include > + > +EXTRA_CFLAGS += -Wall -O3 -g3 -Wall is always set when compiling the kernel. -O3 doesn't make much sense since the fight for producing the fastest code is between -O2 and -Os. You don't want to always compile your driver with -g3. > +_ipath_idstr:="$$""Id: kernel.org InfiniPath Release 1.1 $$"" $$""Date: $(shell date +%F-%R)"" $$" > +EXTRA_CFLAGS += -D_IPATH_IDSTR='$(_ipath_idstr)' -DIPATH_KERN_TYPE=0 >... Please move the _IPATH_IDSTR revision tag to a header file and remove IPATH_KERN_TYPE. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From rjwalsh at pathscale.com Sat Dec 17 13:55:47 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 13:55:47 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <84144f020512170433h151a7667o42c382242f81347b@mail.gmail.com> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <84144f020512170433h151a7667o42c382242f81347b@mail.gmail.com> Message-ID: <1134856547.20575.39.camel@phosphene.durables.org> > Do we really need this ugly userspace emulation code in the kernel? Probably not. I'll look into removing it. > What is this used for? Why can't yo use memcpy? Our chip can only handle double-word copies. > > +#define round_up(v,sz) (((v) + (sz)-1) & ~((sz)-1)) > > Please use ALIGN(). Fair enough. > You seem to be introducing loads of new ioctls. Any reason you can't > use sysfs and/or configfs? I'll see what people here think of that idea. > > +#define ips_get_ipath_ver(ipath_header) (((ipath_header) >> INFINIPATH_I_VERS_SHIFT) \ > > + & INFINIPATH_I_VERS_MASK) > > Please use static inlines instead for readability. OK. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 14:19:13 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 14:19:13 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <20051217131456.GA13043@infradead.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> Message-ID: <1134857953.20575.59.camel@phosphene.durables.org> > > + * $Id: ipath_common.h 4491 2005-12-15 22:20:31Z rjwalsh $ > > please remove RCSIDs everywhere. These are everywhere in the OpenIB code. I was actually asked by one of the OpenIB developers to include them. I'm happy to remove them again, but what do the OpenIB folks think? > > +#define yield() sched_yield() > > Please push this out. It's fine if they reuse kernel-code in userspace > this way, but please move the compat wrappers to a separate file that's > not in the kernel tree. I will do this. > > +typedef uint8_t ipath_type; > > totally meaningless typedef In what way? > > +#ifndef _BITS_PER_BYTE > > +#define _BITS_PER_BYTE 8 > > +#endif > > WTF? Hmm. That is odd. I'll ask the folks here if we can remove this. > in kernel land we __inline__ includes always_inline. Also no need for > a separate prototype for a just following inline function. Fine. > > +{ > > + void *ssv, *dsv; > > + uint32_t csv; > > + __asm__ __volatile__("cld\n\trep\n\tmovsb":"=&c"(csv), "=&D"(dsv), > > + "=&S"(ssv) > > + :"0"(cnt), "1"(dest), "2"(src) > > + :"memory"); > > +} > > No way we're gonna put assembler code into such a driver. Why not? The chip (and therefore the driver) only works with Opterons. It's tied to the HT bus, but PCI or anything like that. > > +struct ipath_int_vec { > > + int long long addr; > > + uint32_t info; > > +}; > > > please always used fixes-size types for user communication. OK. > also please > avoid ioctls like the rest of the IB codebase. More complex, but I'll look into it. > > +/* Similarly, this is the kernel version going back to the user. It's slightly > > + * different, in that we want to tell if the driver was built as part of a > > + * PathScale release, or from the driver from the OpenIB, kernel.org, or a > > + * standard distribution, for support reasons. The high bit is 0 for > > + * non-PathScale, and 1 for PathScale-built/supplied. That bit is defined > > + * in Makefiles, rather than this file. > > + * > > + * It's returned by the driver to the user code during initialization > > + * in the spi_sw_version field of ipath_base_info, so the user code can > > + * in turn check for compatibility with the kernel. > > +*/ > > +#define IPATH_KERN_SWVERSION ((IPATH_KERN_TYPE<<31) | IPATH_USER_SWVERSION) > > NACK, there's no way we're gonna put in a way to identify an "official" > version. The official version is the last one in mainline always. Why make this hard for vendors? You may only care about the latest mainline, but if we want to sell chips, we have to support this all the way back to 2.6.9 (RHEL). > > +#ifndef PCI_VENDOR_ID_PATHSCALE /* not in pci.ids yet */ > > +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1 > > +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH1 0xa > > +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH2 0xd > > +#endif > > so move it there? Sounds like a good idea. I'll submit a separate patch. > > +typedef struct _ipath_portdata { > > please avoid typedefs for struct types. I thought I had, but I must have missed this one. > > +/* > > + * these should be somewhat dynamic someday, although they are fixed > > + * for all users of the device on any given load. > > + * > > + * NOTE: There is a VM bug in the 2.4 Kernels similar to the one Dave > > + * fixed in the 2.6 Kernel. When using large or discontinuous memory, > > + * we get random kernel oops. So, in 2.4, we are just going to stick > > + * with 4k chunks instead of 64k chunks. > > + */ > > No one cares about 2.4 kernels here. Fine. > > + * these function similarly to the mlock/munlock system calls. > > + * ipath_mlock() is used to pin an address range (if not already pinned), > > + * and optionally return the list of physical addresses > > + * ipath_munlock() does the obvious, and ipath_mlock() cleans up all > > + * private memory, used at driver unload. > > + * ipath_mlock_nocopy() is similar to mlock, but only one page, and marks > > + * the vm so the page isn't taken away on a fork. > > + */ > > +int ipath_mlock(unsigned long, size_t, struct page **); > > +int ipath_mlock_nocopy(unsigned long, struct page **); > > this kind of thing definitly doesn't belong into an LLDD. or maybe > it's just stale prototypes? No - they're used. Why do you say they don't belong? > > +#ifdef IPATH_COSIM > > +extern __u32 sim_readl(const volatile void __iomem * addr); > > +extern __u64 sim_readq(const volatile void __iomem * addr); > > +extern void sim_writel(__u32 val, volatile void __iomem * addr); > > +extern void sim_writeq(__u64 val, volatile void __iomem * addr); > > +#define ipath_readl(addr) sim_readl(addr) > > +#define ipath_readq(addr) sim_readq(addr) > > +#define ipath_writel(val, addr) sim_writel(val, addr) > > +#define ipath_writeq(val, addr) sim_writeq(val, addr) > > +#else > > +#define ipath_readl(addr) readl(addr) > > +#define ipath_readq(addr) readq(addr) > > +#define ipath_writel(val, addr) writel(val, addr) > > +#define ipath_writeq(val, addr) writeq(val, addr) > > +#endif > > Please use the proper functions directly. Your simulator can override > them if nessecary. Fine. > > +static __inline__ uint32_t ipath_kget_kreg32(const ipath_type stype, > > + ipath_kreg regno) > > +{ > > + volatile uint32_t *kreg32; > > + > > + if (!devdata[stype].ipath_kregbase) > > + return ~0; > > + > > + kreg32 = (volatile uint32_t *)&devdata[stype].ipath_kregbase[regno]; > > volatile use is probably always wrong. but this whole functions looks like > a very odd wrapper anyway? The volatile is there so the compiler doesn't optimize away the read. This is important, because reads of our hardware have side-effects and cannot be optimized out. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From arjan at infradead.org Sat Dec 17 14:25:23 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 17 Dec 2005 23:25:23 +0100 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <1134857953.20575.59.camel@phosphene.durables.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> <1134857953.20575.59.camel@phosphene.durables.org> Message-ID: <1134858323.2997.11.camel@laptopd505.fenrus.org> On Sat, 2005-12-17 at 14:19 -0800, Robert Walsh wrote: > > > +{ > > > + void *ssv, *dsv; > > > + uint32_t csv; > > > + __asm__ __volatile__("cld\n\trep\n\tmovsb":"=&c"(csv), "=&D"(dsv), > > > + "=&S"(ssv) > > > + :"0"(cnt), "1"(dest), "2"(src) > > > + :"memory"); > > > +} > > > > No way we're gonna put assembler code into such a driver. > > Why not? The chip (and therefore the driver) only works with Opterons. > It's tied to the HT bus, but PCI or anything like that. and opterons can already run 2 architectures. And the HT bus is a generic bus.. with public specs. Others than just AMD use it as well. also.. what is wrong with memcpy and co ? > > > +static __inline__ uint32_t ipath_kget_kreg32(const ipath_type stype, > > > + ipath_kreg regno) > > > +{ > > > + volatile uint32_t *kreg32; > > > + > > > + if (!devdata[stype].ipath_kregbase) > > > + return ~0; > > > + > > > + kreg32 = (volatile uint32_t *)&devdata[stype].ipath_kregbase[regno]; > > > > volatile use is probably always wrong. but this whole functions looks like > > a very odd wrapper anyway? > > The volatile is there so the compiler doesn't optimize away the read. > This is important, because reads of our hardware have side-effects and > cannot be optimized out. then you need to use readl() and family most like; they already take care of this anyway. From rjwalsh at pathscale.com Sat Dec 17 14:39:18 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 14:39:18 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <20051217123827.32f119da.akpm@osdl.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217123827.32f119da.akpm@osdl.org> Message-ID: <1134859158.20575.82.camel@phosphene.durables.org> > > +#define yield() sched_yield() > > Some might get upset about what I assume is userspace test harness code or > what _is_ this doing?) in a driver. But if the maintainers find it useful > we can live with it, That is cosimulator code. It's easy enough to remove. I'll look into it. > > +#ifndef _BITS_PER_BYTE > > +#define _BITS_PER_BYTE 8 > > +#endif > > I'd be inclined to stick BITS_PER_BYTE into include/linux/types.h. Really? I was just going to suggest removing it, but if sticking it in types.h works for you, then fine. > > +static __inline__ void ipath_shortcopy(void *dest, void *src, uint32_t cnt) > > + __attribute__ ((always_inline)); > > s/__inline__/inline/ throughout. OK. > > +#define round_up(v,sz) (((v) + (sz)-1) & ~((sz)-1)) > > We have ALIGN() Yup. > > +struct ipath_int_vec { > > + int long long addr; > > long long OK. > > +#define IPATH_USERINIT _IOW('s', 16, struct ipath_user_info) > > +/* init; kernel/chip params to user */ > > +#define IPATH_BASEINFO _IOR('s', 17, struct ipath_base_info) > > +/* send a packet */ > > +#define IPATH_SENDPKT _IOW('s', 18, struct ipath_sendpkt) > > uh-oh. ioctls. Do we have compat conversions for them all, if needed? For those that are needed, I believe we covered them all. Some have suggested removing ioctls. I'm willing to look into alternatives, but if you think they're OK, I'd rather leave them. > > +/* > > + * A segment is a linear region of low physical memory. > > + * XXX Maybe we should use phys addr here and kmap()/kunmap() > > + * Used by the verbs layer. > > + */ > > +struct ipath_seg { > > + void *vaddr; > > + u64 length; > > +}; > > Suggest `long' for the length. We don't need 64 bits on 32-bit machines. OK. > > +struct ipath_mregion { > > + u64 user_base; /* User's address for this region */ > > void *. > > > + u64 iova; /* IB start address of this region */ > > Maybe here too. OK. > > +int ipath_mlock(unsigned long, size_t, struct page **); > > Sometimes it does `int foo()' and sometimes `extern int foo()'. I tend to > think the `extern' is a waste of space. Yup. > > +#define ipath_func_krecord(a) > > +#define ipath_func_urecord(a, b) > > +#define ipath_func_mrecord(a, b) > > +#define ipath_func_rkrecord(a) > > +#define ipath_func_rurecord(a, b) > > +#define ipath_func_rmrecord(a, b) > > +#define ipath_func_rsrecord(a) > > +#define ipath_func_rcrecord(a) > > What are all these doing? Might need do{}while(0) for safety. I'll look at cleaning them out. Probably left-overs from some earlier experiment. > > +#ifdef IPATH_COSIM > > +extern __u32 sim_readl(const volatile void __iomem * addr); > > +extern __u64 sim_readq(const volatile void __iomem * addr); > > The driver has a strange mixture of int32_t, s32 and __s32. s32 is > preferred. Yea - I'll clean that up. > > + */ > > +static __inline__ uint32_t ipath_kget_ureg32(const ipath_type stype, > > + ipath_ureg regno, int port) > > +{ > > + uint64_t *ubase; > > + > > + ubase = (uint64_t *) (devdata[stype].ipath_uregbase > > + + (char *)devdata[stype].ipath_kregbase > > + + devdata[stype].ipath_palign * port); > > + return ubase ? ipath_readl(ubase + regno) : 0; > > +} > > Are all these u64's needed on 32-bit? Don't know - I'll ask around. We don't support the hardware in 32-bit anyway, so... > > +static __inline__ uint64_t ipath_kget_kreg64(const ipath_type stype, > > + ipath_kreg regno) > > +{ > > + if (!devdata[stype].ipath_kregbase) > > + return ~0ULL; > > We don't know that the architecture implements u64 as unsigned long long. > Some use unsigned long. Best way of implmenting the all-ones pattern is > just `-1'. OK. > Gee. Big driver. Tell me about it :-) Basically, we're doing infiniband in software: no offload. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 14:40:43 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 14:40:43 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <20051217123833.1aa430ab.akpm@osdl.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> Message-ID: <1134859243.20575.84.camel@phosphene.durables.org> On Sat, 2005-12-17 at 12:38 -0800, Andrew Morton wrote: > Roland Dreier wrote: > > > > + .globl ipath_dwordcpy > > +/* rdi destination, rsi source, rdx count */ > > +ipath_dwordcpy: > > + movl %edx,%ecx > > + shrl $1,%ecx > > + andl $1,%edx > > + cld > > + rep > > + movsq > > + movl %edx,%ecx > > + rep > > + movsd > > + ret > > err, we have a portability problem. Any chance we could get these moved into the x86_64 arch directory, then? We have to do double-word copies, or our chip gets unhappy. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 14:47:12 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 14:47:12 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <1134858323.2997.11.camel@laptopd505.fenrus.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> <1134857953.20575.59.camel@phosphene.durables.org> <1134858323.2997.11.camel@laptopd505.fenrus.org> Message-ID: <1134859632.20575.92.camel@phosphene.durables.org> > and opterons can already run 2 architectures. And the HT bus is a > generic bus.. with public specs. Others than just AMD use it as well. > > also.. what is wrong with memcpy and co ? Our chips can only handle double-word writes. memcpy isn't guaranteed to do this. > then you need to use readl() and family most like; they already take > care of this anyway. Oh, OK then. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 14:54:44 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 14:54:44 -0800 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <20051217215251.GV23349@stusta.de> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051217215251.GV23349@stusta.de> Message-ID: <1134860084.20575.101.camel@phosphene.durables.org> > The driver shouldn't use assembler code and therefore no longer depend > on X86_64. Agreed about the assembler, but one way or the other, x86_64 is the only arch we support. > -Wall is always set when compiling the kernel. Fine. > -O3 doesn't make much sense since the fight for producing the fastest > code is between -O2 and -Os. Makes many nanoseconds of difference to us for our latency numbers. At the low latency numbers we measuring (1.29us), this is a very important difference to our customers. > You don't want to always compile your driver with -g3. Good point. I'll ask around here why we're doing this. > > +_ipath_idstr:="$$""Id: kernel.org InfiniPath Release 1.1 $$"" $$""Date: $(shell date +%F-%R)"" $$" > > +EXTRA_CFLAGS += -D_IPATH_IDSTR='$(_ipath_idstr)' -DIPATH_KERN_TYPE=0 > >... > > Please move the _IPATH_IDSTR revision tag to a header file and remove > IPATH_KERN_TYPE. I'll see what I can do. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From bunk at stusta.de Sat Dec 17 15:55:54 2005 From: bunk at stusta.de (Adrian Bunk) Date: Sun, 18 Dec 2005 00:55:54 +0100 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <1134860084.20575.101.camel@phosphene.durables.org> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051217215251.GV23349@stusta.de> <1134860084.20575.101.camel@phosphene.durables.org> Message-ID: <20051217235554.GW23349@stusta.de> On Sat, Dec 17, 2005 at 02:54:44PM -0800, Robert Walsh wrote: > > The driver shouldn't use assembler code and therefore no longer depend > > on X86_64. > > Agreed about the assembler, but one way or the other, x86_64 is the only > arch we support. >... There's a difference between "technically supported by the driver" and "officially supported for our costumers": It's fine if you tell the costumers buying your hardware "anything else than 64bit x86_64 kernels is completely unsupported", but for getting your driver included into the kernel it should be 32bit clean [1] and should also work for people using 32bit kernels on an Opteron. > > -O3 doesn't make much sense since the fight for producing the fastest > > code is between -O2 and -Os. > > Makes many nanoseconds of difference to us for our latency numbers. At > the low latency numbers we measuring (1.29us), this is a very important > difference to our customers. >... There's no doubt that this is important for your customers. What surprises me is that -O3 turned out to be the fastest flag for you. Can you send numbers comparing -Os/-O2/-O3 (without -g3, preferable with gcc 4.0) including a description what and how you are measuring? > Regards, > Robert. cu Adrian [1] not long ago, it used to be the other way round that drivers weren't 64bit clean... -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From alan at lxorguk.ukuu.org.uk Sat Dec 17 16:27:03 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Sun, 18 Dec 2005 00:27:03 +0000 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <1134860084.20575.101.camel@phosphene.durables.org> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051217215251.GV23349@stusta.de> <1134860084.20575.101.camel@phosphene.durables.org> Message-ID: <1134865624.11953.58.camel@localhost.localdomain> On Sad, 2005-12-17 at 14:54 -0800, Robert Walsh wrote: > Agreed about the assembler, but one way or the other, x86_64 is the only > arch we support. If you need a quad only copy then put it into asm/string.h (asm/io.h if its operating on I/O space I guess) or somewhere similar as a generic function that does just that. That allows people to come along and provide the same functions for other platforms if they need it, and also makes it possible for others to use this feature if their hardware has the same feature. Nobody expects you as a vendor to support it on sparc64, x86-32 or whatever, nor to write sparc64 asm functions just to make it possible for someone to do so cleanly later on. From rjwalsh at pathscale.com Sat Dec 17 17:17:04 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 17:17:04 -0800 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <20051217235554.GW23349@stusta.de> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051217215251.GV23349@stusta.de> <1134860084.20575.101.camel@phosphene.durables.org> <20051217235554.GW23349@stusta.de> Message-ID: <1134868624.20575.106.camel@phosphene.durables.org> > There's a difference between "technically supported by the driver" and > "officially supported for our costumers": > > It's fine if you tell the costumers buying your hardware "anything else > than 64bit x86_64 kernels is completely unsupported", but for getting > your driver included into the kernel it should be 32bit clean [1] and > should also work for people using 32bit kernels on an Opteron. Fair enough - I'll see what I can do. > What surprises me is that -O3 turned out to be the fastest flag for you. > > Can you send numbers comparing -Os/-O2/-O3 (without -g3, preferable with > gcc 4.0) including a description what and how you are measuring? I'll try get around to this after my vacation and after we've had time to absorb and address all the other feedback we received. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From akpm at osdl.org Sat Dec 17 19:10:07 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 19:10:07 -0800 Subject: [openib-general] Re: [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <1134855235.20575.22.camel@phosphene.durables.org> References: <200512161548.KglSM2YESlGlEQfQ@cisco.com> <200512161548.3fqe3fMerrheBMdX@cisco.com> <20051217123850.aa6cfd53.akpm@osdl.org> <1134855235.20575.22.camel@phosphene.durables.org> Message-ID: <20051217191007.a77d23af.akpm@osdl.org> Robert Walsh wrote: > > > > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > > OK. It's perhaps not a very well named function. > > Really? Suggestion for a better name? > ipath_get_user_pages() would cause the least surprise. > > > + } > > > + vm->vm_flags |= VM_SHM | VM_LOCKED; > > > + > > > + return 0; > > > +} > > > > I don't think we want to be setting the user's VMA's vm_flags in this > > manner. This is purely to retain the physical page across fork? > > I didn't write this bit of the driver, but I believe this is the case. > Is there a better way of doing this? This stuff has been churning a bit lately. I've drawn Hugh Dickins's attention to the patch - he'd have a better handle on what the best approach would be. From rjwalsh at pathscale.com Sat Dec 17 19:13:09 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 19:13:09 -0800 Subject: [openib-general] Re: [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <20051217191007.a77d23af.akpm@osdl.org> References: <200512161548.KglSM2YESlGlEQfQ@cisco.com> <200512161548.3fqe3fMerrheBMdX@cisco.com> <20051217123850.aa6cfd53.akpm@osdl.org> <1134855235.20575.22.camel@phosphene.durables.org> <20051217191007.a77d23af.akpm@osdl.org> Message-ID: <1134875589.20575.122.camel@phosphene.durables.org> On Sat, 2005-12-17 at 19:10 -0800, Andrew Morton wrote: > Robert Walsh wrote: > > > > > > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > > > OK. It's perhaps not a very well named function. > > > > Really? Suggestion for a better name? > > > > ipath_get_user_pages() would cause the least surprise. Seems reasonable. I'll look at the related functions, too. > > > I don't think we want to be setting the user's VMA's vm_flags in this > > > manner. This is purely to retain the physical page across fork? > > > > I didn't write this bit of the driver, but I believe this is the case. > > Is there a better way of doing this? > > This stuff has been churning a bit lately. I've drawn Hugh Dickins's > attention to the patch - he'd have a better handle on what the best > approach would be. OK then - I'll wait and see. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From akpm at osdl.org Sat Dec 17 19:14:17 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 19:14:17 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <1134859158.20575.82.camel@phosphene.durables.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217123827.32f119da.akpm@osdl.org> <1134859158.20575.82.camel@phosphene.durables.org> Message-ID: <20051217191417.f16011bb.akpm@osdl.org> Robert Walsh wrote: > > > I'd be inclined to stick BITS_PER_BYTE into include/linux/types.h. > > Really? I was just going to suggest removing it, but if sticking it in > types.h works for you, then fine. > I think it's a readbility thing. x += 8; /* wtf? */ vs x += BITS_PER_BYTE; /* ah! */ From akpm at osdl.org Sat Dec 17 19:19:32 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 17 Dec 2005 19:19:32 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <1134859243.20575.84.camel@phosphene.durables.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> Message-ID: <20051217191932.af2b422c.akpm@osdl.org> Robert Walsh wrote: > > > > + movl %edx,%ecx > > > + shrl $1,%ecx > > > + andl $1,%edx > > > + cld > > > + rep > > > + movsq > > > + movl %edx,%ecx > > > + rep > > > + movsd > > > + ret > > > > err, we have a portability problem. > > Any chance we could get these moved into the x86_64 arch directory, > then? That would make sense. Give it a non-ipath-related name and require that all architectures which wish to run this driver must implement that (documented) function. And, in Kconfig, make sure that architectures which don't implement that library function do not attempt to build this driver. To avoid breaking `make allmodconfig'. > We have to do double-word copies, or our chip gets unhappy. In what form is this chip available? As a standard PCI/PCIX card which people will want to plug into power4/ia64/x86 machines? Or is it in some way exclusively tied to x86_64? From ak at suse.de Sat Dec 17 19:25:27 2005 From: ak at suse.de (Andi Kleen) Date: 18 Dec 2005 04:25:27 +0100 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> Message-ID: ebiederm at xmission.com (Eric W. Biederman) writes: > Christoph Hellwig writes: > > > please always used fixes-size types for user communication. also please > > avoid ioctls like the rest of the IB codebase. > > Could someone please explain to me how the uverbs abuse of write > is better that ioctl? It's actually worse because if they have a 32bit compat issue then ioctl can be fixed up, but read/write can't. I wish the people arguing against ioctl all the time would just stop that because the alternatives are usually worse. > - 64bit compilers will not pad every structure to 8 bytes. This > only will happen if you happen to have an 8 byte element in your > structure that is only aligned to 32bits by a 32bit structure. > Unfortunately the 32bit gcc only aligns long long to 32bits on > x86, which triggers the described behavior. Exactly - and driver writers usually don't get that right so we need to have a tool to fix it up in the end. And with ioctl that's easiest. -Andi From ak at suse.de Sat Dec 17 19:27:06 2005 From: ak at suse.de (Andi Kleen) Date: 18 Dec 2005 04:27:06 +0100 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <1134859243.20575.84.camel@phosphene.durables.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> Message-ID: Robert Walsh writes: > > Any chance we could get these moved into the x86_64 arch directory, > then? We have to do double-word copies, or our chip gets unhappy. Standard memcpy will do double word copies if everything is suitably aligned. Just use that. -Andi From bunk at stusta.de Sat Dec 17 19:35:17 2005 From: bunk at stusta.de (Adrian Bunk) Date: Sun, 18 Dec 2005 04:35:17 +0100 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <20051217191932.af2b422c.akpm@osdl.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> <20051217191932.af2b422c.akpm@osdl.org> Message-ID: <20051218033517.GY23349@stusta.de> On Sat, Dec 17, 2005 at 07:19:32PM -0800, Andrew Morton wrote: >... > In what form is this chip available? As a standard PCI/PCIX card which > people will want to plug into power4/ia64/x86 machines? Or is it in some > way exclusively tied to x86_64? Hardware can hardly be exclusively tied to x86_64 without also being available on x86 machines since i386 kernels run on x86_64 hardware. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From rjwalsh at pathscale.com Sat Dec 17 21:33:51 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 21:33:51 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <20051217191932.af2b422c.akpm@osdl.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> <20051217191932.af2b422c.akpm@osdl.org> Message-ID: <1134884031.20575.126.camel@phosphene.durables.org> > > Any chance we could get these moved into the x86_64 arch directory, > > then? > > That would make sense. Give it a non-ipath-related name and require that > all architectures which wish to run this driver must implement that > (documented) function. > > And, in Kconfig, make sure that architectures which don't implement that > library function do not attempt to build this driver. To avoid breaking > `make allmodconfig'. Sounds good. I'll get something together next week. > > We have to do double-word copies, or our chip gets unhappy. > > In what form is this chip available? As a standard PCI/PCIX card which > people will want to plug into power4/ia64/x86 machines? Or is it in some > way exclusively tied to x86_64? It's a HyperTransport card, not PCI/PCIe/PCIX. It plugs into the HTX slot on a suitably-equipped motherboard. On some machines, it's available on the motherboard itself (e.g. the Linux Networx LS/X.) Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rjwalsh at pathscale.com Sat Dec 17 21:36:29 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 17 Dec 2005 21:36:29 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> Message-ID: <1134884189.20575.129.camel@phosphene.durables.org> On Sun, 2005-12-18 at 04:27 +0100, Andi Kleen wrote: > Robert Walsh writes: > > > > Any chance we could get these moved into the x86_64 arch directory, > > then? We have to do double-word copies, or our chip gets unhappy. > > Standard memcpy will do double word copies if everything is suitably > aligned. Just use that. This is dealing with buffers that may be passed in from user space, so there's no guarantee of alignment for either the start address or the length. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From ak at suse.de Sat Dec 17 21:41:50 2005 From: ak at suse.de (Andi Kleen) Date: Sun, 18 Dec 2005 06:41:50 +0100 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <1134884189.20575.129.camel@phosphene.durables.org> References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> <1134884189.20575.129.camel@phosphene.durables.org> Message-ID: <20051218054149.GE23384@wotan.suse.de> On Sat, Dec 17, 2005 at 09:36:29PM -0800, Robert Walsh wrote: > On Sun, 2005-12-18 at 04:27 +0100, Andi Kleen wrote: > > Robert Walsh writes: > > > > > > Any chance we could get these moved into the x86_64 arch directory, > > > then? We have to do double-word copies, or our chip gets unhappy. > > > > Standard memcpy will do double word copies if everything is suitably > > aligned. Just use that. > > This is dealing with buffers that may be passed in from user space, so > there's no guarantee of alignment for either the start address or the > length. So how can you do double word access when the length is not a multiple of four? The current x86-64 copy_from_user will use double work access even in that case, except for the end of course. But what you're doing is so deeply unportable it's not funny. I am not sure such a unportable driver even belongs in the kernel. If the code was really intended to run on user space addresses it was totally broken btw because it didn't handle exceptions. -Andi From jackm at mellanox.co.il Sat Dec 17 23:52:24 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 18 Dec 2005 09:52:24 +0200 Subject: [openib-general] [PATCH] mthca: fix WQE size calculation in create-srq Message-ID: <20051218075224.GA1741@mellanox.co.il> Thinko: 64 bytes is the minimum SRQ WQE size (not the maximum). Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_srq.c =================================================================== --- linux-kernel.orig/drivers/infiniband/hw/mthca/mthca_srq.c +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_srq.c @@ -198,7 +198,7 @@ int mthca_alloc_srq(struct mthca_dev *de if (mthca_is_memfree(dev)) srq->max = roundup_pow_of_two(srq->max + 1); - ds = min(64UL, + ds = max(64UL, roundup_pow_of_two(sizeof (struct mthca_next_seg) + srq->max_gs * sizeof (struct mthca_data_seg))); srq->wqe_shift = long_log2(ds); From eitan at mellanox.co.il Sun Dec 18 00:53:30 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 18 Dec 2005 10:53:30 +0200 Subject: [openib-general] Re: A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi In-Reply-To: <1134766491.4338.10299.camel@hal.voltaire.com> References: <1134766491.4338.10299.camel@hal.voltaire.com> Message-ID: <43A5238A.1040702@mellanox.co.il> Hal Rosenstock wrote: > Hi, > > I have a couple of questions about > osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi. > > There is the following code: > > if ( (mtu != ib_port_info_get_mtu_cap( p_old_pi )) || > (op_vls != ib_port_info_get_op_vls(p_old_pi))) > { > if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > "__osm_lid_mgr_set_physp_pi: " > "Sending Link Down due to op_vls or mtu change. > MTU:%u,%u VL_CAP:%u,%u\n", > mtu, ib_port_info_get_mtu_cap( p_old_pi ), > op_vls, ib_port_info_get_op_vls(p_old_pi) > ); > } > ib_port_info_set_port_state( p_pi, IB_LINK_DOWN ); > > This seems a little inconsistent to me. It seems like NeighborMTU would > be the equivalent of OperationalVLs, rather than MTUCap (which is RO). Yes I think we should have checked the NeighborMTU and not the MTUCap. > Also, why does changing the MTU require that the link be taken down ? The behavior of the link when a neighbor MTU is changes is not very well defined. So the best way to handle that is to force it down. > > I also noticed a nit in the same function: > > p_pi->m_key_lease_period = p_mgr->p_subn->opt.m_key_lease_period; > /* Check to see if the value we are setting is different than > the value in the port_info. If it is - turn on send_set flag */ > if (cl_memcmp( &p_pi->m_key_lease_period, > &p_old_pi->m_key_lease_period, > sizeof(p_pi->m_key_lease_period) )) > send_set = TRUE; > > Should that be only when the Mkey is non 0 ? Well, I know the lease is not relevant when MKey = 0. But for code clarity I propose to ignore that fact. The effect is only when someone set lease period but MKey = 0 which IMO does not make any sense anyway. > > -- Hal > From davem at davemloft.net Sun Dec 18 01:33:41 2005 From: davem at davemloft.net (David S. Miller) Date: Sun, 18 Dec 2005 01:33:41 -0800 (PST) Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <20051217191932.af2b422c.akpm@osdl.org> References: <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> <20051217191932.af2b422c.akpm@osdl.org> Message-ID: <20051218.013341.34772534.davem@davemloft.net> From: Andrew Morton Date: Sat, 17 Dec 2005 19:19:32 -0800 > That would make sense. Give it a non-ipath-related name and require that > all architectures which wish to run this driver must implement that > (documented) function. > > And, in Kconfig, make sure that architectures which don't implement that > library function do not attempt to build this driver. To avoid breaking > `make allmodconfig'. How about we implement a portable version in C that you get by default if you don't implement the assembler routine? Pretty please? :-) From wgttvjdje at dtiltas.lt Sun Dec 18 07:00:25 2005 From: wgttvjdje at dtiltas.lt (Annabelle Boykin) Date: Sun, 18 Dec 2005 10:00:25 -0500 Subject: [openib-general] Stunning Rolex and LV replicas online In-Reply-To: <200202192wgttvjdje@dtiltas.lt> Message-ID: <52685277.103133wgttvjdje@dtiltas.lt> We noticed you had bought one of our products before.(Order# wqvc) We just recently slashed prices, and thought we should let you know. http://tugu.echristmas2005.com/?agik Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Annabelle Boykin Customer Service Rep. their may halverson some a veracious see and combine , not stilt or be sus , may myoglobin see it's cylindric some! presage it. azalea on missive the but preemptor on the antithetic see some mobcap and , withdrew , may ad ! a beggar oror gaul in. From eitan at mellanox.co.il Sun Dec 18 03:53:24 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 18 Dec 2005 13:53:24 +0200 Subject: [openib-general] RE: [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B49@mtlexch01.mtl.com> Hi Hal, The attached patch is fine. Please go ahead and commit it. BTW: In the following commit 4509 you have changed the name of a switch info record field. Note this is an API change and have severe effect on any application using ib_types.h (and there are plenty of these) I would appreciate if you will revert this un-necessary change. Also in the future please post a patch before changing ib_types.h osm_vendor_api.h osm_vendor_sa_api.h and any of the complib H files. ------------------------------------------------------------------------ r4509 | halr | 2005-12-16 22:25:29 +0200 (Fri, 16 Dec 2005) | 5 lines In switchinfo, rename enforce_cap to partition_enf_cap to be closer to it's IBA spec name EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 15, 2005 4:33 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate > > OpenSM/ib_types.h: Modify ib_port_info_compute_rate so that gcc version > 4.0.0 20050519 (Red Hat 4.0.0-8) doesn't complain when compiling > osm_sa_path_record.c as follows: > > osm_sa_path_record.c: In function '__osm_pr_rcv_get_path_parms': > osm_sa_path_record.c:194: warning: control may reach end of non-void function > 'ib_port_info_compute_rate' being inlined > > Signed-off-by: Hal Rosenstock > > Index: include/iba/ib_types.h > =================================================================== > --- include/iba/ib_types.h (revision 4479) > +++ include/iba/ib_types.h (working copy) > @@ -4290,59 +4290,76 @@ static inline uint8_t > ib_port_info_compute_rate( > IN const ib_port_info_t* const p_pi ) > { > + uint8_t rate = 0; > + > switch (ib_port_info_get_link_speed_active(p_pi)) > { > case IB_LINK_SPEED_ACTIVE_2_5: > switch (p_pi->link_width_active) > { > case IB_LINK_WIDTH_ACTIVE_1X: > - return IB_PATH_RECORD_RATE_2_5_GBS; > + rate = IB_PATH_RECORD_RATE_2_5_GBS; > + break; > > case IB_LINK_WIDTH_ACTIVE_4X: > - return IB_PATH_RECORD_RATE_10_GBS; > - > + rate = IB_PATH_RECORD_RATE_10_GBS; > + break; > + > case IB_LINK_WIDTH_ACTIVE_12X: > - return IB_PATH_RECORD_RATE_30_GBS; > - > + rate = IB_PATH_RECORD_RATE_30_GBS; > + break; > + > default: > - return IB_PATH_RECORD_RATE_2_5_GBS; > + rate = IB_PATH_RECORD_RATE_2_5_GBS; > + break; > } > break; > case IB_LINK_SPEED_ACTIVE_5: > switch (p_pi->link_width_active) > { > case IB_LINK_WIDTH_ACTIVE_1X: > - return IB_PATH_RECORD_RATE_5_GBS; > - > + rate = IB_PATH_RECORD_RATE_5_GBS; > + break; > + > case IB_LINK_WIDTH_ACTIVE_4X: > - return IB_PATH_RECORD_RATE_20_GBS; > - > + rate = IB_PATH_RECORD_RATE_20_GBS; > + break; > + > case IB_LINK_WIDTH_ACTIVE_12X: > - return IB_PATH_RECORD_RATE_60_GBS; > - > + rate = IB_PATH_RECORD_RATE_60_GBS; > + break; > + > default: > - return IB_PATH_RECORD_RATE_5_GBS; > + rate = IB_PATH_RECORD_RATE_5_GBS; > + break; > } > break; > case IB_LINK_SPEED_ACTIVE_10: > switch (p_pi->link_width_active) > { > case IB_LINK_WIDTH_ACTIVE_1X: > - return IB_PATH_RECORD_RATE_10_GBS; > - > + rate = IB_PATH_RECORD_RATE_10_GBS; > + break; > + > case IB_LINK_WIDTH_ACTIVE_4X: > - return IB_PATH_RECORD_RATE_40_GBS; > - > + rate = IB_PATH_RECORD_RATE_40_GBS; > + break; > + > case IB_LINK_WIDTH_ACTIVE_12X: > - return IB_PATH_RECORD_RATE_120_GBS; > - > + rate =IB_PATH_RECORD_RATE_120_GBS; > + break; > + > default: > - return IB_PATH_RECORD_RATE_10_GBS; > + rate = IB_PATH_RECORD_RATE_10_GBS; > + break; > } > break; > default: > - return IB_PATH_RECORD_RATE_2_5_GBS; > + rate = IB_PATH_RECORD_RATE_2_5_GBS; > + break; > } > + > + return rate; > } > /* > * PARAMETERS From mst at mellanox.co.il Sun Dec 18 04:07:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 18 Dec 2005 14:07:25 +0200 Subject: [openib-general] [PATCH] ipoib: mcast allocation error handling Message-ID: <20051218120725.GE4241@mellanox.co.il> Here's a patch for a potential memory leak in ipoib. BTW, given that mcast group allocations are done here with GFP_ATOMIC, dont we want to do something safer than just print a warning if they fail? --- Warn, and dont leak memory, on allocation failure for broadcast mcast group. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- latest.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ latest/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -795,9 +796,11 @@ void ipoib_mcast_dev_flush(struct net_de &priv->multicast_tree); list_add_tail(&priv->broadcast->list, &remove_list); - } - - priv->broadcast = nmcast; + priv->broadcast = nmcast; + } else + ipoib_warn(priv, "could not reallocate broadcast group " + IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(priv->broadcast->mcmember.mgid)); } spin_unlock_irqrestore(&priv->lock, flags); -- MST From alan at lxorguk.ukuu.org.uk Sun Dec 18 05:25:48 2005 From: alan at lxorguk.ukuu.org.uk (Alan Cox) Date: Sun, 18 Dec 2005 13:25:48 +0000 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: References: <200512161548.HbgfRzF2TysjsR2G@cisco.com> <200512161548.lRw6KI369ooIXS9o@cisco.com> <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> Message-ID: <1134912348.26141.19.camel@localhost.localdomain> On Sul, 2005-12-18 at 04:27 +0100, Andi Kleen wrote: > Robert Walsh writes: > > > > Any chance we could get these moved into the x86_64 arch directory, > > then? We have to do double-word copies, or our chip gets unhappy. > > Standard memcpy will do double word copies if everything is suitably > aligned. Just use that. Sorry I have to disagree with that. The standard memcpy may change in future to have different properties. If you really need specific unusual properties then you want to be sure that it is obvious they are there. I'd also like to see Dave's generic C implementation as I don't believe you can create one the compiler isn't allowed at least technically to do differently. From ebiederm at xmission.com Sun Dec 18 07:02:47 2005 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sun, 18 Dec 2005 08:02:47 -0700 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: (Andi Kleen's message of "18 Dec 2005 04:25:27 +0100") References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217131456.GA13043@infradead.org> Message-ID: Andi Kleen writes: > ebiederm at xmission.com (Eric W. Biederman) writes: > >> Christoph Hellwig writes: >> >> > please always used fixes-size types for user communication. also please >> > avoid ioctls like the rest of the IB codebase. >> >> Could someone please explain to me how the uverbs abuse of write >> is better that ioctl? > > It's actually worse because if they have a 32bit compat issue > then ioctl can be fixed up, but read/write can't. > > I wish the people arguing against ioctl all the time would > just stop that because the alternatives are usually worse. Some of the suggestions like using sysfs aren't too bad. One value per file in text format is clean and not going to change when you switch architectures :) The rule should really be that you can't just argue against ioctl but instead you must argue for something. >> - 64bit compilers will not pad every structure to 8 bytes. This >> only will happen if you happen to have an 8 byte element in your >> structure that is only aligned to 32bits by a 32bit structure. >> Unfortunately the 32bit gcc only aligns long long to 32bits on >> x86, which triggers the described behavior. > > Exactly - and driver writers usually don't get that right so we > need to have a tool to fix it up in the end. And with ioctl > that's easiest. In this case I don't see any current problems. But I don't think this is a pattern we want to encourage, and if there is a more maintainable pattern now would be the time to fix it. Eric From halr at voltaire.com Sun Dec 18 07:48:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2005 10:48:04 -0500 Subject: [openib-general] RE: [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B49@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B49@mtlexch01.mtl.com> Message-ID: <1134920884.4328.21299.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-12-18 at 06:53, Eitan Zahavi wrote: > Hi Hal, > > The attached patch is fine. Please go ahead and commit it. > > BTW: > In the following commit 4509 you have changed the name of a switch info > record field. > Note this is an API change and have severe effect on any application > using ib_types.h > (and there are plenty of these) Is this "API" frozen for all time ? How would you propose that this "API" evolve ? I do not see where there is any versioning to the API. > I would appreciate if you will revert this un-necessary change. I reverted this change. > Also in > the future please post a patch before changing ib_types.h > osm_vendor_api.h osm_vendor_sa_api.h and any of the complib H files. The previous request on this was for any non cosmetic changes. This was viewed as a cosmetic change (a simple variable name change).I'll now post any changes to these header files. -- Hal From eitan at mellanox.co.il Sun Dec 18 08:03:32 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 18 Dec 2005 18:03:32 +0200 Subject: [openib-general] RE: [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B50@mtlexch01.mtl.com> Hi Hal, Thanks for reverting the patch. Regarding changing the API, I propose we will discuss every change and try to limit them to really critical ones. I do not know how we can use versioning in Header files. Eitan Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, December 18, 2005 5:48 PM > To: Eitan Zahavi > Cc: openib-general at openib.org; Yael Kalka > Subject: RE: [PATCH] OpenSM/ib_types.h: Modify ib_port_info_compute_rate > > Hi Eitan, > > On Sun, 2005-12-18 at 06:53, Eitan Zahavi wrote: > > Hi Hal, > > > > The attached patch is fine. Please go ahead and commit it. > > > > BTW: > > In the following commit 4509 you have changed the name of a switch info > > record field. > > Note this is an API change and have severe effect on any application > > using ib_types.h > > (and there are plenty of these) > > Is this "API" frozen for all time ? How would you propose that this > "API" evolve ? I do not see where there is any versioning to the API. > > > I would appreciate if you will revert this un-necessary change. > > I reverted this change. > > > Also in > > the future please post a patch before changing ib_types.h > > osm_vendor_api.h osm_vendor_sa_api.h and any of the complib H files. > > The previous request on this was for any non cosmetic changes. This was > viewed as a cosmetic change (a simple variable name change).I'll now > post any changes to these header files. > > -- Hal From halr at voltaire.com Sun Dec 18 08:13:55 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2005 11:13:55 -0500 Subject: [openib-general] Re: A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi In-Reply-To: <43A5238A.1040702@mellanox.co.il> References: <1134766491.4338.10299.camel@hal.voltaire.com> <43A5238A.1040702@mellanox.co.il> Message-ID: <1134922434.4328.21565.camel@hal.voltaire.com> On Sun, 2005-12-18 at 03:53, Eitan Zahavi wrote: [snip...] > > This seems a little inconsistent to me. It seems like NeighborMTU would > > be the equivalent of OperationalVLs, rather than MTUCap (which is RO). > Yes I think we should have checked the NeighborMTU and not the MTUCap. OK. I'll fix this. > > Also, why does changing the MTU require that the link be taken down ? > The behavior of the link when a neighbor MTU is changes is not very well defined. > So the best way to handle that is to force it down. NeighborMTU is not involved with the link negotiation nor is there a comment in the description like OperationalVLs. What behavior are you referring to ? > > I also noticed a nit in the same function: > > > > p_pi->m_key_lease_period = p_mgr->p_subn->opt.m_key_lease_period; > > /* Check to see if the value we are setting is different than > > the value in the port_info. If it is - turn on send_set flag */ > > if (cl_memcmp( &p_pi->m_key_lease_period, > > &p_old_pi->m_key_lease_period, > > sizeof(p_pi->m_key_lease_period) )) > > send_set = TRUE; > > > > Should that be only when the Mkey is non 0 ? > Well, I know the lease is not relevant when MKey = 0. But for code clarity I > propose to ignore that fact. The effect is only when someone set lease period but MKey = 0 > which IMO does not make any sense anyway. I agree it does not make sense but could happen (is it prevented somehow ?) so my take is to minimize the need for sets. As I said this is a nit. -- Hal From halr at voltaire.com Sun Dec 18 08:18:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2005 11:18:29 -0500 Subject: [openib-general] [PATCH] OpenSM/osm_port.h: In osm_physp_trim_base_lid_to_valid_range, also fix LID 0 Message-ID: <1134922709.4328.21620.camel@hal.voltaire.com> OpenSM/osm_port.h: In osm_physp_trim_base_lid_to_valid_range, fix LID 0 as well as multicast LIDs Signed-off-by: Hal Rosenstock Index: osm_port.h =================================================================== --- osm_port.h (revision 4522) +++ osm_port.h (working copy) @@ -473,7 +473,8 @@ osm_physp_trim_base_lid_to_valid_range( ib_net16_t orig_lid = 0; CL_ASSERT( osm_physp_is_valid( p_physp ) ); - if ( cl_ntoh16( p_physp->port_info.base_lid ) > IB_LID_UCAST_END_HO ) + if ( ( cl_ntoh16( p_physp->port_info.base_lid ) > IB_LID_UCAST_END_HO ) || + ( cl_ntoh16( p_physp->port_info.base_lid ) < IB_LID_UCAST_START_HO ) ) { orig_lid = p_physp->port_info.base_lid; p_physp->port_info.base_lid = 0; From halr at voltaire.com Sun Dec 18 08:29:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2005 11:29:04 -0500 Subject: [openib-general] OpenSM/osm_lif_mgr.c: In __osm_lid_mgr_set_physp_pi, check for change in neighbor MTU rather than MTUCap Message-ID: <1134923343.4328.21735.camel@hal.voltaire.com> OpenSM/osm_lif_mgr.c: In __osm_lid_mgr_set_physp_pi, check for change in neighbor MTU rather than MTUCap in order to determine whether link should be DOWNed Signed-off-by: Hal Rosenstock Index: osm_lid_mgr.c =================================================================== --- osm_lid_mgr.c (revision 4522) +++ osm_lid_mgr.c (working copy) @@ -1095,7 +1095,7 @@ __osm_lid_mgr_set_physp_pi( To reset the port state machine we can send PortInfo.State = DOWN. (see: 7.2.7 p161 lines:10-19.) */ - if ( (mtu != ib_port_info_get_mtu_cap( p_old_pi )) || + if ( (mtu != ib_port_info_get_neighbor_mtu( p_old_pi )) || (op_vls != ib_port_info_get_op_vls(p_old_pi))) { if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) From ftillier at silverstorm.com Sun Dec 18 08:38:42 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Sun, 18 Dec 2005 08:38:42 -0800 Subject: [openib-general] RE: [PATCH] OpenSM/ib_types.h: Modifyib_port_info_compute_rate In-Reply-To: <1134920884.4328.21299.camel@hal.voltaire.com> Message-ID: <000401c603f1$866d4eb0$6401a8c0@infiniconsys.com> Hi Hal, > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Sunday, December 18, 2005 7:48 AM > > On Sun, 2005-12-18 at 06:53, Eitan Zahavi wrote: > > Hi Hal, > > > > The attached patch is fine. Please go ahead and commit it. > > > > BTW: > > In the following commit 4509 you have changed the name of a > > switch info record field. > > Note this is an API change and have severe effect on any > > application using ib_types.h (and there are plenty of these) > > Is this "API" frozen for all time ? How would you propose that this > "API" evolve ? I do not see where there is any versioning to the API. The ib_types.h file in the OpenSM project was originally lifted from the IBAL project. The Windows OpenIB Project, since it's derived from IBAL, has that file. Currently, OpenSM has it's own shadow copy of that header, so changes to it must be carefully controlled to keep OpenSM building on Windows. Hopefully that helps explain. - Fab From halr at voltaire.com Sun Dec 18 08:43:36 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2005 11:43:36 -0500 Subject: [openib-general] RE: [PATCH] OpenSM/ib_types.h: Modifyib_port_info_compute_rate In-Reply-To: <000401c603f1$866d4eb0$6401a8c0@infiniconsys.com> References: <000401c603f1$866d4eb0$6401a8c0@infiniconsys.com> Message-ID: <1134924028.4328.21864.camel@hal.voltaire.com> On Sun, 2005-12-18 at 11:38, Fab Tillier wrote: > Hi Hal, > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Sunday, December 18, 2005 7:48 AM > > > > On Sun, 2005-12-18 at 06:53, Eitan Zahavi wrote: > > > Hi Hal, > > > > > > The attached patch is fine. Please go ahead and commit it. > > > > > > BTW: > > > In the following commit 4509 you have changed the name of a > > > switch info record field. > > > Note this is an API change and have severe effect on any > > > application using ib_types.h (and there are plenty of these) > > > > Is this "API" frozen for all time ? How would you propose that this > > "API" evolve ? I do not see where there is any versioning to the API. > > The ib_types.h file in the OpenSM project was originally lifted from the IBAL > project. The Windows OpenIB Project, since it's derived from IBAL, has that > file. Currently, OpenSM has it's own shadow copy of that header, so changes to > it must be carefully controlled to keep OpenSM building on Windows. > > Hopefully that helps explain. OpenSM builds. It is other tools (currently non OpenIB) which experience problems. -- Hal From eitan at mellanox.co.il Sun Dec 18 11:20:20 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 18 Dec 2005 21:20:20 +0200 Subject: [openib-general] RE: A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B52@mtlexch01.mtl.com> > On Sun, 2005-12-18 at 03:53, Eitan Zahavi wrote: > > [snip...] > > > > This seems a little inconsistent to me. It seems like NeighborMTU would > > > be the equivalent of OperationalVLs, rather than MTUCap (which is RO). > > > Yes I think we should have checked the NeighborMTU and not the MTUCap. > > OK. I'll fix this. [EZ] Thanks. I have seen the patch. It is fine. > > > > Also, why does changing the MTU require that the link be taken down ? > > > The behavior of the link when a neighbor MTU is changes is not very well defined. > > So the best way to handle that is to force it down. > > NeighborMTU is not involved with the link negotiation nor is there a > comment in the description like OperationalVLs. What behavior are you > referring to ? [EZ] I actually do not see any spec note about modifying neighbor MTU during link up. However, I remember we had to add this functionality. I try to dig this up in the old bit keeper and found the first occurrence of the setting of the port down in version 1.7. But the log does not say why. > > > > I also noticed a nit in the same function: > > > > > > p_pi->m_key_lease_period = p_mgr->p_subn->opt.m_key_lease_period; > > > /* Check to see if the value we are setting is different than > > > the value in the port_info. If it is - turn on send_set flag */ > > > if (cl_memcmp( &p_pi->m_key_lease_period, > > > &p_old_pi->m_key_lease_period, > > > sizeof(p_pi->m_key_lease_period) )) > > > send_set = TRUE; > > > > > > Should that be only when the Mkey is non 0 ? > > > Well, I know the lease is not relevant when MKey = 0. But for code clarity I > > propose to ignore that fact. The effect is only when someone set lease period but > MKey = 0 > > which IMO does not make any sense anyway. > > I agree it does not make sense but could happen (is it prevented somehow > ?) so my take is to minimize the need for sets. As I said this is a nit. [EZ] We could avoid that but I do not think this is required. > > -- Hal From rjwalsh at pathscale.com Sun Dec 18 11:52:05 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sun, 18 Dec 2005 11:52:05 -0800 Subject: [openib-general] Re: [PATCH 03/13] [RFC] ipath copy routines In-Reply-To: <20051218.013341.34772534.davem@davemloft.net> References: <20051217123833.1aa430ab.akpm@osdl.org> <1134859243.20575.84.camel@phosphene.durables.org> <20051217191932.af2b422c.akpm@osdl.org> <20051218.013341.34772534.davem@davemloft.net> Message-ID: <1134935525.5826.0.camel@phosphene.durables.org> > > That would make sense. Give it a non-ipath-related name and require that > > all architectures which wish to run this driver must implement that > > (documented) function. > > > > And, in Kconfig, make sure that architectures which don't implement that > > library function do not attempt to build this driver. To avoid breaking > > `make allmodconfig'. > > How about we implement a portable version in C that you get > by default if you don't implement the assembler routine? > Pretty please? :-) Sure. :-) -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From sam at ravnborg.org Sun Dec 18 11:23:56 2005 From: sam at ravnborg.org (Sam Ravnborg) Date: Sun, 18 Dec 2005 20:23:56 +0100 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <200512161548.lokgvLraSGi0enUH@cisco.com> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> Message-ID: <20051218192356.GB9145@mars.ravnborg.org> On Fri, Dec 16, 2005 at 03:48:55PM -0800, Roland Dreier wrote: > @@ -0,0 +1,15 @@ > +EXTRA_CFLAGS += -Idrivers/infiniband/include If this is needed then some header files should be moved to include/rdma > + > +ipath_core-objs := ipath_copy.o ipath_driver.o \ > + ipath_dwordcpy.o ipath_ht400.o ipath_i2c.o ipath_layer.o \ > + ipath_lib.o ipath_mlock.o > + > +ib_ipath-objs := ipath_mad.o ipath_verbs.o Please use: ipath_core-y := ... ib_ipath-y := ... Use of -y let you do better Kconfig selection in the makefile, and is preferred compared to -objs Sam From paulmck at us.ibm.com Sun Dec 18 11:59:22 2005 From: paulmck at us.ibm.com (Paul E. McKenney) Date: Sun, 18 Dec 2005 11:59:22 -0800 Subject: [openib-general] Re: [PATCH 10/13] [RFC] ipath verbs, part 1 In-Reply-To: <200512161548.W9sJn4CLmdhnSTcH@cisco.com> References: <200512161548.zxp6FKcabEu47EnS@cisco.com> <200512161548.W9sJn4CLmdhnSTcH@cisco.com> Message-ID: <20051218195922.GC31184@us.ibm.com> On Fri, Dec 16, 2005 at 03:48:55PM -0800, Roland Dreier wrote: > First half of ipath verbs driver Some RCU-related questions interspersed. Basic question is "where is the lock-free read-side traversal?" Thanx, Paul > --- > > drivers/infiniband/hw/ipath/ipath_verbs.c | 3244 +++++++++++++++++++++++++++++ > 1 files changed, 3244 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/hw/ipath/ipath_verbs.c > > 72075ecec75f8c42e444a7d7d8ffcf340a845b96 > diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c > new file mode 100644 > index 0000000..808326e > --- /dev/null > +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c > @@ -0,0 +1,3244 @@ > +/* > + * Copyright (c) 2005. PathScale, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Patent licenses, if any, provided herein do not apply to > + * combinations of this program with other software, or any other > + * product whatsoever. > + * > + * $Id: ipath_verbs.c 4491 2005-12-15 22:20:31Z rjwalsh $ > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +#include "ipath_common.h" > +#include "ips_common.h" > +#include "ipath_layer.h" > +#include "ipath_verbs.h" > + > +/* > + * Compare the lower 24 bits of the two values. > + * Returns an integer <, ==, or > than zero. > + */ > +static inline int cmp24(u32 a, u32 b) > +{ > + return (((int) a) - ((int) b)) << 8; > +} > + > +#define MODNAME "ib_ipath" > +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " > +#define PFX MODNAME ": " > + > + > +/* Not static, because we don't want the compiler removing it */ > +const char ipath_verbs_version[] = "ipath_verbs " _IPATH_IDSTR; > + > +unsigned int ib_ipath_qp_table_size = 251; > +module_param(ib_ipath_qp_table_size, uint, 0444); > +MODULE_PARM_DESC(ib_ipath_qp_table_size, "QP table size"); > + > +unsigned int ib_ipath_lkey_table_size = 12; > +module_param(ib_ipath_lkey_table_size, uint, 0444); > +MODULE_PARM_DESC(ib_ipath_lkey_table_size, > + "LKEY table size in bits (2^n, 1 <= n <= 23)"); > + > +unsigned int ib_ipath_debug; /* debug mask */ > +module_param(ib_ipath_debug, uint, 0644); > +MODULE_PARM_DESC(ib_ipath_debug, "Verbs debug mask"); > + > + > +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, > + u32 len, struct ib_send_wr *wr, struct ib_wc *wc); > +static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc); > +static int ipath_destroy_qp(struct ib_qp *ibqp); > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("PathScale "); > +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); > + > +enum { > + IPATH_FAULT_RC_DROP_SEND_F = 1, > + IPATH_FAULT_RC_DROP_SEND_M, > + IPATH_FAULT_RC_DROP_SEND_L, > + IPATH_FAULT_RC_DROP_SEND_O, > + IPATH_FAULT_RC_DROP_RDMA_WRITE_F, > + IPATH_FAULT_RC_DROP_RDMA_WRITE_M, > + IPATH_FAULT_RC_DROP_RDMA_WRITE_L, > + IPATH_FAULT_RC_DROP_RDMA_WRITE_O, > + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_F, > + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_M, > + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_L, > + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_O, > + IPATH_FAULT_RC_DROP_ACK, > +}; > + > +enum { > + IPATH_TRANS_INVALID = 0, > + IPATH_TRANS_ANY2RST, > + IPATH_TRANS_RST2INIT, > + IPATH_TRANS_INIT2INIT, > + IPATH_TRANS_INIT2RTR, > + IPATH_TRANS_RTR2RTS, > + IPATH_TRANS_RTS2RTS, > + IPATH_TRANS_SQERR2RTS, > + IPATH_TRANS_ANY2ERR, > + IPATH_TRANS_RTS2SQD, /* XXX Wait for expected ACKs & signal event */ > + IPATH_TRANS_SQD2SQD, /* error if not drained & parameter change */ > + IPATH_TRANS_SQD2RTS, /* error if not drained */ > +}; > + > +enum { > + IPATH_POST_SEND_OK = 0x0001, > + IPATH_POST_RECV_OK = 0x0002, > + IPATH_PROCESS_RECV_OK = 0x0004, > + IPATH_PROCESS_SEND_OK = 0x0008, > +}; > + > +static int state_ops[IB_QPS_ERR + 1] = { > + [IB_QPS_RESET] = 0, > + [IB_QPS_INIT] = IPATH_POST_RECV_OK, > + [IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, > + [IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | > + IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK, > + [IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | > + IPATH_POST_SEND_OK, > + [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, > + [IB_QPS_ERR] = 0, > +}; > + > +/* > + * Convert the AETH credit code into the number of credits. > + */ > +static u32 credit_table[31] = { > + 0, /* 0 */ > + 1, /* 1 */ > + 2, /* 2 */ > + 3, /* 3 */ > + 4, /* 4 */ > + 6, /* 5 */ > + 8, /* 6 */ > + 12, /* 7 */ > + 16, /* 8 */ > + 24, /* 9 */ > + 32, /* A */ > + 48, /* B */ > + 64, /* C */ > + 96, /* D */ > + 128, /* E */ > + 192, /* F */ > + 256, /* 10 */ > + 384, /* 11 */ > + 512, /* 12 */ > + 768, /* 13 */ > + 1024, /* 14 */ > + 1536, /* 15 */ > + 2048, /* 16 */ > + 3072, /* 17 */ > + 4096, /* 18 */ > + 6144, /* 19 */ > + 8192, /* 1A */ > + 12288, /* 1B */ > + 16384, /* 1C */ > + 24576, /* 1D */ > + 32768 /* 1E */ > +}; > + > +/* > + * Convert the AETH RNR timeout code into the number of milliseconds. > + */ > +static u32 rnr_table[32] = { > + 656, /* 0 */ > + 1, /* 1 */ > + 1, /* 2 */ > + 1, /* 3 */ > + 1, /* 4 */ > + 1, /* 5 */ > + 1, /* 6 */ > + 1, /* 7 */ > + 1, /* 8 */ > + 1, /* 9 */ > + 1, /* A */ > + 1, /* B */ > + 1, /* C */ > + 1, /* D */ > + 2, /* E */ > + 2, /* F */ > + 3, /* 10 */ > + 4, /* 11 */ > + 6, /* 12 */ > + 8, /* 13 */ > + 11, /* 14 */ > + 16, /* 15 */ > + 21, /* 16 */ > + 31, /* 17 */ > + 41, /* 18 */ > + 62, /* 19 */ > + 82, /* 1A */ > + 123, /* 1B */ > + 164, /* 1C */ > + 246, /* 1D */ > + 328, /* 1E */ > + 492 /* 1F */ > +}; > + > +/* > + * Translate ib_wr_opcode into ib_wc_opcode. > + */ > +static enum ib_wc_opcode wc_opcode[] = { > + [IB_WR_RDMA_WRITE] = IB_WC_RDMA_WRITE, > + [IB_WR_RDMA_WRITE_WITH_IMM] = IB_WC_RDMA_WRITE, > + [IB_WR_SEND] = IB_WC_SEND, > + [IB_WR_SEND_WITH_IMM] = IB_WC_SEND, > + [IB_WR_RDMA_READ] = IB_WC_RDMA_READ, > + [IB_WR_ATOMIC_CMP_AND_SWP] = IB_WC_COMP_SWAP, > + [IB_WR_ATOMIC_FETCH_AND_ADD] = IB_WC_FETCH_ADD > +}; > + > +/* > + * Array of device pointers. > + */ > +static uint32_t number_of_devices; > +static struct ipath_ibdev **ipath_devices; > + > +/* > + * Global table of GID to attached QPs. > + * The table is global to all ipath devices since a send from one QP/device > + * needs to be locally routed to any locally attached QPs on the same > + * or different device. > + */ > +static struct rb_root mcast_tree; > +static spinlock_t mcast_lock = SPIN_LOCK_UNLOCKED; > + > +/* > + * Allocate a structure to link a QP to the multicast GID structure. > + */ > +static struct ipath_mcast_qp *ipath_mcast_qp_alloc(struct ipath_qp *qp) > +{ > + struct ipath_mcast_qp *mqp; > + > + mqp = kmalloc(sizeof(*mqp), GFP_KERNEL); > + if (!mqp) > + return NULL; > + > + mqp->qp = qp; > + atomic_inc(&qp->refcount); > + > + return mqp; > +} > + > +static void ipath_mcast_qp_free(struct ipath_mcast_qp *mqp) > +{ > + struct ipath_qp *qp = mqp->qp; > + > + /* Notify ipath_destroy_qp() if it is waiting. */ > + if (atomic_dec_and_test(&qp->refcount)) > + wake_up(&qp->wait); > + > + kfree(mqp); > +} > + > +/* > + * Allocate a structure for the multicast GID. > + * A list of QPs will be attached to this structure. > + */ > +static struct ipath_mcast *ipath_mcast_alloc(union ib_gid *mgid) > +{ > + struct ipath_mcast *mcast; > + > + mcast = kmalloc(sizeof(*mcast), GFP_KERNEL); > + if (!mcast) > + return NULL; > + > + mcast->mgid = *mgid; > + INIT_LIST_HEAD(&mcast->qp_list); > + init_waitqueue_head(&mcast->wait); > + atomic_set(&mcast->refcount, 0); > + > + return mcast; > +} > + > +static void ipath_mcast_free(struct ipath_mcast *mcast) > +{ > + struct ipath_mcast_qp *p, *tmp; > + > + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) > + ipath_mcast_qp_free(p); > + > + kfree(mcast); > +} > + > +/* > + * Search the global table for the given multicast GID. > + * Return it or NULL if not found. > + * The caller is responsible for decrementing the reference count if found. > + */ > +static struct ipath_mcast *ipath_mcast_find(union ib_gid *mgid) > +{ > + struct rb_node *n; > + unsigned long flags; > + > + spin_lock_irqsave(&mcast_lock, flags); > + n = mcast_tree.rb_node; > + while (n) { > + struct ipath_mcast *mcast; > + int ret; > + > + mcast = rb_entry(n, struct ipath_mcast, rb_node); > + > + ret = memcmp(mgid->raw, mcast->mgid.raw, sizeof(union ib_gid)); > + if (ret < 0) > + n = n->rb_left; > + else if (ret > 0) > + n = n->rb_right; > + else { > + atomic_inc(&mcast->refcount); > + spin_unlock_irqrestore(&mcast_lock, flags); > + return mcast; > + } > + } > + spin_unlock_irqrestore(&mcast_lock, flags); > + > + return NULL; > +} > + > +/* > + * Insert the multicast GID into the table and > + * attach the QP structure. > + * Return zero if both were added. > + * Return EEXIST if the GID was already in the table but the QP was added. > + * Return ESRCH if the QP was already attached and neither structure was added. > + */ > +static int ipath_mcast_add(struct ipath_mcast *mcast, > + struct ipath_mcast_qp *mqp) > +{ > + struct rb_node **n = &mcast_tree.rb_node; > + struct rb_node *pn = NULL; > + unsigned long flags; > + > + spin_lock_irqsave(&mcast_lock, flags); > + > + while (*n) { > + struct ipath_mcast *tmcast; > + struct ipath_mcast_qp *p; > + int ret; > + > + pn = *n; > + tmcast = rb_entry(pn, struct ipath_mcast, rb_node); > + > + ret = memcmp(mcast->mgid.raw, tmcast->mgid.raw, > + sizeof(union ib_gid)); > + if (ret < 0) { > + n = &pn->rb_left; > + continue; > + } > + if (ret > 0) { > + n = &pn->rb_right; > + continue; > + } > + > + /* Search the QP list to see if this is already there. */ > + list_for_each_entry_rcu(p, &tmcast->qp_list, list) { Given that we hold the global mcast_lock, how is RCU helping here? Is there a lock-free read-side traversal path somewhere that I am missing? > + if (p->qp == mqp->qp) { > + spin_unlock_irqrestore(&mcast_lock, flags); > + return ESRCH; > + } > + } > + list_add_tail_rcu(&mqp->list, &tmcast->qp_list); Ditto... > + spin_unlock_irqrestore(&mcast_lock, flags); > + return EEXIST; > + } > + > + list_add_tail_rcu(&mqp->list, &mcast->qp_list); Ditto... > + spin_unlock_irqrestore(&mcast_lock, flags); > + > + atomic_inc(&mcast->refcount); > + rb_link_node(&mcast->rb_node, pn, n); > + rb_insert_color(&mcast->rb_node, &mcast_tree); > + > + spin_unlock_irqrestore(&mcast_lock, flags); > + > + return 0; > +} > + > +static int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, > + u16 lid) > +{ > + struct ipath_qp *qp = to_iqp(ibqp); > + struct ipath_mcast *mcast; > + struct ipath_mcast_qp *mqp; > + > + /* > + * Allocate data structures since its better to do this outside of > + * spin locks and it will most likely be needed. > + */ > + mcast = ipath_mcast_alloc(gid); > + if (mcast == NULL) > + return -ENOMEM; > + mqp = ipath_mcast_qp_alloc(qp); > + if (mqp == NULL) { > + ipath_mcast_free(mcast); > + return -ENOMEM; > + } > + switch (ipath_mcast_add(mcast, mqp)) { > + case ESRCH: > + /* Neither was used: can't attach the same QP twice. */ > + ipath_mcast_qp_free(mqp); > + ipath_mcast_free(mcast); > + return -EINVAL; > + case EEXIST: /* The mcast wasn't used */ > + ipath_mcast_free(mcast); > + break; > + default: > + break; > + } > + return 0; > +} > + > +static int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, > + u16 lid) > +{ > + struct ipath_qp *qp = to_iqp(ibqp); > + struct ipath_mcast *mcast = NULL; > + struct ipath_mcast_qp *p, *tmp; > + struct rb_node *n; > + unsigned long flags; > + int last = 0; > + > + spin_lock_irqsave(&mcast_lock, flags); > + > + /* Find the GID in the mcast table. */ > + n = mcast_tree.rb_node; > + while (1) { > + int ret; > + > + if (n == NULL) { > + spin_unlock_irqrestore(&mcast_lock, flags); > + return 0; > + } > + > + mcast = rb_entry(n, struct ipath_mcast, rb_node); > + ret = memcmp(gid->raw, mcast->mgid.raw, sizeof(union ib_gid)); > + if (ret < 0) > + n = n->rb_left; > + else if (ret > 0) > + n = n->rb_right; > + else > + break; > + } > + > + /* Search the QP list. */ > + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) { > + if (p->qp != qp) > + continue; > + /* > + * We found it, so remove it, but don't poison the forward link > + * until we are sure there are no list walkers. > + */ > + list_del_rcu(&p->list); Ditto... > + spin_unlock_irqrestore(&mcast_lock, flags); > + > + /* If this was the last attached QP, remove the GID too. */ > + if (list_empty(&mcast->qp_list)) { > + rb_erase(&mcast->rb_node, &mcast_tree); > + last = 1; > + } > + break; > + } > + > + spin_unlock_irqrestore(&mcast_lock, flags); > + > + if (p) { > + /* > + * Wait for any list walkers to finish before freeing the > + * list element. > + */ > + wait_event(mcast->wait, atomic_read(&mcast->refcount) <= 1); > + ipath_mcast_qp_free(p); > + } > + if (last) { > + atomic_dec(&mcast->refcount); > + wait_event(mcast->wait, !atomic_read(&mcast->refcount)); > + ipath_mcast_free(mcast); > + } > + > + return 0; > +} > + > +/* > + * Copy data to SGE memory. > + */ > +static void copy_sge(struct ipath_sge_state *ss, void *data, u32 length) > +{ > + struct ipath_sge *sge = &ss->sge; > + > + while (length) { > + u32 len = sge->length; > + > + BUG_ON(len == 0); > + if (len > length) > + len = length; > + memcpy(sge->vaddr, data, len); > + sge->vaddr += len; > + sge->length -= len; > + sge->sge_length -= len; > + if (sge->sge_length == 0) { > + if (--ss->num_sge) > + *sge = *ss->sg_list++; > + } else if (sge->length == 0 && sge->mr != NULL) { > + if (++sge->n >= IPATH_SEGSZ) { > + if (++sge->m >= sge->mr->mapsz) > + break; > + sge->n = 0; > + } > + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; > + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; > + } > + data += len; > + length -= len; > + } > +} > + > +/* > + * Skip over length bytes of SGE memory. > + */ > +static void skip_sge(struct ipath_sge_state *ss, u32 length) > +{ > + struct ipath_sge *sge = &ss->sge; > + > + while (length > sge->sge_length) { > + length -= sge->sge_length; > + ss->sge = *ss->sg_list++; > + } > + while (length) { > + u32 len = sge->length; > + > + BUG_ON(len == 0); > + if (len > length) > + len = length; > + sge->vaddr += len; > + sge->length -= len; > + sge->sge_length -= len; > + if (sge->sge_length == 0) { > + if (--ss->num_sge) > + *sge = *ss->sg_list++; > + } else if (sge->length == 0 && sge->mr != NULL) { > + if (++sge->n >= IPATH_SEGSZ) { > + if (++sge->m >= sge->mr->mapsz) > + break; > + sge->n = 0; > + } > + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; > + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; > + } > + length -= len; > + } > +} > + > +static inline u32 alloc_qpn(struct ipath_qp_table *qpt) > +{ > + u32 i, offset, max_scan, qpn; > + struct qpn_map *map; > + > + qpn = qpt->last + 1; > + if (qpn >= QPN_MAX) > + qpn = 2; > + offset = qpn & BITS_PER_PAGE_MASK; > + map = &qpt->map[qpn / BITS_PER_PAGE]; > + max_scan = qpt->nmaps - !offset; > + for (i = 0;;) { > + if (unlikely(!map->page)) { > + unsigned long page = get_zeroed_page(GFP_KERNEL); > + unsigned long flags; > + > + /* > + * Free the page if someone raced with us > + * installing it: > + */ > + spin_lock_irqsave(&qpt->lock, flags); > + if (map->page) > + free_page(page); > + else > + map->page = (void *)page; > + spin_unlock_irqrestore(&qpt->lock, flags); > + if (unlikely(!map->page)) > + break; > + } > + if (likely(atomic_read(&map->n_free))) { > + do { > + if (!test_and_set_bit(offset, map->page)) { > + atomic_dec(&map->n_free); > + qpt->last = qpn; > + return qpn; > + } > + offset = find_next_offset(map, offset); > + qpn = mk_qpn(qpt, map, offset); > + /* > + * This test differs from alloc_pidmap(). > + * If find_next_offset() does find a zero bit, > + * we don't need to check for QPN wrapping > + * around past our starting QPN. We > + * just need to be sure we don't loop forever. > + */ > + } while (offset < BITS_PER_PAGE && qpn < QPN_MAX); > + } > + /* > + * In order to keep the number of pages allocated to a minimum, > + * we scan the all existing pages before increasing the size > + * of the bitmap table. > + */ > + if (++i > max_scan) { > + if (qpt->nmaps == QPNMAP_ENTRIES) > + break; > + map = &qpt->map[qpt->nmaps++]; > + offset = 0; > + } else if (map < &qpt->map[qpt->nmaps]) { > + ++map; > + offset = 0; > + } else { > + map = &qpt->map[0]; > + offset = 2; > + } > + qpn = mk_qpn(qpt, map, offset); > + } > + return 0; > +} > + > +static inline void free_qpn(struct ipath_qp_table *qpt, u32 qpn) > +{ > + struct qpn_map *map; > + > + map = qpt->map + qpn / BITS_PER_PAGE; > + if (map->page) > + clear_bit(qpn & BITS_PER_PAGE_MASK, map->page); > + atomic_inc(&map->n_free); > +} > + > +/* > + * Allocate the next available QPN and put the QP into the hash table. > + * The hash table holds a reference to the QP. > + */ > +static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, > + enum ib_qp_type type) > +{ > + unsigned long flags; > + u32 qpn; > + > + if (type == IB_QPT_SMI) > + qpn = 0; > + else if (type == IB_QPT_GSI) > + qpn = 1; > + else { > + /* Allocate the next available QPN */ > + qpn = alloc_qpn(qpt); > + if (qpn == 0) { > + return -ENOMEM; > + } > + } > + qp->ibqp.qp_num = qpn; > + > + /* Add the QP to the hash table. */ > + spin_lock_irqsave(&qpt->lock, flags); > + > + qpn %= qpt->max; > + qp->next = qpt->table[qpn]; > + qpt->table[qpn] = qp; > + atomic_inc(&qp->refcount); > + > + spin_unlock_irqrestore(&qpt->lock, flags); > + return 0; > +} > + > +/* > + * Remove the QP from the table so it can't be found asynchronously by > + * the receive interrupt routine. > + */ > +static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) > +{ > + struct ipath_qp *q, **qpp; > + unsigned long flags; > + int fnd = 0; > + > + spin_lock_irqsave(&qpt->lock, flags); > + > + /* Remove QP from the hash table. */ > + qpp = &qpt->table[qp->ibqp.qp_num % qpt->max]; > + for (; (q = *qpp) != NULL; qpp = &q->next) { > + if (q == qp) { > + *qpp = qp->next; > + qp->next = NULL; > + atomic_dec(&qp->refcount); > + fnd = 1; > + break; > + } > + } > + > + spin_unlock_irqrestore(&qpt->lock, flags); > + > + if (!fnd) > + return; > + > + /* If QPN is not reserved, mark QPN free in the bitmap. */ > + if (qp->ibqp.qp_num > 1) > + free_qpn(qpt, qp->ibqp.qp_num); > + > + wait_event(qp->wait, !atomic_read(&qp->refcount)); > +} > + > +/* > + * Remove all QPs from the table. > + */ > +static void ipath_free_all_qps(struct ipath_qp_table *qpt) > +{ > + unsigned long flags; > + struct ipath_qp *qp, *nqp; > + u32 n; > + > + for (n = 0; n < qpt->max; n++) { > + spin_lock_irqsave(&qpt->lock, flags); > + qp = qpt->table[n]; > + qpt->table[n] = NULL; > + spin_unlock_irqrestore(&qpt->lock, flags); > + > + while (qp) { > + nqp = qp->next; > + if (qp->ibqp.qp_num > 1) > + free_qpn(qpt, qp->ibqp.qp_num); > + if (!atomic_dec_and_test(&qp->refcount) || > + !ipath_destroy_qp(&qp->ibqp)) > + _VERBS_INFO("QP memory leak!\n"); > + qp = nqp; > + } > + } > + > + for (n = 0; n < ARRAY_SIZE(qpt->map); n++) { > + if (qpt->map[n].page) > + free_page((unsigned long)qpt->map[n].page); > + } > +} > + > +/* > + * Return the QP with the given QPN. > + * The caller is responsible for decrementing the QP reference count when done. > + */ > +static struct ipath_qp *ipath_lookup_qpn(struct ipath_qp_table *qpt, u32 qpn) > +{ > + unsigned long flags; > + struct ipath_qp *qp; > + > + spin_lock_irqsave(&qpt->lock, flags); > + > + for (qp = qpt->table[qpn % qpt->max]; qp; qp = qp->next) { > + if (qp->ibqp.qp_num == qpn) { > + atomic_inc(&qp->refcount); > + break; > + } > + } > + > + spin_unlock_irqrestore(&qpt->lock, flags); > + return qp; > +} > + > +static int ipath_alloc_lkey(struct ipath_lkey_table *rkt, > + struct ipath_mregion *mr) > +{ > + unsigned long flags; > + u32 r; > + u32 n; > + > + spin_lock_irqsave(&rkt->lock, flags); > + > + /* Find the next available LKEY */ > + r = n = rkt->next; > + for (;;) { > + if (rkt->table[r] == NULL) > + break; > + r = (r + 1) & (rkt->max - 1); > + if (r == n) { > + spin_unlock_irqrestore(&rkt->lock, flags); > + _VERBS_INFO("LKEY table full\n"); > + return 0; > + } > + } > + rkt->next = (r + 1) & (rkt->max - 1); > + /* > + * Make sure lkey is never zero which is reserved to indicate an > + * unrestricted LKEY. > + */ > + rkt->gen++; > + mr->lkey = (r << (32 - ib_ipath_lkey_table_size)) | > + ((((1 << (24 - ib_ipath_lkey_table_size)) - 1) & rkt->gen) << 8); > + if (mr->lkey == 0) { > + mr->lkey |= 1 << 8; > + rkt->gen++; > + } > + rkt->table[r] = mr; > + spin_unlock_irqrestore(&rkt->lock, flags); > + > + return 1; > +} > + > +static void ipath_free_lkey(struct ipath_lkey_table *rkt, u32 lkey) > +{ > + unsigned long flags; > + u32 r; > + > + if (lkey == 0) > + return; > + r = lkey >> (32 - ib_ipath_lkey_table_size); > + spin_lock_irqsave(&rkt->lock, flags); > + rkt->table[r] = NULL; > + spin_unlock_irqrestore(&rkt->lock, flags); > +} > + > +/* > + * Check the IB SGE for validity and initialize our internal version of it. > + * Return 1 if OK, else zero. > + */ > +static int ipath_lkey_ok(struct ipath_lkey_table *rkt, struct ipath_sge *isge, > + struct ib_sge *sge, int acc) > +{ > + struct ipath_mregion *mr; > + size_t off; > + > + /* > + * We use LKEY == zero to mean a physical kmalloc() address. > + * This is a bit of a hack since we rely on dma_map_single() > + * being reversible by calling bus_to_virt(). > + */ > + if (sge->lkey == 0) { > + isge->mr = NULL; > + isge->vaddr = bus_to_virt(sge->addr); > + isge->length = sge->length; > + isge->sge_length = sge->length; > + return 1; > + } > + spin_lock(&rkt->lock); > + mr = rkt->table[(sge->lkey >> (32 - ib_ipath_lkey_table_size))]; > + spin_unlock(&rkt->lock); > + if (unlikely(mr == NULL || mr->lkey != sge->lkey)) > + return 0; > + > + off = sge->addr - mr->user_base; > + if (unlikely(sge->addr < mr->user_base || > + off + sge->length > mr->length || > + (mr->access_flags & acc) != acc)) > + return 0; > + > + off += mr->offset; > + isge->mr = mr; > + isge->m = 0; > + isge->n = 0; > + while (off >= mr->map[isge->m]->segs[isge->n].length) { > + off -= mr->map[isge->m]->segs[isge->n].length; > + if (++isge->n >= IPATH_SEGSZ) { > + isge->m++; > + isge->n = 0; > + } > + } > + isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off; > + isge->length = mr->map[isge->m]->segs[isge->n].length - off; > + isge->sge_length = sge->length; > + return 1; > +} > + > +/* > + * Initialize the qp->s_sge after a restart. > + * The QP s_lock should be held. > + */ > +static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) > +{ > + struct ipath_ibdev *dev; > + u32 len; > + > + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * > + ib_mtu_enum_to_int(qp->path_mtu); > + qp->s_sge.sge = wqe->sg_list[0]; > + qp->s_sge.sg_list = wqe->sg_list + 1; > + qp->s_sge.num_sge = wqe->wr.num_sge; > + skip_sge(&qp->s_sge, len); > + qp->s_len = wqe->length - len; > + dev = to_idev(qp->ibqp.device); > + spin_lock(&dev->pending_lock); > + if (qp->timerwait.next == LIST_POISON1) > + list_add_tail(&qp->timerwait, > + &dev->pending[dev->pending_index]); > + spin_unlock(&dev->pending_lock); > +} > + > +/* > + * Check the IB virtual address, length, and RKEY. > + * Return 1 if OK, else zero. > + * The QP r_rq.lock should be held. > + */ > +static int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss, > + u32 len, u64 vaddr, u32 rkey, int acc) > +{ > + struct ipath_lkey_table *rkt = &dev->lk_table; > + struct ipath_sge *sge = &ss->sge; > + struct ipath_mregion *mr; > + size_t off; > + > + spin_lock(&rkt->lock); > + mr = rkt->table[(rkey >> (32 - ib_ipath_lkey_table_size))]; > + spin_unlock(&rkt->lock); > + if (unlikely(mr == NULL || mr->lkey != rkey)) > + return 0; > + > + off = vaddr - mr->iova; > + if (unlikely(vaddr < mr->iova || off + len > mr->length || > + (mr->access_flags & acc) == 0)) > + return 0; > + > + off += mr->offset; > + sge->mr = mr; > + sge->m = 0; > + sge->n = 0; > + while (off >= mr->map[sge->m]->segs[sge->n].length) { > + off -= mr->map[sge->m]->segs[sge->n].length; > + if (++sge->n >= IPATH_SEGSZ) { > + sge->m++; > + sge->n = 0; > + } > + } > + sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off; > + sge->length = mr->map[sge->m]->segs[sge->n].length - off; > + sge->sge_length = len; > + ss->sg_list = NULL; > + ss->num_sge = 1; > + return 1; > +} > + > +/* > + * Add a new entry to the completion queue. > + * This may be called with one of the qp->s_lock or qp->r_rq.lock held. > + */ > +static void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig) > +{ > + unsigned long flags; > + u32 next; > + > + spin_lock_irqsave(&cq->lock, flags); > + > + cq->queue[cq->head] = *entry; > + next = cq->head + 1; > + if (next == cq->ibcq.cqe) > + next = 0; > + if (next != cq->tail) > + cq->head = next; > + else { > + /* XXX - need to mark current wr as having an error... */ > + } > + > + if (cq->notify == IB_CQ_NEXT_COMP || > + (cq->notify == IB_CQ_SOLICITED && sig)) { > + cq->notify = IB_CQ_NONE; > + cq->triggered++; > + /* > + * This will cause send_complete() to be called in > + * another thread. > + */ > + tasklet_schedule(&cq->comptask); > + } > + > + spin_unlock_irqrestore(&cq->lock, flags); > + > + if (entry->status != IB_WC_SUCCESS) > + to_idev(cq->ibcq.device)->n_wqe_errs++; > +} > + > +static void send_complete(unsigned long data) > +{ > + struct ipath_cq *cq = (struct ipath_cq *)data; > + > + /* > + * The completion handler will most likely rearm the notification > + * and poll for all pending entries. If a new completion entry > + * is added while we are in this routine, tasklet_schedule() > + * won't call us again until we return so we check triggered to > + * see if we need to call the handler again. > + */ > + for (;;) { > + u8 triggered = cq->triggered; > + > + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); > + > + if (cq->triggered == triggered) > + return; > + } > +} > + > +/* > + * This is the QP state transition table. > + * See ipath_modify_qp() for details. > + */ > +static const struct { > + int trans; > + u32 req_param[IB_QPT_RAW_IPV6]; > + u32 opt_param[IB_QPT_RAW_IPV6]; > +} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { > + [IB_QPS_RESET] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_INIT] = { > + .trans = IPATH_TRANS_RST2INIT, > + .req_param = { > + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_ACCESS_FLAGS), > + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_ACCESS_FLAGS), > + }, > + }, > + }, > + [IB_QPS_INIT] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_INIT] = { > + .trans = IPATH_TRANS_INIT2INIT, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_ACCESS_FLAGS), > + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | > + IB_QP_PORT | > + IB_QP_ACCESS_FLAGS), > + } > + }, > + [IB_QPS_RTR] = { > + .trans = IPATH_TRANS_INIT2RTR, > + .req_param = { > + [IB_QPT_UC] = (IB_QP_AV | > + IB_QP_PATH_MTU | > + IB_QP_DEST_QPN | > + IB_QP_RQ_PSN), > + [IB_QPT_RC] = (IB_QP_AV | > + IB_QP_PATH_MTU | > + IB_QP_DEST_QPN | > + IB_QP_RQ_PSN | > + IB_QP_MAX_DEST_RD_ATOMIC | > + IB_QP_MIN_RNR_TIMER), > + }, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | > + IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX), > + [IB_QPT_RC] = (IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX), > + } > + } > + }, > + [IB_QPS_RTR] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_RTS] = { > + .trans = IPATH_TRANS_RTR2RTS, > + .req_param = { > + [IB_QPT_SMI] = IB_QP_SQ_PSN, > + [IB_QPT_GSI] = IB_QP_SQ_PSN, > + [IB_QPT_UD] = IB_QP_SQ_PSN, > + [IB_QPT_UC] = IB_QP_SQ_PSN, > + [IB_QPT_RC] = (IB_QP_TIMEOUT | > + IB_QP_RETRY_CNT | > + IB_QP_RNR_RETRY | > + IB_QP_SQ_PSN | > + IB_QP_MAX_QP_RD_ATOMIC), > + }, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | > + IB_QP_PATH_MIG_STATE), > + [IB_QPT_RC] = (IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | > + IB_QP_MIN_RNR_TIMER | > + IB_QP_PATH_MIG_STATE), > + } > + } > + }, > + [IB_QPS_RTS] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_RTS] = { > + .trans = IPATH_TRANS_RTS2RTS, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS | > + IB_QP_ALT_PATH | > + IB_QP_PATH_MIG_STATE), > + [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS | > + IB_QP_ALT_PATH | > + IB_QP_PATH_MIG_STATE | > + IB_QP_MIN_RNR_TIMER), > + } > + }, > + [IB_QPS_SQD] = { > + .trans = IPATH_TRANS_RTS2SQD, > + }, > + }, > + [IB_QPS_SQD] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_RTS] = { > + .trans = IPATH_TRANS_SQD2RTS, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PATH_MIG_STATE), > + [IB_QPT_RC] = (IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_MIN_RNR_TIMER | > + IB_QP_PATH_MIG_STATE), > + } > + }, > + [IB_QPS_SQD] = { > + .trans = IPATH_TRANS_SQD2SQD, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), > + [IB_QPT_UC] = (IB_QP_AV | > + IB_QP_TIMEOUT | > + IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | > + IB_QP_PATH_MIG_STATE), > + [IB_QPT_RC] = (IB_QP_AV | > + IB_QP_TIMEOUT | > + IB_QP_RETRY_CNT | > + IB_QP_RNR_RETRY | > + IB_QP_MAX_QP_RD_ATOMIC | > + IB_QP_MAX_DEST_RD_ATOMIC | > + IB_QP_CUR_STATE | > + IB_QP_ALT_PATH | > + IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | > + IB_QP_MIN_RNR_TIMER | > + IB_QP_PATH_MIG_STATE), > + } > + } > + }, > + [IB_QPS_SQE] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, > + [IB_QPS_RTS] = { > + .trans = IPATH_TRANS_SQERR2RTS, > + .opt_param = { > + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), > + [IB_QPT_UC] = IB_QP_CUR_STATE, > + [IB_QPT_RC] = (IB_QP_CUR_STATE | > + IB_QP_MIN_RNR_TIMER), > + } > + } > + }, > + [IB_QPS_ERR] = { > + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, > + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR } > + } > +}; > + > +/* > + * Initialize the QP state to the reset state. > + */ > +static void ipath_reset_qp(struct ipath_qp *qp) > +{ > + qp->remote_qpn = 0; > + qp->qkey = 0; > + qp->qp_access_flags = 0; > + qp->s_hdrwords = 0; > + qp->s_psn = 0; > + qp->r_psn = 0; > + atomic_set(&qp->msn, 0); > + if (qp->ibqp.qp_type == IB_QPT_RC) { > + qp->s_state = IB_OPCODE_RC_SEND_LAST; > + qp->r_state = IB_OPCODE_RC_SEND_LAST; > + } else { > + qp->s_state = IB_OPCODE_UC_SEND_LAST; > + qp->r_state = IB_OPCODE_UC_SEND_LAST; > + } > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + qp->s_nak_state = 0; > + qp->s_rnr_timeout = 0; > + qp->s_head = 0; > + qp->s_tail = 0; > + qp->s_cur = 0; > + qp->s_last = 0; > + qp->s_ssn = 1; > + qp->s_lsn = 0; > + qp->r_rq.head = 0; > + qp->r_rq.tail = 0; > + qp->r_reuse_sge = 0; > +} > + > +/* > + * Flush send work queue. > + * The QP s_lock should be held. > + */ > +static void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc) > +{ > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); > + > + _VERBS_INFO("Send queue error on QP%d/%d: err: %d\n", > + qp->ibqp.qp_num, qp->remote_qpn, wc->status); > + > + spin_lock(&dev->pending_lock); > + /* XXX What if its already removed by the timeout code? */ > + if (qp->timerwait.next != LIST_POISON1) > + list_del(&qp->timerwait); > + if (qp->piowait.next != LIST_POISON1) > + list_del(&qp->piowait); > + spin_unlock(&dev->pending_lock); > + > + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); > + if (++qp->s_last >= qp->s_size) > + qp->s_last = 0; > + > + wc->status = IB_WC_WR_FLUSH_ERR; > + > + while (qp->s_last != qp->s_head) { > + wc->wr_id = wqe->wr.wr_id; > + wc->opcode = wc_opcode[wqe->wr.opcode]; > + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); > + if (++qp->s_last >= qp->s_size) > + qp->s_last = 0; > + wqe = get_swqe_ptr(qp, qp->s_last); > + } > + qp->s_cur = qp->s_tail = qp->s_head; > + qp->state = IB_QPS_SQE; > +} > + > +/* > + * Flush both send and receive work queues. > + * QP r_rq.lock and s_lock should be held. > + */ > +static void ipath_error_qp(struct ipath_qp *qp) > +{ > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + struct ib_wc wc; > + > + _VERBS_INFO("QP%d/%d in error state\n", > + qp->ibqp.qp_num, qp->remote_qpn); > + > + spin_lock(&dev->pending_lock); > + /* XXX What if its already removed by the timeout code? */ > + if (qp->timerwait.next != LIST_POISON1) > + list_del(&qp->timerwait); > + if (qp->piowait.next != LIST_POISON1) > + list_del(&qp->piowait); > + spin_unlock(&dev->pending_lock); > + > + wc.status = IB_WC_WR_FLUSH_ERR; > + wc.vendor_err = 0; > + wc.byte_len = 0; > + wc.imm_data = 0; > + wc.qp_num = qp->ibqp.qp_num; > + wc.src_qp = 0; > + wc.wc_flags = 0; > + wc.pkey_index = 0; > + wc.slid = 0; > + wc.sl = 0; > + wc.dlid_path_bits = 0; > + wc.port_num = 0; > + > + while (qp->s_last != qp->s_head) { > + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); > + > + wc.wr_id = wqe->wr.wr_id; > + wc.opcode = wc_opcode[wqe->wr.opcode]; > + if (++qp->s_last >= qp->s_size) > + qp->s_last = 0; > + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); > + } > + qp->s_cur = qp->s_tail = qp->s_head; > + qp->s_hdrwords = 0; > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + > + wc.opcode = IB_WC_RECV; > + while (qp->r_rq.tail != qp->r_rq.head) { > + wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; > + if (++qp->r_rq.tail >= qp->r_rq.size) > + qp->r_rq.tail = 0; > + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); > + } > +} > + > +static int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, > + int attr_mask) > +{ > + struct ipath_qp *qp = to_iqp(ibqp); > + enum ib_qp_state cur_state, new_state; > + u32 req_param, opt_param; > + unsigned long flags; > + > + if (attr_mask & IB_QP_CUR_STATE) { > + cur_state = attr->cur_qp_state; > + if (cur_state != IB_QPS_RTR && > + cur_state != IB_QPS_RTS && > + cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE) > + return -EINVAL; > + spin_lock_irqsave(&qp->r_rq.lock, flags); > + spin_lock(&qp->s_lock); > + } else { > + spin_lock_irqsave(&qp->r_rq.lock, flags); > + spin_lock(&qp->s_lock); > + cur_state = qp->state; > + } > + > + if (attr_mask & IB_QP_STATE) { > + new_state = attr->qp_state; > + if (new_state < 0 || new_state > IB_QPS_ERR) > + goto inval; > + } else > + new_state = cur_state; > + > + switch (qp_state_table[cur_state][new_state].trans) { > + case IPATH_TRANS_INVALID: > + goto inval; > + > + case IPATH_TRANS_ANY2RST: > + ipath_reset_qp(qp); > + break; > + > + case IPATH_TRANS_ANY2ERR: > + ipath_error_qp(qp); > + break; > + > + } > + > + req_param = > + qp_state_table[cur_state][new_state].req_param[qp->ibqp.qp_type]; > + opt_param = > + qp_state_table[cur_state][new_state].opt_param[qp->ibqp.qp_type]; > + > + if ((req_param & attr_mask) != req_param) > + goto inval; > + > + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) > + goto inval; > + > + if (attr_mask & IB_QP_PKEY_INDEX) { > + struct ipath_ibdev *dev = to_idev(ibqp->device); > + > + if (attr->pkey_index >= ipath_layer_get_npkeys(dev->ib_unit)) > + goto inval; > + qp->s_pkey_index = attr->pkey_index; > + } > + > + if (attr_mask & IB_QP_DEST_QPN) > + qp->remote_qpn = attr->dest_qp_num; > + > + if (attr_mask & IB_QP_SQ_PSN) { > + qp->s_next_psn = attr->sq_psn; > + qp->s_last_psn = qp->s_next_psn - 1; > + } > + > + if (attr_mask & IB_QP_RQ_PSN) > + qp->r_psn = attr->rq_psn; > + > + if (attr_mask & IB_QP_ACCESS_FLAGS) > + qp->qp_access_flags = attr->qp_access_flags; > + > + if (attr_mask & IB_QP_AV) > + qp->remote_ah_attr = attr->ah_attr; > + > + if (attr_mask & IB_QP_PATH_MTU) > + qp->path_mtu = attr->path_mtu; > + > + if (attr_mask & IB_QP_RETRY_CNT) > + qp->s_retry = qp->s_retry_cnt = attr->retry_cnt; > + > + if (attr_mask & IB_QP_RNR_RETRY) { > + qp->s_rnr_retry = attr->rnr_retry; > + if (qp->s_rnr_retry > 7) > + qp->s_rnr_retry = 7; > + qp->s_rnr_retry_cnt = qp->s_rnr_retry; > + } > + > + if (attr_mask & IB_QP_MIN_RNR_TIMER) > + qp->s_min_rnr_timer = attr->min_rnr_timer & 0x1F; > + > + if (attr_mask & IB_QP_QKEY) > + qp->qkey = attr->qkey; > + > + if (attr_mask & IB_QP_PKEY_INDEX) > + qp->s_pkey_index = attr->pkey_index; > + > + qp->state = new_state; > + spin_unlock(&qp->s_lock); > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + > + /* > + * Try to move to ARMED if QP1 changed to the RTS state. > + */ > + if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) { > + struct ipath_ibdev *dev = to_idev(ibqp->device); > + > + /* > + * Bounce the link even if it was active so the SM will > + * reinitialize the SMA's state. > + */ > + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKDOWN); > + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKARM); > + } > + return 0; > + > +inval: > + spin_unlock(&qp->s_lock); > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + return -EINVAL; > +} > + > +/* > + * Compute the AETH (syndrome + MSN). > + * The QP s_lock should be held. > + */ > +static u32 ipath_compute_aeth(struct ipath_qp *qp) > +{ > + u32 aeth = atomic_read(&qp->msn) & 0xFFFFFF; > + > + if (qp->s_nak_state) { > + aeth |= qp->s_nak_state << 24; > + } else if (qp->ibqp.srq) { > + /* Shared receive queues don't generate credits. */ > + aeth |= 0x1F << 24; > + } else { > + u32 min, max, x; > + u32 credits; > + > + /* > + * Compute the number of credits available (RWQEs). > + * XXX Not holding the r_rq.lock here so there is a small > + * chance that the pair of reads are not atomic. > + */ > + credits = qp->r_rq.head - qp->r_rq.tail; > + if ((int)credits < 0) > + credits += qp->r_rq.size; > + /* Binary search the credit table to find the code to use. */ > + min = 0; > + max = 31; > + for (;;) { > + x = (min + max) / 2; > + if (credit_table[x] == credits) > + break; > + if (credit_table[x] > credits) > + max = x; > + else if (min == x) > + break; > + else > + min = x; > + } > + aeth |= x << 24; > + } > + return cpu_to_be32(aeth); > +} > + > + > +static void no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&dev->pending_lock, flags); > + if (qp->piowait.next == LIST_POISON1) > + list_add_tail(&qp->piowait, &dev->piowait); > + spin_unlock_irqrestore(&dev->pending_lock, flags); > + /* > + * Note that as soon as ipath_layer_want_buffer() is called and > + * possibly before it returns, ipath_ib_piobufavail() > + * could be called. If we are still in the tasklet function, > + * tasklet_schedule() will not call us until the next time > + * tasklet_schedule() is called. > + * We clear the tasklet flag now since we are committing to return > + * from the tasklet function. > + */ > + tasklet_unlock(&qp->s_task); > + ipath_layer_want_buffer(dev->ib_unit); > + dev->n_piowait++; > +} > + > +/* > + * Process entries in the send work queue until the queue is exhausted. > + * Only allow one CPU to send a packet per QP (tasklet). > + * Otherwise, after we drop the QP lock, two threads could send > + * packets out of order. > + * This is similar to do_rc_send() below except we don't have timeouts or > + * resends. > + */ > +static void do_uc_send(unsigned long data) > +{ > + struct ipath_qp *qp = (struct ipath_qp *)data; > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + struct ipath_swqe *wqe; > + unsigned long flags; > + u16 lrh0; > + u32 hwords; > + u32 nwords; > + u32 extra_bytes; > + u32 bth0; > + u32 bth2; > + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); > + u32 len; > + struct ipath_other_headers *ohdr; > + struct ib_wc wc; > + > + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) > + return; > + > + if (unlikely(qp->remote_ah_attr.dlid == > + ipath_layer_get_lid(dev->ib_unit))) { > + /* Pass in an uninitialized ib_wc to save stack space. */ > + ipath_ruc_loopback(qp, &wc); > + clear_bit(IPATH_S_BUSY, &qp->s_flags); > + return; > + } > + > + ohdr = &qp->s_hdr.u.oth; > + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) > + ohdr = &qp->s_hdr.u.l.oth; > + > +again: > + /* Check for a constructed packet to be sent. */ > + if (qp->s_hdrwords != 0) { > + /* > + * If no PIO bufs are available, return. > + * An interrupt will call ipath_ib_piobufavail() > + * when one is available. > + */ > + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, > + (uint32_t *) &qp->s_hdr, > + qp->s_cur_size, qp->s_cur_sge)) { > + no_bufs_available(qp, dev); > + return; > + } > + /* Record that we sent the packet and s_hdr is empty. */ > + qp->s_hdrwords = 0; > + } > + > + lrh0 = IPS_LRH_BTH; > + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ > + hwords = 5; > + > + /* > + * The lock is needed to synchronize between > + * setting qp->s_ack_state and post_send(). > + */ > + spin_lock_irqsave(&qp->s_lock, flags); > + > + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) > + goto done; > + > + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); > + > + /* Send a request. */ > + wqe = get_swqe_ptr(qp, qp->s_last); > + switch (qp->s_state) { > + default: > + /* Signal the completion of the last send (if there is one). */ > + if (qp->s_last != qp->s_tail) { > + if (++qp->s_last == qp->s_size) > + qp->s_last = 0; > + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || > + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { > + wc.wr_id = wqe->wr.wr_id; > + wc.status = IB_WC_SUCCESS; > + wc.opcode = wc_opcode[wqe->wr.opcode]; > + wc.vendor_err = 0; > + wc.byte_len = wqe->length; > + wc.qp_num = qp->ibqp.qp_num; > + wc.src_qp = qp->remote_qpn; > + wc.pkey_index = 0; > + wc.slid = qp->remote_ah_attr.dlid; > + wc.sl = qp->remote_ah_attr.sl; > + wc.dlid_path_bits = 0; > + wc.port_num = 0; > + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, > + 0); > + } > + wqe = get_swqe_ptr(qp, qp->s_last); > + } > + /* Check if send work queue is empty. */ > + if (qp->s_tail == qp->s_head) > + goto done; > + /* > + * Start a new request. > + */ > + qp->s_psn = wqe->psn = qp->s_next_psn; > + qp->s_sge.sge = wqe->sg_list[0]; > + qp->s_sge.sg_list = wqe->sg_list + 1; > + qp->s_sge.num_sge = wqe->wr.num_sge; > + qp->s_len = len = wqe->length; > + switch (wqe->wr.opcode) { > + case IB_WR_SEND: > + case IB_WR_SEND_WITH_IMM: > + if (len > pmtu) { > + qp->s_state = IB_OPCODE_UC_SEND_FIRST; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_SEND) { > + qp->s_state = IB_OPCODE_UC_SEND_ONLY; > + } else { > + qp->s_state = > + IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + } > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + break; > + > + case IB_WR_RDMA_WRITE: > + case IB_WR_RDMA_WRITE_WITH_IMM: > + ohdr->u.rc.reth.vaddr = > + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); > + ohdr->u.rc.reth.rkey = > + cpu_to_be32(wqe->wr.wr.rdma.rkey); > + ohdr->u.rc.reth.length = cpu_to_be32(len); > + hwords += sizeof(struct ib_reth) / 4; > + if (len > pmtu) { > + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_FIRST; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { > + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_ONLY; > + } else { > + qp->s_state = > + IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; > + /* Immediate data comes after the RETH */ > + ohdr->u.rc.imm_data = wqe->wr.imm_data; > + hwords += 1; > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + } > + break; > + > + default: > + goto done; > + } > + if (++qp->s_tail >= qp->s_size) > + qp->s_tail = 0; > + break; > + > + case IB_OPCODE_UC_SEND_FIRST: > + qp->s_state = IB_OPCODE_UC_SEND_MIDDLE; > + /* FALLTHROUGH */ > + case IB_OPCODE_UC_SEND_MIDDLE: > + len = qp->s_len; > + if (len > pmtu) { > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_SEND) > + qp->s_state = IB_OPCODE_UC_SEND_LAST; > + else { > + qp->s_state = IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + } > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + break; > + > + case IB_OPCODE_UC_RDMA_WRITE_FIRST: > + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_MIDDLE; > + /* FALLTHROUGH */ > + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: > + len = qp->s_len; > + if (len > pmtu) { > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) > + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_LAST; > + else { > + qp->s_state = > + IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + } > + break; > + } > + bth2 = qp->s_next_psn++ & 0xFFFFFF; > + qp->s_len -= len; > + bth0 |= qp->s_state << 24; > + > + spin_unlock_irqrestore(&qp->s_lock, flags); > + > + /* Construct the header. */ > + extra_bytes = (4 - len) & 3; > + nwords = (len + extra_bytes) >> 2; > + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { > + /* Header size in 32-bit words. */ > + hwords += 10; > + lrh0 = IPS_LRH_GRH; > + qp->s_hdr.u.l.grh.version_tclass_flow = > + cpu_to_be32((6 << 28) | > + (qp->remote_ah_attr.grh.traffic_class << 20) | > + qp->remote_ah_attr.grh.flow_label); > + qp->s_hdr.u.l.grh.paylen = > + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); > + qp->s_hdr.u.l.grh.next_hdr = 0x1B; > + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; > + /* The SGID is 32-bit aligned. */ > + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; > + qp->s_hdr.u.l.grh.sgid.global.interface_id = > + ipath_layer_get_guid(dev->ib_unit); > + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; > + } > + qp->s_hdrwords = hwords; > + qp->s_cur_sge = &qp->s_sge; > + qp->s_cur_size = len; > + lrh0 |= qp->remote_ah_attr.sl << 4; > + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); > + /* DEST LID */ > + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); > + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); > + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); > + bth0 |= extra_bytes << 20; > + ohdr->bth[0] = cpu_to_be32(bth0); > + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); > + ohdr->bth[2] = cpu_to_be32(bth2); > + > + /* Check for more work to do. */ > + goto again; > + > +done: > + spin_unlock_irqrestore(&qp->s_lock, flags); > + clear_bit(IPATH_S_BUSY, &qp->s_flags); > +} > + > +/* > + * Process entries in the send work queue until credit or queue is exhausted. > + * Only allow one CPU to send a packet per QP (tasklet). > + * Otherwise, after we drop the QP s_lock, two threads could send > + * packets out of order. > + */ > +static void do_rc_send(unsigned long data) > +{ > + struct ipath_qp *qp = (struct ipath_qp *)data; > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + struct ipath_swqe *wqe; > + struct ipath_sge_state *ss; > + unsigned long flags; > + u16 lrh0; > + u32 hwords; > + u32 nwords; > + u32 extra_bytes; > + u32 bth0; > + u32 bth2; > + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); > + u32 len; > + struct ipath_other_headers *ohdr; > + char newreq; > + > + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) > + return; > + > + if (unlikely(qp->remote_ah_attr.dlid == > + ipath_layer_get_lid(dev->ib_unit))) { > + struct ib_wc wc; > + > + /* > + * Pass in an uninitialized ib_wc to be consistent with > + * other places where ipath_ruc_loopback() is called. > + */ > + ipath_ruc_loopback(qp, &wc); > + clear_bit(IPATH_S_BUSY, &qp->s_flags); > + return; > + } > + > + ohdr = &qp->s_hdr.u.oth; > + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) > + ohdr = &qp->s_hdr.u.l.oth; > + > +again: > + /* Check for a constructed packet to be sent. */ > + if (qp->s_hdrwords != 0) { > + /* > + * If no PIO bufs are available, return. > + * An interrupt will call ipath_ib_piobufavail() > + * when one is available. > + */ > + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, > + (uint32_t *) &qp->s_hdr, > + qp->s_cur_size, qp->s_cur_sge)) { > + no_bufs_available(qp, dev); > + return; > + } > + /* Record that we sent the packet and s_hdr is empty. */ > + qp->s_hdrwords = 0; > + } > + > + lrh0 = IPS_LRH_BTH; > + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ > + hwords = 5; > + > + /* > + * The lock is needed to synchronize between > + * setting qp->s_ack_state, resend timer, and post_send(). > + */ > + spin_lock_irqsave(&qp->s_lock, flags); > + > + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); > + > + /* Sending responses has higher priority over sending requests. */ > + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) { > + /* > + * Send a response. > + * Note that we are in the responder's side of the QP context. > + */ > + switch (qp->s_ack_state) { > + case IB_OPCODE_RC_RDMA_READ_REQUEST: > + ss = &qp->s_rdma_sge; > + len = qp->s_rdma_len; > + if (len > pmtu) { > + len = pmtu; > + qp->s_ack_state = > + IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; > + } else { > + qp->s_ack_state = > + IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY; > + } > + qp->s_rdma_len -= len; > + bth0 |= qp->s_ack_state << 24; > + ohdr->u.aeth = ipath_compute_aeth(qp); > + hwords++; > + break; > + > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: > + qp->s_ack_state = > + IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; > + /* FALLTHROUGH */ > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: > + ss = &qp->s_rdma_sge; > + len = qp->s_rdma_len; > + if (len > pmtu) { > + len = pmtu; > + } else { > + ohdr->u.aeth = ipath_compute_aeth(qp); > + hwords++; > + qp->s_ack_state = > + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; > + } > + qp->s_rdma_len -= len; > + bth0 |= qp->s_ack_state << 24; > + break; > + > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: > + /* > + * We have to prevent new requests from changing > + * the r_sge state while a ipath_verbs_send() > + * is in progress. > + * Changing r_state allows the receiver > + * to continue processing new packets. > + * We do it here now instead of above so > + * that we are sure the packet was sent before > + * changing the state. > + */ > + qp->r_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + goto send_req; > + > + case IB_OPCODE_RC_COMPARE_SWAP: > + case IB_OPCODE_RC_FETCH_ADD: > + ss = NULL; > + len = 0; > + qp->r_state = IB_OPCODE_RC_SEND_LAST; > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; > + ohdr->u.at.aeth = ipath_compute_aeth(qp); > + ohdr->u.at.atomic_ack_eth = > + cpu_to_be64(qp->s_ack_atomic); > + hwords += sizeof(ohdr->u.at) / 4; > + break; > + > + default: > + /* Send a regular ACK. */ > + ss = NULL; > + len = 0; > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + bth0 |= qp->s_ack_state << 24; > + ohdr->u.aeth = ipath_compute_aeth(qp); > + hwords++; > + } > + bth2 = qp->s_ack_psn++ & 0xFFFFFF; > + } else { > + send_req: > + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || > + qp->s_rnr_timeout) > + goto done; > + > + /* Send a request. */ > + wqe = get_swqe_ptr(qp, qp->s_cur); > + switch (qp->s_state) { > + default: > + /* > + * Resend an old request or start a new one. > + * > + * We keep track of the current SWQE so that > + * we don't reset the "furthest progress" state > + * if we need to back up. > + */ > + newreq = 0; > + if (qp->s_cur == qp->s_tail) { > + /* Check if send work queue is empty. */ > + if (qp->s_tail == qp->s_head) > + goto done; > + qp->s_psn = wqe->psn = qp->s_next_psn; > + newreq = 1; > + } > + /* > + * Note that we have to be careful not to modify the > + * original work request since we may need to resend > + * it. > + */ > + qp->s_sge.sge = wqe->sg_list[0]; > + qp->s_sge.sg_list = wqe->sg_list + 1; > + qp->s_sge.num_sge = wqe->wr.num_sge; > + qp->s_len = len = wqe->length; > + ss = &qp->s_sge; > + bth2 = 0; > + switch (wqe->wr.opcode) { > + case IB_WR_SEND: > + case IB_WR_SEND_WITH_IMM: > + /* If no credit, return. */ > + if (qp->s_lsn != (u32) -1 && > + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { > + goto done; > + } > + wqe->lpsn = wqe->psn; > + if (len > pmtu) { > + wqe->lpsn += (len - 1) / pmtu; > + qp->s_state = IB_OPCODE_RC_SEND_FIRST; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_SEND) { > + qp->s_state = IB_OPCODE_RC_SEND_ONLY; > + } else { > + qp->s_state = > + IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + } > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + bth2 = 1 << 31; /* Request ACK. */ > + if (++qp->s_cur == qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_WR_RDMA_WRITE: > + if (newreq) > + qp->s_lsn++; > + /* FALLTHROUGH */ > + case IB_WR_RDMA_WRITE_WITH_IMM: > + /* If no credit, return. */ > + if (qp->s_lsn != (u32) -1 && > + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { > + goto done; > + } > + ohdr->u.rc.reth.vaddr = > + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); > + ohdr->u.rc.reth.rkey = > + cpu_to_be32(wqe->wr.wr.rdma.rkey); > + ohdr->u.rc.reth.length = cpu_to_be32(len); > + hwords += sizeof(struct ib_reth) / 4; > + wqe->lpsn = wqe->psn; > + if (len > pmtu) { > + wqe->lpsn += (len - 1) / pmtu; > + qp->s_state = > + IB_OPCODE_RC_RDMA_WRITE_FIRST; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { > + qp->s_state = > + IB_OPCODE_RC_RDMA_WRITE_ONLY; > + } else { > + qp->s_state = > + IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; > + /* Immediate data comes after RETH */ > + ohdr->u.rc.imm_data = wqe->wr.imm_data; > + hwords += 1; > + if (wqe->wr. > + send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + } > + bth2 = 1 << 31; /* Request ACK. */ > + if (++qp->s_cur == qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_WR_RDMA_READ: > + ohdr->u.rc.reth.vaddr = > + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); > + ohdr->u.rc.reth.rkey = > + cpu_to_be32(wqe->wr.wr.rdma.rkey); > + ohdr->u.rc.reth.length = cpu_to_be32(len); > + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; > + hwords += sizeof(ohdr->u.rc.reth) / 4; > + if (newreq) { > + qp->s_lsn++; > + /* > + * Adjust s_next_psn to count the > + * expected number of responses. > + */ > + if (len > pmtu) > + qp->s_next_psn += > + (len - 1) / pmtu; > + wqe->lpsn = qp->s_next_psn++; > + } > + ss = NULL; > + len = 0; > + if (++qp->s_cur == qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_WR_ATOMIC_CMP_AND_SWP: > + case IB_WR_ATOMIC_FETCH_AND_ADD: > + qp->s_state = > + wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP ? > + IB_OPCODE_RC_COMPARE_SWAP : > + IB_OPCODE_RC_FETCH_ADD; > + ohdr->u.atomic_eth.vaddr = > + cpu_to_be64(wqe->wr.wr.atomic.remote_addr); > + ohdr->u.atomic_eth.rkey = > + cpu_to_be32(wqe->wr.wr.atomic.rkey); > + ohdr->u.atomic_eth.swap_data = > + cpu_to_be64(wqe->wr.wr.atomic.swap); > + ohdr->u.atomic_eth.compare_data = > + cpu_to_be64(wqe->wr.wr.atomic.compare_add); > + hwords += sizeof(struct ib_atomic_eth) / 4; > + if (newreq) { > + qp->s_lsn++; > + wqe->lpsn = wqe->psn; > + } > + if (++qp->s_cur == qp->s_size) > + qp->s_cur = 0; > + ss = NULL; > + len = 0; > + break; > + > + default: > + goto done; > + } > + if (newreq) { > + if (++qp->s_tail >= qp->s_size) > + qp->s_tail = 0; > + } > + bth2 |= qp->s_psn++ & 0xFFFFFF; > + if ((int)(qp->s_psn - qp->s_next_psn) > 0) > + qp->s_next_psn = qp->s_psn; > + spin_lock(&dev->pending_lock); > + if (qp->timerwait.next == LIST_POISON1) { > + list_add_tail(&qp->timerwait, > + &dev->pending[dev-> > + pending_index]); > + } > + spin_unlock(&dev->pending_lock); > + break; > + > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: > + /* > + * This case can only happen if a send is > + * restarted. See ipath_restart_rc(). > + */ > + ipath_init_restart(qp, wqe); > + /* FALLTHROUGH */ > + case IB_OPCODE_RC_SEND_FIRST: > + qp->s_state = IB_OPCODE_RC_SEND_MIDDLE; > + /* FALLTHROUGH */ > + case IB_OPCODE_RC_SEND_MIDDLE: > + bth2 = qp->s_psn++ & 0xFFFFFF; > + if ((int)(qp->s_psn - qp->s_next_psn) > 0) > + qp->s_next_psn = qp->s_psn; > + ss = &qp->s_sge; > + len = qp->s_len; > + if (len > pmtu) { > + /* > + * Request an ACK every 1/2 MB to avoid > + * retransmit timeouts. > + */ > + if (((wqe->length - len) % (512 * 1024)) == 0) > + bth2 |= 1 << 31; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_SEND) > + qp->s_state = IB_OPCODE_RC_SEND_LAST; > + else { > + qp->s_state = > + IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + } > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + bth2 |= 1 << 31; /* Request ACK. */ > + if (++qp->s_cur >= qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: > + /* > + * This case can only happen if a RDMA write is > + * restarted. See ipath_restart_rc(). > + */ > + ipath_init_restart(qp, wqe); > + /* FALLTHROUGH */ > + case IB_OPCODE_RC_RDMA_WRITE_FIRST: > + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_MIDDLE; > + /* FALLTHROUGH */ > + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: > + bth2 = qp->s_psn++ & 0xFFFFFF; > + if ((int)(qp->s_psn - qp->s_next_psn) > 0) > + qp->s_next_psn = qp->s_psn; > + ss = &qp->s_sge; > + len = qp->s_len; > + if (len > pmtu) { > + /* > + * Request an ACK every 1/2 MB to avoid > + * retransmit timeouts. > + */ > + if (((wqe->length - len) % (512 * 1024)) == 0) > + bth2 |= 1 << 31; > + len = pmtu; > + break; > + } > + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) > + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_LAST; > + else { > + qp->s_state = > + IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE; > + /* Immediate data comes after the BTH */ > + ohdr->u.imm_data = wqe->wr.imm_data; > + hwords += 1; > + if (wqe->wr.send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + } > + bth2 |= 1 << 31; /* Request ACK. */ > + if (++qp->s_cur >= qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: > + /* > + * This case can only happen if a RDMA read is > + * restarted. See ipath_restart_rc(). > + */ > + ipath_init_restart(qp, wqe); > + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * pmtu; > + ohdr->u.rc.reth.vaddr = > + cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len); > + ohdr->u.rc.reth.rkey = > + cpu_to_be32(wqe->wr.wr.rdma.rkey); > + ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len); > + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; > + hwords += sizeof(ohdr->u.rc.reth) / 4; > + bth2 = qp->s_psn++ & 0xFFFFFF; > + if ((int)(qp->s_psn - qp->s_next_psn) > 0) > + qp->s_next_psn = qp->s_psn; > + ss = NULL; > + len = 0; > + if (++qp->s_cur == qp->s_size) > + qp->s_cur = 0; > + break; > + > + case IB_OPCODE_RC_RDMA_READ_REQUEST: > + case IB_OPCODE_RC_COMPARE_SWAP: > + case IB_OPCODE_RC_FETCH_ADD: > + /* > + * We shouldn't start anything new until this request > + * is finished. The ACK will handle rescheduling us. > + * XXX The number of outstanding ones is negotiated > + * at connection setup time (see pg. 258,289)? > + * XXX Also, if we support multiple outstanding > + * requests, we need to check the WQE IB_SEND_FENCE > + * flag and not send a new request if a RDMA read or > + * atomic is pending. > + */ > + goto done; > + } > + qp->s_len -= len; > + bth0 |= qp->s_state << 24; > + /* XXX queue resend timeout. */ > + } > + /* Make sure it is non-zero before dropping the lock. */ > + qp->s_hdrwords = hwords; > + spin_unlock_irqrestore(&qp->s_lock, flags); > + > + /* Construct the header. */ > + extra_bytes = (4 - len) & 3; > + nwords = (len + extra_bytes) >> 2; > + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { > + /* Header size in 32-bit words. */ > + hwords += 10; > + lrh0 = IPS_LRH_GRH; > + qp->s_hdr.u.l.grh.version_tclass_flow = > + cpu_to_be32((6 << 28) | > + (qp->remote_ah_attr.grh.traffic_class << 20) | > + qp->remote_ah_attr.grh.flow_label); > + qp->s_hdr.u.l.grh.paylen = > + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); > + qp->s_hdr.u.l.grh.next_hdr = 0x1B; > + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; > + /* The SGID is 32-bit aligned. */ > + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; > + qp->s_hdr.u.l.grh.sgid.global.interface_id = > + ipath_layer_get_guid(dev->ib_unit); > + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; > + qp->s_hdrwords = hwords; > + } > + qp->s_cur_sge = ss; > + qp->s_cur_size = len; > + lrh0 |= qp->remote_ah_attr.sl << 4; > + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); > + /* DEST LID */ > + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); > + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); > + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); > + bth0 |= extra_bytes << 20; > + ohdr->bth[0] = cpu_to_be32(bth0); > + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); > + ohdr->bth[2] = cpu_to_be32(bth2); > + > + /* Check for more work to do. */ > + goto again; > + > +done: > + spin_unlock_irqrestore(&qp->s_lock, flags); > + clear_bit(IPATH_S_BUSY, &qp->s_flags); > +} > + > +static void send_rc_ack(struct ipath_qp *qp) > +{ > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + u16 lrh0; > + u32 bth0; > + u32 hwords; > + struct ipath_other_headers *ohdr; > + > + /* Construct the header. */ > + ohdr = &qp->s_hdr.u.oth; > + lrh0 = IPS_LRH_BTH; > + /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ > + hwords = 6; > + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { > + ohdr = &qp->s_hdr.u.l.oth; > + /* Header size in 32-bit words. */ > + hwords += 10; > + lrh0 = IPS_LRH_GRH; > + qp->s_hdr.u.l.grh.version_tclass_flow = > + cpu_to_be32((6 << 28) | > + (qp->remote_ah_attr.grh.traffic_class << 20) | > + qp->remote_ah_attr.grh.flow_label); > + qp->s_hdr.u.l.grh.paylen = > + cpu_to_be16(((hwords - 12) + SIZE_OF_CRC) << 2); > + qp->s_hdr.u.l.grh.next_hdr = 0x1B; > + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; > + /* The SGID is 32-bit aligned. */ > + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; > + qp->s_hdr.u.l.grh.sgid.global.interface_id = > + ipath_layer_get_guid(dev->ib_unit); > + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; > + } > + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); > + ohdr->u.aeth = ipath_compute_aeth(qp); > + if (qp->s_ack_state >= IB_OPCODE_RC_COMPARE_SWAP) { > + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; > + ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); > + hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; > + } else { > + bth0 |= IB_OPCODE_RC_ACKNOWLEDGE << 24; > + } > + lrh0 |= qp->remote_ah_attr.sl << 4; > + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); > + /* DEST LID */ > + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); > + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); > + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); > + ohdr->bth[0] = cpu_to_be32(bth0); > + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); > + ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & 0xFFFFFF); > + > + /* > + * If we can send the ACK, clear the ACK state. > + */ > + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, > + 0, NULL) == 0) { > + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; > + dev->n_rc_qacks++; > + } > +} > + > +/* > + * Back up the requester to resend the last un-ACKed request. > + * The QP s_lock should be held. > + */ > +static void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) > +{ > + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); > + struct ipath_ibdev *dev; > + u32 n; > + > + /* > + * If there are no requests pending, we are done. > + */ > + if (cmp24(psn, qp->s_next_psn) >= 0 || qp->s_last == qp->s_tail) > + goto done; > + > + if (qp->s_retry == 0) { > + wc->wr_id = wqe->wr.wr_id; > + wc->status = IB_WC_RETRY_EXC_ERR; > + wc->opcode = wc_opcode[wqe->wr.opcode]; > + wc->vendor_err = 0; > + wc->byte_len = 0; > + wc->qp_num = qp->ibqp.qp_num; > + wc->src_qp = qp->remote_qpn; > + wc->pkey_index = 0; > + wc->slid = qp->remote_ah_attr.dlid; > + wc->sl = qp->remote_ah_attr.sl; > + wc->dlid_path_bits = 0; > + wc->port_num = 0; > + ipath_sqerror_qp(qp, wc); > + return; > + } > + qp->s_retry--; > + > + /* > + * Remove the QP from the timeout queue. > + * Note: it may already have been removed by ipath_ib_timer(). > + */ > + dev = to_idev(qp->ibqp.device); > + spin_lock(&dev->pending_lock); > + if (qp->timerwait.next != LIST_POISON1) > + list_del(&qp->timerwait); > + spin_unlock(&dev->pending_lock); > + > + if (wqe->wr.opcode == IB_WR_RDMA_READ) > + dev->n_rc_resends++; > + else > + dev->n_rc_resends += (int)qp->s_psn - (int)psn; > + > + /* > + * If we are starting the request from the beginning, let the > + * normal send code handle initialization. > + */ > + qp->s_cur = qp->s_last; > + if (cmp24(psn, wqe->psn) <= 0) { > + qp->s_state = IB_OPCODE_RC_SEND_LAST; > + qp->s_psn = wqe->psn; > + } else { > + n = qp->s_cur; > + for (;;) { > + if (++n == qp->s_size) > + n = 0; > + if (n == qp->s_tail) { > + if (cmp24(psn, qp->s_next_psn) >= 0) { > + qp->s_cur = n; > + wqe = get_swqe_ptr(qp, n); > + } > + break; > + } > + wqe = get_swqe_ptr(qp, n); > + if (cmp24(psn, wqe->psn) < 0) > + break; > + qp->s_cur = n; > + } > + qp->s_psn = psn; > + > + /* > + * Reset the state to restart in the middle of a request. > + * Don't change the s_sge, s_cur_sge, or s_cur_size. > + * See do_rc_send(). > + */ > + switch (wqe->wr.opcode) { > + case IB_WR_SEND: > + case IB_WR_SEND_WITH_IMM: > + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; > + break; > + > + case IB_WR_RDMA_WRITE: > + case IB_WR_RDMA_WRITE_WITH_IMM: > + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; > + break; > + > + case IB_WR_RDMA_READ: > + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; > + break; > + > + default: > + /* > + * This case shouldn't happen since its only > + * one PSN per req. > + */ > + qp->s_state = IB_OPCODE_RC_SEND_LAST; > + } > + } > + > +done: > + tasklet_schedule(&qp->s_task); > +} > + > +/* > + * Handle RC and UC post sends. > + */ > +static int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr) > +{ > + struct ipath_swqe *wqe; > + unsigned long flags; > + u32 next; > + int i, j; > + int acc; > + > + /* > + * Don't allow RDMA reads or atomic operations on UC or > + * undefined operations. > + * Make sure buffer is large enough to hold the result for atomics. > + */ > + if (qp->ibqp.qp_type == IB_QPT_UC) { > + if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) > + return -EINVAL; > + } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) > + return -EINVAL; > + else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && > + (wr->num_sge == 0 || wr->sg_list[0].length < sizeof(u64) || > + wr->sg_list[0].addr & 0x7)) > + return -EINVAL; > + > + /* IB spec says that num_sge == 0 is OK. */ > + if (wr->num_sge > qp->s_max_sge) > + return -ENOMEM; > + > + spin_lock_irqsave(&qp->s_lock, flags); > + next = qp->s_head + 1; > + if (next >= qp->s_size) > + next = 0; > + if (next == qp->s_last) { > + spin_unlock_irqrestore(&qp->s_lock, flags); > + return -EINVAL; > + } > + > + wqe = get_swqe_ptr(qp, qp->s_head); > + wqe->wr = *wr; > + wqe->ssn = qp->s_ssn++; > + wqe->sg_list[0].mr = NULL; > + wqe->sg_list[0].vaddr = NULL; > + wqe->sg_list[0].length = 0; > + wqe->sg_list[0].sge_length = 0; > + wqe->length = 0; > + acc = wr->opcode >= IB_WR_RDMA_READ ? IB_ACCESS_LOCAL_WRITE : 0; > + for (i = 0, j = 0; i < wr->num_sge; i++) { > + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) { > + spin_unlock_irqrestore(&qp->s_lock, flags); > + return -EINVAL; > + } > + if (wr->sg_list[i].length == 0) > + continue; > + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, > + &wqe->sg_list[j], &wr->sg_list[i], acc)) { > + spin_unlock_irqrestore(&qp->s_lock, flags); > + return -EINVAL; > + } > + wqe->length += wr->sg_list[i].length; > + j++; > + } > + wqe->wr.num_sge = j; > + qp->s_head = next; > + /* > + * Wake up the send tasklet if the QP is not waiting > + * for an RNR timeout. > + */ > + next = qp->s_rnr_timeout; > + spin_unlock_irqrestore(&qp->s_lock, flags); > + > + if (next == 0) { > + if (qp->ibqp.qp_type == IB_QPT_UC) > + do_uc_send((unsigned long) qp); > + else > + do_rc_send((unsigned long) qp); > + } > + return 0; > +} > + > +/* > + * Note that we actually send the data as it is posted instead of putting > + * the request into a ring buffer. If we wanted to use a ring buffer, > + * we would need to save a reference to the destination address in the SWQE. > + */ > +static int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) > +{ > + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); > + struct ipath_other_headers *ohdr; > + struct ib_ah_attr *ah_attr; > + struct ipath_sge_state ss; > + struct ipath_sge *sg_list; > + struct ib_wc wc; > + u32 hwords; > + u32 nwords; > + u32 len; > + u32 extra_bytes; > + u32 bth0; > + u16 lrh0; > + u16 lid; > + int i; > + > + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) > + return 0; > + > + /* IB spec says that num_sge == 0 is OK. */ > + if (wr->num_sge > qp->s_max_sge) > + return -EINVAL; > + > + if (wr->num_sge > 1) { > + sg_list = kmalloc((qp->s_max_sge - 1) * sizeof(*sg_list), > + GFP_ATOMIC); > + if (!sg_list) > + return -ENOMEM; > + } else > + sg_list = NULL; > + > + /* Check the buffer to send. */ > + ss.sg_list = sg_list; > + ss.sge.mr = NULL; > + ss.sge.vaddr = NULL; > + ss.sge.length = 0; > + ss.sge.sge_length = 0; > + ss.num_sge = 0; > + len = 0; > + for (i = 0; i < wr->num_sge; i++) { > + /* Check LKEY */ > + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) > + return -EINVAL; > + > + if (wr->sg_list[i].length == 0) > + continue; > + if (!ipath_lkey_ok(&dev->lk_table, ss.num_sge ? > + sg_list + ss.num_sge : &ss.sge, > + &wr->sg_list[i], 0)) { > + return -EINVAL; > + } > + len += wr->sg_list[i].length; > + ss.num_sge++; > + } > + extra_bytes = (4 - len) & 3; > + nwords = (len + extra_bytes) >> 2; > + > + /* Construct the header. */ > + ah_attr = &to_iah(wr->wr.ud.ah)->attr; > + if (ah_attr->dlid >= 0xC000 && ah_attr->dlid < 0xFFFF) > + dev->n_multicast_xmit++; > + if (unlikely(ah_attr->dlid == ipath_layer_get_lid(dev->ib_unit))) { > + /* Pass in an uninitialized ib_wc to save stack space. */ > + ipath_ud_loopback(qp, &ss, len, wr, &wc); > + goto done; > + } > + if (ah_attr->ah_flags & IB_AH_GRH) { > + /* Header size in 32-bit words. */ > + hwords = 17; > + lrh0 = IPS_LRH_GRH; > + ohdr = &qp->s_hdr.u.l.oth; > + qp->s_hdr.u.l.grh.version_tclass_flow = > + cpu_to_be32((6 << 28) | > + (ah_attr->grh.traffic_class << 20) | > + ah_attr->grh.flow_label); > + qp->s_hdr.u.l.grh.paylen = > + cpu_to_be16(((wr->opcode == > + IB_WR_SEND_WITH_IMM ? 6 : 5) + nwords + > + SIZE_OF_CRC) << 2); > + qp->s_hdr.u.l.grh.next_hdr = 0x1B; > + qp->s_hdr.u.l.grh.hop_limit = ah_attr->grh.hop_limit; > + /* The SGID is 32-bit aligned. */ > + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; > + qp->s_hdr.u.l.grh.sgid.global.interface_id = > + ipath_layer_get_guid(dev->ib_unit); > + qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid; > + /* > + * Don't worry about sending to locally attached > + * multicast QPs. It is unspecified by the spec. what happens. > + */ > + } else { > + /* Header size in 32-bit words. */ > + hwords = 7; > + lrh0 = IPS_LRH_BTH; > + ohdr = &qp->s_hdr.u.oth; > + } > + if (wr->opcode == IB_WR_SEND_WITH_IMM) { > + ohdr->u.ud.imm_data = wr->imm_data; > + wc.imm_data = wr->imm_data; > + hwords += 1; > + bth0 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE << 24; > + } else if (wr->opcode == IB_WR_SEND) { > + wc.imm_data = 0; > + bth0 = IB_OPCODE_UD_SEND_ONLY << 24; > + } else > + return -EINVAL; > + lrh0 |= ah_attr->sl << 4; > + if (qp->ibqp.qp_type == IB_QPT_SMI) > + lrh0 |= 0xF000; /* Set VL */ > + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); > + qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */ > + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); > + lid = ipath_layer_get_lid(dev->ib_unit); > + qp->s_hdr.lrh[3] = lid ? cpu_to_be16(lid) : IB_LID_PERMISSIVE; > + if (wr->send_flags & IB_SEND_SOLICITED) > + bth0 |= 1 << 23; > + bth0 |= extra_bytes << 20; > + bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY : > + ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); > + ohdr->bth[0] = cpu_to_be32(bth0); > + ohdr->bth[1] = cpu_to_be32(wr->wr.ud.remote_qpn); > + /* XXX Could lose a PSN count but not worth locking */ > + ohdr->bth[2] = cpu_to_be32(qp->s_psn++ & 0xFFFFFF); > + /* > + * Qkeys with the high order bit set mean use the > + * qkey from the QP context instead of the WR. > + */ > + ohdr->u.ud.deth[0] = cpu_to_be32((int)wr->wr.ud.remote_qkey < 0 ? > + qp->qkey : wr->wr.ud.remote_qkey); > + ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num); > + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, > + len, &ss)) > + dev->n_no_piobuf++; > + > +done: > + /* Queue the completion status entry. */ > + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || > + (wr->send_flags & IB_SEND_SIGNALED)) { > + wc.wr_id = wr->wr_id; > + wc.status = IB_WC_SUCCESS; > + wc.vendor_err = 0; > + wc.opcode = IB_WC_SEND; > + wc.byte_len = len; > + wc.qp_num = qp->ibqp.qp_num; > + wc.src_qp = 0; > + wc.wc_flags = 0; > + /* XXX initialize other fields? */ > + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); > + } > + kfree(sg_list); > + > + return 0; > +} > + > +/* > + * This may be called from interrupt context. > + */ > +static int ipath_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, > + struct ib_send_wr **bad_wr) > +{ > + struct ipath_qp *qp = to_iqp(ibqp); > + int err = 0; > + > + /* Check that state is OK to post send. */ > + if (!(state_ops[qp->state] & IPATH_POST_SEND_OK)) { > + *bad_wr = wr; > + return -EINVAL; > + } > + > + for (; wr; wr = wr->next) { > + switch (qp->ibqp.qp_type) { > + case IB_QPT_UC: > + case IB_QPT_RC: > + err = ipath_post_rc_send(qp, wr); > + break; > + > + case IB_QPT_SMI: > + case IB_QPT_GSI: > + case IB_QPT_UD: > + err = ipath_post_ud_send(qp, wr); > + break; > + > + default: > + err = -EINVAL; > + } > + if (err) { > + *bad_wr = wr; > + break; > + } > + } > + return err; > +} > + > +/* > + * This may be called from interrupt context. > + */ > +static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, > + struct ib_recv_wr **bad_wr) > +{ > + struct ipath_qp *qp = to_iqp(ibqp); > + unsigned long flags; > + > + /* Check that state is OK to post receive. */ > + if (!(state_ops[qp->state] & IPATH_POST_RECV_OK)) { > + *bad_wr = wr; > + return -EINVAL; > + } > + > + for (; wr; wr = wr->next) { > + struct ipath_rwqe *wqe; > + u32 next; > + int i, j; > + > + if (wr->num_sge > qp->r_rq.max_sge) { > + *bad_wr = wr; > + return -ENOMEM; > + } > + > + spin_lock_irqsave(&qp->r_rq.lock, flags); > + next = qp->r_rq.head + 1; > + if (next >= qp->r_rq.size) > + next = 0; > + if (next == qp->r_rq.tail) { > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + *bad_wr = wr; > + return -ENOMEM; > + } > + > + wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); > + wqe->wr_id = wr->wr_id; > + wqe->sg_list[0].mr = NULL; > + wqe->sg_list[0].vaddr = NULL; > + wqe->sg_list[0].length = 0; > + wqe->sg_list[0].sge_length = 0; > + wqe->length = 0; > + for (i = 0, j = 0; i < wr->num_sge; i++) { > + /* Check LKEY */ > + if (to_ipd(qp->ibqp.pd)->user && > + wr->sg_list[i].lkey == 0) { > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + *bad_wr = wr; > + return -EINVAL; > + } > + if (wr->sg_list[i].length == 0) > + continue; > + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, > + &wqe->sg_list[j], &wr->sg_list[i], > + IB_ACCESS_LOCAL_WRITE)) { > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + *bad_wr = wr; > + return -EINVAL; > + } > + wqe->length += wr->sg_list[i].length; > + j++; > + } > + wqe->num_sge = j; > + qp->r_rq.head = next; > + spin_unlock_irqrestore(&qp->r_rq.lock, flags); > + } > + return 0; > +} > + > +/* > + * This may be called from interrupt context. > + */ > +static int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr, > + struct ib_recv_wr **bad_wr) > +{ > + struct ipath_srq *srq = to_isrq(ibsrq); > + struct ipath_ibdev *dev = to_idev(ibsrq->device); > + unsigned long flags; > + > + for (; wr; wr = wr->next) { > + struct ipath_rwqe *wqe; > + u32 next; > + int i, j; > + > + if (wr->num_sge > srq->rq.max_sge) { > + *bad_wr = wr; > + return -ENOMEM; > + } > + > + spin_lock_irqsave(&srq->rq.lock, flags); > + next = srq->rq.head + 1; > + if (next >= srq->rq.size) > + next = 0; > + if (next == srq->rq.tail) { > + spin_unlock_irqrestore(&srq->rq.lock, flags); > + *bad_wr = wr; > + return -ENOMEM; > + } > + > + wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); > + wqe->wr_id = wr->wr_id; > + wqe->sg_list[0].mr = NULL; > + wqe->sg_list[0].vaddr = NULL; > + wqe->sg_list[0].length = 0; > + wqe->sg_list[0].sge_length = 0; > + wqe->length = 0; > + for (i = 0, j = 0; i < wr->num_sge; i++) { > + /* Check LKEY */ > + if (to_ipd(srq->ibsrq.pd)->user && > + wr->sg_list[i].lkey == 0) { > + spin_unlock_irqrestore(&srq->rq.lock, flags); > + *bad_wr = wr; > + return -EINVAL; > + } > + if (wr->sg_list[i].length == 0) > + continue; > + if (!ipath_lkey_ok(&dev->lk_table, > + &wqe->sg_list[j], &wr->sg_list[i], > + IB_ACCESS_LOCAL_WRITE)) { > + spin_unlock_irqrestore(&srq->rq.lock, flags); > + *bad_wr = wr; > + return -EINVAL; > + } > + wqe->length += wr->sg_list[i].length; > + j++; > + } > + wqe->num_sge = j; > + srq->rq.head = next; > + spin_unlock_irqrestore(&srq->rq.lock, flags); > + } > + return 0; > +} > + > +/* > + * This is called from ipath_qp_rcv() to process an incomming UD packet > + * for the given QP. > + * Called at interrupt level. > + */ > +static void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, > + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) > +{ > + struct ipath_other_headers *ohdr; > + int opcode; > + u32 hdrsize; > + u32 pad; > + unsigned long flags; > + struct ib_wc wc; > + u32 qkey; > + u32 src_qp; > + struct ipath_rq *rq; > + struct ipath_srq *srq; > + struct ipath_rwqe *wqe; > + > + /* Check for GRH */ > + if (!has_grh) { > + ohdr = &hdr->u.oth; > + hdrsize = 8 + 12 + 8; /* LRH + BTH + DETH */ > + qkey = be32_to_cpu(ohdr->u.ud.deth[0]); > + src_qp = be32_to_cpu(ohdr->u.ud.deth[1]); > + } else { > + ohdr = &hdr->u.l.oth; > + hdrsize = 8 + 40 + 12 + 8; /* LRH + GRH + BTH + DETH */ > + /* > + * The header with GRH is 68 bytes and the > + * core driver sets the eager header buffer > + * size to 56 bytes so the last 12 bytes of > + * the IB header is in the data buffer. > + */ > + qkey = be32_to_cpu(((u32 *) data)[1]); > + src_qp = be32_to_cpu(((u32 *) data)[2]); > + data += 12; > + } > + src_qp &= 0xFFFFFF; > + > + /* Check that the qkey matches. */ > + if (unlikely(qkey != qp->qkey)) { > + /* XXX OK to lose a count once in a while. */ > + dev->qkey_violations++; > + dev->n_pkt_drops++; > + return; > + } > + > + /* Get the number of bytes the message was padded by. */ > + pad = (ohdr->bth[0] >> 12) & 3; > + if (unlikely(tlen < (hdrsize + pad + 4))) { > + /* Drop incomplete packets. */ > + dev->n_pkt_drops++; > + return; > + } > + > + /* > + * A GRH is expected to preceed the data even if not > + * present on the wire. > + */ > + wc.byte_len = tlen - (hdrsize + pad + 4) + sizeof(struct ib_grh); > + > + /* > + * The opcode is in the low byte when its in network order > + * (top byte when in host order). > + */ > + opcode = *(u8 *) (&ohdr->bth[0]); > + if (opcode == IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE) { > + if (has_grh) { > + wc.imm_data = *(u32 *) data; > + data += sizeof(u32); > + } else > + wc.imm_data = ohdr->u.ud.imm_data; > + wc.wc_flags = IB_WC_WITH_IMM; > + hdrsize += sizeof(u32); > + } else if (opcode == IB_OPCODE_UD_SEND_ONLY) { > + wc.imm_data = 0; > + wc.wc_flags = 0; > + } else { > + dev->n_pkt_drops++; > + return; > + } > + > + /* > + * Get the next work request entry to find where to put the data. > + * Note that it is safe to drop the lock after changing rq->tail > + * since ipath_post_receive() won't fill the empty slot. > + */ > + if (qp->ibqp.srq) { > + srq = to_isrq(qp->ibqp.srq); > + rq = &srq->rq; > + } else { > + srq = NULL; > + rq = &qp->r_rq; > + } > + spin_lock_irqsave(&rq->lock, flags); > + if (rq->tail == rq->head) { > + spin_unlock_irqrestore(&rq->lock, flags); > + dev->n_pkt_drops++; > + return; > + } > + /* Silently drop packets which are too big. */ > + wqe = get_rwqe_ptr(rq, rq->tail); > + if (wc.byte_len > wqe->length) { > + spin_unlock_irqrestore(&rq->lock, flags); > + dev->n_pkt_drops++; > + return; > + } > + wc.wr_id = wqe->wr_id; > + qp->r_sge.sge = wqe->sg_list[0]; > + qp->r_sge.sg_list = wqe->sg_list + 1; > + qp->r_sge.num_sge = wqe->num_sge; > + if (++rq->tail >= rq->size) > + rq->tail = 0; > + if (srq && srq->ibsrq.event_handler) { > + u32 n; > + > + if (rq->head < rq->tail) > + n = rq->size + rq->head - rq->tail; > + else > + n = rq->head - rq->tail; > + if (n < srq->limit) { > + struct ib_event ev; > + > + srq->limit = 0; > + spin_unlock_irqrestore(&rq->lock, flags); > + ev.device = qp->ibqp.device; > + ev.element.srq = qp->ibqp.srq; > + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; > + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); > + } else > + spin_unlock_irqrestore(&rq->lock, flags); > + } else > + spin_unlock_irqrestore(&rq->lock, flags); > + if (has_grh) { > + copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); > + wc.wc_flags |= IB_WC_GRH; > + } else > + skip_sge(&qp->r_sge, sizeof(struct ib_grh)); > + copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); > + wc.status = IB_WC_SUCCESS; > + wc.opcode = IB_WC_RECV; > + wc.vendor_err = 0; > + wc.qp_num = qp->ibqp.qp_num; > + wc.src_qp = src_qp; > + /* XXX do we know which pkey matched? Only needed for GSI. */ > + wc.pkey_index = 0; > + wc.slid = be16_to_cpu(hdr->lrh[3]); > + wc.sl = (be16_to_cpu(hdr->lrh[0]) >> 4) & 0xF; > + wc.dlid_path_bits = 0; > + /* Signal completion event if the solicited bit is set. */ > + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, > + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); > +} > + > +/* > + * This is called from ipath_post_ud_send() to forward a WQE addressed > + * to the same HCA. > + */ > +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, > + u32 length, struct ib_send_wr *wr, > + struct ib_wc *wc) > +{ > + struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); > + struct ipath_qp *qp; > + struct ib_ah_attr *ah_attr; > + unsigned long flags; > + struct ipath_rq *rq; > + struct ipath_srq *srq; > + struct ipath_sge_state rsge; > + struct ipath_sge *sge; > + struct ipath_rwqe *wqe; > + > + qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); > + if (!qp) > + return; > + > + /* Check that the qkey matches. */ > + if (unlikely(wr->wr.ud.remote_qkey != qp->qkey)) { > + /* XXX OK to lose a count once in a while. */ > + dev->qkey_violations++; > + dev->n_pkt_drops++; > + goto done; > + } > + > + /* > + * A GRH is expected to preceed the data even if not > + * present on the wire. > + */ > + wc->byte_len = length + sizeof(struct ib_grh); > + > + if (wr->opcode == IB_WR_SEND_WITH_IMM) { > + wc->wc_flags = IB_WC_WITH_IMM; > + wc->imm_data = wr->imm_data; > + } else { > + wc->wc_flags = 0; > + wc->imm_data = 0; > + } > + > + /* > + * Get the next work request entry to find where to put the data. > + * Note that it is safe to drop the lock after changing rq->tail > + * since ipath_post_receive() won't fill the empty slot. > + */ > + if (qp->ibqp.srq) { > + srq = to_isrq(qp->ibqp.srq); > + rq = &srq->rq; > + } else { > + srq = NULL; > + rq = &qp->r_rq; > + } > + spin_lock_irqsave(&rq->lock, flags); > + if (rq->tail == rq->head) { > + spin_unlock_irqrestore(&rq->lock, flags); > + dev->n_pkt_drops++; > + goto done; > + } > + /* Silently drop packets which are too big. */ > + wqe = get_rwqe_ptr(rq, rq->tail); > + if (wc->byte_len > wqe->length) { > + spin_unlock_irqrestore(&rq->lock, flags); > + dev->n_pkt_drops++; > + goto done; > + } > + wc->wr_id = wqe->wr_id; > + rsge.sge = wqe->sg_list[0]; > + rsge.sg_list = wqe->sg_list + 1; > + rsge.num_sge = wqe->num_sge; > + if (++rq->tail >= rq->size) > + rq->tail = 0; > + if (srq && srq->ibsrq.event_handler) { > + u32 n; > + > + if (rq->head < rq->tail) > + n = rq->size + rq->head - rq->tail; > + else > + n = rq->head - rq->tail; > + if (n < srq->limit) { > + struct ib_event ev; > + > + srq->limit = 0; > + spin_unlock_irqrestore(&rq->lock, flags); > + ev.device = qp->ibqp.device; > + ev.element.srq = qp->ibqp.srq; > + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; > + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); > + } else > + spin_unlock_irqrestore(&rq->lock, flags); > + } else > + spin_unlock_irqrestore(&rq->lock, flags); > + ah_attr = &to_iah(wr->wr.ud.ah)->attr; > + if (ah_attr->ah_flags & IB_AH_GRH) { > + copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); > + wc->wc_flags |= IB_WC_GRH; > + } else > + skip_sge(&rsge, sizeof(struct ib_grh)); > + sge = &ss->sge; > + while (length) { > + u32 len = sge->length; > + > + if (len > length) > + len = length; > + BUG_ON(len == 0); > + copy_sge(&rsge, sge->vaddr, len); > + sge->vaddr += len; > + sge->length -= len; > + sge->sge_length -= len; > + if (sge->sge_length == 0) { > + if (--ss->num_sge) > + *sge = *ss->sg_list++; > + } else if (sge->length == 0 && sge->mr != NULL) { > + if (++sge->n >= IPATH_SEGSZ) { > + if (++sge->m >= sge->mr->mapsz) > + break; > + sge->n = 0; > + } > + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; > + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; > + } > + length -= len; > + } > + wc->status = IB_WC_SUCCESS; > + wc->opcode = IB_WC_RECV; > + wc->vendor_err = 0; > + wc->qp_num = qp->ibqp.qp_num; > + wc->src_qp = sqp->ibqp.qp_num; > + /* XXX do we know which pkey matched? Only needed for GSI. */ > + wc->pkey_index = 0; > + wc->slid = ipath_layer_get_lid(dev->ib_unit); > + wc->sl = ah_attr->sl; > + wc->dlid_path_bits = 0; > + /* Signal completion event if the solicited bit is set. */ > + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, > + wr->send_flags & IB_SEND_SOLICITED); > + > +done: > + if (atomic_dec_and_test(&qp->refcount)) > + wake_up(&qp->wait); > +} > + > +/* > + * Copy the next RWQE into the QP's RWQE. > + * Return zero if no RWQE is available. > + * Called at interrupt level with the QP r_rq.lock held. > + */ > +static int get_rwqe(struct ipath_qp *qp, int wr_id_only) > +{ > + struct ipath_rq *rq; > + struct ipath_srq *srq; > + struct ipath_rwqe *wqe; > + > + if (!qp->ibqp.srq) { > + rq = &qp->r_rq; > + if (unlikely(rq->tail == rq->head)) > + return 0; > + wqe = get_rwqe_ptr(rq, rq->tail); > + qp->r_wr_id = wqe->wr_id; > + if (!wr_id_only) { > + qp->r_sge.sge = wqe->sg_list[0]; > + qp->r_sge.sg_list = wqe->sg_list + 1; > + qp->r_sge.num_sge = wqe->num_sge; > + qp->r_len = wqe->length; > + } > + if (++rq->tail >= rq->size) > + rq->tail = 0; > + return 1; > + } > + > + srq = to_isrq(qp->ibqp.srq); > + rq = &srq->rq; > + spin_lock(&rq->lock); > + if (unlikely(rq->tail == rq->head)) { > + spin_unlock(&rq->lock); > + return 0; > + } > + wqe = get_rwqe_ptr(rq, rq->tail); > + qp->r_wr_id = wqe->wr_id; > + if (!wr_id_only) { > + qp->r_sge.sge = wqe->sg_list[0]; > + qp->r_sge.sg_list = wqe->sg_list + 1; > + qp->r_sge.num_sge = wqe->num_sge; > + qp->r_len = wqe->length; > + } > + if (++rq->tail >= rq->size) > + rq->tail = 0; > + if (srq->ibsrq.event_handler) { > + struct ib_event ev; > + u32 n; > + > + if (rq->head < rq->tail) > + n = rq->size + rq->head - rq->tail; > + else > + n = rq->head - rq->tail; > + if (n < srq->limit) { > + srq->limit = 0; > + spin_unlock(&rq->lock); > + ev.device = qp->ibqp.device; > + ev.element.srq = qp->ibqp.srq; > + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; > + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); > + } else > + spin_unlock(&rq->lock); > + } else > + spin_unlock(&rq->lock); > + return 1; > +} > -- > 0.99.9n > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > > From rjwalsh at pathscale.com Sun Dec 18 12:05:58 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sun, 18 Dec 2005 12:05:58 -0800 Subject: [openib-general] Re: [PATCH 10/13] [RFC] ipath verbs, part 1 In-Reply-To: <20051218195922.GC31184@us.ibm.com> References: <200512161548.zxp6FKcabEu47EnS@cisco.com> <200512161548.W9sJn4CLmdhnSTcH@cisco.com> <20051218195922.GC31184@us.ibm.com> Message-ID: <1134936358.5826.2.camel@phosphene.durables.org> On Sun, 2005-12-18 at 11:59 -0800, Paul E. McKenney wrote: > On Fri, Dec 16, 2005 at 03:48:55PM -0800, Roland Dreier wrote: > > First half of ipath verbs driver > > Some RCU-related questions interspersed. Basic question is "where is > the lock-free read-side traversal?" Good question. I'll take a closer look. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From yael at mellanox.co.il Mon Dec 19 01:19:39 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 19 Dec 2005 11:19:39 +0200 Subject: [openib-general] [PATCH] Opensm - fix segfault on exit - cont. Message-ID: <5z4q55h2ac.fsf@mtl066.yok.mtl.com> Hi Hal, I've noticed that under certain operating systems, when driver isn't loaded, the SM still exits with segfault. The following patch fixes this. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4522) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -552,7 +552,7 @@ osm_vendor_delete( /* umad receiver thread ? */ p_ur = (*pp_vend)->receiver; - if (&p_ur->signal) + if (&p_ur->signal != NULL) cl_event_destroy( &p_ur->signal ); cl_spinlock_destroy( &(*pp_vend)->cb_lock ); cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); From krkumar2 at in.ibm.com Mon Dec 19 02:46:40 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 19 Dec 2005 16:16:40 +0530 Subject: [openib-general] [PATCH 07/13] [RFC] ipath core misc files In-Reply-To: <200512161548.3fqe3fMerrheBMdX@cisco.com> Message-ID: Roland Dreier wrote: ... > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > +{ > + int n; > + > + _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page); > + down_read(¤t->mm->mmap_sem); > + n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1, > + p, NULL); > + up_read(¤t->mm->mmap_sem); > + if (n != num_pages) { > + _IPATH_INFO > + ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n", > + num_pages, start_page, n); > + if (n < 0) /* it's an errno */ > + return n; > + return -ENOMEM; /* no way to know actual error */ > + } > + > + return 0; > +} For this routine (where num_pages can be >1), in the error case you need to page_cache_release() the pages that were successfully 'got' (get_page()'d). - KK -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Dec 19 03:37:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2005 06:37:54 -0500 Subject: [openib-general] RE: A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B52@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B52@mtlexch01.mtl.com> Message-ID: <1134992274.4328.32896.camel@hal.voltaire.com> Hi Eitan, On Sun, 2005-12-18 at 14:20, Eitan Zahavi wrote: > [EZ] Thanks. I have seen the patch. It is fine. Thanks. I just committed it. > > > > Also, why does changing the MTU require that the link be taken > down ? > > > > > The behavior of the link when a neighbor MTU is changes is not very > well defined. > > > So the best way to handle that is to force it down. > > > > NeighborMTU is not involved with the link negotiation nor is there a > > comment in the description like OperationalVLs. What behavior are you > > referring to ? > [EZ] I actually do not see any spec note about modifying neighbor MTU > during link up. Yes, that was what I was saying. > However, I remember we had to add this functionality. I > try to dig this up in the old bit keeper and found the first occurrence > of the setting of the port down in version 1.7. But the log does not say > why. Thanks for looking. This seems mysterious to me but I would hestitate to remove it even though I don't think it should be required or if it is some spec comment should be made. I would like to close the loop on this but don't see how. > > > > I also noticed a nit in the same function: > > > > > > > > p_pi->m_key_lease_period = > p_mgr->p_subn->opt.m_key_lease_period; > > > > /* Check to see if the value we are setting is different than > > > > the value in the port_info. If it is - turn on send_set flag > */ > > > > if (cl_memcmp( &p_pi->m_key_lease_period, > > > > &p_old_pi->m_key_lease_period, > > > > sizeof(p_pi->m_key_lease_period) )) > > > > send_set = TRUE; > > > > > > > > Should that be only when the Mkey is non 0 ? > > > > > Well, I know the lease is not relevant when MKey = 0. But for code > clarity I > > > propose to ignore that fact. The effect is only when someone set > lease period but > > MKey = 0 > > > which IMO does not make any sense anyway. > > > > I agree it does not make sense but could happen (is it prevented > somehow > > ?) so my take is to minimize the need for sets. As I said this is a > nit. > [EZ] We could avoid that but I do not think this is required. I agree that it is not required. It's a trivial "optimization". -- Hal From halr at voltaire.com Mon Dec 19 03:41:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2005 06:41:53 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix segfault on exit - cont. In-Reply-To: <5z4q55h2ac.fsf@mtl066.yok.mtl.com> References: <5z4q55h2ac.fsf@mtl066.yok.mtl.com> Message-ID: <1134992405.4328.32921.camel@hal.voltaire.com> Hi Yael, On Mon, 2005-12-19 at 04:19, Yael Kalka wrote: > Hi Hal, > > I've noticed that under certain operating systems, Seems more likely like a compiler difference rather than OS difference. Can you mention on which distribution/compiler this made a difference ? > when driver isn't > loaded, the SM still exits with segfault. > The following patch fixes this. > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4522) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -552,7 +552,7 @@ osm_vendor_delete( > > /* umad receiver thread ? */ > p_ur = (*pp_vend)->receiver; > - if (&p_ur->signal) > + if (&p_ur->signal != NULL) If this makes a difference, there are other uses of similar syntax which should be changed :-( > cl_event_destroy( &p_ur->signal ); > cl_spinlock_destroy( &(*pp_vend)->cb_lock ); > cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); -- Hal From jackm at mellanox.co.il Mon Dec 19 04:00:49 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 19 Dec 2005 14:00:49 +0200 Subject: [openib-general] [PATCH] mthca: check return value in mthca_dev_lim call Message-ID: <20051219120049.GA4858@mellanox.co.il> Check error return on call to mthca_dev_lim for Tavor (as is done for memfree). Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel.orig/drivers/infiniband/hw/mthca/mthca_main.c +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_main.c @@ -261,6 +261,10 @@ static int __devinit mthca_init_tavor(st } err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_disable; + } profile = default_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; From krkumar2 at in.ibm.com Mon Dec 19 04:05:39 2005 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 19 Dec 2005 17:35:39 +0530 Subject: [openib-general] [PATCH 07/13] [RFC] ipath core misc files Message-ID: (Please ignore if you see this twice, I am sending this a second time as I got an error on previous send) Roland Dreier wrote: ... > +int ipath_mlock(unsigned long start_page, size_t num_pages, struct page **p) > +{ > + int n; > + > + _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page); > + down_read(¤t->mm->mmap_sem); > + n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1, > + p, NULL); > + up_read(¤t->mm->mmap_sem); > + if (n != num_pages) { > + _IPATH_INFO > + ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n", > + num_pages, start_page, n); > + if (n < 0) /* it's an errno */ > + return n; > + return -ENOMEM; /* no way to know actual error */ > + } > + > + return 0; > +} For this routine (where num_pages can be >1), in the error case you need to page_cache_release() the pages that were successfully 'got' (get_page()'d). - KK From jackm at mellanox.co.il Mon Dec 19 05:17:36 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Mon, 19 Dec 2005 15:17:36 +0200 Subject: [openib-general] [PATCH] core/ib_uverbs: fix error flow in ib_uverbs_create_cq Message-ID: <20051219131736.GA8822@mellanox.co.il> ib_uverbs_create_cq did not release the completion channel event file reference count in the error flow. Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- linux-kernel.orig/drivers/infiniband/core/uverbs_cmd.c +++ linux-kernel/drivers/infiniband/core/uverbs_cmd.c @@ -593,13 +593,13 @@ ssize_t ib_uverbs_create_cq(struct ib_uv if (cmd.comp_vector >= file->device->num_comp_vectors) return -EINVAL; - if (cmd.comp_channel >= 0) - ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); - uobj = kmalloc(sizeof *uobj, GFP_KERNEL); if (!uobj) return -ENOMEM; + if (cmd.comp_channel >= 0) + ev_file = ib_uverbs_lookup_comp_file(cmd.comp_channel); + uobj->uobject.user_handle = cmd.user_handle; uobj->uobject.context = file->ucontext; uobj->uverbs_file = file; @@ -663,6 +663,8 @@ err_up: ib_destroy_cq(cq); err: + if (ev_file) + ib_uverbs_release_ucq(file, ev_file, uobj); kfree(uobj); return ret; } From elylevy at cs.huji.ac.il Mon Dec 19 05:23:53 2005 From: elylevy at cs.huji.ac.il (Ely Levy) Date: Mon, 19 Dec 2005 15:23:53 +0200 (IST) Subject: [openib-general] Problems with dsp Message-ID: I'm trying to run iperf with sdp using svn from few days ago It works find until the speed goes above 1.49GB (by making block size bigger or running 2 iperf) Then I'm starting to get weird debug messages in the log and as wall from root. Any idea what might cause it? This is what I see in /var/log/message: Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: SDP module load. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Initializing /proc filesystem entries. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Advertisment cache initialization. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Link level services initialization. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Main pool initialized. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Creating connection tables. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: IOCB cache initialization. Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Started listening for SDP connection requests Dec 15 17:34:33 cmos-17 kernel: NET: Registered protocol family 27 Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: SOCKET: type <1> proto <0> state <1:00000000> Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> BIND: family <2> addr <00000000:8913> Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> LISTEN: addr <00000000:1389> backlog <0005> Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr <00000000:1389> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <1> commID <00000014> ID <-180208768> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: CM REQ. comm <00000014> SID <8913010000000000> ca port <1> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello BSDH <003f:00:00:0000005c:00000000:00000000> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello HH Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: complete <1> <0a000702:1389><0a000701:8001> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src <0a000702:1389> dst <0a000701:8001> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src <0a000702:1389> dst <0a000701:8001> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <4> commID <00000014> ID <1> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> CM ESTABLISHED. commID <00000014> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Passive Establish src <0a000702:1389> dst <0a000701:8001> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Mode request <2> from current mode. <1:1> Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr <00000000:1389> Dec 15 17:34:51 cmos-17 kernel: <7shent <1> Dec 15 17:34:51 cmos-17 kernel: 6> bytes. Dec 15 17:34:51 cmos-17 kernel: < wrid <2543> of <4096> bytplete1171> RECV BUFF, bytes <4096> Dec 15 17:34:51 cmos-17 kernel: <1171> Read complete Read complete <3888> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: POST READ BUFF wrid <3959> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: <<1> <1171> POST READ BUFF wrid <4186> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: bytes. Dec 15 17:34:51 cmos-17 kernel: 357> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: <7ST READ BUFF wrid71> Read complete <4296 DATA: <1>71> POST READib_sdp DATA: <1> <1171> POST READ BUFF wrid <4360> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: <7id <5116> of <4096> bytes.te 1> RECV BUFF,te <5066> of <4096> byt1> RECV BUFF, bytes < UFF wr> <1171> Read>> <1171> PTA: <1> <1171> Re. Dec 15 17:34:51 cmos-17 kernel: <7 <1> <1171> Read complete <5507> of <4096> byte <1> <1171> RECV BUFF, bytes <4096> Dec 15 17:34:51 cmos-17 kernel: bytes. Dec 15 17:34:51 cmos-17 kernel: <_sdp DATA: <1> <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: <7 <40, bytes <4096> Dec 15 17:34:51 cmos-17 kernel: 63> of <40<4096> Dec 15 17:34:51 cmos-17 kernel: <7te <6164> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: 5> of <4096> bytes. Dec 15 17:34:51 cmos-17 kernel: <74096> Dec 15 17:34:51 cmos-17 kernel: <7d <6450> of <409e <6396> of <4096> bytes. Ely Levy System group Computer Science Hebrew University Jerusalem Israel From mst at mellanox.co.il Mon Dec 19 05:55:45 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 15:55:45 +0200 Subject: [openib-general] Re: Problems with dsp In-Reply-To: References: Message-ID: <20051219135545.GA1677@mellanox.co.il> Quoting r. Ely Levy : > Subject: Problems with dsp > > I'm trying to run iperf with sdp using svn from few days ago It works > find > until the speed goes above 1.49GB (by making block size bigger or > running > 2 iperf) > Then I'm starting to get weird debug messages in the log and as wall > from > root. > Any idea what might cause it? > This is what I see in /var/log/message: > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: SDP module load. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Initializing /proc > filesystem > entries. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Advertisment cache > initialization. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Link level services > initialization. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Main pool initialized. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Creating connection tables. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: IOCB cache initialization. > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Started listening for SDP > connection requests > Dec 15 17:34:33 cmos-17 kernel: NET: Registered protocol family 27 > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: SOCKET: type <1> proto <0> > state <1:00000000> > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> BIND: family <2> > addr <00000000:8913> > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> LISTEN: addr > <00000000:1389> backlog <0005> > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr > <00000000:1389> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <1> commID <00000014> > ID <-180208768> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: CM REQ. comm <00000014> SID > <8913010000000000> ca port <1> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello BSDH > <003f:00:00:0000005c:00000000:00000000> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello HH > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: complete > <1> <0a000702:1389><0a000701:8001> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src > <0a000702:1389> dst <0a000701:8001> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src > <0a000702:1389> dst <0a000701:8001> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <4> commID <00000014> > ID <1> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> CM ESTABLISHED. > commID <00000014> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Passive > Establish > src <0a000702:1389> dst <0a000701:8001> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Mode request <2> > from current mode. <1:1> > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr > <00000000:1389> > Dec 15 17:34:51 cmos-17 kernel: <7shent <1> > Dec 15 17:34:51 cmos-17 kernel: 6> bytes. > Dec 15 17:34:51 cmos-17 kernel: < wrid <2543> of <4096> bytplete1171> > RECV > BUFF, bytes <4096> > Dec 15 17:34:51 cmos-17 kernel: <1171> Read complete READ > BUFF w1171> Read complete <3888> of <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: POST READ BUFF wrid <3959> of <4096> > bytes. > Dec 15 17:34:51 cmos-17 kernel: <<1> <1171> POST READ BUFF wrid <4186> > of > <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: bytes. > Dec 15 17:34:51 cmos-17 kernel: 357> of <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: <7ST READ BUFF wrid71> Read complete > <4296 > DATA: <1>71> POST READib_sdp DATA: <1> > <1171> POST READ BUFF wrid <4360> of <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: <7id <5116> of <4096> bytes.te 1> RECV > BUFF,te <5066> of <4096> byt1> RECV BUFF, bytes < > UFF wr> <1171> Read>> <1171> PTA: <1> <1171> Re. > Dec 15 17:34:51 cmos-17 kernel: <7 <1> <1171> Read complete <5507> of > <4096> byte <1> <1171> RECV BUFF, bytes <4096> > Dec 15 17:34:51 cmos-17 kernel: bytes. > Dec 15 17:34:51 cmos-17 kernel: <_sdp DATA: <1> <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: <7 <40, bytes <4096> > Dec 15 17:34:51 cmos-17 kernel: 63> of <40<4096> > Dec 15 17:34:51 cmos-17 kernel: <7te <6164> of <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: 5> of <4096> bytes. > Dec 15 17:34:51 cmos-17 kernel: <74096> > Dec 15 17:34:51 cmos-17 kernel: <7d <6450> of <409e <6396> of <4096> > bytes. > > > Ely Levy > System group > Computer Science > Hebrew University > Jerusalem Israel Are you saying you see these when SDP is build with debug disabled? -- MST From mst at mellanox.co.il Mon Dec 19 05:58:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 15:58:14 +0200 Subject: [openib-general] [PATCH applied] sdp: return ECONNREFUSED on illegal address Message-ID: <20051219135814.GB1677@mellanox.co.il> Make more ltp tests pass: attempts to connect to an illegal address should return -ECONNREFUSED. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-12-19 18:29:43.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-12-19 18:30:09.000000000 +0200 @@ -468,7 +468,7 @@ static int sdp_inet_connect(struct socke ZERONET(addr->sin_addr.s_addr) || LOCAL_MCAST(addr->sin_addr.s_addr) || INADDR_ANY == addr->sin_addr.s_addr) - return -EINVAL; + return -ECONNREFUSED; /* * lock socket */ -- MST From elylevy at cs.huji.ac.il Mon Dec 19 06:01:02 2005 From: elylevy at cs.huji.ac.il (Ely Levy) Date: Mon, 19 Dec 2005 16:01:02 +0200 (IST) Subject: [openib-general] Re: Problems with dsp In-Reply-To: <20051219135545.GA1677@mellanox.co.il> References: <20051219135545.GA1677@mellanox.co.il> Message-ID: On Mon, 19 Dec 2005, Michael S. Tsirkin wrote: > Quoting r. Ely Levy : > > Subject: Problems with dsp > > > > I'm trying to run iperf with sdp using svn from few days ago It works > > find > > until the speed goes above 1.49GB (by making block size bigger or > > running > > 2 iperf) > > Then I'm starting to get weird debug messages in the log and as wall > > from > > root. > > Any idea what might cause it? > > This is what I see in /var/log/message: > > > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: SDP module load. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Initializing /proc > > filesystem > > entries. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Advertisment cache > > initialization. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Link level services > > initialization. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Main pool initialized. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Creating connection tables. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: IOCB cache initialization. > > Dec 15 17:34:33 cmos-17 kernel: ib_sdp INIT: Started listening for SDP > > connection requests > > Dec 15 17:34:33 cmos-17 kernel: NET: Registered protocol family 27 > > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: SOCKET: type <1> proto <0> > > state <1:00000000> > > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> BIND: family <2> > > addr <00000000:8913> > > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <8e01> LISTEN: addr > > <00000000:1389> backlog <0005> > > Dec 15 17:34:47 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr > > <00000000:1389> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <1> commID <00000014> > > ID <-180208768> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: CM REQ. comm <00000014> SID > > <8913010000000000> ca port <1> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello BSDH > > <003f:00:00:0000005c:00000000:00000000> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: Hello HH > > > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: complete > > <1> <0a000702:1389><0a000701:8001> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src > > <0a000702:1389> dst <0a000701:8001> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> GETNAME: src > > <0a000702:1389> dst <0a000701:8001> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: event <4> commID <00000014> > > ID <1> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <2340> CM ESTABLISHED. > > commID <00000014> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Passive > > Establish > > src <0a000702:1389> dst <0a000701:8001> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <1> <1171> Mode request <2> > > from current mode. <1:1> > > Dec 15 17:34:51 cmos-17 kernel: ib_sdp CRTL: <0> <0100> ACCEPT: addr > > <00000000:1389> > > Dec 15 17:34:51 cmos-17 kernel: <7shent <1> > > Dec 15 17:34:51 cmos-17 kernel: 6> bytes. > > Dec 15 17:34:51 cmos-17 kernel: < wrid <2543> of <4096> bytplete1171> > > RECV > > BUFF, bytes <4096> > > Dec 15 17:34:51 cmos-17 kernel: <1171> Read complete > READ > > BUFF w1171> Read complete <3888> of <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: POST READ BUFF wrid <3959> of <4096> > > bytes. > > Dec 15 17:34:51 cmos-17 kernel: <<1> <1171> POST READ BUFF wrid <4186> > > of > > <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: bytes. > > Dec 15 17:34:51 cmos-17 kernel: 357> of <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: <7ST READ BUFF wrid71> Read complete > > <4296 > > DATA: <1>71> POST READib_sdp DATA: <1> > > <1171> POST READ BUFF wrid <4360> of <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: <7id <5116> of <4096> bytes.te 1> RECV > > BUFF,te <5066> of <4096> byt1> RECV BUFF, bytes < > > UFF wr> <1171> Read>> <1171> PTA: <1> <1171> Re. > > Dec 15 17:34:51 cmos-17 kernel: <7 <1> <1171> Read complete <5507> of > > <4096> byte <1> <1171> RECV BUFF, bytes <4096> > > Dec 15 17:34:51 cmos-17 kernel: bytes. > > Dec 15 17:34:51 cmos-17 kernel: <_sdp DATA: <1> <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: <7 <40, bytes <4096> > > Dec 15 17:34:51 cmos-17 kernel: 63> of <40<4096> > > Dec 15 17:34:51 cmos-17 kernel: <7te <6164> of <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: 5> of <4096> bytes. > > Dec 15 17:34:51 cmos-17 kernel: <74096> > > Dec 15 17:34:51 cmos-17 kernel: <7d <6450> of <409e <6396> of <4096> > > bytes. > > > > Are you saying you see these when SDP is build with debug disabled? No, DSP debug is enabled but it doesn't start sending those messages as wall unless you get to 1.49GB speed, it also seems to stop at that speed though I thought infiniband can get to higher bw. Also as you can see the debuging info is very messed up so it's hard to tell what was going on. > -- > MST > Ely From mst at mellanox.co.il Mon Dec 19 06:12:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 16:12:10 +0200 Subject: [openib-general] Re: Problems with dsp In-Reply-To: References: Message-ID: <20051219141210.GD1677@mellanox.co.il> Quoting Ely Levy : > > > <4096> byte <1> <1171> RECV BUFF, bytes <4096> > > > Dec 15 17:34:51 cmos-17 kernel: bytes. > > > Dec 15 17:34:51 cmos-17 kernel: <_sdp DATA: <1> <4096> bytes. > > > Dec 15 17:34:51 cmos-17 kernel: <7 <40, bytes <4096> > > > Dec 15 17:34:51 cmos-17 kernel: 63> of <40<4096> > > > Dec 15 17:34:51 cmos-17 kernel: <7te <6164> of <4096> bytes. > > > Dec 15 17:34:51 cmos-17 kernel: 5> of <4096> bytes. > > > Dec 15 17:34:51 cmos-17 kernel: <74096> > > > Dec 15 17:34:51 cmos-17 kernel: <7d <6450> of <409e <6396> of <4096> > > > bytes. > > > > > > > Are you saying you see these when SDP is build with debug disabled? > > No, DSP SDP? > debug is enabled but it doesn't start sending those messages as > wall unless you get to 1.49GB speed, Weird. > it also seems to stop at that speed > though I thought infiniband can get to higher bw. It might be a good idea to disable data path debug at compile time if you are benchmarking speed. > Also as you can see the debuging info is very messed up so it's hard to > tell what was going on. Miltiple sockets running in parallel? -- MST From mst at mellanox.co.il Mon Dec 19 06:44:21 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 16:44:21 +0200 Subject: [openib-general] [PATCH applied] return -ENOPROTOOPT on an unsupported socket option Message-ID: <20051219144421.GG1677@mellanox.co.il> Make more ltp tests pass: return -ENOPROTOOPT on an unsupported socket option. Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_inet.c =================================================================== --- linux-2.6.14.orig/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-12-19 18:30:09.000000000 +0200 +++ linux-2.6.14/drivers/infiniband/ulp/sdp/sdp_inet.c 2005-12-19 19:22:39.000000000 +0200 @@ -1094,6 +1094,7 @@ static int sdp_inet_setopt(struct socket default: sdp_warn("SETSOCKOPT unimplemented option <%d:%d> conn <%d>.", level, optname, conn->hashent); + result = -ENOPROTOOPT; break; } -- MST From mst at mellanox.co.il Mon Dec 19 07:20:56 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 17:20:56 +0200 Subject: [openib-general] [PATCH applied] libsdp.conf: documentation update Message-ID: <20051219152056.GA2232@mellanox.co.il> SIMPLE_LIBSDP must be set to a non-zero string. Document that. Full path in LD_PRELOAD is not a good idea for mixed 64/32 bit environments. Adding the path to LD_LIBRARY_PATH is a better idea. Document that. Signed-off-by: Michael S. Tsirkin Index: openib/src/userspace/libsdp/libsdp.conf =================================================================== --- openib/src/userspace/libsdp/libsdp.conf (revision 4466) +++ openib/src/userspace/libsdp/libsdp.conf (working copy) @@ -66,17 +66,26 @@ # match_both program netserver # match_both program sshd # -# One more function of libsdp.so, is using a simple libsdp by setting the -# envrinoment variable SIMPLE_LIBSDP. This definition is significantly -# simpler. It has no configuration and converts all calls to socket(2) -# with a family of AF_INET and a type of SOCK_STREAM into family of +# One more function of libsdp.so, is using a simple libsdp by setting the +# envrinoment variable SIMPLE_LIBSDP to a non-empty value. This definition is +# significantly simpler. It has no configuration and converts all calls to +# socket(2) with a family of AF_INET and a type of SOCK_STREAM into family of # AF_INET_SDP. # # libsdp.so isn't setup automatically. it can # be used in one of 2 ways: # -# 1) LD_PRELOAD environment variable. Setting this to the full path of the +# 1) LD_PRELOAD environment variable. Setting this to the name of the # library you want to use will cause it to be preloaded. -# 2) Adding the full path of the library into /etc/ld.so.preload. This will +# 2) Adding the name of the library into /etc/ld.so.preload. This will # cause the library to be preloaded for every executable that is linked # with libc. +# +# The library should be installed in a directory in which the dynamic loader +# searches for shared libraries (as specified by LD_LIBRARY_PATH, +# /etc/ld.so.conf, etc). +# Alternatively, you can specify the full path to the library that +# you want to use in LD_PRELOAD or /etc/ld.so.preload as described above. +# +# The last option cant be used if you have multiple library versions +# (e.g. 64/32 bit) and want the linker to select between them automatically. -- MST From halr at voltaire.com Mon Dec 19 07:26:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2005 10:26:45 -0500 Subject: [openib-general] Separating SA and SM Keys Message-ID: <1135006004.4328.35130.camel@hal.voltaire.com> Hi, In order to support separate SA and SM keys and make this clearer, I propose to change ib_types.h as follows. -- Hal Index: ib_types.h =================================================================== --- ib_types.h (revision 4540) +++ ib_types.h (working copy) @@ -3618,7 +3618,7 @@ typedef struct _ib_sa_mad ib_net32_t seg_num; ib_net32_t paylen_newwin; - ib_net64_t sm_key; + ib_net64_t sa_key; ib_net16_t attr_offset; ib_net16_t resv3; From rdreier at cisco.com Mon Dec 12 09:30:46 2005 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Dec 2005 09:30:46 -0800 Subject: [openib-general] Re: [Openib-promoters] Next workshop dates? Ideas for agenda??? In-Reply-To: <6.2.5.6.0.20051211171549.0203dd28@lanl.gov> (Steve Poole's message of "Sun, 11 Dec 2005 17:16:37 -0700") References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C44A@mtlexch01.mtl.com> <6.2.3.4.2.20051211160126.03bc3878@mail-lc.llnl.gov> <6.2.5.6.0.20051211171549.0203dd28@lanl.gov> Message-ID: Steve> As long as they merge with the rest of the requirements for Steve> OpenIB, this is great. We will not have several different Steve> versions of OpenIB. Why do you say that? Obviously the whole point is that their requirements are different and don't merge with the requirements we have from other areas. If their requirements were already being met then they wouldn't need to participate. The idea is that we get input from as many different markets as we can so that OpenIB is as broadly applicable as possible. - R. From head.bubba at csfb.com Mon Dec 12 11:48:06 2005 From: head.bubba at csfb.com (Head Bubba) Date: Mon, 12 Dec 2005 19:48:06 -0000 Subject: [Openib-promoters] Re: [openib-general] Next workshopdates? I deas for agenda??? Message-ID: I agree with Woody -----Original Message----- From: Bob Woodruff [mailto:robert.j.woodruff at intel.com] Sent: Monday, December 12, 2005 2:37 PM To: 'Bill Boas'; Roland Dreier Cc: Henry Brandt; Peter Haas; Head Bubba; Peter Krey (JP Morgan) (E-mail); openib-general at openib.org; Tim Lyons (Morgan Stanley) (E-mail); openib-promoters at openib.org Subject: RE: [Openib-promoters] Re: [openib-general] Next workshopdates? Ideas for agenda??? Roland wrote, >I'm not sure I see the point in dragging everyone together in early >February. With the holidays coming, realistically we only have maybe >5 weeks to prepare a conference agenda, and I don't see that as being >enough time to set up a productive meeting. Another possibility would be to delay the workshop till early March and have it the day before IDF, as we did last fall. Thoughts ? woody ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From Thad at Mellanox.com Mon Dec 12 17:33:07 2005 From: Thad at Mellanox.com (Thad Omura) Date: Mon, 12 Dec 2005 17:33:07 -0800 Subject: [Openib-promoters] Re: [openib-general] Next workshop dates? Ideas for agenda??? Message-ID: <25AE7F432672D511B8DC00B0D0DF11DA060BA4C5@MTIEX01> I agree that we should have the event in early FEB that most people can attend (week of the 6th is OK, although it stinks for Superbowl fans...:) ). I see significant updates, even from our last event at the end of AUG, on all fronts: Windows (MemFree, SDP, etc.), Linux (Updates from distros, stack updates), Storage (SRP, iSER/NFSoRDMA over OpenIB directly), Virtualization, MPI, RDS, discussions about QoS, management, etc. Comments from the last event that many wanted to attend many sessions that were concurrent as we had a windows/linux/applications session all running together. The agenda will have to allow flexibility so folks can attend both while not taking too much of people's time. Maybe run the Windows and Linux together one day and applications on another day because it seemed everyone wanted to get there. It will be up to the organizers and presenters to propose an agenda that is "fresh" and reduce the number of repeat slides from previous conferences. I do see the developer community growing, and as a result, some first time attendees will be seeing some info for the first time so a minimal amount of overlap for continuity isn't going to kill anyone. Regards, THAD ======================================================= Thad Omura - Vice President of Product Marketing thad at mellanox.com Mellanox Technologies, 2900 Stender Way, Santa Clara CA 95054 Work: 408-916-0020 Mobile: 408-750-6236 Skype: tomura74 -----Original Message----- From: Bill Boas [mailto:bboas at llnl.gov] Sent: Monday, December 12, 2005 9:55 AM To: Roland Dreier Cc: Henry Brandt; Peter Haas; Head Bubba; Tziporet Koren; Peter Krey (JP Morgan) (E-mail); openib-general at openib.org; Tim Lyons (Morgan Stanley) (E-mail); openib-promoters at openib.org Subject: Re: [Openib-promoters] Re: [openib-general] Next workshop dates? Ideas for agenda??? Roland, These are all excellent perspectives, I hope others will respond with their view points. Certainly repeating what we have heard already is not a good use of anyone's time or money but I'm under the impression that we will have made some progress toward what we want to work on next as a result of PathForward Phase 2, input from Tom Tucker and others on OpenIB iWARP integration and the HSIR meeting in NYC tomorrow. With respect to "release" of OpenIB rel 1.0, did Doug Ledford effectively do that a week or two ago? I think those of us ( including me) who originally thought OpenIB was actually going be an organization that released and supported code (like RedHat, say) had got it wrong. Now I believe that when a Linux distribution, an IB company or a Tier One OEM decides that is a version of the code that they will support, then that is a "release". OpenIB may be best utilized to try to achieve some consistency in timeframe and content amongst those who wish to "release and support" the code??? Bill. At 09:39 AM 12/12/2005, Roland Dreier wrote: > Bill> I think, subject to others input, it'll be focused on > Bill> wrapping up rel 1.0 of OpenIB, discussing what the > Bill> developers are going to focus on next and validating the > Bill> strategy for RDMA over Ethernet integration at the verbs > Bill> level to lay the foundation for one, consistent RDMA > Bill> structure in Linux, if possible. > >I'm not sure I see the point in dragging everyone together in early >February. With the holidays coming, realistically we only have maybe >5 weeks to prepare a conference agenda, and I don't see that as being >enough time to set up a productive meeting. > >In particular: > > * wrapping up rel 1.0 -- the release process for a "1.0" release has > not even started. About all we could hope to accomplish would be > to pick a release manager and tell that person to go start driving > a release, and I don't see that as a good use of face-to-face > time. It would be much better to pick someone to drive the release > and then give the release manager time to start putting the release > together before getting together, so that we have some idea of what > the real issues that need to be hashed out in person are. > > * iWARP integration -- again, not enough discussion has taken place > in advance. Until the community has a chance to really study the > proposed changes and figure out what the real difficult issues that > need to be sorted out in person are, again it's a waste of time to > meet in person. > > * discuss developers next steps -- perhaps I'm pessimistic but I > think we'll just get the same talks we've already seen twice before > at Sonoma and IDF. > >Sonoma is a short trip for me but given the number of people that will >have to come from the East coast and Israel, I think we should think >hard about whether this conference is the best use of our time. > > - R. >_______________________________________________ >openib-promoters mailing list >openib-promoters at openib.org >http://openib.org/mailman/listinfo/openib-promoters Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 _______________________________________________ openib-promoters mailing list openib-promoters at openib.org http://openib.org/mailman/listinfo/openib-promoters -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Dec 19 10:05:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 10:05:51 -0800 Subject: [openib-general] Re: dev_remove in the CMA In-Reply-To: <1134515244.3764.9.camel@trinity.austin.ammasso.com> References: <1134515244.3764.9.camel@trinity.austin.ammasso.com> Message-ID: <43A6F67F.4060906@ichips.intel.com> Tom Tucker wrote: > I'm don't understand the dev_remove usage in the rdma_cm_id. It looks to > me like if the user calls rdma_resolve_addr, but never calls > rdma_resolve_route that the device could not be removed. Is this the > intended behavior? Once a rdma_cm_id has been bound to a device, the user must destroy that rdma_cm_id on device removal before the device removal can proceed. > Is the goal to prevent the user from removing the device if the client > is in a callback? If so, can't we just increment and decrement in the > cma_notify_user function? I guess I just don't understand... Users above the CMA obtain their device pointer from the CMA. The device pointer must be valid outside of CMA callbacks, so device removal is delayed until the user releases all resources associated with a device. Destruction of the rdma_cm_id indicates that the user is no longer using that device. - Sean From mshefty at ichips.intel.com Mon Dec 19 11:32:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 11:32:55 -0800 Subject: [openib-general] [RFC] IB_AT_MOST In-Reply-To: <000201c60281$4cf5c610$6401a8c0@infiniconsys.com> References: <000201c60281$4cf5c610$6401a8c0@infiniconsys.com> Message-ID: <43A70AE7.6080604@ichips.intel.com> Fab Tillier wrote: > I don't understand the IB_AT_MOST macro. If someone uses IB_AT_MOST( 1 ) and > the hardware supports 4, they will get 4, which is definitely not "at most 1". > > I would rename it to IB_MAX, and define it a -1 or something like that. I agree with Fab. What the user wants is the maximum that they can get, and IB_MAX conveys this better than IB_AT_MOST. - Sean From sean.hefty at intel.com Mon Dec 19 12:07:23 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 12:07:23 -0800 Subject: [openib-general] [PATCH] [CMA] support for SDP + standard protocol In-Reply-To: Message-ID: >Wouldn't it make sense than, to also modify the SDP spec? You could create a new version of the SDP headers that would match the CMA, but I don't think that changing an existing, published version of the headers is really an option. - Sean From sean.hefty at intel.com Mon Dec 19 12:18:16 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 12:18:16 -0800 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051213223622.GA7173@mellanox.co.il> Message-ID: >> A higher level ULP could map 127.0.0.1 to a specific IP address before >> calling the CMA, but I'm not sure that's any better. > >Ugh. I really would like to hide all the IPv4/IPv6 etc from ULPs. Not sure that I understand this comment. The ULP needs to provide some sort of address. >> From the CMA's perspective, 127.0.0.1 could just as easily >> map to an iWarp device as an Infiniband device. > >Which device to select is a difficult problem. >I think we might be able to just punt on this for now, selecting >an arbitrary device of an appropriate type that happens to be up. Long term, we may be able to add some quality of service values that help select devices that map to the same address. >By the way, CMA seems to happily take bits out of the hardware address >and assume that these include the gid, pkey, etc. >Shouldnt it check the device type before doing this? The validation comes later in the CMA, and is needed there due to hotplug support, even if an earlier check is done. So, ib_addr assumes that the address maps to an IB device and reads out of the hardware address. The CMA then does a lookup against known IB devices based on the returned values. - Sean From jlentini at netapp.com Mon Dec 19 12:35:57 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 19 Dec 2005 15:35:57 -0500 (EST) Subject: [openib-general] Re: [kDAPL]questions about the LMR creation of different types of memory In-Reply-To: <7b2fa1820512170043y7ae0e0ccrc577733b708b6399@mail.gmail.com> References: <7b2fa1820512080626kf4c9c23hdc3f416dcb970f6d@mail.gmail.com> <7b2fa1820512081742j7ef50a27kc2322cbf0e52d908@mail.gmail.com> <20051216180338.GC8493@esmail.cup.hp.com> <7b2fa1820512170043y7ae0e0ccrc577733b708b6399@mail.gmail.com> Message-ID: ian> As your mentioned, ULPs in OpenIB (e.g. SDP or IPoIB) are ian> responsible for properly mapping and unmapping for DMA use. ian> AFAIK, SDP is implemented with the IB native verbs. What about ian> the kDAPL? In my opinion the kDAPL does not do the mapping and ian> unmapping work. So it is the responsibility of the kernel ian> applications using the kDAPL. Am I right? Per the spec, it depends on the flag you pass dat_lmr_kcreate. If you register using the DAT_MEM_TYPE_PHYSICAL flag, the memory address should be physical addresses. If you register using DAT_MEM_TYPE_IA, the addresses should be a DMA (aka an I/O, aka a bus) address. Currently the code is assuming that physical addresses are valid DMA addresses and not performing a translation. From ralphc at pathscale.com Mon Dec 19 12:50:27 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Dec 2005 12:50:27 -0800 Subject: [openib-general] Re: [PATCH 10/13] [RFC] ipath verbs, part 1 In-Reply-To: <20051218195922.GC31184@us.ibm.com> References: <200512161548.zxp6FKcabEu47EnS@cisco.com> <200512161548.W9sJn4CLmdhnSTcH@cisco.com> <20051218195922.GC31184@us.ibm.com> Message-ID: <1135025427.6397.21.camel@brick.internal.keyresearch.com> The quick answer is the qp_list is traversed w/o the lock held in ipath_ib_rcv(). The intent is to be able to do a lookup on the GID to get a reference to the struct ipath_mcast and then walk the qp_list w/o locks being held while processing the received packets at interrupt level. On Sun, 2005-12-18 at 11:59 -0800, Paul E. McKenney wrote: > On Fri, Dec 16, 2005 at 03:48:55PM -0800, Roland Dreier wrote: > > First half of ipath verbs driver > > Some RCU-related questions interspersed. Basic question is "where is > the lock-free read-side traversal?" > > Thanx, Paul > > > --- > > > > drivers/infiniband/hw/ipath/ipath_verbs.c | 3244 +++++++++++++++++++++++++++++ > > 1 files changed, 3244 insertions(+), 0 deletions(-) > > create mode 100644 drivers/infiniband/hw/ipath/ipath_verbs.c ... > > +/* > > + * Insert the multicast GID into the table and > > + * attach the QP structure. > > + * Return zero if both were added. > > + * Return EEXIST if the GID was already in the table but the QP was added. > > + * Return ESRCH if the QP was already attached and neither structure was added. > > + */ > > +static int ipath_mcast_add(struct ipath_mcast *mcast, > > + struct ipath_mcast_qp *mqp) > > +{ > > + struct rb_node **n = &mcast_tree.rb_node; > > + struct rb_node *pn = NULL; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&mcast_lock, flags); > > + > > + while (*n) { > > + struct ipath_mcast *tmcast; > > + struct ipath_mcast_qp *p; > > + int ret; > > + > > + pn = *n; > > + tmcast = rb_entry(pn, struct ipath_mcast, rb_node); > > + > > + ret = memcmp(mcast->mgid.raw, tmcast->mgid.raw, > > + sizeof(union ib_gid)); > > + if (ret < 0) { > > + n = &pn->rb_left; > > + continue; > > + } > > + if (ret > 0) { > > + n = &pn->rb_right; > > + continue; > > + } > > + > > + /* Search the QP list to see if this is already there. */ > > + list_for_each_entry_rcu(p, &tmcast->qp_list, list) { > > Given that we hold the global mcast_lock, how is RCU helping here? Its not really. I'm just trying to be consistent where ever the qp_list is traversed. > Is there a lock-free read-side traversal path somewhere that I am > missing? The lock free traversal is in ipath_ib_rcv() which is an interrupt routine. > > + if (p->qp == mqp->qp) { > > + spin_unlock_irqrestore(&mcast_lock, flags); > > + return ESRCH; > > + } > > + } > > + list_add_tail_rcu(&mqp->list, &tmcast->qp_list); > > Ditto... > > > + spin_unlock_irqrestore(&mcast_lock, flags); > > + return EEXIST; > > + } > > + > > + list_add_tail_rcu(&mqp->list, &mcast->qp_list); > > Ditto... > > > + spin_unlock_irqrestore(&mcast_lock, flags); > > + > > + atomic_inc(&mcast->refcount); > > + rb_link_node(&mcast->rb_node, pn, n); > > + rb_insert_color(&mcast->rb_node, &mcast_tree); > > + > > + spin_unlock_irqrestore(&mcast_lock, flags); > > + > > + return 0; > > +} -- Ralph Campbell From mst at mellanox.co.il Mon Dec 19 13:05:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 23:05:19 +0200 Subject: [openib-general] Re: Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: References: Message-ID: <20051219210518.GA2694@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) > > >> A higher level ULP could map 127.0.0.1 to a specific IP address > before > >> calling the CMA, but I'm not sure that's any better. > > > >Ugh. I really would like to hide all the IPv4/IPv6 etc from ULPs. > > Not sure that I understand this comment. The ULP needs to provide some > sort of > address. Right but I hoped we could just copy sockaddr from the user to CMA, aboit looking at it in the ULP. > >> From the CMA's perspective, 127.0.0.1 could just as easily > >> map to an iWarp device as an Infiniband device. > > > >Which device to select is a difficult problem. > >I think we might be able to just punt on this for now, selecting > >an arbitrary device of an appropriate type that happens to be up. > > Long term, we may be able to add some quality of service values that > help select > devices that map to the same address. Right. So lets map 127.0.0.1 to an arbitrary local device for now? -- MST From mst at mellanox.co.il Mon Dec 19 13:42:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 23:42:29 +0200 Subject: [openib-general] Re: Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051219210518.GA2694@mellanox.co.il> References: <20051219210518.GA2694@mellanox.co.il> Message-ID: <20051219214229.GB2694@mellanox.co.il> Quoting Michael S. Tsirkin : > aboit looking at it in the ULP. without looking at it in the ULP. -- MST From mst at mellanox.co.il Mon Dec 19 13:52:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 19 Dec 2005 23:52:54 +0200 Subject: [openib-general] outstanding patches Message-ID: <20051219215254.GC2694@mellanox.co.il> Hello, Roland! Please note that there are currently multiple outstanding patches from Mellanox, that have been posted but haven't been reviewed yet. Most of them are small, I have collected them here: https://openib.org/svn/trunk/contrib/mellanox/patches I would be especially interested to get feedback on several ipoib patches, which have been outstanding for a while now. Thanks, -- MST From thomas.duffy.99 at alumni.brown.edu Mon Dec 19 14:50:57 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Mon, 19 Dec 2005 14:50:57 -0800 Subject: [openib-general] Re: [PATCH applied] return -ENOPROTOOPT on an unsupported socket option In-Reply-To: <20051219144421.GG1677@mellanox.co.il> References: <20051219144421.GG1677@mellanox.co.il> Message-ID: <33038B33-7572-463C-B307-B5114E3243A0@alumni.brown.edu> On Dec 19, 2005, at 6:44 AM, Michael S. Tsirkin wrote: > > Make more ltp tests pass: return -ENOPROTOOPT on an unsupported > socket option. This one is a bit controversial. We had the discussion in the past about doing this, but the problem is that some applications won't run if a particular sockopt is not supported on SDP. Some socket opts don't make much sense on SDP/Infiniband, others work as intended. -tduffy From mst at mellanox.co.il Mon Dec 19 15:05:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 01:05:10 +0200 Subject: [openib-general] Re: [PATCH applied] return -ENOPROTOOPT on an unsupported socket option In-Reply-To: <33038B33-7572-463C-B307-B5114E3243A0@alumni.brown.edu> References: <33038B33-7572-463C-B307-B5114E3243A0@alumni.brown.edu> Message-ID: <20051219230510.GD2694@mellanox.co.il> Quoting Tom Duffy : > > > > Make more ltp tests pass: return -ENOPROTOOPT on an unsupported > > socket option. > > This one is a bit controversial. We had the discussion in the past > about doing this, but the problem is that some applications won't run > if a particular sockopt is not supported on SDP. Some socket opts > don't make much sense on SDP/Infiniband, others work as intended. Hmm. Which option do you have in mind, specifically? The right thing, in my eyes, is to emulate a TCP socket. So we dont want to support options that TCP doesnt support. -- MST From mshefty at ichips.intel.com Mon Dec 19 15:24:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 15:24:51 -0800 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <20051219210518.GA2694@mellanox.co.il> References: <20051219210518.GA2694@mellanox.co.il> Message-ID: <43A74143.30106@ichips.intel.com> Michael S. Tsirkin wrote: > Right. So lets map 127.0.0.1 to an arbitrary local device for now? Sounds good for now. - Sean From mst at mellanox.co.il Mon Dec 19 15:46:57 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 01:46:57 +0200 Subject: [openib-general] Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) In-Reply-To: <43A74143.30106@ichips.intel.com> References: <43A74143.30106@ichips.intel.com> Message-ID: <20051219234657.GE2694@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] sdp: replace ip_dev_find withdev_base scan (was Re: ip_dev_find resolution?) > > Michael S. Tsirkin wrote: > > Right. So lets map 127.0.0.1 to an arbitrary local device for now? > > Sounds good for now. OK, thats what the patch did. Something along the following lines would be a good fit for cma, wouldnt it: > 1. Get rid of ip_dev_find. > 2. Add support for local addresses such as 127.0.0.1, > resolving them to an arbitrary rdma device, while we are at it. > > Signed-off-by: Michael S. Tsirkin +static int tryaddrmatch(struct net_device *dev, u32 s_addr, u32 d_addr) +{ + struct in_ifaddr **ifap; + struct in_ifaddr *ifa; + struct in_device *in_dev; + int rc = -ENETUNREACH; + __be32 addr; + + if (dev->type != ARPHRD_INFINIBAND) + return rc; + + in_dev = in_dev_get(dev); + if (!in_dev) + return rc; + + addr = (ZERONET(s_addr) || LOOPBACK(s_addr)) ? d_addr : s_addr; + + /* Hack to enable using SDP on addresses such as 127.0.0.1 */ + if (ZERONET(addr) || LOOPBACK(addr)) { + rc = (dev->flags & IFF_UP) ? 0 : -ENETUNREACH; + goto done; + } + + for (ifap = &in_dev->ifa_list; (ifa = *ifap); ifap = &ifa->ifa_next) { + if (s_addr == ifa->ifa_address) { + rc = 0; + break; /* found */ + } + } + +done: + in_dev_put(in_dev); + return rc; +} + And, after resolving the route: if (dev->flags & IFF_LOOPBACK) { dev_put(dev); read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) + if (!tryaddrmatch(dev, rt->rt_src, rt->rt_dst)) { + dev_hold(dev); + break; + } read_unlock(&dev_base_lock); } -- MST From rjwalsh at pathscale.com Mon Dec 19 16:32:55 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 19 Dec 2005 16:32:55 -0800 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <20051218192356.GB9145@mars.ravnborg.org> References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051218192356.GB9145@mars.ravnborg.org> Message-ID: <1135038775.7306.18.camel@hematite.internal.keyresearch.com> > > @@ -0,0 +1,15 @@ > > +EXTRA_CFLAGS += -Idrivers/infiniband/include > If this is needed then some header files should be moved to include/rdma Actually, this is done by other IB drivers, too, so I assumed it was OK. Roland, have you any comments on this? > > + > > +ipath_core-objs := ipath_copy.o ipath_driver.o \ > > + ipath_dwordcpy.o ipath_ht400.o ipath_i2c.o ipath_layer.o \ > > + ipath_lib.o ipath_mlock.o > > + > > +ib_ipath-objs := ipath_mad.o ipath_verbs.o > > Please use: > ipath_core-y := ... > ib_ipath-y := ... > > Use of -y let you do better Kconfig selection in the makefile, and is > preferred compared to -objs No problem. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From ralphc at pathscale.com Mon Dec 19 16:38:52 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Mon, 19 Dec 2005 16:38:52 -0800 Subject: [openib-general] mthca calls ib_register_mad_agent() and implements ib_device.process_mad()? Message-ID: <1135039132.6397.52.camel@brick.internal.keyresearch.com> Can someone explain why the mthca driver calls ib_register_mad_agent() and implements ib_device.process_mad()? It looks like the later does the actual processing of MAD packets for the SMA and PMA whereas the former doesn't seem to do anything except cause the ib_mad module to be modprobe'd. I understand the need to have ib_mad loaded before ib_mthca since the call to ib_register_device() will cause ib_mad to create QP 0 & 1. Normally, it looks like ib_register_mad_agent() is used to tell the ib_mad module to send MADs to agents like the SM, CM, etc. -- Ralph Campbell From mshefty at ichips.intel.com Mon Dec 19 17:09:38 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 19 Dec 2005 17:09:38 -0800 Subject: [openib-general] RFC MPI and app. requirements of OpenIB Message-ID: <43A759D2.6040101@ichips.intel.com> I'm soliciting feedback from the MPI and other application developers regarding which OpenIB APIs they will be targeting with their implementations. Specifically, myself and some of the other IB developers are interesting in knowing if userspace applications will be written to the RDMA CMA interface, the IB CM API, or some other abstraction. This will let us focus our priorities on which features to expose to userspace. - Sean From rjwalsh at pathscale.com Mon Dec 19 17:43:17 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 19 Dec 2005 17:43:17 -0800 Subject: [openib-general] Re: [PATCH 01/13] [RFC] ipath basic headers In-Reply-To: <20051217123827.32f119da.akpm@osdl.org> References: <200512161548.jRuyTS0HPMLd7V81@cisco.com> <200512161548.aLjaDpGm5aqk0k0p@cisco.com> <20051217123827.32f119da.akpm@osdl.org> Message-ID: <1135042997.7306.26.camel@hematite.internal.keyresearch.com> > > +#ifdef IPATH_COSIM > > +extern __u32 sim_readl(const volatile void __iomem * addr); > > +extern __u64 sim_readq(const volatile void __iomem * addr); > > The driver has a strange mixture of int32_t, s32 and __s32. s32 is > preferred. The cosim stuff has been nuked, as it was old code anyway. With those functions gone, we now use int32_t (and related 8-, 16-, 32- and 64-bit signed and unsigned versions) consistently throughout the code. We'd prefer to keep it that way, instead of changing over to s32 and friends, as some of our header files are used by userland programs. Unless we put in magic typedef for s32 and friends in userland, that won't work, hence we use the C standard typedefs. Is this a problem? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Mon Dec 19 17:40:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2005 20:40:31 -0500 Subject: [openib-general] mthca calls ib_register_mad_agent() and implements ib_device.process_mad()? In-Reply-To: <1135039132.6397.52.camel@brick.internal.keyresearch.com> References: <1135039132.6397.52.camel@brick.internal.keyresearch.com> Message-ID: <1135042830.4328.41828.camel@hal.voltaire.com> On Mon, 2005-12-19 at 19:38, Ralph Campbell wrote: > Can someone explain why the mthca driver calls > ib_register_mad_agent() and implements ib_device.process_mad()? This is because the mthca has the agents in firmware and the driver wants to see each MAD and give the firmware the right of first refusal on each received MAD. There was a long thread on this in the early days of OpenIB. > It looks like the later does the actual processing of MAD packets > for the SMA and PMA and any other agents that might be implemented in firmware. Right now it is just SMA and PMA. > whereas the former doesn't seem to do anything > except cause the ib_mad module to be modprobe'd. It also registers a send completion handler for MADs sent. Yes, it would also have the effect of pulling in ib_mad. > I understand the need to have ib_mad loaded before ib_mthca > since the call to ib_register_device() will cause ib_mad > to create QP 0 & 1. > > Normally, it looks like ib_register_mad_agent() is used to > tell the ib_mad module to send MADs to agents like the > SM, CM, etc. If what you mean is demultiplex receive MADs, and also entities such as SM, SMA, PMA, CM, etc., then yes (as well as handle the completions from sending MADs which have some resouces to clean up). >From ib_mad.h: /** * ib_register_mad_agent - Register to send/receive MADs. * @device: The device to register with. * @port_num: The port on the specified device to use. * @qp_type: Specifies which QP to access. Must be either * IB_QPT_SMI or IB_QPT_GSI. * @mad_reg_req: Specifies which unsolicited MADs should be received * by the caller. This parameter may be NULL if the caller only * wishes to receive solicited responses. * @rmpp_version: If set, indicates that the client will send * and receive MADs that contain the RMPP header for the given version. * If set to 0, indicates that RMPP is not used by this client. * @send_handler: The completion callback routine invoked after a send * request has completed. * @recv_handler: The completion callback routine invoked for a received * MAD. * @context: User specified context associated with the registration. */ -- Hal From yael at mellanox.co.il Mon Dec 19 23:11:12 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Tue, 20 Dec 2005 09:11:12 +0200 Subject: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24BE@mtlexch01.mtl.com> Hi Hal, The compiler version is gcc (GCC) 4.0.2 As you see from the patch - something strange is going on with the compiler(probably). I have a pointer that its value is null, but it still enters the if statement, and doesn't handle it as zero. There are many more places where we have if statements on pointers similar to this case, and this compiler change can be very problematic. Do you/anyone else know about this change in the gcc version? Is this behavior controlled by some flag? Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Monday, December 19, 2005 1:42 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - fix segfault on exit - cont. Hi Yael, On Mon, 2005-12-19 at 04:19, Yael Kalka wrote: > Hi Hal, > > I've noticed that under certain operating systems, Seems more likely like a compiler difference rather than OS difference. Can you mention on which distribution/compiler this made a difference ? > when driver isn't > loaded, the SM still exits with segfault. > The following patch fixes this. > > Thanks, > Yael > > Signed-off-by: Yael Kalka > > Index: libvendor/osm_vendor_ibumad.c > =================================================================== > --- libvendor/osm_vendor_ibumad.c (revision 4522) > +++ libvendor/osm_vendor_ibumad.c (working copy) > @@ -552,7 +552,7 @@ osm_vendor_delete( > > /* umad receiver thread ? */ > p_ur = (*pp_vend)->receiver; > - if (&p_ur->signal) > + if (&p_ur->signal != NULL) If this makes a difference, there are other uses of similar syntax which should be changed :-( > cl_event_destroy( &p_ur->signal ); > cl_spinlock_destroy( &(*pp_vend)->cb_lock ); > cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); -- Hal From eitan at mellanox.co.il Tue Dec 20 00:17:35 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 20 Dec 2005 10:17:35 +0200 Subject: [openib-general] RE: Separating SA and SM Keys Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B69@mtlexch01.mtl.com> Hi Hal, I think we need to stick to the IB spec terminology. Since the spec did not change and only added some note describing the change we probably just need to add such comment too. I can see people looking at the sa_key and trying to find it in the spec... Eitan Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, December 19, 2005 5:27 PM > To: Eitan Zahavi; Yael Kalka > Cc: openib-general at openib.org > Subject: Separating SA and SM Keys > > Hi, > > In order to support separate SA and SM keys and make this clearer, I > propose to change ib_types.h as follows. > > -- Hal > > Index: ib_types.h > =================================================================== > --- ib_types.h (revision 4540) > +++ ib_types.h (working copy) > @@ -3618,7 +3618,7 @@ typedef struct _ib_sa_mad > ib_net32_t seg_num; > ib_net32_t paylen_newwin; > > - ib_net64_t sm_key; > + ib_net64_t sa_key; > > ib_net16_t attr_offset; > ib_net16_t resv3; From mikipoke at kobej.zzn.com Tue Dec 20 02:46:27 2005 From: mikipoke at kobej.zzn.com (mikipoke at kobej.zzn.com) Date: Tue, 20 Dec 2005 02:46:27 -0800 (PST) Subject: [openib-general] =?iso-2022-jp?b?GyRCPzM6OiEmTExDTDpRJF8bKEI=?= =?iso-2022-jp?b?GyRCJE4kKjZiO30kQSRKPXdALSRyJDQ+UjJwGyhC?= Message-ID: 20051220190749.24866mail@mail.hyper_s_class552158754_lookserver772_serebusystem03_woman-s-class.tv *:。:∴:*:゜:∵:*:。:∴:*:゜:∵:*:。:∴:*:゜:∵:*:。*:。:∴:*:゜*:。: こんな不況の世の中にもリッチな女性はたくさんいます。 審査・面談済みの経済的に裕福なセレブ女性が多数登録されています。 彼女達の目的は貴方の優しさと癒しです。 代わりに登録女性は経済的に貴方の生活を支えてくれます。 ご興味ある方はこちらから http://sclass.cx/h/entry.html ------------------------------------------------------------------- このような女性を優しく癒して頂ける男性を短期集中募集しています。 優しく癒していい方はこちら http://sclass.cx/h/entry.html ------------------------------------------------------------------- 審査・面談済みの経済的に裕福でセレブな登録女性には、 エリア男性を優先的に紹介する権利を差し上げる代わりにご出資頂いています。 優先的に紹介されたい方はこちら http://sclass.cx/h/entry.html ------------------------------------------------------------------- 女性から申し込みを確実に増やす為、 貴方の登録情報を定期的に女性へ配信をさせて頂きますので、 自己PRを登録後お待ちください。 至急に交際希望の方は、その旨を女性に伝えてください。 ------------------------------------------------------------------- From informativo at bc.gov.br Tue Dec 20 03:18:33 2005 From: informativo at bc.gov.br (informativo at bc.gov.br) Date: Tue, 20 Dec 2005 05:18:33 -0600 Subject: [openib-general] Banco Central Message-ID: An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 20 03:45:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 06:45:12 -0500 Subject: [openib-general] RE: Separating SA and SM Keys In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B69@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B69@mtlexch01.mtl.com> Message-ID: <1135078986.4328.49300.camel@hal.voltaire.com> Hi Eitan, On Tue, 2005-12-20 at 03:17, Eitan Zahavi wrote: > Hi Hal, > > I think we need to stick to the IB spec terminology. > Since the spec did not change and only added some note describing the > change we probably just need to add such comment too. Ideally the spec would have been changed. > I can see people looking at the sa_key and trying to find it in the > spec... How do you propose handling the setting of the 2 keys differently ? They need different names at least for configuration purposes. -- Hal > Eitan > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, December 19, 2005 5:27 PM > > To: Eitan Zahavi; Yael Kalka > > Cc: openib-general at openib.org > > Subject: Separating SA and SM Keys > > > > Hi, > > > > In order to support separate SA and SM keys and make this clearer, I > > propose to change ib_types.h as follows. > > > > -- Hal > > > > Index: ib_types.h > > =================================================================== > > --- ib_types.h (revision 4540) > > +++ ib_types.h (working copy) > > @@ -3618,7 +3618,7 @@ typedef struct _ib_sa_mad > > ib_net32_t seg_num; > > ib_net32_t paylen_newwin; > > > > - ib_net64_t sm_key; > > + ib_net64_t sa_key; > > > > ib_net16_t attr_offset; > > ib_net16_t resv3; From eitan at mellanox.co.il Tue Dec 20 04:05:51 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 20 Dec 2005 14:05:51 +0200 Subject: [openib-general] RE: Separating SA and SM Keys Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B70@mtlexch01.mtl.com> Hi Hal, OpenSM can support 2 configuration flags. This is different then the wire protocol described in ib_types.h which should follow the IB spec. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 20, 2005 1:45 PM > To: Eitan Zahavi > Cc: Yael Kalka; openib-general at openib.org > Subject: RE: Separating SA and SM Keys > > Hi Eitan, > > On Tue, 2005-12-20 at 03:17, Eitan Zahavi wrote: > > Hi Hal, > > > > I think we need to stick to the IB spec terminology. > > Since the spec did not change and only added some note describing the > > change we probably just need to add such comment too. > > Ideally the spec would have been changed. > > > I can see people looking at the sa_key and trying to find it in the > > spec... > > How do you propose handling the setting of the 2 keys differently ? They > need different names at least for configuration purposes. > > -- Hal > > > Eitan > > > > Eitan Zahavi > > Design Technology Director > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Monday, December 19, 2005 5:27 PM > > > To: Eitan Zahavi; Yael Kalka > > > Cc: openib-general at openib.org > > > Subject: Separating SA and SM Keys > > > > > > Hi, > > > > > > In order to support separate SA and SM keys and make this clearer, I > > > propose to change ib_types.h as follows. > > > > > > -- Hal > > > > > > Index: ib_types.h > > > =================================================================== > > > --- ib_types.h (revision 4540) > > > +++ ib_types.h (working copy) > > > @@ -3618,7 +3618,7 @@ typedef struct _ib_sa_mad > > > ib_net32_t seg_num; > > > ib_net32_t paylen_newwin; > > > > > > - ib_net64_t sm_key; > > > + ib_net64_t sa_key; > > > > > > ib_net16_t attr_offset; > > > ib_net16_t resv3; From mst at mellanox.co.il Tue Dec 20 04:55:13 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 14:55:13 +0200 Subject: [openib-general] [PATCH repost] option to set node description in sysfs Message-ID: <20051220125513.GC2366@mellanox.co.il> Here's an updated version of the node description patch. changes: Updated to svn rev 4044 Make sure the whole 64 byte description is initialized before passing it to hardware. Signed-off-by: Michael S. Tsirkin This patch does a few things: - Adds node_guid and node_desc fields to struct ib_device - Has mthca set these fields on startup - Extends modify_device method to handle setting node_desc - Exposes node_desc in sysfs - Allows userspace to set node_desc by writing into sysfs file, eg. echo -n `hostname` >> /sys/class/linux-kernel/drivers/infiniband/mthca0/node_desc This should probably be combined with Sean's work to get rid of node_guid queries in ULPs. Comments? - R. Index: linux-2.6.14/drivers/infiniband/core/sysfs.c =================================================================== --- linux-2.6.14/drivers/infiniband/core/sysfs.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/core/sysfs.c (working copy) @@ -637,14 +637,42 @@ be16_to_cpu(((__be16 *) &attr.node_guid)[3])); } +static ssize_t show_node_desc(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + return sprintf(buf, "%.64s\n", dev->node_desc); +} + +static ssize_t set_node_desc(struct class_device *cdev, const char *buf, + size_t count) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_modify desc = {}; + int ret; + + if (!dev->modify_device) + return -EIO; + + memcpy(desc.node_desc, buf, min_t(int, count, 64)); + ret = ib_modify_device(dev, IB_DEVICE_MODIFY_NODE_DESC, &desc); + if (ret) + return ret; + + return count; +} + static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); +static CLASS_DEVICE_ATTR(node_desc, S_IRUGO | S_IWUSR, show_node_desc, + set_node_desc); static struct class_device_attribute *ib_class_attributes[] = { &class_device_attr_node_type, &class_device_attr_sys_image_guid, - &class_device_attr_node_guid + &class_device_attr_node_guid, + &class_device_attr_node_desc }; static struct class ib_class = { Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -177,6 +177,23 @@ return err; } +static int mthca_modify_device(struct ib_device *ibdev, + int mask, + struct ib_device_modify *props) +{ + if (mask & ~IB_DEVICE_MODIFY_NODE_DESC) + return -EOPNOTSUPP; + + if (mask & IB_DEVICE_MODIFY_NODE_DESC) { + if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) + return -ERESTARTSYS; + memcpy(ibdev->node_desc, props->node_desc, 64); + up(&to_mdev(ibdev)->cap_mask_mutex); + } + + return 0; +} + static int mthca_modify_port(struct ib_device *ibdev, u8 port, int port_modify_mask, struct ib_port_modify *props) @@ -1071,6 +1088,20 @@ goto out; init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_DESC; + + err = mthca_MAD_IFC(dev, 1, 1, + 1, NULL, NULL, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(dev->ib_dev.node_desc, out_mad->data, 64); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(dev, 1, 1, @@ -1129,6 +1160,7 @@ dev->ib_dev.class_dev.dev = &dev->pdev->dev; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_device = mthca_modify_device; dev->ib_dev.modify_port = mthca_modify_port; dev->ib_dev.query_pkey = mthca_query_pkey; dev->ib_dev.query_gid = mthca_query_gid; Index: linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (revision 4042) +++ linux-2.6.14/drivers/infiniband/hw/mthca/mthca_mad.c (working copy) @@ -106,6 +106,19 @@ } } +static void node_desc_override(struct ib_device *dev, + struct ib_mad *mad) +{ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP && + mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) { + down(&to_mdev(dev)->cap_mask_mutex); + memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64); + up(&to_mdev(dev)->cap_mask_mutex); + } +} + static void forward_trap(struct mthca_dev *dev, u8 port_num, struct ib_mad *mad) @@ -204,8 +217,10 @@ return IB_MAD_RESULT_FAILURE; } - if (!out_mad->mad_hdr.status) + if (!out_mad->mad_hdr.status) { smp_snoop(ibdev, port_num, in_mad); + node_desc_override(ibdev, out_mad); + } /* set return bit in status of directed route responses */ if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) Index: linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (revision 4044) +++ linux-2.6.14/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -231,11 +231,13 @@ }; enum ib_device_modify_flags { - IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 << 0, + IB_DEVICE_MODIFY_NODE_DESC = 1 << 1 }; struct ib_device_modify { u64 sys_image_guid; + char node_desc[64]; }; enum ib_port_modify_flags { @@ -959,6 +961,7 @@ u64 uverbs_cmd_mask; int uverbs_abi_ver; + char node_desc[64]; __be64 node_guid; u8 node_type; u8 phys_port_cnt; -- MST From halr at voltaire.com Tue Dec 20 05:17:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 08:17:27 -0500 Subject: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24BE@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24BE@mtlexch01.mtl.com> Message-ID: <1135082027.4328.49841.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-12-20 at 02:11, Yael Kalka wrote: > Hi Hal, > The compiler version is gcc (GCC) 4.0.2 > As you see from the patch - something strange is going on with the > compiler(probably). > I have a pointer that its value is null, but it still enters the if > statement, and > doesn't handle it as zero. > There are many more places where we have if statements on pointers > similar to this > case, and this compiler change can be very problematic. > Do you/anyone else know about this change in the gcc version? Is this > behavior > controlled by some flag? I haven't seen this (but only used up to gcc 4.0.0) and don't know about this change. Have you searched on the web for this ? -- Hal > Thanks, > Yael > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, December 19, 2005 1:42 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - fix segfault on exit - cont. > > > Hi Yael, > > On Mon, 2005-12-19 at 04:19, Yael Kalka wrote: > > Hi Hal, > > > > I've noticed that under certain operating systems, > > Seems more likely like a compiler difference rather than OS difference. > Can you mention on which distribution/compiler this made a difference ? > > > when driver isn't > > loaded, the SM still exits with segfault. > > The following patch fixes this. > > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > > Index: libvendor/osm_vendor_ibumad.c > > =================================================================== > > --- libvendor/osm_vendor_ibumad.c (revision 4522) > > +++ libvendor/osm_vendor_ibumad.c (working copy) > > @@ -552,7 +552,7 @@ osm_vendor_delete( > > > > /* umad receiver thread ? */ > > p_ur = (*pp_vend)->receiver; > > - if (&p_ur->signal) > > + if (&p_ur->signal != NULL) > > If this makes a difference, there are other uses of similar syntax which > should be changed :-( > > > cl_event_destroy( &p_ur->signal ); > > cl_spinlock_destroy( &(*pp_vend)->cb_lock ); > > cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); > > -- Hal From halr at voltaire.com Tue Dec 20 05:38:22 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 08:38:22 -0500 Subject: [openib-general] [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Message-ID: <1135085900.4328.50504.camel@hal.voltaire.com> OpenSM: Extend default transaction timeout from 100 msec to 1 second. With the advent of long distance IB and software SMAs, 100 msec is no longer adaquete as a default transaction timeout. Increase this to 1 second which so that the default is sufficient in most common cases. Signed-off-by: Hal Rosenstock Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 4549) +++ include/opensm/osm_base.h (working copy) @@ -246,7 +246,7 @@ BEGIN_C_DECLS * * SYNOPSIS */ -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 /***********/ /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT Index: opensm/main.c =================================================================== --- opensm/main.c (revision 4549) +++ opensm/main.c (working copy) @@ -153,7 +153,7 @@ show_usage(void) " used for transaction timeouts.\n" " Specifying -t 0 disables timeouts.\n" " Without -t, OpenSM defaults to a timeout value of\n" - " 100 milliseconds.\n\n" ); + " 1 second (1000 milliseconds).\n\n" ); printf( "-maxsmps \n" " This option specifies the number of VL15 SMP MADs\n" " allowed on the wire at any one time.\n" From halr at voltaire.com Tue Dec 20 05:42:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 08:42:49 -0500 Subject: [openib-general] [PATCH repost] option to set node description in sysfs In-Reply-To: <20051220125513.GC2366@mellanox.co.il> References: <20051220125513.GC2366@mellanox.co.il> Message-ID: <1135085984.4328.50518.camel@hal.voltaire.com> On Tue, 2005-12-20 at 07:55, Michael S. Tsirkin wrote: > Here's an updated version of the node description patch. > changes: I agree that being able to tailor the NodeDescription is a good thing but how does an SM know that this has changed ? Without this aspect covered as well, this is only a partial solution. -- Hal From mst at mellanox.co.il Tue Dec 20 06:25:23 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 16:25:23 +0200 Subject: [openib-general] [PATCH repost] option to set node description in sysfs In-Reply-To: <1135085984.4328.50518.camel@hal.voltaire.com> References: <1135085984.4328.50518.camel@hal.voltaire.com> Message-ID: <20051220142523.GE2366@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: [openib-general] [PATCH repost] option to set node description in sysfs > > On Tue, 2005-12-20 at 07:55, Michael S. Tsirkin wrote: > > Here's an updated version of the node description patch. > > changes: > > I agree that being able to tailor the NodeDescription is a good thing > but how does an SM know that this has changed ? Why does SM need this string? To print it out in the log file? I want to address a simple need: I need a tool that can figure out the host name/hca from the GUID, without keeping a database in some file. So if I hit the race window while the system is booting, I know that its booting since the description does not have the format that I set from userspace, and can simply re-run the tool. > Without this aspect covered as well, this is only a partial solution. Its good enough for what I want it for, and it seems that no one proposed anything better so far. It should be easy to invent some kind of trap for this, but I dont think anything in the spec matches. -- MST From halr at voltaire.com Tue Dec 20 06:44:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 09:44:16 -0500 Subject: [openib-general] [PATCH repost] option to set node description in sysfs In-Reply-To: <20051220142523.GE2366@mellanox.co.il> References: <1135085984.4328.50518.camel@hal.voltaire.com> <20051220142523.GE2366@mellanox.co.il> Message-ID: <1135089855.4328.51171.camel@hal.voltaire.com> On Tue, 2005-12-20 at 09:25, Michael S. Tsirkin wrote: > Quoting r. Hal Rosenstock : > > Subject: Re: [openib-general] [PATCH repost] option to set node description in sysfs > > > > On Tue, 2005-12-20 at 07:55, Michael S. Tsirkin wrote: > > > Here's an updated version of the node description patch. > > > changes: > > > > I agree that being able to tailor the NodeDescription is a good thing > > but how does an SM know that this has changed ? > > Why does SM need this string? To print it out in the log file? It would be used for (upper layer) management and in a similar vein to how you want to use this (for some diag tool). One could argue that without this, there is no need for an SM to obtain NodeDescription (e.g. OpenSM). > I want to address a simple need: I need a tool that can > figure out the host name/hca from the GUID, without keeping > a database in some file. > > So if I hit the race window while the system is booting, I know > that its booting since the description does not have the format > that I set from userspace, and can simply re-run the tool. > > > Without this aspect covered as well, this is only a partial solution. > > Its good enough for what I want it for, and it seems that no one proposed > anything better so far. > > It should be easy to invent some kind of trap for this, but I dont think > anything in the spec matches. So let's define and implement a trap for this but I don't see a way to extend the SM traps in this manner. Do you ? -- Hal From mst at mellanox.co.il Tue Dec 20 08:29:32 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 18:29:32 +0200 Subject: [openib-general] [PATCH] address handle refrences in ipoib_multicast.c Message-ID: <20051220162932.GH2366@mellanox.co.il> Multiple ipoib_neigh structures on mcast->neigh_list may point to the same ah. Handle this in ipoib_multicast.c, in the same way as it is handled in ipoib_main.c for struct ipoib_path. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 4523) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -95,8 +95,6 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; - LIST_HEAD(ah_list); - struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -105,8 +103,14 @@ static void ipoib_mcast_free(struct ipoi spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that mcast->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) - list_add_tail(&neigh->ah->list, &ah_list); + ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -114,9 +118,6 @@ static void ipoib_mcast_free(struct ipoi spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(ah, tah, &ah_list, list) - ipoib_put_ah(ah); - if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From halr at voltaire.com Tue Dec 20 09:28:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2005 12:28:24 -0500 Subject: [openib-general] [PATCH repost] option to set node description in sysfs In-Reply-To: <1135089855.4328.51171.camel@hal.voltaire.com> References: <1135085984.4328.50518.camel@hal.voltaire.com> <20051220142523.GE2366@mellanox.co.il> <1135089855.4328.51171.camel@hal.voltaire.com> Message-ID: <1135099703.4328.52741.camel@hal.voltaire.com> On Tue, 2005-12-20 at 09:44, Hal Rosenstock wrote: > It would be used for (upper layer) management and in a similar vein to > how you want to use this (for some diag tool). One could argue that > without this, there is no need for an SM to obtain NodeDescription (e.g. > OpenSM). More significantly, it is available via SA NodeRecord so is not just in the log file and is available to any node on the network via SA queries. This is indeed used by various SA clients. -- Hal From mst at mellanox.co.il Tue Dec 20 11:01:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 21:01:31 +0200 Subject: [openib-general] [PATCH repost 1 of 2] ipoib: neighbour issues Message-ID: <20051220190131.GA14252@mellanox.co.il> This series includes two patches I posted previously: they address two independent issues but happen to affect adjacent lines in ipoib_multicast.c I also renamed the global list of neighbours to ipoib_all_neigh_list. --- Multiple ipoib_neigh structures on mcast->neigh_list may point to the same ah. Handle this in ipoib_multicast.c, in the same way as it is handled in ipoib_main.c for struct ipoib_path. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-20 20:24:50.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-20 20:24:44.000000000 +0200 @@ -95,8 +95,6 @@ static void ipoib_mcast_free(struct ipoi struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; unsigned long flags; - LIST_HEAD(ah_list); - struct ipoib_ah *ah, *tah; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", @@ -105,8 +103,14 @@ static void ipoib_mcast_free(struct ipoi spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + /* + * It's safe to call ipoib_put_ah() inside priv->lock + * here, because we know that mcast->ah will always + * hold one more reference, so ipoib_put_ah() will + * never do more than decrement the ref count. + */ if (neigh->ah) - list_add_tail(&neigh->ah->list, &ah_list); + ipoib_put_ah(neigh->ah); *to_ipoib_neigh(neigh->neighbour) = NULL; neigh->neighbour->ops->destructor = NULL; kfree(neigh); @@ -114,9 +118,6 @@ static void ipoib_mcast_free(struct ipoi spin_unlock_irqrestore(&priv->lock, flags); - list_for_each_entry_safe(ah, tah, &ah_list, list) - ipoib_put_ah(ah); - if (mcast->ah) ipoib_put_ah(mcast->ah); -- MST From mst at mellanox.co.il Tue Dec 20 11:03:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 20 Dec 2005 21:03:10 +0200 Subject: [openib-general] [PATCH repost 2 of 2] ipoib: neighbour issues Message-ID: <20051220190310.GA14267@mellanox.co.il> This series includes two patches I posted previously: they address two independent issues but happen to affect adjacent lines in ipoib_multicast.c I also renamed the global list of neighbours to ipoib_all_neigh_list. --- IPoIB uses neighbour ops->destructor to clean up struct ipoib_neigh, but ignores the fact that multiple neighbour objects can share the same ops structure, so setting it to NULL affects multiple neighbours. Fix this, by tracking all ipoib_neigh objects, and only cleaning destructor after no neighbour is going to use it. Note that ops structure isnt per device, so we track them in a global list. Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-20 20:25:55.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-20 20:28:19.000000000 +0200 @@ -71,6 +71,9 @@ static const u8 ipv4_bcast_addr[] = { struct workqueue_struct *ipoib_workqueue; +static spinlock_t ipoib_all_neigh_list_lock; +static LIST_HEAD(ipoib_all_neigh_list); + static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); @@ -244,9 +247,8 @@ static void path_free(struct net_device */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -474,7 +476,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_path *path; struct ipoib_neigh *neigh; - neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (!neigh) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -482,8 +484,6 @@ static void neigh_add_path(struct sk_buf } skb_queue_head_init(&neigh->queue); - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; /* * We can only be called from ipoib_start_xmit, so we're @@ -526,11 +526,8 @@ static void neigh_add_path(struct sk_buf return; err: - *to_ipoib_neigh(skb->dst->neighbour) = NULL; list_del(&neigh->list); - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); - + ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -757,8 +754,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - *to_ipoib_neigh(n) = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -767,23 +763,45 @@ static void ipoib_neigh_destructor(struc ipoib_put_ah(ah); } -static int ipoib_neigh_setup(struct neighbour *neigh) +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) { + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) + return NULL; + + neigh->neighbour = neighbour; + *to_ipoib_neigh(neighbour) = neigh; + /* * Is this kosher? I can't find anybody in the kernel that * sets neigh->destructor, so we should be able to set it here * without trouble. */ - neigh->ops->destructor = ipoib_neigh_destructor; - - return 0; + spin_lock(&ipoib_all_neigh_list_lock); + list_add_tail(&neigh->all_neigh_list, &ipoib_all_neigh_list); + neigh->neighbour->ops->destructor = ipoib_neigh_destructor; + spin_unlock(&ipoib_all_neigh_list_lock); + return neigh; } -static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +void ipoib_neigh_free(struct ipoib_neigh *neigh) { - parms->neigh_setup = ipoib_neigh_setup; + struct ipoib_neigh *n; - return 0; + spin_lock(&ipoib_all_neigh_list_lock); + list_del(&neigh->all_neigh_list); + + list_for_each_entry(n, &ipoib_all_neigh_list, all_neigh_list) + if (n->neighbour->ops == neigh->neighbour->ops) + goto found; + + neigh->neighbour->ops->destructor = NULL; +found: + spin_unlock(&ipoib_all_neigh_list_lock); + *to_ipoib_neigh(neigh->neighbour) = NULL; + kfree(neigh); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) @@ -859,7 +877,6 @@ static void ipoib_setup(struct net_devic dev->tx_timeout = ipoib_timeout; dev->hard_header = ipoib_hard_header; dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; dev->watchdog_timeo = HZ; @@ -1146,6 +1163,8 @@ static int __init ipoib_init_module(void if (ret) goto err_wq; + spin_lock_init(&ipoib_all_neigh_list_lock); + return 0; err_wq: Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-20 20:26:09.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2005-12-20 20:27:32.000000000 +0200 @@ -111,9 +111,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - *to_ipoib_neigh(neigh->neighbour) = NULL; - neigh->neighbour->ops->destructor = NULL; - kfree(neigh); + ipoib_neigh_free(neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -719,13 +717,11 @@ out: if (skb->dst && skb->dst->neighbour && !*to_ipoib_neigh(skb->dst->neighbour)) { - struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb->dst->neighbour); if (neigh) { kref_get(&mcast->ah->ref); neigh->ah = mcast->ah; - neigh->neighbour = skb->dst->neighbour; - *to_ipoib_neigh(skb->dst->neighbour) = neigh; list_add_tail(&neigh->list, &mcast->neigh_list); } } Index: openib/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-20 20:25:55.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib.h 2005-12-20 20:26:25.000000000 +0200 @@ -214,6 +214,7 @@ struct ipoib_neigh { struct neighbour *neighbour; struct list_head list; + struct list_head all_neigh_list; }; static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) @@ -222,6 +223,9 @@ static inline struct ipoib_neigh **to_ip (offsetof(struct neighbour, ha) & 4)); } +struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); +void ipoib_neigh_free(struct ipoib_neigh *neigh); + extern struct workqueue_struct *ipoib_workqueue; /* functions */ -- MST From weiny2 at llnl.gov Tue Dec 20 11:15:29 2005 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 20 Dec 2005 11:15:29 -0800 Subject: [openib-general] help with ifconfig output from ipoib Message-ID: <20051220111529.64a885b9.weiny2@llnl.gov> One of my coworkers brought a problem to me which I need guidance from you guys on. He pointed out that the HW address for the ipoib interface in ifconfig reported all 0's. 11:12:50 > /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 ... I tracked this down using strace to be a difference in the redhat kernel vs the latest 2.6.14 kernel. The ioctl in redhats 2.6.9 kernel will return EOVERFLOW if the sa_data area is too small. This seemed like an easy enough fix. Change the behavoir of the SIOCGIFHWADDR From weiny2 at llnl.gov Tue Dec 20 11:16:46 2005 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 20 Dec 2005 11:16:46 -0800 Subject: [openib-general] help with ifconfig output from ipoib In-Reply-To: <20051220111529.64a885b9.weiny2@llnl.gov> References: <20051220111529.64a885b9.weiny2@llnl.gov> Message-ID: <20051220111646.3d580c71.weiny2@llnl.gov> This was sent before I was done writting it... Please disregard... (stuipid gui buttons...) :-( Ira On Tue, 20 Dec 2005 11:15:29 -0800 Ira Weiny wrote: > One of my coworkers brought a problem to me which I need guidance > from you guys on. > > He pointed out that the HW address for the ipoib interface in > ifconfig reported all 0's. > > 11:12:50 > /sbin/ifconfig ib0 > ib0 Link encap:UNSPEC HWaddr > 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 ... > > I tracked this down using strace to be a difference in the redhat > kernel vs the latest 2.6.14 kernel. The ioctl in redhats 2.6.9 > kernel will return EOVERFLOW if the sa_data area is too small. This > seemed like an easy enough fix. Change the behavoir of the > SIOCGIFHWADDR _______________________________________________ > openib-general mailing list openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From weiny2 at llnl.gov Tue Dec 20 11:21:29 2005 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 20 Dec 2005 11:21:29 -0800 Subject: [openib-general] help with ifconfig output from ipoib Message-ID: <20051220112129.1081953b.weiny2@llnl.gov> One of my coworkers brought a problem to me which I need guidance from you guys on. He pointed out that the HW address for the ipoib interface in ifconfig reported all 0's. 11:12:50 > /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 ... I tracked this down using strace to be a difference in the redhat kernel vs the latest 2.6.14 kernel. The SIOCGIFHWADDR ioctl in redhats 2.6.9 kernel will return EOVERFLOW if the sa_data area is too small. This seemed like an easy enough fix. Change the behavior of the SIOCGIFHWADDR. I did this and this is what I got. # ldev6 /root > /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 This is not the GUID of the node? More digging revealed that this "should" work. Am I incorrect in assuming that the HW address should be the node GUID? Does this work for others? Thanks, Ira PS. Sorry about the premature email before. From halr at voltaire.com Tue Dec 20 11:32:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 20 Dec 2005 21:32:26 +0200 Subject: [openib-general] help with ifconfig output from ipoib Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589AB97@taurus.voltaire.com> IPoIB hardware addresses are QPN + GID (subnet prefix and GUID). ifconfig only supports shorter HW addresses (I forget what the limit is but it is less than the 20 byte IPoIB HW address). Use the newer ip command to see the proper HW address (something like ip link show ib0). -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Ira Weiny Sent: Tue 12/20/2005 2:21 PM To: openib-general at openib.org Subject: [openib-general] help with ifconfig output from ipoib One of my coworkers brought a problem to me which I need guidance from you guys on. He pointed out that the HW address for the ipoib interface in ifconfig reported all 0's. 11:12:50 > /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 ... I tracked this down using strace to be a difference in the redhat kernel vs the latest 2.6.14 kernel. The SIOCGIFHWADDR ioctl in redhats 2.6.9 kernel will return EOVERFLOW if the sa_data area is too small. This seemed like an easy enough fix. Change the behavior of the SIOCGIFHWADDR. I did this and this is what I got. # ldev6 /root > /sbin/ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 This is not the GUID of the node? More digging revealed that this "should" work. Am I incorrect in assuming that the HW address should be the node GUID? Does this work for others? Thanks, Ira PS. Sorry about the premature email before. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From lindahl at pathscale.com Tue Dec 20 12:46:38 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 20 Dec 2005 12:46:38 -0800 Subject: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24BE@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24BE@mtlexch01.mtl.com> Message-ID: <20051220204638.GA2083@greglaptop.internal.keyresearch.com> On Tue, Dec 20, 2005 at 09:11:12AM +0200, Yael Kalka wrote: > > - if (&p_ur->signal) > > + if (&p_ur->signal != NULL) Aren't these 2 statements required to execute the same according to the C standard? I wrote a tiny test program and gcc4.0.0 as distributed with Fedora Core 3 generated identical assembly code for both. -- greg From eitan at mellanox.co.il Tue Dec 20 13:20:55 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 20 Dec 2005 23:20:55 +0200 Subject: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B7C@mtlexch01.mtl.com> Hi Greg, > > On Tue, Dec 20, 2005 at 09:11:12AM +0200, Yael Kalka wrote: > > > > - if (&p_ur->signal) > > > + if (&p_ur->signal != NULL) > > Aren't these 2 statements required to execute the same according to > the C standard? [EZ] We were puzzled too. But there is nothing stronger then seeing it happening. What could break our compiler? Hmmm. > > I wrote a tiny test program and gcc4.0.0 as distributed with Fedora > Core 3 generated identical assembly code for both. > > -- greg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From eitan at mellanox.co.il Tue Dec 20 13:27:07 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 20 Dec 2005 23:27:07 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B7D@mtlexch01.mtl.com> Hi Hal, The effect is basically a slowdown in case of non responding or lost packets. With 1sec timeout - up to 4sec per lost transaction are added to the SM bringup time. In many clusters I have seen a 100msec was enough - but I guess you have actually have seen such failures. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 20, 2005 3:38 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 > second > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > With the advent of long distance IB and software SMAs, 100 msec is no > longer adaquete as a default transaction timeout. Increase this to 1 > second which so that the default is sufficient in most common cases. > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/osm_base.h > =================================================================== > --- include/opensm/osm_base.h (revision 4549) > +++ include/opensm/osm_base.h (working copy) > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > * > * SYNOPSIS > */ > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > /***********/ > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 4549) > +++ opensm/main.c (working copy) > @@ -153,7 +153,7 @@ show_usage(void) > " used for transaction timeouts.\n" > " Specifying -t 0 disables timeouts.\n" > " Without -t, OpenSM defaults to a timeout value of\n" > - " 100 milliseconds.\n\n" ); > + " 1 second (1000 milliseconds).\n\n" ); > printf( "-maxsmps \n" > " This option specifies the number of VL15 SMP MADs\n" > " allowed on the wire at any one time.\n" From halr at voltaire.com Tue Dec 20 13:22:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 20 Dec 2005 23:22:44 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABA1@taurus.voltaire.com> Hi Eitan, Yes, I saw these failures as I mentioned in the original email. Another easy way to see this is to turn on logging on a slow NFS server. Also, wouldn't increasing maxsmps ameliorate this to some degree so maybe that should be done at the same time ? -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tue 12/20/2005 4:27 PM To: Hal Rosenstock; Yael Kalka Cc: openib-general at openib.org Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Hi Hal, The effect is basically a slowdown in case of non responding or lost packets. With 1sec timeout - up to 4sec per lost transaction are added to the SM bringup time. In many clusters I have seen a 100msec was enough - but I guess you have actually have seen such failures. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 20, 2005 3:38 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 > second > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > With the advent of long distance IB and software SMAs, 100 msec is no > longer adaquete as a default transaction timeout. Increase this to 1 > second which so that the default is sufficient in most common cases. > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/osm_base.h > =================================================================== > --- include/opensm/osm_base.h (revision 4549) > +++ include/opensm/osm_base.h (working copy) > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > * > * SYNOPSIS > */ > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > /***********/ > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > Index: opensm/main.c > =================================================================== > --- opensm/main.c (revision 4549) > +++ opensm/main.c (working copy) > @@ -153,7 +153,7 @@ show_usage(void) > " used for transaction timeouts.\n" > " Specifying -t 0 disables timeouts.\n" > " Without -t, OpenSM defaults to a timeout value of\n" > - " 100 milliseconds.\n\n" ); > + " 1 second (1000 milliseconds).\n\n" ); > printf( "-maxsmps \n" > " This option specifies the number of VL15 SMP MADs\n" > " allowed on the wire at any one time.\n" From eitan at mellanox.co.il Tue Dec 20 13:36:12 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 20 Dec 2005 23:36:12 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B7E@mtlexch01.mtl.com> Hi Hal, In a way using parallel MADs sends (maxsmp > 1) would help. But if you count the number of packets that should be sent to every port (NodeInfo, PortInfo, SwitchInfo?, PKey*2, SL2VL, VLArb ....) Even a single bad port will slow down the sweep significantly Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 20, 2005 11:23 PM > To: Eitan Zahavi; Yael Kalka > Cc: openib-general at openib.org > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to > 1 second > > Hi Eitan, > > Yes, I saw these failures as I mentioned in the original email. Another easy way to see > this is to turn on logging on a slow NFS server. > > Also, wouldn't increasing maxsmps ameliorate this to some degree so maybe that > should be done at the same time ? > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tue 12/20/2005 4:27 PM > To: Hal Rosenstock; Yael Kalka > Cc: openib-general at openib.org > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to > 1 second > > > > Hi Hal, > > The effect is basically a slowdown in case of non responding or lost > packets. > With 1sec timeout - up to 4sec per lost transaction are added to the SM > bringup time. > > In many clusters I have seen a 100msec was enough - but I guess you have > actually have seen such failures. > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, December 20, 2005 3:38 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 > msec to 1 > > second > > > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > > > With the advent of long distance IB and software SMAs, 100 msec is no > > longer adaquete as a default transaction timeout. Increase this to 1 > > second which so that the default is sufficient in most common cases. > > > > Signed-off-by: Hal Rosenstock > > > > Index: include/opensm/osm_base.h > > =================================================================== > > --- include/opensm/osm_base.h (revision 4549) > > +++ include/opensm/osm_base.h (working copy) > > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > > * > > * SYNOPSIS > > */ > > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > > /***********/ > > > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 4549) > > +++ opensm/main.c (working copy) > > @@ -153,7 +153,7 @@ show_usage(void) > > " used for transaction timeouts.\n" > > " Specifying -t 0 disables timeouts.\n" > > " Without -t, OpenSM defaults to a timeout value > of\n" > > - " 100 milliseconds.\n\n" ); > > + " 1 second (1000 milliseconds).\n\n" ); > > printf( "-maxsmps \n" > > " This option specifies the number of VL15 SMP > MADs\n" > > " allowed on the wire at any one time.\n" From halr at voltaire.com Tue Dec 20 13:32:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 20 Dec 2005 23:32:47 +0200 Subject: [openib-general] RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F589ABA2@taurus.voltaire.com> Hi Eitan, Yes but what is the alternative for slow ports or long links ? -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tue 12/20/2005 4:36 PM To: Hal Rosenstock; Yael Kalka Cc: openib-general at openib.org Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to 1 second Hi Hal, In a way using parallel MADs sends (maxsmp > 1) would help. But if you count the number of packets that should be sent to every port (NodeInfo, PortInfo, SwitchInfo?, PKey*2, SL2VL, VLArb ....) Even a single bad port will slow down the sweep significantly Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 20, 2005 11:23 PM > To: Eitan Zahavi; Yael Kalka > Cc: openib-general at openib.org > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to > 1 second > > Hi Eitan, > > Yes, I saw these failures as I mentioned in the original email. Another easy way to see > this is to turn on logging on a slow NFS server. > > Also, wouldn't increasing maxsmps ameliorate this to some degree so maybe that > should be done at the same time ? > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tue 12/20/2005 4:27 PM > To: Hal Rosenstock; Yael Kalka > Cc: openib-general at openib.org > Subject: RE: [PATCH] OpenSM: Extend default transaction timeout from 100 msec to > 1 second > > > > Hi Hal, > > The effect is basically a slowdown in case of non responding or lost > packets. > With 1sec timeout - up to 4sec per lost transaction are added to the SM > bringup time. > > In many clusters I have seen a 100msec was enough - but I guess you have > actually have seen such failures. > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, December 20, 2005 3:38 PM > > To: Yael Kalka; Eitan Zahavi > > Cc: openib-general at openib.org > > Subject: [PATCH] OpenSM: Extend default transaction timeout from 100 > msec to 1 > > second > > > > OpenSM: Extend default transaction timeout from 100 msec to 1 second. > > > > With the advent of long distance IB and software SMAs, 100 msec is no > > longer adaquete as a default transaction timeout. Increase this to 1 > > second which so that the default is sufficient in most common cases. > > > > Signed-off-by: Hal Rosenstock > > > > Index: include/opensm/osm_base.h > > =================================================================== > > --- include/opensm/osm_base.h (revision 4549) > > +++ include/opensm/osm_base.h (working copy) > > @@ -246,7 +246,7 @@ BEGIN_C_DECLS > > * > > * SYNOPSIS > > */ > > -#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 100 > > +#define OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC 1000 > > /***********/ > > > > /****d* OpenSM: Base/OSM_DEFAULT_SUBNET_TIMEOUT > > Index: opensm/main.c > > =================================================================== > > --- opensm/main.c (revision 4549) > > +++ opensm/main.c (working copy) > > @@ -153,7 +153,7 @@ show_usage(void) > > " used for transaction timeouts.\n" > > " Specifying -t 0 disables timeouts.\n" > > " Without -t, OpenSM defaults to a timeout value > of\n" > > - " 100 milliseconds.\n\n" ); > > + " 1 second (1000 milliseconds).\n\n" ); > > printf( "-maxsmps \n" > > " This option specifies the number of VL15 SMP > MADs\n" > > " allowed on the wire at any one time.\n" From thomas.duffy.99 at alumni.brown.edu Tue Dec 20 13:52:57 2005 From: thomas.duffy.99 at alumni.brown.edu (Tom Duffy) Date: Tue, 20 Dec 2005 13:52:57 -0800 Subject: [openib-general] Re: [PATCH applied] return -ENOPROTOOPT on an unsupported socket option In-Reply-To: <20051219230510.GD2694@mellanox.co.il> References: <33038B33-7572-463C-B307-B5114E3243A0@alumni.brown.edu> <20051219230510.GD2694@mellanox.co.il> Message-ID: <5A2FDF41-8407-4C6A-A85E-F26E983C03E1@alumni.brown.edu> On Dec 19, 2005, at 3:05 PM, Michael S. Tsirkin wrote: > Hmm. Which option do you have in mind, specifically? LINGER comes to mind. But something like SYNCNT, WINDOW_CLAMP, or QUICKACK doesn't make much sense on SDP, so why not just say we support it? > The right thing, in my eyes, is to emulate a TCP socket. > So we dont want to support options that TCP doesnt support. TCP supports many more options. Perhaps we should special case those. -tduffy From mst at mellanox.co.il Tue Dec 20 14:01:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 21 Dec 2005 00:01:37 +0200 Subject: [openib-general] Re: [PATCH applied] return -ENOPROTOOPT on an unsupported socket option In-Reply-To: <5A2FDF41-8407-4C6A-A85E-F26E983C03E1@alumni.brown.edu> References: <5A2FDF41-8407-4C6A-A85E-F26E983C03E1@alumni.brown.edu> Message-ID: <20051220220137.GE14598@mellanox.co.il> Quoting r. Tom Duffy : > Subject: Re: [PATCH applied] return -ENOPROTOOPT on an unsupported socket option > > > On Dec 19, 2005, at 3:05 PM, Michael S. Tsirkin wrote: > > Hmm. Which option do you have in mind, specifically? > > LINGER comes to mind. But something like SYNCNT, WINDOW_CLAMP, or > QUICKACK doesn't make much sense on SDP, so why not just say we > support it? > > > The right thing, in my eyes, is to emulate a TCP socket. > > So we dont want to support options that TCP doesnt support. > > TCP supports many more options. Perhaps we should special case those. OK, thats what I'll do longer term. I reverted this patch for now. -- MST From rjwalsh at pathscale.com Tue Dec 20 16:00:16 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 20 Dec 2005 16:00:16 -0800 Subject: [openib-general] Re: [PATCH 08/13] [RFC] ipath core last bit In-Reply-To: <20051217123856.d16529a5.akpm@osdl.org> References: <200512161548.3fqe3fMerrheBMdX@cisco.com> <200512161548.y9KRuNtfMzpZjwni@cisco.com> <20051217123856.d16529a5.akpm@osdl.org> Message-ID: <1135123216.13875.5.camel@hematite.internal.keyresearch.com> On Sat, 2005-12-17 at 12:38 -0800, Andrew Morton wrote: > Roland Dreier wrote: > > > > +EXPORT_SYMBOL(ipath_kset_linkstate); > > +EXPORT_SYMBOL(ipath_kset_mtu); > > +EXPORT_SYMBOL(ipath_layer_close); > > +EXPORT_SYMBOL(ipath_layer_get_bcast); > > +EXPORT_SYMBOL(ipath_layer_get_cr_errpkey); > > +EXPORT_SYMBOL(ipath_layer_get_deviceid); > > +EXPORT_SYMBOL(ipath_layer_get_flags); > > +EXPORT_SYMBOL(ipath_layer_get_guid); > > +EXPORT_SYMBOL(ipath_layer_get_ibmtu); > > etc > > EXPORT_SMBOL_GPL? Hmm, well, nothing else in the infiniband directory uses this, probably because of the dual GPL/BSD license that all files in there have. For consistency, I'll leave it as EXPORT_SYMBOL, but I don't have any real problems with it either way. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From okppozhiawlmx at runbox.com Tue Dec 20 17:42:28 2005 From: okppozhiawlmx at runbox.com (Nigel Walters) Date: Wed, 21 Dec 2005 06:42:28 +0500 Subject: [openib-general] Haunted by your credit score? Message-ID: <1D6AA7EB.9okppozhiawlmx@runbox.com> Hello, You have been chosen to participate in an invitation only limited time event! Are you currently paying over 3% for your mortgage? STOP! We can help you lower that today! Answer only a few questions and we can give you an approval in under 30 seconds it's that simple! http://timet0save.net/p3.asp And stop fighting for lenders let them fight for you! Make them work for your business by giving you the lowest rates around! $230,000 loans are available for only $340/month! WE'RE PRACTICALLY GIVING AWAY MONEY! Think your credit is too bad to get a deal like this? THINK AGAIN! We will have you saving your money in no time! Are you ready to save your money? http://timet0save.net/p3.asp Regards, Nigel Walters From yael at mellanox.co.il Wed Dec 21 05:29:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 21 Dec 2005 15:29:49 +0200 Subject: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E24D6@mtlexch01.mtl.com> I solved the mistery (at least partially).... What is happening is that p_ur itself is already null. I will send a new patch that checks both the p_ur and the signal pointer. Yael -----Original Message----- From: Eitan Zahavi Sent: Tuesday, December 20, 2005 11:21 PM To: Greg Lindahl; openib-general at openib.org Subject: RE: [openib-general] RE: [PATCH] Opensm - fix segfault on exit - cont. Hi Greg, > > On Tue, Dec 20, 2005 at 09:11:12AM +0200, Yael Kalka wrote: > > > > - if (&p_ur->signal) > > > + if (&p_ur->signal != NULL) > > Aren't these 2 statements required to execute the same according to > the C standard? [EZ] We were puzzled too. But there is nothing stronger then seeing it happening. What could break our compiler? Hmmm. > > I wrote a tiny test program and gcc4.0.0 as distributed with Fedora > Core 3 generated identical assembly code for both. > > -- greg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yael at mellanox.co.il Wed Dec 21 05:29:57 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 21 Dec 2005 15:29:57 +0200 Subject: [openib-general] Re[PATCH] Opensm - fix segfault on exit - cont. #2 Message-ID: <5z3bkmh92i.fsf@mtl066.yok.mtl.com> Hi Hal, As I wrote in the original thread - when the driver isn't loaded, the p_ur itself is NULL, so before trying to destroy the signal, we need to make sure the p_ur isn't null. If it is not null, this means it was initialized, and p_ur->signal should have a value. The following patch does this. Thanks, Yael Signed-off-by: Yael Kalka Index: libvendor/osm_vendor_ibumad.c =================================================================== --- libvendor/osm_vendor_ibumad.c (revision 4542) +++ libvendor/osm_vendor_ibumad.c (working copy) @@ -552,7 +552,7 @@ osm_vendor_delete( /* umad receiver thread ? */ p_ur = (*pp_vend)->receiver; - if (&p_ur->signal) + if ( p_ur ) cl_event_destroy( &p_ur->signal ); cl_spinlock_destroy( &(*pp_vend)->cb_lock ); cl_spinlock_destroy( &(*pp_vend)->match_tbl_lock ); From halr at voltaire.com Wed Dec 21 06:30:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2005 09:30:31 -0500 Subject: [openib-general] Re: Re[PATCH] Opensm - fix segfault on exit - cont. #2 In-Reply-To: <5z3bkmh92i.fsf@mtl066.yok.mtl.com> References: <5z3bkmh92i.fsf@mtl066.yok.mtl.com> Message-ID: <1135175428.4328.65041.camel@hal.voltaire.com> On Wed, 2005-12-21 at 08:29, Yael Kalka wrote: > As I wrote in the original thread - when the driver isn't loaded, > the p_ur itself is NULL, so before trying to destroy the signal, > we need to make sure the p_ur isn't null. If it is not null, this > means it was initialized, and p_ur->signal should have a value. > The following patch does this. Thanks. Applied. From Thomas.Duffy.99 at alumni.brown.edu Wed Dec 21 18:15:59 2005 From: Thomas.Duffy.99 at alumni.brown.edu (Tom Duffy) Date: Wed, 21 Dec 2005 18:15:59 -0800 Subject: [openib-general] Fwd: SDP References: <43695595.8010709@sun.com> Message-ID: <56DECDA0-F445-45E2-B0E1-1A3CEF605AED@alumni.brown.edu> Nitin, I found this email stuck in the back of my inbox. I am sorry I haven't been able to get back to you. Perhaps somebody on the list can better answer your questions. Begin forwarded message: > From: Nitin Hande > Date: November 2, 2005 4:11:01 PM PST > To: Tom Duffy > Subject: Re: SDP > User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050512) > > Hi Tom > > > Tom Duffy wrote: >> On Oct 4, 2005, at 4:38 PM, Nitin Hande wrote: >>> Tom >>> >>> I have a question. The current SDP implementation on OpenIB I >>> believe now creates a socket of AF_INET_SDP type ?. > While you have choosen AF_INET_SDP, what does this family imply.? > How is it connected to AF_INET family ? (I can tell my reasoning: > While doing ARP, AF_INET_SDP relies on IPoIB to find the peer MAC > address. While IPoIB is a AF_INET family, there is some co-relation > between AF_INET_SDP and AF_INET family. MAC address resolved > through AF_INET is accepted in AF_INET_SDP. This will not be > acceptable, say for AF_IPX family, if there was any reason to do it > like that). What I am trying to find out is what is that relation ? > > My second question, is rather than defining AF_INET_SDP, have you > thought if we can open a SDP connection with > socket(AF_INET, SOCK_STREAM, IPPROTO_SDP) which perhaps suits more > than AF_INET_SDP ? (It is only the transport protocol that is > different, but the address actually belong to AF_INET family). > > Your comments ? > > Nitin From halr at voltaire.com Wed Dec 21 21:18:53 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Dec 2005 00:18:53 -0500 Subject: [openib-general] RE: A couple of questions about osm_lid_mgr.c::__osm_lid_mgr_set_physp_pi In-Reply-To: <1134992274.4328.32896.camel@hal.voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3618B52@mtlexch01.mtl.com> <1134992274.4328.32896.camel@hal.voltaire.com> Message-ID: <1135228732.4328.73617.camel@hal.voltaire.com> On Mon, 2005-12-19 at 06:37, Hal Rosenstock wrote: > Hi Eitan, > > On Sun, 2005-12-18 at 14:20, Eitan Zahavi wrote: > > [EZ] Thanks. I have seen the patch. It is fine. > > Thanks. I just committed it. > > > > > > Also, why does changing the MTU require that the link be taken > > down ? > > > > > > > The behavior of the link when a neighbor MTU is changes is not very > > well defined. > > > > So the best way to handle that is to force it down. > > > > > > NeighborMTU is not involved with the link negotiation nor is there a > > > comment in the description like OperationalVLs. What behavior are you > > > referring to ? > > > [EZ] I actually do not see any spec note about modifying neighbor MTU > > during link up. > > Yes, that was what I was saying. > > > However, I remember we had to add this functionality. I > > try to dig this up in the old bit keeper and found the first occurrence > > of the setting of the port down in version 1.7. But the log does not say > > why. > > Thanks for looking. This seems mysterious to me but I would hestitate to > remove it even though I don't think it should be required or if it is > some spec comment should be made. I would like to close the loop on this > but don't see how. In looking at this area some more, is it even required for changing OperationalVLs ? The spec component description states "Changing OperationalVLs in certain PortStates may cause flow control update errors which may initiate Link/Phy retraining." So it sounds like the link might retrain (and go down) on its own if needed but the SM needn't take the link down. -- Hal From eitan at mellanox.co.il Wed Dec 21 23:42:16 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 22 Dec 2005 09:42:16 +0200 Subject: [openib-general] RE: A couple of questions aboutosm_lid_mgr.c::__osm_lid_mgr_set_physp_pi Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618BB2@mtlexch01.mtl.com> Hi Hal, Yes when you change OperationalVLs you can run into situation where the link will retrain. But this retrain might happen anytime when the watchdog timer expires. Instead of letting it surprises the SM (which thinks the link is armed or active) - I prefer taking the link down in then to arm and active in an orderly manner. Another aspect is that the watchdog timer might expire much later in the bring-up process causing traps to be injected such that OpenSM will perform extra sweeps which could be avoided. This "preference" of mine was developed over the bad experience of trying to rely on the watchdog timer in the past... EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, December 22, 2005 7:19 AM > To: Eitan Zahavi > Cc: openib-general at openib.org; Yael Kalka > Subject: Re: [openib-general] RE: A couple of questions > aboutosm_lid_mgr.c::__osm_lid_mgr_set_physp_pi > > On Mon, 2005-12-19 at 06:37, Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sun, 2005-12-18 at 14:20, Eitan Zahavi wrote: > > > [EZ] Thanks. I have seen the patch. It is fine. > > > > Thanks. I just committed it. > > > > > > > > Also, why does changing the MTU require that the link be taken > > > down ? > > > > > > > > > The behavior of the link when a neighbor MTU is changes is not very > > > well defined. > > > > > So the best way to handle that is to force it down. > > > > > > > > NeighborMTU is not involved with the link negotiation nor is there a > > > > comment in the description like OperationalVLs. What behavior are you > > > > referring to ? > > > > > [EZ] I actually do not see any spec note about modifying neighbor MTU > > > during link up. > > > > Yes, that was what I was saying. > > > > > However, I remember we had to add this functionality. I > > > try to dig this up in the old bit keeper and found the first occurrence > > > of the setting of the port down in version 1.7. But the log does not say > > > why. > > > > Thanks for looking. This seems mysterious to me but I would hestitate to > > remove it even though I don't think it should be required or if it is > > some spec comment should be made. I would like to close the loop on this > > but don't see how. > > In looking at this area some more, is it even required for changing > OperationalVLs ? The spec component description states "Changing > OperationalVLs in certain PortStates may cause flow control update > errors which may initiate Link/Phy retraining." So it sounds like the > link might retrain (and go down) on its own if needed but the SM needn't > take the link down. > > -- Hal From info at kdls.info Thu Dec 22 01:03:50 2005 From: info at kdls.info (info at kdls.info) Date: Thu, 22 Dec 2005 17:03:50 +0800 (MYT) Subject: [openib-general] =?iso-2022-jp?b?GyRCIXoheTVVIXtMXEUqJE4bKEI=?= =?iso-2022-jp?b?GyRCPXdALUI/P3RFUE8/JE5ENk0lTkklNSUkJUghKiEqIXkhehsoQg==?= Message-ID: <20051222090350.ED53739370E@mail.kdls.info> $B!z!y(B $B5U!{L\E*$N=w at -B??tEPO?$ND6M%NI%5%$%H!*(B $B!y!z(B $B$@$l$G$b$$$$$+$i%d%i%;%F!A!*!*(B $B$G$b$*6b$,$J$$$+$i$=$l$bL5M}$@$J!A(B($B!d!c(B) $B$J$s$FD|$a$F$7$^$C$F$$$k$=$3$N$"$J$?!*!*(B $B$=$s$J$"$J$?$rEv%5%$%H$,;Y1g!*!*(B $BEv%5%$%H$K$O!"5U1gL\E*$G$N%;%l%V$J=w at -$,B??tEPO?!*!*(B $B:GDc#1#0K|1_$O$b$i$($F$7$^$&$N$G3P8g$7$F2<$5$$!*!*(B $B6=L#$,$G$F$3$J$/$F$b(B $B$^$:$O%/%j%C%/!*!*(B http://www.kdls.info/az229 $B$<$RGA$$$F$_$F2<$5$$!*!*(B $BB~:#!"4|4V8BDj$K$h$j0lK|1_J,L5NA!*!*(B $B"(%o%s%/%j%C%/!&ITEv at A5aEy$G$O$"$j$^$;$s$N$G!"(B $B0B?4$7$F$4MxMQ$/$@$5$$!#(B From mst at mellanox.co.il Thu Dec 22 05:11:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 22 Dec 2005 15:11:05 +0200 Subject: [openib-general] [PATCH] mthca: error handling fixes Message-ID: <20051222131105.GQ23396@mellanox.co.il> Fix memory leaks in error handling on multicast group operations. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_mcg.c +++ latest/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -109,7 +109,8 @@ static int find_mgm(struct mthca_dev *de goto out; if (status) { mthca_err(dev, "READ_MGM returned status %02x\n", status); - return -EINVAL; + err = -EINVAL; + goto out; } if (!memcmp(mgm->gid, zero_gid, 16)) { @@ -151,8 +152,10 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -233,6 +236,7 @@ int mthca_multicast_attach(struct ib_qp out: up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } @@ -253,8 +257,10 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -356,6 +362,7 @@ int mthca_multicast_detach(struct ib_qp out: up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } -- MST From mst at mellanox.co.il Thu Dec 22 07:18:17 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 22 Dec 2005 17:18:17 +0200 Subject: [openib-general] [PATCH] ipoib: error handling fix Message-ID: <20051222151817.GB1011@mellanox.co.il> The following patch needs to be applied on top of the patch series I posted previously. --- Fix error handling in neigh_add_path: avoid calling list_del on an uninitialized neigh->list. Signed-off-by: Eli Cohen Signed-off-by: Michael S. Tsirkin Index: openib/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- openib.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-22 16:47:48.000000000 +0200 +++ openib/drivers/infiniband/ulp/ipoib/ipoib_main.c 2005-12-22 16:51:49.000000000 +0200 @@ -496,7 +496,7 @@ static void neigh_add_path(struct sk_buf path = path_rec_create(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); if (!path) - goto err; + goto err_path; __path_add(dev, path); } @@ -527,6 +527,7 @@ static void neigh_add_path(struct sk_buf err: list_del(&neigh->list); +err_path: ipoib_neigh_free(neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); -- MST From william at e-standard.biz Thu Dec 22 09:48:57 2005 From: william at e-standard.biz (Stephen) Date: Thu, 22 Dec 2005 18:48:57 +0100 Subject: [openib-general] Need medicine? All here! Message-ID: <000001c60717$99a4e280$0100007f@vestel> Canadian Pharmacy a.. Don't expose your intimate life! b.. Need medicine? All here! c.. Our store is your cureall! d.. All products for your health! e.. Full of health? Then don't click! f.. We cure any desease! Click here for getting your health problems away at once! Best pri$es, secure payment processing, direct shipping from our warehouse and sympathetic customer support! All kinds of medicines at one huge licensed store! Make your life easier with us! Don't wait! Special pri$es today only! Our store is VERIFIED BY BBB! All transactions are APPROVED BY VISA! -------------- next part -------------- An HTML attachment was scrubbed... URL: From bboas at llnl.gov Thu Dec 22 09:08:40 2005 From: bboas at llnl.gov (Bill Boas) Date: Thu, 22 Dec 2005 09:08:40 -0800 Subject: [openib-general] Please join us at the 2006 Sonoma Workshop Feb 5-8 Message-ID: <6.2.3.4.2.20051222082405.033a4258@mail-lc.llnl.gov> Please join us at the February 5-8, 2006 Open IB 2006 Sonoma Workshop at the Lodge at Sonoma see - http://www.thelodgeatsonoma.com The group room block is now ready to receive individual call-in reservations to Marrioot's toll-free number of 888-710-8008, and rooms may also be reserved at: General group room rate quoting code OPAOPAA use: http://marriott.com/property/propertypage/sfols?groupCode=opaopaa&app=resvlink For US Government badge holders quoting code OPAOPAG use: http://marriott.com/property/propertypage/sfols?groupCode=opaopag&app=resvlink The Workshop draft agenda, registration link and these hotel reservation links will also be available on the OpenIB web site soon. We will email you when these are working. For those wishing to watch the Superbowl we'll set up bar and big screen again! For those who have a problem with these dates we apologize but the Lodge is becoming very popular and these are the best dates we could confirm for the number of us that we are hoping will attend this workshop. Please make your hotel reservations as soon as possible - the current cut off date set by the Lodge for all group reservations is January 9, 2006. The charming historic town of Sonoma is about 1 hour drive north of both San Francisco (SFO) and Oakland (OAK) Airports and about 2 hours from San Jose in the heart of the Wine Country. For those wishing additional space with their room, we have reserved 1 Lodge suite at $295 per night and one Cottage suite at $450 per night. There are two more of each available if needed. Please contact Bill Boas if you wish these rooms as the Lodge will assign your name on receipt of your credit card guarantee. The Lodge looks forward to welcoming the Open IB Alliance back!. If your members need additional hotel information please contact them. Very best wishes for the Season from Maureen Keesey Fuentes, Group Housing Coordinator, and Nedra Peterson Senior Account Executive (707-935-8916) The Lodge at Sonoma 1325 Broadway Sonoma, CA 95476 www.thelodgeatsonoma.com Please contact me directly if you have Workshop suggestions or questions. Stay tuned to www.openib.org as the agenda and details come together. Bill. Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From sean.hefty at intel.com Thu Dec 22 11:34:24 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 22 Dec 2005 11:34:24 -0800 Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: <43A759D2.6040101@ichips.intel.com> Message-ID: >I'm soliciting feedback from the MPI and other application developers regarding >which OpenIB APIs they will be targeting with their implementations. >Specifically, myself and some of the other IB developers are interesting in >knowing if userspace applications will be written to the RDMA CMA interface, >the >IB CM API, or some other abstraction. To help clarify the trade-offs: The CMA allows the use of IP addressing for connection establishment and abstracts device hotplug. It also operates over any type of RDMA device. A disadvantage of using the CMA is that it may not select the best set of paths between two or more nodes. The IB CM also permits path failover on a single HCA. Use of the IB CM requires that clients also interface with the IB SA to obtain path records. My personal recommendation would be for applications to use the CMA, but that does result in losing some flexibility. - Sean From halr at voltaire.com Thu Dec 22 11:53:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Dec 2005 14:53:51 -0500 Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: References: Message-ID: <1135281231.4328.81737.camel@hal.voltaire.com> On Thu, 2005-12-22 at 14:34, Sean Hefty wrote: > >I'm soliciting feedback from the MPI and other application developers regarding > >which OpenIB APIs they will be targeting with their implementations. > >Specifically, myself and some of the other IB developers are interesting in > >knowing if userspace applications will be written to the RDMA CMA interface, > >the > >IB CM API, or some other abstraction. > > To help clarify the trade-offs: > > The CMA allows the use of IP addressing for connection establishment and > abstracts device hotplug. It also operates over any type of RDMA device. > > A disadvantage of using the CMA is that it may not select the best set of paths > between two or more nodes. What defines best ? Is this preference or disjointedness or something else ? Note path selection may be important in subnets when LMC > 0. > The IB CM also permits path failover on a single > HCA. Use of the IB CM requires that clients also interface with the IB SA to > obtain path records. Note that interaction with the SA will be required for MPI when multicast groups are to be used. > My personal recommendation would be for applications to use the CMA, but that > does result in losing some flexibility. Would the CMA ultimately support path failover ? -- Hal From sean.hefty at intel.com Thu Dec 22 12:14:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 22 Dec 2005 12:14:44 -0800 Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: <1135281231.4328.81737.camel@hal.voltaire.com> Message-ID: >> To help clarify the trade-offs: >> >> The CMA allows the use of IP addressing for connection establishment and >> abstracts device hotplug. It also operates over any type of RDMA device. >> >> A disadvantage of using the CMA is that it may not select the best set of >paths >> between two or more nodes. > >What defines best ? Is this preference or disjointedness or something >else ? I was intentionally vague here to leave this up to the application developer to define. The application may decide that a particular path or set of paths is better than another based on whatever criteria they choose. The current CMA provides less control over which paths are selected for connections than if the user queried the SA for paths and selected one based on some algorithm. (I'd be surprised if an app actually did this though.) >> The IB CM also permits path failover on a single >> HCA. Use of the IB CM requires that clients also interface with the IB SA to >> obtain path records. > >Note that interaction with the SA will be required for MPI when >multicast groups are to be used. An alternative is to provide UD and multicast/broadcast support in the CMA. I know that the Intel MPI runs over DAPL, which does not provide multicast support. Can MPI operate with unreliable multicast support? Does MPI plan on using IB multicast? >> My personal recommendation would be for applications to use the CMA, but that >> does result in losing some flexibility. > >Would the CMA ultimately support path failover ? Only if there's enough demand. Since IB failover is restricted to a single HCA, I can see where a more robust failover mechanism would be desirable. - Sean From bboas at llnl.gov Thu Dec 22 12:17:24 2005 From: bboas at llnl.gov (bboas at llnl.gov) Date: Thu, 22 Dec 2005 12:17:24 -0800 (PST) Subject: [openib-general] Please register for Sonoma Workshop Message-ID: <31093253.1135282644296.JavaMail.SYSTEM@acteva-web-01> you may register at www.acteva.com/go/rdma Thank you ========================================================================== ** If you *do not* wish to receive these emails, please forward or send a message to ** ** removeme at acteva.com, and we'll remove you from the previte list. ** ========================================================================== From lindahl at pathscale.com Thu Dec 22 12:20:41 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 22 Dec 2005 12:20:41 -0800 Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: References: <1135281231.4328.81737.camel@hal.voltaire.com> Message-ID: <20051222202041.GA2006@greglaptop.internal.keyresearch.com> On Thu, Dec 22, 2005 at 12:14:44PM -0800, Sean Hefty wrote: > Can MPI operate with unreliable multicast support? Does MPI plan on > using IB multicast? Given the large number of MPI implementations over IB, I don't think there's a single answer. -- greg From parks at lanl.gov Thu Dec 22 11:52:22 2005 From: parks at lanl.gov (Parks Fields) Date: Thu, 22 Dec 2005 12:52:22 -0700 Subject: [openib-general] Please join us at the 2006 Sonoma Workshop Feb 5-8 In-Reply-To: <6.2.3.4.2.20051222082405.033a4258@mail-lc.llnl.gov> References: <6.2.3.4.2.20051222082405.033a4258@mail-lc.llnl.gov> Message-ID: <6.2.3.4.2.20051222125107.03b1f068@ccn-mail.lanl.gov> So is the main part of the program 2/6-2/8 closing at 5:00 From mamidala at cse.ohio-state.edu Thu Dec 22 13:38:44 2005 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Thu, 22 Dec 2005 16:38:44 -0500 (EST) Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: Message-ID: > An alternative is to provide UD and multicast/broadcast support in the CMA. I > know that the Intel MPI runs over DAPL, which does not provide multicast > support. Can MPI operate with unreliable multicast support? Does MPI plan on > using IB multicast? Yes, the MPI can operate with unreliable multicast support. MVAPICH-0.9.6 has this broadcast support over IB multicast. As Hal suggested earlier, application processes interact with SA to create/join multicast groups. Thanks, Amith > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From trimmer at silverstorm.com Thu Dec 22 13:54:17 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Thu, 22 Dec 2005 16:54:17 -0500 Subject: [openib-general] RFC MPI and app. requirements of OpenIB Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367D12BF3@mercury.infiniconsys.com> Since the SA views multicast operations at a node level, and applications need to individually operate on multicast groups. It is appropriate for the core stack to provide some multicast management APIs. These would multiplex requests from all the applications on a node to make the appropriate requests to the SA. reference counts would need to be maintained in the core stack for each MC group so that the node would remove itself from the group only on last application exit/unregister. This would also be a good place to handle the "MC group persistence issues". Namely rejoining requested groups when ports go up/down, SMs change (client reregister), etc. Todd Rimmer Chief Systems Architect SilverStorm Technologies Voice: 610-233-4852 Fax: 610-233-4777 TRimmer at SilverStorm.com www.SilverStorm.com > -----Original Message----- > From: amith rajith mamidala [mailto:mamidala at cse.ohio-state.edu] > Sent: Thursday, December 22, 2005 4:39 PM > To: Sean Hefty > Cc: openib > Subject: RE: [openib-general] RFC MPI and app. requirements of OpenIB > > > > > An alternative is to provide UD and multicast/broadcast > support in the CMA. I > > know that the Intel MPI runs over DAPL, which does not > provide multicast > > support. Can MPI operate with unreliable multicast > support? Does MPI plan on > > using IB multicast? > > Yes, the MPI can operate with unreliable multicast support. > MVAPICH-0.9.6 has this broadcast support over IB multicast. As Hal > suggested earlier, application processes interact with SA to > create/join > multicast groups. > > Thanks, > Amith > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From arlin.r.davis at intel.com Thu Dec 22 15:20:29 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 22 Dec 2005 15:20:29 -0800 Subject: [openib-general] [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: James and Arkady, DAPL provides a generalized abstraction to RDMA capable transports. As a generalized abstraction, it cannot exploit the unique properties that many of the underlying platforms/interconnects can provide so I would like to propose a simple (minimum impact on libdat) extensible interface to uDAPL that will allow vendors to expose such capabilities. I am looking for feedback, especially from the DAT collaborative. I have included both a design document and actual working code as a reference. The patch provides a fully tested DAT and DAPL library (openib_cma) set with the following provider extensions: DAT_RETURN dat_ep_post_write_immed( IN DAT_EP_HANDLE ep_handle, IN DAT_COUNT num_segments IN DAT_LMR_TRIPLET *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_UINT32 immediate_data, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_cmp_and_swap( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 cmp_value, IN DAT_UINT64 swap_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_fetch_and_add( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 add_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); Also, included is a sample program (dtest_ext.c) that can be used as a programming example. Thanks, -arlin Signed-off by: Arlin Davis Index: test/dtest/dat.conf =================================================================== --- test/dtest/dat.conf (revision 4589) +++ test/dtest/dat.conf (working copy) @@ -1,11 +1,20 @@ # -# DAT 1.1 and 1.2 configuration file +# DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # -# Example for openib using the first Mellanox adapter, port 1 and port 2 +# Example for openib_cma and openib_scm +# +# For scm version you specify as actual device name and port +# For cma version you specify as: +# network address, network hostname, or netdev name and 0 for port +# +OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" +OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" +OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" +OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" +OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "ib0 0" "" -IB1 u1.2 nonthreadsafe default Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 4589) +++ test/dtest/makefile (working copy) @@ -4,13 +4,18 @@ CFLAGS = -O2 -g DAT_INC = ../../dat/include DAT_LIB = /usr/local/lib -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + Index: test/dtest/README =================================================================== --- test/dtest/README (revision 4589) +++ test/dtest/README (working copy) @@ -1,10 +1,11 @@ simple dapl test just for initial openIB uDAPL testing... dtest/dtest.c + dtest/dtest_ext.c dtest/makefile dtest/dat.conf -to build (default uDAPL name == IB1, ib device == mthca0, port == 1) +to build (default uDAPL name == OpenIB-cma-ip) edit makefile and change path (DAT_LIB) to appropriate libdat.so edit dat.conf and change path to appropriate libdapl.so cp dat.conf to /etc/dat.conf Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 4589) +++ dapl/include/dapl.h (working copy) @@ -1,25 +1,28 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -61,6 +64,8 @@ #include "dapl_dummy_util.h" #elif OPENIB #include "dapl_ib_util.h" +#elif DET +#include "dapl_det_util.h" #endif /********************************************************************* @@ -213,6 +218,10 @@ typedef struct dapl_cookie DAPL_COOKIE; typedef struct dapl_dto_cookie DAPL_DTO_COOKIE; typedef struct dapl_rmr_cookie DAPL_RMR_COOKIE; +#ifdef DAPL_EXTENSIONS +typedef struct dapl_ext_cookie DAPL_EXT_COOKIE; +#endif + typedef struct dapl_private DAPL_PRIVATE; typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( @@ -563,6 +572,13 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAPL_EXTENSIONS + DAPL_DTO_TYPE_RDMA_WRITE_IMMED, + DAPL_DTO_TYPE_RECV_IMMED, + DAPL_DTO_TYPE_CMP_AND_SWAP, + DAPL_DTO_TYPE_FETCH_AND_ADD, +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +586,9 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, +#ifdef DAPL_EXTENSIONS + DAPL_COOKIE_TYPE_EXTENSION, +#endif } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -587,6 +606,27 @@ struct dapl_rmr_cookie DAT_RMR_COOKIE cookie; }; +#ifdef DAPL_EXTENSIONS + +/* DAPL extended cookie types */ +typedef enum dapl_ext_type +{ + DAPL_EXT_TYPE_RDMA_WRITE_IMMED, + DAPL_EXT_TYPE_CMP_AND_SWAP, + DAPL_EXT_TYPE_FETCH_AND_ADD, + DAPL_EXT_TYPE_RECV +} DAPL_EXT_TYPE; + +/* DAPL extended cookie */ +struct dapl_ext_cookie +{ + DAPL_EXT_TYPE type; + DAT_DTO_COOKIE cookie; + DAT_COUNT size; /* used RDMA write with immed */ +}; + +#endif + /* DAPL_COOKIE used as context for WQEs */ struct dapl_cookie { @@ -597,6 +637,9 @@ struct dapl_cookie { DAPL_DTO_COOKIE dto; DAPL_RMR_COOKIE rmr; +#ifdef DAPL_EXTENSIONS + DAPL_EXT_COOKIE ext; +#endif } val; }; @@ -1116,6 +1159,15 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAPL_EXTENSIONS + +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 4589) +++ dapl/udapl/Makefile (working copy) @@ -156,6 +156,7 @@ PROVIDER = $(TOPDIR)/../openib_cma CFLAGS += -DOPENIB CFLAGS += -DCQ_WAIT_OBJECT CFLAGS += -I/usr/local/include/infiniband +CFLAGS += -I/usr/local/include/rdma endif # @@ -168,6 +169,12 @@ endif # VN_MEM_SHARED_VIRTUAL_SUPPORT # CFLAGS += -DVN_MEM_SHARED_VIRTUAL_SUPPORT=1 +# If an implementation supports DAPL extensions +CFLAGS += -DDAPL_EXTENSIONS + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + CFLAGS += -I. CFLAGS += -I.. CFLAGS += -I../../dat/include @@ -283,6 +290,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports DAPL extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 4589) +++ dapl/common/dapl_ia_query.c (working copy) @@ -167,6 +167,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Have provider set their own. + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_set_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 4589) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAPL_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ @@ -272,6 +287,8 @@ dapls_ib_wait_object_wait ( #include "dapl_dummy_dto.h" #elif OPENIB #include "dapl_ib_dto.h" +#elif DET +#include "dapl_det_dto.h" #endif Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 4589) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,11 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAPL_EXTENSIONS + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 4589) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,20 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMM", + "OP_SEND", + "OP_SEND_IMM", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMM", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +523,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1113,6 +1128,15 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + +#ifdef DAPL_EXTENSIONS + case DAPL_COOKIE_TYPE_EXTENSION: + { + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); + break; + } +#endif + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 4589) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - DTO operations and CQE macros + * The OpenIB uCMA provider - DTO operations and CQE macros * **************************************************************************** * Source Control System Information @@ -119,7 +119,6 @@ dapls_ib_post_recv ( return DAT_SUCCESS; } - /* * dapls_ib_post_send * @@ -191,7 +190,7 @@ dapls_ib_post_send ( if (cookie != NULL) cookie->val.dto.size = total_len; - + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { wr.wr.rdma.remote_addr = remote_iov->target_address; wr.wr.rdma.rkey = remote_iov->rmr_context; @@ -224,6 +223,152 @@ dapls_ib_post_send ( return DAT_SUCCESS; } +#ifdef DAPL_EXTENSIONS +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 idata, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = + &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || + (op_type == OP_RDMA_WRITE_IMM) || + (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); + } + + switch (op_type) { + case OP_RDMA_WRITE_IMM: + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_RDMA_WRITE_IMMED=0x%x\n", idata ); + wr.imm_data = idata; + break; + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif + STATIC _INLINE_ DAT_RETURN dapls_ib_optional_prv_dat( IN DAPL_CR *cr_ptr, Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 4589) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities, work thread + * The OpenIB uCMA provider - init, open, close, utilities, work thread * **************************************************************************** * Source Control System Information @@ -64,7 +64,6 @@ static const char rcsid[] = "$Id: $"; #include /* for struct ifreq */ #include /* for ARPHRD_INFINIBAND */ - int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -727,7 +726,7 @@ void dapli_thread(void *arg) int ret,idx,fds; char rbuf[2]; - dapl_dbg_log (DAPL_DBG_TYPE_CM, + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); @@ -767,7 +766,7 @@ void dapli_thread(void *arg) ufds[idx].revents = 0; uhca[idx] = hca; - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" " pipe=%d cm=%d cq=d\n", getpid(), hca, ufds[idx-1].fd, @@ -783,14 +782,14 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); ret = poll(ufds, fds, -1); if (ret <= 0) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_hca_lock); continue; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_event: " " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", getpid(), ufds[idx-1].revents, ufds[0].revents, @@ -834,3 +833,63 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +/* + * dapls_set_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAPL_EXTENSIONS + { + DAT_EXT_ATTR, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RDMA_WRITE_IMMED, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RECV_IMMED, + DAT_EXT_ATTR_TRUE + }, + /* inbound immediate data placed in event, NOT payload */ + { + DAT_EXT_ATTR_RECV_IMMED_EVENT, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_FETCH_AND_ADD, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_CMP_AND_SWAP, + DAT_EXT_ATTR_TRUE + }, +#else + { + "DAT_EXTENSION_INTERFACE", + "FALSE" + }, +#endif +}; + +#define SPEC_ATTR_SIZE(x) ( sizeof(x)/sizeof(DAT_NAMED_ATTR) ) + +void dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 4589) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_mem.c + * MODULE: dapl_ib_mem.c * - * PURPOSE: Intel DET APIs: Memory windows, registration, + * PURPOSE: OpenIB uCMA provider Memory windows, registration, * and protection domain * * $Id: $ @@ -72,12 +72,10 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_LOCAL_WRITE; if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) access |= IBV_ACCESS_REMOTE_WRITE; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) { access |= IBV_ACCESS_REMOTE_READ; + access |= IBV_ACCESS_REMOTE_ATOMIC; + } return access; } Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - connection management + * The OpenIB uCMA provider - uCMA connection management * **************************************************************************** * Source Control System Information @@ -592,7 +592,11 @@ dapls_ib_setup_conn_listener(IN DAPL_IA if (rdma_bind_addr(conn->cm_id, (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { - dat_status = dapl_convert_errno(errno,"setup_listener"); + if (errno == -EBUSY) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = + dapl_convert_errno(errno,"setup_listener"); goto bail; } Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 4589) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_qp.c + * MODULE: dapl_ib_qp.c * - * PURPOSE: QP routines for access to DET Verbs + * PURPOSE: OpenIB uCMA QP routines * * $Id: $ **********************************************************************/ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 4589) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - definitions, prototypes, + * The OpenIB uCMA provider - definitions, prototypes, * **************************************************************************** * Source Control System Information Index: dapl/openib_cma/README =================================================================== --- dapl/openib_cma/README (revision 4589) +++ dapl/openib_cma/README (working copy) @@ -23,15 +23,22 @@ New files for openib_scm provider dapl/openib_cma/dapl_ib_util.c dapl/openib_cma/dapl_ib_util.h dapl/openib_cma/dapl_ib_cm.c + dapl/openib_cma/dapl_ib_extensions.c A simple dapl test just for openib_scm testing... test/dtest/dtest.c + test/dtest/dtest_ext.c test/dtest/makefile server: dtest -s client: dtest -h hostname +or with extensions + + server: dtest_ext -s + client: dtest_ext -h hostname + known issues: no memory windows support in ibverbs, dat_create_rmr fails. Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - completion queue + * The OpenIB uCMA provider - completion queue * **************************************************************************** * Source Control System Information @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,6 +514,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 4589) +++ dat/include/dat/dat_redirection.h (working copy) @@ -59,10 +59,10 @@ typedef struct dat_provider DAT_PROVIDER * This would allow a good compiler to avoid indirection overhead when * making function calls. */ - #define DAT_HANDLE_TO_PROVIDER(handle) (*(DAT_PROVIDER **)(handle)) #endif + #define DAT_IA_QUERY(ia, evd, ia_msk, ia_ptr, p_msk, p_ptr) \ (*DAT_HANDLE_TO_PROVIDER (ia)->ia_query_func) (\ (ia), \ @@ -395,6 +395,12 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#define DAT_EXTENSION(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->extension_func) (\ + (handle), \ + (op), \ + (args)) + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +726,11 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +/* Extension function */ +#include +typedef DAT_RETURN (*DAT_EXTENSION_FUNC) ( + IN DAT_HANDLE, /* dat handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list */ + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 4589) +++ dat/include/dat/dat.h (working copy) @@ -854,11 +854,15 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, + DAT_EXTENSION_EVENT = 0x20001 + } DAT_EVENT_NUMBER; -/* Union for event Data */ +/* include extension data definitions */ +#include +/* Union for event Data */ typedef union dat_event_data { DAT_DTO_COMPLETION_EVENT_DATA dto_completion_event_data; @@ -867,6 +871,7 @@ typedef union dat_event_data DAT_CONNECTION_EVENT_DATA connect_event_data; DAT_ASYNCH_ERROR_EVENT_DATA asynch_error_event_data; DAT_SOFTWARE_EVENT_DATA software_event_data; + DAT_EXTENSION_DATA extension_data; } DAT_EVENT_DATA; /* Event struct that holds all event information */ @@ -1222,6 +1227,11 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_EXT_OP, + IN ... ); + /* * DAT registry functions. * Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 4589) +++ dat/include/dat/udat_redirection.h (working copy) @@ -199,7 +199,6 @@ typedef DAT_RETURN (*DAT_EVD_SET_UNWAITA typedef DAT_RETURN (*DAT_EVD_CLEAR_UNWAITABLE_FUNC) ( IN DAT_EVD_HANDLE); /* evd_handle */ - #include struct dat_provider @@ -294,6 +293,10 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + + /* extension for provder specific functions */ + DAT_EXTENSION_FUNC extension_func; + }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_extensions.h =================================================================== --- dat/include/dat/dat_extensions.h (revision 0) +++ dat/include/dat/dat_extensions.h (revision 0) @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * HEADER: dat_extensions.h + * + * PURPOSE: defines the extensions to the DAT API for uDAPL. + * + * Description: Header file for "uDAPL: User Direct Access Programming + * Library, Version: 1.2" + * + * Mapping rules: + * All global symbols are prepended with "DAT_" or "dat_" + * All DAT objects have an 'api' tag which, such as 'ep' or 'lmr' + * The method table is in the provider definition structure. + * + * + **********************************************************************/ + +#ifndef _DAT_EXTENSIONS_H_ + +extern int dat_extensions; + +/* + * Provider specific attribute strings for extension support + * returned with dat_ia_query() and + * DAT_PROVIDER_ATTR_MASK == DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR + * + * DAT_NAMED_ATTR name == extended operation, + * value == TRUE if extended operation is supported + */ +#define DAT_EXT_ATTR "DAT_EXTENSION_INTERFACE" +#define DAT_EXT_ATTR_RDMA_WRITE_IMMED "DAT_EXT_RDMA_WRITE_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED "DAT_EXT_RECV_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED_EVENT "DAT_EXT_RECV_IMMED_EVENT" +#define DAT_EXT_ATTR_RECV_IMMED_PAYLOAD "DAT_EXT_RECV_IMMED_PAYLOAD" +#define DAT_EXT_ATTR_FETCH_AND_ADD "DAT_EXT_FETCH_AND_ADD" +#define DAT_EXT_ATTR_CMP_AND_SWAP "DAT_EXT_CMP_AND_SWAP" +#define DAT_EXT_ATTR_TRUE "TRUE" +#define DAT_EXT_ATTR_FALSE "FALSE" + +/* + * Extension OPERATIONS + */ +typedef enum dat_ext_op +{ + DAT_EXT_RDMA_WRITE_IMMED, + DAT_EXT_RECV_IMMED, + DAT_EXT_FETCH_AND_ADD, + DAT_EXT_CMP_AND_SWAP, + +} DAT_EXT_OP; + +/* + * Extension completion event TYPES + */ +typedef enum dat_ext_event_type +{ + DAT_EXT_RDMA_WRITE_IMMED_STATUS = 1, + DAT_EXT_RECV_NO_IMMED, + DAT_EXT_RECV_IMMED_DATA_EVENT, + DAT_EXT_RECV_IMMED_DATA_PAYLOAD, + DAT_EXT_FETCH_AND_ADD_STATUS, + DAT_EXT_CMP_AND_SWAP_STATUS, + +} DAT_EXT_EVENT_TYPE; + +/* + * Extension completion event DATA + */ +typedef struct dat_immediate_data +{ + DAT_UINT32 data; + +} DAT_RDMA_WRITE_IMMED_DATA; + +typedef struct dat_extension_data +{ + DAT_DTO_COMPLETION_EVENT_DATA dto; + DAT_EXT_EVENT_TYPE type; + union { + DAT_RDMA_WRITE_IMMED_DATA immed; + } val; +} DAT_EXTENSION_DATA; + +typedef enum dat_ext_flags +{ + DAT_EXT_WRITE_IMMED_FLAG = 0x1, + DAT_EXT_WRITE_CONFIRM_FLAG = 0x2, + +} DAT_EXT_FLAGS; + +/* + * Extended API with redirection via DAT extension function + */ + +/* + * RDMA Write with IMMEDIATE extension: + * + * Asynchronous call performs a normal RDMA write to the remote endpoint + * followed by a post of an extended immediate data value to the receive + * EVD on the remote endpoint. Event completion for the request completes + * as an DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + * Event completion on the remote endpoint completes as an DAT_EXTENSION_EVENT + * with type set to DAT_EXT_RECV_IMMED_DATA_IN_EVENT or + * DAT_EXT_RECV_IMMED_DATA_IN_PAYLOAD depending on the provider transport. + * + * DAT_EXT_WRITE_IMMED_FLAG requests that the supplied + *'immediate' value be sent as the payload of a four byte send following + * the RDMA Write, or any transport-dependent equivalent thereof. + * For example, on InfiniBand the request should be translated as an + * RDMA Write with Immediate. + * + * DAT_EXT_WRITE_CONFIRM_FLAG requests that this DTO + * not complete until receipt by the far end is confirmed. + * + * Note to Consumers: the immediate data will consume a receive + * buffer at the Data Sink. + * + * Other extension flags: + * n/a + */ +#define dat_ep_post_rdma_write_immed(ep, size, lbuf, cookie, rbuf, idata, eflgs, flgs) \ + dat_extension( ep, \ + DAT_EXT_RDMA_WRITE_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (idata), \ + (eflgs), \ + (flgs)) + +/* + * Call performs a normal post receive message to the local endpoint + * that includes additional 32-bit buffer space for immediate data + * Event completion for the request completes as an + * DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + */ +#define dat_ep_post_recv_immed(ep, size, lbuf, cookie, flgs) \ + dat_extension( ep, \ + DAT_EXT_RECV_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Fetch and Add operation. The add_value is added to the 64 bit + * value stored at the remote memory location specified in remote_iov + * and the result is stored in the local_iov. + */ +#define dat_ep_post_fetch_and_add(ep, add_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_FETCH_AND_ADD, \ + (add_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Compare and Swap operation. The cmp_value is compared to the 64 bit + * value stored at the remote memory location specified in remote_iov. + * If the two values are equal, the 64 bit swap_value is stored in + * the remote memory location. In all cases, the original 64 bit + * value stored in the remote memory location is copied to the local_iov. + */ +#define dat_ep_post_cmp_and_swap(ep, cmp_val, swap_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_CMP_AND_SWAP, \ + (cmp_val), \ + (swap_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +#endif /* _DAT_EXTENSIONS_H_ */ + Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 4594) +++ dat/common/dat_api.c (working copy) @@ -1142,6 +1142,36 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_EXT_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + if (handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + + /* verify provider extension support */ + if (!dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + status = DAT_EXTENSION(handle, + ext_op, + args); + va_end(args); + + return status; +} + /* * Local variables: * c-indent-level: 4 Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 4594) +++ dat/udat/udat.c (working copy) @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int dat_extensions = 0; /* * @@ -230,13 +234,44 @@ dat_ia_openv ( async_event_qlen, async_event_handle, ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { - return_handle = dats_set_ia_handle (*ia_handle); - if (return_handle >= 0) - { - *ia_handle = (DAT_IA_HANDLE)return_handle; - } + DAT_PROVIDER_ATTR p_attr; + int i; + + return_handle = dats_set_ia_handle (*ia_handle); + if (return_handle >= 0) + { + *ia_handle = (DAT_IA_HANDLE)return_handle; + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if ( (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) && + (strcmp( p_attr.provider_specific_attr[i].value, + "TRUE" ) == 0) ) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "DAPL Extension Interface supported!\n"); + + dat_extensions = 1; + break; + } + } + } } return dat_status; Index: README =================================================================== --- README (revision 4589) +++ README (working copy) @@ -1,5 +1,10 @@ There are now 3 uDAPL providers for openib (openib,openib_scm,openib_cma). +NEW FEATURES for openib_cma provider: +API extensions for immediate data and atomic operations have been added. +see dat/include/dat/dat_extensions.h for new API's. +see dapl/test/dtest/dtest_ext.c for example test cast + ========== 1.0 BUILD: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DAT_Extensions.pdf Type: application/pdf Size: 83940 bytes Desc: not available URL: From devesh28 at gmail.com Fri Dec 23 00:59:16 2005 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 23 Dec 2005 14:29:16 +0530 Subject: [openib-general] [Openib-SDP] Integrating in 2.6 kernel Message-ID: <309a667c0512230059u6d345172m662b097aa1d8870@mail.gmail.com> Hi all, I am trying to use sdp implementation available on openib.org/svn/gen2/trunk/ and test some programs to 2.6 linux kernel. I want help on following topics 1) From where I can get a tarball of sdp code? 2)How to Integrate sdp code with 2.6 kernel? Please help me out. Devesh From devesh28 at gmail.com Fri Dec 23 04:25:35 2005 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 23 Dec 2005 17:55:35 +0530 Subject: [openib-general] Errors in compilation with 2.6.14.4!! Message-ID: <309a667c0512230425o6869a2a2u8a7f0fd2ede720e8@mail.gmail.com> Hi I have downloaded openib stack from svn with Revision 4595 and trying to compile it with 2.6.14.4. I am getting following errors [root at infini00 linux-2.6.14.4]# make CHK include/linux/version.h CHK include/linux/compile.h CHK usr/initramfs_list CC [M] drivers/infiniband/core/cm.o drivers/infiniband/core/cm.c: In function `cm_alloc_msg': drivers/infiniband/core/cm.c:180: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:180: error: (Each undeclared identifier is reported only once drivers/infiniband/core/cm.c:180: error: for each function it appears in.) drivers/infiniband/core/cm.c:181: error: too few arguments to function `ib_create_send_mad' drivers/infiniband/core/cm.c:188: error: structure has no member named `ah' drivers/infiniband/core/cm.c:189: error: structure has no member named `retries' drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': drivers/infiniband/core/cm.c:210: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:211: error: too few arguments to function `ib_create_send_mad' drivers/infiniband/core/cm.c:216: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_free_msg': drivers/infiniband/core/cm.c:223: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_mask_compare_data': drivers/infiniband/core/cm.c:363: error: `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this function) drivers/infiniband/core/cm.c: In function `cm_compare_data': drivers/infiniband/core/cm.c:370: error: `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this function) drivers/infiniband/core/cm.c:376: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:376: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:377: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:377: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:370: warning: unused variable `src' drivers/infiniband/core/cm.c:371: warning: unused variable `dst' drivers/infiniband/core/cm.c: In function `cm_compare_private_data': drivers/infiniband/core/cm.c:384: error: `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this function) drivers/infiniband/core/cm.c:389: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:390: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:384: warning: unused variable `src' drivers/infiniband/core/cm.c: In function `cm_insert_listen': drivers/infiniband/core/cm.c:410: error: structure has no member named `device' drivers/infiniband/core/cm.c:410: error: structure has no member named `device' drivers/infiniband/core/cm.c:414: error: structure has no member named `device' drivers/infiniband/core/cm.c:414: error: structure has no member named `device' drivers/infiniband/core/cm.c:416: error: structure has no member named `device' drivers/infiniband/core/cm.c:416: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `cm_find_listen': drivers/infiniband/core/cm.c:446: error: structure has no member named `device' drivers/infiniband/core/cm.c:449: error: structure has no member named `device' drivers/infiniband/core/cm.c:451: error: structure has no member named `device' drivers/infiniband/core/cm.c: At top level: drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was here drivers/infiniband/core/cm.c: In function `ib_create_cm_id': drivers/infiniband/core/cm.c:604: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': drivers/infiniband/core/cm.c:731: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:739: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:749: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:764: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: At top level: drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here drivers/infiniband/core/cm.c: In function `ib_cm_listen': drivers/infiniband/core/cm.c:807: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:811: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:812: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:812: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:813: error: `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this function) drivers/infiniband/core/cm.c:813: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:813: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:813: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c:813: error: dereferencing pointer to incomplete type drivers/infiniband/core/cm.c: In function `ib_send_cm_req': drivers/infiniband/core/cm.c:1003: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:1012: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1012: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_issue_rej': drivers/infiniband/core/cm.c:1057: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1057: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': drivers/infiniband/core/cm.c:1265: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1265: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_match_req': drivers/infiniband/core/cm.c:1305: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': drivers/infiniband/core/cm.c:1452: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:1455: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1455: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': drivers/infiniband/core/cm.c:1519: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1519: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': drivers/infiniband/core/cm.c:1591: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1591: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_rep_handler': drivers/infiniband/core/cm.c:1659: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_establish_handler': drivers/infiniband/core/cm.c:1693: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_rtu_handler': drivers/infiniband/core/cm.c:1732: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': drivers/infiniband/core/cm.c:1790: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:1793: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1793: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': drivers/infiniband/core/cm.c:1856: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1856: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_dreq_handler': drivers/infiniband/core/cm.c:1891: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:1905: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1905: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_drep_handler': drivers/infiniband/core/cm.c:1952: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': drivers/infiniband/core/cm.c:2020: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2020: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_rej_handler': drivers/infiniband/core/cm.c:2096: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2106: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': drivers/infiniband/core/cm.c:2164: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2164: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c:2177: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2177: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c:2190: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2190: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_mra_handler': drivers/infiniband/core/cm.c:2252: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2259: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c:2267: warning: passing arg 2 of `ib_modify_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': drivers/infiniband/core/cm.c:2350: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:2353: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2353: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_lap_handler': drivers/infiniband/core/cm.c:2430: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2430: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': drivers/infiniband/core/cm.c:2508: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2508: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_apr_handler': drivers/infiniband/core/cm.c:2547: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': drivers/infiniband/core/cm.c:2644: error: structure has no member named `timeout_ms' drivers/infiniband/core/cm.c:2649: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2649: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': drivers/infiniband/core/cm.c:2713: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': drivers/infiniband/core/cm.c:2785: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2785: error: too few arguments to function `ib_post_send_mad' drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': drivers/infiniband/core/cm.c:2838: warning: passing arg 2 of `ib_cancel_mad' makes integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_send_handler': drivers/infiniband/core/cm.c:2906: error: structure has no member named `send_buf' make[3]: *** [drivers/infiniband/core/cm.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 What is the issue?? From halr at voltaire.com Fri Dec 23 04:31:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2005 07:31:21 -0500 Subject: [openib-general] Errors in compilation with 2.6.14.4!! In-Reply-To: <309a667c0512230425o6869a2a2u8a7f0fd2ede720e8@mail.gmail.com> References: <309a667c0512230425o6869a2a2u8a7f0fd2ede720e8@mail.gmail.com> Message-ID: <1135341078.4328.91384.camel@hal.voltaire.com> On Fri, 2005-12-23 at 07:25, Devesh Sharma wrote: > Hi I have downloaded openib stack from svn with Revision 4595 and > trying to compile it with 2.6.14.4. I am getting following errors > > > [root at infini00 linux-2.6.14.4]# make > CHK include/linux/version.h > CHK include/linux/compile.h > CHK usr/initramfs_list > CC [M] drivers/infiniband/core/cm.o > drivers/infiniband/core/cm.c: In function `cm_alloc_msg': > drivers/infiniband/core/cm.c:180: error: `IB_MGMT_MAD_HDR' undeclared > (first use in this function) > drivers/infiniband/core/cm.c:180: error: (Each undeclared identifier > is reported only once > drivers/infiniband/core/cm.c:180: error: for each function it appears in.) > drivers/infiniband/core/cm.c:181: error: too few arguments to function > `ib_create_send_mad' > drivers/infiniband/core/cm.c:188: error: structure has no member named `ah' > drivers/infiniband/core/cm.c:189: error: structure has no member named `retries' > drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': > drivers/infiniband/core/cm.c:210: error: `IB_MGMT_MAD_HDR' undeclared > (first use in this function) > drivers/infiniband/core/cm.c:211: error: too few arguments to function > `ib_create_send_mad' > drivers/infiniband/core/cm.c:216: error: structure has no member named `ah' > drivers/infiniband/core/cm.c: In function `cm_free_msg': > drivers/infiniband/core/cm.c:223: error: structure has no member named `ah' > drivers/infiniband/core/cm.c: In function `cm_mask_compare_data': > drivers/infiniband/core/cm.c:363: error: > `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this > function) > drivers/infiniband/core/cm.c: In function `cm_compare_data': > drivers/infiniband/core/cm.c:370: error: > `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this > function) > drivers/infiniband/core/cm.c:376: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:376: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:377: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:377: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:370: warning: unused variable `src' > drivers/infiniband/core/cm.c:371: warning: unused variable `dst' > drivers/infiniband/core/cm.c: In function `cm_compare_private_data': > drivers/infiniband/core/cm.c:384: error: > `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this > function) > drivers/infiniband/core/cm.c:389: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:390: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:384: warning: unused variable `src' > drivers/infiniband/core/cm.c: In function `cm_insert_listen': > drivers/infiniband/core/cm.c:410: error: structure has no member named `device' > drivers/infiniband/core/cm.c:410: error: structure has no member named `device' > drivers/infiniband/core/cm.c:414: error: structure has no member named `device' > drivers/infiniband/core/cm.c:414: error: structure has no member named `device' > drivers/infiniband/core/cm.c:416: error: structure has no member named `device' > drivers/infiniband/core/cm.c:416: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `cm_find_listen': > drivers/infiniband/core/cm.c:446: error: structure has no member named `device' > drivers/infiniband/core/cm.c:449: error: structure has no member named `device' > drivers/infiniband/core/cm.c:451: error: structure has no member named `device' > drivers/infiniband/core/cm.c: At top level: > drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' > include/rdma/ib_cm.h:306: error: previous declaration of > 'ib_create_cm_id' was here > drivers/infiniband/core/cm.c:595: error: conflicting types for 'ib_create_cm_id' > include/rdma/ib_cm.h:306: error: previous declaration of > 'ib_create_cm_id' was here > drivers/infiniband/core/cm.c: In function `ib_create_cm_id': > drivers/infiniband/core/cm.c:604: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': > drivers/infiniband/core/cm.c:731: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:739: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:749: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:764: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: At top level: > drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' > include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here > drivers/infiniband/core/cm.c:790: error: conflicting types for 'ib_cm_listen' > include/rdma/ib_cm.h:334: error: previous declaration of 'ib_cm_listen' was here > drivers/infiniband/core/cm.c: In function `ib_cm_listen': > drivers/infiniband/core/cm.c:807: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:811: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:812: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:812: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:813: error: > `IB_CM_PRIVATE_DATA_COMPARE_SIZE' undeclared (first use in this > function) > drivers/infiniband/core/cm.c:813: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:813: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:813: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c:813: error: dereferencing pointer to > incomplete type > drivers/infiniband/core/cm.c: In function `ib_send_cm_req': > drivers/infiniband/core/cm.c:1003: error: structure has no member > named `timeout_ms' > drivers/infiniband/core/cm.c:1012: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1012: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_issue_rej': > drivers/infiniband/core/cm.c:1057: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1057: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': > drivers/infiniband/core/cm.c:1265: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1265: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_match_req': > drivers/infiniband/core/cm.c:1305: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': > drivers/infiniband/core/cm.c:1452: error: structure has no member > named `timeout_ms' > drivers/infiniband/core/cm.c:1455: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1455: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': > drivers/infiniband/core/cm.c:1519: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1519: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': > drivers/infiniband/core/cm.c:1591: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1591: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_rep_handler': > drivers/infiniband/core/cm.c:1659: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_establish_handler': > drivers/infiniband/core/cm.c:1693: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_rtu_handler': > drivers/infiniband/core/cm.c:1732: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': > drivers/infiniband/core/cm.c:1790: error: structure has no member > named `timeout_ms' > drivers/infiniband/core/cm.c:1793: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1793: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': > drivers/infiniband/core/cm.c:1856: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1856: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_dreq_handler': > drivers/infiniband/core/cm.c:1891: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:1905: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:1905: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_drep_handler': > drivers/infiniband/core/cm.c:1952: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': > drivers/infiniband/core/cm.c:2020: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2020: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_rej_handler': > drivers/infiniband/core/cm.c:2096: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2106: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': > drivers/infiniband/core/cm.c:2164: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2164: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c:2177: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2177: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c:2190: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2190: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_mra_handler': > drivers/infiniband/core/cm.c:2252: warning: passing arg 2 of > `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2259: warning: passing arg 2 of > `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c:2267: warning: passing arg 2 of > `ib_modify_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': > drivers/infiniband/core/cm.c:2350: error: structure has no member > named `timeout_ms' > drivers/infiniband/core/cm.c:2353: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2353: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_lap_handler': > drivers/infiniband/core/cm.c:2430: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2430: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': > drivers/infiniband/core/cm.c:2508: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2508: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_apr_handler': > drivers/infiniband/core/cm.c:2547: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': > drivers/infiniband/core/cm.c:2644: error: structure has no member > named `timeout_ms' > drivers/infiniband/core/cm.c:2649: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2649: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': > drivers/infiniband/core/cm.c:2713: error: structure has no member named `device' > drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': > drivers/infiniband/core/cm.c:2785: warning: passing arg 1 of > `ib_post_send_mad' from incompatible pointer type > drivers/infiniband/core/cm.c:2785: error: too few arguments to > function `ib_post_send_mad' > drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': > drivers/infiniband/core/cm.c:2838: warning: passing arg 2 of > `ib_cancel_mad' makes integer from pointer without a cast > drivers/infiniband/core/cm.c: In function `cm_send_handler': > drivers/infiniband/core/cm.c:2906: error: structure has no member > named `send_buf' > make[3]: *** [drivers/infiniband/core/cm.o] Error 1 > make[2]: *** [drivers/infiniband/core] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 > > > What is the issue?? You need to link include/rdma in your Linux tree to the OpenIB one after moving the Linux one away as it is not up to date. -- Hal From jlentini at netapp.com Fri Dec 23 07:35:36 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 23 Dec 2005 10:35:36 -0500 (EST) Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: References: Message-ID: arlin> DAPL provides a generalized abstraction to RDMA capable arlin> transports. As a generalized abstraction, it cannot exploit the arlin> unique properties that many of the underlying arlin> platforms/interconnects can provide so I would like to propose arlin> a simple (minimum impact on libdat) extensible interface to arlin> uDAPL that will allow vendors to expose such capabilities. I am arlin> looking for feedback, especially from the DAT collaborative. I arlin> have included both a design document and actual working code as arlin> a reference. This is an excellent document and clearly certain applications will benefit greatly from adding this additional functionality. Since DAPL's inception, the DAT_PROVIDER structure has contained a field called "extension" of type void *. The purpose of this field was to allow for the kind of provider/platform/interconnect specific extensions you describe. I believe these features can be added without modifications to the current API by defining a particular format for the DAT_PROVIDER's extension data and indicating its presence via a provider attribute. That would require creating an extension document like this one describing an "extension" structure w/ function pointers to the new functions and a well known provider attribute value. Is there a reason this was not feasible? Would minor modifications to the existing framework be sufficient (perhaps an "extension" event type)? james From halr at voltaire.com Fri Dec 23 09:49:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2005 12:49:16 -0500 Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics Message-ID: <1135360153.4328.94443.camel@hal.voltaire.com> Hi, The OpenIB diagnostics (https://openib.org/svn/gen2/trunk/src/userspace/management/diags) have been updated as follows: 1. ibportstate diagnostic tool added to query, disable, and enable switch ports 2. Added error only mode to diagnostic scripts so less data to weed through on a large fabric (also verbose mode to see everything) 3. Tree structure collapsed so all tools in same directory as opposed to individual ones and build simplified Let me know about any comments or issues. Thanks. -- Hal From halr at voltaire.com Fri Dec 23 10:44:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2005 13:44:17 -0500 Subject: [openib-general] PathScale license Message-ID: <1135363454.4328.95007.camel@hal.voltaire.com> Hi, The PathScale OpenIB license includes the following which is beyond the normal OpenIB license: * Patent licenses, if any, provided herein do not apply to * combinations of this program with other software, or any other * product whatsoever. Can you comment/elaborate on this addition to the license ? Thanks. -- Hal From halr at voltaire.com Fri Dec 23 11:01:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2005 14:01:27 -0500 Subject: [openib-general] RFC MPI and app. requirements of OpenIB In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A4367D12BF3@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A4367D12BF3@mercury.infiniconsys.com> Message-ID: <1135364466.4328.95174.camel@hal.voltaire.com> On Thu, 2005-12-22 at 16:54, Rimmer, Todd wrote: > Since the SA views multicast operations at a node level, and applications > need to individually operate on multicast groups. It is appropriate for > the core stack to provide some multicast management APIs. These would > multiplex requests from all the applications on a node to make the > appropriate requests to the SA. reference counts would need to be > maintained in the core stack for each MC group so that the node would > remove itself from the group only on last application exit/unregister. > > This would also be a good place to handle the "MC group persistence issues". > Namely rejoining requested groups when ports go up/down, SMs change > (client reregister), etc. Yes, this has been raised in a thread by Eitan started on 11/22 entitled "First Multicast Leave disconnects all other clients" (http://openib.org/pipermail/openib-general/2005-November/014023.html). -- Hal > Todd Rimmer > Chief Systems Architect SilverStorm Technologies > Voice: 610-233-4852 Fax: 610-233-4777 > TRimmer at SilverStorm.com www.SilverStorm.com > > > > -----Original Message----- > > From: amith rajith mamidala [mailto:mamidala at cse.ohio-state.edu] > > Sent: Thursday, December 22, 2005 4:39 PM > > To: Sean Hefty > > Cc: openib > > Subject: RE: [openib-general] RFC MPI and app. requirements of OpenIB > > > > > > > > > An alternative is to provide UD and multicast/broadcast > > support in the CMA. I > > > know that the Intel MPI runs over DAPL, which does not > > provide multicast > > > support. Can MPI operate with unreliable multicast > > support? Does MPI plan on > > > using IB multicast? > > > > Yes, the MPI can operate with unreliable multicast support. > > MVAPICH-0.9.6 has this broadcast support over IB multicast. As Hal > > suggested earlier, application processes interact with SA to > > create/join > > multicast groups. > > > > Thanks, > > Amith > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Fri Dec 23 11:54:39 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 23 Dec 2005 14:54:39 -0500 Subject: [openib-general] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: Arlin, nice proposal, thanks. I have one high level question and a few specific technical ones. 1. Why do you want to provide this functionality via extension instead of part of new DAT spec, say 2.0? This will allow Consumers to use all events, operations, and Provider/IA functionality uniformly instead of via 2 separate layers. This will also ensure that this basic funcionality can be provided by all DAPL Provider the same way on DAPL and DAT layers. DAPL 2.0 is not done yet so we have time to incorporate that. DAPL 2.0 already introduced new functionality which is easy to beef up for your proposal. See DAT_DTOS for example. DAT_EVENT is also modified to handle remote invalidation so a small addition for Immediate data and Atoimc ops is a sensible addition. This should simplify proposal significantly. As you will not need to introduce any new EXT structures. In general, extension route was intended for RNIC|HCA providers to expose HW capabilities beyond IBTA, iWARP and VIA standards. The standard RDMA functionality is best handle via spec addition. DAT 2.0 does it for FMR, remote and local memory invalidation as well as others. I had posted a complete list of changes/addition to DAT 2.0 about a month ago. But we had not discussed yet version change from 1.3 to 2.0 nor how much backwards compatibility spec will provide. 2. What is IMMED_EVENT? is it just immediate data without any payload one? I suggest chnaging the name so it will not use "EVENT". Just call it NO_PAYLOAD. Do you want to support 2 different way to delivery immediate data? One in event and one in data payload? Why? I would think that just an event way will do. 3. I suggest beefing up DAT_DTO_COMPLETION_EVENT_DATA and DAT_DTOS to convey which operation completed and return Immediate data if complete operation had immediate data. Since we already modified these 2 struct as part of DAT 2.0 change lets add your proposal to the change. This will allow Consumers to use single approach to deal with completions, extension to the current one but not a structural one. No need for DAT_EXTENSION_DATA, DAT_EXT_EVENT_TYPE, DAT_EXT_OP nor the whole mechanism for extended ops. 4. What is the purpose of DAT_EXT_WRITE_CONFIRM_FLAG? Is it to expose IB round trip semantic? iWARP does not support immediate data. One can try to format payload to pass immediate data. Is that what you had in mind? What is the semantic meaning of the completion with this flag set? without flag set? Are extended flags are additonal values for COMPLETION_FLAGS? 2.4.1 talks about extended flags but where they are passed in is not defined. DAT 2.0 extended them already for FMR barrier. I would prefer to follow that route rather than creating a separate extension completion flags. 5. Why do you need RECV_IMMED? If Immed data is delivered in event no new Recv operation is needed. If Consumer asks for immediate data in payload where in payload will it be? If this is needed for local match for remote RDMA_Write to handle immediate data lets state so. What happens for mismatch between local and remote op? That is recv was posted for Send and RDMA_Write "arrived"? Vice Versa? 6. I see extension for immediate data for rdma_write but not for send. Is this deliberate? If we are going to extend DAT semantic to support Immediate data we can as well support the full IBTA/iWARP functionality for it. 7. Currently memory registration do not support access to LMR or RMR by Atomic ops. Do you propose to extend the meaning of current MEM_PRIV for LMR and RMR to covers atomic accesses or add new values to LMR_MEM_PRIV and RMR_MEM_PRIV for atomic operation support? 8. Any alignment requirements for memory used for atomic ops? 9. Any correlation requirements for SRQ buffers to support recv with immediate data? Have a great holidays, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Arlin Davis [mailto:arlin.r.davis at intel.com] Sent: Thursday, December 22, 2005 6:20 PM To: Lentini, James; Kanevsky, Arkady Cc: openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's James and Arkady, DAPL provides a generalized abstraction to RDMA capable transports. As a generalized abstraction, it cannot exploit the unique properties that many of the underlying platforms/interconnects can provide so I would like to propose a simple (minimum impact on libdat) extensible interface to uDAPL that will allow vendors to expose such capabilities. I am looking for feedback, especially from the DAT collaborative. I have included both a design document and actual working code as a reference. The patch provides a fully tested DAT and DAPL library (openib_cma) set with the following provider extensions: DAT_RETURN dat_ep_post_write_immed( IN DAT_EP_HANDLE ep_handle, IN DAT_COUNT num_segments IN DAT_LMR_TRIPLET *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_UINT32 immediate_data, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_cmp_and_swap( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 cmp_value, IN DAT_UINT64 swap_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_fetch_and_add( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 add_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); Also, included is a sample program (dtest_ext.c) that can be used as a programming example. Thanks, -arlin Signed-off by: Arlin Davis Index: test/dtest/dat.conf =================================================================== --- test/dtest/dat.conf (revision 4589) +++ test/dtest/dat.conf (working copy) @@ -1,11 +1,20 @@ # -# DAT 1.1 and 1.2 configuration file +# DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # -# Example for openib using the first Mellanox adapter, port 1 and port 2 +# Example for openib_cma and openib_scm +# +# For scm version you specify as actual device name and port +# For cma version you specify as: +# network address, network hostname, or netdev name and 0 for port +# +OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" +OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" +OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" +OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" +OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "ib0 0" "" -IB1 u1.2 nonthreadsafe default Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 4589) +++ test/dtest/makefile (working copy) @@ -4,13 +4,18 @@ CFLAGS = -O2 -g DAT_INC = ../../dat/include DAT_LIB = /usr/local/lib -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + Index: test/dtest/README =================================================================== --- test/dtest/README (revision 4589) +++ test/dtest/README (working copy) @@ -1,10 +1,11 @@ simple dapl test just for initial openIB uDAPL testing... dtest/dtest.c + dtest/dtest_ext.c dtest/makefile dtest/dat.conf -to build (default uDAPL name == IB1, ib device == mthca0, port == 1) +to build (default uDAPL name == OpenIB-cma-ip) edit makefile and change path (DAT_LIB) to appropriate libdat.so edit dat.conf and change path to appropriate libdapl.so cp dat.conf to /etc/dat.conf Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 4589) +++ dapl/include/dapl.h (working copy) @@ -1,25 +1,28 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -61,6 +64,8 @@ #include "dapl_dummy_util.h" #elif OPENIB #include "dapl_ib_util.h" +#elif DET +#include "dapl_det_util.h" #endif /********************************************************************* @@ -213,6 +218,10 @@ typedef struct dapl_cookie DAPL_COOKIE; typedef struct dapl_dto_cookie DAPL_DTO_COOKIE; typedef struct dapl_rmr_cookie DAPL_RMR_COOKIE; +#ifdef DAPL_EXTENSIONS +typedef struct dapl_ext_cookie DAPL_EXT_COOKIE; +#endif + typedef struct dapl_private DAPL_PRIVATE; typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( @@ -563,6 +572,13 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAPL_EXTENSIONS + DAPL_DTO_TYPE_RDMA_WRITE_IMMED, + DAPL_DTO_TYPE_RECV_IMMED, + DAPL_DTO_TYPE_CMP_AND_SWAP, + DAPL_DTO_TYPE_FETCH_AND_ADD, +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +586,9 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, +#ifdef DAPL_EXTENSIONS + DAPL_COOKIE_TYPE_EXTENSION, +#endif } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -587,6 +606,27 @@ struct dapl_rmr_cookie DAT_RMR_COOKIE cookie; }; +#ifdef DAPL_EXTENSIONS + +/* DAPL extended cookie types */ +typedef enum dapl_ext_type +{ + DAPL_EXT_TYPE_RDMA_WRITE_IMMED, + DAPL_EXT_TYPE_CMP_AND_SWAP, + DAPL_EXT_TYPE_FETCH_AND_ADD, + DAPL_EXT_TYPE_RECV +} DAPL_EXT_TYPE; + +/* DAPL extended cookie */ +struct dapl_ext_cookie +{ + DAPL_EXT_TYPE type; + DAT_DTO_COOKIE cookie; + DAT_COUNT size; /* used RDMA write with immed */ +}; + +#endif + /* DAPL_COOKIE used as context for WQEs */ struct dapl_cookie { @@ -597,6 +637,9 @@ struct dapl_cookie { DAPL_DTO_COOKIE dto; DAPL_RMR_COOKIE rmr; +#ifdef DAPL_EXTENSIONS + DAPL_EXT_COOKIE ext; +#endif } val; }; @@ -1116,6 +1159,15 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAPL_EXTENSIONS + +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 4589) +++ dapl/udapl/Makefile (working copy) @@ -156,6 +156,7 @@ PROVIDER = $(TOPDIR)/../openib_cma CFLAGS += -DOPENIB CFLAGS += -DCQ_WAIT_OBJECT CFLAGS += -I/usr/local/include/infiniband +CFLAGS += -I/usr/local/include/rdma endif # @@ -168,6 +169,12 @@ endif # VN_MEM_SHARED_VIRTUAL_SUPPORT # CFLAGS += -DVN_MEM_SHARED_VIRTUAL_SUPPORT=1 +# If an implementation supports DAPL extensions +CFLAGS += -DDAPL_EXTENSIONS + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + CFLAGS += -I. CFLAGS += -I.. CFLAGS += -I../../dat/include @@ -283,6 +290,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports DAPL extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 4589) +++ dapl/common/dapl_ia_query.c (working copy) @@ -167,6 +167,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Have provider set their own. + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_set_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 4589) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAPL_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ @@ -272,6 +287,8 @@ dapls_ib_wait_object_wait ( #include "dapl_dummy_dto.h" #elif OPENIB #include "dapl_ib_dto.h" +#elif DET +#include "dapl_det_dto.h" #endif Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 4589) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,11 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAPL_EXTENSIONS + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 4589) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,20 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMM", + "OP_SEND", + "OP_SEND_IMM", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMM", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +523,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1113,6 +1128,15 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + +#ifdef DAPL_EXTENSIONS + case DAPL_COOKIE_TYPE_EXTENSION: + { + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); + break; + } +#endif + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 4589) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - DTO operations and CQE macros + * The OpenIB uCMA provider - DTO operations and CQE macros * ************************************************************************ **** * Source Control System Information @@ -119,7 +119,6 @@ dapls_ib_post_recv ( return DAT_SUCCESS; } - /* * dapls_ib_post_send * @@ -191,7 +190,7 @@ dapls_ib_post_send ( if (cookie != NULL) cookie->val.dto.size = total_len; - + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { wr.wr.rdma.remote_addr = remote_iov->target_address; wr.wr.rdma.rkey = remote_iov->rmr_context; @@ -224,6 +223,152 @@ dapls_ib_post_send ( return DAT_SUCCESS; } +#ifdef DAPL_EXTENSIONS +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 idata, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = + &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || + (op_type == OP_RDMA_WRITE_IMM) || + (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); + } + + switch (op_type) { + case OP_RDMA_WRITE_IMM: + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_RDMA_WRITE_IMMED=0x%x\n", idata ); + wr.imm_data = idata; + break; + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif + STATIC _INLINE_ DAT_RETURN dapls_ib_optional_prv_dat( IN DAPL_CR *cr_ptr, Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 4589) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities, work thread + * The OpenIB uCMA provider - init, open, close, utilities, work thread * ************************************************************************ **** * Source Control System Information @@ -64,7 +64,6 @@ static const char rcsid[] = "$Id: $"; #include /* for struct ifreq */ #include /* for ARPHRD_INFINIBAND */ - int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -727,7 +726,7 @@ void dapli_thread(void *arg) int ret,idx,fds; char rbuf[2]; - dapl_dbg_log (DAPL_DBG_TYPE_CM, + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); @@ -767,7 +766,7 @@ void dapli_thread(void *arg) ufds[idx].revents = 0; uhca[idx] = hca; - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" " pipe=%d cm=%d cq=d\n", getpid(), hca, ufds[idx-1].fd, @@ -783,14 +782,14 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); ret = poll(ufds, fds, -1); if (ret <= 0) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_hca_lock); continue; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_event: " " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", getpid(), ufds[idx-1].revents, ufds[0].revents, @@ -834,3 +833,63 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +/* + * dapls_set_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAPL_EXTENSIONS + { + DAT_EXT_ATTR, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RDMA_WRITE_IMMED, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RECV_IMMED, + DAT_EXT_ATTR_TRUE + }, + /* inbound immediate data placed in event, NOT payload */ + { + DAT_EXT_ATTR_RECV_IMMED_EVENT, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_FETCH_AND_ADD, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_CMP_AND_SWAP, + DAT_EXT_ATTR_TRUE + }, +#else + { + "DAT_EXTENSION_INTERFACE", + "FALSE" + }, +#endif +}; + +#define SPEC_ATTR_SIZE(x) ( sizeof(x)/sizeof(DAT_NAMED_ATTR) ) + +void dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 4589) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_mem.c + * MODULE: dapl_ib_mem.c * - * PURPOSE: Intel DET APIs: Memory windows, registration, + * PURPOSE: OpenIB uCMA provider Memory windows, registration, * and protection domain * * $Id: $ @@ -72,12 +72,10 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_LOCAL_WRITE; if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) access |= IBV_ACCESS_REMOTE_WRITE; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) { access |= IBV_ACCESS_REMOTE_READ; + access |= IBV_ACCESS_REMOTE_ATOMIC; + } return access; } Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - connection management + * The OpenIB uCMA provider - uCMA connection management * ************************************************************************ **** * Source Control System Information @@ -592,7 +592,11 @@ dapls_ib_setup_conn_listener(IN DAPL_IA if (rdma_bind_addr(conn->cm_id, (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { - dat_status = dapl_convert_errno(errno,"setup_listener"); + if (errno == -EBUSY) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = + dapl_convert_errno(errno,"setup_listener"); goto bail; } Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 4589) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_qp.c + * MODULE: dapl_ib_qp.c * - * PURPOSE: QP routines for access to DET Verbs + * PURPOSE: OpenIB uCMA QP routines * * $Id: $ **********************************************************************/ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 4589) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - definitions, prototypes, + * The OpenIB uCMA provider - definitions, prototypes, * ************************************************************************ **** * Source Control System Information Index: dapl/openib_cma/README =================================================================== --- dapl/openib_cma/README (revision 4589) +++ dapl/openib_cma/README (working copy) @@ -23,15 +23,22 @@ New files for openib_scm provider dapl/openib_cma/dapl_ib_util.c dapl/openib_cma/dapl_ib_util.h dapl/openib_cma/dapl_ib_cm.c + dapl/openib_cma/dapl_ib_extensions.c A simple dapl test just for openib_scm testing... test/dtest/dtest.c + test/dtest/dtest_ext.c test/dtest/makefile server: dtest -s client: dtest -h hostname +or with extensions + + server: dtest_ext -s + client: dtest_ext -h hostname + known issues: no memory windows support in ibverbs, dat_create_rmr fails. Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - completion queue + * The OpenIB uCMA provider - completion queue * ************************************************************************ **** * Source Control System Information @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,6 +514,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 4589) +++ dat/include/dat/dat_redirection.h (working copy) @@ -59,10 +59,10 @@ typedef struct dat_provider DAT_PROVIDER * This would allow a good compiler to avoid indirection overhead when * making function calls. */ - #define DAT_HANDLE_TO_PROVIDER(handle) (*(DAT_PROVIDER **)(handle)) #endif + #define DAT_IA_QUERY(ia, evd, ia_msk, ia_ptr, p_msk, p_ptr) \ (*DAT_HANDLE_TO_PROVIDER (ia)->ia_query_func) (\ (ia), \ @@ -395,6 +395,12 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#define DAT_EXTENSION(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->extension_func) (\ + (handle), \ + (op), \ + (args)) + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +726,11 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +/* Extension function */ +#include +typedef DAT_RETURN (*DAT_EXTENSION_FUNC) ( + IN DAT_HANDLE, /* dat handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list */ + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 4589) +++ dat/include/dat/dat.h (working copy) @@ -854,11 +854,15 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, + DAT_EXTENSION_EVENT = 0x20001 + } DAT_EVENT_NUMBER; -/* Union for event Data */ +/* include extension data definitions */ +#include +/* Union for event Data */ typedef union dat_event_data { DAT_DTO_COMPLETION_EVENT_DATA dto_completion_event_data; @@ -867,6 +871,7 @@ typedef union dat_event_data DAT_CONNECTION_EVENT_DATA connect_event_data; DAT_ASYNCH_ERROR_EVENT_DATA asynch_error_event_data; DAT_SOFTWARE_EVENT_DATA software_event_data; + DAT_EXTENSION_DATA extension_data; } DAT_EVENT_DATA; /* Event struct that holds all event information */ @@ -1222,6 +1227,11 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_EXT_OP, + IN ... ); + /* * DAT registry functions. * Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 4589) +++ dat/include/dat/udat_redirection.h (working copy) @@ -199,7 +199,6 @@ typedef DAT_RETURN (*DAT_EVD_SET_UNWAITA typedef DAT_RETURN (*DAT_EVD_CLEAR_UNWAITABLE_FUNC) ( IN DAT_EVD_HANDLE); /* evd_handle */ - #include struct dat_provider @@ -294,6 +293,10 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + + /* extension for provder specific functions */ + DAT_EXTENSION_FUNC extension_func; + }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_extensions.h =================================================================== --- dat/include/dat/dat_extensions.h (revision 0) +++ dat/include/dat/dat_extensions.h (revision 0) @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * HEADER: dat_extensions.h + * + * PURPOSE: defines the extensions to the DAT API for uDAPL. + * + * Description: Header file for "uDAPL: User Direct Access Programming + * Library, Version: 1.2" + * + * Mapping rules: + * All global symbols are prepended with "DAT_" or "dat_" + * All DAT objects have an 'api' tag which, such as 'ep' or 'lmr' + * The method table is in the provider definition structure. + * + * + **********************************************************************/ + +#ifndef _DAT_EXTENSIONS_H_ + +extern int dat_extensions; + +/* + * Provider specific attribute strings for extension support + * returned with dat_ia_query() and + * DAT_PROVIDER_ATTR_MASK == DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR + * + * DAT_NAMED_ATTR name == extended operation, + * value == TRUE if extended operation is supported + */ +#define DAT_EXT_ATTR "DAT_EXTENSION_INTERFACE" +#define DAT_EXT_ATTR_RDMA_WRITE_IMMED "DAT_EXT_RDMA_WRITE_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED "DAT_EXT_RECV_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED_EVENT "DAT_EXT_RECV_IMMED_EVENT" +#define DAT_EXT_ATTR_RECV_IMMED_PAYLOAD "DAT_EXT_RECV_IMMED_PAYLOAD" +#define DAT_EXT_ATTR_FETCH_AND_ADD "DAT_EXT_FETCH_AND_ADD" +#define DAT_EXT_ATTR_CMP_AND_SWAP "DAT_EXT_CMP_AND_SWAP" +#define DAT_EXT_ATTR_TRUE "TRUE" +#define DAT_EXT_ATTR_FALSE "FALSE" + +/* + * Extension OPERATIONS + */ +typedef enum dat_ext_op +{ + DAT_EXT_RDMA_WRITE_IMMED, + DAT_EXT_RECV_IMMED, + DAT_EXT_FETCH_AND_ADD, + DAT_EXT_CMP_AND_SWAP, + +} DAT_EXT_OP; + +/* + * Extension completion event TYPES + */ +typedef enum dat_ext_event_type +{ + DAT_EXT_RDMA_WRITE_IMMED_STATUS = 1, + DAT_EXT_RECV_NO_IMMED, + DAT_EXT_RECV_IMMED_DATA_EVENT, + DAT_EXT_RECV_IMMED_DATA_PAYLOAD, + DAT_EXT_FETCH_AND_ADD_STATUS, + DAT_EXT_CMP_AND_SWAP_STATUS, + +} DAT_EXT_EVENT_TYPE; + +/* + * Extension completion event DATA + */ +typedef struct dat_immediate_data +{ + DAT_UINT32 data; + +} DAT_RDMA_WRITE_IMMED_DATA; + +typedef struct dat_extension_data +{ + DAT_DTO_COMPLETION_EVENT_DATA dto; + DAT_EXT_EVENT_TYPE type; + union { + DAT_RDMA_WRITE_IMMED_DATA immed; + } val; +} DAT_EXTENSION_DATA; + +typedef enum dat_ext_flags +{ + DAT_EXT_WRITE_IMMED_FLAG = 0x1, + DAT_EXT_WRITE_CONFIRM_FLAG = 0x2, + +} DAT_EXT_FLAGS; + +/* + * Extended API with redirection via DAT extension function + */ + +/* + * RDMA Write with IMMEDIATE extension: + * + * Asynchronous call performs a normal RDMA write to the remote endpoint + * followed by a post of an extended immediate data value to the receive + * EVD on the remote endpoint. Event completion for the request completes + * as an DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + * Event completion on the remote endpoint completes as an DAT_EXTENSION_EVENT + * with type set to DAT_EXT_RECV_IMMED_DATA_IN_EVENT or + * DAT_EXT_RECV_IMMED_DATA_IN_PAYLOAD depending on the provider transport. + * + * DAT_EXT_WRITE_IMMED_FLAG requests that the supplied + *'immediate' value be sent as the payload of a four byte send following + * the RDMA Write, or any transport-dependent equivalent thereof. + * For example, on InfiniBand the request should be translated as an + * RDMA Write with Immediate. + * + * DAT_EXT_WRITE_CONFIRM_FLAG requests that this DTO + * not complete until receipt by the far end is confirmed. + * + * Note to Consumers: the immediate data will consume a receive + * buffer at the Data Sink. + * + * Other extension flags: + * n/a + */ +#define dat_ep_post_rdma_write_immed(ep, size, lbuf, cookie, rbuf, idata, eflgs, flgs) \ + dat_extension( ep, \ + DAT_EXT_RDMA_WRITE_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (idata), \ + (eflgs), \ + (flgs)) + +/* + * Call performs a normal post receive message to the local endpoint + * that includes additional 32-bit buffer space for immediate data + * Event completion for the request completes as an + * DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + */ +#define dat_ep_post_recv_immed(ep, size, lbuf, cookie, flgs) \ + dat_extension( ep, \ + DAT_EXT_RECV_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Fetch and Add operation. The add_value is added to the 64 bit + * value stored at the remote memory location specified in remote_iov + * and the result is stored in the local_iov. + */ +#define dat_ep_post_fetch_and_add(ep, add_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_FETCH_AND_ADD, \ + (add_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Compare and Swap operation. The cmp_value is compared to the 64 bit + * value stored at the remote memory location specified in remote_iov. + * If the two values are equal, the 64 bit swap_value is stored in + * the remote memory location. In all cases, the original 64 bit + * value stored in the remote memory location is copied to the local_iov. + */ +#define dat_ep_post_cmp_and_swap(ep, cmp_val, swap_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_CMP_AND_SWAP, \ + (cmp_val), \ + (swap_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +#endif /* _DAT_EXTENSIONS_H_ */ + Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 4594) +++ dat/common/dat_api.c (working copy) @@ -1142,6 +1142,36 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_EXT_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + if (handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + + /* verify provider extension support */ + if (!dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + status = DAT_EXTENSION(handle, + ext_op, + args); + va_end(args); + + return status; +} + /* * Local variables: * c-indent-level: 4 Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 4594) +++ dat/udat/udat.c (working copy) @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int dat_extensions = 0; /* * @@ -230,13 +234,44 @@ dat_ia_openv ( async_event_qlen, async_event_handle, ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { - return_handle = dats_set_ia_handle (*ia_handle); - if (return_handle >= 0) - { - *ia_handle = (DAT_IA_HANDLE)return_handle; - } + DAT_PROVIDER_ATTR p_attr; + int i; + + return_handle = dats_set_ia_handle (*ia_handle); + if (return_handle >= 0) + { + *ia_handle = (DAT_IA_HANDLE)return_handle; + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if ( (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) && + (strcmp( p_attr.provider_specific_attr[i].value, + "TRUE" ) == 0) ) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "DAPL Extension Interface supported!\n"); + + dat_extensions = 1; + break; + } + } + } } return dat_status; Index: README =================================================================== --- README (revision 4589) +++ README (working copy) @@ -1,5 +1,10 @@ There are now 3 uDAPL providers for openib (openib,openib_scm,openib_cma). +NEW FEATURES for openib_cma provider: +API extensions for immediate data and atomic operations have been added. +see dat/include/dat/dat_extensions.h for new API's. +see dapl/test/dtest/dtest_ext.c for example test cast + ========== 1.0 BUILD: -------------- next part -------------- An HTML attachment was scrubbed... URL: From ardavis at ichips.intel.com Fri Dec 23 12:02:10 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 23 Dec 2005 12:02:10 -0800 Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: References: Message-ID: <43AC57C2.8050509@ichips.intel.com> James Lentini wrote: >arlin> DAPL provides a generalized abstraction to RDMA capable >arlin> transports. As a generalized abstraction, it cannot exploit the >arlin> unique properties that many of the underlying >arlin> platforms/interconnects can provide so I would like to propose >arlin> a simple (minimum impact on libdat) extensible interface to >arlin> uDAPL that will allow vendors to expose such capabilities. I am >arlin> looking for feedback, especially from the DAT collaborative. I >arlin> have included both a design document and actual working code as >arlin> a reference. > >This is an excellent document and clearly certain applications will >benefit greatly from adding this additional functionality. > >Since DAPL's inception, the DAT_PROVIDER structure has contained a >field called "extension" of type void *. The purpose of this field was >to allow for the kind of provider/platform/interconnect specific >extensions you describe. > >I believe these features can be added without modifications to the >current API by defining a particular format for the DAT_PROVIDER's >extension data and indicating its presence via a provider attribute. >That would require creating an extension document like this one >describing an "extension" structure w/ function pointers to the new >functions and a well known provider attribute value. > >Is there a reason this was not feasible? Would minor modifications to >the existing framework be sufficient (perhaps an "extension" event >type)? > > A single entry point is still there with this patch, I just defined it a little different with a function definition for better DAT API mappings. The idea was to replace the existing pvoid extension definition with this new one. Can you give me an idea of how you would map these extended DAT calls to this pvoid function definition? What is your opinion on the way I extended event data, dapl event processing, event types, and cookies? -arlin >james >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From Arkady.Kanevsky at netapp.com Fri Dec 23 12:07:45 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 23 Dec 2005 15:07:45 -0500 Subject: [openib-general] RE: [dat-discussions] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: Actually, iWARP does not support neither immediate data nor atomic ops. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Kanevsky, Arkady Sent: Friday, December 23, 2005 2:55 PM To: Arlin Davis; Lentini, James Cc: openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [dat-discussions] RE: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Arlin, nice proposal, thanks. I have one high level question and a few specific technical ones. 1. Why do you want to provide this functionality via extension instead of part of new DAT spec, say 2.0? This will allow Consumers to use all events, operations, and Provider/IA functionality uniformly instead of via 2 separate layers. This will also ensure that this basic funcionality can be provided by all DAPL Provider the same way on DAPL and DAT layers. DAPL 2.0 is not done yet so we have time to incorporate that. DAPL 2.0 already introduced new functionality which is easy to beef up for your proposal. See DAT_DTOS for example. DAT_EVENT is also modified to handle remote invalidation so a small addition for Immediate data and Atoimc ops is a sensible addition. This should simplify proposal significantly. As you will not need to introduce any new EXT structures. In general, extension route was intended for RNIC|HCA providers to expose HW capabilities beyond IBTA, iWARP and VIA standards. The standard RDMA functionality is best handle via spec addition. DAT 2.0 does it for FMR, remote and local memory invalidation as well as others. I had posted a complete list of changes/addition to DAT 2.0 about a month ago. But we had not discussed yet version change from 1.3 to 2.0 nor how much backwards compatibility spec will provide. 2. What is IMMED_EVENT? is it just immediate data without any payload one? I suggest chnaging the name so it will not use "EVENT". Just call it NO_PAYLOAD. Do you want to support 2 different way to delivery immediate data? One in event and one in data payload? Why? I would think that just an event way will do. 3. I suggest beefing up DAT_DTO_COMPLETION_EVENT_DATA and DAT_DTOS to convey which operation completed and return Immediate data if complete operation had immediate data. Since we already modified these 2 struct as part of DAT 2.0 change lets add your proposal to the change. This will allow Consumers to use single approach to deal with completions, extension to the current one but not a structural one. No need for DAT_EXTENSION_DATA, DAT_EXT_EVENT_TYPE, DAT_EXT_OP nor the whole mechanism for extended ops. 4. What is the purpose of DAT_EXT_WRITE_CONFIRM_FLAG? Is it to expose IB round trip semantic? iWARP does not support immediate data. One can try to format payload to pass immediate data. Is that what you had in mind? What is the semantic meaning of the completion with this flag set? without flag set? Are extended flags are additonal values for COMPLETION_FLAGS? 2.4.1 talks about extended flags but where they are passed in is not defined. DAT 2.0 extended them already for FMR barrier. I would prefer to follow that route rather than creating a separate extension completion flags. 5. Why do you need RECV_IMMED? If Immed data is delivered in event no new Recv operation is needed. If Consumer asks for immediate data in payload where in payload will it be? If this is needed for local match for remote RDMA_Write to handle immediate data lets state so. What happens for mismatch between local and remote op? That is recv was posted for Send and RDMA_Write "arrived"? Vice Versa? 6. I see extension for immediate data for rdma_write but not for send. Is this deliberate? If we are going to extend DAT semantic to support Immediate data we can as well support the full IBTA/iWARP functionality for it. 7. Currently memory registration do not support access to LMR or RMR by Atomic ops. Do you propose to extend the meaning of current MEM_PRIV for LMR and RMR to covers atomic accesses or add new values to LMR_MEM_PRIV and RMR_MEM_PRIV for atomic operation support? 8. Any alignment requirements for memory used for atomic ops? 9. Any correlation requirements for SRQ buffers to support recv with immediate data? Have a great holidays, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 ________________________________ From: Arlin Davis [mailto:arlin.r.davis at intel.com] Sent: Thursday, December 22, 2005 6:20 PM To: Lentini, James; Kanevsky, Arkady Cc: openib-general at openib.org; dat-discussions at yahoogroups.com Subject: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's James and Arkady, DAPL provides a generalized abstraction to RDMA capable transports. As a generalized abstraction, it cannot exploit the unique properties that many of the underlying platforms/interconnects can provide so I would like to propose a simple (minimum impact on libdat) extensible interface to uDAPL that will allow vendors to expose such capabilities. I am looking for feedback, especially from the DAT collaborative. I have included both a design document and actual working code as a reference. The patch provides a fully tested DAT and DAPL library (openib_cma) set with the following provider extensions: DAT_RETURN dat_ep_post_write_immed( IN DAT_EP_HANDLE ep_handle, IN DAT_COUNT num_segments IN DAT_LMR_TRIPLET *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_UINT32 immediate_data, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_cmp_and_swap( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 cmp_value, IN DAT_UINT64 swap_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); DAT_RETURN dat_ep_post_fetch_and_add( IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 add_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); Also, included is a sample program (dtest_ext.c) that can be used as a programming example. Thanks, -arlin Signed-off by: Arlin Davis Index: test/dtest/dat.conf =================================================================== --- test/dtest/dat.conf (revision 4589) +++ test/dtest/dat.conf (working copy) @@ -1,11 +1,20 @@ # -# DAT 1.1 and 1.2 configuration file +# DAT 1.2 configuration file # # Each entry should have the following fields: # # \ # # -# Example for openib using the first Mellanox adapter, port 1 and port 2 +# Example for openib_cma and openib_scm +# +# For scm version you specify as actual device name and port +# For cma version you specify as: +# network address, network hostname, or netdev name and 0 for port +# +OpenIB-scm1 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 1" "" +OpenIB-scm2 u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "mthca0 2" "" +OpenIB-cma-ip u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "192.168.0.22 0" "" +OpenIB-cma-name u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "svr1-ib0 0" "" +OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/local/openib_dapl/udapl/Target/libdapl.so mv_dapl.1.2 "ib0 0" "" -IB1 u1.2 nonthreadsafe default Index: test/dtest/makefile =================================================================== --- test/dtest/makefile (revision 4589) +++ test/dtest/makefile (working copy) @@ -4,13 +4,18 @@ CFLAGS = -O2 -g DAT_INC = ../../dat/include DAT_LIB = /usr/local/lib -all: dtest +all: dtest dtest_ext clean: - rm -f *.o;touch *.c;rm -f dtest + rm -f *.o;touch *.c;rm -f dtest dtest_ext dtest: ./dtest.c $(CC) $(CFLAGS) ./dtest.c -o dtest \ -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ -I $(DAT_INC) -L $(DAT_LIB) -ldat +dtest_ext: ./dtest_ext.c + $(CC) $(CFLAGS) ./dtest_ext.c -o dtest_ext \ + -DDAPL_PROVIDER='"OpenIB-cma-ip"' \ + -I $(DAT_INC) -L $(DAT_LIB) -ldat + Index: test/dtest/README =================================================================== --- test/dtest/README (revision 4589) +++ test/dtest/README (working copy) @@ -1,10 +1,11 @@ simple dapl test just for initial openIB uDAPL testing... dtest/dtest.c + dtest/dtest_ext.c dtest/makefile dtest/dat.conf -to build (default uDAPL name == IB1, ib device == mthca0, port == 1) +to build (default uDAPL name == OpenIB-cma-ip) edit makefile and change path (DAT_LIB) to appropriate libdat.so edit dat.conf and change path to appropriate libdapl.so cp dat.conf to /etc/dat.conf Index: dapl/include/dapl.h =================================================================== --- dapl/include/dapl.h (revision 4589) +++ dapl/include/dapl.h (working copy) @@ -1,25 +1,28 @@ /* - * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * * This Software is licensed under one of the following licenses: - * + * * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. - * + * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. - * + * * Licensee has the right to choose one of the above licenses. - * + * * Redistributions of source code must retain the above copyright * notice and one of the license notices. - * + * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. @@ -61,6 +64,8 @@ #include "dapl_dummy_util.h" #elif OPENIB #include "dapl_ib_util.h" +#elif DET +#include "dapl_det_util.h" #endif /********************************************************************* @@ -213,6 +218,10 @@ typedef struct dapl_cookie DAPL_COOKIE; typedef struct dapl_dto_cookie DAPL_DTO_COOKIE; typedef struct dapl_rmr_cookie DAPL_RMR_COOKIE; +#ifdef DAPL_EXTENSIONS +typedef struct dapl_ext_cookie DAPL_EXT_COOKIE; +#endif + typedef struct dapl_private DAPL_PRIVATE; typedef void (*DAPL_CONNECTION_STATE_HANDLER) ( @@ -563,6 +572,13 @@ typedef enum dapl_dto_type DAPL_DTO_TYPE_RECV, DAPL_DTO_TYPE_RDMA_WRITE, DAPL_DTO_TYPE_RDMA_READ, +#ifdef DAPL_EXTENSIONS + DAPL_DTO_TYPE_RDMA_WRITE_IMMED, + DAPL_DTO_TYPE_RECV_IMMED, + DAPL_DTO_TYPE_CMP_AND_SWAP, + DAPL_DTO_TYPE_FETCH_AND_ADD, +#endif + } DAPL_DTO_TYPE; typedef enum dapl_cookie_type @@ -570,6 +586,9 @@ typedef enum dapl_cookie_type DAPL_COOKIE_TYPE_NULL, DAPL_COOKIE_TYPE_DTO, DAPL_COOKIE_TYPE_RMR, +#ifdef DAPL_EXTENSIONS + DAPL_COOKIE_TYPE_EXTENSION, +#endif } DAPL_COOKIE_TYPE; /* DAPL_DTO_COOKIE used as context for DTO WQEs */ @@ -587,6 +606,27 @@ struct dapl_rmr_cookie DAT_RMR_COOKIE cookie; }; +#ifdef DAPL_EXTENSIONS + +/* DAPL extended cookie types */ +typedef enum dapl_ext_type +{ + DAPL_EXT_TYPE_RDMA_WRITE_IMMED, + DAPL_EXT_TYPE_CMP_AND_SWAP, + DAPL_EXT_TYPE_FETCH_AND_ADD, + DAPL_EXT_TYPE_RECV +} DAPL_EXT_TYPE; + +/* DAPL extended cookie */ +struct dapl_ext_cookie +{ + DAPL_EXT_TYPE type; + DAT_DTO_COOKIE cookie; + DAT_COUNT size; /* used RDMA write with immed */ +}; + +#endif + /* DAPL_COOKIE used as context for WQEs */ struct dapl_cookie { @@ -597,6 +637,9 @@ struct dapl_cookie { DAPL_DTO_COOKIE dto; DAPL_RMR_COOKIE rmr; +#ifdef DAPL_EXTENSIONS + DAPL_EXT_COOKIE ext; +#endif } val; }; @@ -1116,6 +1159,15 @@ extern DAT_RETURN dapl_srq_set_lw( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +#ifdef DAPL_EXTENSIONS + +extern DAT_RETURN dapl_extensions( + IN DAT_HANDLE, /* dat_handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list args */ + +#endif + /* * DAPL internal utility function prototpyes */ Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 4589) +++ dapl/udapl/Makefile (working copy) @@ -156,6 +156,7 @@ PROVIDER = $(TOPDIR)/../openib_cma CFLAGS += -DOPENIB CFLAGS += -DCQ_WAIT_OBJECT CFLAGS += -I/usr/local/include/infiniband +CFLAGS += -I/usr/local/include/rdma endif # @@ -168,6 +169,12 @@ endif # VN_MEM_SHARED_VIRTUAL_SUPPORT # CFLAGS += -DVN_MEM_SHARED_VIRTUAL_SUPPORT=1 +# If an implementation supports DAPL extensions +CFLAGS += -DDAPL_EXTENSIONS + +# If an implementation supports DAPL provider specific attributes +CFLAGS += -DDAPL_PROVIDER_SPECIFIC_ATTR + CFLAGS += -I. CFLAGS += -I.. CFLAGS += -I../../dat/include @@ -283,6 +290,8 @@ LDFLAGS += -libverbs -lrdmacm LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ dapl_ib_cm.c dapl_ib_mem.c +# implementation supports DAPL extensions +PROVIDER_SRCS += dapl_ib_extensions.c endif UDAPL_SRCS = dapl_init.c \ Index: dapl/common/dapl_ia_query.c =================================================================== --- dapl/common/dapl_ia_query.c (revision 4589) +++ dapl/common/dapl_ia_query.c (working copy) @@ -167,6 +167,14 @@ dapl_ia_query ( #if !defined(__KDAPL__) provider_attr->pz_support = DAT_PZ_UNIQUE; #endif /* !KDAPL */ + + /* + * Have provider set their own. + */ +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR + dapls_set_provider_specific_attr(provider_attr); +#endif + /* * Set up evd_stream_merging_supported options. Note there is * one bit per allowable combination, using the ordinal Index: dapl/common/dapl_adapter_util.h =================================================================== --- dapl/common/dapl_adapter_util.h (revision 4589) +++ dapl/common/dapl_adapter_util.h (working copy) @@ -256,6 +256,21 @@ dapls_ib_wait_object_wait ( IN u_int32_t timeout); #endif +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +void +dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *provider_attr ); +#endif + +#ifdef DAPL_EXTENSIONS +void +dapls_cqe_to_event_extension( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN ib_work_completion_t *cqe_ptr, + OUT DAT_EVENT *event_ptr); +#endif + /* * Values for provider DAT_NAMED_ATTR */ @@ -272,6 +287,8 @@ dapls_ib_wait_object_wait ( #include "dapl_dummy_dto.h" #elif OPENIB #include "dapl_ib_dto.h" +#elif DET +#include "dapl_det_dto.h" #endif Index: dapl/common/dapl_provider.c =================================================================== --- dapl/common/dapl_provider.c (revision 4589) +++ dapl/common/dapl_provider.c (working copy) @@ -221,7 +221,11 @@ DAT_PROVIDER g_dapl_provider_template = &dapl_srq_post_recv, &dapl_srq_query, &dapl_srq_resize, - &dapl_srq_set_lw + &dapl_srq_set_lw, + +#ifdef DAPL_EXTENSIONS + &dapl_extensions +#endif }; #endif /* __KDAPL__ */ Index: dapl/common/dapl_evd_util.c =================================================================== --- dapl/common/dapl_evd_util.c (revision 4589) +++ dapl/common/dapl_evd_util.c (working copy) @@ -502,6 +502,20 @@ dapli_evd_eh_print_cqe ( #ifdef DAPL_DBG static char *optable[] = { +#ifdef OPENIB + /* different order for openib verbs */ + "OP_RDMA_WRITE", + "OP_RDMA_WRITE_IMM", + "OP_SEND", + "OP_SEND_IMM", + "OP_RDMA_READ", + "OP_COMP_AND_SWAP", + "OP_FETCH_AND_ADD", + "OP_RECEIVE", + "OP_RECEIVE_IMM", + "OP_BIND_MW", + "OP_INVALID", +#else "OP_SEND", "OP_RDMA_READ", "OP_RDMA_WRITE", @@ -509,6 +523,7 @@ dapli_evd_eh_print_cqe ( "OP_FETCH_AND_ADD", "OP_RECEIVE", "OP_BIND_MW", +#endif 0 }; @@ -1113,6 +1128,15 @@ dapli_evd_cqe_to_event ( dapls_cookie_dealloc (&ep_ptr->req_buffer, cookie); break; } + +#ifdef DAPL_EXTENSIONS + case DAPL_COOKIE_TYPE_EXTENSION: + { + dapls_cqe_to_event_extension(ep_ptr, cookie, cqe_ptr, event_ptr); + break; + } +#endif + default: { dapl_os_assert (!"Invalid Operation type"); Index: dapl/openib_cma/dapl_ib_dto.h =================================================================== --- dapl/openib_cma/dapl_ib_dto.h (revision 4589) +++ dapl/openib_cma/dapl_ib_dto.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - DTO operations and CQE macros + * The OpenIB uCMA provider - DTO operations and CQE macros * ************************************************************************ **** * Source Control System Information @@ -119,7 +119,6 @@ dapls_ib_post_recv ( return DAT_SUCCESS; } - /* * dapls_ib_post_send * @@ -191,7 +190,7 @@ dapls_ib_post_send ( if (cookie != NULL) cookie->val.dto.size = total_len; - + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { wr.wr.rdma.remote_addr = remote_iov->target_address; wr.wr.rdma.rkey = remote_iov->rmr_context; @@ -224,6 +223,152 @@ dapls_ib_post_send ( return DAT_SUCCESS; } +#ifdef DAPL_EXTENSIONS +/* + * dapls_ib_post_ext_send + * + * Provider specific extended Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_ext_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_UINT32 idata, + IN DAT_UINT64 compare_add, + IN DAT_UINT64 swap, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs", + "%d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = + &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if(segments <= DEFAULT_DS_ENTRIES) + ds_array_p = ds_array; + else + ds_array_p = + dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || + (op_type == OP_RDMA_WRITE_IMM) || + (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); + } + + switch (op_type) { + case OP_RDMA_WRITE_IMM: + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_RDMA_WRITE_IMMED=0x%x\n", idata ); + wr.imm_data = idata; + break; + case OP_COMP_AND_SWAP: + /* OP_COMP_AND_SWAP has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_COMP_AND_SWAP=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, swap, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.swap = swap; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + case OP_FETCH_AND_ADD: + /* OP_FETCH_AND_ADD has direct IBAL wr_type mapping */ + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: OP_FETCH_AND_ADD=%lx," + "%lx rkey 0x%x va %#016Lx\n", + compare_add, remote_iov->rmr_context, + remote_iov->target_address); + + wr.wr.atomic.compare_add = compare_add; + wr.wr.atomic.remote_addr = remote_iov->target_address; + wr.wr.atomic.rkey = remote_iov->rmr_context; + break; + default: + break; + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log(DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} +#endif + STATIC _INLINE_ DAT_RETURN dapls_ib_optional_prv_dat( IN DAPL_CR *cr_ptr, Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 4589) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - init, open, close, utilities, work thread + * The OpenIB uCMA provider - init, open, close, utilities, work thread * ************************************************************************ **** * Source Control System Information @@ -64,7 +64,6 @@ static const char rcsid[] = "$Id: $"; #include /* for struct ifreq */ #include /* for ARPHRD_INFINIBAND */ - int g_dapl_loopback_connection = 0; int g_ib_pipe[2]; ib_thread_state_t g_ib_thread_state = 0; @@ -727,7 +726,7 @@ void dapli_thread(void *arg) int ret,idx,fds; char rbuf[2]; - dapl_dbg_log (DAPL_DBG_TYPE_CM, + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " ib_thread(%d,0x%x): ENTER: pipe %d ucma %d\n", getpid(), g_ib_thread, g_ib_pipe[0], rdma_get_fd()); @@ -767,7 +766,7 @@ void dapli_thread(void *arg) ufds[idx].revents = 0; uhca[idx] = hca; - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_fd: hca[%d]=%p, async=%d" " pipe=%d cm=%d cq=d\n", getpid(), hca, ufds[idx-1].fd, @@ -783,14 +782,14 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); ret = poll(ufds, fds, -1); if (ret <= 0) { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d): ERR %s poll\n", getpid(),strerror(errno)); dapl_os_lock(&g_hca_lock); continue; } - dapl_dbg_log(DAPL_DBG_TYPE_CM, + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread(%d) poll_event: " " async=0x%x pipe=0x%x cm=0x%x cq=0x%x\n", getpid(), ufds[idx-1].revents, ufds[0].revents, @@ -834,3 +833,63 @@ void dapli_thread(void *arg) dapl_os_unlock(&g_hca_lock); } +#ifdef DAPL_PROVIDER_SPECIFIC_ATTR +/* + * dapls_set_provider_specific_attr + * + * Input: + * attr_ptr Pointer provider attributes + * + * Output: + * none + * + * Returns: + * void + */ +DAT_NAMED_ATTR ib_attrs[] = { + +#ifdef DAPL_EXTENSIONS + { + DAT_EXT_ATTR, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RDMA_WRITE_IMMED, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_RECV_IMMED, + DAT_EXT_ATTR_TRUE + }, + /* inbound immediate data placed in event, NOT payload */ + { + DAT_EXT_ATTR_RECV_IMMED_EVENT, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_FETCH_AND_ADD, + DAT_EXT_ATTR_TRUE + }, + { + DAT_EXT_ATTR_CMP_AND_SWAP, + DAT_EXT_ATTR_TRUE + }, +#else + { + "DAT_EXTENSION_INTERFACE", + "FALSE" + }, +#endif +}; + +#define SPEC_ATTR_SIZE(x) ( sizeof(x)/sizeof(DAT_NAMED_ATTR) ) + +void dapls_set_provider_specific_attr( + IN DAT_PROVIDER_ATTR *attr_ptr ) +{ + attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); + attr_ptr->provider_specific_attr = ib_attrs; +} + +#endif + Index: dapl/openib_cma/dapl_ib_mem.c =================================================================== --- dapl/openib_cma/dapl_ib_mem.c (revision 4589) +++ dapl/openib_cma/dapl_ib_mem.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_mem.c + * MODULE: dapl_ib_mem.c * - * PURPOSE: Intel DET APIs: Memory windows, registration, + * PURPOSE: OpenIB uCMA provider Memory windows, registration, * and protection domain * * $Id: $ @@ -72,12 +72,10 @@ dapls_convert_privileges(IN DAT_MEM_PRIV access |= IBV_ACCESS_LOCAL_WRITE; if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) access |= IBV_ACCESS_REMOTE_WRITE; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) - access |= IBV_ACCESS_REMOTE_READ; - if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) { access |= IBV_ACCESS_REMOTE_READ; + access |= IBV_ACCESS_REMOTE_ATOMIC; + } return access; } Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - connection management + * The OpenIB uCMA provider - uCMA connection management * ************************************************************************ **** * Source Control System Information @@ -592,7 +592,11 @@ dapls_ib_setup_conn_listener(IN DAPL_IA if (rdma_bind_addr(conn->cm_id, (struct sockaddr *)&ia_ptr->hca_ptr->hca_address)) { - dat_status = dapl_convert_errno(errno,"setup_listener"); + if (errno == -EBUSY) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = + dapl_convert_errno(errno,"setup_listener"); goto bail; } Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 4589) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -25,9 +25,9 @@ /********************************************************************** * - * MODULE: dapl_det_qp.c + * MODULE: dapl_ib_qp.c * - * PURPOSE: QP routines for access to DET Verbs + * PURPOSE: OpenIB uCMA QP routines * * $Id: $ **********************************************************************/ Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 4589) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - definitions, prototypes, + * The OpenIB uCMA provider - definitions, prototypes, * ************************************************************************ **** * Source Control System Information Index: dapl/openib_cma/README =================================================================== --- dapl/openib_cma/README (revision 4589) +++ dapl/openib_cma/README (working copy) @@ -23,15 +23,22 @@ New files for openib_scm provider dapl/openib_cma/dapl_ib_util.c dapl/openib_cma/dapl_ib_util.h dapl/openib_cma/dapl_ib_cm.c + dapl/openib_cma/dapl_ib_extensions.c A simple dapl test just for openib_scm testing... test/dtest/dtest.c + test/dtest/dtest_ext.c test/dtest/makefile server: dtest -s client: dtest -h hostname +or with extensions + + server: dtest_ext -s + client: dtest_ext -h hostname + known issues: no memory windows support in ibverbs, dat_create_rmr fails. Index: dapl/openib_cma/dapl_ib_cq.c =================================================================== --- dapl/openib_cma/dapl_ib_cq.c (revision 4589) +++ dapl/openib_cma/dapl_ib_cq.c (working copy) @@ -35,7 +35,7 @@ * * Description: * - * The uDAPL openib provider - completion queue + * The OpenIB uCMA provider - completion queue * ************************************************************************ **** * Source Control System Information @@ -498,7 +498,10 @@ dapls_ib_wait_object_wait(IN ib_wait_obj if (timeout != DAT_TIMEOUT_INFINITE) timeout_ms = timeout/1000; - status = poll(&cq_fd, 1, timeout_ms); + /* restart syscall */ + while ((status = poll(&cq_fd, 1, timeout_ms)) == -1 ) + if (errno == EINTR) + continue; /* returned event */ if (status > 0) { @@ -511,6 +514,8 @@ dapls_ib_wait_object_wait(IN ib_wait_obj /* timeout */ } else if (status == 0) status = ETIMEDOUT; + else + status = errno; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", Index: dat/include/dat/dat_redirection.h =================================================================== --- dat/include/dat/dat_redirection.h (revision 4589) +++ dat/include/dat/dat_redirection.h (working copy) @@ -59,10 +59,10 @@ typedef struct dat_provider DAT_PROVIDER * This would allow a good compiler to avoid indirection overhead when * making function calls. */ - #define DAT_HANDLE_TO_PROVIDER(handle) (*(DAT_PROVIDER **)(handle)) #endif + #define DAT_IA_QUERY(ia, evd, ia_msk, ia_ptr, p_msk, p_ptr) \ (*DAT_HANDLE_TO_PROVIDER (ia)->ia_query_func) (\ (ia), \ @@ -395,6 +395,12 @@ typedef struct dat_provider DAT_PROVIDER (lbuf), \ (cookie)) +#define DAT_EXTENSION(handle, op, args) \ + (*DAT_HANDLE_TO_PROVIDER (handle)->extension_func) (\ + (handle), \ + (op), \ + (args)) + /*************************************************************** * * FUNCTION PROTOTYPES @@ -720,4 +726,11 @@ typedef DAT_RETURN (*DAT_SRQ_POST_RECV_F IN DAT_LMR_TRIPLET *, /* local_iov */ IN DAT_DTO_COOKIE ); /* user_cookie */ +/* Extension function */ +#include +typedef DAT_RETURN (*DAT_EXTENSION_FUNC) ( + IN DAT_HANDLE, /* dat handle */ + IN DAT_EXT_OP, /* extension operation */ + IN va_list ); /* va_list */ + #endif /* _DAT_REDIRECTION_H_ */ Index: dat/include/dat/dat.h =================================================================== --- dat/include/dat/dat.h (revision 4589) +++ dat/include/dat/dat.h (working copy) @@ -854,11 +854,15 @@ typedef enum dat_event_number DAT_ASYNC_ERROR_EP_BROKEN = 0x08003, DAT_ASYNC_ERROR_TIMED_OUT = 0x08004, DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR = 0x08005, - DAT_SOFTWARE_EVENT = 0x10001 + DAT_SOFTWARE_EVENT = 0x10001, + DAT_EXTENSION_EVENT = 0x20001 + } DAT_EVENT_NUMBER; -/* Union for event Data */ +/* include extension data definitions */ +#include +/* Union for event Data */ typedef union dat_event_data { DAT_DTO_COMPLETION_EVENT_DATA dto_completion_event_data; @@ -867,6 +871,7 @@ typedef union dat_event_data DAT_CONNECTION_EVENT_DATA connect_event_data; DAT_ASYNCH_ERROR_EVENT_DATA asynch_error_event_data; DAT_SOFTWARE_EVENT_DATA software_event_data; + DAT_EXTENSION_DATA extension_data; } DAT_EVENT_DATA; /* Event struct that holds all event information */ @@ -1222,6 +1227,11 @@ extern DAT_RETURN dat_srq_set_lw ( IN DAT_SRQ_HANDLE, /* srq_handle */ IN DAT_COUNT); /* low_watermark */ +extern DAT_RETURN dat_extension( + IN DAT_HANDLE, + IN DAT_EXT_OP, + IN ... ); + /* * DAT registry functions. * Index: dat/include/dat/udat_redirection.h =================================================================== --- dat/include/dat/udat_redirection.h (revision 4589) +++ dat/include/dat/udat_redirection.h (working copy) @@ -199,7 +199,6 @@ typedef DAT_RETURN (*DAT_EVD_SET_UNWAITA typedef DAT_RETURN (*DAT_EVD_CLEAR_UNWAITABLE_FUNC) ( IN DAT_EVD_HANDLE); /* evd_handle */ - #include struct dat_provider @@ -294,6 +293,10 @@ struct dat_provider DAT_SRQ_QUERY_FUNC srq_query_func; DAT_SRQ_RESIZE_FUNC srq_resize_func; DAT_SRQ_SET_LW_FUNC srq_set_lw_func; + + /* extension for provder specific functions */ + DAT_EXTENSION_FUNC extension_func; + }; #endif /* _UDAT_REDIRECTION_H_ */ Index: dat/include/dat/dat_extensions.h =================================================================== --- dat/include/dat/dat_extensions.h (revision 0) +++ dat/include/dat/dat_extensions.h (revision 0) @@ -0,0 +1,209 @@ +/* + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is in the file LICENSE3.txt in the root directory. The + * license is also available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +/********************************************************************** + * + * HEADER: dat_extensions.h + * + * PURPOSE: defines the extensions to the DAT API for uDAPL. + * + * Description: Header file for "uDAPL: User Direct Access Programming + * Library, Version: 1.2" + * + * Mapping rules: + * All global symbols are prepended with "DAT_" or "dat_" + * All DAT objects have an 'api' tag which, such as 'ep' or 'lmr' + * The method table is in the provider definition structure. + * + * + **********************************************************************/ + +#ifndef _DAT_EXTENSIONS_H_ + +extern int dat_extensions; + +/* + * Provider specific attribute strings for extension support + * returned with dat_ia_query() and + * DAT_PROVIDER_ATTR_MASK == DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR + * + * DAT_NAMED_ATTR name == extended operation, + * value == TRUE if extended operation is supported + */ +#define DAT_EXT_ATTR "DAT_EXTENSION_INTERFACE" +#define DAT_EXT_ATTR_RDMA_WRITE_IMMED "DAT_EXT_RDMA_WRITE_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED "DAT_EXT_RECV_IMMED" +#define DAT_EXT_ATTR_RECV_IMMED_EVENT "DAT_EXT_RECV_IMMED_EVENT" +#define DAT_EXT_ATTR_RECV_IMMED_PAYLOAD "DAT_EXT_RECV_IMMED_PAYLOAD" +#define DAT_EXT_ATTR_FETCH_AND_ADD "DAT_EXT_FETCH_AND_ADD" +#define DAT_EXT_ATTR_CMP_AND_SWAP "DAT_EXT_CMP_AND_SWAP" +#define DAT_EXT_ATTR_TRUE "TRUE" +#define DAT_EXT_ATTR_FALSE "FALSE" + +/* + * Extension OPERATIONS + */ +typedef enum dat_ext_op +{ + DAT_EXT_RDMA_WRITE_IMMED, + DAT_EXT_RECV_IMMED, + DAT_EXT_FETCH_AND_ADD, + DAT_EXT_CMP_AND_SWAP, + +} DAT_EXT_OP; + +/* + * Extension completion event TYPES + */ +typedef enum dat_ext_event_type +{ + DAT_EXT_RDMA_WRITE_IMMED_STATUS = 1, + DAT_EXT_RECV_NO_IMMED, + DAT_EXT_RECV_IMMED_DATA_EVENT, + DAT_EXT_RECV_IMMED_DATA_PAYLOAD, + DAT_EXT_FETCH_AND_ADD_STATUS, + DAT_EXT_CMP_AND_SWAP_STATUS, + +} DAT_EXT_EVENT_TYPE; + +/* + * Extension completion event DATA + */ +typedef struct dat_immediate_data +{ + DAT_UINT32 data; + +} DAT_RDMA_WRITE_IMMED_DATA; + +typedef struct dat_extension_data +{ + DAT_DTO_COMPLETION_EVENT_DATA dto; + DAT_EXT_EVENT_TYPE type; + union { + DAT_RDMA_WRITE_IMMED_DATA immed; + } val; +} DAT_EXTENSION_DATA; + +typedef enum dat_ext_flags +{ + DAT_EXT_WRITE_IMMED_FLAG = 0x1, + DAT_EXT_WRITE_CONFIRM_FLAG = 0x2, + +} DAT_EXT_FLAGS; + +/* + * Extended API with redirection via DAT extension function + */ + +/* + * RDMA Write with IMMEDIATE extension: + * + * Asynchronous call performs a normal RDMA write to the remote endpoint + * followed by a post of an extended immediate data value to the receive + * EVD on the remote endpoint. Event completion for the request completes + * as an DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + * Event completion on the remote endpoint completes as an DAT_EXTENSION_EVENT + * with type set to DAT_EXT_RECV_IMMED_DATA_IN_EVENT or + * DAT_EXT_RECV_IMMED_DATA_IN_PAYLOAD depending on the provider transport. + * + * DAT_EXT_WRITE_IMMED_FLAG requests that the supplied + *'immediate' value be sent as the payload of a four byte send following + * the RDMA Write, or any transport-dependent equivalent thereof. + * For example, on InfiniBand the request should be translated as an + * RDMA Write with Immediate. + * + * DAT_EXT_WRITE_CONFIRM_FLAG requests that this DTO + * not complete until receipt by the far end is confirmed. + * + * Note to Consumers: the immediate data will consume a receive + * buffer at the Data Sink. + * + * Other extension flags: + * n/a + */ +#define dat_ep_post_rdma_write_immed(ep, size, lbuf, cookie, rbuf, idata, eflgs, flgs) \ + dat_extension( ep, \ + DAT_EXT_RDMA_WRITE_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (idata), \ + (eflgs), \ + (flgs)) + +/* + * Call performs a normal post receive message to the local endpoint + * that includes additional 32-bit buffer space for immediate data + * Event completion for the request completes as an + * DAT_EXTENSION_EVENT with type set to DAT_EXT_RDMA_WRITE_IMMED_STATUS. + */ +#define dat_ep_post_recv_immed(ep, size, lbuf, cookie, flgs) \ + dat_extension( ep, \ + DAT_EXT_RECV_IMMED, \ + (size), \ + (lbuf), \ + (cookie), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Fetch and Add operation. The add_value is added to the 64 bit + * value stored at the remote memory location specified in remote_iov + * and the result is stored in the local_iov. + */ +#define dat_ep_post_fetch_and_add(ep, add_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_FETCH_AND_ADD, \ + (add_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +/* + * This asynchronous call is modeled after the InfiniBand atomic + * Compare and Swap operation. The cmp_value is compared to the 64 bit + * value stored at the remote memory location specified in remote_iov. + * If the two values are equal, the 64 bit swap_value is stored in + * the remote memory location. In all cases, the original 64 bit + * value stored in the remote memory location is copied to the local_iov. + */ +#define dat_ep_post_cmp_and_swap(ep, cmp_val, swap_val, lbuf, cookie, rbuf, flgs) \ + dat_extension( ep, \ + DAT_EXT_CMP_AND_SWAP, \ + (cmp_val), \ + (swap_val), \ + (lbuf), \ + (cookie), \ + (rbuf), \ + (flgs)) + +#endif /* _DAT_EXTENSIONS_H_ */ + Index: dat/common/dat_api.c =================================================================== --- dat/common/dat_api.c (revision 4594) +++ dat/common/dat_api.c (working copy) @@ -1142,6 +1142,36 @@ DAT_RETURN dat_srq_set_lw( low_watermark); } +DAT_RETURN dat_extension( + IN DAT_HANDLE handle, + IN DAT_EXT_OP ext_op, + IN ... ) + +{ + DAT_RETURN status; + va_list args; + + if (handle == NULL) + { + return DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP); + } + + /* verify provider extension support */ + if (!dat_extensions) + { + return DAT_ERROR(DAT_NOT_IMPLEMENTED, 0); + } + + va_start(args, ext_op); + + status = DAT_EXTENSION(handle, + ext_op, + args); + va_end(args); + + return status; +} + /* * Local variables: * c-indent-level: 4 Index: dat/udat/udat.c =================================================================== --- dat/udat/udat.c (revision 4594) +++ dat/udat/udat.c (working copy) @@ -66,6 +66,10 @@ udat_check_state ( void ); * * *********************************************************************/ +/* + * Use a global to get an unresolved when run with pre-extension library + */ +int dat_extensions = 0; /* * @@ -230,13 +234,44 @@ dat_ia_openv ( async_event_qlen, async_event_handle, ia_handle); + + /* + * See if provider supports extensions + */ if (dat_status == DAT_SUCCESS) { - return_handle = dats_set_ia_handle (*ia_handle); - if (return_handle >= 0) - { - *ia_handle = (DAT_IA_HANDLE)return_handle; - } + DAT_PROVIDER_ATTR p_attr; + int i; + + return_handle = dats_set_ia_handle (*ia_handle); + if (return_handle >= 0) + { + *ia_handle = (DAT_IA_HANDLE)return_handle; + } + + if ( dat_ia_query( *ia_handle, + NULL, + 0, + NULL, + DAT_PROVIDER_FIELD_PROVIDER_SPECIFIC_ATTR, + &p_attr ) == DAT_SUCCESS ) + { + for ( i = 0; i < p_attr.num_provider_specific_attr; i++ ) + { + if ( (strcmp( p_attr.provider_specific_attr[i].name, + "DAT_EXTENSION_INTERFACE" ) == 0) && + (strcmp( p_attr.provider_specific_attr[i].value, + "TRUE" ) == 0) ) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_CONSUMER_API, + "DAT Registry: dat_ia_open () " + "DAPL Extension Interface supported!\n"); + + dat_extensions = 1; + break; + } + } + } } return dat_status; Index: README =================================================================== --- README (revision 4589) +++ README (working copy) @@ -1,5 +1,10 @@ There are now 3 uDAPL providers for openib (openib,openib_scm,openib_cma). +NEW FEATURES for openib_cma provider: +API extensions for immediate data and atomic operations have been added. +see dat/include/dat/dat_extensions.h for new API's. +see dapl/test/dtest/dtest_ext.c for example test cast + ========== 1.0 BUILD: ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Fri Dec 23 13:46:45 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 23 Dec 2005 13:46:45 -0800 Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: <54AD0F12E08D1541B826BE97C98F99F114206D@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > arlin> DAPL provides a generalized abstraction to RDMA capable > arlin> transports. As a generalized abstraction, it cannot > exploit the > arlin> unique properties that many of the underlying > arlin> platforms/interconnects can provide so I would like to propose > a arlin> simple (minimum impact on libdat) extensible interface > to uDAPL > arlin> that will allow vendors to expose such capabilities. I > am looking > arlin> for feedback, especially from the DAT collaborative. I have > arlin> included both a design document and actual working code as a > arlin> reference. > > This is an excellent document and clearly certain > applications will benefit greatly from adding this additional > functionality. > > Since DAPL's inception, the DAT_PROVIDER structure has > contained a field called "extension" of type void *. The > purpose of this field was to allow for the kind of > provider/platform/interconnect specific extensions you describe. > > I believe these features can be added without modifications > to the current API by defining a particular format for the > DAT_PROVIDER's extension data and indicating its presence via > a provider attribute. > That would require creating an extension document like this > one describing an "extension" structure w/ function pointers > to the new functions and a well known provider attribute value. > > Is there a reason this was not feasible? Would minor > modifications to the existing framework be sufficient > (perhaps an "extension" event type)? > Good points. Promoting something from a provider-specific extension, or even an extension that many providers agree to, creates an expectation that other providers SHOULD implement at least an emulation of this new method if it is at all relevant on their transport. And at the minimum they have to explicitly reject calls to the new method. An extension creates no similar expectations on other Providers. From caitlinb at broadcom.com Fri Dec 23 13:29:56 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 23 Dec 2005 13:29:56 -0800 Subject: [openib-general] RE: [dat-discussions] [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's Message-ID: <54AD0F12E08D1541B826BE97C98F99F1142064@NT-SJCA-0751.brcm.ad.broadcom.com> dat-discussions at yahoogroups.com wrote: > James and Arkady, > > > > DAPL provides a generalized abstraction to RDMA capable > transports. As a generalized abstraction, it cannot exploit > the unique properties that many of the underlying > platforms/interconnects can provide so I would like to > propose a simple (minimum impact on libdat) extensible > interface to uDAPL that will allow vendors to expose such > capabilities. I am looking for feedback, especially from the > DAT collaborative. I have included both a design document and > actual working code as a reference. > > > > The patch provides a fully tested DAT and DAPL library > (openib_cma) set with the following provider extensions: > > > > DAT_RETURN > dat_ep_post_write_immed( > IN DAT_EP_HANDLE ep_handle, > > IN DAT_COUNT num_segments > IN DAT_LMR_TRIPLET *local_iov, > IN DAT_DTO_COOKIE user_cookie, > > IN DAT_RMR_TRIPLE *remote_iov, > IN DAT_UINT32 immediate_data, > IN DAT_COMPLETION_FLAGS completion_flags); > > There was an earlier discussion on "extended RDMA Writes" that could optionally include immediate data and/or rdma-peer-confirmed delivery. I think any discussions regarding RDMA Writes should build upon that, since we already outlined the differences between iWARP and IB and how each could be supported. To recap: it is easy to support any of the IB RDMA Write semantics over iWARP, but doing so maps to a variable number of work requests (for example, an RDMA Write *and* an RDMA Send). As I recall, there was insufficient interest to deal justify the extra complexity. But there is a base of interest in being able to do these "extended RDMA Writes" in a transport neutral fashion. > > DAT_RETURN > dat_ep_post_cmp_and_swap( > IN DAT_EP_HANDLE ep_handle, > > IN DAT_UINT64 cmp_value, > IN DAT_UINT64 swap_value, > > IN DAT_LMR_TRIPLE *local_iov, > IN DAT_DTO_COOKIE user_cookie, > > IN DAT_RMR_TRIPLE *remote_iov, > IN DAT_COMPLETION_FLAGS completion_flags); > > > > DAT_RETURN > dat_ep_post_fetch_and_add( > IN DAT_EP_HANDLE ep_handle, > > IN DAT_UINT64 add_value, > > IN DAT_LMR_TRIPLE *local_iov, > IN DAT_DTO_COOKIE user_cookie, > > IN DAT_RMR_TRIPLE *remote_iov, > IN DAT_COMPLETION_FLAGS completion_flags); > IT-API already has made a similar proposal, and the idea that atomic work requests at least exist is present in all of the verbs layer APIs (with the caveat that the response might be "illegal request"). It might be interesting to explore if there is any way that these can be defined so that they could easily be implemented at the driver or Provider layer since they are obviously not a part of an iWARP RNIC. I belive this is possible if there is no other untagged traffic (send/recv) on the same connection. Would transport neutral atomics with that restriction be of any use at the application layer? Or is just being told "no we don't support hardware-assisted atomics" a simpler attribute? Keep in mind that atomics were not omitted from iWARP by oversight. They would have been a fairly simple opcode to add to the same work queue that handles RDMA Read requests. It was the consensus of the drafters that IB atomics had not justified the silicon they required. The key question is whether the benefits of using atomics when they are hardware optimized can be retained when the overhead of checking to see if they can be used is added. From rminnich at lanl.gov Sat Dec 24 17:34:38 2005 From: rminnich at lanl.gov (Ronald G. Minnich) Date: Sat, 24 Dec 2005 18:34:38 -0700 (MST) Subject: [openib-general] PathScale license In-Reply-To: <1135363454.4328.95007.camel@hal.voltaire.com> References: <1135363454.4328.95007.camel@hal.voltaire.com> Message-ID: <21289.128.165.0.81.1135474478.squirrel@webmail.lanl.gov> > Hi, > > The PathScale OpenIB license includes the following which > is beyond the normal OpenIB license: > > * Patent licenses, if any, provided herein do not apply to > * combinations of this program with other software, or any other > * product whatsoever. ??? What the heck could this mean? This kind of comment, lacking any real explanation, can cause trouble for a vendor -- it causes worries for customers. Customer worries can lead to customers looking to other vendors. I had hoped we had gotten past this 'one wrong move and I kill myself' aspect of the IB vendor community. I think I'd like to see PathScale comments on this as well. Over the last few years, the PathScale position on IP has led to confusion among potential customers, including me, which has led to a concern about the consequences of buying the PathScale HCAs -- and from there, a decision not to buy them. I'm getting set to buy some of these interfaces but could still be scared off by this type of IP confusion -- please help us understand what these means. IANAL, so keep it simple :-) Can I hear from a lawyer if I buy PathScale HCAs, run PathScale software, and then do something to it that involves a patent I've never heard of? ron From hch at lst.de Sun Dec 25 02:41:23 2005 From: hch at lst.de (Christoph Hellwig) Date: Sun, 25 Dec 2005 11:41:23 +0100 Subject: [openib-general] PathScale license In-Reply-To: <21289.128.165.0.81.1135474478.squirrel@webmail.lanl.gov> References: <1135363454.4328.95007.camel@hal.voltaire.com> <21289.128.165.0.81.1135474478.squirrel@webmail.lanl.gov> Message-ID: <20051225104123.GA7180@lst.de> On Sat, Dec 24, 2005 at 06:34:38PM -0700, Ronald G. Minnich wrote: > > Hi, > > > > The PathScale OpenIB license includes the following which > > is beyond the normal OpenIB license: > > > > * Patent licenses, if any, provided herein do not apply to > > * combinations of this program with other software, or any other > > * product whatsoever. > > ??? What the heck could this mean? This kind of comment, lacking any real > explanation, can cause trouble for a vendor -- it causes worries for > customers. Customer worries can lead to customers looking to other > vendors. I had hoped we had gotten past this 'one wrong move and I kill > myself' aspect of the IB vendor community. Yepp. Second problem is that this is not GPL-compatible so we definitly couldn't out the code into the kernel tree with this license. From rdreier at cisco.com Sun Dec 25 18:49:19 2005 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 Dec 2005 18:49:19 -0800 Subject: [openib-general] Re: [PATCH 13/13] [RFC] ipath Kconfig and Makefile In-Reply-To: <20051218192356.GB9145@mars.ravnborg.org> (Sam Ravnborg's message of "Sun, 18 Dec 2005 20:23:56 +0100") References: <200512161548.MdcxE8ZQTy1yj4v1@cisco.com> <200512161548.lokgvLraSGi0enUH@cisco.com> <20051218192356.GB9145@mars.ravnborg.org> Message-ID: > > +EXTRA_CFLAGS += -Idrivers/infiniband/include > If this is needed then some header files should be moved to include/rdma Sorry, this is really my fault -- it's a remnant to make building our subversion tree easier. It's not needed when the driver is part of the kernel proper, and I'll make sure to remove it when finally merging. Thanks, Roland From rdreier at cisco.com Sun Dec 25 18:53:14 2005 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 Dec 2005 18:53:14 -0800 Subject: [openib-general] Re: outstanding patches In-Reply-To: <20051219215254.GC2694@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 19 Dec 2005 23:52:54 +0200") References: <20051219215254.GC2694@mellanox.co.il> Message-ID: Michael> Most of them are small, I have collected them here: Michael> https://openib.org/svn/trunk/contrib/mellanox/patches Great. I don't think I've lost any but this is useful as well. Michael> I would be especially interested to get feedback on Michael> several ipoib patches, which have been outstanding for a Michael> while now. Sorry, since the end of November, I've had vacation, office move, and back on vacation (and I'm still not back really, just stealing time to check my email quickly). I'll be back with a vengeance January 4. - R. From rdreier at cisco.com Sun Dec 25 18:55:55 2005 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 Dec 2005 18:55:55 -0800 Subject: [openib-general] mthca calls ib_register_mad_agent() and implements ib_device.process_mad()? In-Reply-To: <1135039132.6397.52.camel@brick.internal.keyresearch.com> (Ralph Campbell's message of "Mon, 19 Dec 2005 16:38:52 -0800") References: <1135039132.6397.52.camel@brick.internal.keyresearch.com> Message-ID: Ralph> Can someone explain why the mthca driver calls Ralph> ib_register_mad_agent() and implements Ralph> ib_device.process_mad()? It looks like the later does the Ralph> actual processing of MAD packets for the SMA and PMA Ralph> whereas the former doesn't seem to do anything except cause Ralph> the ib_mad module to be modprobe'd. The MAD agent is created to handle sending MADs that the device itself generates -- traps and notices essentially. Look at mthca_mad.c::forward_trap() to see where it gets used. MADs from the network are passed down to the firmware in the process_mad method, and those replies are returned back up the the MAD layer for sending. - R. From info at fjddh.com Mon Dec 26 03:38:11 2005 From: info at fjddh.com (info at fjddh.com) Date: 26 Dec 2005 20:38:11 +0900 Subject: [openib-general] $BNY$N $B=i$a$^$7$F!#!ZNY$Ne$N=w at -$H$N5U%5%]!{%H8r:]$r(B $B$7$F$_$^$;$s$+!)(B $B!}F~2q6b!&2qHqEy$9$Y$F=w at -IiC4!"CK at -$O$^$C$?$/EPO?NA6b$,3]$+(B $B$j$^$;$s!#(B $B!}7n(B1$B!A(B2$B2s(B($B=w at -$K$h$k!"Kt$OOC$79g$$$G(B) $B!}3d at Z$C$?4X78!"7k:'A0Ds!"%S%8%M%9%Q!<%H%J! Fill vendor_err field in completion with error. Signed-off-by: Michael S. Tsirkin Index: last_stable/drivers/infiniband/hw/mthca/mthca_cq.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_cq.c 2005-12-26 13:54:22.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_cq.c 2005-12-26 13:54:38.000000000 +0200 @@ -128,12 +128,12 @@ struct mthca_err_cqe { __be32 my_qpn; u32 reserved1[3]; u8 syndrome; - u8 reserved2; + u8 vendor_err; __be16 db_cnt; - u32 reserved3; + u32 reserved2; __be32 wqe; u8 opcode; - u8 reserved4[2]; + u8 reserved3[2]; u8 owner; }; @@ -342,8 +342,8 @@ static int handle_error_cqe(struct mthca } /* - * For completions in error, only work request ID, status (and - * freed resource count for RD) have to be set. + * For completions in error, only work request ID, status, vendor error + * (and freed resource count for RD) have to be set. */ switch (cqe->syndrome) { case SYNDROME_LOCAL_LENGTH_ERR: @@ -405,6 +405,8 @@ static int handle_error_cqe(struct mthca break; } + entry->vendor_err = cqe->vendor_err; + /* * Mem-free HCAs always generate one CQE per WQE, even in the * error case, so we don't have to check the doorbell count, etc. -- MST From mst at mellanox.co.il Mon Dec 26 04:10:18 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Dec 2005 14:10:18 +0200 Subject: [openib-general] [PATCH] libmthca: fill vendor_err in completion with error Message-ID: <20051226121018.GS4907@mellanox.co.il> Fill vendor_err field for completion with error. Index: last_stable/src/userspace/libmthca/src/cq.c =================================================================== --- last_stable.orig/src/userspace/libmthca/src/cq.c 2005-12-25 17:04:23.000000000 +0200 +++ last_stable/src/userspace/libmthca/src/cq.c 2005-12-26 13:57:00.000000000 +0200 @@ -112,12 +112,12 @@ struct mthca_err_cqe { uint32_t my_qpn; uint32_t reserved1[3]; uint8_t syndrome; - uint8_t reserved2; + uint8_t vendor_err; uint16_t db_cnt; - uint32_t reserved3; + uint32_t reserved2; uint32_t wqe; uint8_t opcode; - uint8_t reserved4[2]; + uint8_t reserved3[2]; uint8_t owner; }; @@ -197,8 +197,8 @@ static int handle_error_cqe(struct mthca } /* - * For completions in error, only work request ID, status (and - * freed resource count for RD) have to be set. + * For completions in error, only work request ID, status, vendor error + * (and freed resource count for RD) have to be set. */ switch (cqe->syndrome) { case SYNDROME_LOCAL_LENGTH_ERR: @@ -260,6 +260,8 @@ static int handle_error_cqe(struct mthca break; } + wc->vendor_err = cqe->vendor_err; + /* * Mem-free HCAs always generate one CQE per WQE, even in the * error case, so we don't have to check the doorbell count, etc. -- MST From devesh28 at gmail.com Mon Dec 26 04:32:46 2005 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 26 Dec 2005 18:02:46 +0530 Subject: [openib-general] OpenIB Design Document Message-ID: <309a667c0512260432w20ae94f4u73abdd95959b95f4@mail.gmail.com> Hi all, I have a query regarding the design specification currently openib fourm is following is it the document which is available on /trunk/contrib/sourceforge/Documentation/LinuxSAS.1.0.1.pdf OR trunk/contrib/voltaire/openib-access.pdf Or there is no difference between the two file? please help me out on this. Devesh From mst at mellanox.co.il Mon Dec 26 07:08:04 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 26 Dec 2005 17:08:04 +0200 Subject: [openib-general] [PATCH] mthca_mcg: multiple fixes Message-ID: <20051226150804.GW4907@mellanox.co.il> The following patch supercedes the patch for mthca_mcg that I sent previously. Unfortunately the patch has got relatively big, and it's somewhat hard to split it up to smaller chunks. --- Multicast group management fixes: . Dont leak mailbox memory in error handling on multicast group operations . Free AMGM indices at detach and in attach error handling . Fix amount to shift for aligning next_gid_index in mailbox: it starts at bit 6, not bit 5 . Allocate AMGM index after end of MGM table, in the range num_mgms to multicast table size - 1. Add some BUG_ON checks to catch cases where the index falls in the MGM hash area. . Initialize the list of QPs in a newly-allocated group from AMGM to 0 This is necessary since when a group is moved from AMGM to MGM (in the case where the MGM entry has been emptied of QPs), the AMGM entry is not reset to 0 (and we dont want an extra command to do that). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: latest/drivers/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- latest.orig/drivers/infiniband/hw/mthca/mthca_mcg.c +++ latest/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -109,7 +109,8 @@ static int find_mgm(struct mthca_dev *de goto out; if (status) { mthca_err(dev, "READ_MGM returned status %02x\n", status); - return -EINVAL; + err = -EINVAL; + goto out; } if (!memcmp(mgm->gid, zero_gid, 16)) { @@ -124,7 +125,7 @@ static int find_mgm(struct mthca_dev *de goto out; *prev = *index; - *index = be32_to_cpu(mgm->next_gid_index) >> 5; + *index = be32_to_cpu(mgm->next_gid_index) >> 6; } while (*index); *index = -1; @@ -151,8 +152,10 @@ int mthca_multicast_attach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -179,9 +182,8 @@ int mthca_multicast_attach(struct ib_qp err = -EINVAL; goto out; } - + memset(mgm, 0, sizeof *mgm); memcpy(mgm->gid, gid->raw, 16); - mgm->next_gid_index = 0; } for (i = 0; i < MTHCA_QP_PER_MGM; ++i) @@ -207,6 +209,7 @@ int mthca_multicast_attach(struct ib_qp if (status) { mthca_err(dev, "WRITE_MGM returned status %02x\n", status); err = -EINVAL; + goto out; } if (!link) @@ -221,7 +224,7 @@ int mthca_multicast_attach(struct ib_qp goto out; } - mgm->next_gid_index = cpu_to_be32(index << 5); + mgm->next_gid_index = cpu_to_be32(index << 6); err = mthca_WRITE_MGM(dev, prev, mailbox, &status); if (err) @@ -232,7 +235,12 @@ int mthca_multicast_attach(struct ib_qp } out: + if (err && link && index != -1) { + BUG_ON(index < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, index); + } up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } @@ -253,8 +261,10 @@ int mthca_multicast_detach(struct ib_qp return PTR_ERR(mailbox); mgm = mailbox->buf; - if (down_interruptible(&dev->mcg_table.sem)) - return -EINTR; + if (down_interruptible(&dev->mcg_table.sem)) { + err = -EINTR; + goto err_sem; + } err = find_mgm(dev, gid->raw, mailbox, &hash, &prev, &index); if (err) @@ -303,13 +313,11 @@ int mthca_multicast_detach(struct ib_qp if (i != 1) goto out; - goto out; - if (prev == -1) { /* Remove entry from MGM */ - if (be32_to_cpu(mgm->next_gid_index) >> 5) { - err = mthca_READ_MGM(dev, - be32_to_cpu(mgm->next_gid_index) >> 5, + int amgm_index_to_free = be32_to_cpu(mgm->next_gid_index) >> 6; + if (amgm_index_to_free) { + err = mthca_READ_MGM(dev, amgm_index_to_free, mailbox, &status); if (err) goto out; @@ -330,9 +338,13 @@ int mthca_multicast_detach(struct ib_qp err = -EINVAL; goto out; } + if (amgm_index_to_free) { + BUG_ON(amgm_index_to_free < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, amgm_index_to_free); + } } else { /* Remove entry from AMGM */ - index = be32_to_cpu(mgm->next_gid_index) >> 5; + int curr_next_index = be32_to_cpu(mgm->next_gid_index) >> 6; err = mthca_READ_MGM(dev, prev, mailbox, &status); if (err) goto out; @@ -342,7 +354,7 @@ int mthca_multicast_detach(struct ib_qp goto out; } - mgm->next_gid_index = cpu_to_be32(index << 5); + mgm->next_gid_index = cpu_to_be32(curr_next_index << 6); err = mthca_WRITE_MGM(dev, prev, mailbox, &status); if (err) @@ -352,10 +364,13 @@ int mthca_multicast_detach(struct ib_qp err = -EINVAL; goto out; } + BUG_ON(index < dev->limits.num_mgms); + mthca_free(&dev->mcg_table.alloc, index); } out: up(&dev->mcg_table.sem); + err_sem: mthca_free_mailbox(dev, mailbox); return err; } @@ -363,11 +378,12 @@ int mthca_multicast_detach(struct ib_qp int __devinit mthca_init_mcg_table(struct mthca_dev *dev) { int err; + int table_size = dev->limits.num_mgms + dev->limits.num_amgms; err = mthca_alloc_init(&dev->mcg_table.alloc, - dev->limits.num_amgms, - dev->limits.num_amgms - 1, - 0); + table_size, + table_size - 1, + dev->limits.num_mgms); if (err) return err; -- MST From ardavis at ichips.intel.com Mon Dec 26 13:54:09 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 26 Dec 2005 13:54:09 -0800 Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F114206D@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F114206D@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <43B06681.2000404@ichips.intel.com> Caitlin Bestler wrote: >openib-general-bounces at openib.org wrote: > > >>arlin> DAPL provides a generalized abstraction to RDMA capable >>arlin> transports. As a generalized abstraction, it cannot >>exploit the >>arlin> unique properties that many of the underlying >>arlin> platforms/interconnects can provide so I would like to propose >>a arlin> simple (minimum impact on libdat) extensible interface >>to uDAPL >>arlin> that will allow vendors to expose such capabilities. I >>am looking >>arlin> for feedback, especially from the DAT collaborative. I have >>arlin> included both a design document and actual working code as a >>arlin> reference. >> >>This is an excellent document and clearly certain >>applications will benefit greatly from adding this additional >>functionality. >> >>Since DAPL's inception, the DAT_PROVIDER structure has >>contained a field called "extension" of type void *. The >>purpose of this field was to allow for the kind of >>provider/platform/interconnect specific extensions you describe. >> >>I believe these features can be added without modifications >>to the current API by defining a particular format for the >>DAT_PROVIDER's extension data and indicating its presence via >>a provider attribute. >>That would require creating an extension document like this >>one describing an "extension" structure w/ function pointers >>to the new functions and a well known provider attribute value. >> >>Is there a reason this was not feasible? Would minor >>modifications to the existing framework be sufficient >>(perhaps an "extension" event type)? >> >> >> > > >Good points. > >Promoting something from a provider-specific extension, or even >an extension that many providers agree to, creates an expectation >that other providers SHOULD implement at least an emulation of >this new method if it is at all relevant on their transport. >And at the minimum they have to explicitly reject calls to >the new method. > > I am not promoting these specific extensions for all providers, they are merely working examples that add value to the IB provider and are inlcuded so everyone can see exactly how the "proposed extension interface"can plug into new provider specific entry points (i.e. real code, working patch). Please concentrate on the extension method proposed and not the provider specific calls. I am looking for comments on the new extension function definition, extension event processing, event type and data extensions, and cookie extensions. >An extension creates no similar expectations on other Providers. > > I agree. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From mst at mellanox.co.il Mon Dec 26 23:55:40 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Dec 2005 09:55:40 +0200 Subject: [openib-general] checkstack warnings Message-ID: <20051227075540.GZ4907@mellanox.co.il> Hi! Running make checkstack on the openib tree generates quite a long list of stack hogs (below). Do we want to clean the code with respect to these warnings? drivers/infiniband/ulp/ipoib/ib_ipoib.ko 0x00000e57 ipoib_start_xmit: 232 0x000018c9 ipoib_start_xmit: 232 0x00002629 ipoib_ib_dev_stop: 200 0x000029a6 ipoib_ib_dev_stop: 200 0x00004640 ipoib_mcast_send: 184 0x00004b48 ipoib_mcast_send: 184 0x00005b10 ipoib_init_qp: 184 0x00005caa ipoib_init_qp: 184 0x00005000 ipoib_mcast_join: 136 0x000051df ipoib_mcast_join: 136 0x00000a13 path_rec_completion: 120 0x00000cea path_rec_completion: 120 0x00002360 ipoib_ib_post_receive: 120 0x0000250b ipoib_ib_post_receive: 120 0x00003a80 ipoib_mcast_leave: 120 0x00003bb2 ipoib_mcast_leave: 120 0x00002bc0 ipoib_send: 104 0x00002e7e ipoib_send: 104 0x00003244 ipoib_ib_completion: 104 0x000037f9 ipoib_ib_completion: 104 drivers/infiniband/ulp/sdp/ib_sdp.ko 0x000091bd sdp_inet_send: 312 0x00009df7 sdp_inet_send: 312 0x0000afa5 sdp_inet_recv: 296 0x0000bcc6 sdp_inet_recv: 296 0x00005b08 sdp_proc_dump_conn_data: 248 0x00005d76 sdp_proc_dump_conn_data: 248 0x0000a84f sdp_recv_flush: 248 0x0000af7e sdp_recv_flush: 248 0x00005da5 sdp_proc_dump_conn_main: 232 0x00006016 sdp_proc_dump_conn_main: 232 0x0000ddc0 do_link_path_lookup: 232 0x0000e1a4 do_link_path_lookup: 232 0x000058c0 sdp_proc_dump_conn_rdma: 200 0x00005ae7 sdp_proc_dump_conn_rdma: 200 0x00003320 sdp_inet_connect: 184 0x00003840 sdp_inet_connect: 184 0x00007d90 sdp_send_data_queue_test: 184 0x000087e0 sdp_send_data_queue_test: 184 0x00003ad0 sdp_inet_release: 168 0x00003e5f sdp_inet_release: 168 0x0000cec0 sdp_cm_path_complete: 152 0x0000d2e2 sdp_cm_path_complete: 152 0x00002eaf sdp_inet_accept: 136 0x000032fd sdp_inet_accept: 136 0x00005700 sdp_proc_dump_conn_sopt: 136 0x000058aa sdp_proc_dump_conn_sopt: 136 0x00006e88 sdp_conn_internal_lock: 104 0x00006f4d sdp_conn_internal_lock: 104 0x0000c35d sdp_cm_req_handler: 104 0x0000cba5 sdp_cm_req_handler: 104 drivers/infiniband/ulp/srp/ib_srp.ko 0x00001a26 srp_create_target: 200 0x000020ce srp_create_target: 200 0x000016d7 srp_reconnect_target: 192 0x0000192f srp_reconnect_target: 192 0x000003b7 __srp_post_send: 112 0x0000045e __srp_post_send: 112 drivers/infiniband/core/ib_cm.ko 0x0000123d ib_cm_listen: 168 0x000014a6 ib_cm_listen: 168 0x000030ee ib_destroy_cm_id: 144 0x00003397 ib_destroy_cm_id: 144 0x000034a0 cm_send_handler: 136 0x000035f9 cm_send_handler: 136 0x00001ca0 ib_send_cm_apr: 104 0x00001e46 ib_send_cm_apr: 104 drivers/infiniband/core/ib_core.ko 0x00001372 show_sys_image_guid: 176 0x000013f4 show_sys_image_guid: 176 0x00001412 show_node_guid: 176 0x00001494 show_node_guid: 176 0x00002d30 ib_flush_fmr_pool: 136 0x00002e39 ib_flush_fmr_pool: 136 0x00001950 add_port: 104 0x00001b70 add_port: 104 drivers/infiniband/core/ib_mad.ko 0x00003200 ib_register_mad_agent: 184 0x00003b8a ib_register_mad_agent: 184 0x0000297d ib_unregister_mad_agent: 168 0x00002f05 ib_unregister_mad_agent: 168 0x00001bb2 ib_mad_completion_handler: 136 0x00002149 ib_mad_completion_handler: 136 0x0000216d ib_post_send_mad: 136 0x00002621 ib_post_send_mad: 136 0x00000620 create_mad_qp: 104 0x000006f4 create_mad_qp: 104 0x000010f4 local_completions: 104 0x00001305 local_completions: 104 drivers/infiniband/core/ib_sa.ko 0x0c60 ib_sa_service_rec_callback: 200 0x0cd4 ib_sa_service_rec_callback: 200 0x0130 update_sm_ah: 120 0x0256 update_sm_ah: 120 drivers/infiniband/core/ib_uat.ko 0x00000d10 ib_uat_paths_by_route: 184 0x00000f59 ib_uat_paths_by_route: 184 0x000009a0 ib_uat_ips_by_gid: 136 0x00000bc5 ib_uat_ips_by_gid: 136 0x000003b0 ib_uat_event: 112 0x000006ff ib_uat_event: 112 0x00000f70 ib_uat_route_by_ip: 104 0x00001177 ib_uat_route_by_ip: 104 drivers/infiniband/core/ib_umad.ko 0x00000c30 ib_umad_ioctl: 136 0x00000e81 ib_umad_ioctl: 136 0x00000e90 ib_umad_write: 136 0x0000119c ib_umad_write: 136 0x000011c1 ib_umad_read: 136 0x000013c3 ib_umad_read: 136 drivers/infiniband/core/ib_uverbs.ko 0x00004431 ib_uverbs_query_device: 368 0x000046b0 ib_uverbs_query_device: 368 0x00002ed0 ib_uverbs_create_qp: 280 0x0000335e ib_uverbs_create_qp: 280 0x00002060 ib_uverbs_create_srq: 168 0x0000233d ib_uverbs_create_srq: 168 0x00002aa0 ib_uverbs_create_ah: 168 0x00002d2c ib_uverbs_create_ah: 168 0x00003370 ib_uverbs_modify_qp: 168 0x0000360f ib_uverbs_modify_qp: 168 0x00003c30 ib_uverbs_reg_mr: 168 0x00003f7f ib_uverbs_reg_mr: 168 0x00000eda ib_uverbs_event_read: 136 0x000010cd ib_uverbs_event_read: 136 0x000046c0 ib_uverbs_get_context: 136 0x00004923 ib_uverbs_get_context: 136 0x000042e1 ib_uverbs_query_port: 112 0x00004410 ib_uverbs_query_port: 112 0x000018c0 ib_uverbs_unmarshall_recv: 104 0x00001ac7 ib_uverbs_unmarshall_recv: 104 0x00002350 ib_uverbs_post_srq_recv: 104 0x000024c7 ib_uverbs_post_srq_recv: 104 0x000024d0 ib_uverbs_post_recv: 104 0x00002647 ib_uverbs_post_recv: 104 drivers/infiniband/core/ib_at.ko 0x000010f0 ib_dev_ats_op: 296 0x00001260 ib_dev_ats_op: 296 0x00001548 resolve_ats_ips: 216 0x0000160a resolve_ats_ips: 216 0x00001758 resolve_ats_route: 216 0x00001819 resolve_ats_route: 216 0x000012f0 ib_at_ats_reg: 200 0x0000152e ib_at_ats_reg: 200 0x00001900 resolve_ip: 152 0x00001c25 resolve_ip: 152 0x000021d0 ib_at_route_by_ip: 104 0x000023ca ib_at_route_by_ip: 104 drivers/infiniband/core/ib_ucm.ko 0x000006a0 ib_ucm_init_qp_attr: 344 0x000007b6 ib_ucm_init_qp_attr: 344 0x00000a93 ib_ucm_destroy_id: 136 0x00000c30 ib_ucm_destroy_id: 136 0x000010b0 ib_ucm_path_get: 104 0x00001145 ib_ucm_path_get: 104 0x000013a0 ib_ucm_send_rep: 104 0x000014b2 ib_ucm_send_rep: 104 drivers/infiniband/core/ib_addr.ko 0x05d0 ib_resolve_addr: 184 0x08a3 ib_resolve_addr: 184 0x0040 addr_resolve_remote: 152 0x01c5 addr_resolve_remote: 152 drivers/infiniband/core/rdma_cm.ko 0x00001430 cma_modify_qp_rtr: 184 0x000014c1 cma_modify_qp_rtr: 184 0x000014d0 cma_modify_qp_rts: 184 0x00001531 cma_modify_qp_rts: 184 0x00001890 rdma_create_qp: 168 0x00001998 rdma_create_qp: 168 0x00000b60 cma_modify_qp_err: 152 0x00000b86 cma_modify_qp_err: 152 0x00000691 cma_remove_one: 136 0x00000999 cma_remove_one: 136 0x000012a0 rdma_resolve_route: 120 0x000013c3 rdma_resolve_route: 120 0x00000c90 rdma_connect: 104 0x00000efb rdma_connect: 104 drivers/infiniband/core/rdma_ucm.ko 0x00000900 ucma_init_qp_attr: 344 0x00000a16 ucma_init_qp_attr: 344 0x00000b20 ucma_accept: 344 0x00000bd9 ucma_accept: 344 0x00000c60 ucma_connect: 344 0x00000d0d ucma_connect: 344 0x00000a80 ucma_reject: 296 0x00000b0f ucma_reject: 296 0x00000e60 ucma_query_route: 264 0x00001067 ucma_query_route: 264 0x00000713 ucma_destroy_id: 136 0x000008b0 ucma_destroy_id: 136 0x00000480 ucma_get_event: 112 0x0000066f ucma_get_event: 112 drivers/infiniband/hw/mthca/ib_mthca.ko 0x000007e0 mthca_init_one: 1048 0x0000200d mthca_init_one: 1048 0x00009d80 mthca_process_mad: 184 0x0000a178 mthca_process_mad: 184 0x000079a0 mthca_free_qp: 136 0x00007bb3 mthca_free_qp: 136 0x00008473 mthca_modify_qp: 136 0x00008f76 mthca_modify_qp: 136 0x0000b500 mthca_create_qp: 136 0x0000b7a9 mthca_create_qp: 136 0x0000c150 mthca_modify_port: 136 0x0000c232 mthca_modify_port: 136 0x000095d0 mthca_multicast_detach: 120 0x00009878 mthca_multicast_detach: 120 0x0000ae60 mthca_reg_phys_mr: 120 0x0000b1a1 mthca_reg_phys_mr: 120 0x0000cba0 mthca_map_user_db: 120 0x0000cdf2 mthca_map_user_db: 120 0x00004473 mthca_free_cq: 112 0x000046b0 mthca_free_cq: 112 0x0000de23 mthca_free_srq: 112 0x0000e02e mthca_free_srq: 112 0x00001225 mthca_map_cmd: 104 0x000014b1 mthca_map_cmd: 104 0x0000337a mthca_buf_alloc: 104 0x0000367d mthca_buf_alloc: 104 0x00004b4f mthca_poll_cq: 104 0x000051ab mthca_poll_cq: 104 0x0000a6a0 mthca_query_gid: 104 0x0000a851 mthca_query_gid: 104 0x000020f6 mthca_create_eq: 104 0x00002550 mthca_create_eq: 104 -- MST From mst at mellanox.co.il Tue Dec 27 01:11:54 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 27 Dec 2005 11:11:54 +0200 Subject: [openib-general] [PATCH] uverbs: error handling fixes Message-ID: <20051227091153.GC4907@mellanox.co.il> uverbs: fix reference counting in error handling. Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin Index: linux-kernel/drivers/infiniband/core/uverbs_cmd.c =================================================================== --- linux-kernel/drivers/infiniband/core/uverbs_cmd.c (revision 4611) +++ linux-kernel/drivers/infiniband/core/uverbs_cmd.c (working copy) @@ -488,6 +488,7 @@ err_idr: idr_remove(&ib_uverbs_mr_idr, obj->uobject.id); err_unreg: + atomic_dec(&pd->usecnt); ib_dereg_mr(mr); err_up: @@ -934,6 +935,11 @@ err_idr: idr_remove(&ib_uverbs_qp_idr, uobj->uevent.uobject.id); err_destroy: + atomic_dec(&pd->usecnt); + atomic_dec(&attr.send_cq->usecnt); + atomic_dec(&attr.recv_cq->usecnt); + if (attr.srq) + atomic_dec(&attr.srq->usecnt); ib_destroy_qp(qp); err_up: @@ -1728,6 +1734,7 @@ err_idr: idr_remove(&ib_uverbs_srq_idr, uobj->uobject.id); err_destroy: + atomic_dec(&pd->usecnt); ib_destroy_srq(srq); err_up: -- MST From danb at voltaire.com Tue Dec 27 05:56:04 2005 From: danb at voltaire.com (Dan Bar Dov) Date: Tue, 27 Dec 2005 15:56:04 +0200 Subject: [openib-general] iser/iscsi merge Message-ID: With svn r4622 iSER (ib_iser.ko) provides the following two APIs: - iscsi transport class defined by the Linux open-iscsi initiator - SCSI LLD defined by the Linux SCSI Mid Layer The Open-iscsi initiator is an open source project whose website and mailing list are http://open-iscsi.org and http://groups.google.co.il/group/open-iscsi/ . The kernel portion of it is made of two modules, scsi_transport_iscsi and iscsi_tcp, which were pushed upstream and exist in the 2.6.15 rc patches (eg patch-2.6.15-rc6.bz2). The Documentation section of the website and specifically the "Startup, Login, Read/Write" slide provides good intro to the initiator SW architecture. This code has been tested with 2.6.14 / openib r4606 / AMD x86_64 system. It was also compile-tested on x86 and ppc64. To compile ulp/iser with 2.6.14 the kernel must be patched with 3 files: scsi_transport_iscsi.h iscsi_proto.h iscsi_if.h The patch with the above files is available in https://openib.org/svn/trunk/contrib/voltaire/patches/iscsi_includes.patch Dan From eitan at mellanox.co.il Tue Dec 27 07:40:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: 27 Dec 2005 17:40:09 +0200 Subject: [openib-general] [PATCH] osm: support arbitrary ibumad installation path Message-ID: <86zmmm8s6e.fsf@mtl066.yok.mtl.com> Hi Hal Current build for OpenSM assumed ibumad is located always under /usr/local . Some installations require more flexibility. The patch below allows the user to specify either: --with-umad-prefix (then we use $with_umad_prefix/include and lib) or --with-umad-libs and --with-umad-includes This provides the greatest flexibility. Thanks Eitan Signed-off-by: Eitan Zahavi Index: config/osmvsel.m4 =================================================================== --- config/osmvsel.m4 (revision 4622) +++ config/osmvsel.m4 (working copy) @@ -17,34 +17,56 @@ AC_ARG_WITH(osmv, AC_MSG_NOTICE(Using OSM Vendor Type:$with_osmv), with_osmv="openib") -dnl Define a way for the user to provide the path to the driver installation -AC_ARG_WITH(uldrv, -[ --with-uldrv= define the dir where the user level driver is installed], -AC_MSG_NOTICE(Using user level installation prefix:$with_uldrv), -with_uldrv="") +dnl Define a way for the user to provide the path to the ibumad installation +AC_ARG_WITH(umad-prefix, +[ --with-umad-prefix= define the dir used as prefix for ibumad installation], +AC_MSG_NOTICE(Using ibumad installation prefix:$with_umad_prefix), +with_umad_prefix="") + +dnl Define a way for the user to provide the path to the ibumad includes +AC_ARG_WITH(umad-includes, +[ --with-umad-includes= define the dir where ibumad includes are installed], +AC_MSG_NOTICE(Using ibumad includes from:$with_umad_includes), +with_umad_includes="") + +if test x$with_umad_includes = x; then + if test x$with_umad_prefix != x; then + with_umad_includes=$with_umad_prefix/include + fi +fi -dnl Define a way for the user to provide the path to the simulator installation -AC_ARG_WITH(sim, -[ --with-sim= define the simulator prefix for building sim vendor (/usr)], -AC_MSG_NOTICE(Using Simulator from:$with_sim), -with_sim="/usr") +dnl Define a way for the user to provide the path to the ibumad libs +AC_ARG_WITH(umad-libs, +[ --with-umad-libs= define the dir where ibumad libs are installed], +AC_MSG_NOTICE(Using ibumad libs from:$with_umad_libs), +with_umad_libs="") +if test x$with_umad_libs = x; then + if test x$with_umad_prefix != x; then dnl Should we use lib64 or lib if test "$(uname -m)" = "x86_64"; then - osmv_lib_type="lib64" + with_umad_libs=$with_umad_prefix/lib64 else - osmv_lib_type="lib" + with_umad_libs=$with_umad_prefix/lib + fi fi +fi + +dnl Define a way for the user to provide the path to the simulator installation +AC_ARG_WITH(sim, +[ --with-sim= define the simulator prefix for building sim vendor (/usr)], +AC_MSG_NOTICE(Using Simulator from:$with_sim), +with_sim="/usr") dnl based on the with_osmv we can try the vendor flag if test $with_osmv = "openib"; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_OPENIB" OSMV_INCLUDES="-I\$(srcdir)/../include -I\$(srcdir)/../../libibcommon/include/infiniband -I\$(srcdir)/../../libibumad/include/infiniband" - if test "x$with_uldrv" = "x"; then + if test "x$with_umad_libs" = "x"; then OSMV_LDADD="-libumad" else - OSMV_INCLUDES="-I$with_uldrv/include $OSMV_INCLUDES" - OSMV_LDADD="-L$with_uldrv/$osmv_lib_type -libumad" + OSMV_INCLUDES="-I$with_umad_includes $OSMV_INCLUDES" + OSMV_LDADD="-L$with_umad_libs -libumad" fi elif test $with_osmv = "sim" ; then OSMV_CFLAGS="-DOSM_VENDOR_INTF_SIM" From bardov at gmail.com Tue Dec 27 08:38:00 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 27 Dec 2005 18:38:00 +0200 Subject: [openib-general] iser/iscsi merge In-Reply-To: References: Message-ID: I've copied the 2.6.14 patch necessary for iser compilation to https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-iscsi_includes.diff These headers files are part of the 2.6.15-rc6 and later. Dan On 12/27/05, Dan Bar Dov wrote: > > With svn r4622 iSER (ib_iser.ko) provides the following two APIs: > > - iscsi transport class defined by the Linux open-iscsi initiator > - SCSI LLD defined by the Linux SCSI Mid Layer > > The Open-iscsi initiator is an open source project whose website and mailing list are > http://open-iscsi.org and http://groups.google.co.il/group/open-iscsi/ . The kernel portion > of it is made of two modules, scsi_transport_iscsi and iscsi_tcp, which were pushed > upstream and exist in the 2.6.15 rc patches (eg patch-2.6.15-rc6.bz2). The Documentation > section of the website and specifically the "Startup, Login, Read/Write" slide provides > good intro to the initiator SW architecture. > > This code has been tested with 2.6.14 / openib r4606 / AMD x86_64 system. > It was also compile-tested on x86 and ppc64. > > To compile ulp/iser with 2.6.14 the kernel must be patched with 3 files: > scsi_transport_iscsi.h > iscsi_proto.h > iscsi_if.h > The patch with the above files is available in > https://openib.org/svn/trunk/contrib/voltaire/patches/iscsi_includes.patch > > Dan > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Tue Dec 27 08:55:14 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 27 Dec 2005 11:55:14 -0500 (EST) Subject: [openib-general] Re: [RFC][PATCH] OpenIB uDAPL extension proposal - sample immed data and atomic api's In-Reply-To: <43AC57C2.8050509@ichips.intel.com> References: <43AC57C2.8050509@ichips.intel.com> Message-ID: On Fri, 23 Dec 2005, Arlin Davis wrote: arlin> James Lentini wrote: arlin> arlin> > arlin> DAPL provides a generalized abstraction to RDMA capable arlin> arlin> > transports. As a generalized abstraction, it cannot exploit the arlin> arlin> > unique properties that many of the underlying arlin> arlin> > platforms/interconnects can provide so I would like to propose arlin> a arlin> > simple (minimum impact on libdat) extensible interface to arlin> uDAPL arlin> > that will allow vendors to expose such capabilities. I am arlin> arlin> > looking for feedback, especially from the DAT collaborative. I arlin> arlin> > have included both a design document and actual working code as arlin> arlin> > a reference. arlin> > arlin> > This is an excellent document and clearly certain applications will arlin> > benefit greatly from adding this additional functionality. arlin> > arlin> > Since DAPL's inception, the DAT_PROVIDER structure has contained a arlin> > field called "extension" of type void *. The purpose of this field was arlin> > to allow for the kind of provider/platform/interconnect specific arlin> > extensions you describe. arlin> > arlin> > I believe these features can be added without modifications to the arlin> > current API by defining a particular format for the DAT_PROVIDER's arlin> > extension data and indicating its presence via a provider attribute. arlin> > That would require creating an extension document like this one arlin> > describing an "extension" structure w/ function pointers to the new arlin> > functions and a well known provider attribute value. arlin> > Is there a reason this was not feasible? Would minor modifications to arlin> > the existing framework be sufficient (perhaps an "extension" event arlin> > type)? arlin> arlin> A single entry point is still there with this patch, I just arlin> defined it a little different with a function definition for arlin> better DAT API mappings. The idea was to replace the existing arlin> pvoid extension definition with this new one. Can you give me arlin> an idea of how you would map these extended DAT calls to this arlin> pvoid function definition? For uDAPL, the DAT_PROVIDER structure is defined as follows: struct dat_provider { const char * device_name; DAT_PVOID extension; ... You could create a well known extensions API by defining a structure with several function pointers struct dat_atomic_extensions { DAT_RETURN (*cmp_and_swap_func)(IN DAT_EP_HANDLE ep_handle, IN DAT_UINT64 cmp_value, IN DAT_UINT64 swap_value, IN DAT_LMR_TRIPLE *local_iov, IN DAT_DTO_COOKIE user_cookie, IN DAT_RMR_TRIPLE *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags); ... } and require the dat_provider's extensions member to point to your new extension struct. To make the API easier to use, you could also create macros, similar to the standard DAT macros, to reach inside an objects provider structure and call the correct extension function. #define dat_ep_post_cmp_and_swap(ep, cmp, swap, local_iov, cookie, remote_iov, flags) \ (*DAT_HANDLE_TO_EXTENSION (ep)->cmp_and_swap_func) (\ (ep), \ (cmp), \ (swap), \ (local_iov), \ (cookie), \ (remote_iov), \ (flags)) A drawback to this approach is that adding new extensions requires synchronizing with the original extension specification document. To eliminate that issue, you could require that the dat_provider's extension member point to a typed list of these sorts of extension structures. arlin> What is your opinion on the way I extended event data, dapl arlin> event processing, event types, and cookies? These looked good to me. I also agree with Arkady that these functions are of such general utility that they should be considered for inclusion into the DAT specification. From halr at voltaire.com Tue Dec 27 10:24:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2005 13:24:54 -0500 Subject: [openib-general] OpenIB Design Document In-Reply-To: <309a667c0512260432w20ae94f4u73abdd95959b95f4@mail.gmail.com> References: <309a667c0512260432w20ae94f4u73abdd95959b95f4@mail.gmail.com> Message-ID: <1135707839.4328.159780.camel@hal.voltaire.com> On Mon, 2005-12-26 at 07:32, Devesh Sharma wrote: > Hi all, > > I have a query regarding the design specification currently openib > fourm is following is it the document which is available on > /trunk/contrib/sourceforge/Documentation/LinuxSAS.1.0.1.pdf This document pertains to the Intel gen1 implementation (IBAL based). > OR > > trunk/contrib/voltaire/openib-access.pdf This document was a proposal for OpenIB (gen2) architecture. The gen2 architecture roughly follows this but there are numerous changes from this. > Or there is no difference between the two file? The document on OpenIB (gen2) architecture can be found in: https://openib.org/svn/gen2/trunk/arch/ It is a high level document and also needs some updating. -- Hal > please help me out on this. > > Devesh > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Tue Dec 27 12:01:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2005 15:01:29 -0500 Subject: [openib-general] Re: [PATCH] osm: support arbitrary ibumad installation path In-Reply-To: <86zmmm8s6e.fsf@mtl066.yok.mtl.com> References: <86zmmm8s6e.fsf@mtl066.yok.mtl.com> Message-ID: <1135713688.4328.160816.camel@hal.voltaire.com> On Tue, 2005-12-27 at 10:40, Eitan Zahavi wrote: > Hi Hal > > Current build for OpenSM assumed ibumad is located always under > /usr/local . Some installations require more flexibility. > The patch below allows the user to specify either: > --with-umad-prefix (then we use $with_umad_prefix/include and lib) > or > --with-umad-libs and --with-umad-includes > > This provides the greatest flexibility. Thanks. Applied. From surs at cse.ohio-state.edu Tue Dec 27 15:52:33 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 27 Dec 2005 18:52:33 -0500 Subject: [openib-general] error modifying QP capabilities Message-ID: <20051227235230.GA8353@cse.ohio-state.edu> Hi, I am trying to modify the QP capabilities (after the QP has been created and transitioned to IBV_QPS_RTS successfully). I am getting an return error value of 22. Is there any other conditions I need to take care of before modifying the QP? I am using Arbel (MemFree) cards with firmware version 5.1.0. The OpenIB svn revision number is 4594 (on linux 2.6.14) <--- Snippet ---> struct ibv_qp_attr attr; int ret = 0; memset(&attr, 0, sizeof(struct ibv_qp_attr)); attr.cap.max_send_wr = 128; /* initial value was 32 */ attr.cap.max_recv_wr = 0; attr.cap.max_send_sge = 1; attr.cap.max_recv_sge = 1; attr.cap.max_inline_data = 128; if((ret = ibv_modify_qp(c->qp_hndl, &attr, IBV_QP_CAP)) != 0) { error_abort_all(GEN_EXIT_ERR, "Couldn't modify QP size, ret %d", ret) } <--- /Snippet ---> TIA, Sayantan. -- http://www.cse.ohio-state.edu/~surs From johann at pathscale.com Tue Dec 27 18:02:55 2005 From: johann at pathscale.com (Johann George) Date: Tue, 27 Dec 2005 18:02:55 -0800 Subject: [openib-general] PathScale license In-Reply-To: <1135363454.4328.95007.camel@hal.voltaire.com> References: <1135363454.4328.95007.camel@hal.voltaire.com> Message-ID: <20051228020255.GA3280@cuprite.internal.keyresearch.com> We have heard the issues that have been raised regarding the PathScale license. PathScale's intent is solely to protect its hardware IP and not to limit use of the software in any way. PathScale's use of this language is not original. SGI has used, and perhaps originated, the additional language. It currently appears in several files in the Linux kernel. As an example, see fs/xfs/linux-2.6/kmem.c At the bottom of their license, they provide a URL which no longer exists. Going though the web archives, it appears that the following is a simile: http://web.archive.org/web/20040311225417/oss.sgi.com/projects/GenInfo/NoticeExplan/ In it they claim that the additional language has been reviewed by the Free Software Foundation which concluded there was no conflict between it and the GPL and LGPL. PathScale's intent is to comply with the licenses that OpenIB allows for contributions. Nevertheless, in our ongoing efforts to be as friendly to the Open Source community as possible, we will review our language as soon as people return in the new year. We are fully supportive of OpenIB and do not want to see it hindered in any way. Johann From dotanb at mellanox.co.il Tue Dec 27 22:25:49 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 28 Dec 2005 08:25:49 +0200 Subject: [openib-general] error modifying QP capabilities Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA2CDC@mtlexch01.mtl.com> > I am trying to modify the QP capabilities (after the QP has > been created > and transitioned to IBV_QPS_RTS successfully). I am getting an return > error value of 22. Is there any other conditions I need to > take care of > before modifying the QP? I am using Arbel (MemFree) cards > with firmware > version 5.1.0. The OpenIB svn revision number is 4594 (on > linux 2.6.14) I looked at the code of the mthca driver and it doesn't support resize of the QP. Dotan From yipeeyipeeyipeeyipee at yahoo.com Wed Dec 28 06:16:49 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Wed, 28 Dec 2005 14:16:49 +0000 (UTC) Subject: [openib-general] understanding mthca_alloc_db() Message-ID: Hi, I'm trying to understand what's going on inside mthca_alloc_db() (in libmthca/ src/memfree.c). I understand that this function manages a bitmap-based freelist of doorbells for send, receive & completion queues. Is there a possibility of a memory leak when posix_memalign() steps on a previously allocated 'db_tab->page[i].db_rec' (when all of 'db_tab->page[i]. free[j]' is cleared)? isn't 'i' already out of bounds? The functions defines two types/groups of doorbells. Why are these doorbells allocated differently (one group starts at the begining of the array and the other at the end)? Another thing I noticed is that doorbells are different between Tavor and Arbel HCA's (e.g. see update_cons_index(). Is it correct that Arbel doorbells are only 32 bits wide? thanks, x From dotanb at mellanox.co.il Wed Dec 28 06:52:33 2005 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 28 Dec 2005 16:52:33 +0200 Subject: [openib-general] srq_pingpong with many QPs and events may never ends Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA2E7B@mtlexch01.mtl.com> here is a patch to solve this problem: Empty the CQ on each event (there can be more than 1 QP connected to a CQ in this test) before arm. Make sure there is a receive work request per each QP. Signed-off-by: Dotan Barak Index: last_stable/src/userspace/libibverbs/examples/srq_pingpong.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/examples/srq_pingpong.c +++ last_stable/src/userspace/libibverbs/examples/srq_pingpong.c @@ -514,9 +514,11 @@ int main(int argc, char *argv[]) struct pingpong_context *ctx; struct pingpong_dest my_dest[MAX_QP]; struct pingpong_dest *rem_dest; + struct ibv_wc *wc_arr; struct timeval start, end; char *ib_devname = NULL; char *servername = NULL; + int num_of_wc; int port = 18515; int ib_port = 1; int size = 4096; @@ -603,6 +605,13 @@ int main(int argc, char *argv[]) return 1; } + num_of_wc = num_qp + rx_depth; + wc_arr = malloc(num_of_wc * sizeof *wc_arr); + if (!wc_arr) { + fprintf(stderr, "Failed to allocate memory\n"); + return 1; + } + page_size = sysconf(_SC_PAGESIZE); dev_list = ibv_get_device_list(NULL); @@ -714,11 +723,10 @@ int main(int argc, char *argv[]) } { - struct ibv_wc wc[2]; int ne, qp_ind; do { - ne = ibv_poll_cq(ctx->cq, 2, wc); + ne = ibv_poll_cq(ctx->cq, num_of_wc, wc_arr); } while (!use_event && ne < 1); if (ne < 0) { @@ -727,26 +735,26 @@ int main(int argc, char *argv[]) } for (i = 0; i < ne; ++i) { - if (wc[i].status != IBV_WC_SUCCESS) { + if (wc_arr[i].status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %d\n", - wc[i].status, (int) wc[i].wr_id); + wc_arr[i].status, (int) wc_arr[i].wr_id); return 1; } - qp_ind = find_qp(wc[i].qp_num, ctx, num_qp); + qp_ind = find_qp(wc_arr[i].qp_num, ctx, num_qp); if (qp_ind < 0) { fprintf(stderr, "Couldn't find QPN %06x\n", - wc[i].qp_num); + wc_arr[i].qp_num); return 1; } - switch ((int) wc[i].wr_id) { + switch ((int) wc_arr[i].wr_id) { case PINGPONG_SEND_WRID: ++scnt; break; case PINGPONG_RECV_WRID: - if (--routs <= 1) { + if (--routs <= num_qp) { routs += pp_post_recv(ctx, ctx->rx_depth - routs); if (routs < ctx->rx_depth) { fprintf(stderr, @@ -761,11 +769,11 @@ int main(int argc, char *argv[]) default: fprintf(stderr, "Completion for unknown wr_id %d\n", - (int) wc[i].wr_id); + (int) wc_arr[i].wr_id); return 1; } - ctx->pending[qp_ind] &= ~(int) wc[i].wr_id; + ctx->pending[qp_ind] &= ~(int) wc_arr[i].wr_id; if (scnt < iters && !ctx->pending[qp_ind]) { if (pp_post_send(ctx, qp_ind)) { fprintf(stderr, "Couldn't post send\n"); Dotan Barak Software Verification Engineer Mellanox Technologies LTD Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 ISRAEL. Home: +972-77-8841095 Cell: 052-4222383 [ May the fork be with you ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Dec 28 10:46:07 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Dec 2005 10:46:07 -0800 Subject: [openib-general] error modifying QP capabilities References: <20051227235230.GA8353@cse.ohio-state.edu> Message-ID: Sayantan> Hi, I am trying to modify the QP capabilities (after the Sayantan> QP has been created and transitioned to IBV_QPS_RTS Sayantan> successfully). I am getting an return error value of Sayantan> 22. Is there any other conditions I need to take care of Sayantan> before modifying the QP? I am using Arbel (MemFree) Sayantan> cards with firmware version 5.1.0. The OpenIB svn Sayantan> revision number is 4594 (on linux 2.6.14) The current mthca driver does not support resizing QPs unfortunately. Making this work would be quite a bit of work... - R. From bboas at llnl.gov Sat Dec 24 11:00:42 2005 From: bboas at llnl.gov (Bill Boas) Date: Sat, 24 Dec 2005 11:00:42 -0800 Subject: [openib-general] RE: Technical content of Sonoma Workshop Feb 5-8 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0006728A1D@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0006728A1D@orsmsx408> Message-ID: <6.2.3.4.2.20051224090550.02005458@mail-lc.llnl.gov> Woody, It'll be a longshot for the Pats to get to theSuperbowl this year, I think. But I hope! Your list is a great start, but isn't each item you mention in the context of Release 1.0? From the Labs and Wall Street perspectives, the preference is to"tie a ribbon around" Rel 1.0 (both Windows and Linux) as soon as we can, and go to the next stage of the evolution of the stack. So that means making the definition of the content of Rel 2.0 the main technical focus of the workshop. OpenIB PathForward Phase 2, iWARP integration, QOS, improved OpenSM, and more ...... Perhaps Matt, Steve Poole, Hal, Roland, HB, Peter,and others will join in this discussion and express different opinions..... Bill. At 02:56 PM 12/23/2005, you wrote: >I'll give it some thought and try to start a discussion on >the list. Some ideas for a technical track that come to mind are: > >RDS - perhaps we could get someone from Oracle and Silverstorm >to present something on this. There has been some discussion on >the list, but not sure we have everyone aligned on what needs to >be done for this. > >Core S/W update. where we are and where we are going moving forward. > >Generic RDMA support, what is there, what needs to be done. > >iSer update. > >SDP update, what needs to be done before it is ready to be pushed >upstream. > >OpenMPI update > >OpenSM and diags update > >Linux distributor update, RedHat, Suse, ... > >New H/W support, Pathscale, IBM ? > >Why the Patriots didn't win another superbowl, can we give someone else >a turn please... > >Were there any specific topics that the DOE folks would like to >hear on the technical side ? > >I'll be OOP on vacation next week, but will probably >be checking email and perhaps we can start a discussion on the list. > >woody > >-----Original Message----- >From: Bill Boas [mailto:bboas at llnl.gov] >Sent: Friday, December 23, 2005 12:02 PM >To: Woodruff, Robert J >Subject: RE: [openib-general] Please register for Sonoma Workshop > >No agenda yet, and definitely need help......I was planning to send >out ideas... maybe you could start that process, please. > >At 01:42 PM 12/22/2005, you wrote: > > > >Hi Bill, > > > >Do you have a proposed agenda for this yet > >or need any help in putting one together. > > > >Trying to determine who from my team should attend. > > > >woody > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 From head.bubba at csfb.com Tue Dec 27 04:57:28 2005 From: head.bubba at csfb.com (Head Bubba) Date: Tue, 27 Dec 2005 12:57:28 -0000 Subject: [openib-general] RE: Technical content of Sonoma Workshop Feb 5-8 Message-ID: Windows and Linux were required yesterday... Since we will have both Mellanox and PathScale for a roundtable session, we should also add any enhancements we need to future HCA and the firmware as well to the discussion - from a little SDP Proof of Concept we did at CSFB, we ended up needing a firmware upgrade As for SDP... At the roundtable we did not go into the gory details of the SDP Proof Of Concept we did at CSFB with Mellanox in which we could dynamically change virtual lane being used, so I think at this we should get Mellanox to go over the details with us to get this something done with SDP in OpenIB to have the real implementation it needs (for those not having the details, contact Nimrod). A better implementation of SDP is needed. This is a good first step to get off of TCP/IP without code changes, but this has also been problematic in our experience. Additionally, it needs to be code better to deliver at near native performance /// we use SDP to eliminate TCP/IP issues, so IPoIB is not viable for us As for RDS, we should all see who has it aside from Bubba which everyone knows about, and whether or not we can get an end-user experience discussed We also would like to virtualize everything... the server, the desktop, the fabric, the storage, etc... to create a Virtual Resource Market (VRM) -----Original Message----- From: Bill Boas [mailto:bboas at llnl.gov] Sent: Saturday, December 24, 2005 2:01 PM To: Woodruff, Robert J Cc: Matt Leininger; Steve Poole; Hal Rosenstock; Roland Dreier; Head Bubba; Peter Krey; openib-promoters at openib.org; openib-windows at openib.org; openib-general at openib.org Subject: RE: Technical content of Sonoma Workshop Feb 5-8 Woody, It'll be a longshot for the Pats to get to theSuperbowl this year, I think. But I hope! Your list is a great start, but isn't each item you mention in the context of Release 1.0? From the Labs and Wall Street perspectives, the preference is to"tie a ribbon around" Rel 1.0 (both Windows and Linux) as soon as we can, and go to the next stage of the evolution of the stack. So that means making the definition of the content of Rel 2.0 the main technical focus of the workshop. OpenIB PathForward Phase 2, iWARP integration, QOS, improved OpenSM, and more ...... Perhaps Matt, Steve Poole, Hal, Roland, HB, Peter,and others will join in this discussion and express different opinions..... Bill. At 02:56 PM 12/23/2005, you wrote: >I'll give it some thought and try to start a discussion on >the list. Some ideas for a technical track that come to mind are: > >RDS - perhaps we could get someone from Oracle and Silverstorm >to present something on this. There has been some discussion on >the list, but not sure we have everyone aligned on what needs to >be done for this. > >Core S/W update. where we are and where we are going moving forward. > >Generic RDMA support, what is there, what needs to be done. > >iSer update. > >SDP update, what needs to be done before it is ready to be pushed >upstream. > >OpenMPI update > >OpenSM and diags update > >Linux distributor update, RedHat, Suse, ... > >New H/W support, Pathscale, IBM ? > >Why the Patriots didn't win another superbowl, can we give someone else >a turn please... > >Were there any specific topics that the DOE folks would like to >hear on the technical side ? > >I'll be OOP on vacation next week, but will probably >be checking email and perhaps we can start a discussion on the list. > >woody > >-----Original Message----- >From: Bill Boas [mailto:bboas at llnl.gov] >Sent: Friday, December 23, 2005 12:02 PM >To: Woodruff, Robert J >Subject: RE: [openib-general] Please register for Sonoma Workshop > >No agenda yet, and definitely need help......I was planning to send >out ideas... maybe you could start that process, please. > >At 01:42 PM 12/22/2005, you wrote: > > > >Hi Bill, > > > >Do you have a proposed agenda for this yet > >or need any help in putting one together. > > > >Trying to determine who from my team should attend. > > > >woody > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.csfb.com/legal_terms/disclaimer_external_email.shtml ============================================================================== From robert.j.woodruff at intel.com Wed Dec 28 10:09:48 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 28 Dec 2005 10:09:48 -0800 Subject: [openib-general] RE: Technical content of Sonoma Workshop Feb 5-8 In-Reply-To: <6.2.3.4.2.20051224090550.02005458@mail-lc.llnl.gov> Message-ID: Bill Boas wrote, >Woody, >It'll be a longshot for the Pats to get to theSuperbowl this year, I >think. But I hope! I have not watched too much football this year, but I did see the Pats play the other day, and they still look pretty darn good, so I would not count them out yet. >Your list is a great start, but isn't each item you mention in the >context of Release 1.0? I was thinking more about what is coming in the future (for 2.0) for things like the core, opensm, RDMA support etc. >From the Labs and Wall Street perspectives, the preference is to"tie >a ribbon around" Rel 1.0 (both Windows and Linux) as soon as we can, >and go to the next stage of the evolution of the stack. I agree. >So that means making the definition of the content of Rel 2.0 the >main technical focus of the workshop. Agreed. Let's focus on what needs to be done going forward and not too much on what has been done, other than what, if anything, needs to be finished up for a release 1.0. >OpenIB PathForward Phase 2, iWARP integration, QOS, improved OpenSM, >and more ...... woody From robert.j.woodruff at intel.com Wed Dec 28 10:25:12 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Wed, 28 Dec 2005 10:25:12 -0800 Subject: [openib-general] RE: Technical content of Sonoma Workshop Feb 5-8 In-Reply-To: Message-ID: Head Bubba wrote, >Since we will have both Mellanox and PathScale for a roundtable session, we >should also add any enhancements we need to future HCA and the firmware as well >to the discussion - from a little SDP Proof of Concept we did at CSFB, we ended >up needing a firmware upgrade Yes. updates from the hardware vendors and feedback from the user community on what is needed in the hardware/firmware would be great. >As for SDP... >At the roundtable we did not go into the gory details of the SDP Proof Of >Concept we did at CSFB with Mellanox in which we could dynamically change >virtual lane being used, so I think at this we should get Mellanox to go >over the details with us to get this something done with SDP in OpenIB to have >the real implementation it needs (for those not having the details, contact >Nimrod).A better implementation of SDP is needed. This is a good first step to >get off of TCP/IP without code changes, but this has >also been problematic in our experience. Additionally, it needs to be code better to deliver at near native performance /// >ie use SDP to eliminate TCP/IP issues, so IPoIB is not viable for us I think that the SDP discussion can be split into 2 areas, what is needed short term to get the code into shape for submission upstream, and second, what features are needed in a future 2.0 release. Feedback from both the maintainers and the user community would be helpful here. >As for RDS, we should all see who has it aside from Bubba which everyone knows >about, and whether or not we can get an end-user experience discussed I think the folks from Oracle and Silverstorm have the most to say about RDS, so perhaps they could present a session on this. Richard Frank/Ranjit Pandit ? >We also would like to virtualize everything... the server, the desktop, the >fabric, the storage, etc... to create a Virtual Resource Market (VRM) Good one. We should have a session on virtualization, what has been done so far and what is needed. I think that Mellanox has already done some work on this. Perhaps they could lead a session on this ? woody From rdreier at cisco.com Wed Dec 28 10:54:39 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Dec 2005 10:54:39 -0800 Subject: [openib-general] understanding mthca_alloc_db() In-Reply-To: (yipee's message of "Wed, 28 Dec 2005 14:16:49 +0000 (UTC)") References: Message-ID: yipee> Hi, I'm trying to understand what's going on inside yipee> mthca_alloc_db() (in libmthca/ src/memfree.c). I yipee> understand that this function manages a bitmap-based yipee> freelist of doorbells for send, receive & completion yipee> queues. Is there a possibility of a memory leak when yipee> posix_memalign() steps on a previously allocated yipee> 'db_tab->page[i].db_rec' (when all of 'db_tab->page[i]. yipee> free[j]' is cleared)? isn't 'i' already out of bounds? I don't think so. i is being set by the loop for (i = start; i != end; i += dir) so if we fall through the end of loop, then i == end, which is where we want to allocate a new doorbell page. yipee> The functions defines two types/groups of doorbells. Why yipee> are these doorbells allocated differently (one group starts yipee> at the begining of the array and the other at the end)? This is the way the hardware works. yipee> Another thing I noticed is that doorbells are different yipee> between Tavor and Arbel HCA's (e.g. see yipee> update_cons_index(). Is it correct that Arbel doorbells are yipee> only 32 bits wide? Sort of. It is definitely true that Tavor-mode doorbells work differently from Arbel/mem-free-mode doorbells. - R. From fleming at jrbcommunications.com Wed Dec 28 12:04:03 2005 From: fleming at jrbcommunications.com (Shannon Hanson) Date: Wed, 28 Dec 2005 14:04:03 -0600 Subject: [openib-general] Lowest rate approved Message-ID: <856m092q.9751796@hotmail.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: deform.6.gif Type: image/gif Size: 7817 bytes Desc: not available URL: From mst at mellanox.co.il Wed Dec 28 12:13:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Dec 2005 22:13:29 +0200 Subject: [openib-general] Re: error modifying QP capabilities In-Reply-To: References: Message-ID: <20051228201328.GA2720@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: error modifying QP capabilities > > Sayantan> Hi, I am trying to modify the QP capabilities (after the > Sayantan> QP has been created and transitioned to IBV_QPS_RTS > Sayantan> successfully). I am getting an return error value of > Sayantan> 22. Is there any other conditions I need to take care of > Sayantan> before modifying the QP? I am using Arbel (MemFree) > Sayantan> cards with firmware version 5.1.0. The OpenIB svn > Sayantan> revision number is 4594 (on linux 2.6.14) > > The current mthca driver does not support resizing QPs unfortunately. > Making this work would be quite a bit of work... > > - R. Maybe we should be returning EOPNOTSUPP rather than EINVAL then. Makes sense? -- MST From rdreier at cisco.com Wed Dec 28 12:15:38 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Dec 2005 12:15:38 -0800 Subject: [openib-general] Re: error modifying QP capabilities In-Reply-To: <20051228201328.GA2720@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 28 Dec 2005 22:13:29 +0200") References: <20051228201328.GA2720@mellanox.co.il> Message-ID: Michael> Maybe we should be returning EOPNOTSUPP rather than Michael> EINVAL then. Makes sense? I guess so, but I'm not sure it's really worth worrying about. - R. From rdreier at cisco.com Wed Dec 28 12:23:51 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Dec 2005 12:23:51 -0800 Subject: [openib-general] Re: checkstack warnings In-Reply-To: <20051227075540.GZ4907@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 27 Dec 2005 09:55:40 +0200") References: <20051227075540.GZ4907@mellanox.co.il> Message-ID: Michael> Hi! Running make checkstack on the openib tree generates Michael> quite a long list of stack hogs (below). Do we want to Michael> clean the code with respect to these warnings? It wouldn't hurt to try and improve things, but I don't see anything really scary here... > 0x000007e0 mthca_init_one: 1048 This one seems pretty bad -- I guess it's -funit-at-a-time going beserk. Are you building for x86_64? What compiler version? With $ gcc --version gcc (GCC) 4.0.2 (Debian 4.0.2-2) I get something slightly different: 0x000005a0 mthca_init_hca: 568 is the top offender, and 0x0000e9a0 mthca_process_mad: 168 is the second-biggest function. In any case it might be worth dynamically allocating some of the large structures used in the init path, just to be safe. - R. From info at nhhfd.com Wed Dec 28 11:44:05 2005 From: info at nhhfd.com (info at nhhfd.com) Date: 29 Dec 2005 04:44:05 +0900 Subject: [openib-general] $B9b3[1g=u$G$bL5M}$+$J!)!JN^!K(B Message-ID: <20051228194405.4891.qmail@mail.nhhfd.com> $B7k:'(B5$BG/L\(B28$B:P!#;R6!$,M_$7$/$F;EJ}$J$$$N$K=PMh$^$;$s!#(B $BK\5$$G@:;R$r;d$N%*!{!{%3$K=P$7$F$/$l$^$;$s$+!)@dBP$K(B $BLBOG$+$1$^$;$s!#(B $BA06b$G7 at Ls$9$k;v$G$9!#K\Ev$K=u$1$F$/$@$5$$!#59$7$/(B $B$*4j$$CW$7$^$9!#BT$C$F$^$9!#(B http://www.deai-style.net/?miku $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail reject_reject123 at yahoo.ca $B:#8e!"l9g$O(B reject_reject123 at yahoo.ca $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From mst at mellanox.co.il Wed Dec 28 12:41:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 28 Dec 2005 22:41:31 +0200 Subject: [openib-general] Re: checkstack warnings In-Reply-To: References: Message-ID: <20051228204131.GB2720@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: checkstack warnings > > Michael> Hi! Running make checkstack on the openib tree generates > Michael> quite a long list of stack hogs (below). Do we want to > Michael> clean the code with respect to these warnings? > > It wouldn't hurt to try and improve things, but I don't see anything > really scary here... > > > 0x000007e0 mthca_init_one: 1048 > > This one seems pretty bad -- I guess it's -funit-at-a-time going > beserk. Are you building for x86_64? What compiler version? > > With > > $ gcc --version > gcc (GCC) 4.0.2 (Debian 4.0.2-2) > > I get something slightly different: > > 0x000005a0 mthca_init_hca: 568 Yes, its a slightly old gcc (GCC) 3.3.5 20050117 (prerelease) (SUSE Linux) > is the top offender, and > > 0x0000e9a0 mthca_process_mad: 168 > > is the second-biggest function. I guess gcc 4.0.2 is better at stack utilization. > In any case it might be worth dynamically allocating some of the large > structures used in the init path, just to be safe. > > - R. > -- MST From halr at voltaire.com Wed Dec 28 12:54:07 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2005 15:54:07 -0500 Subject: [openib-general] [PATCH] OpenSM/osm_helper.c: In osm_dump_smp_dr_path, display DR LIDs only if DR SMP Message-ID: <1135803246.7258.458.camel@hal.voltaire.com> OpenSM/osm_helper.c: In osm_dump_smp_dr_path, display DR LIDs only if DR SMP Signed-off-by: Hal Rosenstock Index: osm_helper.c =================================================================== --- osm_helper.c (revision 4645) +++ osm_helper.c (working copy) @@ -1457,9 +1457,7 @@ osm_dump_dr_smp( "\t\t\t\tattr_id.................0x%X (%s)\n" "\t\t\t\tresv....................0x%X\n" "\t\t\t\tattr_mod................0x%X\n" - "\t\t\t\tm_key...................0x%016" PRIx64 "\n" - "\t\t\t\tdr_slid.................0x%X\n" - "\t\t\t\tdr_dlid.................0x%X\n", + "\t\t\t\tm_key...................0x%016" PRIx64 "\n", p_smp->hop_ptr, p_smp->hop_count, cl_ntoh64(p_smp->trans_id), @@ -1467,14 +1465,20 @@ osm_dump_dr_smp( ib_get_sm_attr_str( p_smp->attr_id ), cl_ntoh16(p_smp->resv), cl_ntoh32(p_smp->attr_mod), - cl_ntoh64(p_smp->m_key), - cl_ntoh16(p_smp->dr_slid), - cl_ntoh16(p_smp->dr_dlid) + cl_ntoh64(p_smp->m_key) ); strcat( buf, line ); if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) { + sprintf( line, + "\t\t\t\tdr_slid.................0x%X\n" + "\t\t\t\tdr_dlid.................0x%X\n", + cl_ntoh16(p_smp->dr_slid), + cl_ntoh16(p_smp->dr_dlid) + ); + strcat( buf, line ); + strcat( buf, "\n\t\t\t\tInitial path: " ); for( i = 0; i <= p_smp->hop_count; i++ ) @@ -1652,7 +1656,7 @@ osm_dump_smp_dr_path( if( osm_log_is_active( p_log, log_level) ) { - sprintf( buf, "Received a SMP on a %u hop path:" + sprintf( buf, "Received SMP on a %u hop path:" "\n\t\t\t\tInitial path = ", p_smp->hop_count ); for( i = 0; i <= p_smp->hop_count; i++ ) From sean.hefty at intel.com Wed Dec 28 14:13:46 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 Dec 2005 14:13:46 -0800 Subject: [openib-general] [PATCH] [CMA] support loopback addresses Message-ID: The following patch adds support to the CMA for loopback IP addresses. By default, loopback addresses map to the first port on the first HCA in the CMA's device list. (At some point, quality of service values could be used to select a more appropriate device.) A minor update was also made to how sockaddr values are copied to account for the different sizes between IPv4 and IPv6. Note that this approach differs from the patch previously submitted by Michael. The reasons for the differences are to allow loopback on non-IB RDMA devices and eventual support for QoS. Signed-off-by: Sean Hefty Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 4356) +++ include/rdma/ib_addr.h (working copy) @@ -30,6 +30,8 @@ #if !defined(IB_ADDR_H) #define IB_ADDR_H +#include +#include #include #include @@ -68,5 +70,11 @@ int ib_resolve_addr(struct sockaddr *src void ib_addr_cancel(struct ib_addr *addr); +static inline int ip_addr_size(struct sockaddr *addr) +{ + return addr->sa_family == AF_INET6 ? + sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); +} + #endif /* IB_ADDR_H */ Index: core/addr.c =================================================================== --- core/addr.c (revision 4356) +++ core/addr.c (working copy) @@ -27,8 +27,6 @@ * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. */ -#include -#include #include #include #include @@ -258,8 +256,8 @@ int ib_resolve_addr(struct sockaddr *src memset(req, 0, sizeof *req); if (src_addr) - req->src_addr = *src_addr; - req->dst_addr = *dst_addr; + memcpy(&req->src_addr, src_addr, ip_addr_size(src_addr)); + memcpy(&req->dst_addr, dst_addr, ip_addr_size(dst_addr)); req->addr = addr; req->callback = callback; req->context = context; Index: core/cma.c =================================================================== --- core/cma.c (revision 4404) +++ core/cma.c (working copy) @@ -55,6 +55,7 @@ static struct ib_client cma_client = { static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DECLARE_MUTEX(mutex); +static struct workqueue_struct *wq; struct cma_device { struct list_head list; @@ -421,6 +422,12 @@ static inline int cma_any_addr(struct so } } +static inline int cma_loopback_addr(struct sockaddr *addr) +{ + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + ntohl(INADDR_LOOPBACK); +} + static int cma_get_net_info(void *hdr, enum rdma_port_space ps, u8 *ip_ver, __u16 *port, union cma_ip_addr **src, union cma_ip_addr **dst) @@ -1070,6 +1077,35 @@ err: } EXPORT_SYMBOL(rdma_resolve_route); +static int cma_bind_loopback(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + int ret; + + down(&mutex); + if (list_empty(&dev_list)) { + ret = -ENODEV; + goto out; + } + + cma_dev = list_entry(dev_list.next, struct cma_device, list); + ret = ib_get_cached_gid(cma_dev->device, 1, 0, + &id_priv->id.route.addr.addr.ibaddr.sgid); + if (ret) + goto out; + + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, + &id_priv->id.route.addr.addr.ibaddr.pkey); + if (ret) + goto out; + + id_priv->id.port_num = 1; + cma_attach_to_dev(id_priv, cma_dev); +out: + up(&mutex); + return ret; +} + static void addr_handler(int status, struct sockaddr *src_addr, struct ib_addr *ibaddr, void *context) { @@ -1092,7 +1128,8 @@ static void addr_handler(int status, str } else { if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) goto out; - id_priv->id.route.addr.src_addr = *src_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); event = RDMA_CM_EVENT_ADDR_RESOLVED; } @@ -1108,6 +1145,57 @@ out: cma_deref_id(id_priv); } +static void loopback_addr_handler(void *data) +{ + struct rdma_id_private *id_priv = data; + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_loopback(struct rdma_id_private *id_priv, + struct sockaddr *src_addr, enum cma_state state) +{ + struct work_struct *work; + int ret; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + if (state == CMA_IDLE) { + ret = cma_bind_loopback(id_priv); + if (ret) + goto err; + id_priv->id.route.addr.addr.ibaddr.dgid = + id_priv->id.route.addr.addr.ibaddr.sgid; + if (!src_addr || cma_any_addr(src_addr)) + src_addr = &id_priv->id.route.addr.dst_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + } + + INIT_WORK(work, loopback_addr_handler, id_priv); + queue_work(wq, work); + return 0; +err: + kfree(work); + return ret; +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { @@ -1126,9 +1214,13 @@ int rdma_resolve_addr(struct rdma_cm_id return -EINVAL; atomic_inc(&id_priv->refcount); - id->route.addr.dst_addr = *dst_addr; - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, - timeout_ms, addr_handler, id_priv); + memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); + if (cma_loopback_addr(dst_addr)) + ret = cma_resolve_loopback(id_priv, src_addr, expected_state); + else + ret = ib_resolve_addr(src_addr, dst_addr, + &id->route.addr.addr.ibaddr, + timeout_ms, addr_handler, id_priv); if (ret) goto err; @@ -1143,7 +1235,7 @@ EXPORT_SYMBOL(rdma_resolve_addr); int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; - struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; + struct ib_addr *ibaddr; int ret; if (addr->sa_family != AF_INET) @@ -1154,9 +1246,11 @@ int rdma_bind_addr(struct rdma_cm_id *id return -EINVAL; if (cma_any_addr(addr)) { - id->route.addr.src_addr = *addr; ret = 0; + } else if (cma_loopback_addr(addr)) { + ret = cma_bind_loopback(id_priv); } else { + ibaddr = &id->route.addr.addr.ibaddr; ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); if (!ret) ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); @@ -1165,7 +1259,7 @@ int rdma_bind_addr(struct rdma_cm_id *id if (ret) goto err; - id->route.addr.src_addr = *addr; + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); return 0; err: cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); @@ -1508,12 +1602,17 @@ static void cma_remove_one(struct ib_dev static int cma_init(void) { + wq = create_singlethread_workqueue("rdma_cm"); + if (!wq) + return -ENOMEM; + return ib_register_client(&cma_client); } static void cma_cleanup(void) { ib_unregister_client(&cma_client); + destroy_workqueue(wq); } module_init(cma_init); From ralphc at pathscale.com Wed Dec 28 14:14:52 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 28 Dec 2005 14:14:52 -0800 Subject: [openib-general] minor bug in pingpong programs Message-ID: <1135808092.5081.7.camel@brick.internal.keyresearch.com> The pingpong test programs don't check the return value from ibv_pool_cq() properly. Here is a patch to fix them. Index: rc_pingpong.c =================================================================== --- rc_pingpong.c (revision 4412) +++ rc_pingpong.c (working copy) @@ -644,13 +644,12 @@ do { ne = ibv_poll_cq(ctx->cq, 2, wc); + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } } while (!use_event && ne < 1); - if (ne < 0) { - fprintf(stderr, "poll CQ failed %d\n", ne); - return 1; - } - for (i = 0; i < ne; ++i) { if (wc[i].status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %d\n", Index: srq_pingpong.c =================================================================== --- srq_pingpong.c (revision 4412) +++ srq_pingpong.c (working copy) @@ -720,13 +720,12 @@ do { ne = ibv_poll_cq(ctx->cq, 2, wc); + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } } while (!use_event && ne < 1); - if (ne < 0) { - fprintf(stderr, "poll CQ failed %d\n", ne); - return 1; - } - for (i = 0; i < ne; ++i) { if (wc[i].status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %d\n", Index: uc_pingpong.c =================================================================== --- uc_pingpong.c (revision 4412) +++ uc_pingpong.c (working copy) @@ -632,13 +632,12 @@ do { ne = ibv_poll_cq(ctx->cq, 2, wc); + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } } while (!use_event && ne < 1); - if (ne < 0) { - fprintf(stderr, "poll CQ failed %d\n", ne); - return 1; - } - for (i = 0; i < ne; ++i) { if (wc[i].status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %d\n", Index: ud_pingpong.c =================================================================== --- ud_pingpong.c (revision 4412) +++ ud_pingpong.c (working copy) @@ -640,13 +640,12 @@ do { ne = ibv_poll_cq(ctx->cq, 2, wc); + if (ne < 0) { + fprintf(stderr, "poll CQ failed %d\n", ne); + return 1; + } } while (!use_event && ne < 1); - if (ne < 0) { - fprintf(stderr, "poll CQ failed %d\n", ne); - return 1; - } - for (i = 0; i < ne; ++i) { if (wc[i].status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %d for wr_id %d\n", -- Ralph Campbell From surs at cse.ohio-state.edu Wed Dec 28 14:19:29 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 28 Dec 2005 17:19:29 -0500 Subject: [openib-general] Re: error modifying QP capabilities In-Reply-To: References: <20051228201328.GA2720@mellanox.co.il> Message-ID: <20051228221928.GA9271@cse.ohio-state.edu> Dotan, Roland & Michael, * On Dec,5 Roland Dreier wrote : > Michael> Maybe we should be returning EOPNOTSUPP rather than > Michael> EINVAL then. Makes sense? > > I guess so, but I'm not sure it's really worth worrying about. Thanks for your responses. I can understand that resizing QPs is a complex thing to do for the driver. Sayantan. -- http://www.cse.ohio-state.edu/~surs From mst at mellanox.co.il Wed Dec 28 14:31:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Dec 2005 00:31:51 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support loopback addresses In-Reply-To: References: Message-ID: <20051228223151.GA3614@mellanox.co.il> Hello, Sean! Thanks for looking at this. Quoting Sean Hefty : > + INIT_WORK(work, loopback_addr_handler, id_priv); > + queue_work(wq, work); > + return 0; > +err: > + kfree(work); > + return ret; > +} I'm not following this: why are we deferring the work? Cant it be done directly? > @@ -1508,12 +1602,17 @@ static void cma_remove_one(struct ib_dev > > static int cma_init(void) > { > + wq = create_singlethread_workqueue("rdma_cm"); > + if (!wq) > + return -ENOMEM; > + > return ib_register_client(&cma_client); > } > > static void cma_cleanup(void) > { > ib_unregister_client(&cma_client); > + destroy_workqueue(wq); > } > > module_init(cma_init); It would be nice to avoid adding another workqueue. -- MST From mst at mellanox.co.il Wed Dec 28 14:42:08 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Dec 2005 00:42:08 +0200 Subject: [openib-general] Re: error modifying QP capabilities In-Reply-To: References: Message-ID: <20051228224207.GA3851@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: error modifying QP capabilities > > Michael> Maybe we should be returning EOPNOTSUPP rather than > Michael> EINVAL then. Makes sense? > > I guess so, but I'm not sure it's really worth worrying about. What about moving most parameter checks from mthca to verbs.c? Then mtcha would only have to worry about what is supported. -- MST From mshefty at ichips.intel.com Wed Dec 28 14:39:37 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 28 Dec 2005 14:39:37 -0800 Subject: [openib-general] Re: [PATCH] [CMA] support loopback addresses In-Reply-To: <20051228223151.GA3614@mellanox.co.il> References: <20051228223151.GA3614@mellanox.co.il> Message-ID: <43B31429.1080408@ichips.intel.com> Michael S. Tsirkin wrote: >>+ INIT_WORK(work, loopback_addr_handler, id_priv); >>+ queue_work(wq, work); >>+ return 0; >>+err: >>+ kfree(work); >>+ return ret; >>+} > > I'm not following this: why are we deferring the work? > Cant it be done directly? To keep rdma_resolve_addr() generic, it is an asynchronous call. The work queue is used to callback the user from a separate thread other than the one that they called down with. The ib_addr module does something similar when the destination address is actually a local address, deferring the callback to another thread. The alternative is to have the API behave one way for destination addresses that are local, versus those that are remote, but this complicates applications that are not aware if an address belongs to the local or a remote system. - Sean From mst at mellanox.co.il Wed Dec 28 14:49:10 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Dec 2005 00:49:10 +0200 Subject: [openib-general] Re: Re: [PATCH] [CMA] support loopback addresses In-Reply-To: <43B31429.1080408@ichips.intel.com> References: <43B31429.1080408@ichips.intel.com> Message-ID: <20051228224910.GB3851@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: Re: [PATCH] [CMA] support loopback addresses > > Michael S. Tsirkin wrote: > >>+ INIT_WORK(work, loopback_addr_handler, id_priv); > >>+ queue_work(wq, work); > >>+ return 0; > >>+err: > >>+ kfree(work); > >>+ return ret; > >>+} > > > > I'm not following this: why are we deferring the work? > > Cant it be done directly? > > To keep rdma_resolve_addr() generic, it is an asynchronous call. The work > queue is used to callback the user from a separate thread other than the one > that they called down with. The ib_addr module does something similar when > the destination address is actually a local address, deferring the callback to > another thread. Maybe this should be moved to ib_addr then? > > The alternative is to have the API behave one way for destination addresses that > are local, versus those that are remote, but this complicates applications that > are not aware if an address belongs to the local or a remote system. > > - Sean So, maybe we can reuse the ib_addr wq? Having all requests complete from a single thread could help applications, I think. -- MST From mshefty at ichips.intel.com Wed Dec 28 15:03:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 28 Dec 2005 15:03:20 -0800 Subject: [openib-general] Re: [PATCH] [CMA] support loopback addresses In-Reply-To: <20051228224910.GB3851@mellanox.co.il> References: <43B31429.1080408@ichips.intel.com> <20051228224910.GB3851@mellanox.co.il> Message-ID: <43B319B8.1040100@ichips.intel.com> Michael S. Tsirkin wrote: >>To keep rdma_resolve_addr() generic, it is an asynchronous call. The work >>queue is used to callback the user from a separate thread other than the one >>that they called down with. The ib_addr module does something similar when >>the destination address is actually a local address, deferring the callback to >>another thread. > > Maybe this should be moved to ib_addr then? > >>The alternative is to have the API behave one way for destination addresses that >>are local, versus those that are remote, but this complicates applications that >>are not aware if an address belongs to the local or a remote system. > > So, maybe we can reuse the ib_addr wq? Having all requests complete from > a single thread could help applications, I think. I did consider pushing this down into ib_addr because it has an existing work queue, and it would have been easier to implement, but ib_addr simply converts an IP address into a GID/pkey. To keep the CMA RDMA centric, rather than IB-specific, it seemed better to me to map lookback addresses to an RDMA device outside of IB specific code. For example, if the mapping were moved into ib_addr, then loopback addresses would not work for a system containing only iWarp devices. If there's concern about using too many work queues in the IB stack, we could export a single work queue from ib_core that's accessible to multiple modules. Work queues appear to be used by these modules: ib_mad, rdma_cm, ib_cm, ib_addr, sdp, and ipoib. - Sean From mst at mellanox.co.il Wed Dec 28 15:18:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 29 Dec 2005 01:18:19 +0200 Subject: [openib-general] Re: [PATCH] [CMA] support loopback addresses In-Reply-To: <43B319B8.1040100@ichips.intel.com> References: <43B319B8.1040100@ichips.intel.com> Message-ID: <20051228231819.GA4060@mellanox.co.il> Quoting r. Sean Hefty : > >>The alternative is to have the API behave one way for destination addresses that > >>are local, versus those that are remote, but this complicates applications that > >>are not aware if an address belongs to the local or a remote system. > > > > So, maybe we can reuse the ib_addr wq? Having all requests complete from > > a single thread could help applications, I think. > I did consider pushing this down into ib_addr because it has an existing work > queue, and it would have been easier to implement, but ib_addr simply converts > an IP address into a GID/pkey. To keep the CMA RDMA centric, rather than > IB-specific, it seemed better to me to map lookback addresses to an RDMA device > outside of IB specific code. For example, if the mapping were moved into > ib_addr, then loopback addresses would not work for a system containing only > iWarp devices. Does it really matter in which module doe sthe function reside? > If there's concern about using too many work queues in the IB stack, we could > export a single work queue from ib_core that's accessible to multiple modules. > Work queues appear to be used by these modules: ib_mad, rdma_cm, ib_cm, ib_addr, > sdp, and ipoib. > > - Sean > Sounds good, I would start with just rdma_cm ib_cm and ib_addr. But note that exposing the wq means you must be explicit on what runs on this queue, since otehrwise a user can flush the queue from inside it or other such deadlocking silliness. Need to be careful. -- MST From mshefty at ichips.intel.com Wed Dec 28 15:37:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 28 Dec 2005 15:37:06 -0800 Subject: [openib-general] Re: [PATCH] [CMA] support loopback addresses In-Reply-To: <20051228231819.GA4060@mellanox.co.il> References: <43B319B8.1040100@ichips.intel.com> <20051228231819.GA4060@mellanox.co.il> Message-ID: <43B321A2.8020802@ichips.intel.com> Michael S. Tsirkin wrote: >>For example, if the mapping were moved into >>ib_addr, then loopback addresses would not work for a system containing only >>iWarp devices. > > Does it really matter in which module doe sthe function reside? Yes - ib_addr does not have a list of RDMA devices. It simply converts an IP address to a HW address, then interprets that address as a GID/pkey. Placing the functionality in the CMA allows loopback operation without ipoib being loaded, and allows it to operate over non-IB devices. And I agree that we need to be careful merging any work queues together. It's probably simple enough to merge ib_addr and rdma_cm work queues that I will go ahead and try that, but would like to defer merging in the ib_cm or other work queues for now. - Sean From sean.hefty at intel.com Wed Dec 28 16:22:39 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 Dec 2005 16:22:39 -0800 Subject: [openib-general] RE: [PATCH] [CMA] support loopback addresses In-Reply-To: <20051228223151.GA3614@mellanox.co.il> Message-ID: >It would be nice to avoid adding another workqueue. Here's an updated patch that exports the work queue in ib_addr and has the CMA make use of that one. Thanks for the feedback - please respond if there are any other issues. Signed-off-by: Sean Hefty Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 4651) +++ include/rdma/ib_addr.h (working copy) @@ -30,9 +30,13 @@ #if !defined(IB_ADDR_H) #define IB_ADDR_H +#include +#include #include #include +extern struct workqueue_struct *rdma_wq; + struct ib_addr { union ib_gid sgid; union ib_gid dgid; @@ -68,5 +72,11 @@ void ib_addr_cancel(struct ib_addr *addr); +static inline int ip_addr_size(struct sockaddr *addr) +{ + return addr->sa_family == AF_INET6 ? + sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); +} + #endif /* IB_ADDR_H */ Index: core/addr.c =================================================================== --- core/addr.c (revision 4651) +++ core/addr.c (working copy) @@ -27,8 +27,6 @@ * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. */ -#include -#include #include #include #include @@ -57,7 +55,8 @@ static DECLARE_MUTEX(mutex); static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); -static struct workqueue_struct *wq; +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); static u16 addr_get_pkey(struct net_device *dev) { @@ -90,7 +89,7 @@ if ((long)delay <= 0) delay = 1; - queue_delayed_work(wq, &work, delay); + queue_delayed_work(rdma_wq, &work, delay); } static void queue_req(struct addr_req *req) @@ -258,8 +257,8 @@ memset(req, 0, sizeof *req); if (src_addr) - req->src_addr = *src_addr; - req->dst_addr = *dst_addr; + memcpy(&req->src_addr, src_addr, ip_addr_size(src_addr)); + memcpy(&req->dst_addr, dst_addr, ip_addr_size(dst_addr)); req->addr = addr; req->callback = callback; req->context = context; @@ -333,8 +332,8 @@ static int addr_init(void) { - wq = create_singlethread_workqueue("ib_addr"); - if (!wq) + rdma_wq = create_singlethread_workqueue("rdma_wq"); + if (!rdma_wq) return -ENOMEM; dev_add_pack(&addr_arp); @@ -344,7 +343,7 @@ static void addr_cleanup(void) { dev_remove_pack(&addr_arp); - destroy_workqueue(wq); + destroy_workqueue(rdma_wq); } module_init(addr_init); Index: core/cma.c =================================================================== --- core/cma.c (revision 4651) +++ core/cma.c (working copy) @@ -421,6 +421,12 @@ } } +static inline int cma_loopback_addr(struct sockaddr *addr) +{ + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == + ntohl(INADDR_LOOPBACK); +} + static int cma_get_net_info(void *hdr, enum rdma_port_space ps, u8 *ip_ver, __u16 *port, union cma_ip_addr **src, union cma_ip_addr **dst) @@ -1070,6 +1076,35 @@ } EXPORT_SYMBOL(rdma_resolve_route); +static int cma_bind_loopback(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + int ret; + + down(&mutex); + if (list_empty(&dev_list)) { + ret = -ENODEV; + goto out; + } + + cma_dev = list_entry(dev_list.next, struct cma_device, list); + ret = ib_get_cached_gid(cma_dev->device, 1, 0, + &id_priv->id.route.addr.addr.ibaddr.sgid); + if (ret) + goto out; + + ret = ib_get_cached_pkey(cma_dev->device, 1, 0, + &id_priv->id.route.addr.addr.ibaddr.pkey); + if (ret) + goto out; + + id_priv->id.port_num = 1; + cma_attach_to_dev(id_priv, cma_dev); +out: + up(&mutex); + return ret; +} + static void addr_handler(int status, struct sockaddr *src_addr, struct ib_addr *ibaddr, void *context) { @@ -1092,7 +1127,8 @@ } else { if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) goto out; - id_priv->id.route.addr.src_addr = *src_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); event = RDMA_CM_EVENT_ADDR_RESOLVED; } @@ -1108,6 +1144,57 @@ cma_deref_id(id_priv); } +static void loopback_addr_handler(void *data) +{ + struct rdma_id_private *id_priv = data; + + atomic_inc(&id_priv->dev_remove); + + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + + if (cma_notify_user(id_priv, RDMA_CM_EVENT_ADDR_RESOLVED, 0, NULL, 0)) { + cma_exch(id_priv, CMA_DESTROYING); + cma_release_remove(id_priv); + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_release_remove(id_priv); + cma_deref_id(id_priv); +} + +static int cma_resolve_loopback(struct rdma_id_private *id_priv, + struct sockaddr *src_addr, enum cma_state state) +{ + struct work_struct *work; + int ret; + + work = kmalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + + if (state == CMA_IDLE) { + ret = cma_bind_loopback(id_priv); + if (ret) + goto err; + id_priv->id.route.addr.addr.ibaddr.dgid = + id_priv->id.route.addr.addr.ibaddr.sgid; + if (!src_addr || cma_any_addr(src_addr)) + src_addr = &id_priv->id.route.addr.dst_addr; + memcpy(&id_priv->id.route.addr.src_addr, src_addr, + ip_addr_size(src_addr)); + } + + INIT_WORK(work, loopback_addr_handler, id_priv); + queue_work(rdma_wq, work); + return 0; +err: + kfree(work); + return ret; +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { @@ -1126,9 +1213,13 @@ return -EINVAL; atomic_inc(&id_priv->refcount); - id->route.addr.dst_addr = *dst_addr; - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, - timeout_ms, addr_handler, id_priv); + memcpy(&id->route.addr.dst_addr, dst_addr, ip_addr_size(dst_addr)); + if (cma_loopback_addr(dst_addr)) + ret = cma_resolve_loopback(id_priv, src_addr, expected_state); + else + ret = ib_resolve_addr(src_addr, dst_addr, + &id->route.addr.addr.ibaddr, + timeout_ms, addr_handler, id_priv); if (ret) goto err; @@ -1143,7 +1234,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; - struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; + struct ib_addr *ibaddr; int ret; if (addr->sa_family != AF_INET) @@ -1154,9 +1245,11 @@ return -EINVAL; if (cma_any_addr(addr)) { - id->route.addr.src_addr = *addr; ret = 0; + } else if (cma_loopback_addr(addr)) { + ret = cma_bind_loopback(id_priv); } else { + ibaddr = &id->route.addr.addr.ibaddr; ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); if (!ret) ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); @@ -1165,7 +1258,7 @@ if (ret) goto err; - id->route.addr.src_addr = *addr; + memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); return 0; err: cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); From bos at pathscale.com Wed Dec 28 16:31:21 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:21 -0800 Subject: [openib-general] [PATCH 2 of 20] memcpy32 for x86_64 In-Reply-To: Message-ID: <801287704e408ed65660.1135816281@eng-12.pathscale.com> Introduce an x86_64-specific memcpy32 routine. The routine is similar to memcpy, but is guaranteed to work in units of 32 bits at a time. Signed-off-by: Bryan O'Sullivan diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/kernel/x8664_ksyms.c --- a/arch/x86_64/kernel/x8664_ksyms.c Wed Dec 28 14:19:42 2005 -0800 +++ b/arch/x86_64/kernel/x8664_ksyms.c Wed Dec 28 14:19:42 2005 -0800 @@ -164,6 +164,8 @@ EXPORT_SYMBOL(memcpy); EXPORT_SYMBOL(__memcpy); +EXPORT_SYMBOL_GPL(memcpy32); + #ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM /* prototypes are wrong, these are assembly with custom calling functions */ extern void rwsem_down_read_failed_thunk(void); diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/lib/Makefile --- a/arch/x86_64/lib/Makefile Wed Dec 28 14:19:42 2005 -0800 +++ b/arch/x86_64/lib/Makefile Wed Dec 28 14:19:42 2005 -0800 @@ -9,4 +9,4 @@ lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \ usercopy.o getuser.o putuser.o \ thunk.o clear_page.o copy_page.o bitstr.o bitops.o -lib-y += memcpy.o memmove.o memset.o copy_user.o +lib-y += memcpy.o memcpy32.o memmove.o memset.o copy_user.o diff -r ef833f6712e7 -r 801287704e40 include/asm-x86_64/string.h --- a/include/asm-x86_64/string.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-x86_64/string.h Wed Dec 28 14:19:42 2005 -0800 @@ -45,6 +45,15 @@ #define __HAVE_ARCH_MEMMOVE void * memmove(void * dest,const void *src,size_t count); +/* + * memcpy32 - copy data, 32 bits at a time + * + * @dst: destination (must be 32-bit aligned) + * @src: source (must be 32-bit aligned) + * @count: number of 32-bit quantities to copy + */ +void memcpy32(void *dst, const void *src, size_t count); + /* Use C out of line version for memcmp */ #define memcmp __builtin_memcmp int memcmp(const void * cs,const void * ct,size_t count); diff -r ef833f6712e7 -r 801287704e40 arch/x86_64/lib/memcpy32.S --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/arch/x86_64/lib/memcpy32.S Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2003, 2004, 2005 PathScale, Inc. + */ + +/* + * memcpy32 - Copy a memory block, 32 bits at a time. + * + * Count is number of dwords; it need not be a qword multiple. + * Input: + * rdi destination + * rsi source + * rdx count + */ + + .globl memcpy32 +memcpy32: + movl %edx,%ecx + shrl $1,%ecx + andl $1,%edx + rep movsq + movl %edx,%ecx + rep movsd + ret From bos at pathscale.com Wed Dec 28 16:31:19 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:19 -0800 Subject: [openib-general] [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver Message-ID: Following Roland's submission of our InfiniPath InfiniBand HCA driver earlier this month, we have responded to people's comments by making a large number of changes to the driver. Here is another set of driver patches for review. Roland is on vacation until January 4, so I'm posting these in his place. Once again, your comments are appreciated. We'd like to submit this driver for inclusion in 2.6.16, so we'll be responding quickly to all feedback. A short summary of the changes we have made is as follows: - sparse annotations (yes, it passes "make C=1") - Removed x86_64 specificity from driver - Introduced generic memcpy_toio32 for safe MMIO access - Got rid of release and RCS IDs - Use set_page_dirty_lock instead of SetPageDirty - Fixed misuse of copy_from_user - Removed all sysctls - Removed stuff inside #ifndef __KERNEL__ - Use ALIGN() instead of round_up() - Use static inlines instead of #defines, generally tidied inline functions - Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into linux/types.h - Got rid of ipath_shortcopy - Use fixed-size types for user/kernel communication - Renamed ipath_mlock to ipath_get_user_pages, fixed some bugs There are a few requested changes we have chosen to omit for now: - The driver still uses EXPORT_SYMBOL, for consistency with other code in drivers/infiniband - Someone asked for the kernel's i2c infrastructure to be used, but our i2c usage is very specialised, and it would be more of a mess to use the kernel's - We're still using ioctls instead of sysfs or configfs in some cases, to maintain userspace compatibility Please note that these patches require a set of OpenIB kernel patches that are awaiting the 2.6.16 submission window in order to compile; in other words, they really are for review only. I'll be happy to provide a suitable jumbo OpenIB patch to anyone who feels a need to compile-test these patches. Message-ID: This routine is an arch-independent building block for memcpy_toio32. It copies data to a memory-mapped I/O region, using 32-bit accesses. This style of access is required by some devices. Signed-off-by: Bryan O'Sullivan diff -r a56fd6a8895d -r ef833f6712e7 include/asm-generic/iomap.h --- a/include/asm-generic/iomap.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-generic/iomap.h Wed Dec 28 14:19:42 2005 -0800 @@ -56,6 +56,15 @@ extern void fastcall iowrite16_rep(void __iomem *port, const void *buf, unsigned long count); extern void fastcall iowrite32_rep(void __iomem *port, const void *buf, unsigned long count); +/* + * __memcpy_toio32 - copy data to MMIO space, in 32-bit units + * + * @to: destination, in MMIO space (must be 32-bit aligned) + * @from: source (must be 32-bit aligned) + * @count: number of 32-bit quantities to copy + */ +void fastcall __memcpy_toio32(void __iomem *to, const void *from, size_t count); + /* Create a virtual mapping cookie for an IO port range */ extern void __iomem *ioport_map(unsigned long port, unsigned int nr); extern void ioport_unmap(void __iomem *); diff -r a56fd6a8895d -r ef833f6712e7 lib/iomap.c --- a/lib/iomap.c Wed Dec 28 14:19:42 2005 -0800 +++ b/lib/iomap.c Wed Dec 28 14:19:42 2005 -0800 @@ -187,6 +187,17 @@ EXPORT_SYMBOL(iowrite16_rep); EXPORT_SYMBOL(iowrite32_rep); +void fastcall __memcpy_toio32(void __iomem *d, const void *s, size_t count) +{ + u32 __iomem *dst = d; + const u32 *src = s; + size_t i; + + for (i = 0; i < count; i++) + __raw_writel(*src++, dst++); + wmb(); +} + /* Create a virtual mapping cookie for an IO port range */ void __iomem *ioport_map(unsigned long port, unsigned int nr) { From bos at pathscale.com Wed Dec 28 16:31:23 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:23 -0800 Subject: [openib-general] [PATCH 4 of 20] Define BITS_PER_BYTE In-Reply-To: Message-ID: This can make some arithmetic expressions clearer. Signed-off-by: Bryan O'Sullivan diff -r b792638cc4bc -r a3a00f637da6 include/linux/types.h --- a/include/linux/types.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/linux/types.h Wed Dec 28 14:19:42 2005 -0800 @@ -8,6 +8,8 @@ (((bits)+BITS_PER_LONG-1)/BITS_PER_LONG) #define DECLARE_BITMAP(name,bits) \ unsigned long name[BITS_TO_LONGS(bits)] + +#define BITS_PER_BYTE 8 #endif #include From bos at pathscale.com Wed Dec 28 16:31:22 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:22 -0800 Subject: [openib-general] [PATCH 3 of 20] Add memcpy_toio32 to each arch In-Reply-To: Message-ID: Most arches use the generic __memcpy_toio32 routine, while x86_64 uses memcpy32, which is substantially faster. Signed-off-by: Bryan O'Sullivan diff -r 801287704e40 -r b792638cc4bc include/asm-alpha/io.h --- a/include/asm-alpha/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-alpha/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -504,6 +504,8 @@ extern void memcpy_toio(volatile void __iomem *, const void *, long); extern void _memset_c_io(volatile void __iomem *, unsigned long, long); +#define memcpy_toio32 __memcpy_toio32 + static inline void memset_io(volatile void __iomem *addr, u8 c, long len) { _memset_c_io(addr, 0x0101010101010101UL * c, len); diff -r 801287704e40 -r b792638cc4bc include/asm-arm/io.h --- a/include/asm-arm/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-arm/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -184,6 +184,8 @@ #define memset_io(c,v,l) _memset_io(__mem_pci(c),(v),(l)) #define memcpy_fromio(a,c,l) _memcpy_fromio((a),__mem_pci(c),(l)) #define memcpy_toio(c,a,l) _memcpy_toio(__mem_pci(c),(a),(l)) + +#define memcpy_toio32 __memcpy_toio32 #define eth_io_copy_and_sum(s,c,l,b) \ eth_copy_and_sum((s),__mem_pci(c),(l),(b)) diff -r 801287704e40 -r b792638cc4bc include/asm-cris/io.h --- a/include/asm-cris/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-cris/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -121,6 +121,8 @@ #define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c)) #define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c)) +#define memcpy_toio32 __memcpy_toio32 + /* * Again, CRIS does not require mem IO specific function. */ diff -r 801287704e40 -r b792638cc4bc include/asm-frv/io.h --- a/include/asm-frv/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-frv/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -124,6 +124,8 @@ memcpy((void __force *) dst, src, count); } +#define memcpy_toio32 __memcpy_toio32 + static inline uint8_t inb(unsigned long addr) { return __builtin_read8((void *)addr); diff -r 801287704e40 -r b792638cc4bc include/asm-h8300/io.h --- a/include/asm-h8300/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-h8300/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -209,6 +209,8 @@ #define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c)) #define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c)) +#define memcpy_toio32 __memcpy_toio32 + #define mmiowb() #define inb(addr) ((h8300_buswidth(addr))?readw((addr) & ~1) & 0xff:readb(addr)) diff -r 801287704e40 -r b792638cc4bc include/asm-i386/io.h --- a/include/asm-i386/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-i386/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -203,6 +203,9 @@ { __memcpy((void __force *) dst, src, count); } + +#define memcpy_toio32 __memcpy_toio32 + /* * ISA space is 'always mapped' on a typical x86 system, no need to diff -r 801287704e40 -r b792638cc4bc include/asm-ia64/io.h --- a/include/asm-ia64/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-ia64/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -443,6 +443,8 @@ extern void memcpy_toio(volatile void __iomem *dst, const void *src, long n); extern void memset_io(volatile void __iomem *s, int c, long n); +#define memcpy_toio32 __memcpy_toio32 + #define dma_cache_inv(_start,_size) do { } while (0) #define dma_cache_wback(_start,_size) do { } while (0) #define dma_cache_wback_inv(_start,_size) do { } while (0) diff -r 801287704e40 -r b792638cc4bc include/asm-m32r/io.h --- a/include/asm-m32r/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-m32r/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -216,6 +216,8 @@ memcpy((void __force *) dst, src, count); } +#define memcpy_toio32 __memcpy_toio32 + /* * Convert a physical pointer to a virtual kernel pointer for /dev/mem * access diff -r 801287704e40 -r b792638cc4bc include/asm-m68knommu/io.h --- a/include/asm-m68knommu/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-m68knommu/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -113,6 +113,8 @@ #define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c)) #define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c)) +#define memcpy_toio32 __memcpy_toio32 + #define inb(addr) readb(addr) #define inw(addr) readw(addr) #define inl(addr) readl(addr) diff -r 801287704e40 -r b792638cc4bc include/asm-mips/io.h --- a/include/asm-mips/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-mips/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -534,6 +534,8 @@ memcpy((void __force *) dst, src, count); } +#define memcpy_toio32 __memcpy_toio32 + /* * Memory Mapped I/O */ diff -r 801287704e40 -r b792638cc4bc include/asm-parisc/io.h --- a/include/asm-parisc/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-parisc/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -294,6 +294,8 @@ void memcpy_fromio(void *dst, const volatile void __iomem *src, int count); void memcpy_toio(volatile void __iomem *dst, const void *src, int count); +#define memcpy_toio32 __memcpy_toio32 + /* Support old drivers which don't ioremap. * NB this interface is scheduled to disappear in 2.5 */ diff -r 801287704e40 -r b792638cc4bc include/asm-powerpc/io.h --- a/include/asm-powerpc/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-powerpc/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -63,6 +63,8 @@ #define memcpy_fromio(a,b,c) iSeries_memcpy_fromio((a), (b), (c)) #define memcpy_toio(a,b,c) iSeries_memcpy_toio((a), (b), (c)) +#define memcpy_toio32 __memcpy_toio32 + #define inb(addr) readb(((void __iomem *)(long)(addr))) #define inw(addr) readw(((void __iomem *)(long)(addr))) #define inl(addr) readl(((void __iomem *)(long)(addr))) diff -r 801287704e40 -r b792638cc4bc include/asm-ppc/io.h --- a/include/asm-ppc/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-ppc/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -367,6 +367,8 @@ } #endif +#define memcpy_toio32 __memcpy_toio32 + #define eth_io_copy_and_sum(a,b,c,d) eth_copy_and_sum((a),(void __force *)(void __iomem *)(b),(c),(d)) /* diff -r 801287704e40 -r b792638cc4bc include/asm-s390/io.h --- a/include/asm-s390/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-s390/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -99,6 +99,8 @@ #define memcpy_fromio(a,b,c) memcpy((a),__io_virt(b),(c)) #define memcpy_toio(a,b,c) memcpy(__io_virt(a),(b),(c)) +#define memcpy_toio32 __memcpy_toio32 + #define inb_p(addr) readb(addr) #define inb(addr) readb(addr) diff -r 801287704e40 -r b792638cc4bc include/asm-sh/io.h --- a/include/asm-sh/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-sh/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -177,6 +177,8 @@ extern void memcpy_toio(unsigned long, const void *, unsigned long); extern void memset_io(unsigned long, int, unsigned long); +#define memcpy_toio32 __memcpy_toio32 + /* SuperH on-chip I/O functions */ static __inline__ unsigned char ctrl_inb(unsigned long addr) { diff -r 801287704e40 -r b792638cc4bc include/asm-sh64/io.h --- a/include/asm-sh64/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-sh64/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -125,6 +125,8 @@ void memcpy_toio(void __iomem *to, const void *from, long count); void memcpy_fromio(void *to, void __iomem *from, long count); + +#define memcpy_toio32 __memcpy_toio32 #define mmiowb() diff -r 801287704e40 -r b792638cc4bc include/asm-sparc/io.h --- a/include/asm-sparc/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-sparc/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -239,6 +239,8 @@ #define memcpy_toio(d,s,sz) _memcpy_toio(d,s,sz) +#define memcpy_toio32 __memcpy_toio32 + #ifdef __KERNEL__ /* diff -r 801287704e40 -r b792638cc4bc include/asm-sparc64/io.h --- a/include/asm-sparc64/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-sparc64/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -440,6 +440,8 @@ #define memcpy_toio(d,s,sz) _memcpy_toio(d,s,sz) +#define memcpy_toio32 __memcpy_toio32 + static inline int check_signature(void __iomem *io_addr, const unsigned char *signature, int length) diff -r 801287704e40 -r b792638cc4bc include/asm-v850/io.h --- a/include/asm-v850/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-v850/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -130,6 +130,8 @@ #define memcpy_fromio(dst, src, len) memcpy (dst, (void *)src, len) #define memcpy_toio(dst, src, len) memcpy ((void *)dst, src, len) +#define memcpy_toio32 __memcpy_toio32 + /* * Convert a physical pointer to a virtual kernel pointer for /dev/mem * access diff -r 801287704e40 -r b792638cc4bc include/asm-x86_64/io.h --- a/include/asm-x86_64/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-x86_64/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -252,6 +252,13 @@ __memcpy_toio((unsigned long)to,from,len); } +#include + +static inline void memcpy_toio32(void __iomem *dst, const void *src, size_t count) +{ + memcpy32((void __force *) dst, src, count); +} + void memset_io(volatile void __iomem *a, int b, size_t c); /* diff -r 801287704e40 -r b792638cc4bc include/asm-xtensa/io.h --- a/include/asm-xtensa/io.h Wed Dec 28 14:19:42 2005 -0800 +++ b/include/asm-xtensa/io.h Wed Dec 28 14:19:42 2005 -0800 @@ -159,6 +159,8 @@ #define memcpy_fromio(a,b,c) memcpy((a),(void *)(b),(c)) #define memcpy_toio(a,b,c) memcpy((void *)(a),(b),(c)) +#define memcpy_toio32 __memcpy_toio32 + /* At this point the Xtensa doesn't provide byte swap instructions */ #ifdef __XTENSA_EB__ From bos at pathscale.com Wed Dec 28 16:31:26 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:26 -0800 Subject: [openib-general] [PATCH 7 of 20] ipath - MMIO copy routines In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r 9e8d017ed298 -r ffbd416f30d4 drivers/infiniband/hw/ipath/ipath_copy.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_copy.c Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,612 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +/* + * This file provides support for doing sk_buff buffer swapping between + * the low level driver eager buffers, and the network layer. It's part + * of the core driver, rather than the ether driver, because it relies + * on variables and functions in the core driver. It exports a single + * entry point for use in the ipath_ether module. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include /* we can generate our own crc's for testing */ + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* + * Allocate a PIO send buffer, initialize the header and copy it out. + */ +static int layer_send_getpiobuf(struct copy_data_s *cdp) +{ + uint32_t device = cdp->device; + uint32_t extra_bytes; + uint32_t len, nwords; + uint32_t __iomem *piobuf; + + if (!(piobuf = ipath_getpiobuf(device, NULL))) { + cdp->error = -EBUSY; + return cdp->error; + } + + /* + * Compute the max amount of data that can fit into a PIO buffer. + * buffer size - header size - trigger qword length & flags - CRC + */ + len = devdata[device].ipath_ibmaxlen - + sizeof(struct ether_header_typ) - 8 - (SIZE_OF_CRC << 2); + if (len > devdata[device].ipath_rcvegrbufsize) + len = devdata[device].ipath_rcvegrbufsize; + if (len > (cdp->len + cdp->extra)) + len = (cdp->len + cdp->extra); + /* Compute word aligment (i.e., (len & 3) ? 4 - (len & 3) : 0) */ + extra_bytes = (4 - len) & 3; + nwords = (sizeof(struct ether_header_typ) + len + extra_bytes) >> 2; + cdp->hdr->lrh[2] = htons(nwords + SIZE_OF_CRC); + cdp->hdr->bth[0] = htonl((OPCODE_ITH4X << 24) + (extra_bytes << 20) + + IPS_DEFAULT_P_KEY); + cdp->hdr->sub_opcode = OPCODE_ENCAP; + + cdp->hdr->bth[2] = 0; + /* Generate an interrupt on the receive side for the last fragment. */ + cdp->hdr->iph.pkt_flags = ((cdp->len+cdp->extra) == len) ? INFINIPATH_KPF_INTR : 0; + cdp->hdr->iph.chksum = (uint16_t) IPS_LRH_BTH + + (uint16_t) (nwords + SIZE_OF_CRC) - + (uint16_t) ((cdp->hdr->iph.ver_port_tid_offset >> 16)&0xFFFF) - + (uint16_t) (cdp->hdr->iph.ver_port_tid_offset & 0xFFFF) - + (uint16_t) cdp->hdr->iph.pkt_flags; + + _IPATH_VDBG("send %d (%x %x %x %x %x %x %x)\n", nwords, + cdp->hdr->lrh[0], cdp->hdr->lrh[1], + cdp->hdr->lrh[2], cdp->hdr->lrh[3], + cdp->hdr->bth[0], cdp->hdr->bth[1], cdp->hdr->bth[2]); + /* + * Write len to control qword, no flags. + * +1 is for the qword padding of pbc. + */ + writeq(nwords + 1ULL, (uint64_t __iomem *) piobuf); + /* we have to flush after the PBC for correctness on some cpus + * or WC buffer can be written out of order */ + mb(); + piobuf += 2; + memcpy_toio32(piobuf, cdp->hdr, sizeof(struct ether_header_typ) >> 2); + cdp->csum_pio = &((struct ether_header_typ __iomem *) piobuf)->csum; + cdp->to = piobuf + (sizeof(struct ether_header_typ) >> 2); + cdp->flen = nwords - (sizeof(struct ether_header_typ) >> 2); + cdp->hdr->frag_num++; + return 0; +} + +/* + * copy the last full dword when that's the "extra" word, preceding it + * with a memory fence, so that all prior data is written to the PIO + * buffer before the trigger word, to enforce the correct bus ordering + * of the WC buffer contents on the bus. + */ +static inline unsigned copy_extra_dword(struct copy_data_s *cdp, unsigned dosum) +{ + if (!cdp->flen && layer_send_getpiobuf(cdp) < 0) + return 1; + /* write the checksum before the last PIO write, if requested. */ + if (dosum && cdp->flen == 1) + writel(csum_fold(cdp->csum), cdp->csum_pio); + mb(); + writel(cdp->u.w, cdp->to++); + mb(); + cdp->extra = 0; + cdp->flen -= 1; + return 0; +} + +/* + * copy a PIO buffer's worth (or the skb fragment, at least) to the PIO + * buffer, adding a memory fence before the last word. We need the fence + * as part of forcing the WC ordering on some cpus, for the cases where + * it will be the trigger word. The final fence after the trigger word + * will be done either at the next chunk, or on final return from the caller + * Takes max byte count, returns byte count actually done (always rounded + * to dword multiple). + */ +static uint32_t copy_a_buffer(struct copy_data_s *cdp, void *p, uint32_t n, + unsigned dosum) +{ + uint32_t *p32; + + if (!cdp->flen && layer_send_getpiobuf(cdp) < 0) + return -1; + if (n > cdp->flen) + n = cdp->flen; + if (dosum && cdp->flen == n) + writel(csum_fold(cdp->csum), cdp->csum_pio); + p32 = p; + memcpy_toio32(cdp->to, p32, n-1); + cdp->to += n-1; + mb(); + writel(p32[n-1], cdp->to++); + mb(); + _IPATH_PDBG("trigger write to pio %p\n", &p32[n-1]); + cdp->flen -= n; + n <<= 2; + cdp->offset += n; + cdp->len -= n; + return n; +} + +/* + * Copy data out of one or a chain of sk_buffs, into the PIO buffer. + * Fragment an sk_buff into multiple IB packets if the amount of data is + * more than a single eager send. + * Offset and len are in bytes. + * Note that this function is recursive! + */ +static void copy_bits(const struct sk_buff *skb, unsigned int offset, + unsigned int len, struct copy_data_s *cdp) +{ + unsigned int start = skb_headlen(skb); + unsigned int i, copy; + uint32_t n; + uint8_t *p; + + /* Copy header. */ + if ((int)(copy = start - offset) > 0) { + if (copy > len) + copy = len; + p = skb->data + offset; + offset += copy; + len -= copy; + /* If the alignment buffer is not empty, fill it and write it out. */ + if (cdp->extra) { + if (cdp->extra == 4) { + if (copy_extra_dword(cdp, 0)) + return; + } + else while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { + if (copy_extra_dword(cdp, 0)) + return; + break; + } + } + } + while (copy >= 4) { + n = copy_a_buffer(cdp, p, copy>>2, 0); + if (n == -1) + return; + p += n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + } + if (len == 0) + return; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + unsigned int end; + + end = start + frag->size; + if ((int)(copy = end - offset) > 0) { + uint8_t *vaddr; + + if (copy > len) + copy = len; + vaddr = kmap_skb_frag(frag); + p = vaddr + frag->page_offset + offset - start; + offset += copy; + len -= copy; + /* If the alignment buffer is not empty, fill it and write it out. */ + if (cdp->extra) { + if (cdp->extra == 4) { + if (copy_extra_dword(cdp, 0)) + return; + } + else while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { + if (copy_extra_dword(cdp, 0)) + return; + break; + } + } + } + while (copy >= 4) { + n = copy_a_buffer(cdp, p, copy>>2, 0); + if (n == -1) + return; + p += n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + kunmap_skb_frag(vaddr); + + if (len == 0) + return; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + unsigned int end; + + end = start + list->len; + if ((int)(copy = end - offset) > 0) { + if (copy > len) + copy = len; + copy_bits(list, offset - start, copy, cdp); + if (cdp->error || (len -= copy) == 0) + return; + } + start = end; + } + } + if (len) + cdp->error = -EFAULT; +} + +/* + * Copy data out of one or a chain of sk_buffs, into the PIO buffer, generating + * the checksum as we go. + * Fragment an sk_buff into multiple IB packets if the amount of data is + * more than a single eager send. + * Offset and len are in bytes. + * Note that this function is recursive! + */ +static void copy_and_csum_bits(const struct sk_buff *skb, unsigned int offset, + unsigned int len, struct copy_data_s *cdp) +{ + unsigned int start = skb_headlen(skb); + unsigned int i, copy; + unsigned int csum2; + uint32_t n; + uint8_t *p; + + /* Copy header. */ + if ((int)(copy = start - offset) > 0) { + if (copy > len) + copy = len; + p = skb->data + offset; + offset += copy; + len -= copy; + if (!cdp->checksum_calc) { + cdp->checksum_calc = 1; + + csum2 = csum_partial(p, copy, 0); + cdp->csum = csum_block_add(cdp->csum, csum2, cdp->pos); + cdp->pos += copy; + } + /* If the alignment buffer is not empty, fill it and write it out. */ + if (cdp->extra) { + if (cdp->extra == 4) { + if (copy_extra_dword(cdp, 1)) + goto done; + } + else while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + if (++cdp->extra == 4) { + if (copy_extra_dword(cdp, 1)) + goto done; + break; + } + } + } + + while (copy >= 4) { + n = copy_a_buffer(cdp, p, copy>>2, 1); + if (n == -1) + goto done; + p += n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + + cdp->checksum_calc = 0; + + if (len == 0) + goto done; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + unsigned int end; + + end = start + frag->size; + if ((int)(copy = end - offset) > 0) { + uint8_t *vaddr; + + if (copy > len) + copy = len; + vaddr = kmap_skb_frag(frag); + p = vaddr + frag->page_offset + offset - start; + offset += copy; + len -= copy; + + if (!cdp->checksum_calc) { + cdp->checksum_calc = 1; + + csum2 = csum_partial(p, copy, 0); + cdp->csum = csum_block_add(cdp->csum, csum2, + cdp->pos); + cdp->pos += copy; + } + /* If the alignment buffer is not empty, fill it and write it out. */ + if (cdp->extra) { + if (cdp->extra == 4) { + if (copy_extra_dword(cdp, 1)) { + kunmap_skb_frag(vaddr); + goto done; + } + } + else while (copy != 0) { + cdp->u.buf[cdp->extra] = *p++; + copy--; + cdp->offset++; + cdp->len--; + + if (++cdp->extra == 4) { + if (copy_extra_dword(cdp, 1)) { + kunmap_skb_frag(vaddr); + goto done; + } + break; + } + } + } + while (copy >= 4) { + n = copy_a_buffer(cdp, p, copy>>2, 1); + if (n == -1) { + kunmap_skb_frag(vaddr); + goto done; + } + p += n; + copy -= n; + } + /* + * Either cdp->extra is zero or copy is zero which means that + * the loop here can't cause the alignment buffer to fill up. + */ + while (copy != 0) { + cdp->u.buf[cdp->extra++] = *p++; + copy--; + cdp->offset++; + cdp->len--; + } + kunmap_skb_frag(vaddr); + + cdp->checksum_calc = 0; + + if (len == 0) + goto done; + } + start = end; + } + + if (skb_shinfo(skb)->frag_list) { + struct sk_buff *list = skb_shinfo(skb)->frag_list; + + for (; list; list = list->next) { + unsigned int end; + + end = start + list->len; + if ((int)(copy = end - offset) > 0) { + if (copy > len) + copy = len; + copy_and_csum_bits(list, offset - start, copy, cdp); + if (cdp->error || (len -= copy) == 0) + goto done; + offset += copy; + } + start = end; + } + } + if (len) + cdp->error = -EFAULT; +done: + /* we have to flush after trigger word for correctness on some cpus + * or WC buffer can be written out of order; needed even if + * there was an error */ + mb(); +} + +/* + * Note that the header should have the unchanging parts + * initialized but the rest of the header is computed as needed in + * order to break up skb data buffers larger than the hardware MTU. + * In other words, the Linux network stack MTU can be larger than the + * hardware MTU. + */ +int ipath_layer_send_skb(struct copy_data_s *cdata) +{ + int ret = 0; + uint16_t vlsllnh; + int device = cdata->device; + + if (device >= infinipath_max) { + _IPATH_INFO("Invalid unit %u, failing\n", device); + return -EINVAL; + } + if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) { + _IPATH_INFO("send while not open\n"); + ret = -EINVAL; + } + else if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) + || devdata[device].ipath_lid == 0) { + /* lid check is for when sma hasn't yet configured */ + ret = -ENETDOWN; + _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n", + devdata[device].ipath_lid, devdata[device].ipath_flags); + } + vlsllnh = *((uint16_t *) cdata->hdr); + if (vlsllnh != htons(IPS_LRH_BTH)) { + _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n", + vlsllnh, htons(IPS_LRH_BTH)); + ret = -EINVAL; + } + if (ret) + goto done; + + cdata->error = 0; /* clear last calls error */ + + if (cdata->skb->ip_summed == CHECKSUM_HW) { + unsigned int csstart = cdata->skb->h.raw - cdata->skb->data; + + /* + * Computing the checksum is a bit tricky since if we fragment + * the packet, the fragment that should contain the checksum + * will have already been sent. The solution is to store the checksum + * in the header of the last fragment just before we write the + * last data word which triggers the last fragment to be sent. + * The receiver will check the header "tag" field, see that + * there is a checksum, and store the checksum back into the packet. + * + * Save the offset of the two byte checksum. + * Note that we have to add 2 to account for the two bytes of the + * ethernet address we stripped from the packet and put in the header. + */ + cdata->hdr->csum_offset = csstart + cdata->skb->csum + 2; + + if (cdata->offset < csstart) + copy_bits(cdata->skb, cdata->offset, + csstart - cdata->offset, cdata); + + if (cdata->error) { + ret = cdata->error; + goto done; + } + + if (cdata->offset < cdata->skb->len) + copy_and_csum_bits(cdata->skb, cdata->offset, + cdata->skb->len - cdata->offset, cdata); + + if (cdata->error) { + ret = cdata->error; + goto done; + } + + if (cdata->extra) { + while (cdata->extra < 4) + cdata->u.buf[cdata->extra++] = 0; + (void)copy_extra_dword(cdata, 1); + } + } + else { + copy_bits(cdata->skb, cdata->offset, + cdata->skb->len - cdata->offset, cdata); + + if (cdata->error) { + ret = cdata->error; + goto done; + } + + if (cdata->extra) { + while (cdata->extra < 4) + cdata->u.buf[cdata->extra++] = 0; + (void)copy_extra_dword(cdata, 1); + } + } + + if (cdata->error) { + ret = cdata->error; + if (cdata->error != -EBUSY) + _IPATH_UNIT_ERROR(device, + "layer_send copy_bits failed with error %d\n", + -ret); + } + + ipath_stats.sps_ether_spkts++; /* another ether packet sent */ + +done: + /* we have to flush after trigger word for correctness on some cpus + * or WC buffer can be written out of order; needed even if + * there was an error */ + mb(); + return ret; +} + +EXPORT_SYMBOL(ipath_layer_send_skb); + From bos at pathscale.com Wed Dec 28 16:31:27 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:27 -0800 Subject: [openib-general] [PATCH 8 of 20] ipath - core driver, part 1 of 4 In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r ffbd416f30d4 -r ddd21709e12c drivers/infiniband/hw/ipath/ipath_driver.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,1879 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include +#include +#include +#include +#include +#include + +#include /* we can generate our own crc's for testing */ + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* + * Our LSB-assigned major number, so scripts can figure + * out how to make entry in /dev. + */ + +static int ipath_major = 233; + +/* + * number of buffers reserved for driver (layered drivers and SMA send). + * Reserved at end of buffer list. + */ + +static uint infinipath_kpiobufs = 32; + +/* + * number of ports we are configured to use (to allow for more pio + * buffers per port, etc.) Zero means use chip value. + */ + +static uint infinipath_cfgports; + +/* + * number of units we are configured to use (to allow for bringup on + * multi-chip systems) Zero means use only one for now, but eventually + * will mean to use infinipath_max + */ + +static uint infinipath_cfgunits; + +uint64_t ipath_dummy_val_for_testing; + +static __kernel_pid_t ipath_sma_alive; /* PID of SMA, if it's running */ +static spinlock_t ipath_sma_lock; /* SMA receive */ + +/* max SM received packets we'll queue; we keep the most recent packets. */ + +#define IPATH_NUM_SMAPKTS 16 + +#define IPATH_SMA_HDRSZ (8+12+8) /* LRH+BTH+DETH */ + +static struct _ipath_sma_rpkt { + /* length of received packet; non-zero if queued */ + uint32_t len; + /* unit number of interface packet was received from */ + uint32_t unit; + uint8_t *buf; +} ipath_sma_data[IPATH_NUM_SMAPKTS]; + +static unsigned ipath_sma_first; /* oldest sma packet index */ +static unsigned ipath_sma_next; /* next sma packet index to use */ + +/* + * ipath_sma_data_bufs has one extra, pointed to by ipath_sma_data_spare, + * so we can exchange buffers to do copy_to_user, and not hold the lock + * across the copy_to_user(). + */ + +#define SMA_MAX_PKTSZ (IPATH_SMA_HDRSZ+256) /* max len of an SMA packet */ + +static uint8_t ipath_sma_data_bufs[IPATH_NUM_SMAPKTS + 1][SMA_MAX_PKTSZ]; +static uint8_t *ipath_sma_data_spare; +/* sma waits globally on all units */ +static wait_queue_head_t ipath_sma_wait; +static wait_queue_head_t ipath_sma_state_wait; + +struct infinipath_stats ipath_stats; + +/* + * this will only be used for diags, now that we have enabled the DMA + * of the sendpioavail regs to system memory. + */ + +static inline uint64_t ipath_kget_sreg(const ipath_type stype, + ipath_sreg regno) +{ + uint64_t val; + uint64_t *sbase; + + sbase = (uint64_t *) (devdata[stype].ipath_sregbase + + (char *)devdata[stype].ipath_kregbase); + val = sbase ? sbase[regno] : 0ULL; + return val; +} + +static int ipath_do_user_init(struct ipath_portdata *, + struct ipath_user_info __user *); +static int ipath_get_baseinfo(struct ipath_portdata *, + struct ipath_base_info __user *); +static int ipath_get_units(void); +static int ipath_wr_eeprom(struct ipath_portdata *, + struct ipath_eeprom_req __user *); +static int ipath_wait_intr(struct ipath_portdata *, uint32_t); +static int ipath_tid_update(struct ipath_portdata *, struct _tidupd __user *); +static int ipath_tid_free(struct ipath_portdata *, struct _tidupd __user *); +static int ipath_get_counters(ipath_type, struct infinipath_counters __user *); +static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a); +static int ipath_get_stats(struct infinipath_stats __user *); +static int ipath_set_partkey(struct ipath_portdata *, uint16_t); +static int ipath_manage_rcvq(struct ipath_portdata *, uint16_t); +static void ipath_clean_partkey(struct ipath_portdata *, + struct ipath_devdata *); +static void ipath_disarm_piobufs(const ipath_type, unsigned, unsigned); +static int ipath_create_user_egr(struct ipath_portdata *); +static int ipath_create_port0_egr(struct ipath_portdata *); +static int ipath_create_rcvhdrq(struct ipath_portdata *); +static void ipath_handle_errors(const ipath_type, uint64_t); +static void ipath_update_pio_bufs(const ipath_type); +static int ipath_shutdown_link(const ipath_type); +static int ipath_bringup_link(const ipath_type); +int ipath_bringup_serdes(const ipath_type); +static void ipath_get_faststats(unsigned long); +static int ipath_setup_htconfig(struct pci_dev *, uint64_t *, const ipath_type); +static struct page *ipath_nopage(struct vm_area_struct *, unsigned long, int *); +static irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *regs); +static void ipath_decode_err(char *, size_t, uint64_t); +void ipath_free_pddata(struct ipath_devdata *, uint32_t, int); +static void ipath_clear_tids(const ipath_type, unsigned); +static void ipath_get_guid(const ipath_type); +static int ipath_sma_ioctl(struct file *, unsigned int, unsigned long); +static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *); +static int ipath_kset_lid(uint32_t); +static int ipath_kset_mlid(uint32_t); +static int ipath_get_mlid(uint32_t __user *); +static int ipath_get_devstatus(uint64_t __user *); +static int ipath_kset_guid(struct ipath_setguid __user *); +static int ipath_get_portinfo(uint32_t __user *); +static int ipath_get_nodeinfo(uint32_t __user *); +#ifdef _IPATH_EXTRA_DEBUG +static void ipath_dump_allregs(char *, ipath_type); +#endif + +static const char ipath_sma_name[] = "infinipath_SMA"; + +/* + * is diags mode enabled? if it is, then things like auto bringup of + * links is disabled + */ + +int ipath_diags_enabled = 0; + +void ipath_chip_done(void) +{ +} + +void ipath_chip_cleanup(struct ipath_devdata * dd) +{ +} + +/* + * cache aligned location + * + * where port 0 rcvhdrtail register is written back; also want + * nothing else sharing the cache line, so make it a cache line in size + * used for all units + * + * This is volatile as it's the target of a DMA from the chip. + */ + +static volatile uint64_t ipath_port0_rcvhdrtail[512] + __attribute__ ((aligned(4096))); + +#define MODNAME "ipath_core" +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " +#define PFX MODNAME ": " + +/* + * min buffers we want to have per port, after driver + */ + +#define IPATH_MIN_USER_PORT_BUFCNT 8 + +/* The size has to be longer than this string, so we can + * append board/chip information to it in the init code. + */ +static char ipath_core_version[192] = IPATH_IDSTR; +static char *chip_driver_version; +static int chip_driver_size; + +/* mylid and lidbase are to deal with LIDs in "fabric", until SM is working */ + +module_param(infinipath_debug, uint, 0644); +module_param(infinipath_kpiobufs, uint, 0644); +module_param(infinipath_cfgports, uint, 0644); +module_param(infinipath_cfgunits, uint, 0644); + +MODULE_PARM_DESC(infinipath_debug, "mask for debug prints"); +MODULE_PARM_DESC(infinipath_cfgports, "Set max number of ports to use"); +MODULE_PARM_DESC(infinipath_cfgunits, "Set max number of devices to use"); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("PathScale "); +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); + +#ifdef IPATH_DIAG +static __kernel_pid_t ipath_diag_alive; /* PID of diags, if running */ +int ipath_diags_ioctl(struct file *, unsigned, unsigned long); +static int ipath_opendiag(struct inode *, struct file *); +#endif + +#if __IPATH_INFO || __IPATH_DBG +static const char *ipath_ibcstatus_str[] = { + "Disabled", + "LinkUp", + "PollActive", + "PollQuiet", + "SleepDelay", + "SleepQuiet", + "LState6", /* unused */ + "LState7", /* unused */ + "CfgDebounce", + "CfgRcvfCfg", + "CfgWaitRmt", + "CfgIdle", + "RecovRetrain", + "LState0xD", /* unused */ + "RecovWaitRmt", + "RecovIdle", +}; +#endif + +static ssize_t show_version(struct device_driver *dev, char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%s", ipath_core_version); +} + +static ssize_t show_status(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + if (!dd->ipath_statusp) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%llx\n", *(dd->ipath_statusp)); +} + +static const char *ipath_status_str[] = { + "Initted", + "Disabled", + "4", /* unused */ + "OIB_SMA", + "SMA", + "Present", + "IB_link_up", + "IB_configured", + "NoIBcable", + "Fatal_Hardware_Error", + NULL, +}; + +static ssize_t show_status_str(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + int i, any; + uint64_t s; + + if (!dd) + return -EINVAL; + + if (!dd->ipath_statusp) + return -EINVAL; + + s = *(dd->ipath_statusp); + *buf = '\0'; + for (any = i = 0; s && ipath_status_str[i]; i++) { + if (s & 1) { + if (any && strlcat(buf, " ", PAGE_SIZE) >= PAGE_SIZE) + /* overflow */ + break; + if (strlcat(buf, ipath_status_str[i], + PAGE_SIZE) >= PAGE_SIZE) + break; + any = 1; + } + s >>= 1; + } + if (any) + strlcat(buf, "\n", PAGE_SIZE); + + return strlen(buf); +} + +static ssize_t show_lid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_lid); +} + +static ssize_t show_mlid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_mlid); +} + +static ssize_t show_guid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + uint8_t *guid; + + if (!dd) + return -EINVAL; + + guid = (uint8_t *)&(dd->ipath_guid); + + return snprintf(buf, PAGE_SIZE, "%x:%x:%x:%x:%x:%x:%x:%x\n", + guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], + guid[6], guid[7]); +} + +static ssize_t show_nguid(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + return snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_nguid); +} + +static ssize_t show_serial(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + buf[sizeof dd->ipath_serial] = '\0'; + memcpy(buf, dd->ipath_serial, sizeof dd->ipath_serial); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_unit(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct ipath_devdata *dd = dev_get_drvdata(dev); + + if (!dd) + return -EINVAL; + + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit); + return strlen(buf); +} + +static DRIVER_ATTR(version, S_IRUGO, show_version, NULL); +static DEVICE_ATTR(status, S_IRUGO, show_status, NULL); +static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL); +static DEVICE_ATTR(lid, S_IRUGO, show_lid, NULL); +static DEVICE_ATTR(mlid, S_IRUGO, show_mlid, NULL); +static DEVICE_ATTR(guid, S_IRUGO, show_guid, NULL); +static DEVICE_ATTR(nguid, S_IRUGO, show_nguid, NULL); +static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL); +static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL); + +/* + * called from add_timer and user counter read calls, to deal with + * counters that wrap in "human time". The words sent and received, and + * the packets sent and received are all that we worry about. For now, + * at least, we don't worry about error counters, because if they wrap + * that quickly, we probably don't care. We may eventually just make this + * handle all the counters. word counters can wrap in about 20 seconds + * of full bandwidth traffic, packet counters in a few hours. + */ + +uint64_t ipath_snap_cntr(const ipath_type t, ipath_creg creg) +{ + uint32_t val; + uint64_t val64, t0, t1; + struct ipath_devdata *dd = &devdata[t]; + static uint64_t one_sec_in_cycles; + extern uint32_t _ipath_pico_per_cycle; + + if (!one_sec_in_cycles && _ipath_pico_per_cycle) + one_sec_in_cycles = 1000000000000UL / _ipath_pico_per_cycle; + + t0 = get_cycles(); + val = ipath_kget_creg32(t, creg); + t1 = get_cycles(); + if ((t1 - t0) > one_sec_in_cycles && val == -1) { + /* + * This is just a way to detect things that are quite broken. + * Normally this should take just a few cycles (the check is + * for long enough that we don't care if we get pre-empted.) + * An Opteron HT O read timeout is 4 seconds with normal + * NB values + */ + + _IPATH_UNIT_ERROR(t, "Error! Reading counter 0x%x timed out\n", + creg); + return 0ULL; + } + + if (creg == cr_wordsendcnt) { + if (val != dd->ipath_lastsword) { + dd->ipath_sword += val - dd->ipath_lastsword; + dd->ipath_lastsword = val; + } + val64 = dd->ipath_sword; + } else if (creg == cr_wordrcvcnt) { + if (val != dd->ipath_lastrword) { + dd->ipath_rword += val - dd->ipath_lastrword; + dd->ipath_lastrword = val; + } + val64 = dd->ipath_rword; + } else if (creg == cr_pktsendcnt) { + if (val != dd->ipath_lastspkts) { + dd->ipath_spkts += val - dd->ipath_lastspkts; + dd->ipath_lastspkts = val; + } + val64 = dd->ipath_spkts; + } else if (creg == cr_pktrcvcnt) { + if (val != dd->ipath_lastrpkts) { + dd->ipath_rpkts += val - dd->ipath_lastrpkts; + dd->ipath_lastrpkts = val; + } + val64 = dd->ipath_rpkts; + } else + val64 = (uint64_t) val; + + return val64; +} + +/* + * print the delta of egrfull/hdrqfull errors for kernel ports no more + * than every 5 seconds. User processes are printed at close, but kernel + * doesn't close, so... Separate routine so may call from other places + * someday, and so function name when printed by _IPATH_INFO is meaningfull + */ + +static void ipath_qcheck(const ipath_type t) +{ + static uint64_t last_tot_hdrqfull; + size_t blen = 0; + struct ipath_devdata *dd = &devdata[t]; + char buf[128]; + + *buf = 0; + if (dd->ipath_pd[0]->port_hdrqfull != dd->ipath_p0_hdrqfull) { + blen = snprintf(buf, sizeof buf, "port 0 hdrqfull %u", + dd->ipath_pd[0]->port_hdrqfull - + dd->ipath_p0_hdrqfull); + dd->ipath_p0_hdrqfull = dd->ipath_pd[0]->port_hdrqfull; + } + if (ipath_stats.sps_etidfull != dd->ipath_last_tidfull) { + blen += + snprintf(buf + blen, sizeof buf - blen, "%srcvegrfull %llu", + blen ? ", " : "", + ipath_stats.sps_etidfull - dd->ipath_last_tidfull); + dd->ipath_last_tidfull = ipath_stats.sps_etidfull; + } + + /* + * this is actually the number of hdrq full interrupts, not actual + * events, but at the moment that's mostly what I'm interested in. + * Actual count, etc. is in the counters, if needed. For production + * users this won't ordinarily be printed. + */ + + if ((infinipath_debug & (__IPATH_PKTDBG | __IPATH_DBG)) && + ipath_stats.sps_hdrqfull != last_tot_hdrqfull) { + blen += + snprintf(buf + blen, sizeof buf - blen, + "%shdrqfull %llu (all ports)", blen ? ", " : "", + ipath_stats.sps_hdrqfull - last_tot_hdrqfull); + last_tot_hdrqfull = ipath_stats.sps_hdrqfull; + } + if (blen) + _IPATH_DBG("%s\n", buf); + + if (*dd->ipath_hdrqtailptr != dd->ipath_port0head) { + if (dd->ipath_lastport0rcv_cnt == ipath_stats.sps_port0pkts) { + _IPATH_PDBG("missing rcv interrupts? port0 hd=%llx tl=%x; port0pkts %llx\n", + *dd->ipath_hdrqtailptr, dd->ipath_port0head,ipath_stats.sps_port0pkts); + ipath_kreceive(t); + } + dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts; + } +} + +/* + * called from add_timer to get word counters from chip before they + * can overflow + */ + +static void ipath_get_faststats(unsigned long t) +{ + uint32_t val; + struct ipath_devdata *dd = &devdata[t]; + static unsigned cnt; + + /* + * don't access the chip while running diags, or memory diags + * can fail + */ + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT) || + ipath_diags_enabled) { + /* but re-arm the timer, for diags case; won't hurt other */ + goto done; + } + + ipath_snap_cntr((ipath_type) t, cr_wordsendcnt); + ipath_snap_cntr((ipath_type) t, cr_wordrcvcnt); + ipath_snap_cntr((ipath_type) t, cr_pktsendcnt); + ipath_snap_cntr((ipath_type) t, cr_pktrcvcnt); + + ipath_qcheck(t); + + /* + * deal with repeat error suppression. Doesn't really matter if + * last error was almost a full interval ago, or just a few usecs + * ago; still won't get more than 2 per interval. We may want + * longer intervals for this eventually, could do with mod, counter + * or separate timer. Also see code in ipath_handle_errors() and + * ipath_handle_hwerrors(). + */ + + if (dd->ipath_lasterror) + dd->ipath_lasterror = 0; + if (dd->ipath_lasthwerror) + dd->ipath_lasthwerror = 0; + if ((devdata[t].ipath_maskederrs & ~devdata[t].ipath_ignorederrs) + && get_cycles() > devdata[t].ipath_unmasktime) { + char ebuf[256]; + ipath_decode_err(ebuf, sizeof ebuf, + (devdata[t].ipath_maskederrs & ~devdata[t]. + ipath_ignorederrs)); + if ((devdata[t].ipath_maskederrs & ~devdata[t]. + ipath_ignorederrs) + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) { + _IPATH_UNIT_ERROR(t, "Re-enabling masked errors (%s)\n", + ebuf); + } else { + /* + * rcvegrfull and rcvhdrqfull are "normal", + * for some types of processes (mostly benchmarks) + * that send huge numbers of messages, while + * not processing them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Disabling frequent queue full errors (%s)\n", + ebuf); + } + devdata[t].ipath_maskederrs = devdata[t].ipath_ignorederrs; + ipath_kput_kreg(t, kr_errormask, ~devdata[t].ipath_maskederrs); + } + + if (dd->ipath_flags & IPATH_LINK_SLEEPING) { + uint64_t ibc; + _IPATH_VDBG("linkinitcmd SLEEP, move to POLL\n"); + dd->ipath_flags &= ~IPATH_LINK_SLEEPING; + ibc = dd->ipath_ibcctrl; + /* + * don't put linkinitcmd in ipath_ibcctrl, want that to + * stay a NOP + */ + ibc |= + INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + ipath_kput_kreg(t, kr_ibcctrl, ibc); + } + + /* limit qfull messages to ~one per minute per port */ + if ((++cnt & 0x10)) { + for (val = devdata[t].ipath_cfgports - 1; ((int)val) >= 0; + val--) { + if (dd->ipath_lastegrheads[val] != -1) + dd->ipath_lastegrheads[val] = -1; + if (dd->ipath_lastrcvhdrqtails[val] != -1) + dd->ipath_lastrcvhdrqtails[val] = -1; + } + } + + if (dd->ipath_nosma_bufs) { + dd->ipath_nosma_secs += 5; + if (dd->ipath_nosma_secs >= 30) { + _IPATH_SMADBG("No SMA bufs avail %u seconds; cancelling pending sends\n", + dd->ipath_nosma_secs); + ipath_disarm_piobufs(t, dd->ipath_lastport_piobuf, + dd->ipath_piobcnt - dd->ipath_lastport_piobuf); + dd->ipath_nosma_secs = 0; /* start again, if necessary */ + } + else + _IPATH_SMADBG("No SMA bufs avail %u tries, after %u seconds\n", + dd->ipath_nosma_bufs, dd->ipath_nosma_secs); + } + +done: + mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); +} + + +static void __devexit infinipath_remove_one(struct pci_dev *); +static int infinipath_init_one(struct pci_dev *, const struct pci_device_id *); + +/* Only needed for registration, nothing else needs this info */ +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1 +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT 0xd + +const struct pci_device_id infinipath_pci_tbl[] = { + { + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT, + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, + {0,} +}; + +MODULE_DEVICE_TABLE(pci, infinipath_pci_tbl); + +static struct pci_driver infinipath_driver = { + .name = MODNAME, + .driver.owner = THIS_MODULE, + .probe = infinipath_init_one, + .remove = __devexit_p(infinipath_remove_one), + .id_table = infinipath_pci_tbl, +}; + +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) +int remap_area_pages(unsigned long address, unsigned long phys_addr, + unsigned long size, unsigned long flags); +#endif + +static int infinipath_init_one(struct pci_dev *pdev, + const struct pci_device_id *ent) +{ + int ret, len, j; + static int chip_idx = -1; + unsigned long addr; + uint64_t intconfig; + uint8_t rev; + ipath_type dev; + + /* + * XXX: Right now, we have a hardcoded array of devices. We'll + * change this in a future release, but not just yet. For the + * moment, we're limited to 4 infinipath devices per system. + */ + + dev = ++chip_idx; + + _IPATH_VDBG("initializing unit #%u\n", dev); + if ((!infinipath_cfgunits && (dev >= 1)) || + (infinipath_cfgunits && (dev >= infinipath_cfgunits)) || + (dev >= infinipath_max)) { + _IPATH_ERROR("Trying to initialize unit %u, max is %u\n", + dev, infinipath_max - 1); + return -EINVAL; + } + + devdata[dev].pci_registered = 1; + devdata[dev].ipath_unit = dev; + + if ((ret = pci_enable_device(pdev))) { + _IPATH_DBG("pci_enable unit %u failed: %x\n", dev, ret); + } + + if ((ret = pci_request_regions(pdev, MODNAME))) + _IPATH_INFO("pci_request_regions unit %u fails: %d\n", dev, + ret); + + if ((ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) != 0) + _IPATH_INFO("pci_set_dma_mask unit %u fails: %d\n", dev, ret); + + pci_set_master(pdev); /* probably not be needed for HT */ + + addr = pci_resource_start(pdev, 0); + len = pci_resource_len(pdev, 0); + _IPATH_VDBG + ("regbase (0) %lx len %d irq %x, vend %x/%x driver_data %lx\n", + addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data); + devdata[dev].ipath_deviceid = ent->device; /* save for later use */ + devdata[dev].ipath_vendorid = ent->vendor; + for (j = 0; j < 6; j++) { + if (!pdev->resource[j].start) + continue; + _IPATH_VDBG("BAR %d start %lx, end %lx, len %lx\n", + j, pdev->resource[j].start, + pdev->resource[j].end, pci_resource_len(pdev, j)); + } + + if (!addr) { + _IPATH_UNIT_ERROR(dev, "No valid address in BAR 0!\n"); + return -ENODEV; + } + + if ((ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev))) { + _IPATH_UNIT_ERROR(dev, + "Failed to read PCI revision ID unit %u: %d\n", + dev, ret); + return ret; /* shouldn't ever happen */ + } else + devdata[dev].ipath_pcirev = rev; + + devdata[dev].ipath_kregbase = ioremap_nocache(addr, len); +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) + printk("Remapping pages WC\n"); + remap_area_pages((unsigned long) devdata[dev].ipath_kregbase + + 1024 * 1024, addr + 1024 * 1024, 1024 * 1024, + _PAGE_MA_WC); + /* devdata[dev].ipath_kregbase = __ioremap(addr, len, _PAGE_MA_WC); */ +#endif + + if (!devdata[dev].ipath_kregbase) { + _IPATH_DBG("Unable to map io addr %lx to kvirt, failing\n", + addr); + ret = -ENOMEM; + goto fail; + } + devdata[dev].ipath_kregend = (uint64_t __iomem *) + ((void __iomem *) devdata[dev].ipath_kregbase + len); + devdata[dev].ipath_physaddr = addr; /* used for io_remap, etc. */ + /* for user mmap */ + devdata[dev].ipath_kregvirt = (uint64_t __iomem *) phys_to_virt(addr); + _IPATH_VDBG("mapped io addr %lx to kregbase %p kregvirt %p\n", addr, + devdata[dev].ipath_kregbase, devdata[dev].ipath_kregvirt); + + /* + * set these up before registering the interrupt handler, just + * in case + */ + devdata[dev].pcidev = pdev; + pci_set_drvdata(pdev, &(devdata[dev])); + + /* + * set up our interrupt handler; SA_SHIRQ probably not needed, + * but won't hurt for now. + */ + + if (!pdev->irq) { + _IPATH_UNIT_ERROR(dev, "irq is 0, failing init\n"); + ret = -EINVAL; + goto fail; + } + if ((ret = request_irq(pdev->irq, ipath_intr, + SA_SHIRQ, MODNAME, &devdata[dev]))) { + _IPATH_UNIT_ERROR(dev, + "Couldn't setup interrupt handler, irq=%u: %d\n", + pdev->irq, ret); + goto fail; + } + + /* + * clear ipath_flags here instead of in ipath_init_chip as it is set + * by ipath_setup_htconfig. + */ + devdata[dev].ipath_flags = 0; + if (ipath_setup_htconfig(pdev, &intconfig, dev)) + _IPATH_DBG + ("Failed to setup HT config, continuing anyway for now\n"); + + ret = ipath_init_chip(dev); /* do the chip-specific init */ + if (!ret) { +#ifdef CONFIG_MTRR + uint64_t pioaddr, piolen; + unsigned bits; + /* + * Set the PIO buffers to be WCCOMB, so we get HT bursts + * to the chip. Linux (possibly the hardware) requires + * it to be on a power of 2 address matching the length + * (which has to be a power of 2). For rev1, that means + * the base address, for rev2, it will be just the PIO + * buffers themselves. + */ + pioaddr = addr + devdata[dev].ipath_piobufbase; + piolen = devdata[dev].ipath_piobcnt * + ALIGN(devdata[dev].ipath_piosize, + devdata[dev].ipath_palign); + + for (bits = 0; !(piolen & (1ULL << bits)); bits++) + /* do nothing */; + + if (piolen != (1ULL << bits)) { + _IPATH_DBG("piolen 0x%llx not power of 2, bits=%u\n", + piolen, bits); + piolen >>= bits; + while (piolen >>= 1) + bits++; + piolen = 1ULL << (bits + 1); + _IPATH_DBG("Changed piolen to 0x%llx bits=%u\n", piolen, + bits); + } + if (pioaddr & (piolen - 1)) { + uint64_t atmp; + _IPATH_DBG + ("pioaddr %llx not on right boundary for size %llx, fixing\n", + pioaddr, piolen); + atmp = pioaddr & ~(piolen - 1); + if (atmp < addr || (atmp + piolen) > (addr + len)) { + _IPATH_UNIT_ERROR(dev, + "No way to align address/size (%llx/%llx), no WC mtrr\n", + atmp, piolen << 1); + ret = -ENODEV; + } else { + _IPATH_DBG + ("changing WC base from %llx to %llx, len from %llx to %llx\n", + pioaddr, atmp, piolen, piolen << 1); + pioaddr = atmp; + piolen <<= 1; + } + } + + if (!ret) { + int cookie; + _IPATH_VDBG + ("Setting mtrr for chip to WC (addr %llx, len=0x%llx)\n", + pioaddr, piolen); + cookie = mtrr_add(pioaddr, piolen, MTRR_TYPE_WRCOMB, 0); + if (cookie < 0) { + _IPATH_INFO + ("mtrr_add(%llx,0x%llx,WC,0) failed (%d)\n", + pioaddr, piolen, cookie); + ret = -EINVAL; + } else { + _IPATH_VDBG + ("Set mtrr for chip to WC, cookie is %d\n", + cookie); + devdata[dev].ipath_mtrr = (uint32_t) cookie; + } + } +#endif /* CONFIG_MTRR */ + } + + if (!ret && devdata[dev].ipath_kregbase && (devdata[dev].ipath_flags + & IPATH_PRESENT)) { + /* + * for the hardware, enable interrupts only after + * kr_interruptconfig is written, if we could set it up + */ + if (intconfig) { + /* interrupt address */ + ipath_kput_kreg(dev, kr_interruptconfig, intconfig); + /* enable all interrupts */ + ipath_kput_kreg(dev, kr_intmask, -1LL); + /* force re-interrupt of any pending interrupts. */ + ipath_kput_kreg(dev, kr_intclear, 0ULL); + /* OK, the chip is usable, marked it as initialized */ + *devdata[dev].ipath_statusp |= IPATH_STATUS_INITTED; + } else + _IPATH_UNIT_ERROR(dev, + "No interrupts enabled, couldn't setup interrupt address\n"); + } else if (ret != -EPERM) + _IPATH_INFO("Not configuring unit %u interrupts, init failed\n", + dev); + + device_create_file(&(pdev->dev), &dev_attr_status); + device_create_file(&(pdev->dev), &dev_attr_status_str); + device_create_file(&(pdev->dev), &dev_attr_lid); + device_create_file(&(pdev->dev), &dev_attr_mlid); + device_create_file(&(pdev->dev), &dev_attr_guid); + device_create_file(&(pdev->dev), &dev_attr_nguid); + device_create_file(&(pdev->dev), &dev_attr_serial); + device_create_file(&(pdev->dev), &dev_attr_unit); + + /* + * We used to cleanup here, with pci_release_regions, etc. but that + * can cause other problems if we want to run diags, etc., so instead + * defer that until driver unload. + */ + +fail: /* after we've done at least some of the pci setup */ + if (ret == -EPERM) /* disabled device, don't want module load error; + * just want to carry status through to this point */ + ret = 0; + + return ret; +} + + + +#define HT_CAPABILITY_ID 0x08 /* HT capabilities not defined in kernel */ +#define HT_INTR_DISC_CONFIG 0x80 /* HT interrupt and discovery cap */ +#define HT_INTR_REG_INDEX 2 /* intconfig requires indirect accesses */ + +/* + * setup the interruptconfig register from the HT config info. + * Also clear CRC errors in HT linkcontrol, if necessary. + * This is done only for the real hardware. It is done before + * chip address space is initted, so can't touch infinipath registers + */ + +static int ipath_setup_htconfig(struct pci_dev *pdev, uint64_t * iaddr, + const ipath_type t) +{ + uint8_t cap_type; + uint32_t int_handler_addr_lower; + uint32_t int_handler_addr_upper; + uint64_t ihandler = 0; + int i, pos, ret = 0; + + *iaddr = 0ULL; /* init to zero in case not able to configure */ + + /* + * Read the capability info to find the interrupt info, and also + * handle clearing CRC errors in linkctrl register if necessary. + * We do this early, before we ever enable errors or hardware errors, + * mostly to avoid causing the chip to enter freeze mode. + */ + if (!(pos = pci_find_capability(pdev, HT_CAPABILITY_ID))) { + _IPATH_UNIT_ERROR(t, + "Couldn't find HyperTransport capability; no interrupts\n"); + return -ENODEV; + } + do { + /* the HT capability type byte is 3 bytes after the + * capability byte. + */ + if (pci_read_config_byte(pdev, pos+3, &cap_type)) { + _IPATH_INFO + ("Couldn't read config command @ %d\n", pos); + continue; + } + if (!(cap_type & 0xE0)) { + /* bits 13-15 of command==0 is slave/primary block. + * Clear any HT CRC errors. We only bother to + * do this at load time, because it's OK if it + * happened before we were loaded (first time + * after boot/reset), but any time after that, + * it's fatal anyway. Also need to not check for + * for upper byte errors if we are in 8 bit mode, + * so figure out our width. For now, at least, + * also complain if it's 8 bit. + */ + uint8_t linkwidth = 0, linkerr, link_a_b_off, link_off; + uint16_t linkctrl = 0; + + devdata[t].ipath_ht_slave_off = pos; + /* command word, master_host bit */ + if ((cap_type >> 2) & 1) /* master host || slave */ + link_a_b_off = 4; + else + link_a_b_off = 0; + _IPATH_VDBG("HT%u (Link %c) connected to processor\n", + link_a_b_off ? 1 : 0, + link_a_b_off ? 'B' : 'A'); + + link_a_b_off += pos; + + /* + * check both link control registers; clear both + * HT CRC sets if necessary. + */ + + for (i = 0; i < 2; i++) { + link_off = pos + i * 4 + 0x4; + if (pci_read_config_word + (pdev, link_off, &linkctrl)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link control%d register\n", + i); + else if (linkctrl & (0xf << 8)) { + _IPATH_VDBG + ("Clear linkctrl%d CRC Error bits %x\n", + i, linkctrl & (0xf << 8)); + /* + * now write them back to clear + * the error. + */ + pci_write_config_byte(pdev, link_off, + linkctrl & (0xf << + 8)); + } + } + + /* + * As with HT CRC bits, same for protocol errors + * that might occur during boot. + */ + + for (i = 0; i < 2; i++) { + link_off = pos + i * 4 + 0xd; + if (pci_read_config_byte + (pdev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't read linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) { + _IPATH_VDBG + ("HT linkerr%d bits 0x%x set, clearing\n", + linkerr >> 4, i); + /* + * writing the linkerr bits that + * are set will clear them + */ + if (pci_write_config_byte + (pdev, link_off, linkerr)) + _IPATH_DBG + ("Failed write to clear HT linkerror%d\n", + i); + if (pci_read_config_byte + (pdev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't reread linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) + _IPATH_INFO + ("HT linkerror%d bits 0x%x couldn't be cleared\n", + i, linkerr >> 4); + } + } + + /* + * this is just for our link to the host, not + * devices connected through tunnel. + */ + + if (pci_read_config_byte + (pdev, link_a_b_off + 7, &linkwidth)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link width config register\n"); + else { + uint32_t width; + switch (linkwidth & 7) { + case 5: + width = 4; + break; + case 4: + width = 2; + break; + case 3: + width = 32; + break; + case 1: + width = 16; + break; + case 0: + default: /* if wrong, assume 8 bit */ + width = 8; + break; + } + ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_htwidth = width; + + if (linkwidth != 0x11) { + _IPATH_UNIT_ERROR(t, + "Not configured for 16 bit HT (%x)\n", + linkwidth); + if (!(linkwidth & 0xf)) { + _IPATH_DBG + ("Will ignore HT lane1 errors\n"); + ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_flags |= IPATH_8BIT_IN_HT0; + } + } + } + + /* + * this is just for our link to the host, not + * devices connected through tunnel. + */ + + if (pci_read_config_byte + (pdev, link_a_b_off + 0xd, &linkwidth)) + _IPATH_UNIT_ERROR(t, + "Couldn't read HT link frequency config register\n"); + else { + uint32_t speed; + switch (linkwidth & 0xf) { + case 6: + speed = 1000; + break; + case 5: + speed = 800; + break; + case 4: + speed = 600; + break; + case 3: + speed = 500; + break; + case 2: + speed = 400; + break; + case 1: + speed = 300; + break; + default: + /* + * assume reserved and + * vendor-specific are 200... + */ + case 0: + speed = 200; + break; + } + ((struct ipath_devdata *) pci_get_drvdata(pdev))->ipath_htspeed = speed; + } + } else if (cap_type == HT_INTR_DISC_CONFIG) { + /* use indirection register to get the intr handler */ + uint32_t intvec; + pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, + 0x10); + pci_read_config_dword(pdev, pos + 4, + &int_handler_addr_lower); + + pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, + 0x11); + pci_read_config_dword(pdev, pos + 4, + &int_handler_addr_upper); + + ihandler = (uint64_t) int_handler_addr_lower | + ((uint64_t) int_handler_addr_upper << 32); + + /* + * I'm unable to find an exported API to get + * the the actual vector, either from the PCI + * infrastructure, or from the APIC + * infrastructure. This heuristic seems to be + * valid for Opteron on 2.6.x kernels, for irq's > 2. + * It may not be universally true... Bug 2338 + * + * Oh well; the heuristic doesn't work for the + * AMI/Iwill BIOS... But the good news is, + * somewhere by 2.6.9, when CONFIG_PCI_MSI is + * enabled, the irq field actually turned into + * the vector number + * We therefore require that MSI be enabled... + */ + + intvec = pdev->irq; + /* + * clear any bits there; normally not set but + * we'll overload this for some debug purposes + * (setting the HTC debug register value from + * software, rather than GPIOs), so it might be + * set on a driver reload. + */ + + ihandler &= ~0xff0000; + /* x86 vector goes in intrinfo[23:16] */ + ihandler |= intvec << 16; + _IPATH_VDBG + ("ihandler lower %x, upper %x, intvec %x, interruptconfig %llx\n", + int_handler_addr_lower, int_handler_addr_upper, + intvec, ihandler); + + /* return to caller, can't program yet. */ + *iaddr = ihandler; + /* + * no break, have to be sure we find link control + * stuff also + */ + } + + } while ((pos=pci_find_next_capability(pdev, pos, HT_CAPABILITY_ID))); + + if (!ihandler) { + _IPATH_UNIT_ERROR(t, + "Couldn't find interrupt handler in config space\n"); + ret = -ENODEV; + } + return ret; +} + +/* + * get the GUID from the i2c device + * When we add the multi-chip support, we will probably have to add + * the ability to use the number of guids field, and get the guid from + * the first chip's flash, to use for all of them. + */ + +static void ipath_get_guid(const ipath_type t) +{ + void *buf; + struct ipath_flash *ifp; + uint64_t guid; + int len; + uint8_t csum, *bguid; + + if (t && devdata[0].ipath_nguid > 1 && t <= devdata[0].ipath_nguid) { + uint8_t oguid; + devdata[t].ipath_guid = devdata[0].ipath_guid; + bguid = (uint8_t *) & devdata[t].ipath_guid; + + oguid = bguid[7]; + bguid[7] += t; + if (oguid > bguid[7]) { + if (bguid[6] == 0xff) { + if (bguid[5] == 0xff) { + _IPATH_UNIT_ERROR(t, + "Can't set %s GUID from base GUID, wraps to OUI!\n", + ipath_get_unit_name + (t)); + devdata[t].ipath_guid = 0; + return; + } + bguid[5]++; + } + bguid[6]++; + } + devdata[t].ipath_nguid = 1; + + _IPATH_DBG + ("nguid %u, so adding %u to device 0 guid, for %llx (big-endian)\n", + devdata[0].ipath_nguid, t, devdata[t].ipath_guid); + return; + } + + len = offsetof(struct ipath_flash, if_future); + if (!(buf = vmalloc(len))) { + _IPATH_UNIT_ERROR(t, + "Couldn't allocate memory to read %u bytes from eeprom for GUID\n", + len); + return; + } + + if (ipath_eeprom_read(t, 0, buf, len)) { + _IPATH_UNIT_ERROR(t, "Failed reading GUID from eeprom\n"); + goto done; + } + ifp = (struct ipath_flash *)buf; + + csum = ipath_flash_csum(ifp, 0); + if (csum != ifp->if_csum) { + _IPATH_INFO("Bad I2C flash checksum: 0x%x, not 0x%x\n", + csum, ifp->if_csum); + goto done; + } + if (*(uint64_t *) ifp->if_guid == 0ULL + || *(uint64_t *) ifp->if_guid == -1LL) { + _IPATH_UNIT_ERROR(t, "Invalid GUID %llx from flash; ignoring\n", + *(uint64_t *) ifp->if_guid); + goto done; /* don't allow GUID if all 0 or all 1's */ + } + + /* complain, but allow it */ + if (*(uint64_t *) ifp->if_guid == 0x100007511000000ULL) + _IPATH_INFO + ("Warning, GUID %llx is default, probabaly not correct!\n", + *(uint64_t *) ifp->if_guid); + + bguid = ifp->if_guid; + if (!bguid[0] && !bguid[1] && !bguid[2]) { + /* original incorrect GUID format in flash; fix in core copy, by + * shifting up 2 octets; don't need to change top octet, since both + * it and shifted are 0.. */ + bguid[1] = bguid[3]; + bguid[2] = bguid[4]; + bguid[3] = bguid[4] = 0; + guid = *(uint64_t *)ifp->if_guid; + _IPATH_VDBG("Old GUID format in flash, top 3 zero, shifting 2 octets\n"); + } + else + guid = *(uint64_t *)ifp->if_guid; + devdata[t].ipath_guid = guid; + devdata[t].ipath_nguid = ifp->if_numguid; + memcpy(devdata[t].ipath_serial, ifp->if_serial, sizeof(ifp->if_serial)); + _IPATH_VDBG("Initted GUID to %llx (big-endian) from i2c flash\n", + devdata[t].ipath_guid); + +done: + vfree(buf); +} + +static void __devexit infinipath_remove_one(struct pci_dev *pdev) +{ + struct ipath_devdata *dd; + + _IPATH_VDBG("pci_release, pdev=%p\n", pdev); + if (pdev) { + device_remove_file(&(pdev->dev), &dev_attr_status); + device_remove_file(&(pdev->dev), &dev_attr_status_str); + device_remove_file(&(pdev->dev), &dev_attr_lid); + device_remove_file(&(pdev->dev), &dev_attr_mlid); + device_remove_file(&(pdev->dev), &dev_attr_guid); + device_remove_file(&(pdev->dev), &dev_attr_nguid); + device_remove_file(&(pdev->dev), &dev_attr_serial); + device_remove_file(&(pdev->dev), &dev_attr_unit); + dd = pci_get_drvdata(pdev); + pci_set_drvdata(pdev, NULL); + _IPATH_VDBG + ("Releasing pci memory regions, devdata %p, unit %u\n", dd, + (uint32_t) (dd - devdata)); + if (dd && dd->ipath_kregbase) { + _IPATH_VDBG("Unmapping kregbase %p\n", + dd->ipath_kregbase); + iounmap((volatile void __iomem *) dd->ipath_kregbase); + dd->ipath_kregbase = NULL; + } + pci_release_regions(pdev); + _IPATH_VDBG("calling pci_disable_device\n"); + pci_disable_device(pdev); + } +} + +int ipath_open(struct inode *, struct file *); +static int ipath_opensma(struct inode *, struct file *); +int ipath_close(struct inode *, struct file *); +static unsigned int ipath_poll(struct file *, struct poll_table_struct *); +long ipath_ioctl(struct file *, unsigned int, unsigned long); +static loff_t ipath_llseek(struct file *, loff_t, int); +static int ipath_mmap(struct file *, struct vm_area_struct *); + +static struct file_operations ipath_fops = { + .owner = THIS_MODULE, + .open = ipath_open, + .release = ipath_close, + .poll = ipath_poll, + /* + * all of ours are completely compatible and don't require the + * kernel lock + */ + .compat_ioctl = ipath_ioctl, + /* we don't need kernel lock for our ioctls */ + .unlocked_ioctl = ipath_ioctl, + .llseek = ipath_llseek, + .mmap = ipath_mmap +}; + +static DECLARE_MUTEX(ipath_mutex); /* general driver use */ +spinlock_t ipath_pioavail_lock; + +/* + * For now, at least (and probably forever), we don't require root + * or equivalent permissions to use the device. + */ + +int ipath_open(struct inode *in, struct file *fp) +{ + int ret = 0, minor, i, prefunit=-1, devmax; + int maxofallports, npresent = 0, notup = 0; + ipath_type ndev; + + down(&ipath_mutex); + + minor = iminor(in); + _IPATH_VDBG("open on dev %lx (minor %d)\n", (long)in->i_rdev, minor); + + /* This code is present to allow a knowledgeable person to specify the + * layout of processes to processors before opening this driver, and + * then we'll assign the process to the "closest" HT-400 to + * that processor * (we assume reasonable connectivity, for now). + * This code assumes that if affinity has been set before this + * point, that at most one cpu is set; for now this is reasonable. + * I check for both cpus_empty() and cpus_full(), in case some + * kernel variant sets none of the bits when no affinity is set. + * 2.6.11 and 12 kernels have all present cpus set. + * Some day we'll have to fix it up further to handle a cpu subset. + * This algorithm fails for two HT-400's connected in tunnel fashion. + * Eventually this needs real topology information. + * There may be some issues with dual core numbering as well. This + * needs more work prior to release. + */ + if (minor != IPATH_SMA +#ifdef IPATH_DIAG + && minor != IPATH_DIAG +#endif + && minor != IPATH_CTRL + && !cpus_empty(current->cpus_allowed) + && !cpus_full(current->cpus_allowed)) { + int ncpus = num_online_cpus(), curcpu = -1; + for (i=0; icpus_allowed)) { + _IPATH_PRDBG("%s[%u] affinity set for cpu %d\n", + current->comm, current->pid, i); + curcpu = i; + } + if (curcpu != -1) { + for (ndev = 0; ndev < infinipath_max; ndev++) + if ((devdata[ndev].ipath_flags & IPATH_PRESENT) + && devdata[ndev].ipath_kregbase) + npresent++; + if (npresent) { + prefunit = curcpu/(ncpus/npresent); + _IPATH_DBG("%s[%u] %d chips, %d cpus, " + "%d cpus/chip, select unit %d\n", + current->comm, current->pid, + npresent, ncpus, ncpus/npresent, + prefunit); + } + } + } + + if (minor == IPATH_SMA) { + ret = ipath_opensma(in, fp); + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + goto done; + } +#ifdef IPATH_DIAG + else if (minor == IPATH_DIAG) { + ret = ipath_opendiag(in, fp); + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + goto done; + } +#endif + else if (minor == IPATH_CTRL) { + /* for ipath_ioctl */ + fp->private_data = (void *)(unsigned long)minor; + ret = 0; + goto done; + } + else if (minor) { + /* + * minor number 0 is used for all chips, we choose available + * chip ourselves, it isn't based on what they open. + */ + + _IPATH_DBG("open on invalid minor %u\n", minor); + ret = -ENXIO; + goto done; + } + + /* + * for now, we use all ports on one, then all ports on the + * next, etc. Eventually we want to tweak this to be cpu/chip + * topology aware, and round-robin across chips that are + * configured and connected, placing processes on the closest + * available processor that isn't already over-allocated. + * multi-HT400 topology could be better handled + */ + + npresent = maxofallports = 0; + for (ndev = 0; ndev < infinipath_max; ndev++) { + if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) || + !devdata[ndev].ipath_kregbase) + continue; + npresent++; + if ((devdata[ndev]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) { + _IPATH_VDBG("unit %u present, but link not ready\n", + ndev); + notup++; + continue; + } else if (!devdata[ndev].ipath_lid) { + _IPATH_VDBG + ("unit %u present, but LID not assigned, down\n", + ndev); + notup++; + continue; + } + if (devdata[ndev].ipath_cfgports > maxofallports) + maxofallports = devdata[ndev].ipath_cfgports; + } + + /* + * user ports start at 1, kernel port is 0 + * For now, we do round-robin access across all chips + */ + + devmax = prefunit!=-1 ? prefunit+1 : infinipath_max; +recheck: + for (i = 1; i < maxofallports; i++) { + for (ndev = prefunit!=-1?prefunit:0; ndev < devmax; ndev++) { + if (!(devdata[ndev].ipath_flags & IPATH_PRESENT) || + !devdata[ndev].ipath_kregbase + || !devdata[ndev].ipath_lid + || (devdata[ndev]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) + break; /* can't use this chip */ + if (i >= devdata[ndev].ipath_cfgports) + break; /* max'ed out on users of this chip */ + if (!devdata[ndev].ipath_pd[i]) { + void *p, *ptmp; + p = kmalloc(sizeof(struct ipath_portdata), + GFP_KERNEL); + + /* + * allocate memory for use in + * ipath_tid_update() just once at open, + * not per call. Reduces cost of expected + * send setup + */ + + ptmp = + kmalloc(devdata[ndev].ipath_rcvtidcnt * + sizeof(uint16_t) + + + devdata[ndev].ipath_rcvtidcnt * + sizeof(struct page **), GFP_KERNEL); + if (!p || !ptmp) { + _IPATH_UNIT_ERROR(ndev, + "Unable to allocate portdata memory, failing open\n"); + ret = -ENOMEM; + kfree(p); + kfree(ptmp); + goto done; + } + memset(p, 0, sizeof(struct ipath_portdata)); + devdata[ndev].ipath_pd[i] = p; + devdata[ndev].ipath_pd[i]->port_port = i; + devdata[ndev].ipath_pd[i]->port_unit = ndev; + devdata[ndev].ipath_pd[i]->port_tid_pg_list = + ptmp; + init_waitqueue_head(&devdata[ndev].ipath_pd[i]-> + port_wait); + } + if (!devdata[ndev].ipath_pd[i]->port_cnt) { + devdata[ndev].ipath_pd[i]->port_cnt = 1; + fp->private_data = + (void *)devdata[ndev].ipath_pd[i]; + _IPATH_PRDBG("%s[%u] opened unit:port %u:%u\n", + current->comm, current->pid, ndev, + i); + devdata[ndev].ipath_pd[i]->port_pid = + current->pid; + strncpy(devdata[ndev].ipath_pd[i]->port_comm, + current->comm, + sizeof(devdata[ndev].ipath_pd[i]-> + port_comm)); + ipath_stats.sps_ports++; + goto done; + } + } + } + + if (npresent) { + if (notup) { + ret = -ENETDOWN; + _IPATH_DBG + ("No ports available (none initialized and ready)\n"); + } else { + if (prefunit > 0) { /* if we started above unit 0, retry from 0 */ + _IPATH_PRDBG("%s[%u] no ports on prefunit %d, clear and re-check\n", + current->comm, current->pid, prefunit); + devmax = infinipath_max; + prefunit = -1; + goto recheck; + } + ret = -EBUSY; + _IPATH_DBG("No ports available\n"); + } + } else { + ret = -ENXIO; + _IPATH_DBG("No boards found\n"); + } + +done: + up(&ipath_mutex); + return ret; +} + +static int ipath_opensma(struct inode *in, struct file *fp) +{ + ipath_type s; + + if (ipath_sma_alive) { + _IPATH_DBG("SMA already running (pid %u), failing\n", + ipath_sma_alive); + return -EBUSY; + } + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* all SMA functions are root-only */ + + for (s = 0; s < infinipath_max; s++) { + /* we need at least one infinipath device to be initialized. */ + if (devdata[s].ipath_flags & IPATH_INITTED) { + ipath_sma_alive = current->pid; + *devdata[s].ipath_statusp |= IPATH_STATUS_SMA; + *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + } + } + if (ipath_sma_alive) { + _IPATH_SMADBG + ("SMA device now open, SMA active as PID %u\n", + ipath_sma_alive); + return 0; + } + _IPATH_DBG("No hardware yet found and initted, failing\n"); + return -ENODEV; +} + + +#ifdef IPATH_DIAG +static int ipath_opendiag(struct inode *in, struct file *fp) +{ + ipath_type s; + + if (ipath_diag_alive) { + _IPATH_DBG("Diags already running (pid %u), failing\n", + ipath_diag_alive); + return -EBUSY; + } + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* all diags functions are root-only */ + + for (s = 0; s < infinipath_max; s++) + /* + * we need at least one infinipath device to be present + * (don't use INITTED, because we want to be able to open + * even if device is in freeze mode, which cleared INITTED. + * There is s small amount of risk to this, which is + * why we also verify kregbase is set. + */ + + if ((devdata[s].ipath_flags & IPATH_PRESENT) + && devdata[s].ipath_kregbase) { + ipath_diag_alive = current->pid; + _IPATH_DBG("diag device now open, active as PID %u\n", + ipath_diag_alive); + return 0; + } + _IPATH_DBG("No hardware yet found and initted, failing diags\n"); + return -ENODEV; +} +#endif + +/* + * clear all TID entries for a port, expected and eager. + * Used from ipath_close(), and at chip initialization. + */ + +static void ipath_clear_tids(const ipath_type t, unsigned port) +{ + uint64_t __iomem *tidbase; + int i; + struct ipath_devdata *dd; + uint64_t tidval; + dd = &devdata[t]; + + if (!dd->ipath_kregbase) + return; + + /* + * chip errata bug 7358, try to work around it by marking invalid + * tids as having max length + */ + + tidval = + (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT; + + /* + * need to invalidate all of the expected TID entries for this + * port, so we don't have valid entries that might somehow get + * used (early in next use of this port, or through some bug) + * We don't bother with the eager, because they are initialized + * each time before receives are enabled; expected aren't + */ + + tidbase = (uint64_t __iomem *) ((char __iomem *)(dd->ipath_kregbase) + + dd->ipath_rcvtidbase + + port * dd->ipath_rcvtidcnt * + sizeof(*tidbase)); + _IPATH_VDBG("Invalidate expected TIDs for port %u, tidbase=%p\n", port, + tidbase); + for (i = 0; i < dd->ipath_rcvtidcnt; i++) + ipath_kput_memq(t, &tidbase[i], tidval); + yield(); /* don't hog the cpu */ + + /* zero the eager TID entries */ + tidbase = (uint64_t __iomem *)((char __iomem *)(dd->ipath_kregbase) + + dd->ipath_rcvegrbase + + port * dd->ipath_rcvegrcnt * + sizeof(*tidbase)); + + for (i = 0; i < dd->ipath_rcvegrcnt; i++) + ipath_kput_memq(t, &tidbase[i], tidval); + yield(); /* don't hog the cpu */ +} + +int ipath_close(struct inode *in, struct file *fp) +{ + int ret = 0; + struct ipath_portdata *pd; + + _IPATH_VDBG("close on dev %lx, private data %p\n", (long)in->i_rdev, + fp->private_data); + + down(&ipath_mutex); + if (iminor(in) == IPATH_SMA) { + ipath_type s; + + ipath_sma_alive = 0; + _IPATH_SMADBG("Closing SMA device\n"); + for (s = 0; s < infinipath_max; s++) { + if (!(devdata[s].ipath_flags & IPATH_INITTED)) + continue; + *devdata[s].ipath_statusp &= ~IPATH_STATUS_SMA; + if (devdata[s].verbs_layer.l_flags & + IPATH_VERBS_KERNEL_SMA) + *devdata[s].ipath_statusp |= + IPATH_STATUS_OIB_SMA; + } + } +#ifdef IPATH_DIAG + else if (iminor(in) == IPATH_DIAG) { + ipath_diag_alive = 0; + _IPATH_DBG("Closing DIAG device\n"); + } +#endif + else if (fp->private_data && 255UL < (unsigned long)fp->private_data) { + ipath_type t; + unsigned port; + struct ipath_devdata *dd; + + pd = (struct ipath_portdata *) fp->private_data; + port = pd->port_port; + fp->private_data = NULL; + t = pd->port_unit; + if (t > infinipath_max) { + _IPATH_ERROR + ("closing, fp %p, pd %p, but unit %x not valid!\n", + fp, pd, t); + goto done; + } + dd = &devdata[t]; + + if (pd->port_hdrqfull) { + _IPATH_PRDBG + ("%s[%u] had %u rcvhdrqfull errors during run\n", + pd->port_comm, pd->port_pid, pd->port_hdrqfull); + pd->port_hdrqfull = 0; + } + + if (pd->port_rcvwait_to || pd->port_piowait_to + || pd->port_rcvnowait || pd->port_pionowait) { + _IPATH_VDBG + ("port%u, %u rcv, %u pio wait timeo; %u rcv %u, pio already\n", + pd->port_port, pd->port_rcvwait_to, + pd->port_piowait_to, pd->port_rcvnowait, + pd->port_pionowait); + pd->port_rcvwait_to = pd->port_piowait_to = + pd->port_rcvnowait = pd->port_pionowait = 0; + } + if (pd->port_flag) { + _IPATH_DBG("port %u port_flag still set to 0x%x\n", + pd->port_port, pd->port_flag); + pd->port_flag = 0; + } + + if (devdata[t].ipath_kregbase) { + if (pd->port_rcvhdrtail_uaddr) { + pd->port_rcvhdrtail_uaddr = 0; + pd->port_rcvhdrtail_kvaddr = NULL; + ipath_putpages(1, &pd->port_rcvhdrtail_pagep); + pd->port_rcvhdrtail_pagep = NULL; + ipath_stats.sps_pageunlocks++; + } + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, port, 0ULL); + ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr, + pd->port_port, 0); + + /* clean up the pkeys for this port user */ + ipath_clean_partkey(pd, dd); + + if (port < dd->ipath_cfgports) { + int i = dd->ipath_pbufsport * (port - 1); + ipath_disarm_piobufs(t, i, dd->ipath_pbufsport); + + /* atomically clear receive enable port. */ + atomic_clear_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + port), + &devdata[t].ipath_rcvctrl); + ipath_kput_kreg(t, kr_rcvctrl, + devdata[t].ipath_rcvctrl); + + if (dd->ipath_pageshadow) { + /* + * unlock any expected TID + * entries port still had in use + */ + int port_tidbase = + pd->port_port * dd->ipath_rcvtidcnt; + int i, cnt = 0, maxtid = + port_tidbase + dd->ipath_rcvtidcnt; + + _IPATH_VDBG + ("Port %u unlocking any locked expTID pages\n", + pd->port_port); + for (i = port_tidbase; i < maxtid; i++) { + if (dd->ipath_pageshadow[i]) { + ipath_putpages(1, + &dd-> + ipath_pageshadow + [i]); + dd->ipath_pageshadow[i] + = NULL; + cnt++; + ipath_stats. + sps_pageunlocks++; + } + } + if (cnt) + _IPATH_VDBG + ("Port %u had %u expTID entries locked\n", + pd->port_port, cnt); + if (ipath_stats.sps_pagelocks + || ipath_stats.sps_pageunlocks) + _IPATH_VDBG + ("%llu pages locked, %llu unlocked with" + " ipath_m{un}lock\n", + ipath_stats.sps_pagelocks, + ipath_stats. + sps_pageunlocks); + } + ipath_stats.sps_ports--; + _IPATH_PRDBG("%s[%u] closed port %u:%u\n", + pd->port_comm, pd->port_pid, t, + port); + } + } + + pd->port_cnt = 0; + pd->port_pid = 0; + + ipath_clear_tids(t, pd->port_port); + + ipath_free_pddata(dd, pd->port_port, 0); + } + +done: + up(&ipath_mutex); + + return ret; +} From bos at pathscale.com Wed Dec 28 16:31:31 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:31 -0800 Subject: [openib-general] [PATCH 12 of 20] ipath - misc driver support code In-Reply-To: Message-ID: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_ht400.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_ht400.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,1137 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +/* + * The first part of this file is shared with the diags, the second + * part is used only in the kernel. + */ + +#include /* for offsetof */ + +#include +#include +#include +#include +#include "ipath_kernel.h" + +#include "ipath_registers.h" +#include "ipath_common.h" + +/* + * This lists the InfiniPath registers, in the actual chip layout. This + * structure should never be directly accessed. It is included by the + * user mode diags, and so must be able to be compiled in both user + * and kernel mode. + */ +struct _infinipath_do_not_use_kernel_regs { + unsigned long long Revision; + unsigned long long Control; + unsigned long long PageAlign; + unsigned long long PortCnt; + unsigned long long DebugPortSelect; + unsigned long long DebugPort; + unsigned long long SendRegBase; + unsigned long long UserRegBase; + unsigned long long CounterRegBase; + unsigned long long Scratch; + unsigned long long ReservedMisc1; + unsigned long long InterruptConfig; + unsigned long long IntBlocked; + unsigned long long IntMask; + unsigned long long IntStatus; + unsigned long long IntClear; + unsigned long long ErrorMask; + unsigned long long ErrorStatus; + unsigned long long ErrorClear; + unsigned long long HwErrMask; + unsigned long long HwErrStatus; + unsigned long long HwErrClear; + unsigned long long HwDiagCtrl; + unsigned long long MDIO; + unsigned long long IBCStatus; + unsigned long long IBCCtrl; + unsigned long long ExtStatus; + unsigned long long ExtCtrl; + unsigned long long GPIOOut; + unsigned long long GPIOMask; + unsigned long long GPIOStatus; + unsigned long long GPIOClear; + unsigned long long RcvCtrl; + unsigned long long RcvBTHQP; + unsigned long long RcvHdrSize; + unsigned long long RcvHdrCnt; + unsigned long long RcvHdrEntSize; + unsigned long long RcvTIDBase; + unsigned long long RcvTIDCnt; + unsigned long long RcvEgrBase; + unsigned long long RcvEgrCnt; + unsigned long long RcvBufBase; + unsigned long long RcvBufSize; + unsigned long long RxIntMemBase; + unsigned long long RxIntMemSize; + unsigned long long RcvPartitionKey; + unsigned long long ReservedRcv[10]; + unsigned long long SendCtrl; + unsigned long long SendPIOBufBase; + unsigned long long SendPIOSize; + unsigned long long SendPIOBufCnt; + unsigned long long SendPIOAvailAddr; + unsigned long long TxIntMemBase; + unsigned long long TxIntMemSize; + unsigned long long ReservedSend[9]; + unsigned long long SendBufferError; + unsigned long long SendBufferErrorCONT1; + unsigned long long SendBufferErrorCONT2; + unsigned long long SendBufferErrorCONT3; + unsigned long long ReservedSBE[4]; + unsigned long long RcvHdrAddr0; + unsigned long long RcvHdrAddr1; + unsigned long long RcvHdrAddr2; + unsigned long long RcvHdrAddr3; + unsigned long long RcvHdrAddr4; + unsigned long long RcvHdrAddr5; + unsigned long long RcvHdrAddr6; + unsigned long long RcvHdrAddr7; + unsigned long long RcvHdrAddr8; + unsigned long long ReservedRHA[7]; + unsigned long long RcvHdrTailAddr0; + unsigned long long RcvHdrTailAddr1; + unsigned long long RcvHdrTailAddr2; + unsigned long long RcvHdrTailAddr3; + unsigned long long RcvHdrTailAddr4; + unsigned long long RcvHdrTailAddr5; + unsigned long long RcvHdrTailAddr6; + unsigned long long RcvHdrTailAddr7; + unsigned long long RcvHdrTailAddr8; + unsigned long long ReservedRHTA[7]; + unsigned long long Sync; /* Software only */ + unsigned long long Dump; /* Software only */ + unsigned long long SimVer; /* Software only */ + unsigned long long ReservedSW[5]; + unsigned long long SerdesConfig0; + unsigned long long SerdesConfig1; + unsigned long long SerdesStatus; + unsigned long long XGXSConfig; + unsigned long long ReservedSW2[4]; +}; + +#define IPATH_KREG_OFFSET(field) (offsetof(struct \ + _infinipath_do_not_use_kernel_regs, field) / sizeof(uint64_t)) +#define IPATH_CREG_OFFSET(field) (offsetof( \ + struct infinipath_counters, field) / sizeof(uint64_t)) + +ipath_kreg + kr_control = IPATH_KREG_OFFSET(Control), + kr_counterregbase = IPATH_KREG_OFFSET(CounterRegBase), + kr_debugport = IPATH_KREG_OFFSET(DebugPort), + kr_debugportselect = IPATH_KREG_OFFSET(DebugPortSelect), + kr_errorclear = IPATH_KREG_OFFSET(ErrorClear), + kr_errormask = IPATH_KREG_OFFSET(ErrorMask), + kr_errorstatus = IPATH_KREG_OFFSET(ErrorStatus), + kr_extctrl = IPATH_KREG_OFFSET(ExtCtrl), + kr_extstatus = IPATH_KREG_OFFSET(ExtStatus), + kr_gpio_clear = IPATH_KREG_OFFSET(GPIOClear), + kr_gpio_mask = IPATH_KREG_OFFSET(GPIOMask), + kr_gpio_out = IPATH_KREG_OFFSET(GPIOOut), + kr_gpio_status = IPATH_KREG_OFFSET(GPIOStatus), + kr_hwdiagctrl = IPATH_KREG_OFFSET(HwDiagCtrl), + kr_hwerrclear = IPATH_KREG_OFFSET(HwErrClear), + kr_hwerrmask = IPATH_KREG_OFFSET(HwErrMask), + kr_hwerrstatus = IPATH_KREG_OFFSET(HwErrStatus), + kr_ibcctrl = IPATH_KREG_OFFSET(IBCCtrl), + kr_ibcstatus = IPATH_KREG_OFFSET(IBCStatus), + kr_intblocked = IPATH_KREG_OFFSET(IntBlocked), + kr_intclear = IPATH_KREG_OFFSET(IntClear), + kr_interruptconfig = IPATH_KREG_OFFSET(InterruptConfig), + kr_intmask = IPATH_KREG_OFFSET(IntMask), + kr_intstatus = IPATH_KREG_OFFSET(IntStatus), + kr_mdio = IPATH_KREG_OFFSET(MDIO), + kr_pagealign = IPATH_KREG_OFFSET(PageAlign), + kr_partitionkey = IPATH_KREG_OFFSET(RcvPartitionKey), + kr_portcnt = IPATH_KREG_OFFSET(PortCnt), + kr_rcvbthqp = IPATH_KREG_OFFSET(RcvBTHQP), + kr_rcvbufbase = IPATH_KREG_OFFSET(RcvBufBase), + kr_rcvbufsize = IPATH_KREG_OFFSET(RcvBufSize), + kr_rcvctrl = IPATH_KREG_OFFSET(RcvCtrl), + kr_rcvegrbase = IPATH_KREG_OFFSET(RcvEgrBase), + kr_rcvegrcnt = IPATH_KREG_OFFSET(RcvEgrCnt), + kr_rcvhdrcnt = IPATH_KREG_OFFSET(RcvHdrCnt), + kr_rcvhdrentsize = IPATH_KREG_OFFSET(RcvHdrEntSize), + kr_rcvhdrsize = IPATH_KREG_OFFSET(RcvHdrSize), + kr_rcvintmembase = IPATH_KREG_OFFSET(RxIntMemBase), + kr_rcvintmemsize = IPATH_KREG_OFFSET(RxIntMemSize), + kr_rcvtidbase = IPATH_KREG_OFFSET(RcvTIDBase), + kr_rcvtidcnt = IPATH_KREG_OFFSET(RcvTIDCnt), + kr_revision = IPATH_KREG_OFFSET(Revision), + kr_scratch = IPATH_KREG_OFFSET(Scratch), + kr_sendbuffererror = IPATH_KREG_OFFSET(SendBufferError), + kr_sendbuffererror1 = IPATH_KREG_OFFSET(SendBufferErrorCONT1), + kr_sendbuffererror2 = IPATH_KREG_OFFSET(SendBufferErrorCONT2), + kr_sendbuffererror3 = IPATH_KREG_OFFSET(SendBufferErrorCONT3), + kr_sendctrl = IPATH_KREG_OFFSET(SendCtrl), + kr_sendpioavailaddr = IPATH_KREG_OFFSET(SendPIOAvailAddr), + kr_sendpiobufbase = IPATH_KREG_OFFSET(SendPIOBufBase), + kr_sendpiobufcnt = IPATH_KREG_OFFSET(SendPIOBufCnt), + kr_sendpiosize = IPATH_KREG_OFFSET(SendPIOSize), + kr_sendregbase = IPATH_KREG_OFFSET(SendRegBase), + kr_txintmembase = IPATH_KREG_OFFSET(TxIntMemBase), + kr_txintmemsize = IPATH_KREG_OFFSET(TxIntMemSize), + kr_userregbase = IPATH_KREG_OFFSET(UserRegBase), + kr_serdesconfig0 = IPATH_KREG_OFFSET(SerdesConfig0), + kr_serdesconfig1 = IPATH_KREG_OFFSET(SerdesConfig1), + kr_serdesstatus = IPATH_KREG_OFFSET(SerdesStatus), + kr_xgxsconfig = IPATH_KREG_OFFSET(XGXSConfig), + /* + * last valid direct use register other than diag-only registers + */ + __kr_lastvaliddirect = IPATH_KREG_OFFSET(ReservedSW2[0]), + /* always invalid for initializing */ + __kr_invalid = IPATH_KREG_OFFSET(ReservedSW2[0]) + 1, + /* + * These should not be used directly via ipath_kget_kreg64(), + * use them with ipath_kget_kreg64_port() + */ + kr_rcvhdraddr = IPATH_KREG_OFFSET(RcvHdrAddr0), /* not for direct use */ + /* not for direct use */ + kr_rcvhdrtailaddr = IPATH_KREG_OFFSET(RcvHdrTailAddr0), + /* we define the full set for the diags, the kernel doesn't use them */ + kr_rcvhdraddr1 = IPATH_KREG_OFFSET(RcvHdrAddr1), + kr_rcvhdraddr2 = IPATH_KREG_OFFSET(RcvHdrAddr2), + kr_rcvhdraddr3 = IPATH_KREG_OFFSET(RcvHdrAddr3), + kr_rcvhdraddr4 = IPATH_KREG_OFFSET(RcvHdrAddr4), + kr_rcvhdrtailaddr1 = IPATH_KREG_OFFSET(RcvHdrTailAddr1), + kr_rcvhdrtailaddr2 = IPATH_KREG_OFFSET(RcvHdrTailAddr2), + kr_rcvhdrtailaddr3 = IPATH_KREG_OFFSET(RcvHdrTailAddr3), + kr_rcvhdrtailaddr4 = IPATH_KREG_OFFSET(RcvHdrTailAddr4), + kr_rcvhdraddr5 = IPATH_KREG_OFFSET(RcvHdrAddr5), + kr_rcvhdraddr6 = IPATH_KREG_OFFSET(RcvHdrAddr6), + kr_rcvhdraddr7 = IPATH_KREG_OFFSET(RcvHdrAddr7), + kr_rcvhdraddr8 = IPATH_KREG_OFFSET(RcvHdrAddr8), + kr_rcvhdrtailaddr5 = IPATH_KREG_OFFSET(RcvHdrTailAddr5), + kr_rcvhdrtailaddr6 = IPATH_KREG_OFFSET(RcvHdrTailAddr6), + kr_rcvhdrtailaddr7 = IPATH_KREG_OFFSET(RcvHdrTailAddr7), + kr_rcvhdrtailaddr8 = IPATH_KREG_OFFSET(RcvHdrTailAddr8); + +/* + * first of the pioavail registers, the total number is + * (kr_sendpiobufcnt / 32); each buffer uses 2 bits + * More properly, it's: + * (kr_sendpiobufcnt / ((sizeof(uint64_t)*BITS_PER_BYTE)/2)) + */ +ipath_sreg sr_sendpioavail = 0; + +ipath_creg + cr_badformatcnt = IPATH_CREG_OFFSET(RxBadFormatCnt), + cr_erricrccnt = IPATH_CREG_OFFSET(RxICRCErrCnt), + cr_errlinkcnt = IPATH_CREG_OFFSET(RxLinkProblemCnt), + cr_errlpcrccnt = IPATH_CREG_OFFSET(RxLPCRCErrCnt), + cr_errpkey = IPATH_CREG_OFFSET(RxPKeyMismatchCnt), + cr_errrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowCtrlErrCnt), + cr_err_rlencnt = IPATH_CREG_OFFSET(RxLenErrCnt), + cr_errslencnt = IPATH_CREG_OFFSET(TxLenErrCnt), + cr_errtidfull = IPATH_CREG_OFFSET(RxTIDFullErrCnt), + cr_errtidvalid = IPATH_CREG_OFFSET(RxTIDValidErrCnt), + cr_errvcrccnt = IPATH_CREG_OFFSET(RxVCRCErrCnt), + cr_ibstatuschange = IPATH_CREG_OFFSET(IBStatusChangeCnt), + /* calc from Reg_CounterRegBase + offset */ + cr_intcnt = IPATH_CREG_OFFSET(LBIntCnt), + cr_invalidrlencnt = IPATH_CREG_OFFSET(RxMaxMinLenErrCnt), + cr_invalidslencnt = IPATH_CREG_OFFSET(TxMaxMinLenErrCnt), + cr_lbflowstallcnt = IPATH_CREG_OFFSET(LBFlowStallCnt), + cr_pktrcvcnt = IPATH_CREG_OFFSET(RxDataPktCnt), + cr_pktrcvflowctrlcnt = IPATH_CREG_OFFSET(RxFlowPktCnt), + cr_pktsendcnt = IPATH_CREG_OFFSET(TxDataPktCnt), + cr_pktsendflowcnt = IPATH_CREG_OFFSET(TxFlowPktCnt), + cr_portovflcnt = IPATH_CREG_OFFSET(RxP0HdrEgrOvflCnt), + cr_portovflcnt1 = IPATH_CREG_OFFSET(RxP1HdrEgrOvflCnt), + cr_portovflcnt2 = IPATH_CREG_OFFSET(RxP2HdrEgrOvflCnt), + cr_portovflcnt3 = IPATH_CREG_OFFSET(RxP3HdrEgrOvflCnt), + cr_portovflcnt4 = IPATH_CREG_OFFSET(RxP4HdrEgrOvflCnt), + cr_portovflcnt5 = IPATH_CREG_OFFSET(RxP5HdrEgrOvflCnt), + cr_portovflcnt6 = IPATH_CREG_OFFSET(RxP6HdrEgrOvflCnt), + cr_portovflcnt7 = IPATH_CREG_OFFSET(RxP7HdrEgrOvflCnt), + cr_portovflcnt8 = IPATH_CREG_OFFSET(RxP8HdrEgrOvflCnt), + cr_rcvebpcnt = IPATH_CREG_OFFSET(RxEBPCnt), + cr_rcvovflcnt = IPATH_CREG_OFFSET(RxBufOvflCnt), + cr_senddropped = IPATH_CREG_OFFSET(TxDroppedPktCnt), + cr_sendstallcnt = IPATH_CREG_OFFSET(TxFlowStallCnt), + cr_sendunderruncnt = IPATH_CREG_OFFSET(TxUnderrunCnt), + cr_wordrcvcnt = IPATH_CREG_OFFSET(RxDwordCnt), + cr_wordsendcnt = IPATH_CREG_OFFSET(TxDwordCnt), + cr_unsupvlcnt = IPATH_CREG_OFFSET(TxUnsupVLErrCnt), + cr_rxdroppktcnt = IPATH_CREG_OFFSET(RxDroppedPktCnt), + cr_iblinkerrrecovcnt = IPATH_CREG_OFFSET(IBLinkErrRecoveryCnt), + cr_iblinkdowncnt = IPATH_CREG_OFFSET(IBLinkDownedCnt), + cr_ibsymbolerrcnt = IPATH_CREG_OFFSET(IBSymbolErrCnt); + +/* kr_sendctrl bits */ +#define INFINIPATH_S_DISARMPIOBUF_MASK 0xFF + +/* kr_rcvctrl bits */ +#define INFINIPATH_R_PORTENABLE_MASK 0x1FF +#define INFINIPATH_R_INTRAVAIL_MASK 0x1FF + +/* kr_intstatus, kr_intclear, kr_intmask bits */ +#define INFINIPATH_I_RCVURG_MASK 0x1FF +#define INFINIPATH_I_RCVAVAIL_MASK 0x1FF + +/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */ +#define INFINIPATH_HWE_HTCMEMPARITYERR_MASK 0x3FFFFFULL +#define INFINIPATH_HWE_HTCLNKABYTE0CRCERR 0x0000000000800000ULL +#define INFINIPATH_HWE_HTCLNKABYTE1CRCERR 0x0000000001000000ULL +#define INFINIPATH_HWE_HTCLNKBBYTE0CRCERR 0x0000000002000000ULL +#define INFINIPATH_HWE_HTCLNKBBYTE1CRCERR 0x0000000004000000ULL +#define INFINIPATH_HWE_HTCMISCERR4 0x0000000008000000ULL +#define INFINIPATH_HWE_HTCMISCERR5 0x0000000010000000ULL +#define INFINIPATH_HWE_HTCMISCERR6 0x0000000020000000ULL +#define INFINIPATH_HWE_HTCMISCERR7 0x0000000040000000ULL +#define INFINIPATH_HWE_MEMBISTFAILED 0x0040000000000000ULL +#define INFINIPATH_HWE_COREPLL_FBSLIP 0x0080000000000000ULL +#define INFINIPATH_HWE_COREPLL_RFSLIP 0x0100000000000000ULL +#define INFINIPATH_HWE_HTBPLL_FBSLIP 0x0200000000000000ULL +#define INFINIPATH_HWE_HTBPLL_RFSLIP 0x0400000000000000ULL +#define INFINIPATH_HWE_HTAPLL_FBSLIP 0x0800000000000000ULL +#define INFINIPATH_HWE_HTAPLL_RFSLIP 0x1000000000000000ULL +#define INFINIPATH_HWE_EXTSERDESPLLFAILED 0x2000000000000000ULL + +/* kr_hwdiagctrl bits */ +#define INFINIPATH_DC_NUMHTMEMS 22 + +/* kr_extstatus bits */ +#define INFINIPATH_EXTS_FREQSEL 0x2 +#define INFINIPATH_EXTS_SERDESSEL 0x4 +#define INFINIPATH_EXTS_MEMBIST_ENDTEST 0x0000000000004000 +#define INFINIPATH_EXTS_MEMBIST_CORRECT 0x0000000000008000 + +/* kr_extctrl bits */ + +/* + * masks and bits that are different in different chips, or present only + * in one + */ +const uint32_t infinipath_i_rcvavail_mask = INFINIPATH_I_RCVAVAIL_MASK; +const uint32_t infinipath_i_rcvurg_mask = INFINIPATH_I_RCVURG_MASK; +const uint64_t infinipath_hwe_htcmemparityerr_mask = + INFINIPATH_HWE_HTCMEMPARITYERR_MASK; + +const uint64_t infinipath_hwe_spibdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_sphtdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_htcdcmlockfailed_mask = 0ULL; +const uint64_t infinipath_hwe_htcdcmlockfailed_shift = 0ULL; +const uint64_t infinipath_hwe_sphtdcmlockfailed_shift = 0ULL; +const uint64_t infinipath_hwe_spibdcmlockfailed_shift = 0ULL; + +const uint64_t infinipath_hwe_htclnkabyte0crcerr = + INFINIPATH_HWE_HTCLNKABYTE0CRCERR; +const uint64_t infinipath_hwe_htclnkabyte1crcerr = + INFINIPATH_HWE_HTCLNKABYTE1CRCERR; +const uint64_t infinipath_hwe_htclnkbbyte0crcerr = + INFINIPATH_HWE_HTCLNKBBYTE0CRCERR; +const uint64_t infinipath_hwe_htclnkbbyte1crcerr = + INFINIPATH_HWE_HTCLNKBBYTE1CRCERR; + +const uint64_t infinipath_c_bitsextant = + (INFINIPATH_C_FREEZEMODE | INFINIPATH_C_LINKENABLE); + +const uint64_t infinipath_s_bitsextant = + (INFINIPATH_S_ABORT | INFINIPATH_S_PIOINTBUFAVAIL | + INFINIPATH_S_PIOBUFAVAILUPD | INFINIPATH_S_PIOENABLE | + INFINIPATH_S_DISARM | + (INFINIPATH_S_DISARMPIOBUF_MASK << INFINIPATH_S_DISARMPIOBUF_SHIFT)); + +const uint64_t infinipath_r_bitsextant = + ((INFINIPATH_R_PORTENABLE_MASK << INFINIPATH_R_PORTENABLE_SHIFT) | + (INFINIPATH_R_INTRAVAIL_MASK << INFINIPATH_R_INTRAVAIL_SHIFT) | + INFINIPATH_R_TAILUPD); + +const uint64_t infinipath_i_bitsextant = + ((INFINIPATH_I_RCVURG_MASK << INFINIPATH_I_RCVURG_SHIFT) | + (INFINIPATH_I_RCVAVAIL_MASK << INFINIPATH_I_RCVAVAIL_SHIFT) | + INFINIPATH_I_ERROR | INFINIPATH_I_SPIOSENT | + INFINIPATH_I_SPIOBUFAVAIL | INFINIPATH_I_GPIO); + +const uint64_t infinipath_e_bitsextant = + (INFINIPATH_E_RFORMATERR | INFINIPATH_E_RVCRC | INFINIPATH_E_RICRC | + INFINIPATH_E_RMINPKTLEN | INFINIPATH_E_RMAXPKTLEN | + INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN | + INFINIPATH_E_RUNEXPCHAR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_REBP | + INFINIPATH_E_RIBFLOW | INFINIPATH_E_RBADVERSION | + INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL | + INFINIPATH_E_RBADTID | INFINIPATH_E_RHDRLEN | + INFINIPATH_E_RHDR | INFINIPATH_E_RIBLOSTLINK | + INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SMAXPKTLEN | + INFINIPATH_E_SUNDERRUN | INFINIPATH_E_SPKTLEN | + INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SDROPPEDDATAPKT | + INFINIPATH_E_SPIOARMLAUNCH | INFINIPATH_E_SUNEXPERRPKTNUM | + INFINIPATH_E_SUNSUPVL | INFINIPATH_E_IBSTATUSCHANGED | + INFINIPATH_E_INVALIDADDR | INFINIPATH_E_RESET | INFINIPATH_E_HARDWARE); + +const uint64_t infinipath_hwe_bitsextant = + (INFINIPATH_HWE_HTCMEMPARITYERR_MASK << + INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) | + (INFINIPATH_HWE_TXEMEMPARITYERR_MASK << + INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) | + (INFINIPATH_HWE_RXEMEMPARITYERR_MASK << + INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) | + INFINIPATH_HWE_HTCLNKABYTE0CRCERR | + INFINIPATH_HWE_HTCLNKABYTE1CRCERR | INFINIPATH_HWE_HTCLNKBBYTE0CRCERR | + INFINIPATH_HWE_HTCLNKBBYTE1CRCERR | INFINIPATH_HWE_HTCMISCERR4 | + INFINIPATH_HWE_HTCMISCERR5 | INFINIPATH_HWE_HTCMISCERR6 | + INFINIPATH_HWE_HTCMISCERR7 | INFINIPATH_HWE_HTCBUSTREQPARITYERR | + INFINIPATH_HWE_HTCBUSTRESPPARITYERR | + INFINIPATH_HWE_HTCBUSIREQPARITYERR | + INFINIPATH_HWE_RXDSYNCMEMPARITYERR | INFINIPATH_HWE_MEMBISTFAILED | + INFINIPATH_HWE_COREPLL_FBSLIP | INFINIPATH_HWE_COREPLL_RFSLIP | + INFINIPATH_HWE_HTBPLL_FBSLIP | INFINIPATH_HWE_HTBPLL_RFSLIP | + INFINIPATH_HWE_HTAPLL_FBSLIP | INFINIPATH_HWE_HTAPLL_RFSLIP | + INFINIPATH_HWE_EXTSERDESPLLFAILED | + INFINIPATH_HWE_IBCBUSTOSPCPARITYERR | + INFINIPATH_HWE_IBCBUSFRSPCPARITYERR; + +const uint64_t infinipath_dc_bitsextant = + (INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK << + INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT) | + (INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK << + INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT) | + (INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK << + INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT) | + INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR | + INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR | + INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR | + INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR | + INFINIPATH_DC_COUNTERDISABLE | INFINIPATH_DC_COUNTERWREN | + INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR | + INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR; + +const uint64_t infinipath_ibcc_bitsextant = + (INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK << + INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT) | + (INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK << + INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT) | + (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT) | + (INFINIPATH_IBCC_LINKCMD_MASK << INFINIPATH_IBCC_LINKCMD_SHIFT) | + (INFINIPATH_IBCC_MAXPKTLEN_MASK << INFINIPATH_IBCC_MAXPKTLEN_SHIFT) | + (INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK << + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) | + (INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK << + INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) | + (INFINIPATH_IBCC_CREDITSCALE_MASK << + INFINIPATH_IBCC_CREDITSCALE_SHIFT) | + INFINIPATH_IBCC_LOOPBACK | INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + +const uint64_t infinipath_mdio_bitsextant = + (INFINIPATH_MDIO_CLKDIV_MASK << INFINIPATH_MDIO_CLKDIV_SHIFT) | + (INFINIPATH_MDIO_COMMAND_MASK << INFINIPATH_MDIO_COMMAND_SHIFT) | + (INFINIPATH_MDIO_DEVADDR_MASK << INFINIPATH_MDIO_DEVADDR_SHIFT) | + (INFINIPATH_MDIO_REGADDR_MASK << INFINIPATH_MDIO_REGADDR_SHIFT) | + (INFINIPATH_MDIO_DATA_MASK << INFINIPATH_MDIO_DATA_SHIFT) | + INFINIPATH_MDIO_CMDVALID | INFINIPATH_MDIO_RDDATAVALID; + +const uint64_t infinipath_ibcs_bitsextant = + (INFINIPATH_IBCS_LINKTRAININGSTATE_MASK << + INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) | + (INFINIPATH_IBCS_LINKSTATE_MASK << INFINIPATH_IBCS_LINKSTATE_SHIFT) | + INFINIPATH_IBCS_TXREADY | INFINIPATH_IBCS_TXCREDITOK; + +const uint64_t infinipath_extc_bitsextant = + (INFINIPATH_EXTC_GPIOINVERT_MASK << INFINIPATH_EXTC_GPIOINVERT_SHIFT) | + (INFINIPATH_EXTC_GPIOOE_MASK << INFINIPATH_EXTC_GPIOOE_SHIFT) | + INFINIPATH_EXTC_SERDESENABLE | INFINIPATH_EXTC_SERDESCONNECT | + INFINIPATH_EXTC_SERDESENTRUNKING | INFINIPATH_EXTC_SERDESDISRXFIFO | + INFINIPATH_EXTC_SERDESENPLPBK1 | INFINIPATH_EXTC_SERDESENPLPBK2 | + INFINIPATH_EXTC_SERDESENENCDEC | INFINIPATH_EXTC_LEDSECPORTGREENON | + INFINIPATH_EXTC_LEDSECPORTYELLOWON | INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON | INFINIPATH_EXTC_LEDGBLOKGREENON | + INFINIPATH_EXTC_LEDGBLERRREDOFF; + +/* Start of Documentation block for SerDes registers + * serdes and xgxs register bits; not all have defines, + * since I haven't yet needed them all, and I'm lazy. Those that I needed + * are in ipath_registers.h + +serdesConfig0Out (R/W) + Default Value +bit[3:0] - ResetA/B/C/D (4'b1111) +bit[7:4] -L1PwrdnA/B/C/D (4'b0000) +bit[11:8] - RxIdleEnX (4'b0000) +bit[15:12] - TxIdleEnX (4'b0000) +bit[19:16] - RxDetectEnX (4'b0000) +bit[23:20] - BeaconTxEnX (4'b0000) +bit[27:24] - RxTermEnX (4'b0000) +bit[28] - ResetPLL (1'b0) +bit[29] -L2Pwrdn (1'b0) +bit[37:30] - Offset[7:0] (8'b00000000) +bit[38] -OffsetEn (1'b0) +bit[39] -ParLBPK (1'b0) +bit[40] -ParReset (1'b0) +bit[42:41] - RefSel (2'b10) +bit[43] - PW (1'b0) +bit[47:44] - LPBKA/B/C/D (4'b0000) +bit[49:48] - ClkBufTermAdj (2'b0) +bit[51:50] - RxTermAdj (2'b0) +bit[53:52] - TxTermAdj (2'b0) +bit[55:54] - RxEqCtl (2'b0) +bit[63:56] - Reserved + +cce_wip_serdesConfig1Out[63:0] (R/W) +bit[3:0] - HiDrvX (4'b0000) +bit[7:4] - LoDrvX (4'b0000) +bit[12:11] - DtxA[3:0] (4'b0000) +bit[15:12] - DtxB[3:0] (4'b0000) +bit[19:16] - DtxC[3:0] (4'b0000) +bit[23:20] - DtxD[3:0] (4'b0000) +bit[27:24] - DeqA[3:0] (4'b0000) +bit[31:28] - DeqB[3:0] (4'b0000) +bit[35:32] - DeqC[3:0] (4'b0000) +bit[39:36] - DeqD[3:0] (4'b0000) +Framer interface, bits 40-59, not used +bit[44:40] - FmOffsetA[4:0] (5'b00000) +bit[49:45] - FmOffsetB[4:0] (5'b00000) +bit[54:50] - FmOffsetC[4:0] (5'b00000) +bit[59:55] - FmOffsetD[4:0] (5'b00000) +bit[63:60] - FmOffsetEnA/B/C/D (4'b0000) + +SerdesStatus[63:0] (RO) +bit[3:0] - TxIdleDetectA/B/C/D +bit[7:4] - RxDetectA/B/C/D +bit[11:8] - BeaconDetectA/B/C/D +bit[63:12] - Reserved + +XGXSConfigOut[63:0] +bit[2:0] - Resets, init to 1; bit 0 unused? +bit[3] - MDIO, select register bank for vendor specific register + (0x1e if set, else 0x1f); vendor-specific status in register 8 + bits 0-3 lanes0-3 signal detect, 1 if detected + bits 4-7 lanes0-3 CTC fifo errors, 1 if detected (latched until read) +bit[8:4] - MDIO port address +bit[18:9] - lnk_sync_mask +bit[22:19] - polarity inv + +Documentation end */ + +/* + * + * General specs: + * ExtCtrl[63:48] = EXTC_GPIOOE[15:0] + * ExtCtrl[47:32] = EXTC_GPIOInvert[15:0] + * ExtStatus[63:48] = GpioIn[15:0] + * + * GPIO[1] = EEPROM_SDA + * GPIO[0] = EEPROM_SCL + */ + +#define _IPATH_GPIO_SDA_NUM 1 +#define _IPATH_GPIO_SCL_NUM 0 + +#define IPATH_GPIO_SDA \ + (1UL << (_IPATH_GPIO_SDA_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT)) +#define IPATH_GPIO_SCL \ + (1UL << (_IPATH_GPIO_SCL_NUM+INFINIPATH_EXTC_GPIOOE_SHIFT)) + +/* + * register bits for selecting i2c direction and values, used for I2C serial + * flash + */ +const uint16_t ipath_gpio_sda_num = _IPATH_GPIO_SDA_NUM; +const uint16_t ipath_gpio_scl_num = _IPATH_GPIO_SCL_NUM; +const uint64_t ipath_gpio_sda = IPATH_GPIO_SDA; +const uint64_t ipath_gpio_scl = IPATH_GPIO_SCL; + + +#include +#include +#include +#include +#include +#include + +/* + * This file contains all of the code that is specific to the InfiniPath + * HT-400 chip. + */ + +/* we support up to 4 chips per system */ +const uint32_t infinipath_max = 4; +struct ipath_devdata devdata[4]; +static const char *ipath_unit_names[4] = { + "infinipath0", "infinipath1", "infinipath2", "infinipath3" +}; + +const char *ipath_get_unit_name(int unit) +{ + return ipath_unit_names[unit]; +} + +static void ipath_check_htlink(ipath_type t); + +/* + * display hardware errors. Use same msg buffer as regular errors to avoid + * excessive stack use. Most hardware errors are catastrophic, but for + * right now, we'll print them and continue. + * We reuse the same message buffer as ipath_handle_errors() to avoid + * excessive stack usage. + */ +void ipath_handle_hwerrors(const ipath_type t, char *msg, int msgl) +{ + uint64_t hwerrs = ipath_kget_kreg64(t, kr_hwerrstatus); + uint32_t bits, ctrl; + int isfatal = 0; + char bitsmsg[64]; + + if (!hwerrs) { + _IPATH_VDBG("Called but no hardware errors set\n"); + /* + * better than printing cofusing messages + * This seems to be related to clearing the crc error, or + * the pll error during init. + */ + return; + } else if (hwerrs == -1LL) { + _IPATH_UNIT_ERROR(t, + "Read of hardware error status failed (all bits set); ignoring\n"); + return; + } + ipath_stats.sps_hwerrs++; + + /* + * clear the error, regardless of whether we continue or stop using + * the chip. + */ + ipath_kput_kreg(t, kr_hwerrclear, hwerrs); + + hwerrs &= devdata[t].ipath_hwerrmask; + + /* + * make sure we get this much out, unless told to be quiet, + * or it's occurred within the last 5 seconds + */ + if ((hwerrs & ~devdata[t].ipath_lasthwerror) || + (infinipath_debug & __IPATH_VERBDBG)) + _IPATH_INFO("Hardware error: hwerr=0x%llx (cleared)\n", hwerrs); + devdata[t].ipath_lasthwerror |= hwerrs; + + if (hwerrs & ~infinipath_hwe_bitsextant) + _IPATH_UNIT_ERROR(t, + "hwerror interrupt with unknown errors %llx set\n", + hwerrs & ~infinipath_hwe_bitsextant); + + ctrl = ipath_kget_kreg32(t, kr_control); + if (ctrl & INFINIPATH_C_FREEZEMODE) { + if (hwerrs) { + /* + * if any set that we aren't ignoring + * only make the complaint once, in case it's stuck or + * recurring, and we get here multiple times + */ + if (devdata[t].ipath_flags & IPATH_INITTED) { + _IPATH_UNIT_ERROR(t, + "Fatal Error (freezemode), no longer usable\n"); + isfatal = 1; + } + *devdata[t].ipath_statusp &= ~IPATH_STATUS_IB_READY; + /* mark as having had error */ + *devdata[t].ipath_statusp |= IPATH_STATUS_HWERROR; + /* + * mark as not usable, at a minimum until driver + * is reloaded, probably until reboot, since no + * other reset is possible. + */ + devdata[t].ipath_flags &= ~IPATH_INITTED; + } else { + _IPATH_DBG + ("Clearing freezemode on ignored hardware error\n"); + ctrl &= ~INFINIPATH_C_FREEZEMODE; + ipath_kput_kreg(t, kr_control, ctrl); + } + } + + *msg = '\0'; + + /* + * may someday want to decode into which bits are which + * functional area for parity errors, etc. + */ + if (hwerrs & (infinipath_hwe_htcmemparityerr_mask + << INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_HTCMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[HTC Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & (INFINIPATH_HWE_RXEMEMPARITYERR_MASK + << INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_RXEMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[RXE Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & (INFINIPATH_HWE_TXEMEMPARITYERR_MASK + << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT)) { + bits = (uint32_t) ((hwerrs >> + INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT) & + INFINIPATH_HWE_TXEMEMPARITYERR_MASK); + snprintf(bitsmsg, sizeof bitsmsg, "[TXE Parity Errs %x] ", + bits); + strlcat(msg, bitsmsg, msgl); + } + if (hwerrs & INFINIPATH_HWE_IBCBUSTOSPCPARITYERR) + strlcat(msg, "[IB2IPATH Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_IBCBUSFRSPCPARITYERR) + strlcat(msg, "[IPATH2IB Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSIREQPARITYERR) + strlcat(msg, "[HTC Ireq Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSTREQPARITYERR) + strlcat(msg, "[HTC Treq Parity]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCBUSTRESPPARITYERR) + strlcat(msg, "[HTC Tresp Parity]", msgl); + +/* keep the code below somewhat more readonable; not used elsewhere */ +#define _IPATH_HTLINK0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \ + infinipath_hwe_htclnkabyte1crcerr) +#define _IPATH_HTLINK1_CRCBITS (infinipath_hwe_htclnkbbyte0crcerr | \ + infinipath_hwe_htclnkbbyte1crcerr) +#define _IPATH_HTLANE0_CRCBITS (infinipath_hwe_htclnkabyte0crcerr | \ + infinipath_hwe_htclnkbbyte0crcerr) +#define _IPATH_HTLANE1_CRCBITS (infinipath_hwe_htclnkabyte1crcerr | \ + infinipath_hwe_htclnkbbyte1crcerr) + if (hwerrs & (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS)) { + char bitsmsg[64]; + uint64_t crcbits = hwerrs & + (_IPATH_HTLINK0_CRCBITS | _IPATH_HTLINK1_CRCBITS); + /* don't check if 8bit HT */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0) + crcbits &= ~infinipath_hwe_htclnkabyte1crcerr; + /* don't check if 8bit HT */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1) + crcbits &= ~infinipath_hwe_htclnkbbyte1crcerr; + /* + * we'll want to ignore link errors on link that is + * not in use, if any. For now, complain about both + */ + if (crcbits) { + uint16_t ctrl0, ctrl1; + snprintf(bitsmsg, sizeof bitsmsg, + "[HT%s lane %s CRC (%llx); ignore till reload]", + !(crcbits & _IPATH_HTLINK1_CRCBITS) ? + "0 (A)" : (!(crcbits & _IPATH_HTLINK0_CRCBITS) + ? "1 (B)" : "0+1 (A+B)"), + !(crcbits & _IPATH_HTLANE1_CRCBITS) ? "0" + : (!(crcbits & _IPATH_HTLANE0_CRCBITS) ? "1" : + "0+1"), crcbits); + strlcat(msg, bitsmsg, msgl); + + /* + * print extra info for debugging. + * slave/primary config word 4, 8 (link control 0, 1) + */ + + if (pci_read_config_word(devdata[t].pcidev, + devdata[t].ipath_ht_slave_off + + 0x4, &ctrl0)) + _IPATH_INFO + ("Couldn't read linkctrl0 of slave/primary config block\n"); + else if (!(ctrl0 & 1 << 6)) /* not if EOC bit set */ + _IPATH_DBG("HT linkctrl0 0x%x%s%s\n", ctrl0, + ((ctrl0 >> 8) & 7) ? " CRC" : "", + ((ctrl0 >> 4) & 1) ? "linkfail" : + ""); + if (pci_read_config_word + (devdata[t].pcidev, + devdata[t].ipath_ht_slave_off + 0x8, &ctrl1)) + _IPATH_INFO + ("Couldn't read linkctrl1 of slave/primary config block\n"); + else if (!(ctrl1 & 1 << 6)) /* not if EOC bit set */ + _IPATH_DBG("HT linkctrl1 0x%x%s%s\n", ctrl1, + ((ctrl1 >> 8) & 7) ? " CRC" : "", + ((ctrl1 >> 4) & 1) ? "linkfail" : + ""); + + /* disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~crcbits; + ipath_kput_kreg(t, kr_hwerrmask, + devdata[t].ipath_hwerrmask); + _IPATH_DBG("HT crc errs: %s\n", msg); + } else + _IPATH_DBG + ("ignoring HT crc errors 0x%llx, not in use\n", + hwerrs & (_IPATH_HTLINK0_CRCBITS | + _IPATH_HTLINK1_CRCBITS)); + } + + if (hwerrs & INFINIPATH_HWE_HTCMISCERR5) + strlcat(msg, "[HT core Misc5]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCMISCERR6) + strlcat(msg, "[HT core Misc6]", msgl); + if (hwerrs & INFINIPATH_HWE_HTCMISCERR7) + strlcat(msg, "[HT core Misc7]", msgl); + if (hwerrs & INFINIPATH_HWE_MEMBISTFAILED) { + strlcat(msg, "[Memory BIST test failed, HT-400 unusable]", + msgl); + /* ignore from now on, so disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~INFINIPATH_HWE_MEMBISTFAILED; + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } +#define _IPATH_PLL_FAIL (INFINIPATH_HWE_COREPLL_FBSLIP | \ + INFINIPATH_HWE_COREPLL_RFSLIP | \ + INFINIPATH_HWE_HTBPLL_FBSLIP | \ + INFINIPATH_HWE_HTBPLL_RFSLIP | \ + INFINIPATH_HWE_HTAPLL_FBSLIP | \ + INFINIPATH_HWE_HTAPLL_RFSLIP) + + if (hwerrs & _IPATH_PLL_FAIL) { + snprintf(bitsmsg, sizeof bitsmsg, + "[PLL failed (%llx), HT-400 unusable]", + hwerrs & _IPATH_PLL_FAIL); + strlcat(msg, bitsmsg, msgl); + /* ignore from now on, so disable until driver reloaded */ + devdata[t].ipath_hwerrmask &= ~(hwerrs & _IPATH_PLL_FAIL); + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } + + if (hwerrs & INFINIPATH_HWE_EXTSERDESPLLFAILED) { + /* + * If it occurs, it is left masked since the eternal interface + * is unused + */ + devdata[t].ipath_hwerrmask &= + ~INFINIPATH_HWE_EXTSERDESPLLFAILED; + ipath_kput_kreg(t, kr_hwerrmask, devdata[t].ipath_hwerrmask); + } + + if (hwerrs & INFINIPATH_HWE_RXDSYNCMEMPARITYERR) + strlcat(msg, "[Rx Dsync]", msgl); + if (hwerrs & INFINIPATH_HWE_SERDESPLLFAILED) + strlcat(msg, "[SerDes PLL]", msgl); + + _IPATH_UNIT_ERROR(t, "%s hardware error\n", msg); + if (isfatal && (!ipath_diags_enabled)) { + if (devdata[t].ipath_freezemsg) { + /* + * for proc status file ; if no trailing } is copied, we'll know + * it was truncated. + */ + snprintf(devdata[t].ipath_freezemsg, + devdata[t].ipath_freezelen, "{%s}", msg); + } + } +} + +/* fill in the board name, based on the board revision register */ +void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen) +{ + char *n = NULL; + uint8_t boardrev = devdata[t].ipath_boardrev; + + switch (boardrev) { + case 4: /* Ponderosa is one of the bringup boards */ + n = "Ponderosa"; + break; + case 5: /* HT-460 original production board */ + n = "InfiniPath_HT-460"; + break; + case 7: /* HT-460 small form factor production board */ + n = "InfiniPath_HT-465"; + break; + case 6: + n = "OEM_Board_3"; + break; + case 8: + n = "LS/X-1"; + break; + case 9: /* Comstock bringup test board */ + n = "Comstock"; + break; + case 10: + n = "OEM_Board_2"; + break; + case 11: + n = "HT-470"; + break; + default: /* don't know, just print the number */ + _IPATH_ERROR("Don't yet know about board with ID %u\n", + boardrev); + snprintf(name, namelen, "UnknownBoardRev%u", boardrev); + break; + } + if (n) + snprintf(name, namelen, "%s", n); +} + +int ipath_validate_rev(struct ipath_devdata * dd) +{ + if (dd->ipath_majrev != 3 || dd->ipath_minrev != 2) { + /* + * This version of the driver only supports the HT-400 + * Rev 3.2 + */ + _IPATH_UNIT_ERROR(IPATH_UNIT(dd), + "Unsupported HT-400 revision %u.%u!\n", + dd->ipath_majrev, dd->ipath_minrev); + return 1; + } + if (dd->ipath_htspeed != 800) + _IPATH_UNIT_ERROR(IPATH_UNIT(dd), + "Incorrectly configured for HT @ %uMHz\n", + dd->ipath_htspeed); + if (dd->ipath_boardrev == 7 || dd->ipath_boardrev == 11 || + dd->ipath_boardrev == 6) + dd->ipath_flags |= IPATH_GPIO_INTR; + else if (dd->ipath_boardrev == 8) { /* LS/X-1 */ + uint64_t val; + val = ipath_kget_kreg64(dd->ipath_pd[0]->port_unit, kr_extstatus); + if (val & INFINIPATH_EXTS_SERDESSEL) { /* hardware disabled */ + /* This means that the chip is hardware disabled, and will + * not be able to bring up the link, in any case. We special + * case this and abort early, to avoid later messages. We + * also set the DISABLED status bit + */ + _IPATH_DBG("Unit %u is hardware-disabled\n", + dd->ipath_pd[0]->port_unit); + *dd->ipath_statusp |= IPATH_STATUS_DISABLED; + return 2; /* this value is handled differently */ + } + } + return 0; +} + +static void ipath_check_htlink(ipath_type t) +{ + uint8_t linkerr, link_off, i; + + for (i = 0; i < 2; i++) { + link_off = devdata[t].ipath_ht_slave_off + i * 4 + 0xd; + if (pci_read_config_byte(devdata[t].pcidev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't read linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) { + _IPATH_VDBG("HT linkerr%d bits 0x%x set, clearing\n", + linkerr >> 4, i); + /* + * writing the linkerr bits that are set should + * clear them + */ + if (pci_write_config_byte + (devdata[t].pcidev, link_off, linkerr)) + _IPATH_DBG + ("Failed write to clear HT linkerror%d\n", + i); + if (pci_read_config_byte + (devdata[t].pcidev, link_off, &linkerr)) + _IPATH_INFO + ("Couldn't reread linkerror%d of HT slave/primary block\n", + i); + else if (linkerr & 0xf0) + _IPATH_INFO + ("HT linkerror%d bits 0x%x couldn't be cleared\n", + i, linkerr >> 4); + } + } +} + +/* + * now that we have finished initializing everything that might reasonably + * cause a hardware error, and cleared those errors bits as they occur, + * we can enable hardware errors in the mask (potentially enabling + * freeze mode), and enable hardware errors as errors (along with + * everything else) in errormask + */ +void ipath_clear_init_hwerrs(ipath_type t) +{ + uint64_t val, extsval; + + extsval = ipath_kget_kreg64(t, kr_extstatus); + + if (!(extsval & INFINIPATH_EXTS_MEMBIST_ENDTEST)) + _IPATH_UNIT_ERROR(t, "MemBIST did not complete!\n"); + + ipath_check_htlink(t); + + /* barring bugs, all hwerrors become interrupts, which can */ + val = -1LL; + /* don't look at crc lane1 if 8 bit */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT0) + val &= ~infinipath_hwe_htclnkabyte1crcerr; + /* don't look at crc lane1 if 8 bit */ + if (devdata[t].ipath_flags & IPATH_8BIT_IN_HT1) + val &= ~infinipath_hwe_htclnkbbyte1crcerr; + + /* + * disable RXDSYNCMEMPARITY because external serdes is unused, + * and therefore the logic will never be used or initialized, + * and uninitialized state will normally result in this error + * being asserted. Similarly for the external serdess pll + * lock signal. + */ + val &= + ~(INFINIPATH_HWE_EXTSERDESPLLFAILED | + INFINIPATH_HWE_RXDSYNCMEMPARITYERR); + + /* + * Disable MISCERR4 because of an inversion in the HT core + * logic checking for errors that cause this bit to be set. + * The errata can also cause the protocol error bit to be set + * in the HT config space linkerror register(s). + */ + val &= ~INFINIPATH_HWE_HTCMISCERR4; + + /* + * PLL ignored because MDIO interface has a logic problem for reads, + * on Comstock and Ponderosa. BRINGUP + */ + if (devdata[t].ipath_boardrev == 4 || devdata[t].ipath_boardrev == 9) + val &= ~INFINIPATH_HWE_EXTSERDESPLLFAILED; /* BRINGUP */ + devdata[t].ipath_hwerrmask = val; +} + +/* bring up the serdes */ +int ipath_bringup_serdes(ipath_type t) +{ + uint64_t val, config1; + int ret = 0, change = 0; + + _IPATH_DBG("Trying to bringup serdes\n"); + + if (ipath_kget_kreg64(t, kr_hwerrstatus) & + INFINIPATH_HWE_SERDESPLLFAILED) { + _IPATH_DBG + ("At start, serdes PLL failed bit set in hwerrstatus, clearing and continuing\n"); + ipath_kput_kreg(t, kr_hwerrclear, + INFINIPATH_HWE_SERDESPLLFAILED); + } + + val = ipath_kget_kreg64(t, kr_serdesconfig0); + config1 = ipath_kget_kreg64(t, kr_serdesconfig1); + + _IPATH_VDBG + ("Initial serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n", + val, config1, ipath_kget_kreg64(t, kr_serdesstatus), + ipath_kget_kreg64(t, kr_xgxsconfig)); + + /* force reset on */ + val |= + INFINIPATH_SERDC0_RESET_PLL /* | INFINIPATH_SERDC0_RESET_MASK */ ; + ipath_kput_kreg(t, kr_serdesconfig0, val); + udelay(15); /* need pll reset set at least for a bit */ + + if (val & INFINIPATH_SERDC0_RESET_PLL) { + uint64_t val2 = val &= ~INFINIPATH_SERDC0_RESET_PLL; + /* set lane resets, and tx idle, during pll reset */ + val2 |= INFINIPATH_SERDC0_RESET_MASK | INFINIPATH_SERDC0_TXIDLE; + _IPATH_VDBG("Clearing serdes PLL reset (writing %llx)\n", val2); + ipath_kput_kreg(t, kr_serdesconfig0, val2); + /* be sure chip saw it */ + val = ipath_kget_kreg64(t, kr_scratch); + /* + * need pll reset clear at least 11 usec before lane resets + * cleared; give it a few more + */ + udelay(15); + val = val2; /* for check below */ + } + + if (val & (INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK + | INFINIPATH_SERDC0_TXIDLE)) { + val &= + ~(INFINIPATH_SERDC0_RESET_PLL | INFINIPATH_SERDC0_RESET_MASK + | INFINIPATH_SERDC0_TXIDLE); + ipath_kput_kreg(t, kr_serdesconfig0, val); /* clear them */ + } + + val = ipath_kget_kreg64(t, kr_xgxsconfig); + if (((val >> INFINIPATH_XGXS_MDIOADDR_SHIFT) & + INFINIPATH_XGXS_MDIOADDR_MASK) != 3) { + val &= + ~(INFINIPATH_XGXS_MDIOADDR_MASK << + INFINIPATH_XGXS_MDIOADDR_SHIFT); + /* we use address 3 */ + val |= 3ULL << INFINIPATH_XGXS_MDIOADDR_SHIFT; + change = 1; + } + if (val & INFINIPATH_XGXS_RESET) { /* normally true after boot */ + val &= ~INFINIPATH_XGXS_RESET; + change = 1; + } + if (change) + ipath_kput_kreg(t, kr_xgxsconfig, val); + + val = ipath_kget_kreg64(t, kr_serdesconfig0); + + config1 &= ~0x0ffffffff00ULL; /* clear current and de-emphasis bits */ + config1 |= 0x00000000000ULL; /* set current to 20ma */ + config1 |= 0x0cccc000000ULL; /* set de-emphasis to -5.68dB */ + ipath_kput_kreg(t, kr_serdesconfig1, config1); + + _IPATH_VDBG + ("After setup: serdes status is config0=%llx config1=%llx, sstatus=%llx xgxs %llx\n", + val, config1, ipath_kget_kreg64(t, kr_serdesstatus), + ipath_kget_kreg64(t, kr_xgxsconfig)); + + if ((!ipath_waitfor_mdio_cmdready(t))) { + ipath_kput_kreg(t, kr_mdio, IPATH_MDIO_REQ(IPATH_MDIO_CMD_READ, + 31, + IPATH_MDIO_CTRL_XGXS_REG_8, + 0)); + if (ipath_waitfor_complete + (t, kr_mdio, IPATH_MDIO_DATAVALID, &val)) + _IPATH_DBG + ("Never got MDIO data for XGXS status read\n"); + else + _IPATH_VDBG("MDIO Read reg8, 'bank' 31 %x\n", + (uint32_t) val); + } else + _IPATH_DBG("Never got MDIO cmdready for XGXS status read\n"); + + return ret; /* for now, say we always succeeded */ +} + +/* set serdes to txidle; driver is being unloaded */ +void ipath_quiet_serdes(const ipath_type t) +{ + uint64_t val = ipath_kget_kreg64(t, kr_serdesconfig0); + + val |= INFINIPATH_SERDC0_TXIDLE; + _IPATH_DBG("Setting TxIdleEn on serdes (config0 = %llx)\n", val); + ipath_kput_kreg(t, kr_serdesconfig0, val); +} + +EXPORT_SYMBOL(ipath_get_unit_name); + diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_i2c.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_i2c.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,473 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* + * InfiniPath I2C Driver for a serial flash. Not a generic i2c + * interface. Requires software bitbanging, with readbacks from chip + * to ensure timing (simple udelay is not enough). Specialized enough + * that using the standard kernel i2c bitbanging interface appears as + * though it would make the code longer and harder to maintain, rather + * than simpler. + * Intended to work with parts similar to Atmel AT24C01 (a 1Kbit part, + * that uses no programmable address bits, with address 1010000b). + */ + +typedef enum i2c_line_type_e { + i2c_line_scl = 0, + i2c_line_sda +} ipath_i2c_type; + +typedef enum i2c_line_state_e { + i2c_line_low = 0, + i2c_line_high +} ipath_i2c_state; + +#define READ_CMD 1 +#define WRITE_CMD 0 + +static int ipath_eeprom_init; + +/* + * The gpioval manipulation really should be protected by spinlocks + * or be converted to use atomic operations (unfortunately, atomic.h + * doesn't cover 64 bit ops for some of them). + */ + +int i2c_gpio_set(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state new_line_state); +int i2c_gpio_get(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state * curr_statep); + +/* + * returns 0 if the line was set to the new state successfully, non-zero + * on error. + */ +int i2c_gpio_set(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state new_line_state) +{ + uint64_t read_val, write_val, mask, *gpioval; + + gpioval = &devdata[dev].ipath_gpio_out; + read_val = ipath_kget_kreg64(dev, kr_extctrl); + if (line == i2c_line_scl) + mask = ipath_gpio_scl; + else + mask = ipath_gpio_sda; + + if (new_line_state == i2c_line_high) + /* tri-state the output rather than force high */ + write_val = read_val & ~mask; + else + /* config line to be an output */ + write_val = read_val | mask; + ipath_kput_kreg(dev, kr_extctrl, write_val); + + /* set high and verify */ + if (new_line_state == i2c_line_high) + write_val = 0x1UL; + else + write_val = 0x0UL; + + if (line == i2c_line_scl) { + write_val <<= ipath_gpio_scl_num; + *gpioval = *gpioval & ~(1UL << ipath_gpio_scl_num); + *gpioval |= write_val; + } else { + write_val <<= ipath_gpio_sda_num; + *gpioval = *gpioval & ~(1UL << ipath_gpio_sda_num); + *gpioval |= write_val; + } + ipath_kput_kreg(dev, kr_gpio_out, *gpioval); + + return 0; +} + +/* + * returns 0 if the line was set to the new state successfully, non-zero + * on error. curr_state is not set on error. + */ +int i2c_gpio_get(ipath_type dev, ipath_i2c_type line, + ipath_i2c_state * curr_statep) +{ + uint64_t read_val, write_val, mask; + + /* check args */ + if (curr_statep == NULL) + return 1; + + read_val = ipath_kget_kreg64(dev, kr_extctrl); + /* config line to be an input */ + if (line == i2c_line_scl) + mask = ipath_gpio_scl; + else + mask = ipath_gpio_sda; + write_val = read_val & ~mask; + ipath_kput_kreg(dev, kr_extctrl, write_val); + read_val = ipath_kget_kreg64(dev, kr_extstatus); + + if (read_val & mask) + *curr_statep = i2c_line_high; + else + *curr_statep = i2c_line_low; + + return 0; +} + +/* + * would prefer to not inline this, to avoid code bloat, and simplify debugging + * But when compiling against 2.6.10 kernel tree, it gets an error, so + * not for now. + */ +static void ipath_i2c_delay(ipath_type, int); + +/* + * we use this instead of udelay directly, so we can make sure + * that previous register writes have been flushed all the way + * to the chip. Since we are delaying anyway, the cost doesn't + * hurt, and makes the bit twiddling more regular + * If delay is negative, we'll do the chip read, to be sure write made it + * to our chip, but won't do udelay() + */ +static void ipath_i2c_delay(ipath_type dev, int dtime) +{ + /* + * This needs to be volatile, so that the compiler doesn't + * optimize away the read to the device's mapped memory. + */ + volatile uint32_t read_val; + if (!dtime) + return; + read_val = ipath_kget_kreg32(dev, kr_scratch); + if (--dtime > 0) /* register read takes about .5 usec, itself */ + udelay(dtime); +} + +static void ipath_scl_out(ipath_type dev, uint8_t bit, int delay) +{ + i2c_gpio_set(dev, i2c_line_scl, bit ? i2c_line_high : i2c_line_low); + + ipath_i2c_delay(dev, delay); +} + +static void ipath_sda_out(ipath_type dev, uint8_t bit, int delay) +{ + i2c_gpio_set(dev, i2c_line_sda, bit ? i2c_line_high : i2c_line_low); + + ipath_i2c_delay(dev, delay); +} + +static uint8_t ipath_sda_in(ipath_type dev, int delay) +{ + ipath_i2c_state bit; + + if (i2c_gpio_get(dev, i2c_line_sda, &bit)) + _IPATH_DBG("get bit failed!\n"); + + ipath_i2c_delay(dev, delay); + + return bit == i2c_line_high ? 1U : 0; +} + +/* see if ack following write is true */ +static int ipath_i2c_ackrcv(ipath_type dev) +{ + uint8_t ack_received; + + /* AT ENTRY SCL = LOW */ + /* change direction, ignore data */ + ack_received = ipath_sda_in(dev, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ack_received = ipath_sda_in(dev, 1) == 0; + ipath_scl_out(dev, i2c_line_low, 1); + return ack_received; +} + +/* + * write a byte, one bit at a time. Returns 0 if we got the following + * ack, otherwise 1 + */ +static int ipath_wr_byte(ipath_type dev, uint8_t data) +{ + int bit_cntr; + uint8_t bit; + + for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) { + bit = (data >> bit_cntr) & 1; + ipath_sda_out(dev, bit, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_low, 1); + } + if (!ipath_i2c_ackrcv(dev)) + return 1; + return 0; +} + +static void send_ack(ipath_type dev) +{ + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_high, 1); +} + +/* + * ipath_i2c_startcmd - Transmit the start condition, followed by + * address/cmd + * (both clock/data high, clock high, data low while clock is high) + */ +static int ipath_i2c_startcmd(ipath_type dev, uint8_t offset_dir) +{ + int res; + + /* issue start sequence */ + ipath_sda_out(dev, i2c_line_high, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_low, 1); + + /* issue length and direction byte */ + res = ipath_wr_byte(dev, offset_dir); + + if (res) + _IPATH_VDBG("No ack to complete start\n"); + return res; +} + +/* + * stop_cmd - Transmit the stop condition + * (both clock/data low, clock high, data high while clock is high) + */ +static void stop_cmd(ipath_type dev) +{ + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_high, 1); + ipath_sda_out(dev, i2c_line_high, 3); +} + +/* + * ipath_eeprom_reset - reset I2C communication. + * + * eeprom: Atmel AT24C01 + * + */ + +static int ipath_eeprom_reset(ipath_type dev) +{ + int clock_cycles_left = 9; + uint64_t *gpioval = &devdata[dev].ipath_gpio_out; + + ipath_eeprom_init = 1; + *gpioval = ipath_kget_kreg64(dev, kr_gpio_out); + _IPATH_VDBG("Resetting i2c flash; initial gpioout reg is %llx\n", + *gpioval); + + /* + * This is to get the i2c into a known state, by first going low, + * then tristate sda (and then tristate scl as first thing in loop) + */ + ipath_scl_out(dev, i2c_line_low, 1); + ipath_sda_out(dev, i2c_line_high, 1); + + while (clock_cycles_left--) { + ipath_scl_out(dev, i2c_line_high, 1); + + if (ipath_sda_in(dev, 0)) { + ipath_sda_out(dev, i2c_line_low, 1); + ipath_scl_out(dev, i2c_line_low, 1); + return 0; + } + + ipath_scl_out(dev, i2c_line_low, 1); + } + + return 1; +} + +/* + * ipath_eeprom_read - Receives x # byte from the eeprom via I2C. + * + * eeprom: Atmel AT24C01 + * + */ + +int ipath_eeprom_read(ipath_type dev, uint8_t eeprom_offset, void *buffer, + int len) +{ + /* compiler complains unless initialized */ + uint8_t single_byte = 0; + int bit_cntr; + + if (!ipath_eeprom_init) + ipath_eeprom_reset(dev); + + eeprom_offset = (eeprom_offset << 1) | READ_CMD; + + if (ipath_i2c_startcmd(dev, eeprom_offset)) { + _IPATH_DBG("Failed startcmd\n"); + stop_cmd(dev); + return 1; + } + + /* + * flash keeps clocking data out as long as we ack, automatically + * incrementing the address. + */ + while (len-- > 0) { + /* get data */ + single_byte = 0; + for (bit_cntr = 8; bit_cntr; bit_cntr--) { + uint8_t bit; + ipath_scl_out(dev, i2c_line_high, 1); + bit = ipath_sda_in(dev, 0); + single_byte |= bit << (bit_cntr - 1); + ipath_scl_out(dev, i2c_line_low, 1); + } + + /* send ack if not the last byte */ + if (len) + send_ack(dev); + + *((uint8_t *) buffer) = single_byte; + (uint8_t *) buffer++; + } + + stop_cmd(dev); + + return 0; +} + +/* + * ipath_eeprom_write - writes data to the eeprom via I2C. + * +*/ +int ipath_eeprom_write(ipath_type dev, uint8_t eeprom_offset, void *buffer, + int len) +{ + uint8_t single_byte; + int sub_len; + uint8_t *bp = buffer; + int max_wait_time, i; + + if (!ipath_eeprom_init) + ipath_eeprom_reset(dev); + + while (len > 0) { + if (ipath_i2c_startcmd(dev, (eeprom_offset << 1) | WRITE_CMD)) { + _IPATH_DBG("Failed to start cmd offset %u\n", + eeprom_offset); + goto failed_write; + } + + sub_len = min(len, 4); + eeprom_offset += sub_len; + len -= sub_len; + + for (i = 0; i < sub_len; i++) { + if (ipath_wr_byte(dev, *bp++)) { + _IPATH_DBG + ("no ack after byte %u/%u (%u total remain)\n", + i, sub_len, len + sub_len - i); + goto failed_write; + } + } + + stop_cmd(dev); + + /* + * wait for write complete by waiting for a successful + * read (the chip replies with a zero after the write + * cmd completes, and before it writes to the flash. + * The startcmd for the read will fail the ack until + * the writes have completed. We do this inline to avoid + * the debug prints that are in the real read routine + * if the startcmd fails. + */ + max_wait_time = 100; + while (ipath_i2c_startcmd(dev, READ_CMD)) { + stop_cmd(dev); + if (!--max_wait_time) { + _IPATH_DBG + ("Did not get successful read to complete write\n"); + goto failed_write; + } + } + /* now read the zero byte */ + for (i = single_byte = 0; i < 8; i++) { + uint8_t bit; + ipath_scl_out(dev, i2c_line_high, 1); + bit = ipath_sda_in(dev, 0); + ipath_scl_out(dev, i2c_line_low, 1); + single_byte <<= 1; + single_byte |= bit; + } + stop_cmd(dev); + } + + return 0; + +failed_write: + stop_cmd(dev); + return 1; +} + +uint8_t ipath_flash_csum(struct ipath_flash * ifp, int adjust) +{ + uint8_t *ip = (uint8_t *) ifp; + uint8_t csum = 0, len; + + for (len = 0; len < ifp->if_length; len++) + csum += *ip++; + csum -= ifp->if_csum; + csum = ~csum; + if (adjust) + ifp->if_csum = csum; + return csum; +} diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_lib.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_lib.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,90 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +/* + * This is library code for the driver, similar to what's in libinfinipath for + * usermode code. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" + +unsigned infinipath_debug = __IPATH_INFO; + +uint32_t _ipath_pico_per_cycle; /* always present, for now */ + +/* + * This isn't perfect, but it's close enough for timing work. We want this + * to work on systems where the cycle counter isn't the same as the clock + * frequency. The one msec spin is OK, since we execute this only once + * when first loaded. We don't use CURRENT_TIME because on some systems + * it only has jiffy resolution; we just assume udelay is well calibrated + * and that we aren't likely to be rescheduled. Do it multiple times, + * with a yield in between, to try to make sure we get the "true minimum" + * value. + * _ipath_pico_per_cycle isn't going to lead to completely accurate + * conversions from timestamps to nanoseconds, but it's close enough + * for our purposes, which is mainly to allow people to show events with + * nsecs or usecs if desired, rather than cycles. + */ +void ipath_init_picotime(void) +{ + int i; + u_int64_t ts, te, delta = -1ULL; + + for (i = 0; i < 5; i++) { + ts = get_cycles(); + udelay(250); + te = get_cycles(); + if ((te - ts) < delta) + delta = te - ts; + yield(); + } + _ipath_pico_per_cycle = 250000000 / delta; +} diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_upages.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_upages.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,144 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "ipath_kernel.h" + +/* + * Our version of the kernel mlock function. This function is no longer + * exposed, so we need to do it ourselves. It takes a given start page + * (page aligned user virtual address) and pins it and the following specified + * number of pages. + * For now, num_pages is always 1, but that will probably change at some + * point (because caller is doing expected sends on a single virtually + * contiguous buffer, so we can do all pages at once). + */ +int ipath_get_upages(unsigned long start_page, size_t num_pages, struct page **p) +{ + int n; + + _IPATH_VDBG("pin %lx pages from vaddr %lx\n", num_pages, start_page); + down_read(¤t->mm->mmap_sem); + n = get_user_pages(current, current->mm, start_page, num_pages, 1, 1, + p, NULL); + up_read(¤t->mm->mmap_sem); + if (n != num_pages) { + _IPATH_INFO + ("get_user_pages (0x%lx pages starting at 0x%lx failed with %d\n", + num_pages, start_page, n); + if (n < 0) /* it's an errno */ + return n; + /* + * We may have gotten some pages, so unlock those. + * ipath_putpages() correctly handles n==0 + */ + ipath_putpages(n, p); + return -ENOMEM; /* no way to know actual error */ + } + + return 0; +} + +/* + * this is similar to ipath_get_upages, but it's always one page, and we mark + * the page as locked for i/o, and shared. This is used for the user process + * page that contains the destination address for the rcvhdrq tail update, + * so we need to have the vma. If we don't do this, the page can be taken + * away from us on fork, even if the child never touches it, and then + * the user process never sees the tail register updates. + */ +int ipath_get_upages_nocopy(unsigned long start_page, struct page **p) +{ + int n; + struct vm_area_struct *vm = NULL; + + down_read(¤t->mm->mmap_sem); + n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm); + up_read(¤t->mm->mmap_sem); + if (n != 1) { + _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n", + start_page, n); + if (n < 0) /* it's an errno */ + return n; + /* + * If we ever ask for more than a single page, we will have to + * free the pages (if any) that we did get, via ipath_get_upages() + * or put_page() directly. + */ + return -ENOMEM; /* no way to know actual error */ + } + vm->vm_flags |= VM_SHM | VM_LOCKED; + + return 0; +} + +/* + * Unpins the start page (a page aligned full user virtual address, not a + * page number) and pins it and the following specified number of pages. + */ +void ipath_putpages(size_t num_pages, struct page **p) +{ + int i; + + for (i = 0; i < num_pages; i++) { + _IPATH_MMDBG("%u/%lu put_page %p\n", i, num_pages, p[i]); + set_page_dirty_lock(p[i]); + put_page(p[i]); + } +} + +/* + * This routine frees up all the allocations made in this file; it's a nop + * now, but I'm leaving it in case we go back to a more sophisticated + * implementation later. + */ +void ipath_upages_cleanup(struct ipath_portdata * pd) +{ +} From bos at pathscale.com Wed Dec 28 16:31:24 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:24 -0800 Subject: [openib-general] [PATCH 5 of 20] ipath - driver core header files In-Reply-To: Message-ID: <2d9a3f27a10c8f11df92.1135816284@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_common.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_common.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,704 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _IPATH_COMMON_H +#define _IPATH_COMMON_H + +/* + * This file contains defines, structures, etc. that are used + * to communicate between kernel and user code. + */ + +#define IPATH_IDSTR "PathScale 1.1\n" + +typedef uint8_t ipath_type; + +/* This is the IEEE-assigned OUI for PathScale, Inc. */ +#define IPATH_SRC_OUI_1 0x00 +#define IPATH_SRC_OUI_2 0x11 +#define IPATH_SRC_OUI_3 0x75 + +/* version of protocol header (known to chip also). In the long run, + * we should be able to generate and accept a range of version numbers; + * for now we only accept one, and it's compiled in. + */ +#define IPS_PROTO_VERSION 2 + +/* + * These are compile time constants that you may want to enable or disable + * if you are trying to debug problems with code or performance. + * IPATH_VERBOSE_TRACING define as 1 if you want additional tracing in + * fastpath code + * IPATH_TRACE_REGWRITES define as 1 if you want register writes to be + * traced in faspath code + * _IPATH_TRACING define as 0 if you want to remove all tracing in a + * compilation unit + * _IPATH_DEBUGGING define as 0 if you want to remove debug prints + */ + + +/* + * These tell the driver which ioctl's belong to the diags interface. + * As above, don't use them elsewhere. + */ +#define _IPATH_DIAG_IOCTL_LOW 100 +#define _IPATH_DIAG_IOCTL_HIGH 109 + +struct ipath_eeprom_req { + long long addr; + uint16_t len; + uint16_t offset; +}; + +/* + * NOTE: We use compatible ioctls, the same ioctl code for both 32 and 64 + * bit user mode. For that reason, all structures, etc. used in these + * ioctls must have the same size and offsets, in both 32 and 64 bit mode. + * We normally use uint64_t to hold pointers for this reason, doing appropriate + * casts on both sides before using the data value. + */ + +/* init; user params to kernel */ +#define IPATH_USERINIT _IOW('s', 16, struct ipath_user_info) +/* init; kernel/chip params to user */ +#define IPATH_BASEINFO _IOR('s', 17, struct ipath_base_info) +/* send a packet */ +#define IPATH_SENDPKT _IOW('s', 18, struct ipath_sendpkt) +/* + * if arg is 0, disable port, used when flushing after a hdrq overflow. + * If arg ia 1, re-enable, and return new value of head register + */ +#define IPATH_RCVCTRL _IOR('s', 19, uint32_t) +#define IPATH_READ_EEPROM _IOWR('s', 20, struct ipath_eeprom_req) +/* set an accepted partition key; up to 4 pkeys can be active at once */ +#define IPATH_SET_PKEY _IOW('s', 21, uint16_t) +#define IPATH_WRITE_EEPROM _IOWR('s', 22, struct ipath_eeprom_req) +/* set LID for interface (SMA) */ +#define IPATH_SET_LID _IOW('s', 23, uint32_t) +/* set IB MTU for interface (SMA) */ +#define IPATH_SET_MTU _IOW('s', 24, uint32_t) +/* set IB link state for interface (SMA) */ +#define IPATH_SET_LINKSTATE _IOW('s', 25, uint32_t) +/* send an SMA packet, sps_flags contains "normal" SMA unit and minor number. */ +#define IPATH_SEND_SMA_PKT _IOW('s', 26, struct ipath_sendpkt) +/* receive an SMA packet */ +#define IPATH_RCV_SMA_PKT _IOW('s', 27, struct ipath_sendpkt) +/* get the portinfo data (SMA) + * takes array of 13, returns port info fields. Data is in host order, + * not network order; SMA-only fields are not filled in + */ +#define IPATH_GET_PORTINFO _IOWR('s', 28, uint32_t *) +/* + * get the nodeinfo data (SMA) + * takes an array of 10, returns nodeinfo fields in host order + */ +#define IPATH_GET_NODEINFO _IOWR('s', 29, uint32_t *) +/* set GUID on interface (SMA; GUID given in network order) */ +#define IPATH_SET_GUID _IOW('s', 30, struct ipath_setguid) +/* set MLID for interface (SMA) */ +#define IPATH_SET_MLID _IOW('s', 31, uint32_t) +#define IPATH_GET_MLID _IOWR('s', 32, uint32_t *) /* get the MLID (SMA) */ +/* update expected TID entries */ +#define IPATH_UPDM_TID _IOWR('s', 33, struct _tidupd) +/* free expected TID entries */ +#define IPATH_FREE_TID _IOW('s', 34, struct _tidupd) +/* return assigned unit:port */ +#define IPATH_GETPORT _IOR('s', 35, uint32_t) +/* wait for rcv pkt or pioavail */ +#define IPATH_WAIT _IOW('s', 36, uint32_t) +/* return LID for passed in unit */ +#define IPATH_GETLID _IOR('s', 37, uint16_t) +/* return # of units supported by driver */ +#define IPATH_GETUNITS _IO('s', 38) +/* get the device status */ +#define IPATH_GET_DEVSTATUS _IOWR('s', 39, uint64_t *) + +/* available for reuse ('s', 48) */ + +/* diagnostic read */ +#define IPATH_DIAGREAD _IOR('s', 100, struct ipath_diag_info) +/* diagnostic write */ +#define IPATH_DIAGWRITE _IOW('s', 101, struct ipath_diag_info) +/* HT Config read */ +#define IPATH_DIAG_HTREAD _IOR('s', 102, struct ipath_diag_info) +/* HT config write */ +#define IPATH_DIAG_HTWRITE _IOW('s', 103, struct ipath_diag_info) +#define IPATH_DIAGENTER _IO('s', 104) /* Enter diagnostic mode */ +#define IPATH_DIAGLEAVE _IO('s', 105) /* Leave diagnostic mode */ +/* send a packet, sps_flags contains unit and minor number. */ +#define IPATH_SEND_DIAG_PKT _IOW('s', 106, struct ipath_sendpkt) +/* + * read I2C FLASH + * NOTE: To read the I2C device, the _uaddress field should contain + * a pointer to struct ipath_eeprom_req, and _unit must be valid + */ +#define IPATH_DIAG_RD_I2C _IOW('s', 107, struct ipath_diag_info) + +/* + * Monitoring ioctls. All of these work with the main device + * (/dev/ipath), if you don't mind using a port (e.g. you already have + * the device open.) IPATH_GETSTATS and IPATH_GETUNITCOUNTERS also + * work with the control device (/dev/ipath_ctrl), if you don't want to + * use a port. + */ + +/* return chip counters for current unit. */ +#define IPATH_GETCOUNTERS _IOR('s', 40, struct infinipath_counters) +/* return chip stats */ +#define IPATH_GETSTATS _IOR('s', 41, struct infinipath_stats) +/* return chip counters for a particular unit. */ +#define IPATH_GETUNITCOUNTERS _IOR('s', 43, struct infinipath_getunitcounters) + +/* + * unit is incoming unit number. + * data is a pointer to the infinipath_counters structure. + */ +struct infinipath_getunitcounters { + uint16_t unit; + uint16_t fill[3]; /* required for same size struct 32/64 bit */ + uint64_t data; +}; + +/* + * The value in the BTH QP field that InfiniPath uses to differentiate + * an infinipath protocol IB packet vs standard IB transport + */ +#define IPATH_KD_QP 0x656b79 + +/* + * valid states passed to ipath_set_linkstate() user call + * (IPATH_SET_LINKSTATE ioctl) + */ +#define IPATH_IB_LINKDOWN 0 +#define IPATH_IB_LINKARM 1 +#define IPATH_IB_LINKACTIVE 2 +#define IPATH_IB_LINKINIT 3 +#define IPATH_IB_LINKDOWN_POLL 4 +#define IPATH_IB_LINKDOWN_DISABLE 5 + +/* + * stats maintained by the driver. For now, at least, this is global + * to all minor devices. + */ +struct infinipath_stats { + uint64_t sps_ints; /* number of interrupts taken */ + uint64_t sps_errints; /* number of interrupts for errors */ + /* number of errors from chip (not including packet errors or CRC) */ + uint64_t sps_errs; + /* number of packet errors from chip other than CRC */ + uint64_t sps_pkterrs; + /* number of packets with CRC errors (ICRC and VCRC) */ + uint64_t sps_crcerrs; + /* number of hardware errors reported (parity, etc.) */ + uint64_t sps_hwerrs; + /* number of times IB link changed state unexpectedly */ + uint64_t sps_iblink; + uint64_t sps_unused3; /* no longer used; left for compatibility */ + uint64_t sps_port0pkts; /* number of kernel (port0) packets received */ + /* number of "ethernet" packets sent by driver */ + uint64_t sps_ether_spkts; + /* number of "ethernet" packets received by driver */ + uint64_t sps_ether_rpkts; + uint64_t sps_sma_spkts; /* number of SMA packets sent by driver */ + uint64_t sps_sma_rpkts; /* number of SMA packets received by driver */ + /* number of times all ports rcvhdrq was full and packet dropped */ + uint64_t sps_hdrqfull; + /* number of times all ports egrtid was full and packet dropped */ + uint64_t sps_etidfull; + /* + * number of times we tried to send from driver, but no pio + * buffers avail + */ + uint64_t sps_nopiobufs; + uint64_t sps_ports; /* number of ports currently open */ + /* list of pkeys (other than default) accepted (0 means not set) */ + uint16_t sps_pkeys[4]; + /* lids for up to 4 infinipaths, indexed by infinipath # */ + uint16_t sps_lid[4]; + /* number of user ports per chip (not IB ports) */ + uint32_t sps_nports; + uint32_t sps_nullintr; /* not our interrupt, or already handled */ + uint32_t sps_maxpkts_call; /* max number of packets handled per receive call */ + uint32_t sps_avgpkts_call; /* avg number of packets handled per receive call */ + uint64_t sps_pagelocks; /* total number of pages locked */ + uint64_t sps_pageunlocks; /* total number of pages unlocked */ + /* + * Number of packets dropped in kernel other than errors + * (ether packets if ipath not configured, sma/mad, etc.) + */ + uint64_t sps_krdrops; + /* mlids for up to 4 infinipaths, indexed by infinipath # */ + uint16_t sps_mlid[4]; + uint64_t __sps_pad[45]; /* pad for future growth */ +}; + +/* + * These are the status bits returned (in ascii form, 64bit value) + * by the IPATH_GETSTATS ioctl. + */ +#define IPATH_STATUS_INITTED 0x1 /* basic driver initialization done */ +#define IPATH_STATUS_DISABLED 0x2 /* hardware disabled */ +#define IPATH_STATUS_UNUSED 0x4 /* available */ +#define IPATH_STATUS_OIB_SMA 0x8 /* ipath_mad kernel SMA running */ +#define IPATH_STATUS_SMA 0x10 /* user SMA running */ +/* Chip has been found and initted */ +#define IPATH_STATUS_CHIP_PRESENT 0x20 +#define IPATH_STATUS_IB_READY 0x40 /* IB link is at ACTIVE, has LID, + * usable for all VL's */ +/* after link up, LID,MTU,etc. has been configured */ +#define IPATH_STATUS_IB_CONF 0x80 +/* no link established, probably no cable */ +#define IPATH_STATUS_IB_NOCABLE 0x100 +/* A Fatal hardware error has occurred. */ +#define IPATH_STATUS_HWERROR 0x200 + +/* The list of usermode accessible registers. Also see Reg_* later in file */ +typedef enum _ipath_ureg { + ur_rcvhdrtail = 0, /* (RO) DMA RcvHdr to be used next. */ + /* (RW) RcvHdr entry to be processed next by host. */ + ur_rcvhdrhead = 1, + ur_rcvegrindextail = 2, /* (RO) Index of next Eager index to use. */ + ur_rcvegrindexhead = 3, /* (RW) Eager TID to be processed next */ + /* For internal use only; max register number. */ + _IPATH_UregMax +} ipath_ureg; + +/* SMA minor# no portinfo, one for all instances */ +#define IPATH_SMA 128 + +/* Control minor# no portinfo, one for all instances */ +#define IPATH_CTRL 130 + +/* + * This structure is returned by ipath_userinit() immediately after open + * to get implementation-specific info, and info specific to this + * instance. + */ +struct ipath_base_info { + /* version of hardware, for feature checking. */ + uint32_t spi_hw_version; + /* version of software, for feature checking. */ + uint32_t spi_sw_version; + /* InfiniPath port assigned, goes into sent packets */ + uint32_t spi_port; + /* + * IB MTU, packets IB data must be less than this. + * The MTU is in bytes, and will be a multiple of 4 bytes. + */ + uint32_t spi_mtu; + /* + * size of a PIO buffer. Any given packet's total + * size must be less than this (in words). Included is the + * starting control word, so if 513 is returned, then total + * pkt size is 512 words or less. + */ + uint32_t spi_piosize; + /* size of the TID cache in infinipath, in entries */ + uint32_t spi_tidcnt; + /* size of the TID Eager list in infinipath, in entries */ + uint32_t spi_tidegrcnt; + /* size of a single receive header queue entry. */ + uint32_t spi_rcvhdrent_size; + /* Count of receive header queue entries allocated. + * This may be less than the spu_rcvhdrcnt passed in!. + */ + uint32_t spi_rcvhdr_cnt; + + uint32_t __32_bit_compatibility_pad; /* DO NOT MOVE OR REMOVE */ + + /* address where receive buffer queue is mapped into */ + uint64_t spi_rcvhdr_base; + + /* user program. */ + + /* base address of eager TID receive buffers. */ + uint64_t spi_rcv_egrbufs; + + /* Allocated by initialization code, not by protocol. */ + + /* size of each TID buffer in host memory, + * starting at spi_rcv_egrbufs. It includes spu_egrskip, and is + * at least spi_mtu bytes, and the buffers are virtually contiguous + */ + uint32_t spi_rcv_egrbufsize; + /* + * The special QP (queue pair) value that identifies an infinipath + * protocol packet from standard IB packets. More, probably much + * more, to be added. + */ + uint32_t spi_qpair; + + /* + * user register base for init code, not to be used directly by + * protocol or applications + */ + uint64_t __spi_uregbase; + /* + * maximum buffer size in bytes that can be used in a + * single TID entry (assuming the buffer is aligned to this boundary). + * This is the minimum of what the hardware and software support + * Guaranteed to be a power of 2. + */ + uint32_t spi_tid_maxsize; + /* + * alignment of each pio send buffer (byte count + * to add to spi_piobufbase to get to second buffer) + */ + uint32_t spi_pioalign; + /* + * the index of the first pio buffer available + * to this process; needed to do lookup in spi_pioavailaddr; not added + * to spi_piobufbase + */ + uint32_t spi_pioindex; + uint32_t spi_piocnt; /* number of buffers mapped for this process */ + + /* + * base address of writeonly pio buffers for this process. + * Each buffer has spi_piosize words, and is aligned on spi_pioalign + * boundaries. spi_piocnt buffers are mapped from this address + */ + uint64_t spi_piobufbase; + + /* + * base address of readonly memory copy of the pioavail registers. + * There are 2 bits for each buffer. + */ + uint64_t spi_pioavailaddr; + + /* + * Address where driver updates a copy + * of the interface and driver status (IPATH_STATUS_*) as a 64 bit value + * It's followed by a string indicating hardware error, if there was one + */ + uint64_t spi_status; + + /* number of chip ports available to user processes */ + uint32_t spi_nports; + uint32_t spi_unit; /* unit number of chip we are using */ + uint32_t spi_rcv_egrperchunk; /* num bufs in each contiguous set */ + /* size in bytes of each contiguous set */ + uint32_t spi_rcv_egrchunksize; + /* total size of mmap to cover full rcvegrbuffers */ + uint32_t spi_rcv_egrbuftotlen; + /* + * ioctl cmd includes struct size, so pad out, and adjust down as + * new fields are added to keep size constant + */ + uint32_t __spi_pad[19]; +} __attribute__ ((aligned(8))); + +#define IPATH_WAIT_RCV 0x1 /* IPATH_WAIT, receive */ +#define IPATH_WAIT_PIO 0x2 /* IPATH_WAIT, PIO */ + +/* + * This version number is given to the driver by the user code during + * initialization in the spu_userversion field of ipath_user_info, so + * the driver can check for compatibility with user code. + * + * The major version changes when data structures + * change in an incompatible way. The driver must be the same or higher + * for initialization to succeed. In some cases, a higher version + * driver will not interoperate with older software, and initialization + * will return an error. + */ +#define IPATH_USER_SWMAJOR 1 + +/* + * Minor version differences are always compatible + * a within a major version, however if if user software is larger + * than driver software, some new features and/or structure fields + * may not be implemented; the user code must deal with this if it + * cares, or it must abort after initialization reports the difference + */ +#define IPATH_USER_SWMINOR 2 + +#define IPATH_USER_SWVERSION ((IPATH_USER_SWMAJOR<<16) | IPATH_USER_SWMINOR) + +#define IPATH_KERN_TYPE 0 + +/* Similarly, this is the kernel version going back to the user. It's slightly + * different, in that we want to tell if the driver was built as part of a + * PathScale release, or from the driver from OpenIB, kernel.org, or a + * standard distribution, for support reasons. The high bit is 0 for + * non-PathScale, and 1 for PathScale-built/supplied. + * + * It's returned by the driver to the user code during initialization + * in the spi_sw_version field of ipath_base_info, so the user code can + * in turn check for compatibility with the kernel. +*/ +#define IPATH_KERN_SWVERSION ((IPATH_KERN_TYPE<<31) | IPATH_USER_SWVERSION) + +/* + * This structure is passed to ipath_userinit() to tell the driver where + * user code buffers are, sizes, etc. + */ +struct ipath_user_info { + /* + * version of user software, to detect compatibility issues. + * Should be set to IPATH_USER_SWVERSION. + */ + uint32_t spu_userversion; + + /* desired number of receive header queue entries */ + uint32_t spu_rcvhdrcnt; + + /* + * Leave this much unused space at the start of + * each eager buffer for software use. Similar in effect to + * setting K_Offset to this value. needs to be 'small', on the + * order of one or two cachelines + */ + uint32_t spu_egrskip; + + /* + * number of words in KD protocol header + * This tells InfiniPath how many words to copy to rcvhdrq. If 0, + * kernel uses a default. Once set, attempts to set any other value + * are an error (EAGAIN) until driver is reloaded. + */ + uint32_t spu_rcvhdrsize; + + /* + * cache line aligned (64 byte) user address to + * which the rcvhdrtail register will be written by infinipath + * whenever it changes, so that no chip registers are read in + * the performance path. + */ + uint64_t spu_rcvhdraddr; + + /* + * ioctl cmd includes struct size, so pad out, + * and adjust down as new fields are added to keep size constant + */ + uint32_t __spu_pad[6]; +} __attribute__ ((aligned(8))); + +struct ipath_iovec { + /* Pointer to data, but same size 32 and 64 bit */ + uint64_t iov_base; + + /* + * Length of data; don't need 64 bits, but want + * ipath_sendpkt to remain same size as before 32 bit changes, so... + */ + uint64_t iov_len; +}; + +/* + * Describes a single packet for send. Each packet can have one or more + * buffers, but the total length (exclusive of IB headers) must be less + * than the MTU, and if using the PIO method, entire packet length, + * including IB headers, must be less than the ipath_piosize value (words). + * Use of this necessitates including sys/uio.h + */ +struct ipath_sendpkt { + uint32_t sps_flags; /* flags for packet (TBD) */ + uint32_t sps_cnt; /* number of entries to use in sps_iov */ + /* array of iov's describing packet. TEMPORARY */ + struct ipath_iovec sps_iov[4]; +}; + +struct _tidupd { /* used only in inlined function for ioctl. */ + uint32_t tidcnt; + uint32_t tid__unused; /* make structure same size in 32 and 64 bit */ + uint64_t tidvaddr; /* virtual address of first page in transfer */ + /* pointer (same size 32/64 bit) to uint16_t tid array */ + uint64_t tidlist; + + /* + * pointer (same size 32/64 bit) to bitmap of TIDs used + * for this call; checked for being large enough at open + */ + uint64_t tidmap; +}; + +struct ipath_setguid { /* set GUID for interface */ + uint64_t sguid; /* in network order */ + uint64_t sunit; /* unit number of interface */ +}; + +/* + * Structure used to send data to and receive data from a diags ioctl. + * + * NOTE: For HT reads and writes, we only support byte, word (16bits) and + * dword (32bits). All other sizes for HT are invalid. + */ +struct ipath_diag_info { + uint64_t _base_offset; /* register to start reading from */ + uint64_t _num_bytes; /* number of bytes to read or write */ + /* + * address in user space. + * for reads, this is the address to store the read result(s). + * for writes, it the address to get the write data from. + * This memory better be valid in user space! + */ + uint64_t _uaddress; + uint64_t _unit; /* Unit ID of chip we are accessing. */ + uint64_t _pad[15]; +}; + +/* + * Data layout in I2C flash (for GUID, etc.) + * All fields are little-endian binary unless otherwise stated + */ +#define IPATH_FLASH_VERSION 1 +struct ipath_flash { + uint8_t if_fversion; /* flash layout version (IPATH_FLASH_VERSION) */ + uint8_t if_csum; /* checksum protecting if_length bytes */ + /* + * valid length (in use, protected by if_csum), including if_fversion + * and if_sum themselves) + */ + uint8_t if_length; + uint8_t if_guid[8]; /* the GUID, in network order */ + /* number of GUIDs to use, starting from if_guid */ + uint8_t if_numguid; + char if_serial[12]; /* the board serial number, in ASCII */ + char if_mfgdate[8]; /* board mfg date (YYYYMMDD ASCII) */ + /* last board rework/test date (YYYYMMDD ASCII) */ + char if_testdate[8]; + uint8_t if_errcntp[4]; /* logging of error counts, TBD */ + /* powered on hours, updated at driver unload */ + uint8_t if_powerhour[2]; + char if_comment[32]; /* ASCII free-form comment field */ + uint8_t if_future[50]; /* 78 bytes used, min flash size is 128 bytes */ +}; + +uint8_t ipath_flash_csum(struct ipath_flash *, int); + +/* + * These are the counters implemented in the chip, and are listed in order. + * They are returned in this order by the IPATH_GETCOUNTERS ioctl + */ +struct infinipath_counters { + unsigned long long LBIntCnt; + unsigned long long LBFlowStallCnt; + unsigned long long Reserved1; + unsigned long long TxUnsupVLErrCnt; + unsigned long long TxDataPktCnt; + unsigned long long TxFlowPktCnt; + unsigned long long TxDwordCnt; + unsigned long long TxLenErrCnt; + unsigned long long TxMaxMinLenErrCnt; + unsigned long long TxUnderrunCnt; + unsigned long long TxFlowStallCnt; + unsigned long long TxDroppedPktCnt; + unsigned long long RxDroppedPktCnt; + unsigned long long RxDataPktCnt; + unsigned long long RxFlowPktCnt; + unsigned long long RxDwordCnt; + unsigned long long RxLenErrCnt; + unsigned long long RxMaxMinLenErrCnt; + unsigned long long RxICRCErrCnt; + unsigned long long RxVCRCErrCnt; + unsigned long long RxFlowCtrlErrCnt; + unsigned long long RxBadFormatCnt; + unsigned long long RxLinkProblemCnt; + unsigned long long RxEBPCnt; + unsigned long long RxLPCRCErrCnt; + unsigned long long RxBufOvflCnt; + unsigned long long RxTIDFullErrCnt; + unsigned long long RxTIDValidErrCnt; + unsigned long long RxPKeyMismatchCnt; + unsigned long long RxP0HdrEgrOvflCnt; + unsigned long long RxP1HdrEgrOvflCnt; + unsigned long long RxP2HdrEgrOvflCnt; + unsigned long long RxP3HdrEgrOvflCnt; + unsigned long long RxP4HdrEgrOvflCnt; + unsigned long long RxP5HdrEgrOvflCnt; + unsigned long long RxP6HdrEgrOvflCnt; + unsigned long long RxP7HdrEgrOvflCnt; + unsigned long long RxP8HdrEgrOvflCnt; + unsigned long long Reserved6; + unsigned long long Reserved7; + unsigned long long IBStatusChangeCnt; + unsigned long long IBLinkErrRecoveryCnt; + unsigned long long IBLinkDownedCnt; + unsigned long long IBSymbolErrCnt; +}; + +/* + * The next set of defines are for packet headers, and chip register + * and memory bits that are visible to and/or used by user-mode software + * The other bits that are used only by the driver or diags are in + * ipath_registers.h + */ + +/* RcvHdrFlags bits */ +#define INFINIPATH_RHF_LENGTH_MASK 0x7FF +#define INFINIPATH_RHF_LENGTH_SHIFT 0 +#define INFINIPATH_RHF_RCVTYPE_MASK 0x7 +#define INFINIPATH_RHF_RCVTYPE_SHIFT 11 +#define INFINIPATH_RHF_EGRINDEX_MASK 0x7FF +#define INFINIPATH_RHF_EGRINDEX_SHIFT 16 +#define INFINIPATH_RHF_H_ICRCERR 0x80000000 +#define INFINIPATH_RHF_H_VCRCERR 0x40000000 +#define INFINIPATH_RHF_H_PARITYERR 0x20000000 +#define INFINIPATH_RHF_H_LENERR 0x10000000 +#define INFINIPATH_RHF_H_MTUERR 0x08000000 +#define INFINIPATH_RHF_H_IHDRERR 0x04000000 +#define INFINIPATH_RHF_H_TIDERR 0x02000000 +#define INFINIPATH_RHF_H_MKERR 0x01000000 +#define INFINIPATH_RHF_H_IBERR 0x00800000 +#define INFINIPATH_RHF_L_SWA 0x00008000 +#define INFINIPATH_RHF_L_SWB 0x00004000 + +/* infinipath header fields */ +#define INFINIPATH_I_VERS_MASK 0xF +#define INFINIPATH_I_VERS_SHIFT 28 +#define INFINIPATH_I_PORT_MASK 0xF +#define INFINIPATH_I_PORT_SHIFT 24 +#define INFINIPATH_I_TID_MASK 0x7FF +#define INFINIPATH_I_TID_SHIFT 13 +#define INFINIPATH_I_OFFSET_MASK 0x1FFF +#define INFINIPATH_I_OFFSET_SHIFT 0 + +/* K_PktFlags bits */ +#define INFINIPATH_KPF_INTR 0x1 + +/* SendPIO per-buffer control */ +#define INFINIPATH_SP_LENGTHP1_MASK 0x3FF +#define INFINIPATH_SP_LENGTHP1_SHIFT 0 +#define INFINIPATH_SP_INTR 0x80000000 +#define INFINIPATH_SP_TEST 0x40000000 +#define INFINIPATH_SP_TESTEBP 0x20000000 + +/* SendPIOAvail bits */ +#define INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT 1 +#define INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT 0 + +#endif /* _IPATH_COMMON_H */ diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_kernel.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,697 @@ +#ifndef _IPATH_KERNEL_H +#define _IPATH_KERNEL_H +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + + + +/* + * This header file is the base header file for infinipath kernel code + * ipath_user.h serves a similar purpose for user code. + */ + +#include "ipath_common.h" +#include "ipath_kdebug.h" +#include "ipath_registers.h" +#include +#include + +/* only s/w major version of InfiniPath we can handle */ +#define IPATH_CHIP_VERS_MAJ 2U + +#define IPATH_CHIP_VERS_MIN 0U /* don't care about this except printing */ + +extern struct infinipath_stats ipath_stats; /* temporary, maybe always */ + +/* only s/w version of chip we can handle for now */ +#define IPATH_CHIP_SWVERSION IPATH_CHIP_VERS_MAJ + +struct ipath_portdata { + /* minor number of devices, for ipath_type use */ + unsigned port_unit; + /* array of struct page pointers */ + struct page **port_rcvegrbuf_pages; + /* array of virtual addresses (from above) */ + void **port_rcvegrbuf_virt; + void *port_rcvhdrq; /* rcvhdrq base, needs mmap before useful */ + /* kernel virtual address where hdrqtail is updated */ + uint64_t *port_rcvhdrtail_kvaddr; + struct page *port_rcvhdrtail_pagep; /* page * used for uaddr */ + /* + * temp buffer for expected send setup, allocated at open, instead + * of each setup call + */ + void *port_tid_pg_list; + wait_queue_head_t port_wait; /* when waiting for rcv or pioavail */ + /* + * rcvegr bufs base, physical, must fit + * in 44 bits so 32 bit programs mmap64 44 bit works) + */ + unsigned long port_rcvegr_phys; + /* for mmap of hdrq, must fit in 44 bits */ + unsigned long port_rcvhdrq_phys; + /* + * the actual user address that we locked, so we can + * unlock it at close + */ + unsigned long port_rcvhdrtail_uaddr; + /* + * number of opens on this instance (0 or 1; ignoring forks, dup, + * etc. for now) + */ + int port_cnt; + /* + * how much space to leave at start of eager TID entries for protocol + * use, on each TID + */ + unsigned port_egrskip; + unsigned port_port; /* instead of calculating it */ + uint32_t port_piobufs; /* chip offset of PIO buffers for this port */ + /* how many alloc_pages() chunks in port_rcvegrbuf_pages */ + uint32_t port_rcvegrbuf_chunks; + uint32_t port_rcvegrbufs_perchunk; /* how many egrbufs per chunk */ + /* order used with port_rcvegrbuf_pages */ + uint32_t port_rcvegrbuf_order; + uint32_t port_rcvhdrq_order; /* rcvhdrq order (for free_pages) */ + /* next expected TID to check when looking for free */ + uint32_t port_tidcursor; + /* next expected TID to check when looking for free */ + uint32_t port_flag; + /* WAIT_RCV that timed out, no interrupt */ + uint32_t port_rcvwait_to; + /* WAIT_PIO that timed out, no interrupt */ + uint32_t port_piowait_to; + uint32_t port_rcvnowait; /* WAIT_RCV already happened, no wait */ + uint32_t port_pionowait; /* WAIT_PIO already happened, no wait */ + uint32_t port_hdrqfull; /* total number of rcvhdrqfull errors */ + pid_t port_pid; /* pid of process using this port */ + /* same size as task_struct .comm[], but no define */ + char port_comm[16]; + uint16_t port_pkeys[4]; /* pkeys set by this use of this port */ +}; + +struct sk_buff; + +/* + * control information for layered drivers + * This is used only as part of devdata via ipath_layer; + */ +struct _ipath_layer { + int (*l_intr) (const ipath_type, uint32_t); + int (*l_rcv) (const ipath_type, void *, struct sk_buff *); + int (*l_rcv_lid) (const ipath_type, void *); + uint16_t l_rcv_opcode; + uint16_t l_rcv_lid_opcode; +}; + +/* Verbs layer interface */ +struct _verbs_layer { + int (*l_piobufavail) (const ipath_type); + void (*l_rcv) (const ipath_type, void *, void *, uint32_t); + void (*l_timer_cb) (const ipath_type); + struct timer_list l_timer; + unsigned l_flags; +}; + +/* + * These are the fields that only exist for port 0, not per port, so + * they aren't in ipath_devdata + */ +struct ipath_devdata { + /* driver data structures */ + /* mem-mapped pointer to base of chip regs; should always use read/write{lq} + * when accesses are made, or via memcpy32() for PIO buffers*/ + uint64_t __iomem *ipath_kregbase; + /* end of mem-mapped chip space; range checking */ + uint64_t __iomem *ipath_kregend; + /* physical address of chip for io_remap, etc. */ + unsigned long ipath_physaddr; + /* base of memory alloced for ipath_kregbase, for free */ + uint64_t *ipath_kregalloc; + /* + * version of kregbase that doesn't have high bits set (for 32 bit + * programs, so mmap64 44 bit works) + */ + uint64_t __iomem *ipath_kregvirt; + struct ipath_portdata **ipath_pd; /* ipath_cfgports pointers */ + /* sk_buffs used by port 0 eager receive queue */ + struct sk_buff **ipath_port0_skbs; + void __iomem *ipath_piobase; /* kvirt address of 1st pio buffer */ + + /* + * virtual address where port0 rcvhdrqtail updated by chip via DMA + * volatile because we want to be sure compiler always makes a memory + * reference when we dereference it. + */ + volatile uint64_t *ipath_hdrqtailptr; + /* + * points to area where PIOavail registers will be DMA'ed. Has to + * be on a page of it's own, because the page will be mapped into user + * program space. Updated by chip via DMA, treated as readonly by software. + * volatile because we want to be sure compiler always makes a memory + * reference when we dereference it. + */ + volatile uint64_t *ipath_pioavailregs_dma; + + /* original address for kfree */ + volatile uint64_t *__ipath_pioavailregs_base; + /* physical address where updates occur */ + unsigned long ipath_pioavailregs_phys; + struct _ipath_layer ipath_layer; + struct _verbs_layer verbs_layer; + /* total dwords sent (summed from counter) */ + uint64_t ipath_sword; + /* total dwords received (summed from counter) */ + uint64_t ipath_rword; + /* total packets sent (summed from counter) */ + uint64_t ipath_spkts; + /* total packets received (summed from counter) */ + uint64_t ipath_rpkts; + /* to make the receive interrupt failsafe */ + uint64_t ipath_lastqtail; + uint64_t _ipath_status; /* ipath_statusp initially points to this. */ + uint64_t ipath_guid; /* GUID for this interface, in network order */ + /* + * aggregrate of error bits reported since + * last cleared, for limiting of error reporting + */ + uint64_t ipath_lasterror; + /* + * aggregrate of error bits reported + * since last cleared, for limiting of hwerror reporting + */ + uint64_t ipath_lasthwerror; + /* + * errors masked because they occur too fast, + * also includes errors that are always ignored (ipath_ignorederrs) + */ + uint64_t ipath_maskederrs; + /* time at which to re-enable maskederrs */ + cycles_t ipath_unmasktime; + /* + * errors always ignored (masked), at least + * for a given chip/device, because they are wrong or not useful + */ + uint64_t ipath_ignorederrs; + /* count of egrfull errors, combined for all ports */ + uint64_t ipath_last_tidfull; + uint64_t ipath_lastport0rcv_cnt; /* for ipath_qcheck() */ + + uint32_t ipath_kregsize; /* size of memory at ipath_kregbase */ + /* number of registers used for pioavail */ + uint32_t ipath_pioavregs; + uint32_t ipath_flags; /* IPATH_POLL, etc. */ + /* ipath_flags sma is waiting for */ + uint32_t ipath_sma_state_wanted; + /* last buffer for user use, first buf for kernel use is this index. */ + uint32_t ipath_lastport_piobuf; + uint32_t pci_registered; /* driver is a registered pci device */ + uint32_t ipath_stats_timer_active; /* is a stats timer active */ + /* dwords sent read from infinipath counter */ + uint32_t ipath_lastsword; + /* dwords received read from infinipath counter */ + uint32_t ipath_lastrword; + /* sent packets read from infinipath counter */ + uint32_t ipath_lastspkts; + /* received packets read from infinipath counter */ + uint32_t ipath_lastrpkts; + uint32_t ipath_pbufsport; /* pio bufs allocated per port */ + /* + * number of ports configured as max; zero is + * set to number chip supports, less gives more pio bufs/port, etc. + */ + uint32_t ipath_cfgports; + /* our idea of the port0 rcvhdrq head offset */ + uint32_t ipath_port0head; + uint32_t ipath_p0_hdrqfull; /* count of port 0 hdrqfull errors */ + + /* + * (*cfgports) used to suppress multiple instances of same port + * staying stuck at same point + */ + uint32_t *ipath_lastrcvhdrqtails; + /* + * (*cfgports) used to suppress multiple instances of same port + * staying stuck at same point + */ + uint32_t *ipath_lastegrheads; + /* + * index of last piobuffer we used. Speeds up searching, by starting + * at this point. Doesn't matter if multiple cpu's use and update, + * last updater is only write that matters. Whenever it wraps, + * we update shadow copies. Need a copy per device when we get to + * multiple devices + */ + uint32_t ipath_lastpioindex; + uint32_t ipath_freezelen; /* max length of freezemsg */ + uint32_t ipath_consec_nopiobuf; /* consecutive times we wanted a PIO buffer + * but were unable to get one */ + uint32_t ipath_upd_pio_shadow; /* hint that we should update + * ipath_pioavailshadow before looking for a PIO buffer */ + uint32_t ipath_nosma_bufs; /* sequential tries for SMA send and no bufs */ + uint32_t ipath_nosma_secs; /* duration (seconds) ipath_nosma_bufs set */ + /* HT/PCI Vendor ID (here for NodeInfo) */ + uint16_t ipath_vendorid; + /* HT/PCI Device ID (here for NodeInfo) */ + uint16_t ipath_deviceid; + /* offset in HT config space of slave/primary interface block */ + uint8_t ipath_ht_slave_off; + int ipath_mtrr; /* registration handle for WRCOMB setting on */ + /* ref count of how many users set each pkey */ + atomic_t ipath_pkeyrefs[4]; + /* shadow copy of all exptids physaddr; used only by funcsim */ + uint64_t *ipath_tidsimshadow; + /* shadow copy of struct page *'s for exp tid pages */ + struct page **ipath_pageshadow; + /* + * IPATH_STATUS_* + * this address is mapped readonly into user processes so they can + * get status cheaply, whenever they want. + */ + uint64_t *ipath_statusp; + char *ipath_freezemsg; /* freeze msg if hw error put chip in freeze */ + struct pci_dev *pcidev; /* pci access data structure */ + /* timer used to prevent stats overflow, error throttling, etc. */ + struct timer_list ipath_stats_timer; + /* only allow one interrupt at a time. */ + unsigned long ipath_rcv_pending; + + /* + * shadow copies of registers; size indicates read access size + * Most of them are readonly, but some are write-only register, where + * we manipulate the bits in the shadow copy, and then write the shadow + * copy to infinipath + * We deliberately make most of these 32 bits, since they have + * restricted range and for any that we read, we won't to generate + * 32 bit accesses, since Opteron will generate 2 separate 32 bit + * HT transactions for a 64 bit read, and we want to avoid unnecessary + * HT transactions + */ + + /* This is the 64 bit group */ + /* + * shadow of pioavail, check to be sure it's large enough at + * init time. + */ + uint64_t ipath_pioavailshadow[8]; + uint64_t ipath_gpio_out; /* shadow of kr_gpio_out, for rmw ops */ + /* kr_revision value (also see ipath_majrev) */ + uint64_t ipath_revision; + /* shadow of ibcctrl, for interrupt handling of link changes, etc. */ + uint64_t ipath_ibcctrl; + /* + * last ibcstatus, to suppress "duplicate" status change messages, + * mostly from 2 to 3 + */ + uint64_t ipath_lastibcstat; + /* mask of hardware errors that are enabled */ + uint64_t ipath_hwerrmask; + uint64_t ipath_extctrl; /* shadow the gpio output contents */ + + /* these are the "32 bit" regs */ + /* + * number of GUIDs in the flash for this interface; may need some + * rethinking for setting on other ifaces + */ + uint32_t ipath_nguid; + uint32_t ipath_rcvctrl; /* shadow kr_rcvctrl */ + uint32_t ipath_sendctrl; /* shadow kr_sendctrl */ + uint32_t ipath_rcvhdrcnt; /* value we put in kr_rcvhdrcnt */ + uint32_t ipath_rcvhdrsize; /* value we put in kr_rcvhdrsize */ + uint32_t ipath_rcvhdrentsize; /* value we put in kr_rcvhdrentsize */ + /* byte offset of last entry in rcvhdrq */ + uint32_t ipath_hdrqlast; + uint32_t ipath_portcnt; /* kr_portcnt value */ + uint32_t ipath_palign; /* kr_pagealign value */ + uint32_t ipath_piobcnt; /* kr_sendpiobufcnt value */ + uint32_t ipath_piobufbase; /* kr_sendpiobufbase value */ + uint32_t ipath_piosize; /* kr_sendpiosize */ + uint32_t ipath_rcvegrbase; /* kr_rcvegrbase value */ + uint32_t ipath_rcvegrcnt; /* kr_rcvegrcnt value */ + uint32_t ipath_rcvtidbase; /* kr_rcvtidbase value */ + uint32_t ipath_rcvtidcnt; /* kr_rcvtidcnt value */ + uint32_t ipath_sregbase; /* kr_sendregbase */ + uint32_t ipath_uregbase; /* kr_userregbase */ + uint32_t ipath_cregbase; /* kr_counterregbase */ + uint32_t ipath_control; /* shadow the control register contents */ + uint32_t ipath_pcirev; /* PCI revision register (HTC rev on FPGA) */ + + uint32_t ipath_ibmtu; /* The MTU programmed for this unit */ + /* + * The max size IB packet, included IB headers that we can send. + * Starts same as ipath_piosize, but is affected when ibmtu is + * changed, or by size of eager buffers + */ + uint32_t ipath_ibmaxlen; + /* + * ibmaxlen at init time, limited by chip and by receive buffer size. + * Not changed after init. + */ + uint32_t ipath_init_ibmaxlen; + /* size we allocate for each rcvegrbuffer */ + uint32_t ipath_rcvegrbufsize; + uint32_t ipath_htwidth; /* width (2,4,8,16,32) from HT config reg */ + uint32_t ipath_htspeed; /* HT speed (200,400,800,1000) from HT config */ + /* bitmap of ports waiting for PIO avail intr */ + uint32_t ipath_portpiowait; + /* + *number of sequential ibcstatus change for polling active/quiet + * (i.e., link not coming up). + */ + uint32_t ipath_ibpollcnt; + uint16_t ipath_mlid; /* MLID programmed for this instance */ + uint16_t ipath_lid; /* LID programmed for this instance */ + /* list of pkeys programmed; 0 means not set */ + uint16_t ipath_pkeys[4]; + uint8_t ipath_serial[12]; /* ASCII serial number, from flash */ + uint8_t ipath_majrev; /* chip major rev, from ipath_revision */ + uint8_t ipath_minrev; /* chip minor rev, from ipath_revision */ + uint8_t ipath_boardrev; /* board rev, from ipath_revision */ + uint8_t ipath_unit; /* Unit number for this chip */ +}; + +/* + * A segment is a linear region of low physical memory. + * XXX Maybe we should use phys addr here and kmap()/kunmap() + * Used by the verbs layer. + */ +struct ipath_seg { + void *vaddr; + size_t length; +}; + +/* The number of ipath_segs that fit in a page. */ +#define IPATH_SEGSZ (PAGE_SIZE / sizeof (struct ipath_seg)) + +struct ipath_segarray { + struct ipath_seg segs[IPATH_SEGSZ]; +}; + +/* + * Used by the verbs layer. + */ +struct ipath_mregion { + uint64_t user_base; /* User's address for this region */ + uint64_t iova; /* IB start address of this region */ + size_t length; + uint32_t lkey; + uint32_t offset; /* offset (bytes) to start of region */ + int access_flags; + uint32_t max_segs; /* number of ipath_segs in all the arrays */ + uint32_t mapsz; /* size of the map array */ + struct ipath_segarray *map[0]; /* the segments */ +}; + +/* + * These keep track of the copy progress within a memory region. + * Used by the verbs layer. + */ +struct ipath_sge { + struct ipath_mregion *mr; + void *vaddr; /* current pointer into the segment */ + uint32_t sge_length; /* length of the SGE */ + uint32_t length; /* remaining length of the segment */ + uint16_t m; /* current index: mr->map[m] */ + uint16_t n; /* current index: mr->map[m]->segs[n] */ +}; + +struct ipath_sge_state { + struct ipath_sge *sg_list; /* next SGE to be used if any */ + struct ipath_sge sge; /* progress state for the current SGE */ + uint8_t num_sge; +}; + +extern struct ipath_devdata devdata[]; +#define IPATH_UNIT(p) ((p)-devdata) +extern const uint32_t infinipath_max; /* number of units (chips) supported */ +extern const char *ipath_minor_names[]; + +extern int ipath_diags_enabled; /* is diags mode enabled? */ + +/* clean up any per-chip chip-specific stuff */ +void ipath_chip_cleanup(struct ipath_devdata *); +void ipath_chip_done(void); /* clean up any chip type-specific stuff */ +void ipath_handle_hwerrors(const ipath_type, char *, int); +int ipath_validate_rev(struct ipath_devdata *); +void ipath_clear_init_hwerrs(const ipath_type); + +/* + * This is here to simplify compatibility with source that supports + * multiple chip types + */ +void ipath_ht_get_boardname(const ipath_type t, char *name, size_t namelen); + +/* these are primarily for SMA, but are also used by diags */ +int ipath_send_smapkt(struct ipath_sendpkt __user *); + +int ipath_wait_linkstate(const ipath_type, uint32_t, int); +void ipath_down_link(const ipath_type); +void ipath_set_ib_lstate(const ipath_type, int); +void ipath_kreceive(const ipath_type); +int ipath_setrcvhdrsize(const ipath_type, unsigned); + +/* for use in system calls, where we want to know device type, etc. */ +#define port_fp(fp) (((fp)->private_data>(void*)255UL)?((struct ipath_portdata *)fp->private_data):NULL) + +/* + * values for ipath_flags + */ +#define IPATH_INITTED 0x2 /* The chip is up and initted */ +#define IPATH_RCVHDRSZ_SET 0x4 /* set if any user code has set kr_rcvhdrsize */ +/* The chip is present and valid for accesses */ +#define IPATH_PRESENT 0x8 +/* HT link0 is only 8 bits wide, ignore upper byte crc errors, etc. */ +#define IPATH_8BIT_IN_HT0 0x10 +/* HT link1 is only 8 bits wide, ignore upper byte crc errors, etc. */ +#define IPATH_8BIT_IN_HT1 0x20 +/* The link is down (or not yet up 0x11 or earlier) */ +#define IPATH_LINKDOWN 0x40 +#define IPATH_LINKINIT 0x80 /* The link level is up (0x11) */ +/* The link is in the armed (0x21) state */ +#define IPATH_LINKARMED 0x100 +/* The link is in the active (0x31) state */ +#define IPATH_LINKACTIVE 0x200 +/* The link was taken down, but no interrupt yet */ +#define IPATH_LINKUNK 0x400 +/* link being moved to armed (0x21) state */ +#define IPATH_LINK_TOARMED 0x800 +/* link being moved to active (0x31) state */ +#define IPATH_LINK_TOACTIVE 0x1000 +/* linkinit cmd is SLEEP, move to POLL */ +#define IPATH_LINK_SLEEPING 0x2000 +/* no IB cable, or no device on IB cable */ +#define IPATH_NOCABLE 0x4000 +/* Supports port zero per packet receive interrupts via GPIO */ +#define IPATH_GPIO_INTR 0x8000 + +/* portdata flag values */ +#define IPATH_PORT_WAITING_RCV 0x4 /* waiting for a packet to arrive */ +/* waiting for a PIO buffer to be available */ +#define IPATH_PORT_WAITING_PIO 0x8 + +int ipath_init_chip(const ipath_type); +/* free up any allocated data at closes */ +void ipath_free_data(struct ipath_portdata *dd); +void ipath_init_picotime(void); /* init cycles to picosecs conversion */ +int ipath_bringup_serdes(const ipath_type); +int ipath_waitfor_mdio_cmdready(const ipath_type); +int ipath_waitfor_complete(const ipath_type, ipath_kreg, uint64_t, uint64_t *); +void ipath_quiet_serdes(const ipath_type); +void ipath_get_boardname(uint8_t, char *, size_t); +uint32_t __iomem *ipath_getpiobuf(int, uint32_t *); +int ipath_bufavail(int); +int ipath_rd_eeprom(const ipath_type port_unit, + struct ipath_eeprom_req __user *); +uint64_t ipath_snap_cntr(const ipath_type, ipath_creg); + +/* + * these should be somewhat dynamic someday, although they are fixed + * for all users of the device on any given load. + */ +/* (words) room for all IB headers and KD proto header */ +#define IPATH_RCVHDRENTSIZE 16 +/* + * 64K, which is about all you can hope to get contiguous. API allows + * users to request a size, for now I'm ignoring that. + */ +#define IPATH_RCVHDRCNT 1024 + +/* + * number of words in KD protocol header if not set by ipath_userinit(); + * this uses the full 64 bytes of rcvhdrentry + */ +#define IPATH_DFLT_RCVHDRSIZE 9 + +#define IPATH_MDIO_CMD_WRITE 1 +#define IPATH_MDIO_CMD_READ 2 +#define IPATH_MDIO_CLD_DIV 25 /* to get 2.5 Mhz mdio clock */ +#define IPATH_MDIO_CMDVALID 0x40000000 /* bit 30 */ +#define IPATH_MDIO_DATAVALID 0x80000000 /* bit 31 */ +#define IPATH_MDIO_CTRL_STD 0x0 + +#define IPATH_MDIO_REQ(cmd,dev,reg,data) ( (((uint64_t)IPATH_MDIO_CLD_DIV) << 32) | \ + ((cmd) << 26) | ((dev)<<21) | ((reg) << 16) | ((data) & 0xFFFF)) + +#define IPATH_MDIO_CTRL_XGXS_REG_8 0x8 /* signal and fifo status, in bank 31 */ + +/* controls loopback, redundancy */ +#define IPATH_MDIO_CTRL_8355_REG_1 0x10 +#define IPATH_MDIO_CTRL_8355_REG_2 0x11 /* premph, encdec, etc. */ +#define IPATH_MDIO_CTRL_8355_REG_6 0x15 /* Kchars, etc. */ +#define IPATH_MDIO_CTRL_8355_REG_9 0x18 +#define IPATH_MDIO_CTRL_8355_REG_10 0x1D + +/* + * ipath_get_upages() is used to pin an address range (if not already pinned), + * and optionally return the list of physical addresses + * ipath_putpages() does the obvious, and ipath_get_upages() cleans up all + * private memory, used at driver unload. + * ipath_get_upages_nocopy() is similar to ipage_get_upages, but only 1 page, + * and marks the vm so the page isn't taken away on a fork. + */ +int ipath_get_upages(unsigned long, size_t, struct page **); +int ipath_get_upages_nocopy(unsigned long, struct page **); +void ipath_putpages(size_t, struct page **); +void ipath_upages_cleanup(struct ipath_portdata *); +int ipath_eeprom_read(const ipath_type, uint8_t, void *, int); +int ipath_eeprom_write(const ipath_type, uint8_t, void *, int); + +/* these are used for the registers that vary with port */ +void ipath_kput_kreg_port(const ipath_type, ipath_kreg, unsigned, uint64_t); +uint64_t ipath_kget_kreg64_port(const ipath_type, ipath_kreg, unsigned); + +/* + * we could have a single register get/put routine, that takes a group + * type, but this is somewhat clearer and cleaner. It also gives us some + * error checking. 64 bit register reads should always work, but are + * inefficient on opteron (the northbridge always generates 2 separate + * HT 32 bit reads), so we use kreg32 wherever possible. + * User register and counter register reads are always 32 bit reads, so only + * one form of those routines. + */ + +/* + * At the moment, none of the s-registers are writable, so no ipath_kput_sreg() + * At the moment, none of the c-registers are writable, so no ipath_kput_creg() + */ + +/* + * return the contents of a register that is virtualized to be per port + * prints a debug message and returns -1 on errors (not distinguishable from + * valid contents at runtime; we may add a separate error variable at some + * point). + * This is normally not used by the kernel, but may be for debugging, + * and has a different implementation than user mode, which is why + * it's not in _common.h + */ +static inline uint32_t ipath_kget_ureg32(const ipath_type stype, + ipath_ureg regno, int port) +{ + if (!devdata[stype].ipath_kregbase) + return 0; + + return readl(regno + (uint64_t __iomem *) + (devdata[stype].ipath_uregbase + + (char __iomem *) devdata[stype].ipath_kregbase + + devdata[stype].ipath_palign * port)); +} + +/* + * change the contents of a register that is virtualized to be per port + * prints a debug message and returns 1 on errors, 0 on success. + */ +static inline void ipath_kput_ureg(const ipath_type stype, ipath_ureg regno, + uint64_t value, int port) +{ + uint64_t __iomem *ubase; + + ubase = (uint64_t __iomem *) + (devdata[stype].ipath_uregbase + + (char __iomem *) devdata[stype].ipath_kregbase + + devdata[stype].ipath_palign * port); + if(devdata[stype].ipath_kregbase) + writeq(value, &ubase[regno]); +} + +static inline uint32_t ipath_kget_kreg32(const ipath_type stype, + ipath_kreg regno) +{ + if (!devdata[stype].ipath_kregbase) + return -1; + return readl((uint32_t __iomem *) &devdata[stype].ipath_kregbase[regno]); +} + +static inline uint64_t ipath_kget_kreg64(const ipath_type stype, + ipath_kreg regno) +{ + if (!devdata[stype].ipath_kregbase) + return -1; + + return readq(&devdata[stype].ipath_kregbase[regno]); +} + +static inline void ipath_kput_kreg(const ipath_type stype, + ipath_kreg regno, uint64_t value) +{ + if (devdata[stype].ipath_kregbase) + writeq(value, &devdata[stype].ipath_kregbase[regno]); +} + +static inline uint32_t ipath_kget_creg32(const ipath_type stype, + ipath_sreg regno) +{ + if(!devdata[stype].ipath_kregbase) + return 0; + return readl(regno + (uint64_t __iomem *) + (devdata[stype].ipath_cregbase + + (char __iomem *) devdata[stype].ipath_kregbase)); +} + +/* + * caddr is the destination chip address (full pointer, not offset), + * val is the qword to write there. We only handle a single qword (8 bytes). + * This is not used for copies to the PIO buffer, just TID updates, etc. + * This function localizes all chip mem (as opposed to register) writes. + */ +static inline void ipath_kput_memq(const ipath_type stype, + uint64_t __iomem *caddr, uint64_t val) +{ + if (devdata[stype].ipath_kregbase) + writeq(val, caddr); +} + + +#endif /* _IPATH_KERNEL_H */ diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_layer.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_layer.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _IPATH_LAYER_H +#define _IPATH_LAYER_H + +/* + * This header file is for symbols shared between the infinipath driver + * and drivers layered upon it (such as ipath). + */ + +struct sk_buff; +struct ipath_sge_state; + +struct ipath_layer_counters { + uint64_t symbol_error_counter; + uint64_t link_error_recovery_counter; + uint64_t link_downed_counter; + uint64_t port_rcv_errors; + uint64_t port_rcv_remphys_errors; + uint64_t port_xmit_discards; + uint64_t port_xmit_data; + uint64_t port_rcv_data; + uint64_t port_xmit_packets; + uint64_t port_rcv_packets; +}; + +int ipath_layer_register(const ipath_type device, + int (*l_intr) (const ipath_type, uint32_t), + int (*l_rcv) (const ipath_type, void *, + struct sk_buff *), + uint16_t rcv_opcode, + int (*l_rcv_lid) (const ipath_type, void *), + uint16_t rcv_lid_opcode); +int ipath_verbs_register(const ipath_type device, + int (*l_piobufavail) (const ipath_type device), + void (*l_rcv) (const ipath_type device, + void *rhdr, void *data, + uint32_t tlen), + void (*l_timer_cb) (const ipath_type device)); +void ipath_verbs_unregister(const ipath_type device); +int ipath_layer_open(const ipath_type device, uint32_t * pktmax); +uint16_t ipath_layer_get_lid(const ipath_type device); +int ipath_layer_get_mac(const ipath_type device, uint8_t *); +uint16_t ipath_layer_get_bcast(const ipath_type device); +int ipath_layer_get_num_of_dev(void); +int ipath_layer_get_cr_errpkey(const ipath_type device); +int ipath_kset_linkstate(uint32_t arg); +int ipath_kset_mtu(uint32_t); +void ipath_set_sps_lid(const ipath_type, uint32_t); +void ipath_layer_close(const ipath_type device); +int ipath_layer_send(const ipath_type device, void *hdr, void *data, + uint32_t datalen); +int ipath_verbs_send(const ipath_type device, uint32_t hdrwords, + uint32_t *hdr, uint32_t len, + struct ipath_sge_state *ss); +int ipath_layer_send_skb(struct copy_data_s *cdata); +void ipath_layer_set_piointbufavail_int(const ipath_type device); +void ipath_get_boardname(const ipath_type, char *name, size_t namelen); +void ipath_layer_snapshot_counters(const ipath_type t, uint64_t * swords, + uint64_t * rwords, uint64_t * spkts, uint64_t * rpkts); +void ipath_layer_get_counters(const ipath_type device, + struct ipath_layer_counters *cntrs); +void ipath_layer_want_buffer(const ipath_type t); +int ipath_layer_set_guid(const ipath_type t, uint64_t guid); +uint64_t ipath_layer_get_guid(const ipath_type t); +uint32_t ipath_layer_get_nguid(const ipath_type t); +int ipath_layer_query_device(const ipath_type t, uint32_t * vendor, + uint32_t * boardrev, uint32_t * majrev, + uint32_t * minrev); +uint32_t ipath_layer_get_flags(const ipath_type t); +struct device *ipath_layer_get_pcidev(const ipath_type t); +uint16_t ipath_layer_get_deviceid(const ipath_type t); +uint64_t ipath_layer_get_lastibcstat(const ipath_type t); +uint32_t ipath_layer_get_ibmtu(const ipath_type t); +void ipath_layer_enable_timer(const ipath_type t); +void ipath_layer_disable_timer(const ipath_type t); +unsigned ipath_verbs_get_flags(const ipath_type device); +void ipath_verbs_set_flags(const ipath_type device, unsigned flags); +unsigned ipath_layer_get_npkeys(const ipath_type device); +unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index); +void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys); +int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys); +int ipath_layer_get_linkdowndefaultstate(const ipath_type device); +int ipath_layer_set_linkdowndefaultstate(const ipath_type device, int sleep); +int ipath_layer_get_phyerrthreshold(const ipath_type device); +int ipath_layer_set_phyerrthreshold(const ipath_type device, unsigned n); +int ipath_layer_get_overrunthreshold(const ipath_type device); +int ipath_layer_set_overrunthreshold(const ipath_type device, unsigned n); + +/* ipath_ether interrupt values */ +#define IPATH_LAYER_INT_IF_UP 0x2 +#define IPATH_LAYER_INT_IF_DOWN 0x4 +#define IPATH_LAYER_INT_LID 0x8 +#define IPATH_LAYER_INT_SEND_CONTINUE 0x10 +#define IPATH_LAYER_INT_BCAST 0x40 + +/* _verbs_layer.l_flags */ +#define IPATH_VERBS_KERNEL_SMA 0x1 + +#endif /* _IPATH_LAYER_H */ diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ipath_registers.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_registers.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,355 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _IPATH_REGISTERS_H +#define _IPATH_REGISTERS_H + +/* + * This file should only be included by kernel source, and by the diags. + * It defines the registers, and their contents, for the InfiniPath HT-400 chip + */ + +/* + * These are the InfiniPath register and buffer bit definitions, + * that are visible to software, and needed only by the kernel + * and diag code. A few, that are visible to protocol and user + * code are in ipath_common.h. Some bits are specific + * to a given chip implementation, and have been moved to the + * chip-specific source file + */ + +/* kr_revision bits */ +#define INFINIPATH_R_CHIPREVMINOR_MASK 0xFF +#define INFINIPATH_R_CHIPREVMINOR_SHIFT 0 +#define INFINIPATH_R_CHIPREVMAJOR_MASK 0xFF +#define INFINIPATH_R_CHIPREVMAJOR_SHIFT 8 +#define INFINIPATH_R_ARCH_MASK 0xFF +#define INFINIPATH_R_ARCH_SHIFT 16 +#define INFINIPATH_R_SOFTWARE_MASK 0xFF +#define INFINIPATH_R_SOFTWARE_SHIFT 24 +#define INFINIPATH_R_BOARDID_MASK 0xFF +#define INFINIPATH_R_BOARDID_SHIFT 32 + +/* kr_ontrol bits */ +#define INFINIPATH_C_FREEZEMODE 0x00000002 +#define INFINIPATH_C_LINKENABLE 0x00000004 + +/* kr_sendctrl bits */ +#define INFINIPATH_S_DISARMPIOBUF_SHIFT 16 +#define INFINIPATH_S_ABORT 0x00000001U +#define INFINIPATH_S_PIOINTBUFAVAIL 0x00000002U +#define INFINIPATH_S_PIOBUFAVAILUPD 0x00000004U +#define INFINIPATH_S_PIOENABLE 0x00000008U +#define INFINIPATH_S_DISARM 0x80000000U + +/* kr_rcvctrl bits */ +#define INFINIPATH_R_PORTENABLE_SHIFT 0 +#define INFINIPATH_R_INTRAVAIL_SHIFT 16 +#define INFINIPATH_R_TAILUPD 0x80000000 + +/* kr_intstatus, kr_intclear, kr_intmask bits */ +#define INFINIPATH_I_RCVURG_SHIFT 0 +#define INFINIPATH_I_RCVAVAIL_SHIFT 12 +#define INFINIPATH_I_ERROR 0x80000000 +#define INFINIPATH_I_SPIOSENT 0x40000000 +#define INFINIPATH_I_SPIOBUFAVAIL 0x20000000 +#define INFINIPATH_I_GPIO 0x10000000 + +/* kr_errorstatus, kr_errorclear, kr_errormask bits */ +#define INFINIPATH_E_RFORMATERR 0x0000000000000001ULL +#define INFINIPATH_E_RVCRC 0x0000000000000002ULL +#define INFINIPATH_E_RICRC 0x0000000000000004ULL +#define INFINIPATH_E_RMINPKTLEN 0x0000000000000008ULL +#define INFINIPATH_E_RMAXPKTLEN 0x0000000000000010ULL +#define INFINIPATH_E_RLONGPKTLEN 0x0000000000000020ULL +#define INFINIPATH_E_RSHORTPKTLEN 0x0000000000000040ULL +#define INFINIPATH_E_RUNEXPCHAR 0x0000000000000080ULL +#define INFINIPATH_E_RUNSUPVL 0x0000000000000100ULL +#define INFINIPATH_E_REBP 0x0000000000000200ULL +#define INFINIPATH_E_RIBFLOW 0x0000000000000400ULL +#define INFINIPATH_E_RBADVERSION 0x0000000000000800ULL +#define INFINIPATH_E_RRCVEGRFULL 0x0000000000001000ULL +#define INFINIPATH_E_RRCVHDRFULL 0x0000000000002000ULL +#define INFINIPATH_E_RBADTID 0x0000000000004000ULL +#define INFINIPATH_E_RHDRLEN 0x0000000000008000ULL +#define INFINIPATH_E_RHDR 0x0000000000010000ULL +#define INFINIPATH_E_RIBLOSTLINK 0x0000000000020000ULL +#define INFINIPATH_E_SMINPKTLEN 0x0000000020000000ULL +#define INFINIPATH_E_SMAXPKTLEN 0x0000000040000000ULL +#define INFINIPATH_E_SUNDERRUN 0x0000000080000000ULL +#define INFINIPATH_E_SPKTLEN 0x0000000100000000ULL +#define INFINIPATH_E_SDROPPEDSMPPKT 0x0000000200000000ULL +#define INFINIPATH_E_SDROPPEDDATAPKT 0x0000000400000000ULL +#define INFINIPATH_E_SPIOARMLAUNCH 0x0000000800000000ULL +#define INFINIPATH_E_SUNEXPERRPKTNUM 0x0000001000000000ULL +#define INFINIPATH_E_SUNSUPVL 0x0000002000000000ULL +#define INFINIPATH_E_IBSTATUSCHANGED 0x0001000000000000ULL +#define INFINIPATH_E_INVALIDADDR 0x0002000000000000ULL +#define INFINIPATH_E_RESET 0x0004000000000000ULL +#define INFINIPATH_E_HARDWARE 0x0008000000000000ULL + +/* kr_hwerrclear, kr_hwerrmask, kr_hwerrstatus, bits */ +#define INFINIPATH_HWE_HTCMEMPARITYERR_SHIFT 0 +#define INFINIPATH_HWE_TXEMEMPARITYERR_MASK 0xFULL +#define INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT 40 +#define INFINIPATH_HWE_RXEMEMPARITYERR_MASK 0x7FULL +#define INFINIPATH_HWE_RXEMEMPARITYERR_SHIFT 44 +#define INFINIPATH_HWE_HTCBUSTREQPARITYERR 0x0000000080000000ULL +#define INFINIPATH_HWE_HTCBUSTRESPPARITYERR 0x0000000100000000ULL +#define INFINIPATH_HWE_HTCBUSIREQPARITYERR 0x0000000200000000ULL +#define INFINIPATH_HWE_RXDSYNCMEMPARITYERR 0x0000000400000000ULL +#define INFINIPATH_HWE_SERDESPLLFAILED 0x2000000000000000ULL +#define INFINIPATH_HWE_IBCBUSTOSPCPARITYERR 0x4000000000000000ULL +#define INFINIPATH_HWE_IBCBUSFRSPCPARITYERR 0x8000000000000000ULL + +/* kr_hwdiagctrl bits */ +#define INFINIPATH_DC_FORCEHTCENABLE 0x20 +#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_MASK 0x3FULL +#define INFINIPATH_DC_FORCEHTCMEMPARITYERR_SHIFT 0 +#define INFINIPATH_DC_FORCETXEMEMPARITYERR_MASK 0xFULL +#define INFINIPATH_DC_FORCETXEMEMPARITYERR_SHIFT 40 +#define INFINIPATH_DC_FORCERXEMEMPARITYERR_MASK 0x7FULL +#define INFINIPATH_DC_FORCERXEMEMPARITYERR_SHIFT 44 +#define INFINIPATH_DC_FORCEHTCBUSTREQPARITYERR 0x0000000080000000ULL +#define INFINIPATH_DC_FORCEHTCBUSTRESPPARITYERR 0x0000000100000000ULL +#define INFINIPATH_DC_FORCEHTCBUSIREQPARITYERR 0x0000000200000000ULL +#define INFINIPATH_DC_FORCERXDSYNCMEMPARITYERR 0x0000000400000000ULL +#define INFINIPATH_DC_COUNTERDISABLE 0x1000000000000000ULL +#define INFINIPATH_DC_COUNTERWREN 0x2000000000000000ULL +#define INFINIPATH_DC_FORCEIBCBUSTOSPCPARITYERR 0x4000000000000000ULL +#define INFINIPATH_DC_FORCEIBCBUSFRSPCPARITYERR 0x8000000000000000ULL + +/* kr_ibcctrl bits */ +#define INFINIPATH_IBCC_FLOWCTRLPERIOD_MASK 0xFFULL +#define INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT 0 +#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_MASK 0xFFULL +#define INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT 8 +#define INFINIPATH_IBCC_LINKINITCMD_MASK 0x3ULL +#define INFINIPATH_IBCC_LINKINITCMD_DISABLE 1 +/* cycle through TS1/TS2 till OK */ +#define INFINIPATH_IBCC_LINKINITCMD_POLL 2 +#define INFINIPATH_IBCC_LINKINITCMD_SLEEP 3 /* wait for TS1, then go on */ +#define INFINIPATH_IBCC_LINKINITCMD_SHIFT 16 +#define INFINIPATH_IBCC_LINKCMD_MASK 0x3ULL +#define INFINIPATH_IBCC_LINKCMD_INIT 1 /* move to 0x11 */ +#define INFINIPATH_IBCC_LINKCMD_ARMED 2 /* move to 0x21 */ +#define INFINIPATH_IBCC_LINKCMD_ACTIVE 3 /* move to 0x31 */ +#define INFINIPATH_IBCC_LINKCMD_SHIFT 18 +#define INFINIPATH_IBCC_MAXPKTLEN_MASK 0x7FFULL +#define INFINIPATH_IBCC_MAXPKTLEN_SHIFT 20 +#define INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK 0xFULL +#define INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT 32 +#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK 0xFULL +#define INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT 36 +#define INFINIPATH_IBCC_CREDITSCALE_MASK 0x7ULL +#define INFINIPATH_IBCC_CREDITSCALE_SHIFT 40 +#define INFINIPATH_IBCC_LOOPBACK 0x8000000000000000ULL +#define INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE 0x4000000000000000ULL + +/* kr_ibcstatus bits */ +#define INFINIPATH_IBCS_LINKTRAININGSTATE_MASK 0xF +#define INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT 0 +#define INFINIPATH_IBCS_LINKSTATE_MASK 0x7 +#define INFINIPATH_IBCS_LINKSTATE_SHIFT 4 +#define INFINIPATH_IBCS_TXREADY 0x40000000 +#define INFINIPATH_IBCS_TXCREDITOK 0x80000000 + +/* kr_extstatus bits */ +#define INFINIPATH_EXTS_SERDESPLLLOCK 0x1 +#define INFINIPATH_EXTS_GPIOIN_MASK 0xFFFFULL +#define INFINIPATH_EXTS_GPIOIN_SHIFT 48 + +/* kr_extctrl bits */ +#define INFINIPATH_EXTC_GPIOINVERT_MASK 0xFFFFULL +#define INFINIPATH_EXTC_GPIOINVERT_SHIFT 32 +#define INFINIPATH_EXTC_GPIOOE_MASK 0xFFFFULL +#define INFINIPATH_EXTC_GPIOOE_SHIFT 48 +#define INFINIPATH_EXTC_SERDESENABLE 0x80000000ULL +#define INFINIPATH_EXTC_SERDESCONNECT 0x40000000ULL +#define INFINIPATH_EXTC_SERDESENTRUNKING 0x20000000ULL +#define INFINIPATH_EXTC_SERDESDISRXFIFO 0x10000000ULL +#define INFINIPATH_EXTC_SERDESENPLPBK1 0x08000000ULL +#define INFINIPATH_EXTC_SERDESENPLPBK2 0x04000000ULL +#define INFINIPATH_EXTC_SERDESENENCDEC 0x02000000ULL +#define INFINIPATH_EXTC_LEDSECPORTGREENON 0x00000020ULL +#define INFINIPATH_EXTC_LEDSECPORTYELLOWON 0x00000010ULL +#define INFINIPATH_EXTC_LEDPRIPORTGREENON 0x00000008ULL +#define INFINIPATH_EXTC_LEDPRIPORTYELLOWON 0x00000004ULL +#define INFINIPATH_EXTC_LEDGBLOKGREENON 0x00000002ULL +#define INFINIPATH_EXTC_LEDGBLERRREDOFF 0x00000001ULL + +/* kr_mdio bits */ +#define INFINIPATH_MDIO_CLKDIV_MASK 0x7FULL +#define INFINIPATH_MDIO_CLKDIV_SHIFT 32 +#define INFINIPATH_MDIO_COMMAND_MASK 0x7ULL +#define INFINIPATH_MDIO_COMMAND_SHIFT 26 +#define INFINIPATH_MDIO_DEVADDR_MASK 0x1FULL +#define INFINIPATH_MDIO_DEVADDR_SHIFT 21 +#define INFINIPATH_MDIO_REGADDR_MASK 0x1FULL +#define INFINIPATH_MDIO_REGADDR_SHIFT 16 +#define INFINIPATH_MDIO_DATA_MASK 0xFFFFULL +#define INFINIPATH_MDIO_DATA_SHIFT 0 +#define INFINIPATH_MDIO_CMDVALID 0x0000000040000000ULL +#define INFINIPATH_MDIO_RDDATAVALID 0x0000000080000000ULL + +/* kr_partitionkey bits */ +#define INFINIPATH_PKEY_SIZE 16 +#define INFINIPATH_PKEY_MASK 0xFFFF +#define INFINIPATH_PKEY_DEFAULT_PKEY 0xFFFF + +/* kr_serdesconfig0 bits */ +#define INFINIPATH_SERDC0_RESET_MASK 0xfULL /* overal reset bits */ +#define INFINIPATH_SERDC0_RESET_PLL 0x10000000ULL /* pll reset */ +#define INFINIPATH_SERDC0_TXIDLE 0xF000ULL /* tx idle enables (per lane) */ + +/* kr_xgxsconfig bits */ +#define INFINIPATH_XGXS_RESET 0x7ULL +#define INFINIPATH_XGXS_MDIOADDR_MASK 0xfULL +#define INFINIPATH_XGXS_MDIOADDR_SHIFT 4 + +/* TID entries (memory) */ +#define INFINIPATH_RT_VALID 0x8000000000000000ULL +#define INFINIPATH_RT_ADDR_MASK 0xFFFFFFFFFFULL +#define INFINIPATH_RT_ADDR_SHIFT 0 +#define INFINIPATH_RT_BUFSIZE_MASK 0x3FFF +#define INFINIPATH_RT_BUFSIZE_SHIFT 48 + +/* mask of defined bits for various registers */ +extern const uint64_t infinipath_c_bitsextant, + infinipath_s_bitsextant, infinipath_r_bitsextant, + infinipath_i_bitsextant, infinipath_e_bitsextant, + infinipath_hwe_bitsextant, infinipath_dc_bitsextant, + infinipath_extc_bitsextant, infinipath_mdio_bitsextant, + infinipath_ibcs_bitsextant, infinipath_ibcc_bitsextant; + +/* masks that are different in different chips */ +extern const uint32_t infinipath_i_rcvavail_mask, infinipath_i_rcvurg_mask; +extern const uint64_t infinipath_hwe_htcmemparityerr_mask; +extern const uint64_t infinipath_hwe_spibdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_sphtdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_htcdcmlockfailed_mask; +extern const uint64_t infinipath_hwe_htcdcmlockfailed_shift; +extern const uint64_t infinipath_hwe_sphtdcmlockfailed_shift; +extern const uint64_t infinipath_hwe_spibdcmlockfailed_shift; + +extern const uint64_t infinipath_hwe_htclnkabyte0crcerr; +extern const uint64_t infinipath_hwe_htclnkabyte1crcerr; +extern const uint64_t infinipath_hwe_htclnkbbyte0crcerr; +extern const uint64_t infinipath_hwe_htclnkbbyte1crcerr; + +/* + * These are the infinipath general register numbers (not offsets). + * The kernel registers are used directly, those beyond the kernel + * registers are calculated from one of the base registers. The use of + * an integer type doesn't allow type-checking as thorough as, say, + * an enum but allows for better hiding of chip differences. + */ +typedef const uint16_t + ipath_kreg, /* kernel-only, infinipath general registers */ + ipath_creg, /* kernel-only, infinipath counter registers */ + ipath_sreg; /* kernel-only, infinipath send registers */ + +/* + * These are all implemented such that 64 bit accesses work. + * Some implement no more than 32 bits. Because 64 bit reads + * require 2 HT cmds on opteron, we access those with 32 bit + * reads for efficiency (they are written as 64 bits, since + * the extra 32 bits are nearly free on writes, and it slightly reduces + * complexity). The rest are all accessed as 64 bits. + */ +extern ipath_kreg + /* These are the 32 bit group */ + kr_control, kr_counterregbase, kr_intmask, kr_intstatus, + kr_pagealign, kr_portcnt, kr_rcvtidbase, kr_rcvtidcnt, + kr_rcvegrbase, kr_rcvegrcnt, kr_scratch, kr_sendctrl, + kr_sendpiobufbase, kr_sendpiobufcnt, kr_sendpiosize, + kr_sendregbase, kr_userregbase, + /* These are the 64 bit group */ + kr_debugport, kr_debugportselect, kr_errorclear, kr_errormask, + kr_errorstatus, kr_extctrl, kr_extstatus, kr_gpio_clear, kr_gpio_mask, + kr_gpio_out, kr_gpio_status, kr_hwdiagctrl, kr_hwerrclear, + kr_hwerrmask, kr_hwerrstatus, kr_ibcctrl, kr_ibcstatus, kr_intblocked, + kr_intclear, kr_interruptconfig, kr_mdio, kr_partitionkey, kr_rcvbthqp, + kr_rcvbufbase, kr_rcvbufsize, kr_rcvctrl, kr_rcvhdrcnt, + kr_rcvhdrentsize, kr_rcvhdrsize, kr_rcvintmembase, kr_rcvintmemsize, + kr_revision, kr_sendbuffererror, kr_sendbuffererror1, + kr_sendbuffererror2, kr_sendbuffererror3, kr_sendpioavailaddr, + kr_serdesconfig0, kr_serdesconfig1, kr_serdesstatus, kr_txintmembase, + kr_txintmemsize, kr_xgxsconfig, + __kr_invalid, /* a marker for debug, don't use them directly */ + /* a marker for debug, don't use them directly */ + __kr_lastvaliddirect, + /* use only with ipath_k*_kreg64_port(), not *kreg64() */ + kr_rcvhdraddr, + /* use only with ipath_k*_kreg64_port(), not *kreg64() */ + kr_rcvhdrtailaddr, + /* we define the full set for the diags, the kernel doesn't use them */ + kr_rcvhdraddr1, kr_rcvhdraddr2, kr_rcvhdraddr3, kr_rcvhdraddr4, + kr_rcvhdraddr5, kr_rcvhdraddr6, kr_rcvhdraddr7, kr_rcvhdraddr8, + kr_rcvhdrtailaddr1, kr_rcvhdrtailaddr2, kr_rcvhdrtailaddr3, + kr_rcvhdrtailaddr4, kr_rcvhdrtailaddr5, kr_rcvhdrtailaddr6, + kr_rcvhdrtailaddr7, kr_rcvhdrtailaddr8; + +/* + * first of the pioavail registers, the total number is + * (kr_sendpiobufcnt / 32); each buffer uses 2 bits + */ +extern ipath_sreg sr_sendpioavail; + +extern ipath_creg cr_badformatcnt, cr_erricrccnt, cr_errlinkcnt, + cr_errlpcrccnt, cr_errpkey, cr_errrcvflowctrlcnt, + cr_err_rlencnt, cr_errslencnt, cr_errtidfull, + cr_errtidvalid, cr_errvcrccnt, cr_ibstatuschange, + cr_intcnt, cr_invalidrlencnt, cr_invalidslencnt, + cr_lbflowstallcnt, cr_iblinkdowncnt, cr_iblinkerrrecovcnt, + cr_ibsymbolerrcnt, cr_pktrcvcnt, cr_pktrcvflowctrlcnt, + cr_pktsendcnt, cr_pktsendflowcnt, cr_portovflcnt, + cr_portovflcnt1, cr_portovflcnt2, cr_portovflcnt3, cr_portovflcnt4, + cr_portovflcnt5, cr_portovflcnt6, cr_portovflcnt7, cr_portovflcnt8, + cr_rcvebpcnt, cr_rcvovflcnt, cr_rxdroppktcnt, + cr_senddropped, cr_sendstallcnt, cr_sendunderruncnt, + cr_unsupvlcnt, cr_wordrcvcnt, cr_wordsendcnt; + +/* + * register bits for selecting i2c direction and values, used for I2C serial + * flash + */ +extern const uint16_t ipath_gpio_sda_num; +extern const uint16_t ipath_gpio_scl_num; +extern const uint64_t ipath_gpio_sda; +extern const uint64_t ipath_gpio_scl; + +#endif /* _IPATH_REGISTERS_H */ diff -r a3a00f637da6 -r 2d9a3f27a10c drivers/infiniband/hw/ipath/ips_common.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ips_common.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,249 @@ +#ifndef IPS_COMMON_H +#define IPS_COMMON_H +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include "ipath_common.h" + +struct ipath_header_typ { + /* + * Version - 4 bits, Port - 4 bits, TID - 10 bits and Offset - 14 bits + * before ECO change ~28 Dec 03. + * After that, Vers 4, Port 3, TID 11, offset 14. + */ + uint32_t ver_port_tid_offset; + uint16_t chksum; + uint16_t pkt_flags; +}; + +struct ips_message_header_typ { + uint16_t lrh[4]; + uint32_t bth[3]; + struct ipath_header_typ iph; + uint8_t sub_opcode; + uint8_t flags; + uint16_t src_rank; + /* 24 bits. The upper 8 bit is available for other use */ + union { + struct { + unsigned ack_seq_num : 24; + unsigned port : 4; + unsigned unused : 4; + }; + uint32_t ack_seq_num_org; + }; + uint8_t expected_tid_session_id; + uint8_t tinylen; /* to aid MPI */ + uint16_t tag; /* to aid MPI */ + union { + uint32_t mpi[4]; /* to aid MPI */ + uint32_t data[4]; + struct { + uint16_t mtu; + uint8_t major_ver; + uint8_t minor_ver; + uint32_t not_used; //free + uint32_t run_id; + uint32_t client_ver; + }; + }; +}; + +struct ether_header_typ { + uint16_t lrh[4]; + uint32_t bth[3]; + struct ipath_header_typ iph; + uint8_t sub_opcode; + uint8_t cmd; + uint16_t lid; + uint16_t mac[3]; + uint8_t frag_num; + uint8_t seq_num; + uint32_t len; + /* MUST be of word size do to PIO write requirements */ + uint32_t csum; + uint16_t csum_offset; + uint16_t flags; + uint16_t first_2_bytes; + uint8_t unused[2]; /* currently unused */ +}; + +/* + * The PIO buffer used for sending infinipath messages must only be written + * in 32-bit words, all the data must be written, and no writes can occur + * after the last word is written (which transfers "ownership" of the buffer + * to the chip and triggers the message to be sent). + * Since the Linux sk_buff structure can be recursive, non-aligned, and + * any number of bytes in each segment, we use the following structure + * to keep information about the overall state of the copy operation. + * This is used to save the information needed to store the checksum + * in the right place before sending the last word to the hardware and + * to buffer the last 0-3 bytes of non-word sized segments. + */ +struct copy_data_s { + struct ether_header_typ *hdr; + uint32_t __iomem *csum_pio; /* addr of PIO buf to write csum to */ + uint32_t __iomem *to; /* addr of PIO buf to write data to */ + uint32_t device; /* which device to allocate PIO bufs from */ + int error; /* set if there is an error. */ + int extra; /* amount of data saved in u.buf below */ + unsigned int len; /* total length to send in bytes */ + unsigned int flen; /* frament length in words */ + unsigned int csum; /* partial IP checksum */ + unsigned int pos; /* position for partial checksum */ + unsigned int offset; /* offset to where data currently starts */ + int checksum_calc; /* set to 'true' when the checksum has been calculated */ + struct sk_buff *skb; + union { + uint32_t w; + uint8_t buf[4]; + } u; +}; + +/* IB - LRH header consts */ +#define IPS_LRH_GRH 0x0003 /* 1. word of IB LRH - next header: GRH */ +#define IPS_LRH_BTH 0x0002 /* 1. word of IB LRH - next header: BTH */ + +#define IPS_OFFSET 0 + +/* + * defines the cut-off point between the header queue and eager/expected + * TID queue + */ +#define NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE ((sizeof(struct ips_message_header_typ) - offsetof(struct ips_message_header_typ, iph)) >> 2) + +/* OpCodes */ +#define OPCODE_IPS 0xC0 +#define OPCODE_ITH4X 0xC1 + +/* OpCode 30 is use by stand-alone test programs */ +#define OPCODE_RAW_DATA 0xDE +/* last OpCode (31) is reserved for test */ +#define OPCODE_TEST 0xDF + +/* sub OpCodes - ips */ +#define OPCODE_SEQ_DATA 0x01 +#define OPCODE_SEQ_CTRL 0x02 + +#define OPCODE_ACK 0x10 +#define OPCODE_NAK 0x11 + +#define OPCODE_ERR_CHK 0x20 +#define OPCODE_ERR_CHK_PLS 0x21 + +#define OPCODE_STARTUP 0x30 +#define OPCODE_STARTUP_ACK 0x31 +#define OPCODE_STARTUP_NAK 0x32 + +#define OPCODE_STARTUP_EXT 0x34 +#define OPCODE_STARTUP_ACK_EXT 0x35 +#define OPCODE_STARTUP_NAK_EXT 0x36 + +#define OPCODE_TIDS_RELEASE 0x40 +#define OPCODE_TIDS_RELEASE_CONFIRM 0x41 + +#define OPCODE_CLOSE 0x50 +#define OPCODE_CLOSE_ACK 0x51 +/* + * like OPCODE_CLOSE, but no complaint if other side has already closed. Used + * when doing abort(), MPI_Abort(), etc. + */ +#define OPCODE_ABORT 0x52 + +/* sub OpCodes - ith4x */ +#define OPCODE_ENCAP 0x81 +#define OPCODE_LID_ARP 0x82 + +/* Receive Header Queue: receive type (from infinipath) */ +#define RCVHQ_RCV_TYPE_EXPECTED 0 +#define RCVHQ_RCV_TYPE_EAGER 1 +#define RCVHQ_RCV_TYPE_NON_KD 2 +#define RCVHQ_RCV_TYPE_ERROR 3 + +/* misc. */ +#define SIZE_OF_CRC 1 + +#define EAGER_TID_ID INFINIPATH_I_TID_MASK + +#define IPS_DEFAULT_P_KEY 0xFFFF + +/* functions for extracting fields from rcvhdrq entries */ +static inline uint32_t ips_get_hdr_err_flags(uint32_t *rbuf) +{ + return rbuf[1]; +} + +static inline uint32_t ips_get_index(uint32_t *rbuf) +{ + return (rbuf[0] >> INFINIPATH_RHF_EGRINDEX_SHIFT) + & INFINIPATH_RHF_EGRINDEX_MASK; +} + +static inline uint32_t ips_get_rcv_type(uint32_t *rbuf) +{ + return (rbuf[0] >> INFINIPATH_RHF_RCVTYPE_SHIFT) + & INFINIPATH_RHF_RCVTYPE_MASK; +} + +static inline uint32_t ips_get_length_in_bytes(uint32_t *rbuf) +{ + return ((rbuf[0] >> INFINIPATH_RHF_LENGTH_SHIFT) + & INFINIPATH_RHF_LENGTH_MASK) << 2; +} + +static inline void *ips_get_first_protocol_header(uint32_t *rbuf) +{ + return (void *)&rbuf[2]; +} + +static inline struct ips_message_header_typ *ips_get_ips_header(uint32_t *rbuf) +{ + return (struct ips_message_header_typ *)&rbuf[2]; +} + +static inline uint32_t ips_get_ipath_ver(uint32_t hdrword) +{ + return (hdrword >> INFINIPATH_I_VERS_SHIFT) + & INFINIPATH_I_VERS_MASK; +} + +/* + * Copy routine that is guaranteed to work in terms of aligned 32-bit + * quantities. + */ +void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords); + +#endif /* IPS_COMMON_H */ From bos at pathscale.com Wed Dec 28 16:31:25 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:25 -0800 Subject: [openib-general] [PATCH 6 of 20] ipath - driver debugging headers In-Reply-To: Message-ID: <9e8d017ed298d591ea33.1135816285@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_debug.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_debug.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,98 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _IPATH_DEBUG_H +#define _IPATH_DEBUG_H + +#ifndef _IPATH_DEBUGGING /* debugging enabled or not */ +#define _IPATH_DEBUGGING 1 +#endif + +#if _IPATH_DEBUGGING + +/* + * Mask values for debugging. The scheme allows us to compile out any + * of the debug tracing stuff, and if compiled in, to enable or disable + * dynamically. This can be set at modprobe time also: + * modprobe infinipath.ko infinipath_debug=7 + */ + +#define __IPATH_INFO 0x1 /* generic low verbosity stuff */ +#define __IPATH_DBG 0x2 /* generic debug */ +#define __IPATH_TRSAMPLE 0x8 /* generate trace buffer sample entries */ +/* leave some low verbosity spots open */ +#define __IPATH_VERBDBG 0x40 /* very verbose debug */ +#define __IPATH_PKTDBG 0x80 /* print packet data */ +/* print process startup (init)/exit messages */ +#define __IPATH_PROCDBG 0x100 +/* print mmap/nopage stuff, not using VDBG any more */ +#define __IPATH_MMDBG 0x200 +#define __IPATH_USER_SEND 0x1000 /* use user mode send */ +#define __IPATH_KERNEL_SEND 0x2000 /* use kernel mode send */ +#define __IPATH_EPKTDBG 0x4000 /* print ethernet packet data */ +#define __IPATH_SMADBG 0x8000 /* sma packet debug */ +#define __IPATH_IPATHDBG 0x10000 /* Ethernet (IPATH) general debug on */ +#define __IPATH_IPATHWARN 0x20000 /* Ethernet (IPATH) warnings on */ +#define __IPATH_IPATHERR 0x40000 /* Ethernet (IPATH) errors on */ +#define __IPATH_IPATHPD 0x80000 /* Ethernet (IPATH) packet dump on */ +#define __IPATH_IPATHTABLE 0x100000 /* Ethernet (IPATH) table dump on */ + +#else /* _IPATH_DEBUGGING */ + +/* + * define all of these even with debugging off, for the few places that do + * if(infinipath_debug & _IPATH_xyzzy), but in a way that will make the + * compiler eliminate the code + */ + +#define __IPATH_INFO 0x0 /* generic low verbosity stuff */ +#define __IPATH_DBG 0x0 /* generic debug */ +#define __IPATH_TRSAMPLE 0x0 /* generate trace buffer sample entries */ +#define __IPATH_VERBDBG 0x0 /* very verbose debug */ +#define __IPATH_PKTDBG 0x0 /* print packet data */ +#define __IPATH_PROCDBG 0x0 /* print process startup (init)/exit messages */ +/* print mmap/nopage stuff, not using VDBG any more */ +#define __IPATH_MMDBG 0x0 +#define __IPATH_EPKTDBG 0x0 /* print ethernet packet data */ +#define __IPATH_SMADBG 0x0 /* print process startup (init)/exit messages */#define __IPATH_IPATHDBG 0x0 /* Ethernet (IPATH) table dump on */ +#define __IPATH_IPATHWARN 0x0 /* Ethernet (IPATH) warnings on */ +#define __IPATH_IPATHERR 0x0 /* Ethernet (IPATH) errors on */ +#define __IPATH_IPATHPD 0x0 /* Ethernet (IPATH) packet dump on */ +#define __IPATH_IPATHTABLE 0x0 /* Ethernet (IPATH) packet dump on */ + +#endif /* _IPATH_DEBUGGING */ + +#endif /* _IPATH_DEBUG_H */ diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_kdebug.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_kdebug.h Wed Dec 28 14:19:42 2005 -0800 @@ -0,0 +1,109 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _IPATH_KDEBUG_H +#define _IPATH_KDEBUG_H + +#include "ipath_debug.h" + +/* + * This file contains lightweight kernel tracing code. + */ + +extern unsigned infinipath_debug; +const char *ipath_get_unit_name(int unit); + +#if _IPATH_DEBUGGING + +#define _IPATH_UNIT_ERROR(unit,fmt,...) \ + printk(KERN_ERR "%s: " fmt, ipath_get_unit_name(unit), ##__VA_ARGS__) + +#define _IPATH_ERROR(fmt,...) printk(KERN_ERR "infinipath: " fmt, ##__VA_ARGS__) + +#define _IPATH_INFO(fmt,...) \ + do { \ + if(unlikely(infinipath_debug & __IPATH_INFO)) \ + printk(KERN_INFO "infinipath: " fmt, ##__VA_ARGS__); \ + } while(0) + +#define __IPATH_DBG_WHICH(which,fmt,...) \ + do { \ + if(unlikely(infinipath_debug&(which))) \ + printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#define _IPATH_DBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_DBG,fmt,##__VA_ARGS__) +#define _IPATH_VDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_VERBDBG,fmt,##__VA_ARGS__) +#define _IPATH_PDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PKTDBG,fmt,##__VA_ARGS__) +#define _IPATH_EPDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_EPKTDBG,fmt,##__VA_ARGS__) +#define _IPATH_PRDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_PROCDBG,fmt,##__VA_ARGS__) +#define _IPATH_MMDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_MMDBG,fmt,##__VA_ARGS__) +#define _IPATH_SMADBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_SMADBG,fmt,##__VA_ARGS__) +#define _IPATH_IPATHDBG(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHDBG,fmt,##__VA_ARGS__) +#define _IPATH_IPATHWARN(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHWARN,fmt,##__VA_ARGS__) +#define _IPATH_IPATHERR(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHERR ,fmt,##__VA_ARGS__) +#define _IPATH_IPATHPD(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHPD ,fmt,##__VA_ARGS__) +#define _IPATH_IPATHTABLE(fmt,...) __IPATH_DBG_WHICH(__IPATH_IPATHTABLE ,fmt,##__VA_ARGS__) + +#else /* ! _IPATH_DEBUGGING */ + +#define _IPATH_UNIT_ERROR(unit,fmt,...) \ + do { \ + printk(KERN_ERR "%s" fmt, "",##__VA_ARGS__); \ + } while(0) + +#define _IPATH_ERROR(fmt,...) \ + do { \ + printk (KERN_ERR "%s" fmt, "",##__VA_ARGS__); \ + } while(0) + +#define _IPATH_INFO(fmt,...) +#define _IPATH_DBG(fmt,...) +#define _IPATH_PDBG(fmt,...) +#define _IPATH_EPDBG(fmt,...) +#define _IPATH_PRDBG(fmt,...) +#define _IPATH_VDBG(fmt,...) +#define _IPATH_MMDBG(fmt,...) +#define _IPATH_SMADBG(fmt,...) +#define _IPATH_IPATHDBG(fmt,...) +#define _IPATH_IPATHWARN(fmt,...) +#define _IPATH_IPATHERR(fmt,...) +#define _IPATH_IPATHPD(fmt,...) +#define _IPATH_IPATHTABLE(fmt,...) + +#endif /* _IPATH_DEBUGGING */ + +#endif /* _IPATH_DEBUG_H */ From bos at pathscale.com Wed Dec 28 16:31:32 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:32 -0800 Subject: [openib-general] [PATCH 13 of 20] ipath - routines used by upper layer driver code In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r 5e9b0b7876e2 -r f9bcd9de3548 drivers/infiniband/hw/ipath/ipath_layer.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_layer.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,1313 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +/* + * These are the routines used by layered drivers, currently just the + * layered ethernet driver and verbs layer. + */ + +#include + +#include "ipath_kernel.h" +#include "ips_common.h" +#include "ipath_layer.h" + +/* unit number is already validated in ipath_ioctl() */ +int ipath_kset_linkstate(uint32_t arg) +{ + ipath_type unit = 0xffff & (arg >> 16); + uint32_t lstate; + struct ipath_devdata *dd; + int tryarmed = 0; + + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + dd = &devdata[unit]; + arg &= 0xffff; + switch (arg) { + case IPATH_IB_LINKDOWN: + ipath_down_link(unit); /* really moving it to idle */ + lstate = IPATH_LINKDOWN | IPATH_LINK_SLEEPING; + break; + + case IPATH_IB_LINKDOWN_POLL: + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + lstate = IPATH_LINKDOWN; + break; + + case IPATH_IB_LINKDOWN_DISABLE: + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKINITCMD_DISABLE << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + lstate = IPATH_LINKDOWN; + break; + + case IPATH_IB_LINKINIT: + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_INIT); + lstate = IPATH_LINKINIT; + break; + + case IPATH_IB_LINKARM: + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ARMED); + lstate = IPATH_LINKARMED; + break; + + case IPATH_IB_LINKACTIVE: + /* + * because we sometimes go to ARMED, but then back to 0x11 + * (initialized) before the SMA asks us to move to ACTIVE, + * we will try to advance state to ARMED here, if necessary + */ + if (!(dd->ipath_flags & + (IPATH_LINKINIT | IPATH_LINKARMED | IPATH_LINKDOWN | + IPATH_LINK_SLEEPING | IPATH_LINKACTIVE))) { + /* this one is just paranoia */ + _IPATH_DBG + ("don't know current state (flags 0x%x), try anyway\n", + dd->ipath_flags); + tryarmed = 1; + + } + if (!(dd->ipath_flags & (IPATH_LINKARMED | IPATH_LINKACTIVE))) + tryarmed = 1; + if (tryarmed) { + ipath_set_ib_lstate(unit, + INFINIPATH_IBCC_LINKCMD_ARMED); + /* + * give it up to 2 seconds to get to ARMED or + * ACTIVE; continue afterwards even if we fail + */ + if (ipath_wait_linkstate + (unit, IPATH_LINKARMED | IPATH_LINKACTIVE, 2000)) + _IPATH_VDBG + ("try for active, even though didn't get to ARMED\n"); + } + + ipath_set_ib_lstate(unit, INFINIPATH_IBCC_LINKCMD_ACTIVE); + lstate = IPATH_LINKACTIVE; + break; + + default: + _IPATH_DBG("Unknown linkstate 0x%x requested\n", arg); + return -EINVAL; + } + return ipath_wait_linkstate(unit, lstate, 2000); +} + +/* + * we can handle "any" incoming size, the issue here is whether we + * need to restrict our outgoing size. For now, we don't do any + * sanity checking on this, and we don't deal with what happens to + * programs that are already running when the size changes. + * unit number is already validated in ipath_ioctl() + * NOTE: changing the MTU will usually cause the IBC to go back to + * link initialize (0x11) state... + */ +int ipath_kset_mtu(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + uint32_t piosize; + int changed = 0; + + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + /* + * mtu is IB data payload max. It's the largest power of 2 less + * than piosize (or even larger, since it only really controls the + * largest we can receive; we can send the max of the mtu and piosize). + * We check that it's one of the valid IB sizes. + */ + if (arg != 256 && arg != 512 && arg != 1024 && arg != 2048 && + arg != 4096) { + _IPATH_DBG("Trying to set invalid mtu %u, failing\n", arg); + return -EINVAL; + } + if (devdata[unit].ipath_ibmtu == arg) { + return 0; /* same as current */ + } + + piosize = devdata[unit].ipath_ibmaxlen; + devdata[unit].ipath_ibmtu = arg; + + /* + * the 128 is the max IB header size allowed for in our pio send buffers + * If we are reducing the MTU below that, this doesn't completely make + * sense, but it's OK. + */ + if (arg >= (piosize - 128)) { + /* hasn't been changed */ + if (piosize == devdata[unit].ipath_init_ibmaxlen) + _IPATH_VDBG + ("mtu 0x%x >= ibmaxlen hardware max, nothing to do\n", + arg); + else { + _IPATH_VDBG + ("mtu 0x%x restores ibmaxlen to full amount 0x%x\n", + arg, piosize); + devdata[unit].ipath_ibmaxlen = piosize; + changed = 1; + } + } else if ((arg + 128) == devdata[unit].ipath_ibmaxlen) + _IPATH_VDBG("ibmaxlen %x same as current, no change\n", arg); + else { + piosize = arg + 128; + _IPATH_VDBG("ibmaxlen was 0x%x, setting to 0x%x (mtu 0x%x)\n", + devdata[unit].ipath_ibmaxlen, piosize, arg); + devdata[unit].ipath_ibmaxlen = piosize; + changed = 1; + } + + if (changed) { + /* + * set the IBC maxpktlength to the size of our pio + * buffers in words + */ + uint64_t ibc = devdata[unit].ipath_ibcctrl; + ibc &= ~(INFINIPATH_IBCC_MAXPKTLEN_MASK << + INFINIPATH_IBCC_MAXPKTLEN_SHIFT); + + piosize = piosize - 2 * sizeof(uint32_t); /* ignore pbc */ + devdata[unit].ipath_ibmaxlen = piosize; + piosize /= sizeof(uint32_t); /* in words */ + /* + * for ICRC, which we only send in diag test pkt mode, and we + * don't need to worry about that for mtu + */ + piosize += 1; + + ibc |= piosize << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; + devdata[unit].ipath_ibcctrl = ibc; + ipath_kput_kreg(unit, kr_ibcctrl, devdata[unit].ipath_ibcctrl); + } + return 0; +} + +void ipath_set_sps_lid(const ipath_type unit, uint32_t arg) +{ + if (unit >= infinipath_max || + !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return; + } + + ipath_stats.sps_lid[unit] = devdata[unit].ipath_lid = arg; + if (devdata[unit].ipath_layer.l_intr) + devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_LID); +} + +/* XXX - need to inform anyone who cares this just happened. */ +int ipath_layer_set_guid(const ipath_type device, uint64_t guid) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return -ENODEV; + } + devdata[device].ipath_guid = guid; + return 0; +} + +uint64_t ipath_layer_get_guid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + return devdata[device].ipath_guid; +} + +uint32_t ipath_layer_get_nguid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + return devdata[device].ipath_nguid; +} + +int ipath_layer_query_device(const ipath_type device, uint32_t * vendor, + uint32_t * boardrev, uint32_t * majrev, + uint32_t * minrev) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return -ENODEV; + } + + *vendor = devdata[device].ipath_vendorid; + *boardrev = devdata[device].ipath_boardrev; + *majrev = devdata[device].ipath_majrev; + *minrev = devdata[device].ipath_minrev; + + return 0; +} + +uint32_t ipath_layer_get_flags(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_flags; +} + +struct device *ipath_layer_get_pcidev(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return NULL; + } + + return &(devdata[device].pcidev->dev); +} + +uint16_t ipath_layer_get_deviceid(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_deviceid; +} + +uint64_t ipath_layer_get_lastibcstat(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_lastibcstat; +} + +uint32_t ipath_layer_get_ibmtu(const ipath_type device) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].ipath_ibmtu; +} + +int ipath_layer_register(const ipath_type device, + int (*l_intr) (const ipath_type, uint32_t), + int (*l_rcv) (const ipath_type, void *, + struct sk_buff *), uint16_t l_rcv_opcode, + int (*l_rcv_lid) (const ipath_type, void *), + uint16_t l_rcv_lid_opcode) +{ + int ret = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 1; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return 1; + } + + _IPATH_VDBG("intr %p rx %p, rx_lid %p\n", l_intr, l_rcv, l_rcv_lid); + if (devdata[device].ipath_layer.l_intr + || devdata[device].ipath_layer.l_rcv) { + _IPATH_DBG + ("Layered device already registered on unit %u, failing\n", + device); + return 1; + } + + if (!(*devdata[device].ipath_statusp & IPATH_STATUS_SMA)) + *devdata[device].ipath_statusp |= IPATH_STATUS_OIB_SMA; + devdata[device].ipath_layer.l_intr = l_intr; + devdata[device].ipath_layer.l_rcv = l_rcv; + devdata[device].ipath_layer.l_rcv_lid = l_rcv_lid; + devdata[device].ipath_layer.l_rcv_opcode = l_rcv_opcode; + devdata[device].ipath_layer.l_rcv_lid_opcode = l_rcv_lid_opcode; + + return ret; +} + +static void ipath_verbs_timer(unsigned long t) +{ + /* + * If port 0 receive packet interrupts are not availabile, + * check the receive queue. + */ + if (!(devdata[t].ipath_flags & IPATH_GPIO_INTR)) + ipath_kreceive(t); + + /* Handle verbs layer timeouts. */ + if (devdata[t].verbs_layer.l_timer_cb) + devdata[t].verbs_layer.l_timer_cb(t); + + mod_timer(&devdata[t].verbs_layer.l_timer, jiffies + 1); +} + +/* Verbs layer registration. */ +int ipath_verbs_register(const ipath_type device, + int (*l_piobufavail) (const ipath_type device), + void (*l_rcv) (const ipath_type device, void *rhdr, + void *data, uint32_t tlen), + void (*l_timer_cb) (const ipath_type device)) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return 0; + } + + _IPATH_VDBG("piobufavail %p rx %p\n", l_piobufavail, l_rcv); + if (devdata[device].verbs_layer.l_piobufavail || + devdata[device].verbs_layer.l_rcv) { + _IPATH_DBG("Verbs layer already registered on unit %u, " + "failing\n", device); + return 0; + } + + devdata[device].verbs_layer.l_piobufavail = l_piobufavail; + devdata[device].verbs_layer.l_rcv = l_rcv; + devdata[device].verbs_layer.l_timer_cb = l_timer_cb; + devdata[device].verbs_layer.l_flags = 0; + + return 1; +} + +void ipath_verbs_unregister(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + _IPATH_VDBG("%s not yet initialized, failing\n", + ipath_get_unit_name(device)); + return; + } + + *devdata[device].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + devdata[device].verbs_layer.l_piobufavail = NULL; + devdata[device].verbs_layer.l_rcv = NULL; + devdata[device].verbs_layer.l_timer_cb = NULL; + devdata[device].verbs_layer.l_flags = 0; +} + +int ipath_layer_open(const ipath_type device, uint32_t * pktmax) +{ + int ret = 0; + uint32_t intval = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 1; + } + if (!devdata[device].ipath_layer.l_intr + || !devdata[device].ipath_layer.l_rcv) { + _IPATH_DBG("layer not registered, failing\n"); + return 1; + } + + if ((ret = + ipath_setrcvhdrsize(device, NUM_OF_EKSTRA_WORDS_IN_HEADER_QUEUE))) + return ret; + + *pktmax = devdata[device].ipath_ibmaxlen; + + if (*devdata[device].ipath_statusp & IPATH_STATUS_IB_READY) + intval |= IPATH_LAYER_INT_IF_UP; + if (ipath_stats.sps_lid[device]) + intval |= IPATH_LAYER_INT_LID; + if (ipath_stats.sps_mlid[device]) + intval |= IPATH_LAYER_INT_BCAST; + /* + * do this on open, in case low level is already up and + * just layered driver was reloaded, etc. + */ + if (intval) + devdata[device].ipath_layer.l_intr(device, intval); + + return ret; +} + +uint16_t ipath_layer_get_lid(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + _IPATH_VDBG("returning mylid 0x%x for layered dev %d\n", + devdata[device].ipath_lid, device); + return devdata[device].ipath_lid; +} + +/* + * get the MAC address. This is the EUID-64 OUI octets (top 3), then + * skip the next 2 (which should both be zero or 0xff). + * The returned MAC is in network order + * mac points to at least 6 bytes of buffer + * returns 0 on error (to be consistent with get_lid and get_bcast + * return 1 on success + * We assume that by the time the LID is set, that the GUID is as valid + * as it's ever going to be, rather than adding yet another status bit. + */ + +int ipath_layer_get_mac(const ipath_type device, uint8_t * mac) +{ + uint8_t *guid; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return 0; + } + guid = (uint8_t *) & devdata[device].ipath_guid; + + mac[0] = guid[0]; + mac[1] = guid[1]; + mac[2] = guid[2]; + mac[3] = guid[5]; + mac[4] = guid[6]; + mac[5] = guid[7]; + if ((guid[3] || guid[4]) && !(guid[3] == 0xff && guid[4] == 0xff)) + _IPATH_DBG("Warning, guid bytes 3 and 4 not 0 or 0xffff: %x %x\n", + guid[3], guid[4]); + _IPATH_VDBG("Returning %x:%x:%x:%x:%x:%x\n", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + return 1; +} + +uint16_t ipath_layer_get_bcast(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return 0; + } + + _IPATH_VDBG("returning broadcast LID 0x%x for unit %u\n", + devdata[device].ipath_mlid, device); + return devdata[device].ipath_mlid; +} + +int ipath_layer_get_num_of_dev(void) +{ + return infinipath_max; +} + +int ipath_layer_get_cr_errpkey(const ipath_type device) +{ + return ipath_kget_creg32(device, cr_errpkey); +} + +void ipath_layer_close(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + if (!devdata[device].ipath_layer.l_intr + || !devdata[device].ipath_layer.l_rcv) { + /* normal if not all chips are present */ + _IPATH_VDBG("layer close without open\n"); + } else { + devdata[device].ipath_layer.l_intr = NULL; + devdata[device].ipath_layer.l_rcv = NULL; + devdata[device].ipath_layer.l_rcv_lid = NULL; + devdata[device].ipath_layer.l_rcv_opcode = 0; + devdata[device].ipath_layer.l_rcv_lid_opcode = 0; + } +} + +static inline void copy_aligned(uint32_t __iomem *piobuf, + struct ipath_sge_state *ss, + uint32_t length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length) { + uint32_t len = sge->length; + uint32_t w; + + BUG_ON(len == 0); + if (len > length) + len = length; + /* Need to round up for the last dword in the packet. */ + w = (len + 3) >> 2; + if (length == len) { /* last chunk, trigger word is special */ + uint32_t *src32; + memcpy_toio32(piobuf, sge->vaddr, w-1); + src32 = (w-1)+(uint32_t*)sge->vaddr; + mb(); /* must flush early everything before trigger word */ + writel(*src32, piobuf+w-1); + } + else + memcpy_toio32(piobuf, sge->vaddr, w); + piobuf += w; + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } +} + +static inline void copy_unaligned(uint32_t __iomem *piobuf, + struct ipath_sge_state *ss, + uint32_t length) +{ + struct ipath_sge *sge = &ss->sge; + union { + uint8_t wbuf[4]; + uint32_t w; + } u; + int extra = 0; + + while (length) { + uint32_t len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + length -= len; + while (len) { + u.wbuf[extra++] = *(uint8_t *) sge->vaddr; + sge->vaddr++; + sge->length--; + sge->sge_length--; + if (extra >= 4) { + if (!length && len == 1) + mb(); /* flush all before the trigger word write */ + writel(u.w, piobuf); + extra = 0; + } + len--; + } + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + } + if (extra) { + while (extra < 4) + u.wbuf[extra++] = 0; + mb(); /* flush all before the trigger word write */ + writel(u.w, piobuf); + } +} + +/* + * This is like ipath_send_smapkt() in that we need to be able to send + * packets after the chip is initialized (MADs) but also like + * ipath_layer_send() since its used by the verbs layer. + */ +int ipath_verbs_send(const ipath_type device, uint32_t hdrwords, + uint32_t *hdr, uint32_t len, struct ipath_sge_state *ss) +{ + struct ipath_devdata *dd = &devdata[device]; + uint32_t __iomem *piobuf; + uint32_t plen; + + if (device >= infinipath_max || + !(dd->ipath_flags & IPATH_PRESENT) || !dd->ipath_kregbase) { + _IPATH_DBG("illegal unit %u\n", device); + return -ENODEV; + } + if (!(dd->ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return -ENODEV; + } + /* +1 is for the qword padding of pbc */ + plen = hdrwords + ((len + 3) >> 2) + 1; + if ((plen << 2) > dd->ipath_ibmaxlen) { + _IPATH_DBG("packet len 0x%x too long, failing\n", plen); + return -EINVAL; + } + + /* Get a PIO buffer to use. */ + if (!(piobuf = ipath_getpiobuf(device, NULL))) + return -EBUSY; + + _IPATH_EPDBG("0x%x+1w pio %p\n", plen - 1, piobuf); + + /* Write len to control qword, no flags. + * we have to flush after the PBC for correctness on some cpus + * or WC buffer can be written out of order */ + writeq(plen, piobuf); + mb(); + piobuf += 2; + if (len == 0) { + /* if there is just the header portion, must flush before + * writing last word of header for correctness, and after + * the last header word (trigger word) */ + memcpy_toio32(piobuf, hdr, hdrwords-1); + mb(); + writel(hdr[hdrwords-1], piobuf+hdrwords-1); + mb(); + return 0; + } + memcpy_toio32(piobuf, hdr, hdrwords); + piobuf += hdrwords; + /* + * If we really wanted to check everything, we would have to + * check that each segment starts on a dword boundary and is + * a dword multiple in length. + * Since there can be lots of segments, we only check for a simple + * common case where the amount to copy is contained in one segment. + */ + if (ss->sge.length == len) + copy_aligned(piobuf, ss, len); + else + copy_unaligned(piobuf, ss, len); + mb(); /* be sure trigger word is written */ + return 0; +} + +void ipath_layer_snapshot_counters(const ipath_type device, uint64_t * swords, + uint64_t * rwords, uint64_t * spkts, uint64_t * rpkts) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_PRESENT)) { + _IPATH_DBG("illegal unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return; + } + *swords = ipath_snap_cntr(device, cr_wordsendcnt); + *rwords = ipath_snap_cntr(device, cr_wordrcvcnt); + *spkts = ipath_snap_cntr(device, cr_pktsendcnt); + *rpkts = ipath_snap_cntr(device, cr_pktrcvcnt); +} + +/* + * Return the counters needed by recv_pma_get_portcounters(). + */ +void ipath_layer_get_counters(const ipath_type device, + struct ipath_layer_counters *cntrs) +{ + if (device >= infinipath_max || + !(devdata[device].ipath_flags & IPATH_PRESENT)) { + _IPATH_DBG("illegal unit %u\n", device); + return; + } + if (!(devdata[device].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_DBG("unit %u not usable\n", device); + return; + } + cntrs->symbol_error_counter = + ipath_snap_cntr(device, cr_ibsymbolerrcnt); + cntrs->link_error_recovery_counter = + ipath_snap_cntr(device, cr_iblinkerrrecovcnt); + cntrs->link_downed_counter = ipath_snap_cntr(device, cr_iblinkdowncnt); + cntrs->port_rcv_errors = ipath_snap_cntr(device, cr_err_rlencnt) + + ipath_snap_cntr(device, cr_invalidrlencnt) + + ipath_snap_cntr(device, cr_erricrccnt) + + ipath_snap_cntr(device, cr_errvcrccnt) + + ipath_snap_cntr(device, cr_badformatcnt); + cntrs->port_rcv_remphys_errors = ipath_snap_cntr(device, cr_rcvebpcnt); + cntrs->port_xmit_discards = ipath_snap_cntr(device, cr_unsupvlcnt); + cntrs->port_xmit_data = ipath_snap_cntr(device, cr_wordsendcnt); + cntrs->port_rcv_data = ipath_snap_cntr(device, cr_wordrcvcnt); + cntrs->port_xmit_packets = ipath_snap_cntr(device, cr_pktsendcnt); + cntrs->port_rcv_packets = ipath_snap_cntr(device, cr_pktrcvcnt); +} + +void ipath_layer_want_buffer(const ipath_type device) +{ + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &devdata[device].ipath_sendctrl); + ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl); +} + +int ipath_layer_send(const ipath_type device, void *hdr, void *data, + uint32_t datawords) +{ + int ret = 0; + uint32_t __iomem *piobuf; + uint32_t plen; + uint16_t vlsllnh; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u, failing\n", device); + return -EINVAL; + } + if (!(devdata[device].ipath_flags & IPATH_RCVHDRSZ_SET)) { + _IPATH_DBG("send while not open\n"); + ret = -EINVAL; + } else + if ((devdata[device].ipath_flags & (IPATH_LINKUNK | IPATH_LINKDOWN)) + || devdata[device].ipath_lid == 0) { + /* lid check is for when sma hasn't yet configured */ + ret = -ENETDOWN; + _IPATH_VDBG("send while not ready, mylid=%u, flags=0x%x\n", + devdata[device].ipath_lid, + devdata[device].ipath_flags); + } + /* +1 is for the qword padding of pbc */ + plen = (sizeof(struct ips_message_header_typ) >> 2) + datawords + 1; + if (plen > (devdata[device].ipath_ibmaxlen >> 2)) { + _IPATH_DBG("packet len 0x%x too long, failing\n", plen); + ret = -EINVAL; + } + vlsllnh = *((uint16_t *) hdr); + if (vlsllnh != htons(IPS_LRH_BTH)) { + _IPATH_DBG("Warning: lrh[0] wrong (%x, not %x); not sending\n", + vlsllnh, htons(IPS_LRH_BTH)); + ret = -EINVAL; + } + if (ret) + goto done; + + /* Get a PIO buffer to use. */ + if (!(piobuf = ipath_getpiobuf(device, NULL))) { + ret = -EBUSY; + goto done; + } + + _IPATH_EPDBG("0x%x+1w pio %p\n", plen - 1, piobuf); + + /* len to control qword, no flags */ + writeq(plen, piobuf); + piobuf += 2; + memcpy_toio32(piobuf, hdr, + (sizeof(struct ips_message_header_typ) >> 2)); + piobuf += (sizeof(struct ips_message_header_typ) >> 2); + memcpy_toio32(piobuf, data, datawords); + + ipath_stats.sps_ether_spkts++; /* another ether packet sent */ + +done: + return ret; +} + +void ipath_layer_set_piointbufavail_int(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &devdata[device].ipath_sendctrl); + + ipath_kput_kreg(device, kr_sendctrl, devdata[device].ipath_sendctrl); +} + +void ipath_layer_enable_timer(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + /* + * HT-400 has a design flaw where the chip and kernel idea + * of the tail register don't always agree, and therefore we won't + * get an interrupt on the next packet received. + * If the board supports per packet receive interrupts, use it. + * Otherwise, the timer function periodically checks for packets + * to cover this case. + * Either way, the timer is needed for verbs layer related + * processing. + */ + if (devdata[device].ipath_flags & IPATH_GPIO_INTR) { + ipath_kput_kreg(device, kr_debugportselect, 0x2074076542310UL); + /* Enable GPIO bit 2 interrupt */ + ipath_kput_kreg(device, kr_gpio_mask, (uint64_t)(1 << 2)); + } + + init_timer(&devdata[device].verbs_layer.l_timer); + devdata[device].verbs_layer.l_timer.function = ipath_verbs_timer; + devdata[device].verbs_layer.l_timer.data = (unsigned long)device; + devdata[device].verbs_layer.l_timer.expires = jiffies + 1; + add_timer(&devdata[device].verbs_layer.l_timer); +} + +void ipath_layer_disable_timer(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + /* Disable GPIO bit 2 interrupt */ + if (devdata[device].ipath_flags & IPATH_GPIO_INTR) + ipath_kput_kreg(device, kr_gpio_mask, 0); + + del_timer_sync(&devdata[device].verbs_layer.l_timer); +} + +/* + * Get the verbs layer flags. + */ +unsigned ipath_verbs_get_flags(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return devdata[device].verbs_layer.l_flags; +} + +/* + * Set the verbs layer flags. + */ +void ipath_verbs_set_flags(const ipath_type device, unsigned flags) +{ + ipath_type s; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + devdata[device].verbs_layer.l_flags = flags; + + for (s = 0; s < infinipath_max; s++) { + if (!(devdata[s].ipath_flags & IPATH_INITTED)) + continue; + if ((flags & IPATH_VERBS_KERNEL_SMA) && + !(*devdata[s].ipath_statusp & IPATH_STATUS_SMA)) { + *devdata[s].ipath_statusp |= IPATH_STATUS_OIB_SMA; + } else { + *devdata[s].ipath_statusp &= ~IPATH_STATUS_OIB_SMA; + } + } +} + +/* + * Return the size of the PKEY table for port 0. + */ +unsigned ipath_layer_get_npkeys(const ipath_type device) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + + return ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys); +} + +/* + * Return the indexed PKEY from the port 0 PKEY table. + */ +unsigned ipath_layer_get_pkey(const ipath_type device, unsigned index) +{ + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return 0; + } + if (index >= ARRAY_SIZE(devdata[device].ipath_pd[0]->port_pkeys)) + return 0; + + return devdata[device].ipath_pd[0]->port_pkeys[index]; +} + +/* + * Return the PKEY table for port 0. + */ +void ipath_layer_get_pkeys(const ipath_type device, uint16_t *pkeys) +{ + struct ipath_portdata *pd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return; + } + + pd = devdata[device].ipath_pd[0]; + memcpy(pkeys, pd->port_pkeys, sizeof(pd->port_pkeys)); +} + +/* + * Decrecment the reference count for the given PKEY. + * Return true if this was the last reference and the hardware table entry + * needs to be changed. + */ +static inline int rm_pkey(struct ipath_devdata *dd, uint16_t key) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (dd->ipath_pkeys[i] != key) + continue; + if (atomic_dec_and_test(&dd->ipath_pkeyrefs[i])) { + dd->ipath_pkeys[i] = 0; + return 1; + } + break; + } + return 0; +} + +/* + * Add the given PKEY to the hardware table. + * Return an error code if unable to add the entry, zero if no change, + * or 1 if the hardware PKEY register needs to be updated. + */ +static inline int add_pkey(struct ipath_devdata *dd, uint16_t key) +{ + int i; + uint16_t lkey = key & 0x7FFF; + int any = 0; + + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i]) { + any++; + continue; + } + /* If it matches exactly, try to increment the ref count */ + if (dd->ipath_pkeys[i] == key) { + if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) + return 0; + /* Lost the race. Look for an empty slot below. */ + atomic_dec(&dd->ipath_pkeyrefs[i]); + any++; + } + /* + * It makes no sense to have both the limited and unlimited + * PKEY set at the same time since the unlimited one will + * disable the limited one. + */ + if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) + return -EEXIST; + } + if (!any) + return -EBUSY; + for (i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i] && + atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { + /* for ipathstats, etc. */ + ipath_stats.sps_pkeys[i] = lkey; + dd->ipath_pkeys[i] = key; + return 1; + } + } + return -EBUSY; +} + +/* + * Set the PKEY table for port 0. + */ +int ipath_layer_set_pkeys(const ipath_type device, uint16_t *pkeys) +{ + struct ipath_portdata *pd; + struct ipath_devdata *dd; + int i; + int changed = 0; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + pd = dd->ipath_pd[0]; + + for (i = 0; i > ARRAY_SIZE(pd->port_pkeys); i++) { + uint16_t key = pkeys[i]; + uint16_t okey = pd->port_pkeys[i]; + + if (key == okey) + continue; + /* + * The value of this PKEY table entry is changing. + * Remove the old entry in the hardware's array of PKEYs. + */ + if (okey & 0x7FFF) + changed |= rm_pkey(dd, okey); + if (key & 0x7FFF) { + int ret = add_pkey(dd, key); + + if (ret < 0) + key = 0; + else + changed |= ret; + } + pd->port_pkeys[i] = key; + } + if (changed) { + uint64_t pkey; + + pkey = (uint64_t) dd->ipath_pkeys[0] | + ((uint64_t) dd->ipath_pkeys[1] << 16) | + ((uint64_t) dd->ipath_pkeys[2] << 32) | + ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_VDBG("p0 new pkey reg %llx\n", pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + } + return 0; +} + +/* + * Registers that vary with the chip implementation constants (port) + * use this routine. + */ +uint64_t ipath_kget_kreg64_port(const ipath_type stype, ipath_kreg regno, + unsigned port) +{ + ipath_kreg tmp = + (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ? + regno + port : + ((port < devdata[stype].ipath_portcnt + && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid); + return ipath_kget_kreg64(stype, tmp); +} + +/* + * Registers that vary with the chip implementation constants (port) + * use this routine. + */ +void ipath_kput_kreg_port(const ipath_type stype, ipath_kreg regno, + unsigned port, uint64_t value) +{ + ipath_kreg tmp = + (port < devdata[stype].ipath_portcnt && regno == kr_rcvhdraddr) ? + regno + port : + ((port < devdata[stype].ipath_portcnt + && regno == kr_rcvhdrtailaddr) ? regno + port : __kr_invalid); + ipath_kput_kreg(stype, tmp, value); +} + +/* + * Returns zero if the default is POLL, 1 if the default is SLEEP. + */ +int ipath_layer_get_linkdowndefaultstate(const ipath_type device) +{ + struct ipath_devdata *dd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + return (dd->ipath_ibcctrl & INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE) ? + 1 : 0; +} + +/* + * Note that this will only take effect when the link state changes. + */ +int ipath_layer_set_linkdowndefaultstate(const ipath_type device, int sleep) +{ + struct ipath_devdata *dd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + _IPATH_DBG("state %s\n", sleep ? "SLEEP" : "POLL"); + dd = &devdata[device]; + if (sleep) + dd->ipath_ibcctrl |= INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + else + dd->ipath_ibcctrl &= ~INFINIPATH_IBCC_LINKDOWNDEFAULTSTATE; + return 0; +} + +int ipath_layer_get_phyerrthreshold(const ipath_type device) +{ + struct ipath_devdata *dd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; +} + +/* + * Note that this will only take effect when the link state changes. + */ +int ipath_layer_set_phyerrthreshold(const ipath_type device, unsigned n) +{ + struct ipath_devdata *dd; + unsigned v; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK; + if (v != n) { + _IPATH_DBG("error threshold %u\n", n); + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_PHYERRTHRESHOLD_MASK << + INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT); + dd->ipath_ibcctrl |= + (uint64_t)n << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT; + } + return 0; +} + +int ipath_layer_get_overrunthreshold(const ipath_type device) +{ + struct ipath_devdata *dd; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + return (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; +} + +/* + * Note that this will only take effect when the link state changes. + */ +int ipath_layer_set_overrunthreshold(const ipath_type device, unsigned n) +{ + struct ipath_devdata *dd; + unsigned v; + + if (device >= infinipath_max) { + _IPATH_DBG("Invalid unit %u\n", device); + return -EINVAL; + } + + dd = &devdata[device]; + v = (dd->ipath_ibcctrl >> INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT) & + INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK; + if (v != n) { + _IPATH_DBG("overrun threshold %u\n", n); + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_OVERRUNTHRESHOLD_MASK << + INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT); + dd->ipath_ibcctrl |= + (uint64_t)n << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT; + } + return 0; +} + +EXPORT_SYMBOL(ipath_kset_linkstate); +EXPORT_SYMBOL(ipath_kset_mtu); +EXPORT_SYMBOL(ipath_layer_close); +EXPORT_SYMBOL(ipath_layer_get_bcast); +EXPORT_SYMBOL(ipath_layer_get_cr_errpkey); +EXPORT_SYMBOL(ipath_layer_get_deviceid); +EXPORT_SYMBOL(ipath_layer_get_flags); +EXPORT_SYMBOL(ipath_layer_get_guid); +EXPORT_SYMBOL(ipath_layer_get_ibmtu); +EXPORT_SYMBOL(ipath_layer_get_lastibcstat); +EXPORT_SYMBOL(ipath_layer_get_lid); +EXPORT_SYMBOL(ipath_layer_get_mac); +EXPORT_SYMBOL(ipath_layer_get_nguid); +EXPORT_SYMBOL(ipath_layer_get_num_of_dev); +EXPORT_SYMBOL(ipath_layer_get_pcidev); +EXPORT_SYMBOL(ipath_layer_open); +EXPORT_SYMBOL(ipath_layer_query_device); +EXPORT_SYMBOL(ipath_layer_register); +EXPORT_SYMBOL(ipath_layer_send); +EXPORT_SYMBOL(ipath_layer_set_guid); +EXPORT_SYMBOL(ipath_layer_set_piointbufavail_int); +EXPORT_SYMBOL(ipath_layer_snapshot_counters); +EXPORT_SYMBOL(ipath_layer_get_counters); +EXPORT_SYMBOL(ipath_layer_want_buffer); +EXPORT_SYMBOL(ipath_verbs_register); +EXPORT_SYMBOL(ipath_verbs_send); +EXPORT_SYMBOL(ipath_verbs_unregister); +EXPORT_SYMBOL(ipath_set_sps_lid); +EXPORT_SYMBOL(ipath_layer_enable_timer); +EXPORT_SYMBOL(ipath_layer_disable_timer); +EXPORT_SYMBOL(ipath_verbs_get_flags); +EXPORT_SYMBOL(ipath_verbs_set_flags); +EXPORT_SYMBOL(ipath_layer_get_npkeys); +EXPORT_SYMBOL(ipath_layer_get_pkey); +EXPORT_SYMBOL(ipath_layer_get_pkeys); +EXPORT_SYMBOL(ipath_layer_set_pkeys); +EXPORT_SYMBOL(ipath_layer_get_linkdowndefaultstate); +EXPORT_SYMBOL(ipath_layer_set_linkdowndefaultstate); +EXPORT_SYMBOL(ipath_layer_get_phyerrthreshold); +EXPORT_SYMBOL(ipath_layer_set_phyerrthreshold); +EXPORT_SYMBOL(ipath_layer_get_overrunthreshold); +EXPORT_SYMBOL(ipath_layer_set_overrunthreshold); From bos at pathscale.com Wed Dec 28 16:31:29 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:29 -0800 Subject: [openib-general] [PATCH 10 of 20] ipath - core driver, part 3 of 4 In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r dad2e87e21f4 -r c37b118ef806 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 @@ -3878,3 +3878,1533 @@ /* process possible error packets in hdrq */ ipath_kreceive(t); } + +/* must only be called if ipath_pd[port] is known to be allocated */ +static inline void *ipath_get_egrbuf(const ipath_type t, uint32_t bufnum, + int err) +{ + return devdata[t].ipath_port0_skbs ? + (void *)devdata[t].ipath_port0_skbs[bufnum]->data : NULL; + +#ifdef _USE_FOR_DEBUGGING_ONLY + /* + * want routine to be inlined and fast this is here so if we do ports + * other than 0, I don't have to rewrite the code, since it's slightly + * complicated + */ + if (port != 1) { + void *chunkbase; + /* + * This calculation takes about 50 cycles. Could do + * what I did for protocol code, and have an array of + * addresses, getting it down to just a few cycles per + * lookup, at the cost of 16KB of memory. + */ + if (!devdata[t].ipath_pd[port]->port_rcvegrbuf_virt) + return NULL; + chunkbase = devdata[t].ipath_pd[port]->port_rcvegrbuf_virt + [bufnum / + devdata[t].ipath_pd[port]->port_rcvegrbufs_perchunk]; + return (void *)(chunkbase + + (bufnum % + devdata[t].ipath_pd[port]-> + port_rcvegrbufs_perchunk) + * devdata[t].ipath_rcvegrbufsize); + } +#endif +} + +/* receive an sma packet. Separate for better overall optimization */ +static void ipath_rcv_sma(const ipath_type t, uint32_t tlen, + uint64_t * rc, void *ebuf) +{ + int sindex, slen, elen; + void *smbuf; + uint8_t pad, *bthbytes; + + ipath_stats.sps_sma_rpkts++; /* another SMA packet received */ + + bthbytes = (uint8_t *)((struct ips_message_header_typ *) &rc[1])->bth; + + pad = (bthbytes[1] >> 4) & 3; + elen = tlen - (IPATH_SMA_HDRSZ + pad + (uint32_t) sizeof(uint32_t)); + if (elen > (SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ)) + elen = SMA_MAX_PKTSZ - IPATH_SMA_HDRSZ; + + spin_lock_irq(&ipath_sma_lock); + sindex = ipath_sma_next; + smbuf = ipath_sma_data[sindex].buf; + ipath_sma_data[sindex].unit = t; + slen = ipath_sma_data[ipath_sma_next].len; + memcpy(smbuf, &rc[1], IPATH_SMA_HDRSZ); + memcpy(smbuf + IPATH_SMA_HDRSZ, ebuf, elen); + if (slen) { + /* + * overwriting a yet unread old one (buffer wrap), have to + * advance ipath_sma_first to next oldest + */ + + /* count OK packets that we drop */ + ipath_stats.sps_krdrops++; + if (++ipath_sma_first >= IPATH_NUM_SMAPKTS) + ipath_sma_first = 0; + } + slen = ipath_sma_data[sindex].len = elen + IPATH_SMA_HDRSZ; + if (++ipath_sma_next >= IPATH_NUM_SMAPKTS) + ipath_sma_next = 0; + spin_unlock_irq(&ipath_sma_lock); +} + +/* + * receive a packet for the layered (ethernet) driver. + * Separate routine for better overall optimization + */ +static void ipath_rcv_layer(const ipath_type t, uint32_t etail, + uint32_t tlen, struct ether_header_typ * hdr) +{ + uint32_t elen; + uint8_t pad, *bthbytes; + struct sk_buff *skb; + struct sk_buff *nskb; + struct ipath_devdata *dd = &devdata[t]; + struct ipath_portdata *pd; + unsigned long pa, pent; + uint64_t __iomem *egrbase; + uint64_t lenvalid; /* in words */ + + if (dd->ipath_port0_skbs && hdr->sub_opcode == OPCODE_ENCAP) { + /* + * Allocate a new sk_buff to replace the one we give + * to the network stack. + */ + if (!(nskb = dev_alloc_skb(dd->ipath_ibmaxlen + 4))) { + /* count OK packets that we drop */ + ipath_stats.sps_krdrops++; + return; + } + + bthbytes = (uint8_t *) hdr->bth; + pad = (bthbytes[1] >> 4) & 3; + /* +CRC32 */ + elen = tlen - (sizeof(*hdr) + pad + sizeof(uint32_t)); + + skb_reserve(nskb, 4); + + skb = dd->ipath_port0_skbs[etail]; + dd->ipath_port0_skbs[etail] = nskb; + skb_put(skb, elen); + + pd = dd->ipath_pd[0]; + lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + pa = virt_to_phys(nskb->data); + pa += pd->port_egrskip; + pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid; + /* This is simplified for port 0 */ + egrbase = (uint64_t __iomem *) + ((char __iomem *)(dd->ipath_kregbase) + + dd->ipath_rcvegrbase); + ipath_kput_memq(t, &egrbase[etail], pent); + + dd->ipath_layer.l_rcv(t, hdr, skb); + + /* another ether packet received */ + ipath_stats.sps_ether_rpkts++; + } else if (hdr->sub_opcode == OPCODE_LID_ARP) { + if (dd->ipath_layer.l_rcv_lid) + dd->ipath_layer.l_rcv_lid(t, hdr); + } + +} + +/* called from interrupt handler for errors or receive interrupt */ +void ipath_kreceive(const ipath_type t) +{ + uint64_t *rc; + void *ebuf; + struct ipath_devdata *dd = &devdata[t]; + const uint32_t rsize = dd->ipath_rcvhdrentsize; /* words */ + const uint32_t maxcnt = dd->ipath_rcvhdrcnt * rsize; /* in words */ + uint32_t etail = -1, l, hdrqtail, sma_this_time = 0; + struct ips_message_header_typ *hdr; + uint32_t eflags, i, etype, tlen, pkttot=0; + static uint64_t totcalls; /* stats, may eventually remove */ + char emsg[128]; + + if (!dd->ipath_hdrqtailptr) { + _IPATH_UNIT_ERROR(t, + "hdrqtailptr not set, can't do receives\n"); + return; + } + + if (test_and_set_bit(0, &dd->ipath_rcv_pending)) { + /* There is already a thread processing this queue. */ + return; + } + + if (dd->ipath_port0head == *dd->ipath_hdrqtailptr) + goto done; + +gotmore: + /* + * read only once at start. If in flood situation, this helps + * performance slightly. If more arrive while we are processing, + * we'll come back here and do them + */ + hdrqtail = *dd->ipath_hdrqtailptr; + + for (i = 0, l = dd->ipath_port0head; l != hdrqtail; i++) { + uint32_t qp; + uint8_t *bthbytes; + + + rc = (uint64_t *) (dd->ipath_pd[0]->port_rcvhdrq + (l << 2)); + hdr = (struct ips_message_header_typ *) & rc[1]; + /* + * could make a network order version of IPATH_KD_QP, and + * do the obvious shift before masking to speed this up. + */ + qp = ntohl(hdr->bth[1]) & 0xffffff; + bthbytes = (uint8_t *) hdr->bth; + + eflags = ips_get_hdr_err_flags((uint32_t*)rc); + etype = ips_get_rcv_type((uint32_t*)rc); + tlen = ips_get_length_in_bytes((uint32_t*)rc); /* total length */ + ebuf = NULL; + if (etype != RCVHQ_RCV_TYPE_EXPECTED) { + /* + * it turns out that the chips uses an eager buffer for + * all non-expected packets, whether it "needs" + * one or not. So always get the index, but + * don't set ebuf (so we try to copy data) + * unless the length requires it. + */ + etail = ips_get_index((uint32_t*)rc); + if (tlen > sizeof(*hdr) + || etype == RCVHQ_RCV_TYPE_NON_KD) { + ebuf = ipath_get_egrbuf(t, etail, 0); + } + } + + /* + * both tiderr and ipathhdrerr are set for all plain IB + * packets; only ipathhdrerr should be set. + */ + + if (etype != RCVHQ_RCV_TYPE_NON_KD + && etype != RCVHQ_RCV_TYPE_ERROR + && ips_get_ipath_ver(hdr->iph.ver_port_tid_offset) != + IPS_PROTO_VERSION) { + _IPATH_PDBG("Bad InfiniPath protocol version %x\n", + etype); + } + + if (eflags & + ~(INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR)) { + get_rhf_errstring(eflags, emsg, sizeof emsg); + _IPATH_PDBG + ("RHFerrs %x hdrqtail=%x typ=%u tlen=%x opcode=%x egridx=%x: %s\n", + eflags, l, etype, tlen, bthbytes[0], + ips_get_index((uint32_t*)rc), emsg); + } else if (etype == RCVHQ_RCV_TYPE_NON_KD) { + /* + * If there is a userland SMA and this is a MAD packet, + * then pass it to the userland SMA. + */ + if (ipath_sma_alive && qp <= 1) { + /* + * count OK packets that we drop because + * SMA isn't yet running, or because we + * are in an sma flood (no point in + * constantly acquiring the spin lock, and + * overwriting previous packets). + * Eventually things will recover. + * Similarly if the sma consumer is + * so far behind that we would overwrite + * (yes, it's outside the lock) + */ + if (!ipath_sma_data_spare || + ipath_sma_data[ipath_sma_next].len || + ++sma_this_time > IPATH_NUM_SMAPKTS) { + ipath_stats.sps_krdrops++; + } else if (ebuf) { + ipath_rcv_sma(t, tlen, rc, ebuf); + } + } else if (dd->verbs_layer.l_rcv) { + dd->verbs_layer.l_rcv(t, rc + 1, ebuf, tlen); + } else { + _IPATH_VDBG("received IB packet, not SMA (QP=%x)\n", + qp); + } + } else if (etype == RCVHQ_RCV_TYPE_EAGER) { + if (qp == IPATH_KD_QP && bthbytes[0] == + dd->ipath_layer.l_rcv_opcode && ebuf) + ipath_rcv_layer(t, etail, tlen, + (struct ether_header_typ *)hdr); + else + _IPATH_PDBG + ("typ %x, opcode %x (eager, qp=%x), len %x; ignored\n", + etype, bthbytes[0], qp, tlen); + } else if (etype == RCVHQ_RCV_TYPE_EXPECTED) { + _IPATH_DBG("Bug: Expected TID, opcode %x; ignored\n", + hdr->bth[0] & 0xff); + } else if (eflags & + (INFINIPATH_RHF_H_TIDERR | INFINIPATH_RHF_H_IHDRERR)) + { + /* + * This is a type 3 packet, only the LRH is in + * the rcvhdrq, the rest of the header is in + * the eager buffer. + */ + uint8_t opcode; + if (ebuf) { + bthbytes = (uint8_t *) ebuf; + opcode = *bthbytes; + } else + opcode = 0; + get_rhf_errstring(eflags, emsg, sizeof emsg); + _IPATH_DBG + ("Err %x (%s), opcode %x, egrbuf %x, len %x\n", + eflags, emsg, opcode, etail, tlen); + } else { + /* + * error packet, type of error unknown. + * Probably type 3, but we don't know, so don't + * even try to print the opcode, etc. + */ + _IPATH_DBG + ("Error Pkt, but no eflags! egrbuf %x, len %x\n" + "hdrq@%lx;hdrq+%x rhf: %llx; hdr %llx %llx %llx %llx %llx\n", + etail, tlen, (unsigned long)rc, l, rc[0], rc[1], + rc[2], rc[3], rc[4], rc[5]); + } + l += rsize; + if (l >= maxcnt) + l = 0; + /* + * update for each packet, to help prevent overflows if we have + * lots of packets. + */ + (void)ipath_kput_ureg(t, ur_rcvhdrhead, l, 0); + if (etype != RCVHQ_RCV_TYPE_EXPECTED) + (void)ipath_kput_ureg(t, ur_rcvegrindexhead, etail, 0); + } + + pkttot += i; + + dd->ipath_port0head = l; + + if (hdrqtail != *dd->ipath_hdrqtailptr) + goto gotmore; /* more arrived while we handled first batch */ + + if (pkttot > ipath_stats.sps_maxpkts_call) + ipath_stats.sps_maxpkts_call = pkttot; + ipath_stats.sps_port0pkts += pkttot; + ipath_stats.sps_avgpkts_call = ipath_stats.sps_port0pkts / ++totcalls; + + if (sma_this_time) /* only once at end, not each time */ + wake_up_interruptible(&ipath_sma_wait); + +done: + clear_bit(0, &dd->ipath_rcv_pending); + smp_mb__after_clear_bit(); +} + +/* + * Update our shadow copy of the PIO availability register map, called + * whenever our local copy indicates we have run out of send buffers + * NOTE: This can be called from interrupt context by ipath_bufavail() + * and from non-interrupt context by ipath_getpiobuf(). + */ + +static void ipath_update_pio_bufs(const ipath_type t) +{ + unsigned long flags; + int i; + const unsigned piobregs = (unsigned)devdata[t].ipath_pioavregs; + + /* If the generation (check) bits have changed, then we update the + * busy bit for the corresponding PIO buffer. This algorithm will + * modify positions to the value they already have in some cases + * (i.e., no change), but it's faster than changing only the bits + * that have changed. + * + * We would like to do this atomicly, to avoid spinlocks in the + * critical send path, but that's not really possible, given the + * type of changes, and that this routine could be called on multiple + * cpu's simultaneously, so we lock in this routine only, to avoid + * conflicting updates; all we change is the shadow, and it's a + * single 64 bit memory location, so by definition the update is + * atomic in terms of what other cpu's can see in testing the + * bits. The spin_lock overhead isn't too bad, since it only + * happens when all buffers are in use, so only cpu overhead, + * not latency or bandwidth is affected. + */ +#define _IPATH_ALL_CHECKBITS 0x5555555555555555ULL + if (!devdata[t].ipath_pioavailregs_dma) { + _IPATH_DBG("Update shadow pioavail, but regs_dma NULL!\n"); + return; + } + if (infinipath_debug & __IPATH_VERBDBG) { + /* only if packet debug and verbose */ + _IPATH_PDBG("Refill avail, dma0=%llx shad0=%llx, " + "d1=%llx s1=%llx, d2=%llx s2=%llx, d3=%llx s3=%llx\n", + devdata[t].ipath_pioavailregs_dma[0], + devdata[t].ipath_pioavailshadow[0], + devdata[t].ipath_pioavailregs_dma[1], + devdata[t].ipath_pioavailshadow[1], + devdata[t].ipath_pioavailregs_dma[2], + devdata[t].ipath_pioavailshadow[2], + devdata[t].ipath_pioavailregs_dma[3], + devdata[t].ipath_pioavailshadow[3]); + if (piobregs > 4) + _IPATH_PDBG("2nd group, dma4=%llx shad4=%llx, " + "d5=%llx s5=%llx, d6=%llx s6=%llx, d7=%llx s7=%llx\n", + devdata[t].ipath_pioavailregs_dma[4], + devdata[t].ipath_pioavailshadow[4], + devdata[t].ipath_pioavailregs_dma[5], + devdata[t].ipath_pioavailshadow[5], + devdata[t].ipath_pioavailregs_dma[6], + devdata[t].ipath_pioavailshadow[6], + devdata[t].ipath_pioavailregs_dma[7], + devdata[t].ipath_pioavailshadow[7]); + } + spin_lock_irqsave(&ipath_pioavail_lock, flags); + for (i = 0; i < piobregs; i++) { + uint64_t pchbusy, pchg, piov, pnew; + /* Chip Errata: bug 6641; even and odd qwords>3 are swapped */ + piov = devdata[t].ipath_pioavailregs_dma[i > 3 ? i ^ 1 : i]; + pchg = + _IPATH_ALL_CHECKBITS & ~(devdata[t]. + ipath_pioavailshadow[i] ^ piov); + pchbusy = pchg << INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT; + if (pchg && (pchbusy & devdata[t].ipath_pioavailshadow[i])) { + pnew = devdata[t].ipath_pioavailshadow[i] & ~pchbusy; + pnew |= piov & pchbusy; + devdata[t].ipath_pioavailshadow[i] = pnew; + } + } + spin_unlock_irqrestore(&ipath_pioavail_lock, flags); +} + +static int ipath_do_user_init(struct ipath_portdata *pd, + struct ipath_user_info __user *uinfo) +{ + int ret = 0; + ipath_type t = pd->port_unit; + struct ipath_devdata *dd = &devdata[t]; + struct ipath_user_info kinfo; + + if (copy_from_user(&kinfo, uinfo, sizeof kinfo)) + ret = -EFAULT; + else { + /* for now, if major version is different, bail */ + if ((kinfo.spu_userversion >> 16) != IPATH_USER_SWMAJOR) { + _IPATH_INFO + ("User major version %d not same as driver major %d\n", + kinfo.spu_userversion >> 16, IPATH_USER_SWMAJOR); + ret = -ENODEV; + } else { + if ((kinfo.spu_userversion & 0xffff) != + IPATH_USER_SWMINOR) + _IPATH_DBG + ("User minor version %d not same as driver minor %d\n", + kinfo.spu_userversion & 0xffff, + IPATH_USER_SWMINOR); + if (kinfo.spu_rcvhdrsize) { + if ((ret = + ipath_setrcvhdrsize(t, + kinfo.spu_rcvhdrsize))) + goto done; + } else if (!dd->ipath_rcvhdrsize) { + /* + * first user of field, kernel or user + * code, and using default + */ + dd->ipath_rcvhdrsize = IPATH_DFLT_RCVHDRSIZE; + ipath_kput_kreg(pd->port_unit, kr_rcvhdrsize, + dd->ipath_rcvhdrsize); + _IPATH_VDBG + ("Use default protocol header size %u\n", + dd->ipath_rcvhdrsize); + } + + pd->port_egrskip = kinfo.spu_egrskip; + if (pd->port_egrskip) { + if (pd->port_egrskip & 3) { + _IPATH_DBG + ("eager skip 0x%x invalid, must be word multiple; using 0x%x\n", + pd->port_egrskip, + pd->port_egrskip & ~3); + pd->port_egrskip &= ~3; + } + _IPATH_DBG + ("user reserves 0x%x bytes at start of eager TIDs\n", + pd->port_egrskip); + } + + /* + * for now we do nothing with rcvhdrcnt: + * kinfo.spu_rcvhdrcnt + */ + + /* + * set up for the rcvhdr Q tail register writeback + * to user memory + */ + if (kinfo.spu_rcvhdraddr && + access_ok(VERIFY_WRITE, + (uint64_t __user *) kinfo.spu_rcvhdraddr, + sizeof(uint64_t))) { + uint64_t physaddr, uaddr, off, atmp; + struct page *pagep; + off = offset_in_page(kinfo.spu_rcvhdraddr); + uaddr = + PAGE_MASK & (unsigned long)kinfo. + spu_rcvhdraddr; + if ((ret = ipath_get_upages_nocopy(uaddr, &pagep))) { + _IPATH_INFO + ("Failed to lookup and lock address %llx for rcvhdrtail: errno %d\n", + kinfo.spu_rcvhdraddr, -ret); + goto done; + } + ipath_stats.sps_pagelocks++; + pd->port_rcvhdrtail_uaddr = uaddr; + pd->port_rcvhdrtail_pagep = pagep; + pd->port_rcvhdrtail_kvaddr = + page_address(pagep); + pd->port_rcvhdrtail_kvaddr += off; + physaddr = page_to_phys(pagep) + off; + _IPATH_VDBG + ("port %d user addr %llx hdrtailaddr, %llx physical (off=%llx)\n", + pd->port_port, kinfo.spu_rcvhdraddr, + physaddr, off); + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, + pd->port_port, physaddr); + atmp = + ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr, + pd->port_port); + if (physaddr != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, RcvHdrTailAddr%u written as %llx, read back as %llx\n", + pd->port_port, + physaddr, atmp); + ret = -EINVAL; + goto done; + } + } else { + _IPATH_DBG + ("Port %d rcvhdrtail addr %llx not valid\n", + pd->port_port, kinfo.spu_rcvhdraddr); + ret = -EINVAL; + goto done; + } + + /* + * for right now, kernel piobufs are at end, + * so port 1 is at 0 + */ + pd->port_piobufs = dd->ipath_piobufbase + + dd->ipath_pbufsport * (pd->port_port - + 1) * dd->ipath_palign; + _IPATH_VDBG("Set base of piobufs for port %u to 0x%x\n", + pd->port_port, pd->port_piobufs); + + /* + * Now allocate the rcvhdr Q and eager TIDs; + * skip the TID array for time being. + * If pd->port_port > chip-supported, we need + * to do extra stuff here to handle by handling + * overflow through port 0, someday + */ + if (!(ret = ipath_create_rcvhdrq(pd))) + ret = ipath_create_user_egr(pd); + if (!ret) { /* enable receives now */ + uint64_t head; + uint32_t head32; + /* atomically set enable bit for this port */ + atomic_set_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + pd->port_port), + &dd->ipath_rcvctrl); + + /* + * set the head registers for this port + * to the current values of the tail + * pointers, since we don't know if they + * were updated on last use of the port. + */ + head32 = + ipath_kget_ureg32(t, ur_rcvhdrtail, + pd->port_port); + head = (uint64_t) head32; + ipath_kput_ureg(t, ur_rcvhdrhead, head, + pd->port_port); + head32 = + ipath_kget_ureg32(t, ur_rcvegrindextail, + pd->port_port); + ipath_kput_ureg(t, ur_rcvegrindexhead, head32, + pd->port_port); + dd->ipath_lastegrheads[pd->port_port] = -1; + dd->ipath_lastrcvhdrqtails[pd->port_port] = -1; + _IPATH_VDBG + ("Wrote port%d head %llx, egrhead %x from tail regs\n", + pd->port_port, head, head32); + /* start at beginning after open */ + pd->port_tidcursor = 0; + { + /* + * now enable the port; the tail + * registers will be written to + * memory by the chip as soon + * as it sees the write to + * kr_rcvctrl. The update only + * happens on transition from 0 + * to 1, so clear it first, then + * set it as part of enabling + * the port. This will (very + * briefly) affect any other open + * ports, but it shouldn't be long + * enough to be an issue. + */ + ipath_kput_kreg(t, kr_rcvctrl, + dd-> + ipath_rcvctrl & + ~INFINIPATH_R_TAILUPD); + ipath_kput_kreg(t, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } + } + } + +done: + return ret; +} + +static int ipath_get_baseinfo(struct ipath_portdata *pd, + struct ipath_base_info __user *ubase) +{ + int ret = 0; + struct ipath_base_info kbase; + struct ipath_devdata *dd = &devdata[pd->port_unit]; + + /* be sure anything we don't set is 0ed */ + memset(&kbase, 0, sizeof kbase); + kbase.spi_rcvhdr_cnt = dd->ipath_rcvhdrcnt; + kbase.spi_rcvhdrent_size = dd->ipath_rcvhdrentsize; + kbase.spi_tidegrcnt = dd->ipath_rcvegrcnt; + kbase.spi_rcv_egrbufsize = dd->ipath_rcvegrbufsize; + kbase.spi_rcv_egrbuftotlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE * (1 << pd->port_rcvegrbuf_order); /* have to mmap whole thing */ + kbase.spi_rcv_egrperchunk = pd->port_rcvegrbufs_perchunk; + kbase.spi_rcv_egrchunksize = kbase.spi_rcv_egrbuftotlen / + pd->port_rcvegrbuf_chunks; + kbase.spi_tidcnt = dd->ipath_rcvtidcnt; + /* + * for this use, may be ipath_cfgports summed over all chips that + * are are configured and present + */ + kbase.spi_nports = dd->ipath_cfgports; + kbase.spi_unit = pd->port_unit; /* unit (chip/board) our port is on */ + /* for now, only a single page */ + kbase.spi_tid_maxsize = PAGE_SIZE; + + /* + * doing this per port, and based on the skip value, etc. + * This has to be the actual buffer size, since the protocol + * code treats it as an array. + * + * These have to be set to user addresses in the user code via mmap + * These values are used on return to user code for the mmap target + * addresses only. For 32 bit, same 44 bit address problem, so use + * the physical address, not virtual. Before 2.6.11, using the + * page_address() macro worked, but in 2.6.11, even that returns + * the full 64 bit address (upper bits all 1's). + * So far, using the physical addresses (or chip offsets, for + * chip mapping) works, but no doubt some future kernel release + * will chang that, and we'll be on to yet another method of + * dealing with this + */ + kbase.spi_rcvhdr_base = (uint64_t) pd->port_rcvhdrq_phys; + kbase.spi_rcv_egrbufs = (uint64_t) pd->port_rcvegr_phys; + kbase.spi_pioavailaddr = (uint64_t) dd->ipath_pioavailregs_phys; + kbase.spi_status = (uint64_t) kbase.spi_pioavailaddr + + (void *)dd->ipath_statusp - (void *)dd->ipath_pioavailregs_dma; + kbase.spi_piobufbase = (uint64_t) pd->port_piobufs; + kbase.__spi_uregbase = + dd->ipath_uregbase + dd->ipath_palign * pd->port_port; + + kbase.spi_pioindex = dd->ipath_pbufsport * (pd->port_port - 1); + kbase.spi_piocnt = dd->ipath_pbufsport; + kbase.spi_pioalign = dd->ipath_palign; + + kbase.spi_qpair = IPATH_KD_QP; + kbase.spi_piosize = dd->ipath_ibmaxlen; + kbase.spi_mtu = dd->ipath_ibmaxlen; /* maxlen, not ibmtu */ + kbase.spi_port = pd->port_port; + kbase.spi_sw_version = IPATH_KERN_SWVERSION; + kbase.spi_hw_version = dd->ipath_revision; + + if (copy_to_user(ubase, &kbase, sizeof kbase)) + ret = -EFAULT; + + return ret; +} + +/* + * return number of units supported by driver. This is infinipath_max, + * unless there are no initted units. + */ +static int ipath_get_units(void) +{ + int i; + + for (i = 0; i < infinipath_max; i++) + if (devdata[i].ipath_flags & IPATH_INITTED) + return infinipath_max; + return 0; +} + +/* write data to the EEPROM on the board */ +static int ipath_wr_eeprom(struct ipath_portdata* pd, + struct ipath_eeprom_req __user *req) +{ + int ret = 0; + struct ipath_eeprom_req kreq; + void *buf = NULL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; /* not just any old user can write flash */ + if (copy_from_user(&kreq, req, sizeof kreq)) + return -EFAULT; + if (!kreq.addr || (kreq.offset + kreq.len) > 128) { + _IPATH_DBG + ("called with NULL addr %llx, or bad cnt %u or offset %u\n", + kreq.addr, kreq.len, kreq.offset); + return -EINVAL; + } + + if (!(buf = vmalloc(kreq.len))) { + ret = -ENOMEM; + _IPATH_UNIT_ERROR(pd->port_unit, + "Couldn't allocate memory to write %u bytes from eeprom\n", + kreq.len); + goto done; + } + if (copy_from_user(buf, (void __user *) kreq.addr, kreq.len)) { + ret = -EFAULT; + goto done; + } + if (ipath_eeprom_write(pd->port_unit, kreq.offset, buf, kreq.len)) { + ret = -ENXIO; + _IPATH_UNIT_ERROR(pd->port_unit, + "Failed write to eeprom %u bytes offset %u\n", + kreq.len, kreq.offset); + } + +done: + if (buf) + vfree(buf); + return ret; +} + +/* read data from the EEPROM on the board */ +int ipath_rd_eeprom(const ipath_type port_unit, + struct ipath_eeprom_req __user *req) +{ + int ret = 0; + struct ipath_eeprom_req kreq; + void *buf = NULL; + + if (copy_from_user(&kreq, req, sizeof kreq)) + return -EFAULT; + if (!kreq.addr || (kreq.offset + kreq.len) > 128) { + _IPATH_DBG + ("called with NULL addr %llx, or bad cnt %u or offset %u\n", + kreq.addr, kreq.len, kreq.offset); + return -EINVAL; + } + + if (!(buf = vmalloc(kreq.len))) { + ret = -ENOMEM; + _IPATH_UNIT_ERROR(port_unit, + "Couldn't allocate memory to read %u bytes from eeprom\n", + kreq.len); + goto done; + } + if (ipath_eeprom_read(port_unit, kreq.offset, buf, kreq.len)) { + ret = -ENXIO; + _IPATH_UNIT_ERROR(port_unit, + "Failed reading %u bytes offset %u from eeprom\n", + kreq.len, kreq.offset); + } + if (copy_to_user((void __user *) kreq.addr, buf, kreq.len)) + ret = -EFAULT; + +done: + if (buf) + vfree(buf); + return ret; +} + +/* + * wait for something to happen on a port. Currently this is + * PIO buffer available, or a packet being received. For now, at + * least, we wait no longer than 1/2 seconds on rcv, 1 tick on PIO, so + * we recover from any bugs (or, as we see in ips.c init and close, cases + * where other side isn't yet ready). + * NOTE: currently called only with PIO or RCV, never both, so path with both + * has not been tested + */ +static int ipath_wait_intr(struct ipath_portdata * pd, uint32_t flag) +{ + struct ipath_devdata *dd = &devdata[pd->port_unit]; + /* stupid compiler can't tell it's initialized */ + uint32_t im = 0; + uint32_t head, tail, timeo = 0, wflag = 0; + + if (!(flag & (IPATH_WAIT_RCV | IPATH_WAIT_PIO))) + return -EINVAL; + if (flag & IPATH_WAIT_RCV) { + head = flag >> 16; + im = (1U << pd->port_port) << INFINIPATH_R_INTRAVAIL_SHIFT; + atomic_set_mask(im, &dd->ipath_rcvctrl); + /* + * now, before blocking, make sure that head is still == tail, + * reading from the chip, so we can be sure the interrupt enable + * has made it to the chip. If not equal, disable + * interrupt again and return immediately. This avoids + * races, and the overhead of the chip read doesn't + * matter much at this point, since we are waiting for + * something anyway. + */ + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + tail = + ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail, + pd->port_port); + if (tail == head) { + timeo = HZ / 2; + wflag = IPATH_PORT_WAITING_RCV; + } else { + atomic_clear_mask(im, &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } + if (flag & IPATH_WAIT_PIO) { + /* + * this one's a bit worse than the receive case, in that we + * can't really verify that at least one interrupt + * will happen... + * We do use a really short timeout, however + */ + timeo = 1; /* if both, the short PIO timeout wins */ + atomic_set_mask(1U << pd->port_port, &dd->ipath_portpiowait); + wflag |= IPATH_PORT_WAITING_PIO; + /* + * this has a possible race with the ipath stuff, so do + * it atomicly + */ + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(pd->port_unit, kr_sendctrl, dd->ipath_sendctrl); + } + if (wflag) { + pd->port_flag |= wflag; + wait_event_interruptible_timeout(pd->port_wait, + (pd->port_flag & wflag) != + wflag, timeo); + if (wflag & pd->port_flag & IPATH_PORT_WAITING_PIO) { + /* timed out, no PIO interrupts */ + atomic_clear_mask(IPATH_PORT_WAITING_PIO, + &pd->port_flag); + pd->port_piowait_to++; + atomic_clear_mask(1U << pd->port_port, + &dd->ipath_portpiowait); + /* + * *don't* clear the pio interrupt enable; + * let that happen in the interrupt handler; + * else we have a race condition. + */ + } + if (wflag & pd->port_flag & IPATH_PORT_WAITING_RCV) { + /* timed out, no packets received */ + atomic_clear_mask(IPATH_PORT_WAITING_RCV, + &pd->port_flag); + pd->port_rcvwait_to++; + atomic_clear_mask(im, &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, + dd->ipath_rcvctrl); + } + } else { + /* else it's already happened, don't do wait_event overhead */ + if (flag & IPATH_WAIT_RCV) + pd->port_rcvnowait++; + if (flag & IPATH_WAIT_PIO) + pd->port_pionowait++; + } + return 0; +} + +/* + * The new implementation as of Oct 2004 is that the driver assigns + * the tid and returns it to the caller. To make it easier to + * catch bugs, and to reduce search time, we keep a cursor for + * each port, walking the shadow tid array to find one that's not + * in use. + * + * For now, if we can't allocate the full list, we fail, although + * in the long run, we'll allocate as many as we can, and the + * caller will deal with that by trying the remaining pages later. + * That means that when we fail, we have to mark the tids as not in + * use again, in our shadow copy. + * + * It's up to the caller to free the tids when they are done. + * We'll unlock the pages as they free them. + * + * Also, right now we are locking one page at a time, but since + * the intended use of this routine is for a single group of + * virtually contiguous pages, that should change to improve + * performance. + */ +static int ipath_tid_update(struct ipath_portdata * pd, + struct _tidupd __user *tidu) +{ + int ret = 0, ntids; + uint32_t tid, porttid, cnt, i, tidcnt; + struct _tidupd tu; + uint16_t *tidlist; + struct ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t vaddr, physaddr, lenvalid; + uint64_t __iomem *tidbase; + uint64_t tidmap[8]; + struct page **pagep = NULL; + + tu.tidcnt = 0; /* for early errors */ + if (!dd->ipath_pageshadow) { + ret = -ENOMEM; + goto done; + } + if (copy_from_user(&tu, tidu, sizeof tu)) { + ret = -EFAULT; + goto done; + } + + if (!(cnt = tu.tidcnt)) { + _IPATH_DBG("After copyin, tidcnt 0, tidlist %llx\n", + tu.tidlist); + /* or should we treat as success? likely a bug */ + ret = -EFAULT; + goto done; + } + tidcnt = dd->ipath_rcvtidcnt; + if (cnt >= tidcnt) { /* make sure it all fits in port_tid_pg_list */ + _IPATH_INFO + ("Process tried to allocate %u TIDs, only trying max (%u)\n", + cnt, tidcnt); + cnt = tidcnt; + } + pagep = (struct page **)pd->port_tid_pg_list; + tidlist = (uint16_t *) (&pagep[cnt]); + + memset(tidmap, 0, sizeof(tidmap)); + tid = pd->port_tidcursor; + /* before decrement; chip actual # */ + porttid = pd->port_port * tidcnt; + ntids = tidcnt; + tidbase = (uint64_t __iomem *) + (((char __iomem *) devdata[pd->port_unit].ipath_kregbase) + + devdata[pd->port_unit].ipath_rcvtidbase + + porttid * sizeof(*tidbase)); + + _IPATH_VDBG("Port%u %u tids, cursor %u, tidbase %p\n", pd->port_port, + cnt, tid, tidbase); + + vaddr = tu.tidvaddr; /* virtual address of first page in transfer */ + if (!access_ok(VERIFY_WRITE, (void __user *) vaddr, cnt * PAGE_SIZE)) { + _IPATH_DBG("Fail vaddr %llx, %u pages, !access_ok\n", + vaddr, cnt); + ret = -EFAULT; + goto done; + } + if ((ret = ipath_get_upages((unsigned long)vaddr, cnt, pagep))) { + if (ret == -EBUSY) { + _IPATH_DBG + ("Failed to lock addr %p, %u pages (already locked)\n", + (void *)vaddr, cnt); + /* + * for now, continue, and see what happens + * but with the new implementation, this should + * never happen, unless perhaps the user has + * mpin'ed the pages themselves (something we + * need to test) + */ + ret = 0; + } else { + _IPATH_INFO + ("Failed to lock addr %p, %u pages: errno %d\n", + (void *)vaddr, cnt, -ret); + goto done; + } + } + for (i = 0; i < cnt; i++, vaddr += PAGE_SIZE) { + for (; ntids--; tid++) { + if (tid == tidcnt) + tid = 0; + if (!dd->ipath_pageshadow[porttid + tid]) + break; + } + if (ntids < 0) { + /* + * oops, wrapped all the way through their TIDs, + * and didn't have enough free; see comments at + * start of routine + */ + _IPATH_DBG + ("Not enough free TIDs for %u pages (index %d), failing\n", + cnt, i); + i--; /* last tidlist[i] not filled in */ + ret = -ENOMEM; + break; + } + tidlist[i] = tid; + _IPATH_VDBG("Updating idx %u to TID %u, vaddr %llx\n", + i, tid, vaddr); + /* for now we "know" system pages and TID pages are same size */ + /* for ipath_free_tid */ + dd->ipath_pageshadow[porttid + tid] = pagep[i]; + __set_bit(tid, tidmap); /* don't need atomic or it's overhead */ + physaddr = page_to_phys(pagep[i]); + ipath_stats.sps_pagelocks++; + _IPATH_VDBG("TID %u, vaddr %llx, physaddr %llx pgp %p\n", + tid, vaddr, physaddr, pagep[i]); + /* + * in words (fixed, full page). could make less for very last + * page in transfer, but for now we won't worry about it. + */ + lenvalid = PAGE_SIZE >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + physaddr |= lenvalid | INFINIPATH_RT_VALID; + ipath_kput_memq(pd->port_unit, &tidbase[tid], physaddr); + /* + * don't check this tid in ipath_portshadow, since we + * just filled it in; start with the next one. + */ + tid++; + } + + if (ret) { + uint32_t limit; + uint64_t tidval; + /* + * chip errata bug 7358, try to work around it by + * marking invalid tids as having max length + */ + tidval = + (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << + INFINIPATH_RT_BUFSIZE_SHIFT; + cleanup: + /* jump here if copy out of updated info failed... */ + _IPATH_DBG("After failure (ret=%d), undo %d of %d entries\n", + -ret, i, cnt); + /* same code that's in ipath_free_tid() */ + if ((limit = sizeof(tidmap) * BITS_PER_BYTE) > tidcnt) + /* just in case size changes in future */ + limit = tidcnt; + tid = find_first_bit((const unsigned long *)tidmap, limit); + /* + * chip errata bug 7358, try to work around it by + * marking invalid tids as having max length + */ + tidval = + (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << + INFINIPATH_RT_BUFSIZE_SHIFT; + for (; tid < limit; tid++) { + if (!test_bit(tid, tidmap)) + continue; + if (dd->ipath_pageshadow[porttid + tid]) { + _IPATH_VDBG("Freeing TID %u\n", tid); + ipath_kput_memq(pd->port_unit, &tidbase[tid], + tidval); + dd->ipath_pageshadow[porttid + tid] = NULL; + ipath_stats.sps_pageunlocks++; + } + } + ipath_putpages(cnt, pagep); + } else { + /* + * copy the updated array, with ipath_tid's filled in, + * back to user. Since we did the copy in already, this + * "should never fail" + * If it does, we have to clean up... + */ + int r; + if ((r = copy_to_user((void __user *) tu.tidlist, tidlist, + cnt * sizeof(*tidlist)))) { + _IPATH_DBG("Failed to copy out %d TIDs (%lx bytes) " + "to %llx (ret %x)\n", cnt, + cnt * sizeof(*tidlist), tu.tidlist, r); + ret = -EFAULT; + goto cleanup; + } + if (copy_to_user((void __user *) tu.tidmap, tidmap, + sizeof tidmap)) { + _IPATH_DBG("Failed to copy out TID map to %llx\n", + tu.tidmap); + ret = -EFAULT; + goto cleanup; + } + if (tid == tidcnt) + tid = 0; + pd->port_tidcursor = tid; + } + +done: + if (ret) + _IPATH_DBG("Failed to map %u TID pages, failing with %d, " + "tidu %p\n", tu.tidcnt, -ret, tidu); + return ret; +} + +/* + * right now we are unlocking one page at a time, but since + * the intended use of this routine is for a single group of + * virtually contiguous pages, that should change to improve + * performance. We check that the TID is in range for this port + * but otherwise don't check validity; if user has an error and + * frees the wrong tid, it's only their own data that can thereby + * be corrupted. We do check that the TID was in use, for sanity + * We always use our idea of the saved address, not the address that + * they pass in to us. + */ + +static int ipath_tid_free(struct ipath_portdata * pd, + struct _tidupd __user *tidu) +{ + int ret = 0; + uint32_t tid, porttid, cnt, limit, tidcnt; + struct _tidupd tu; + struct ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t __iomem *tidbase; + uint64_t tidmap[8]; + uint64_t tidval; + + tu.tidcnt = 0; /* for early errors */ + if (!dd->ipath_pageshadow) { + ret = -ENOMEM; + goto done; + } + + if (copy_from_user(&tu, tidu, sizeof tu)) { + _IPATH_DBG("copy of tidupd structure failed\n"); + ret = -EFAULT; + goto done; + } + if (copy_from_user(tidmap, (void __user *) tu.tidmap, sizeof tidmap)) { + _IPATH_DBG("copy of tidmap failed\n"); + ret = -EFAULT; + goto done; + } + + porttid = pd->port_port * dd->ipath_rcvtidcnt; + tidbase = (uint64_t __iomem *) + ((char __iomem *) (devdata[pd->port_unit].ipath_kregbase) + + devdata[pd->port_unit].ipath_rcvtidbase + + porttid * sizeof(*tidbase)); + + tidcnt = dd->ipath_rcvtidcnt; + if ((limit = sizeof(tidmap) * BITS_PER_BYTE) > tidcnt) + limit = tidcnt; /* just in case size changes in future */ + tid = find_first_bit((const unsigned long *)tidmap, limit); + _IPATH_VDBG + ("Port%u free %u tids; first bit (max=%d) set is %d, porttid %u\n", + pd->port_port, tu.tidcnt, limit, tid, porttid); + /* + * chip errata bug 7358, try to work around it by marking invalid + * tids as having max length + */ + tidval = + (-1LL & INFINIPATH_RT_BUFSIZE_MASK) << INFINIPATH_RT_BUFSIZE_SHIFT; + for (cnt = 0; tid < limit; tid++) { + /* + * small optimization; if we detect a run of 3 or so without + * any set, use find_first_bit again. That's mainly to + * accelerate the case where we wrapped, so we have some at + * the beginning, and some at the end, and a big gap + * in the middle. + */ + if (!test_bit(tid, tidmap)) + continue; + cnt++; + if (dd->ipath_pageshadow[porttid + tid]) { + _IPATH_VDBG("Freeing TID %u\n", tid); + ipath_kput_memq(pd->port_unit, &tidbase[tid], tidval); + ipath_putpages(1, &dd->ipath_pageshadow[porttid + tid]); + dd->ipath_pageshadow[porttid + tid] = NULL; + ipath_stats.sps_pageunlocks++; + } else + _IPATH_DBG("Unused tid %u, ignoring\n", tid); + } + if (cnt != tu.tidcnt) + _IPATH_DBG("passed in tidcnt %d, only %d bits set in map\n", + tu.tidcnt, cnt); +done: + if (ret) + _IPATH_DBG("Failed to unmap %u TID pages, failing with %d\n", + tu.tidcnt, -ret); + return ret; +} + +/* called from user init code, and also layered driver init */ +int ipath_setrcvhdrsize(const ipath_type mdev, unsigned rhdrsize) +{ + int ret = 0; + if (devdata[mdev].ipath_flags & IPATH_RCVHDRSZ_SET) { + if (devdata[mdev].ipath_rcvhdrsize != rhdrsize) { + _IPATH_INFO + ("Error: can't set protocol header size %u, already %u\n", + rhdrsize, devdata[mdev].ipath_rcvhdrsize); + ret = -EAGAIN; + } else + /* OK if set already, with same value, nothing to do */ + _IPATH_VDBG("Reuse same protocol header size %u\n", + devdata[mdev].ipath_rcvhdrsize); + } else if (rhdrsize > + (devdata[mdev].ipath_rcvhdrentsize - + (sizeof(uint64_t) / sizeof(uint32_t)))) { + _IPATH_DBG + ("Error: can't set protocol header size %u (> max %u)\n", + rhdrsize, + devdata[mdev].ipath_rcvhdrentsize - + (uint32_t) (sizeof(uint64_t) / sizeof(uint32_t))); + ret = -EOVERFLOW; + } else { + devdata[mdev].ipath_flags |= IPATH_RCVHDRSZ_SET; + devdata[mdev].ipath_rcvhdrsize = rhdrsize; + ipath_kput_kreg(mdev, kr_rcvhdrsize, + devdata[mdev].ipath_rcvhdrsize); + _IPATH_VDBG("Set protocol header size to %u\n", + devdata[mdev].ipath_rcvhdrsize); + } + return ret; +} + + +/* + * find an available pio buffer, and do appropriate marking as busy, etc. + * returns buffer number if one found (>=0), negative number is error. + * Used by ipath_send_smapkt and ipath_layer_send + */ +uint32_t __iomem *ipath_getpiobuf(int mdev, uint32_t *pbufnum) +{ + int i, j, starti, updated = 0; + unsigned piobcnt, iter; + unsigned long flags; + struct ipath_devdata *dd = &devdata[mdev]; + uint64_t *shadow = dd->ipath_pioavailshadow; + uint32_t __iomem *buf; + + piobcnt = (unsigned)devdata[mdev].ipath_piobcnt; + starti = devdata[mdev].ipath_lastport_piobuf; + iter = piobcnt - starti; + if (dd->ipath_upd_pio_shadow) { + /* + * minor optimization. If we had no buffers on last call, start out + * by doing the update; continue and do scan even if no buffers + * were updated, to be paranoid + */ + ipath_update_pio_bufs(mdev); + updated = 1; /* we scanned here, don't do it at end of scan */ + i = starti; + } + else + i = devdata[mdev].ipath_lastpioindex; + +rescan: + /* + * while test_and_set_bit() is atomic, + * we do that and then the change_bit(), and the pair is not. + * See if this is the cause of the remaining armlaunch errors. + */ + spin_lock_irqsave(&ipath_pioavail_lock, flags); + for (j = 0; j < iter; j++, i++) { + if (i >= piobcnt) + i = starti; + /* + * To avoid bus lock overhead, we first find a candidate + * buffer, then do the test and set, and continue if that fails. + */ + if (test_bit((2 * i) + 1, shadow) || + test_and_set_bit((2 * i) + 1, shadow)) { + continue; + } + /* flip generation bit */ + change_bit(2 * i, shadow); + break; + } + spin_unlock_irqrestore(&ipath_pioavail_lock, flags); + + if (j == iter) { + /* + * first time through; shadow exhausted, but may be + * real buffers available, so go see; if any updated, rescan (once) + */ + if (!updated) { + ipath_update_pio_bufs(mdev); + updated = 1; + i = starti; + goto rescan; + } + dd->ipath_upd_pio_shadow = 1; + /* not atomic, but if we lose one once in a while, that's OK */ + ipath_stats.sps_nopiobufs++; + if (!(++dd->ipath_consec_nopiobuf % 100000)) { + _IPATH_DBG + ("%u pio sends with no bufavail; dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n", + dd->ipath_consec_nopiobuf, + dd->ipath_pioavailregs_dma[0], + dd->ipath_pioavailregs_dma[1], + dd->ipath_pioavailregs_dma[2], + dd->ipath_pioavailregs_dma[3], + shadow[0], shadow[1], shadow[2], shadow[3]); + /* + * 4 buffers per byte, 4 registers above, cover + * rest below + */ + if (dd->ipath_piobcnt > (sizeof(shadow[0]) + * 4 * 4)) + _IPATH_DBG + ("2nd group: dmacopy: %llx %llx %llx %llx; shadow: %llx %llx %llx %llx\n", + devdata[mdev].ipath_pioavailregs_dma[4], + devdata[mdev].ipath_pioavailregs_dma[5], + devdata[mdev].ipath_pioavailregs_dma[6], + devdata[mdev].ipath_pioavailregs_dma[7], + shadow[4], shadow[5], shadow[6], shadow[7]); + } + return NULL; + } + + if (updated && devdata[mdev].ipath_layer.l_intr) { + /* + * ran out of bufs, now some (at least this one we just got) + * are now available, so tell the layered driver. + */ + dd->ipath_layer.l_intr(mdev, IPATH_LAYER_INT_SEND_CONTINUE); + } + + /* + * set next starting place. Since it's just an optimization, + * it doesn't matter who wins on this, so no locking + */ + dd->ipath_lastpioindex = i + 1; + if (dd->ipath_upd_pio_shadow) + dd->ipath_upd_pio_shadow = 0; + if (dd->ipath_consec_nopiobuf) + dd->ipath_consec_nopiobuf = 0; + buf = (uint32_t __iomem *)(dd->ipath_piobase + i * dd->ipath_palign); + _IPATH_VDBG("Return piobuf %u @ %p\n", i, buf); + if (pbufnum) + *pbufnum = i; + return buf; +} + +/* + * this is like ipath_getpiobuf(), except it just probes to see if a buffer + * is available. If it returns that there is one, it's not allocated, + * and so may not be available if caller tries to send. + * NOTE: This can be called from interrupt context by ipath_intr() + * and from non-interrupt context by layer_send_getpiobuf(). + */ +int ipath_bufavail(int mdev) +{ + int i; + unsigned piobcnt; + uint64_t *shadow = devdata[mdev].ipath_pioavailshadow; + + piobcnt = (unsigned)devdata[mdev].ipath_piobcnt; + + for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++) + if (!test_bit((2 * i) + 1, shadow)) + return 1; + + /* if none, check for update and rescan if we updated */ + ipath_update_pio_bufs(mdev); + for (i = devdata[mdev].ipath_lastport_piobuf; i < piobcnt; i++) + if (!test_bit((2 * i) + 1, shadow)) + return 1; + _IPATH_PDBG("No bufs avail\n"); + return 0; +} + +/* + * This routine is no longer on any critical paths; it is used only + * for sending SMA packets, and some diagnostic usage. + * Because it's currently sma only, there are no checks to see if the + * link is up; sma must be able to send in the not fully initialized state + */ +int ipath_send_smapkt(struct ipath_sendpkt __user *upkt) +{ + int i, ret = 0; + uint32_t __iomem *piobuf; + uint32_t plen = 0, clen, pbufn; + struct ipath_sendpkt kpkt; + struct ipath_iovec *iov = kpkt.sps_iov; + ipath_type t; + uint32_t *tmpbuf = NULL; + + if (unlikely((copy_from_user(&kpkt, upkt, sizeof kpkt)))) + ret = -EFAULT; + if (ret) { + _IPATH_VDBG("Send failed: error %d\n", -ret); + goto done; + } + t = kpkt.sps_flags; + if (t >= infinipath_max || !(devdata[t].ipath_flags & IPATH_PRESENT) || + !devdata[t].ipath_kregbase) { + _IPATH_SMADBG("illegal unit %u for sma send\n", t); + return -ENODEV; + } + if (!(devdata[t].ipath_flags & IPATH_INITTED)) { + /* no hardware, freeze, etc. */ + _IPATH_SMADBG("unit %u not usable\n", t); + return -ENODEV; + } + + /* need total length before first word written */ + plen = sizeof(uint32_t); /* +1 word is for the qword padding */ + for (i = 0; i < kpkt.sps_cnt; i++) + /* each must be dword multiple */ + plen += kpkt.sps_iov[i].iov_len; + + if ((plen + 4) > devdata[t].ipath_ibmaxlen) { + _IPATH_DBG("Pkt len 0x%x > ibmaxlen %x\n", + plen - 4, devdata[t].ipath_ibmaxlen); + ret = -EINVAL; + goto done; /* before writing pbc */ + } + if (!(tmpbuf = vmalloc(plen))) { + _IPATH_INFO("Unable to allocate tmp buffer, failing\n"); + ret = -ENOMEM; + goto done; + } + plen >>= 2; /* in words */ + + piobuf = ipath_getpiobuf(t, &pbufn); + if (!piobuf) { + ret = -EBUSY; + devdata[t].ipath_nosma_bufs++; + _IPATH_SMADBG("No PIO buffers available unit %u %u times\n", + t, devdata[t].ipath_nosma_bufs); + goto done; + } + if (devdata[t].ipath_nosma_bufs) { + _IPATH_SMADBG( + "Unit %u got SMA send buffer after %u failures, %u seconds\n", + t, devdata[t].ipath_nosma_bufs, devdata[t].ipath_nosma_secs); + devdata[t].ipath_nosma_bufs = 0; + devdata[t].ipath_nosma_secs = 0; + } + if ((devdata[t].ipath_lastibcstat & 0x11) != 0x11 && + (devdata[t].ipath_lastibcstat & 0x21) != 0x21) { + /* we need to be at least at INIT for SMA packets to go out. If we + * aren't, something has gone wrong, and SMA hasn't noticed. + * Therefore we'll try to go to INIT here, in hopes of fixing up the + * problem. First we verify that indeed the state is still "bad" + * (that is, that lastibcstat * isn't "stale") */ + uint64_t val; + val = ipath_kget_kreg64(t, kr_ibcstatus); + if ((val & 0x11) != 0x11 && (val & 0x21) != 0x21) { + _IPATH_SMADBG("Invalid Link state 0x%llx unit %u for send, try INIT\n", + val, t); + ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKCMD_INIT); + val = ipath_kget_kreg64(t, kr_ibcstatus); + if ((val & 0x11) != 0x11 && (val & 0x21) != 0x21) + _IPATH_SMADBG("Link state still not OK unit %u (0x%llx) after INIT\n", + t, val); + else + _IPATH_SMADBG("Link state OK unit %u (0x%llx) after INIT\n", + t, val); + } + /* and continue, regardless */ + } + + if (infinipath_debug & __IPATH_PKTDBG) // SMA and PKT, both + _IPATH_SMADBG("unit %u 0x%x+1w pio%d, (scnt %d)\n", + t, plen - 1, pbufn, kpkt.sps_cnt); + + + /* we have to flush after the PBC for correctness on some cpus + * or WC buffer can be written out of order */ + writeq(plen, piobuf); + mb(); + ret = 0; + for (clen=i=0; i < kpkt.sps_cnt; i++) { + if (unlikely(copy_from_user(tmpbuf + clen, + (void __user *) iov->iov_base, + iov->iov_len))) + ret = -EFAULT; /* no break */ + clen += iov->iov_len >> 2; + iov++; + } + /* copy all by the trigger word, then flush, so it's written + * to chip before trigger word, then write trigger word, then + * flush again, so packet is sent. */ + memcpy_toio32(piobuf+2, tmpbuf, clen-1); + mb(); + writel(tmpbuf[clen-1], piobuf+clen+1); + mb(); + + if (ret) { + /* + * Packet is bad, so we need to use the PIO abort mechanism to + * abort the packet + */ + uint32_t sendctrl; + sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM | + (pbufn << INFINIPATH_S_DISARMPIOBUF_SHIFT); + _IPATH_DBG("Doing PIO abort on buffer %u after error\n", + pbufn); + ipath_kput_kreg(t, kr_sendctrl, sendctrl); + } + +done: + vfree(tmpbuf); + return ret; +} + +/* + * implemention of the ioctl to get the counter values from the chip + * For the time being, we get all of them when asked, no shadowing. + * We need to shadow the byte counters at a minimum, because otherwise + * they will wrap in just a few seconds at full bandwidth + * The second argument is the user address to which we do the copy_to_user() + */ +static int ipath_get_counters(ipath_type t, + struct infinipath_counters __user *ucounters) +{ + int ret = 0; + uint64_t val; + uint64_t __user *ucreg; + uint16_t vcreg; + + ucreg = (uint64_t __user *) ucounters; + /* + * for now, let's do this one at a time. It's not the most + * optimal method, but it is simple, and has no intermediate + * memory requirements. + */ + for (vcreg = 0; + vcreg < (sizeof(struct infinipath_counters) / sizeof(val)); + vcreg++, ucreg++) { + ipath_creg creg = vcreg; + val = ipath_snap_cntr(t, creg); + if ((ret = copy_to_user(ucreg, &val, sizeof(val)))) { + _IPATH_DBG("copy_to_user error on counter %d\n", creg); + ret = -EFAULT; + break; + } + } + + return ret; +} From bos at pathscale.com Wed Dec 28 16:31:34 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:34 -0800 Subject: [openib-general] [PATCH 15 of 20] ipath - infiniband verbs support, part 1 of 3 In-Reply-To: Message-ID: <471b7a7a005c6ff26e1f.1135816294@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r 26993cb5faee -r 471b7a7a005c drivers/infiniband/hw/ipath/ipath_verbs.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,2307 @@ +/* + * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "ips_common.h" +#include "ipath_layer.h" +#include "ipath_verbs.h" + +/* + * Compare the lower 24 bits of the two values. + * Returns an integer <, ==, or > than zero. + */ +static inline int cmp24(u32 a, u32 b) +{ + return (((int) a) - ((int) b)) << 8; +} + +#define MODNAME "ib_ipath" +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " +#define PFX MODNAME ": " + + +#define BITS_PER_PAGE (PAGE_SIZE*BITS_PER_BYTE) +#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1) +#define mk_qpn(qpt, map, off) (((map) - (qpt)->map)*BITS_PER_PAGE + (off)) +#define find_next_offset(map, off) \ + find_next_zero_bit((map)->page, BITS_PER_PAGE, off) + +/* Not static, because we don't want the compiler removing it */ +const char ipath_verbs_version[] = "ipath_verbs " IPATH_IDSTR; + +unsigned int ib_ipath_qp_table_size = 251; +module_param(ib_ipath_qp_table_size, uint, 0444); +MODULE_PARM_DESC(ib_ipath_qp_table_size, "QP table size"); + +unsigned int ib_ipath_lkey_table_size = 12; +module_param(ib_ipath_lkey_table_size, uint, 0444); +MODULE_PARM_DESC(ib_ipath_lkey_table_size, + "LKEY table size in bits (2^n, 1 <= n <= 23)"); + +unsigned int ib_ipath_debug; /* debug mask */ +module_param(ib_ipath_debug, uint, 0644); +MODULE_PARM_DESC(ib_ipath_debug, "Verbs debug mask"); + + +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 len, struct ib_send_wr *wr, struct ib_wc *wc); +static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc); +static int ipath_destroy_qp(struct ib_qp *ibqp); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("PathScale "); +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); + +enum { + IPATH_FAULT_RC_DROP_SEND_F = 1, + IPATH_FAULT_RC_DROP_SEND_M, + IPATH_FAULT_RC_DROP_SEND_L, + IPATH_FAULT_RC_DROP_SEND_O, + IPATH_FAULT_RC_DROP_RDMA_WRITE_F, + IPATH_FAULT_RC_DROP_RDMA_WRITE_M, + IPATH_FAULT_RC_DROP_RDMA_WRITE_L, + IPATH_FAULT_RC_DROP_RDMA_WRITE_O, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_F, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_M, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_L, + IPATH_FAULT_RC_DROP_RDMA_READ_RESP_O, + IPATH_FAULT_RC_DROP_ACK, +}; + +enum { + IPATH_TRANS_INVALID = 0, + IPATH_TRANS_ANY2RST, + IPATH_TRANS_RST2INIT, + IPATH_TRANS_INIT2INIT, + IPATH_TRANS_INIT2RTR, + IPATH_TRANS_RTR2RTS, + IPATH_TRANS_RTS2RTS, + IPATH_TRANS_SQERR2RTS, + IPATH_TRANS_ANY2ERR, + IPATH_TRANS_RTS2SQD, /* XXX Wait for expected ACKs & signal event */ + IPATH_TRANS_SQD2SQD, /* error if not drained & parameter change */ + IPATH_TRANS_SQD2RTS, /* error if not drained */ +}; + +enum { + IPATH_POST_SEND_OK = 0x0001, + IPATH_POST_RECV_OK = 0x0002, + IPATH_PROCESS_RECV_OK = 0x0004, + IPATH_PROCESS_SEND_OK = 0x0008, +}; + +static int state_ops[IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = 0, + [IB_QPS_INIT] = IPATH_POST_RECV_OK, + [IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, + [IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | + IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK, + [IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | + IPATH_POST_SEND_OK, + [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, + [IB_QPS_ERR] = 0, +}; + +/* + * Convert the AETH credit code into the number of credits. + */ +static u32 credit_table[31] = { + 0, /* 0 */ + 1, /* 1 */ + 2, /* 2 */ + 3, /* 3 */ + 4, /* 4 */ + 6, /* 5 */ + 8, /* 6 */ + 12, /* 7 */ + 16, /* 8 */ + 24, /* 9 */ + 32, /* A */ + 48, /* B */ + 64, /* C */ + 96, /* D */ + 128, /* E */ + 192, /* F */ + 256, /* 10 */ + 384, /* 11 */ + 512, /* 12 */ + 768, /* 13 */ + 1024, /* 14 */ + 1536, /* 15 */ + 2048, /* 16 */ + 3072, /* 17 */ + 4096, /* 18 */ + 6144, /* 19 */ + 8192, /* 1A */ + 12288, /* 1B */ + 16384, /* 1C */ + 24576, /* 1D */ + 32768 /* 1E */ +}; + +/* + * Convert the AETH RNR timeout code into the number of milliseconds. + */ +static u32 rnr_table[32] = { + 656, /* 0 */ + 1, /* 1 */ + 1, /* 2 */ + 1, /* 3 */ + 1, /* 4 */ + 1, /* 5 */ + 1, /* 6 */ + 1, /* 7 */ + 1, /* 8 */ + 1, /* 9 */ + 1, /* A */ + 1, /* B */ + 1, /* C */ + 1, /* D */ + 2, /* E */ + 2, /* F */ + 3, /* 10 */ + 4, /* 11 */ + 6, /* 12 */ + 8, /* 13 */ + 11, /* 14 */ + 16, /* 15 */ + 21, /* 16 */ + 31, /* 17 */ + 41, /* 18 */ + 62, /* 19 */ + 82, /* 1A */ + 123, /* 1B */ + 164, /* 1C */ + 246, /* 1D */ + 328, /* 1E */ + 492 /* 1F */ +}; + +/* + * Translate ib_wr_opcode into ib_wc_opcode. + */ +static enum ib_wc_opcode wc_opcode[] = { + [IB_WR_RDMA_WRITE] = IB_WC_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = IB_WC_RDMA_WRITE, + [IB_WR_SEND] = IB_WC_SEND, + [IB_WR_SEND_WITH_IMM] = IB_WC_SEND, + [IB_WR_RDMA_READ] = IB_WC_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = IB_WC_COMP_SWAP, + [IB_WR_ATOMIC_FETCH_AND_ADD] = IB_WC_FETCH_ADD +}; + +/* + * Array of device pointers. + */ +static uint32_t number_of_devices; +static struct ipath_ibdev **ipath_devices; + +/* + * Global table of GID to attached QPs. + * The table is global to all ipath devices since a send from one QP/device + * needs to be locally routed to any locally attached QPs on the same + * or different device. + */ +static struct rb_root mcast_tree; +static spinlock_t mcast_lock = SPIN_LOCK_UNLOCKED; + +/* + * Allocate a structure to link a QP to the multicast GID structure. + */ +static struct ipath_mcast_qp *ipath_mcast_qp_alloc(struct ipath_qp *qp) +{ + struct ipath_mcast_qp *mqp; + + mqp = kmalloc(sizeof(*mqp), GFP_KERNEL); + if (!mqp) + return NULL; + + mqp->qp = qp; + atomic_inc(&qp->refcount); + + return mqp; +} + +static void ipath_mcast_qp_free(struct ipath_mcast_qp *mqp) +{ + struct ipath_qp *qp = mqp->qp; + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + + kfree(mqp); +} + +/* + * Allocate a structure for the multicast GID. + * A list of QPs will be attached to this structure. + */ +static struct ipath_mcast *ipath_mcast_alloc(union ib_gid *mgid) +{ + struct ipath_mcast *mcast; + + mcast = kmalloc(sizeof(*mcast), GFP_KERNEL); + if (!mcast) + return NULL; + + mcast->mgid = *mgid; + INIT_LIST_HEAD(&mcast->qp_list); + init_waitqueue_head(&mcast->wait); + atomic_set(&mcast->refcount, 0); + + return mcast; +} + +static void ipath_mcast_free(struct ipath_mcast *mcast) +{ + struct ipath_mcast_qp *p, *tmp; + + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) + ipath_mcast_qp_free(p); + + kfree(mcast); +} + +/* + * Search the global table for the given multicast GID. + * Return it or NULL if not found. + * The caller is responsible for decrementing the reference count if found. + */ +static struct ipath_mcast *ipath_mcast_find(union ib_gid *mgid) +{ + struct rb_node *n; + unsigned long flags; + + spin_lock_irqsave(&mcast_lock, flags); + n = mcast_tree.rb_node; + while (n) { + struct ipath_mcast *mcast; + int ret; + + mcast = rb_entry(n, struct ipath_mcast, rb_node); + + ret = memcmp(mgid->raw, mcast->mgid.raw, sizeof(union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else { + atomic_inc(&mcast->refcount); + spin_unlock_irqrestore(&mcast_lock, flags); + return mcast; + } + } + spin_unlock_irqrestore(&mcast_lock, flags); + + return NULL; +} + +/* + * Insert the multicast GID into the table and + * attach the QP structure. + * Return zero if both were added. + * Return EEXIST if the GID was already in the table but the QP was added. + * Return ESRCH if the QP was already attached and neither structure was added. + */ +static int ipath_mcast_add(struct ipath_mcast *mcast, + struct ipath_mcast_qp *mqp) +{ + struct rb_node **n = &mcast_tree.rb_node; + struct rb_node *pn = NULL; + unsigned long flags; + + spin_lock_irqsave(&mcast_lock, flags); + + while (*n) { + struct ipath_mcast *tmcast; + struct ipath_mcast_qp *p; + int ret; + + pn = *n; + tmcast = rb_entry(pn, struct ipath_mcast, rb_node); + + ret = memcmp(mcast->mgid.raw, tmcast->mgid.raw, + sizeof(union ib_gid)); + if (ret < 0) { + n = &pn->rb_left; + continue; + } + if (ret > 0) { + n = &pn->rb_right; + continue; + } + + /* Search the QP list to see if this is already there. */ + list_for_each_entry_rcu(p, &tmcast->qp_list, list) { + if (p->qp == mqp->qp) { + spin_unlock_irqrestore(&mcast_lock, flags); + return ESRCH; + } + } + list_add_tail_rcu(&mqp->list, &tmcast->qp_list); + spin_unlock_irqrestore(&mcast_lock, flags); + return EEXIST; + } + + list_add_tail_rcu(&mqp->list, &mcast->qp_list); + + atomic_inc(&mcast->refcount); + rb_link_node(&mcast->rb_node, pn, n); + rb_insert_color(&mcast->rb_node, &mcast_tree); + + spin_unlock_irqrestore(&mcast_lock, flags); + + return 0; +} + +static int ipath_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, + u16 lid) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_mcast *mcast; + struct ipath_mcast_qp *mqp; + + /* + * Allocate data structures since its better to do this outside of + * spin locks and it will most likely be needed. + */ + mcast = ipath_mcast_alloc(gid); + if (mcast == NULL) + return -ENOMEM; + mqp = ipath_mcast_qp_alloc(qp); + if (mqp == NULL) { + ipath_mcast_free(mcast); + return -ENOMEM; + } + switch (ipath_mcast_add(mcast, mqp)) { + case ESRCH: + /* Neither was used: can't attach the same QP twice. */ + ipath_mcast_qp_free(mqp); + ipath_mcast_free(mcast); + return -EINVAL; + case EEXIST: /* The mcast wasn't used */ + ipath_mcast_free(mcast); + break; + default: + break; + } + return 0; +} + +static int ipath_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, + u16 lid) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_mcast *mcast = NULL; + struct ipath_mcast_qp *p, *tmp; + struct rb_node *n; + unsigned long flags; + int last = 0; + + spin_lock_irqsave(&mcast_lock, flags); + + /* Find the GID in the mcast table. */ + n = mcast_tree.rb_node; + while (1) { + int ret; + + if (n == NULL) { + spin_unlock_irqrestore(&mcast_lock, flags); + return 0; + } + + mcast = rb_entry(n, struct ipath_mcast, rb_node); + ret = memcmp(gid->raw, mcast->mgid.raw, sizeof(union ib_gid)); + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + break; + } + + /* Search the QP list. */ + list_for_each_entry_safe(p, tmp, &mcast->qp_list, list) { + if (p->qp != qp) + continue; + /* + * We found it, so remove it, but don't poison the forward link + * until we are sure there are no list walkers. + */ + list_del_rcu(&p->list); + + /* If this was the last attached QP, remove the GID too. */ + if (list_empty(&mcast->qp_list)) { + rb_erase(&mcast->rb_node, &mcast_tree); + last = 1; + } + break; + } + + spin_unlock_irqrestore(&mcast_lock, flags); + + if (p) { + /* + * Wait for any list walkers to finish before freeing the + * list element. + */ + wait_event(mcast->wait, atomic_read(&mcast->refcount) <= 1); + ipath_mcast_qp_free(p); + } + if (last) { + atomic_dec(&mcast->refcount); + wait_event(mcast->wait, !atomic_read(&mcast->refcount)); + ipath_mcast_free(mcast); + } + + return 0; +} + +/* + * Copy data to SGE memory. + */ +static void copy_sge(struct ipath_sge_state *ss, void *data, u32 length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length) { + u32 len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + memcpy(sge->vaddr, data, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + data += len; + length -= len; + } +} + +/* + * Skip over length bytes of SGE memory. + */ +static void skip_sge(struct ipath_sge_state *ss, u32 length) +{ + struct ipath_sge *sge = &ss->sge; + + while (length > sge->sge_length) { + length -= sge->sge_length; + ss->sge = *ss->sg_list++; + } + while (length) { + u32 len = sge->length; + + BUG_ON(len == 0); + if (len > length) + len = length; + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } +} + +static inline u32 alloc_qpn(struct ipath_qp_table *qpt) +{ + u32 i, offset, max_scan, qpn; + struct qpn_map *map; + + qpn = qpt->last + 1; + if (qpn >= QPN_MAX) + qpn = 2; + offset = qpn & BITS_PER_PAGE_MASK; + map = &qpt->map[qpn / BITS_PER_PAGE]; + max_scan = qpt->nmaps - !offset; + for (i = 0;;) { + if (unlikely(!map->page)) { + unsigned long page = get_zeroed_page(GFP_KERNEL); + unsigned long flags; + + /* + * Free the page if someone raced with us + * installing it: + */ + spin_lock_irqsave(&qpt->lock, flags); + if (map->page) + free_page(page); + else + map->page = (void *)page; + spin_unlock_irqrestore(&qpt->lock, flags); + if (unlikely(!map->page)) + break; + } + if (likely(atomic_read(&map->n_free))) { + do { + if (!test_and_set_bit(offset, map->page)) { + atomic_dec(&map->n_free); + qpt->last = qpn; + return qpn; + } + offset = find_next_offset(map, offset); + qpn = mk_qpn(qpt, map, offset); + /* + * This test differs from alloc_pidmap(). + * If find_next_offset() does find a zero bit, + * we don't need to check for QPN wrapping + * around past our starting QPN. We + * just need to be sure we don't loop forever. + */ + } while (offset < BITS_PER_PAGE && qpn < QPN_MAX); + } + /* + * In order to keep the number of pages allocated to a minimum, + * we scan the all existing pages before increasing the size + * of the bitmap table. + */ + if (++i > max_scan) { + if (qpt->nmaps == QPNMAP_ENTRIES) + break; + map = &qpt->map[qpt->nmaps++]; + offset = 0; + } else if (map < &qpt->map[qpt->nmaps]) { + ++map; + offset = 0; + } else { + map = &qpt->map[0]; + offset = 2; + } + qpn = mk_qpn(qpt, map, offset); + } + return 0; +} + +static inline void free_qpn(struct ipath_qp_table *qpt, u32 qpn) +{ + struct qpn_map *map; + + map = qpt->map + qpn / BITS_PER_PAGE; + if (map->page) + clear_bit(qpn & BITS_PER_PAGE_MASK, map->page); + atomic_inc(&map->n_free); +} + +/* + * Allocate the next available QPN and put the QP into the hash table. + * The hash table holds a reference to the QP. + */ +static int ipath_alloc_qpn(struct ipath_qp_table *qpt, struct ipath_qp *qp, + enum ib_qp_type type) +{ + unsigned long flags; + u32 qpn; + + if (type == IB_QPT_SMI) + qpn = 0; + else if (type == IB_QPT_GSI) + qpn = 1; + else { + /* Allocate the next available QPN */ + qpn = alloc_qpn(qpt); + if (qpn == 0) { + return -ENOMEM; + } + } + qp->ibqp.qp_num = qpn; + + /* Add the QP to the hash table. */ + spin_lock_irqsave(&qpt->lock, flags); + + qpn %= qpt->max; + qp->next = qpt->table[qpn]; + qpt->table[qpn] = qp; + atomic_inc(&qp->refcount); + + spin_unlock_irqrestore(&qpt->lock, flags); + return 0; +} + +/* + * Remove the QP from the table so it can't be found asynchronously by + * the receive interrupt routine. + */ +static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) +{ + struct ipath_qp *q, **qpp; + unsigned long flags; + int fnd = 0; + + spin_lock_irqsave(&qpt->lock, flags); + + /* Remove QP from the hash table. */ + qpp = &qpt->table[qp->ibqp.qp_num % qpt->max]; + for (; (q = *qpp) != NULL; qpp = &q->next) { + if (q == qp) { + *qpp = qp->next; + qp->next = NULL; + atomic_dec(&qp->refcount); + fnd = 1; + break; + } + } + + spin_unlock_irqrestore(&qpt->lock, flags); + + if (!fnd) + return; + + /* If QPN is not reserved, mark QPN free in the bitmap. */ + if (qp->ibqp.qp_num > 1) + free_qpn(qpt, qp->ibqp.qp_num); + + wait_event(qp->wait, !atomic_read(&qp->refcount)); +} + +/* + * Remove all QPs from the table. + */ +static void ipath_free_all_qps(struct ipath_qp_table *qpt) +{ + unsigned long flags; + struct ipath_qp *qp, *nqp; + u32 n; + + for (n = 0; n < qpt->max; n++) { + spin_lock_irqsave(&qpt->lock, flags); + qp = qpt->table[n]; + qpt->table[n] = NULL; + spin_unlock_irqrestore(&qpt->lock, flags); + + while (qp) { + nqp = qp->next; + if (qp->ibqp.qp_num > 1) + free_qpn(qpt, qp->ibqp.qp_num); + if (!atomic_dec_and_test(&qp->refcount) || + !ipath_destroy_qp(&qp->ibqp)) + _VERBS_INFO("QP memory leak!\n"); + qp = nqp; + } + } + + for (n = 0; n < ARRAY_SIZE(qpt->map); n++) { + if (qpt->map[n].page) + free_page((unsigned long)qpt->map[n].page); + } +} + +/* + * Return the QP with the given QPN. + * The caller is responsible for decrementing the QP reference count when done. + */ +static struct ipath_qp *ipath_lookup_qpn(struct ipath_qp_table *qpt, u32 qpn) +{ + unsigned long flags; + struct ipath_qp *qp; + + spin_lock_irqsave(&qpt->lock, flags); + + for (qp = qpt->table[qpn % qpt->max]; qp; qp = qp->next) { + if (qp->ibqp.qp_num == qpn) { + atomic_inc(&qp->refcount); + break; + } + } + + spin_unlock_irqrestore(&qpt->lock, flags); + return qp; +} + +static int ipath_alloc_lkey(struct ipath_lkey_table *rkt, + struct ipath_mregion *mr) +{ + unsigned long flags; + u32 r; + u32 n; + + spin_lock_irqsave(&rkt->lock, flags); + + /* Find the next available LKEY */ + r = n = rkt->next; + for (;;) { + if (rkt->table[r] == NULL) + break; + r = (r + 1) & (rkt->max - 1); + if (r == n) { + spin_unlock_irqrestore(&rkt->lock, flags); + _VERBS_INFO("LKEY table full\n"); + return 0; + } + } + rkt->next = (r + 1) & (rkt->max - 1); + /* + * Make sure lkey is never zero which is reserved to indicate an + * unrestricted LKEY. + */ + rkt->gen++; + mr->lkey = (r << (32 - ib_ipath_lkey_table_size)) | + ((((1 << (24 - ib_ipath_lkey_table_size)) - 1) & rkt->gen) << 8); + if (mr->lkey == 0) { + mr->lkey |= 1 << 8; + rkt->gen++; + } + rkt->table[r] = mr; + spin_unlock_irqrestore(&rkt->lock, flags); + + return 1; +} + +static void ipath_free_lkey(struct ipath_lkey_table *rkt, u32 lkey) +{ + unsigned long flags; + u32 r; + + if (lkey == 0) + return; + r = lkey >> (32 - ib_ipath_lkey_table_size); + spin_lock_irqsave(&rkt->lock, flags); + rkt->table[r] = NULL; + spin_unlock_irqrestore(&rkt->lock, flags); +} + +/* + * Check the IB SGE for validity and initialize our internal version of it. + * Return 1 if OK, else zero. + */ +static int ipath_lkey_ok(struct ipath_lkey_table *rkt, struct ipath_sge *isge, + struct ib_sge *sge, int acc) +{ + struct ipath_mregion *mr; + size_t off; + + /* + * We use LKEY == zero to mean a physical kmalloc() address. + * This is a bit of a hack since we rely on dma_map_single() + * being reversible by calling bus_to_virt(). + */ + if (sge->lkey == 0) { + isge->mr = NULL; + isge->vaddr = bus_to_virt(sge->addr); + isge->length = sge->length; + isge->sge_length = sge->length; + return 1; + } + spin_lock(&rkt->lock); + mr = rkt->table[(sge->lkey >> (32 - ib_ipath_lkey_table_size))]; + spin_unlock(&rkt->lock); + if (unlikely(mr == NULL || mr->lkey != sge->lkey)) + return 0; + + off = sge->addr - mr->user_base; + if (unlikely(sge->addr < mr->user_base || + off + sge->length > mr->length || + (mr->access_flags & acc) != acc)) + return 0; + + off += mr->offset; + isge->mr = mr; + isge->m = 0; + isge->n = 0; + while (off >= mr->map[isge->m]->segs[isge->n].length) { + off -= mr->map[isge->m]->segs[isge->n].length; + if (++isge->n >= IPATH_SEGSZ) { + isge->m++; + isge->n = 0; + } + } + isge->vaddr = mr->map[isge->m]->segs[isge->n].vaddr + off; + isge->length = mr->map[isge->m]->segs[isge->n].length - off; + isge->sge_length = sge->length; + return 1; +} + +/* + * Initialize the qp->s_sge after a restart. + * The QP s_lock should be held. + */ +static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe) +{ + struct ipath_ibdev *dev; + u32 len; + + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * + ib_mtu_enum_to_int(qp->path_mtu); + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + skip_sge(&qp->s_sge, len); + qp->s_len = wqe->length - len; + dev = to_idev(qp->ibqp.device); + spin_lock(&dev->pending_lock); + if (qp->timerwait.next == LIST_POISON1) + list_add_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); +} + +/* + * Check the IB virtual address, length, and RKEY. + * Return 1 if OK, else zero. + * The QP r_rq.lock should be held. + */ +static int ipath_rkey_ok(struct ipath_ibdev *dev, struct ipath_sge_state *ss, + u32 len, u64 vaddr, u32 rkey, int acc) +{ + struct ipath_lkey_table *rkt = &dev->lk_table; + struct ipath_sge *sge = &ss->sge; + struct ipath_mregion *mr; + size_t off; + + spin_lock(&rkt->lock); + mr = rkt->table[(rkey >> (32 - ib_ipath_lkey_table_size))]; + spin_unlock(&rkt->lock); + if (unlikely(mr == NULL || mr->lkey != rkey)) + return 0; + + off = vaddr - mr->iova; + if (unlikely(vaddr < mr->iova || off + len > mr->length || + (mr->access_flags & acc) == 0)) + return 0; + + off += mr->offset; + sge->mr = mr; + sge->m = 0; + sge->n = 0; + while (off >= mr->map[sge->m]->segs[sge->n].length) { + off -= mr->map[sge->m]->segs[sge->n].length; + if (++sge->n >= IPATH_SEGSZ) { + sge->m++; + sge->n = 0; + } + } + sge->vaddr = mr->map[sge->m]->segs[sge->n].vaddr + off; + sge->length = mr->map[sge->m]->segs[sge->n].length - off; + sge->sge_length = len; + ss->sg_list = NULL; + ss->num_sge = 1; + return 1; +} + +/* + * Add a new entry to the completion queue. + * This may be called with one of the qp->s_lock or qp->r_rq.lock held. + */ +static void ipath_cq_enter(struct ipath_cq *cq, struct ib_wc *entry, int sig) +{ + unsigned long flags; + u32 next; + + spin_lock_irqsave(&cq->lock, flags); + + cq->queue[cq->head] = *entry; + next = cq->head + 1; + if (next == cq->ibcq.cqe) + next = 0; + if (likely(next != cq->tail)) + cq->head = next; + else { + spin_unlock_irqrestore(&cq->lock, flags); + if (cq->ibcq.event_handler) { + struct ib_event ev; + + ev.device = cq->ibcq.device; + ev.element.cq = &cq->ibcq; + ev.event = IB_EVENT_CQ_ERR; + cq->ibcq.event_handler(&ev, cq->ibcq.cq_context); + } + return; + } + + if (cq->notify == IB_CQ_NEXT_COMP || + (cq->notify == IB_CQ_SOLICITED && sig)) { + cq->notify = IB_CQ_NONE; + cq->triggered++; + /* + * This will cause send_complete() to be called in + * another thread. + */ + tasklet_schedule(&cq->comptask); + } + + spin_unlock_irqrestore(&cq->lock, flags); + + if (entry->status != IB_WC_SUCCESS) + to_idev(cq->ibcq.device)->n_wqe_errs++; +} + +static void send_complete(unsigned long data) +{ + struct ipath_cq *cq = (struct ipath_cq *)data; + + /* + * The completion handler will most likely rearm the notification + * and poll for all pending entries. If a new completion entry + * is added while we are in this routine, tasklet_schedule() + * won't call us again until we return so we check triggered to + * see if we need to call the handler again. + */ + for (;;) { + u8 triggered = cq->triggered; + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (cq->triggered == triggered) + return; + } +} + +/* + * This is the QP state transition table. + * See ipath_modify_qp() for details. + */ +static const struct { + int trans; + u32 req_param[IB_QPT_RAW_IPV6]; + u32 opt_param[IB_QPT_RAW_IPV6]; +} qp_state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = IPATH_TRANS_RST2INIT, + .req_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + }, + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = IPATH_TRANS_INIT2INIT, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [IB_QPT_RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + } + }, + [IB_QPS_RTR] = { + .trans = IPATH_TRANS_INIT2RTR, + .req_param = { + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [IB_QPT_RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_RTR2RTS, + .req_param = { + [IB_QPT_SMI] = IB_QP_SQ_PSN, + [IB_QPT_GSI] = IB_QP_SQ_PSN, + [IB_QPT_UD] = IB_QP_SQ_PSN, + [IB_QPT_UC] = IB_QP_SQ_PSN, + [IB_QPT_RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + }, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_RTS2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + } + }, + [IB_QPS_SQD] = { + .trans = IPATH_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_SQD2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + }, + [IB_QPS_SQD] = { + .trans = IPATH_TRANS_SQD2SQD, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), + [IB_QPT_UC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_PATH_MIG_STATE), + [IB_QPT_RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = IPATH_TRANS_SQERR2RTS, + .opt_param = { + [IB_QPT_SMI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_GSI] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), + [IB_QPT_UC] = IB_QP_CUR_STATE, + [IB_QPT_RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = IPATH_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = IPATH_TRANS_ANY2ERR } + } +}; + +/* + * Initialize the QP state to the reset state. + */ +static void ipath_reset_qp(struct ipath_qp *qp) +{ + qp->remote_qpn = 0; + qp->qkey = 0; + qp->qp_access_flags = 0; + qp->s_hdrwords = 0; + qp->s_psn = 0; + qp->r_psn = 0; + atomic_set(&qp->msn, 0); + if (qp->ibqp.qp_type == IB_QPT_RC) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->r_state = IB_OPCODE_RC_SEND_LAST; + } else { + qp->s_state = IB_OPCODE_UC_SEND_LAST; + qp->r_state = IB_OPCODE_UC_SEND_LAST; + } + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + qp->s_nak_state = 0; + qp->s_rnr_timeout = 0; + qp->s_head = 0; + qp->s_tail = 0; + qp->s_cur = 0; + qp->s_last = 0; + qp->s_ssn = 1; + qp->s_lsn = 0; + qp->r_rq.head = 0; + qp->r_rq.tail = 0; + qp->r_reuse_sge = 0; +} + +/* + * Flush send work queue. + * The QP s_lock should be held. + */ +static void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + _VERBS_INFO("Send queue error on QP%d/%d: err: %d\n", + qp->ibqp.qp_num, qp->remote_qpn, wc->status); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + + wc->status = IB_WC_WR_FLUSH_ERR; + + while (qp->s_last != qp->s_head) { + wc->wr_id = wqe->wr.wr_id; + wc->opcode = wc_opcode[wqe->wr.opcode]; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + wqe = get_swqe_ptr(qp, qp->s_last); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->state = IB_QPS_SQE; +} + +/* + * Flush both send and receive work queues. + * QP r_rq.lock and s_lock should be held. + */ +static void ipath_error_qp(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + + _VERBS_INFO("QP%d/%d in error state\n", + qp->ibqp.qp_num, qp->remote_qpn); + + spin_lock(&dev->pending_lock); + /* XXX What if its already removed by the timeout code? */ + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock(&dev->pending_lock); + + wc.status = IB_WC_WR_FLUSH_ERR; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.imm_data = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = 0; + wc.sl = 0; + wc.dlid_path_bits = 0; + wc.port_num = 0; + + while (qp->s_last != qp->s_head) { + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + + wc.wr_id = wqe->wr.wr_id; + wc.opcode = wc_opcode[wqe->wr.opcode]; + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); + } + qp->s_cur = qp->s_tail = qp->s_head; + qp->s_hdrwords = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + + wc.opcode = IB_WC_RECV; + while (qp->r_rq.tail != qp->r_rq.head) { + wc.wr_id = get_rwqe_ptr(&qp->r_rq, qp->r_rq.tail)->wr_id; + if (++qp->r_rq.tail >= qp->r_rq.size) + qp->r_rq.tail = 0; + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); + } +} + +static int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask) +{ + struct ipath_qp *qp = to_iqp(ibqp); + enum ib_qp_state cur_state, new_state; + u32 req_param, opt_param; + unsigned long flags; + + if (attr_mask & IB_QP_CUR_STATE) { + cur_state = attr->cur_qp_state; + if (cur_state != IB_QPS_RTR && + cur_state != IB_QPS_RTS && + cur_state != IB_QPS_SQD && cur_state != IB_QPS_SQE) + return -EINVAL; + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + } else { + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + cur_state = qp->state; + } + + if (attr_mask & IB_QP_STATE) { + new_state = attr->qp_state; + if (new_state < 0 || new_state > IB_QPS_ERR) + goto inval; + } else + new_state = cur_state; + + switch (qp_state_table[cur_state][new_state].trans) { + case IPATH_TRANS_INVALID: + goto inval; + + case IPATH_TRANS_ANY2RST: + ipath_reset_qp(qp); + break; + + case IPATH_TRANS_ANY2ERR: + ipath_error_qp(qp); + break; + + } + + req_param = + qp_state_table[cur_state][new_state].req_param[qp->ibqp.qp_type]; + opt_param = + qp_state_table[cur_state][new_state].opt_param[qp->ibqp.qp_type]; + + if ((req_param & attr_mask) != req_param) + goto inval; + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) + goto inval; + + if (attr_mask & IB_QP_PKEY_INDEX) { + struct ipath_ibdev *dev = to_idev(ibqp->device); + + if (attr->pkey_index >= ipath_layer_get_npkeys(dev->ib_unit)) + goto inval; + qp->s_pkey_index = attr->pkey_index; + } + + if (attr_mask & IB_QP_DEST_QPN) + qp->remote_qpn = attr->dest_qp_num; + + if (attr_mask & IB_QP_SQ_PSN) { + qp->s_next_psn = attr->sq_psn; + qp->s_last_psn = qp->s_next_psn - 1; + } + + if (attr_mask & IB_QP_RQ_PSN) + qp->r_psn = attr->rq_psn; + + if (attr_mask & IB_QP_ACCESS_FLAGS) + qp->qp_access_flags = attr->qp_access_flags; + + if (attr_mask & IB_QP_AV) + qp->remote_ah_attr = attr->ah_attr; + + if (attr_mask & IB_QP_PATH_MTU) + qp->path_mtu = attr->path_mtu; + + if (attr_mask & IB_QP_RETRY_CNT) + qp->s_retry = qp->s_retry_cnt = attr->retry_cnt; + + if (attr_mask & IB_QP_RNR_RETRY) { + qp->s_rnr_retry = attr->rnr_retry; + if (qp->s_rnr_retry > 7) + qp->s_rnr_retry = 7; + qp->s_rnr_retry_cnt = qp->s_rnr_retry; + } + + if (attr_mask & IB_QP_MIN_RNR_TIMER) + qp->s_min_rnr_timer = attr->min_rnr_timer & 0x1F; + + if (attr_mask & IB_QP_QKEY) + qp->qkey = attr->qkey; + + if (attr_mask & IB_QP_PKEY_INDEX) + qp->s_pkey_index = attr->pkey_index; + + qp->state = new_state; + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* + * Try to move to ARMED if QP1 changed to the RTS state. + */ + if (qp->ibqp.qp_num == 1 && new_state == IB_QPS_RTS) { + struct ipath_ibdev *dev = to_idev(ibqp->device); + + /* + * Bounce the link even if it was active so the SM will + * reinitialize the SMA's state. + */ + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKDOWN); + ipath_kset_linkstate((dev->ib_unit << 16) | IPATH_IB_LINKARM); + } + return 0; + +inval: + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return -EINVAL; +} + +/* + * Compute the AETH (syndrome + MSN). + * The QP s_lock should be held. + */ +static u32 ipath_compute_aeth(struct ipath_qp *qp) +{ + u32 aeth = atomic_read(&qp->msn) & 0xFFFFFF; + + if (qp->s_nak_state) { + aeth |= qp->s_nak_state << 24; + } else if (qp->ibqp.srq) { + /* Shared receive queues don't generate credits. */ + aeth |= 0x1F << 24; + } else { + u32 min, max, x; + u32 credits; + + /* + * Compute the number of credits available (RWQEs). + * XXX Not holding the r_rq.lock here so there is a small + * chance that the pair of reads are not atomic. + */ + credits = qp->r_rq.head - qp->r_rq.tail; + if ((int)credits < 0) + credits += qp->r_rq.size; + /* Binary search the credit table to find the code to use. */ + min = 0; + max = 31; + for (;;) { + x = (min + max) / 2; + if (credit_table[x] == credits) + break; + if (credit_table[x] > credits) + max = x; + else if (min == x) + break; + else + min = x; + } + aeth |= x << 24; + } + return cpu_to_be32(aeth); +} + + +static void no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + if (qp->piowait.next == LIST_POISON1) + list_add_tail(&qp->piowait, &dev->piowait); + spin_unlock_irqrestore(&dev->pending_lock, flags); + /* + * Note that as soon as ipath_layer_want_buffer() is called and + * possibly before it returns, ipath_ib_piobufavail() + * could be called. If we are still in the tasklet function, + * tasklet_schedule() will not call us until the next time + * tasklet_schedule() is called. + * We clear the tasklet flag now since we are committing to return + * from the tasklet function. + */ + tasklet_unlock(&qp->s_task); + ipath_layer_want_buffer(dev->ib_unit); + dev->n_piowait++; +} + +/* + * Process entries in the send work queue until the queue is exhausted. + * Only allow one CPU to send a packet per QP (tasklet). + * Otherwise, after we drop the QP lock, two threads could send + * packets out of order. + * This is similar to do_rc_send() below except we don't have timeouts or + * resends. + */ +static void do_uc_send(unsigned long data) +{ + struct ipath_qp *qp = (struct ipath_qp *)data; + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe; + unsigned long flags; + u16 lrh0; + u32 hwords; + u32 nwords; + u32 extra_bytes; + u32 bth0; + u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + u32 len; + struct ipath_other_headers *ohdr; + struct ib_wc wc; + + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) + return; + + if (unlikely(qp->remote_ah_attr.dlid == + ipath_layer_get_lid(dev->ib_unit))) { + /* Pass in an uninitialized ib_wc to save stack space. */ + ipath_ruc_loopback(qp, &wc); + clear_bit(IPATH_S_BUSY, &qp->s_flags); + return; + } + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + +again: + /* Check for a constructed packet to be sent. */ + if (qp->s_hdrwords != 0) { + /* + * If no PIO bufs are available, return. + * An interrupt will call ipath_ib_piobufavail() + * when one is available. + */ + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, + (uint32_t *) &qp->s_hdr, + qp->s_cur_size, qp->s_cur_sge)) { + no_bufs_available(qp, dev); + return; + } + dev->n_unicast_xmit++; + /* Record that we sent the packet and s_hdr is empty. */ + qp->s_hdrwords = 0; + } + + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ + hwords = 5; + + /* + * The lock is needed to synchronize between + * setting qp->s_ack_state and post_send(). + */ + spin_lock_irqsave(&qp->s_lock, flags); + + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + goto done; + + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + + /* Send a request. */ + wqe = get_swqe_ptr(qp, qp->s_last); + switch (qp->s_state) { + default: + /* Signal the completion of the last send (if there is one). */ + if (qp->s_last != qp->s_tail) { + if (++qp->s_last == qp->s_size) + qp->s_last = 0; + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wqe->wr.wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, + 0); + } + wqe = get_swqe_ptr(qp, qp->s_last); + } + /* Check if send work queue is empty. */ + if (qp->s_tail == qp->s_head) + goto done; + /* + * Start a new request. + */ + qp->s_psn = wqe->psn = qp->s_next_psn; + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + qp->s_len = len = wqe->length; + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (len > pmtu) { + qp->s_state = IB_OPCODE_UC_SEND_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) { + qp->s_state = IB_OPCODE_UC_SEND_ONLY; + } else { + qp->s_state = + IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + hwords += sizeof(struct ib_reth) / 4; + if (len > pmtu) { + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_ONLY; + } else { + qp->s_state = + IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the RETH */ + ohdr->u.rc.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + break; + + default: + goto done; + } + if (++qp->s_tail >= qp->s_size) + qp->s_tail = 0; + break; + + case IB_OPCODE_UC_SEND_FIRST: + qp->s_state = IB_OPCODE_UC_SEND_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_MIDDLE: + len = qp->s_len; + if (len > pmtu) { + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) + qp->s_state = IB_OPCODE_UC_SEND_LAST; + else { + qp->s_state = IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + break; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + len = qp->s_len; + if (len > pmtu) { + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) + qp->s_state = IB_OPCODE_UC_RDMA_WRITE_LAST; + else { + qp->s_state = + IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + break; + } + bth2 = qp->s_next_psn++ & 0xFFFFFF; + qp->s_len -= len; + bth0 |= qp->s_state << 24; + + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Construct the header. */ + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + } + qp->s_hdrwords = hwords; + qp->s_cur_sge = &qp->s_sge; + qp->s_cur_size = len; + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); + + /* Check for more work to do. */ + goto again; + +done: + spin_unlock_irqrestore(&qp->s_lock, flags); + clear_bit(IPATH_S_BUSY, &qp->s_flags); +} + +/* + * Process entries in the send work queue until credit or queue is exhausted. + * Only allow one CPU to send a packet per QP (tasklet). + * Otherwise, after we drop the QP s_lock, two threads could send + * packets out of order. + */ +static void do_rc_send(unsigned long data) +{ + struct ipath_qp *qp = (struct ipath_qp *)data; + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_swqe *wqe; + struct ipath_sge_state *ss; + unsigned long flags; + u16 lrh0; + u32 hwords; + u32 nwords; + u32 extra_bytes; + u32 bth0; + u32 bth2; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + u32 len; + struct ipath_other_headers *ohdr; + char newreq; + + if (test_and_set_bit(IPATH_S_BUSY, &qp->s_flags)) + return; + + if (unlikely(qp->remote_ah_attr.dlid == + ipath_layer_get_lid(dev->ib_unit))) { + struct ib_wc wc; + + /* + * Pass in an uninitialized ib_wc to be consistent with + * other places where ipath_ruc_loopback() is called. + */ + ipath_ruc_loopback(qp, &wc); + clear_bit(IPATH_S_BUSY, &qp->s_flags); + return; + } + + ohdr = &qp->s_hdr.u.oth; + if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) + ohdr = &qp->s_hdr.u.l.oth; + +again: + /* Check for a constructed packet to be sent. */ + if (qp->s_hdrwords != 0) { + /* + * If no PIO bufs are available, return. + * An interrupt will call ipath_ib_piobufavail() + * when one is available. + */ + if (ipath_verbs_send(dev->ib_unit, qp->s_hdrwords, + (uint32_t *) &qp->s_hdr, + qp->s_cur_size, qp->s_cur_sge)) { + no_bufs_available(qp, dev); + return; + } + dev->n_unicast_xmit++; + /* Record that we sent the packet and s_hdr is empty. */ + qp->s_hdrwords = 0; + } + + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ + hwords = 5; + + /* + * The lock is needed to synchronize between + * setting qp->s_ack_state, resend timer, and post_send(). + */ + spin_lock_irqsave(&qp->s_lock, flags); + + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + + /* Sending responses has higher priority over sending requests. */ + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE) { + /* + * Send a response. + * Note that we are in the responder's side of the QP context. + */ + switch (qp->s_ack_state) { + case IB_OPCODE_RC_RDMA_READ_REQUEST: + ss = &qp->s_rdma_sge; + len = qp->s_rdma_len; + if (len > pmtu) { + len = pmtu; + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + } else { + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY; + } + qp->s_rdma_len -= len; + bth0 |= qp->s_ack_state << 24; + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + ss = &qp->s_rdma_sge; + len = qp->s_rdma_len; + if (len > pmtu) { + len = pmtu; + } else { + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + qp->s_ack_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + } + qp->s_rdma_len -= len; + bth0 |= qp->s_ack_state << 24; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: + /* + * We have to prevent new requests from changing + * the r_sge state while a ipath_verbs_send() + * is in progress. + * Changing r_state allows the receiver + * to continue processing new packets. + * We do it here now instead of above so + * that we are sure the packet was sent before + * changing the state. + */ + qp->r_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + goto send_req; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + ss = NULL; + len = 0; + qp->r_state = IB_OPCODE_RC_SEND_LAST; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + ohdr->u.at.aeth = ipath_compute_aeth(qp); + ohdr->u.at.atomic_ack_eth = + cpu_to_be64(qp->s_ack_atomic); + hwords += sizeof(ohdr->u.at) / 4; + break; + + default: + /* Send a regular ACK. */ + ss = NULL; + len = 0; + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + bth0 |= qp->s_ack_state << 24; + ohdr->u.aeth = ipath_compute_aeth(qp); + hwords++; + } + bth2 = qp->s_ack_psn++ & 0xFFFFFF; + } else { + send_req: + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || + qp->s_rnr_timeout) + goto done; + + /* Send a request. */ + wqe = get_swqe_ptr(qp, qp->s_cur); + switch (qp->s_state) { + default: + /* + * Resend an old request or start a new one. + * + * We keep track of the current SWQE so that + * we don't reset the "furthest progress" state + * if we need to back up. + */ + newreq = 0; + if (qp->s_cur == qp->s_tail) { + /* Check if send work queue is empty. */ + if (qp->s_tail == qp->s_head) + goto done; + qp->s_psn = wqe->psn = qp->s_next_psn; + newreq = 1; + } + /* + * Note that we have to be careful not to modify the + * original work request since we may need to resend + * it. + */ + qp->s_sge.sge = wqe->sg_list[0]; + qp->s_sge.sg_list = wqe->sg_list + 1; + qp->s_sge.num_sge = wqe->wr.num_sge; + qp->s_len = len = wqe->length; + ss = &qp->s_sge; + bth2 = 0; + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + /* If no credit, return. */ + if (qp->s_lsn != (u32) -1 && + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + goto done; + } + wqe->lpsn = wqe->psn; + if (len > pmtu) { + wqe->lpsn += (len - 1) / pmtu; + qp->s_state = IB_OPCODE_RC_SEND_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) { + qp->s_state = IB_OPCODE_RC_SEND_ONLY; + } else { + qp->s_state = + IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth2 = 1 << 31; /* Request ACK. */ + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_RDMA_WRITE: + if (newreq) + qp->s_lsn++; + /* FALLTHROUGH */ + case IB_WR_RDMA_WRITE_WITH_IMM: + /* If no credit, return. */ + if (qp->s_lsn != (u32) -1 && + cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + goto done; + } + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + hwords += sizeof(struct ib_reth) / 4; + wqe->lpsn = wqe->psn; + if (len > pmtu) { + wqe->lpsn += (len - 1) / pmtu; + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_FIRST; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_ONLY; + } else { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE; + /* Immediate data comes after RETH */ + ohdr->u.rc.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr. + send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + bth2 = 1 << 31; /* Request ACK. */ + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_RDMA_READ: + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(len); + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; + hwords += sizeof(ohdr->u.rc.reth) / 4; + if (newreq) { + qp->s_lsn++; + /* + * Adjust s_next_psn to count the + * expected number of responses. + */ + if (len > pmtu) + qp->s_next_psn += + (len - 1) / pmtu; + wqe->lpsn = qp->s_next_psn++; + } + ss = NULL; + len = 0; + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + qp->s_state = + wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP ? + IB_OPCODE_RC_COMPARE_SWAP : + IB_OPCODE_RC_FETCH_ADD; + ohdr->u.atomic_eth.vaddr = + cpu_to_be64(wqe->wr.wr.atomic.remote_addr); + ohdr->u.atomic_eth.rkey = + cpu_to_be32(wqe->wr.wr.atomic.rkey); + ohdr->u.atomic_eth.swap_data = + cpu_to_be64(wqe->wr.wr.atomic.swap); + ohdr->u.atomic_eth.compare_data = + cpu_to_be64(wqe->wr.wr.atomic.compare_add); + hwords += sizeof(struct ib_atomic_eth) / 4; + if (newreq) { + qp->s_lsn++; + wqe->lpsn = wqe->psn; + } + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + ss = NULL; + len = 0; + break; + + default: + goto done; + } + if (newreq) { + if (++qp->s_tail >= qp->s_size) + qp->s_tail = 0; + } + bth2 |= qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + spin_lock(&dev->pending_lock); + if (qp->timerwait.next == LIST_POISON1) { + list_add_tail(&qp->timerwait, + &dev->pending[dev-> + pending_index]); + } + spin_unlock(&dev->pending_lock); + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + /* + * This case can only happen if a send is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_FIRST: + qp->s_state = IB_OPCODE_RC_SEND_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_MIDDLE: + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = &qp->s_sge; + len = qp->s_len; + if (len > pmtu) { + /* + * Request an ACK every 1/2 MB to avoid + * retransmit timeouts. + */ + if (((wqe->length - len) % (512 * 1024)) == 0) + bth2 |= 1 << 31; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_SEND) + qp->s_state = IB_OPCODE_RC_SEND_LAST; + else { + qp->s_state = + IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + } + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth2 |= 1 << 31; /* Request ACK. */ + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + /* + * This case can only happen if a RDMA write is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_MIDDLE; + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = &qp->s_sge; + len = qp->s_len; + if (len > pmtu) { + /* + * Request an ACK every 1/2 MB to avoid + * retransmit timeouts. + */ + if (((wqe->length - len) % (512 * 1024)) == 0) + bth2 |= 1 << 31; + len = pmtu; + break; + } + if (wqe->wr.opcode == IB_WR_RDMA_WRITE) + qp->s_state = IB_OPCODE_RC_RDMA_WRITE_LAST; + else { + qp->s_state = + IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE; + /* Immediate data comes after the BTH */ + ohdr->u.imm_data = wqe->wr.imm_data; + hwords += 1; + if (wqe->wr.send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + } + bth2 |= 1 << 31; /* Request ACK. */ + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + /* + * This case can only happen if a RDMA read is + * restarted. See ipath_restart_rc(). + */ + ipath_init_restart(qp, wqe); + len = ((qp->s_psn - wqe->psn) & 0xFFFFFF) * pmtu; + ohdr->u.rc.reth.vaddr = + cpu_to_be64(wqe->wr.wr.rdma.remote_addr + len); + ohdr->u.rc.reth.rkey = + cpu_to_be32(wqe->wr.wr.rdma.rkey); + ohdr->u.rc.reth.length = cpu_to_be32(qp->s_len); + qp->s_state = IB_OPCODE_RC_RDMA_READ_REQUEST; + hwords += sizeof(ohdr->u.rc.reth) / 4; + bth2 = qp->s_psn++ & 0xFFFFFF; + if ((int)(qp->s_psn - qp->s_next_psn) > 0) + qp->s_next_psn = qp->s_psn; + ss = NULL; + len = 0; + if (++qp->s_cur == qp->s_size) + qp->s_cur = 0; + break; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * We shouldn't start anything new until this request + * is finished. The ACK will handle rescheduling us. + * XXX The number of outstanding ones is negotiated + * at connection setup time (see pg. 258,289)? + * XXX Also, if we support multiple outstanding + * requests, we need to check the WQE IB_SEND_FENCE + * flag and not send a new request if a RDMA read or + * atomic is pending. + */ + goto done; + } + qp->s_len -= len; + bth0 |= qp->s_state << 24; + /* XXX queue resend timeout. */ + } + /* Make sure it is non-zero before dropping the lock. */ + qp->s_hdrwords = hwords; + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Construct the header. */ + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + nwords + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + qp->s_hdrwords = hwords; + } + qp->s_cur_sge = ss; + qp->s_cur_size = len; + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + bth0 |= extra_bytes << 20; + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(bth2); + + /* Check for more work to do. */ + goto again; + +done: + spin_unlock_irqrestore(&qp->s_lock, flags); + clear_bit(IPATH_S_BUSY, &qp->s_flags); +} From bos at pathscale.com Wed Dec 28 16:31:39 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:39 -0800 Subject: [openib-general] [PATCH 20 of 20] ipath - integrate driver into infiniband kbuild infrastructure In-Reply-To: Message-ID: <914136b2b8eed9417ce6.1135816299@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r 07bf9f34e221 -r 914136b2b8ee drivers/infiniband/Kconfig --- a/drivers/infiniband/Kconfig Wed Dec 28 14:19:43 2005 -0800 +++ b/drivers/infiniband/Kconfig Wed Dec 28 14:19:43 2005 -0800 @@ -30,6 +30,7 @@ . source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff -r 07bf9f34e221 -r 914136b2b8ee drivers/infiniband/Makefile --- a/drivers/infiniband/Makefile Wed Dec 28 14:19:43 2005 -0800 +++ b/drivers/infiniband/Makefile Wed Dec 28 14:19:43 2005 -0800 @@ -1,4 +1,5 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_IPATH_CORE) += hw/ipath/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ From bos at pathscale.com Wed Dec 28 16:31:30 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:30 -0800 Subject: [openib-general] [PATCH 11 of 20] ipath - core driver, part 4 of 4 In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r c37b118ef806 -r e8af3873b0d9 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:43 2005 -0800 @@ -5408,3 +5408,1709 @@ return ret; } + +/* + * implemention of the ioctl to get the stats values from the driver + * The argument is the user address to which we do the copy_to_user() + */ +static int ipath_get_stats(struct infinipath_stats __user *ustats) +{ + int ret = 0; + + if ((ret = copy_to_user(ustats, &ipath_stats, sizeof(ipath_stats)))) { + _IPATH_DBG("copy_to_user error on driver stats\n"); + ret = -EFAULT; + } + + return ret; +} + +/* set a partition key. We can have up to 4 active at a time (other than + * the default, which is always allowed). This is somewhat tricky, since + * multiple ports may set the same key, so we reference count them, and + * clean up at exit. All 4 partition keys are packed into a single + * infinipath register. It's an error for a process to set the same + * pkey multiple times. We provide no mechanism to de-allocate a pkey + * at this time, we may eventually need to do that. + * I've used the atomic operations, and no locking, and only make a single + * pass through what's available. This should be more than adequate for + * some time. I'll think about spinlocks or the like if and as it's necessary + */ +static int ipath_set_partkey(struct ipath_portdata *pd, uint16_t key) +{ + struct ipath_devdata *dd; + int i, any = 0, pidx = -1; + uint16_t lkey = key & 0x7FFF; + + dd = &devdata[pd->port_unit]; + + if (lkey == (IPS_DEFAULT_P_KEY & 0x7FFF)) { + /* nothing to do; this key always valid */ + return 0; + } + + _IPATH_VDBG + ("p%u try to set pkey %hx, current keys %hx:%x %hx:%x %hx:%x %hx:%x\n", + pd->port_port, key, dd->ipath_pkeys[0], + atomic_read(&dd->ipath_pkeyrefs[0]), dd->ipath_pkeys[1], + atomic_read(&dd->ipath_pkeyrefs[1]), dd->ipath_pkeys[2], + atomic_read(&dd->ipath_pkeyrefs[2]), dd->ipath_pkeys[3], + atomic_read(&dd->ipath_pkeyrefs[3])); + + if (!lkey) { + _IPATH_PRDBG("p%u tries to set key 0, not allowed\n", + pd->port_port); + return -EINVAL; + } + + /* + * Set the full membership bit, because it has to be + * set in the register or the packet, and it seems + * cleaner to set in the register than to force all + * callers to set it. (see bug 4331) + */ + key |= 0x8000; + + for (i = 0; i < ARRAY_SIZE(pd->port_pkeys); i++) { + if (!pd->port_pkeys[i] && pidx == -1) + pidx = i; + if (pd->port_pkeys[i] == key) { + _IPATH_VDBG + ("p%u tries to set same pkey (%x) more than once\n", + pd->port_port, key); + return -EEXIST; + } + } + if (pidx == -1) { + _IPATH_DBG + ("All pkeys for port %u already in use, can't set %x\n", + pd->port_port, key); + return -EBUSY; + } + for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i]) { + any++; + continue; + } + if (dd->ipath_pkeys[i] == key) { + if (atomic_inc_return(&dd->ipath_pkeyrefs[i]) > 1) { + pd->port_pkeys[pidx] = key; + _IPATH_VDBG + ("p%u set key %x matches #%d, count now %d\n", + pd->port_port, key, i, + atomic_read(&dd->ipath_pkeyrefs[i])); + return 0; + } else { + /* lost race, decrement count, catch below */ + atomic_dec(&dd->ipath_pkeyrefs[i]); + _IPATH_VDBG + ("Lost race, count was 0, after dec, it's %d\n", + atomic_read(&dd->ipath_pkeyrefs[i])); + any++; + } + } + if ((dd->ipath_pkeys[i] & 0x7FFF) == lkey) { + /* + * It makes no sense to have both the limited and full + * membership PKEY set at the same time since the + * unlimited one will disable the limited one. + */ + return -EEXIST; + } + } + if (!any) { + _IPATH_DBG + ("port %u, all pkeys already in use, can't set %x\n", + pd->port_port, key); + return -EBUSY; + } + for (any = i = 0; i < ARRAY_SIZE(dd->ipath_pkeys); i++) { + if (!dd->ipath_pkeys[i] && + atomic_inc_return(&dd->ipath_pkeyrefs[i]) == 1) { + uint64_t pkey; + + /* for ipathstats, etc. */ + ipath_stats.sps_pkeys[i] = lkey; + pd->port_pkeys[pidx] = dd->ipath_pkeys[i] = key; + pkey = + (uint64_t) dd->ipath_pkeys[0] | + ((uint64_t) dd->ipath_pkeys[1] << 16) | + ((uint64_t) dd->ipath_pkeys[2] << 32) | + ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_PRDBG + ("p%u set key %x in #%d, portidx %d, new pkey reg %llx\n", + pd->port_port, key, i, pidx, pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + + return 0; + } + } + _IPATH_DBG + ("port %u, all pkeys already in use 2nd pass, can't set %x\n", + pd->port_port, key); + return -EBUSY; +} + +/* + * stop_start == 0 disables receive on the port, for use in queue overflow + * conditions. stop_start==1 re-enables, and returns value of tail register, + * to be used to re-init the software copy of the head register + */ + +static int ipath_manage_rcvq(struct ipath_portdata * pd, uint16_t start_stop) +{ + struct ipath_devdata *dd; + /* + * This needs to be volatile, so that the compiler doesn't + * optimize away the read to the device's mapped memory. + */ + volatile uint64_t tval; + + dd = &devdata[pd->port_unit]; + _IPATH_PRDBG("%sabling rcv for unit %u port %u\n", + start_stop ? "en" : "dis", pd->port_unit, pd->port_port); + /* atomically clear receive enable port. */ + if (start_stop) { + /* + * on enable, force in-memory copy of the tail register + * to 0, so that protocol code doesn't have to worry + * about whether or not the chip has yet updated + * the in-memory copy or not on return from the system + * call. The chip always resets it's tail register back + * to 0 on a transition from disabled to enabled. + * This could cause a problem if software was broken, + * and did the enable w/o the disable, but eventually + * the in-memory copy will be updated and correct + * itself, even in the face of software bugs. + */ + *pd->port_rcvhdrtail_kvaddr = 0; + atomic_set_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + pd->port_port), + &dd->ipath_rcvctrl); + } else + atomic_clear_mask(1U << + (INFINIPATH_R_PORTENABLE_SHIFT + + pd->port_port), &dd->ipath_rcvctrl); + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + /* now be sure chip saw it before we return */ + tval = ipath_kget_kreg64(pd->port_unit, kr_scratch); + if (start_stop) { + /* + * and try to be sure that tail reg update has happened + * too. This should in theory interlock with the RXE + * changes to the tail register. Don't assign it to + * the tail register in memory copy, since we could + * overwrite an update by the chip if we did. + */ + tval = + ipath_kget_ureg32(pd->port_unit, ur_rcvhdrtail, + pd->port_port); + } + /* always; new head should be equal to new tail; see above */ + return 0; +} + +/* + * This routine is now quite different for user and kernel, because + * the kernel uses skb's, for the accelerated network performance + * This is the user port version + * + * allocate the eager TID buffers and program them into infinipath + * They are no longer completely contiguous, we do multiple + * alloc_pages() calls. + */ +static int ipath_create_user_egr(struct ipath_portdata * pd) +{ + char *buf; + struct ipath_devdata *dd = &devdata[pd->port_unit]; + uint64_t __iomem *egrbase; + uint64_t egroff, lenvalid; + unsigned e, egrcnt, alloced, order, egrperchunk, chunk; + unsigned long pa, pent; + + egrcnt = dd->ipath_rcvegrcnt; + egroff = + dd->ipath_rcvegrbase + pd->port_port * egrcnt * sizeof(*egrbase); + egrbase = (uint64_t __iomem *) + ((char __iomem *)(dd->ipath_kregbase) + egroff); + _IPATH_VDBG("Allocating %d egr buffers, at chip offset %llx (%p)\n", + egrcnt, egroff, egrbase); + + /* + * to avoid wasting a lot of memory, we allocate 32KB chunks of + * physically contiguous memory, advance through it until used up + * and then allocate more. Of course, we need memory to store + * those extra pointers, now. Started out with 256KB, but under + * heavy memory pressure (creating large files and then copying + * them over NFS while doing lots of MPI jobs), we hit some + * alloc_pages() failures, even though we can sleep... (2.6.10) + * Still get failures at 64K. 32K is the lowest we can go without + * waiting more memory again. It seems likely that the coalescing + * in free_pages, etc. still has issues (as it has had previously + * during 2.6.x development). + */ + order = get_order(0x8000); + alloced = ALIGN(dd->ipath_rcvegrbufsize * egrcnt, + (1 << order) * PAGE_SIZE); + egrperchunk = ((1 << order) * PAGE_SIZE) / dd->ipath_rcvegrbufsize; + chunk = (egrcnt + egrperchunk - 1) / egrperchunk; + pd->port_rcvegrbuf_chunks = chunk; + pd->port_rcvegrbufs_perchunk = egrperchunk; + pd->port_rcvegrbuf_order = order; + pd->port_rcvegrbuf_pages = + vmalloc(chunk * sizeof(pd->port_rcvegrbuf_pages[0])); + pd->port_rcvegrbuf_virt = + vmalloc(chunk * sizeof(pd->port_rcvegrbuf_virt[0])); + if (!pd->port_rcvegrbuf_pages || !pd->port_rcvegrbuf_pages) { + _IPATH_UNIT_ERROR(pd->port_unit, + "Unable to allocate %u EGR buffer array pointers\n", + chunk); + if (pd->port_rcvegrbuf_pages) { + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + } + return -ENOMEM; + } + for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { + /* + * GFP_USER, but without GFP_FS, so buffer cache can + * be coalesced (we hope); otherwise, even at order 4, heavy + * filesystem activity makes these fail + */ + if (! + (pd->port_rcvegrbuf_pages[e] = + alloc_pages(__GFP_WAIT | __GFP_IO, order))) { + _IPATH_UNIT_ERROR(pd->port_unit, + "Unable to allocate EGR buffer array %u/%u\n", + e, pd->port_rcvegrbuf_chunks); + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + vfree(pd->port_rcvegrbuf_virt); + pd->port_rcvegrbuf_virt = NULL; + return -ENOMEM; + } + } + + /* + * calculate physical, then phys_to_virt() + * so that we get an address that fits in 64 bits, so we can use + * mmap64 from 32 bit programs on the chip and kernel virtual + * addresses (mmap64 for 32 bit programs on i386 and x86_64 + * only has 44 bits of address, because it uses mmap2()) + * We do this with the first chunk; We don't need a kernel + * virtually contiguous address to give the user virtually + * contiguous mappings. It just complicates the nopage routine + * a little tiny bit ;) + */ + buf = page_address(pd->port_rcvegrbuf_pages[0]); + pa = virt_to_phys(buf); + pd->port_rcvegr_phys = pa; + + /* in words */ + lenvalid = (dd->ipath_rcvegrbufsize - pd->port_egrskip) >> 2; + _IPATH_VDBG + ("port%u egrbuf vaddr %p, cpu %d, egrskip %u, len %llx words\n", + pd->port_port, buf, smp_processor_id(), pd->port_egrskip, + lenvalid); + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + + for (e = chunk = 0; chunk < pd->port_rcvegrbuf_chunks; chunk++) { + int i, n; + struct page *p; + p = pd->port_rcvegrbuf_pages[chunk]; + pa = page_to_phys(p); + buf = page_address(p); + /* + * stash away for later use, since page_address() lookup + * is not cheap + */ + pd->port_rcvegrbuf_virt[chunk] = buf; + if (pa & ~INFINIPATH_RT_ADDR_MASK) + _IPATH_INFO + ("physaddr %lx has more than 40 bits, using only 40!\n", + pa); + n = 1 << pd->port_rcvegrbuf_order; + for (i = 0; i < n; i++) + SetPageReserved(virt_to_page(buf + (i * PAGE_SIZE))); + + /* clear buffer for security, sanity, and, debugging */ + memset(buf, 0, PAGE_SIZE * n); + + for (i = 0; e < egrcnt && i < egrperchunk; e++, i++) { + pent = ((pa + pd->port_egrskip) & + INFINIPATH_RT_ADDR_MASK) | lenvalid; + + ipath_kput_memq(pd->port_unit, &egrbase[e], pent); + _IPATH_VDBG("egr %u phys %lx val %lx\n", e, pa, pent); + pa += dd->ipath_rcvegrbufsize; + } + yield(); /* don't hog the cpu */ + } + + return 0; +} + +/* + * This routine is now quite different for user and kernel, because + * the kernel uses skb's, for the accelerated network performance + * This is the kernel (port0) version + * + * Allocate the eager TID buffers and program them into infinipath. + * We use the network layer alloc_skb() allocator to allocate the memory, and + * either use the buffers as is for things like SMA packets, or pass + * the buffers up to the ipath layered driver and thence the network layer, + * replacing them as we do so (see ipath_kreceive()) + */ +static int ipath_create_port0_egr(struct ipath_portdata * pd) +{ + int ret = 0; + uint64_t __iomem *egrbase; + uint64_t egroff; + unsigned e, egrcnt; + struct ipath_devdata *dd; + struct sk_buff **skbs; + + dd = &devdata[pd->port_unit]; + egrcnt = dd->ipath_rcvegrcnt; + egroff = dd->ipath_rcvegrbase + + pd->port_port * egrcnt * sizeof(*egrbase); + egrbase = (uint64_t __iomem *) ((char __iomem *)(dd->ipath_kregbase) + + egroff); + _IPATH_VDBG + ("unit%u Allocating %d egr buffers, at chip offset %llx (%p)\n", + pd->port_unit, egrcnt, egroff, egrbase); + + skbs = vmalloc(sizeof(*dd->ipath_port0_skbs) * egrcnt); + if (skbs == NULL) + ret = -ENOMEM; + else { + for (e = 0; e < egrcnt; e++) { + /* + * This is a bit tricky in that we allocate + * extra space for 2 bytes of the 14 byte + * ethernet header. These two bytes are passed + * in the ipath header so the rest of the data + * is word aligned. We allocate 4 bytes so that the + * data buffer stays word aligned. + * See ipath_kreceive() for more details. + */ + skbs[e] = + __dev_alloc_skb(dd->ipath_ibmaxlen + 4, GFP_KERNEL); + if (skbs[e] == NULL) { + _IPATH_UNIT_ERROR(pd->port_unit, + "SKB allocation error for eager TID %u\n", + e); + while (e != 0) + dev_kfree_skb(skbs[--e]); + ret = -ENOMEM; + break; + } + skb_reserve(skbs[e], 4); + } + } + /* + * after loop above, so we can test non-NULL + * to see if ready to use at receive, etc. Hope this fixes some + * panics. + */ + dd->ipath_port0_skbs = skbs; + + /* + * have to tell chip each time we init it + * even if we are re-using previous memory. + */ + if (!ret) { + uint64_t lenvalid; /* in words */ + + lenvalid = (dd->ipath_ibmaxlen - pd->port_egrskip) >> 2; + lenvalid <<= INFINIPATH_RT_BUFSIZE_SHIFT; + lenvalid |= INFINIPATH_RT_VALID; + for (e = 0; e < egrcnt; e++) { + unsigned long pa, pent; + + pa = virt_to_phys(dd->ipath_port0_skbs[e]->data); + pa += pd->port_egrskip; + if (!e && (pa & ~INFINIPATH_RT_ADDR_MASK)) + _IPATH_INFO + ("phys addr %lx has more than 40 bits, using only 40!!!\n", + pa); + pent = (pa & INFINIPATH_RT_ADDR_MASK) | lenvalid; + /* + * don't need this except extreme debugging, + * but leaving to save future typing. + * _IPATH_VDBG("egr[%d] %p <- %lx\n", e, &egrbase[e], pent); + */ + ipath_kput_memq(pd->port_unit, &egrbase[e], pent); + } + yield(); /* don't hog the cpu */ + } + + return ret; +} + +/* + * this *must* be physically contiguous memory, and for now, + * that limits it to what kmalloc can do. + */ +static int ipath_create_rcvhdrq(struct ipath_portdata * pd) +{ + int i, ret = 0, amt, order, pgs; + char *qt; + struct page *p; + unsigned long pa, pa0; + + amt = ALIGN(devdata[pd->port_unit].ipath_rcvhdrcnt * devdata[pd->port_unit].ipath_rcvhdrentsize * sizeof(uint32_t), PAGE_SIZE); + if (!pd->port_rcvhdrq) { + order = get_order(amt); + /* + * not using REPEAT isn't viable; at 128KB, we can easily fail + * this. The problem with REPEAT is we can block here + * "forever". There isn't an inbetween, unfortunately. + * We could reduce the risk by never freeing the rcvhdrq + * except at unload, but even then, the first time a + * port is used, we could delay for some time... + */ + p = alloc_pages(GFP_USER, order); + if (!p) { + _IPATH_UNIT_ERROR(pd->port_unit, + "attempt to allocate order %u memory for port %u rcvhdrq failed\n", + order, pd->port_port); + return -ENOMEM; + } + + /* + * should use kmap (and later kunmap), even though high mem will + * always be mapped on x86_64, to play it safe, but for some + * bizarre reason these aren't exported symbols... + */ + pd->port_rcvhdrq = page_address(p); + if (!virt_addr_valid(pd->port_rcvhdrq)) { + _IPATH_DBG + ("weird, virt_addr_valid false right after alloc_pages\n"); + _IPATH_DBG("__pa(%p) is %lx, num_physpages %lx\n", + pd->port_rcvhdrq, __pa(pd->port_rcvhdrq), + num_physpages); + } + pd->port_rcvhdrq_phys = virt_to_phys(pd->port_rcvhdrq); + pd->port_rcvhdrq_order = order; + + pa0 = pd->port_rcvhdrq_phys; + pgs = amt >> PAGE_SHIFT; + _IPATH_VDBG + ("%d pages at %p (phys %lx) order=%u for port %u rcvhdr Q\n", + pgs, pd->port_rcvhdrq, pa0, pd->port_rcvhdrq_order, + pd->port_port); + + /* + * verify it's really physically contiguous, to be paranoid + * also mark pages as reserved, to avoid problems when + * user process with them mapped then exits. + */ + qt = pd->port_rcvhdrq; + SetPageReserved(virt_to_page(qt)); + qt += PAGE_SIZE; + for (pa = pa0, i = 1; i < pgs; i++, qt += PAGE_SIZE) { + SetPageReserved(virt_to_page(qt)); + pa = virt_to_phys(qt); + if (pa != (pa0 + (i * PAGE_SIZE))) + _IPATH_INFO + ("pg %d at %p phys %lx not contiguous\n", i, + qt, pa); + else + _IPATH_VDBG("pg %d at %p phys %lx\n", i, qt, + pa); + } + } + + /* + * clear for security, sanity, and/or debugging (each time we + * use/reuse) + */ + memset(pd->port_rcvhdrq, 0, amt); + + /* + * tell chip each time we init it, even if we are re-using previous + * memory (we zero it at process close) + */ + _IPATH_VDBG("writing port %d rcvhdraddr as %lx\n", pd->port_port, + pd->port_rcvhdrq_phys); + ipath_kput_kreg_port(pd->port_unit, kr_rcvhdraddr, pd->port_port, + pd->port_rcvhdrq_phys); + + return ret; +} + +#ifdef _IPATH_EXTRA_DEBUG +/* + * occasionally useful to dump the full set of kernel registers for debugging. + */ +static void ipath_dump_allregs(char *what, ipath_type t) +{ + uint16_t reg; + _IPATH_DBG("%s\n", what); + for (reg = 0; reg <= 0x100; reg++) { + uint64_t v = ipath_kget_kreg64(t, reg); + if (!(reg % 4)) + printk("\n%3x: ", reg); + printk("%16llx ", v); + } + printk("\n"); +} +#endif /* _IPATH_EXTRA_DEBUG */ + +/* + * Do the actual initialization sequence on the chip. For the real + * hardware, this is done from the init routine called from the PCI + * infrastructure. + */ +int ipath_init_chip(const ipath_type t) +{ + int ret = 0, i; + uint32_t val32, kpiobufs; + uint64_t val, atmp; + uint32_t __iomem *piobuf; + uint32_t pioincr; + struct ipath_devdata *dd = &devdata[t]; + struct ipath_portdata *pd; + struct page *vpage; + char boardn[32]; + + /* first time only, set after static version info */ + if (!chip_driver_version) { + i = strlen(ipath_core_version); + chip_driver_version = ipath_core_version + i; + chip_driver_size = sizeof ipath_core_version - i; + } + + /* + * have to clear shadow copies of registers at init that are not + * otherwise set here, or all kinds of bizarre things happen with + * driver on chip reset + */ + dd->ipath_rcvhdrsize = 0; + + /* + * don't clear ipath_flags as 8bit mode was set before entering + * this func. However, we do set the linkstate to unknown + */ + + /* so we can watch for a transition */ + dd->ipath_flags |= IPATH_LINKUNK; + dd->ipath_flags &= ~(IPATH_LINKACTIVE | IPATH_LINKARMED | IPATH_LINKDOWN + | IPATH_LINKINIT); + + _IPATH_VDBG("Try to read spc chip revision\n"); + dd->ipath_revision = ipath_kget_kreg64(t, kr_revision); + + /* + * set up fundamental info we need to use the chip; we assume if + * the revision reg and these regs are OK, we don't need to special + * case the rest + */ + dd->ipath_sregbase = ipath_kget_kreg32(t, kr_sendregbase); + dd->ipath_cregbase = ipath_kget_kreg32(t, kr_counterregbase); + dd->ipath_uregbase = ipath_kget_kreg32(t, kr_userregbase); + _IPATH_VDBG("ipath_kregbase %p, sendbase %x usrbase %x, cntrbase %x\n", + dd->ipath_kregbase, dd->ipath_sregbase, dd->ipath_uregbase, + dd->ipath_cregbase); + if ((dd->ipath_revision & 0xffffffff) == 0xffffffff || + (dd->ipath_sregbase & 0xffffffff) == 0xffffffff || + (dd->ipath_cregbase & 0xffffffff) == 0xffffffff || + (dd->ipath_uregbase & 0xffffffff) == 0xffffffff) { + _IPATH_UNIT_ERROR(t, + "Register read failures from chip, giving up initialization\n"); + ret = -ENODEV; + goto done; + } + + /* clear the initial reset flag, in case first driver load */ + ipath_kput_kreg(t, kr_errorclear, INFINIPATH_E_RESET); + + dd->ipath_portcnt = ipath_kget_kreg32(t, kr_portcnt); + if (!infinipath_cfgports) + dd->ipath_cfgports = dd->ipath_portcnt; + else if (infinipath_cfgports <= dd->ipath_portcnt) { + dd->ipath_cfgports = infinipath_cfgports; + _IPATH_DBG("Configured to use %u ports out of %u in chip\n", + dd->ipath_cfgports, dd->ipath_portcnt); + } else { + dd->ipath_cfgports = dd->ipath_portcnt; + _IPATH_DBG + ("Tried to configured to use %u ports; chip only supports %u\n", + infinipath_cfgports, dd->ipath_portcnt); + } + dd->ipath_pd = kmalloc(sizeof(*dd->ipath_pd) * dd->ipath_cfgports, + GFP_KERNEL); + if (!dd->ipath_pd) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate portdata array, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_pd, 0, sizeof(*dd->ipath_pd) * dd->ipath_cfgports); + + dd->ipath_lastegrheads = kmalloc(sizeof(*dd->ipath_lastegrheads) + * dd->ipath_cfgports, GFP_KERNEL); + dd->ipath_lastrcvhdrqtails = kmalloc(sizeof(*dd->ipath_lastrcvhdrqtails) + * dd->ipath_cfgports, GFP_KERNEL); + if (!dd->ipath_lastegrheads || !dd->ipath_lastrcvhdrqtails) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate head arrays, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_lastrcvhdrqtails, 0, + sizeof(*dd->ipath_lastrcvhdrqtails) + * dd->ipath_cfgports); + memset(dd->ipath_lastegrheads, 0, sizeof(*dd->ipath_lastegrheads) + * dd->ipath_cfgports); + + dd->ipath_pd[0] = kmalloc(sizeof(struct ipath_portdata), GFP_KERNEL); + if (!dd->ipath_pd[0]) { + _IPATH_UNIT_ERROR(t, + "Unable to allocate portdata for port 0, failing\n"); + ret = -ENOMEM; + goto done; + } + memset(dd->ipath_pd[0], 0, sizeof(struct ipath_portdata)); + + pd = dd->ipath_pd[0]; + pd->port_unit = t; + pd->port_port = 0; + pd->port_cnt = 1; + /* The port 0 pkey table is used by the layer interface. */ + pd->port_pkeys[0] = IPS_DEFAULT_P_KEY; + + dd->ipath_rcvtidcnt = ipath_kget_kreg32(t, kr_rcvtidcnt); + dd->ipath_rcvtidbase = ipath_kget_kreg32(t, kr_rcvtidbase); + dd->ipath_rcvegrcnt = ipath_kget_kreg32(t, kr_rcvegrcnt); + dd->ipath_rcvegrbase = ipath_kget_kreg32(t, kr_rcvegrbase); + dd->ipath_palign = ipath_kget_kreg32(t, kr_pagealign); + dd->ipath_piobufbase = ipath_kget_kreg32(t, kr_sendpiobufbase); + dd->ipath_piosize = ipath_kget_kreg32(t, kr_sendpiosize); + dd->ipath_ibmtu = 4096; /* default to largest legal MTU */ + dd->ipath_piobcnt = ipath_kget_kreg32(t, kr_sendpiobufcnt); + dd->ipath_piobase = (((char __iomem *) dd->ipath_kregbase) + + (dd->ipath_piobufbase & 0xffffffff)); + + _IPATH_VDBG + ("Revision %llx (PCI %x), %u ports, %u tids, %u egrtids, %u piobufs\n", + dd->ipath_revision, dd->ipath_pcirev, dd->ipath_portcnt, + dd->ipath_rcvtidcnt, dd->ipath_rcvegrcnt, dd->ipath_piobcnt); + + if (((dd->ipath_revision >> INFINIPATH_R_SOFTWARE_SHIFT) & INFINIPATH_R_SOFTWARE_MASK) != IPATH_CHIP_SWVERSION) { /* >= maybe, someday */ + _IPATH_UNIT_ERROR(t, + "Driver only handles version %d, chip swversion is %d (%llx), failng\n", + IPATH_CHIP_SWVERSION, + (int)(dd-> + ipath_revision >> + INFINIPATH_R_SOFTWARE_SHIFT) & + INFINIPATH_R_SOFTWARE_MASK, + dd->ipath_revision); + ret = -ENOSYS; + goto done; + } + dd->ipath_majrev = (uint8_t) ((dd->ipath_revision >> + INFINIPATH_R_CHIPREVMAJOR_SHIFT) & + INFINIPATH_R_CHIPREVMAJOR_MASK); + dd->ipath_minrev = + (uint8_t) ((dd-> + ipath_revision >> INFINIPATH_R_CHIPREVMINOR_SHIFT) & + INFINIPATH_R_CHIPREVMINOR_MASK); + dd->ipath_boardrev = + (uint8_t) ((dd-> + ipath_revision >> INFINIPATH_R_BOARDID_SHIFT) & + INFINIPATH_R_BOARDID_MASK); + + ipath_get_boardname(t, boardn, sizeof boardn); + + { + snprintf(chip_driver_version, chip_driver_size, + "Driver %u.%u, %s, InfiniPath%u %u.%u, PCI %u, SW Compat %u\n", + IPATH_CHIP_VERS_MAJ, IPATH_CHIP_VERS_MIN, boardn, + (unsigned)(dd-> + ipath_revision >> INFINIPATH_R_ARCH_SHIFT) & + INFINIPATH_R_ARCH_MASK, dd->ipath_majrev, + dd->ipath_minrev, dd->ipath_pcirev, + (unsigned)(dd-> + ipath_revision >> + INFINIPATH_R_SOFTWARE_SHIFT) & + INFINIPATH_R_SOFTWARE_MASK); + + } + + _IPATH_DBG("%s", chip_driver_version); + + /* + * we ignore most issues after reporting them, but have to specially + * handle hardware-disabled chips. + */ + if (ipath_validate_rev(dd) == 2) { + ret = -EPERM; /* unique error, known to infinipath_init_one() */ + goto done; + } + + /* + * zero all the TID entries at startup. We do this for sanity, + * in case of a previous driver crash of some kind, and also + * because the chip powers up with these memories in an unknown + * state. Use portcnt, not cfgports, since this is for the full chip, + * not for current (possibly different) configuration value + * Chip Errata bug 6447 + */ + for (val32 = 0; val32 < dd->ipath_portcnt; val32++) + ipath_clear_tids(t, val32); + + dd->ipath_rcvhdrentsize = IPATH_RCVHDRENTSIZE; + /* we could bump this + * to allow for full rcvegrcnt + rcvtidcnt, but then it no + * longer nicely fits power of two, and since we now use + * alloc_pages, the rest would be wasted. + */ + dd->ipath_rcvhdrcnt = dd->ipath_rcvegrcnt; + /* + * setup offset of last valid entry in rcvhdrq, for various tests, to + * avoid calculating each time we need it + */ + dd->ipath_hdrqlast = + dd->ipath_rcvhdrentsize * (dd->ipath_rcvhdrcnt - 1); + ipath_kput_kreg(t, kr_rcvhdrentsize, dd->ipath_rcvhdrentsize); + ipath_kput_kreg(t, kr_rcvhdrcnt, dd->ipath_rcvhdrcnt); + /* + * not in ipath_rcvhdrsize, so user programs can set differently, but + * so any early packets see the default size. + */ + ipath_kput_kreg(t, kr_rcvhdrsize, IPATH_DFLT_RCVHDRSIZE); + + /* + * we "know" that this works + * out OK. It's actually a bit more than we need, but 2048+64 isn't + * quite enough for full size, and we want the +N to be a power of 2 + * to give us reasonable alignment and fit within page_alloc()'ed + * memory + */ + dd->ipath_rcvegrbufsize = dd->ipath_piosize; + + /* + * the min() check here is currently a nop, but it may not always be, + * depending on just how we do ipath_rcvegrbufsize + */ + dd->ipath_ibmaxlen = min(dd->ipath_piosize, dd->ipath_rcvegrbufsize); + dd->ipath_init_ibmaxlen = dd->ipath_ibmaxlen; + + /* + * set up the shadow copies of the piobufavail registers, which + * we compare against the chip registers for now, and the in + * memory DMA'ed copies of the registers. This has to be done + * early, before we calculate lastport, etc. + */ + val = dd->ipath_piobcnt; + /* + * calc number of pioavail registers, and save it; we have 2 bits + * per buffer + */ + dd->ipath_pioavregs = ALIGN(val, sizeof(uint64_t) * BITS_PER_BYTE / 2) / (sizeof(uint64_t) * BITS_PER_BYTE / 2); + if (dd->ipath_pioavregs > + (sizeof(dd->ipath_pioavailshadow) / + sizeof(dd->ipath_pioavailshadow[0]))) { + dd->ipath_pioavregs = + sizeof(dd->ipath_pioavailshadow) / + sizeof(dd->ipath_pioavailshadow[0]); + dd->ipath_piobcnt = dd->ipath_pioavregs * sizeof(uint64_t) * BITS_PER_BYTE >> 1; /* 2 bits/reg */ + _IPATH_INFO + ("Warning: %lld piobufs is too many to fit in shadow, only using %d\n", + val, dd->ipath_piobcnt); + } + + if (!infinipath_kpiobufs) { + /* have to have at least one, for SMA */ + kpiobufs = infinipath_kpiobufs = 1; + } else if (dd->ipath_piobcnt < + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT)) { + _IPATH_INFO + ("Too few PIO buffers (%u) for %u ports to have %u each!\n", + dd->ipath_piobcnt, dd->ipath_cfgports, + IPATH_MIN_USER_PORT_BUFCNT); + kpiobufs = 1; /* reserve just the minimum for SMA/ether */ + } else + kpiobufs = infinipath_kpiobufs; + + if (kpiobufs > + (dd->ipath_piobcnt - + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT))) { + i = dd->ipath_piobcnt - + (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT); + if (i < 0) + i = 0; + _IPATH_INFO + ("Allocating %d PIO bufs for kernel leaves too few for %d user ports (%d each); using %u\n", + kpiobufs, dd->ipath_cfgports - 1, + IPATH_MIN_USER_PORT_BUFCNT, i); + /* + * shouldn't change infinipath_kpiobufs, because could be + * different for different devices... + */ + kpiobufs = i; + } + dd->ipath_lastport_piobuf = dd->ipath_piobcnt - kpiobufs; + dd->ipath_pbufsport = dd->ipath_cfgports > 1 ? + dd->ipath_lastport_piobuf / (dd->ipath_cfgports - 1) : 0; + val32 = dd->ipath_lastport_piobuf - + (dd->ipath_pbufsport * (dd->ipath_cfgports - 1)); + if (val32 > 0) { + _IPATH_DBG + ("allocating %u pbufs/port leaves %u unused, add to kernel\n", + dd->ipath_pbufsport, val32); + dd->ipath_lastport_piobuf -= val32; + _IPATH_DBG("%u pbufs/port leaves %u unused, add to kernel\n", + dd->ipath_pbufsport, val32); + } + dd->ipath_lastpioindex = dd->ipath_lastport_piobuf; + _IPATH_VDBG + ("%d PIO bufs %u - %u, %u each for %u user ports\n", + kpiobufs, dd->ipath_lastport_piobuf, dd->ipath_piobcnt, dd->ipath_pbufsport, + dd->ipath_cfgports - 1); + + /* + * this has to be page aligned, and on a page of it's own, so we + * can map it into user space. We also use it to give processes + * a copy of ipath_statusp, on a separate cacheline, followed by + * a copy of the freeze error string, if it's happened. Might also + * use that space for other things. + */ + val = ALIGN(2 * L1_CACHE_BYTES + sizeof(*dd->ipath_statusp) + + dd->ipath_pioavregs * sizeof(uint64_t), 2 * PAGE_SIZE); + if (!(dd->ipath_pioavailregs_dma = kmalloc(val * sizeof(uint64_t), + GFP_KERNEL))) { + _IPATH_UNIT_ERROR(t, + "failed to allocate PIOavail reg area in memory\n"); + ret = -ENOMEM; + goto done; + } + if ((PAGE_SIZE - 1) & (uint64_t) dd->ipath_pioavailregs_dma) { + dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma; + dd->ipath_pioavailregs_dma = (uint64_t *) + ALIGN((uint64_t) dd->ipath_pioavailregs_dma, PAGE_SIZE); + } else + dd->__ipath_pioavailregs_base = dd->ipath_pioavailregs_dma; + /* + * zero initial, since whole thing mapped + * into user space, and don't want info leak, or confusing garbage + */ + memset((void *)dd->ipath_pioavailregs_dma, 0, PAGE_SIZE); + + /* + * we really want L2 cache aligned, but for current CPUs of interest, + * they are the same. + */ + dd->ipath_statusp = (uint64_t *) ((char *)dd->ipath_pioavailregs_dma + + ((2 * L1_CACHE_BYTES + + dd->ipath_pioavregs * + sizeof(uint64_t)) & + ~L1_CACHE_BYTES)); + /* copy the current value now that it's really allocated */ + *dd->ipath_statusp = dd->_ipath_status; + /* + * setup buffer to hold freeze msg, accessible to apps, following + * statusp + */ + dd->ipath_freezemsg = (char *)&dd->ipath_statusp[1]; + /* and it's length */ + dd->ipath_freezelen = L1_CACHE_BYTES - sizeof(dd->ipath_statusp[0]); + + atmp = virt_to_phys(dd->ipath_pioavailregs_dma); + /* stash physical address for user progs */ + dd->ipath_pioavailregs_phys = atmp; + (void)ipath_kput_kreg(t, kr_sendpioavailaddr, atmp); + /* + * this is to detect s/w errors, which the h/w works around by + * ignoring the low 6 bits of address, if it wasn't aligned. + */ + val = ipath_kget_kreg64(t, kr_sendpioavailaddr); + if (val != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, SendPIOAvailAddr written as %llx, read back as %llx\n", + atmp, val); + ret = -EINVAL; + goto done; + } + + if (t * 64 > (sizeof(ipath_port0_rcvhdrtail) - 64)) { + _IPATH_UNIT_ERROR(t, + "unit %u too large for port 0 rcvhdrtail buffer size\n", + t); + ret = -ENODEV; + } + + /* + * kernel modules loaded into vmalloc'ed memory, + * verify that when we assume that, map to phys, and back to virt, + * that we get the right contents, so we did the mapping right. + */ + vpage = vmalloc_to_page((void *)ipath_port0_rcvhdrtail); + if (vpage == NOPAGE_SIGBUS || vpage == NOPAGE_OOM) { + _IPATH_UNIT_ERROR(t, "vmalloc_to_page for rcvhdrtail fails!\n"); + ret = -ENOMEM; + goto done; + } + + /* + * 64 is driven by cache line size, and also by chip requirement + * that low 6 bits be 0 + */ + val = page_to_phys(vpage) + t * 64; + + /* verify that the alignment requirement was met */ + ipath_kput_kreg_port(t, kr_rcvhdrtailaddr, 0, val); + atmp = ipath_kget_kreg64_port(t, kr_rcvhdrtailaddr, 0); + if (val != atmp) { + _IPATH_UNIT_ERROR(t, + "Catastrophic software error, RcvHdrTailAddr0 written as %llx, read back as %llx from %x\n", + val, atmp, kr_rcvhdrtailaddr); + ret = -EINVAL; + goto done; + } + /* so we can get current tail in ipath_kreceive(), per chip */ + dd->ipath_hdrqtailptr = + &ipath_port0_rcvhdrtail[t * + (64 / sizeof(ipath_port0_rcvhdrtail[0]))]; + + ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP); + + /* + * make sure we are not in freeze, and PIO send enabled, so + * writes to pbc happen + */ + ipath_kput_kreg(t, kr_hwerrmask, 0ULL); + ipath_kput_kreg(t, kr_hwerrclear, -1LL); + ipath_kput_kreg(t, kr_control, 0ULL); + ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_PIOENABLE); + + /* + * write the pbc of each buffer, to be sure it's initialized, then + * cancel all the buffers, and also abort any packets that might + * have been in flight for some reason (the latter is for driver + * unload/reload, but isn't a bad idea at first init). + * PIO send isn't enabled at this point, so there is no danger + * of sending these out on the wire. + * Chip Errata bug 6610 + */ + piobuf = (uint32_t __iomem *) (((char __iomem *)(dd->ipath_kregbase)) + + dd->ipath_piobufbase); + pioincr = devdata[t].ipath_palign / sizeof(*piobuf); + for (i = 0; i < dd->ipath_piobcnt; i++) { + writel(16, piobuf); /* reasonable word count, just to init pbc */ + piobuf += pioincr; + } + /* self-clearing */ + ipath_kput_kreg(t, kr_sendctrl, INFINIPATH_S_ABORT); + + /* + * before error clears, since we expect serdes pll errors during + * this, the first time after reset + */ + if (ipath_bringup_link(t)) { + _IPATH_INFO("Failed to bringup IB link\n"); + ret = -ENETDOWN; + goto done; + } + + /* + * clear any "expected" hwerrs from reset and/or initialization + * clear any that aren't enabled (at least this once), and then + * set the enable mask + */ + ipath_clear_init_hwerrs(t); + ipath_kput_kreg(t, kr_hwerrclear, -1LL); + ipath_kput_kreg(t, kr_hwerrmask, dd->ipath_hwerrmask); + + dd->ipath_maskederrs = dd->ipath_ignorederrs; + ipath_kput_kreg(t, kr_errorclear, -1LL); /* clear all */ + /* enable errors that are masked, at least this first time. */ + ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs); + /* clear any interrups up to this point (ints still not enabled) */ + ipath_kput_kreg(t, kr_intclear, -1LL); + + ipath_stats.sps_lid[t] = dd->ipath_lid; + + /* + * allocate the shadow TID array, so we can ipath_putpages + * previous entries. It make make more sense to move the pageshadow + * to the port data structure, so we only allocate memory for ports + * actually in use, since we at 8k per port, now + */ + dd->ipath_pageshadow = (struct page **) + vmalloc(dd->ipath_cfgports * dd->ipath_rcvtidcnt * + sizeof(struct page *)); + if (!dd->ipath_pageshadow) + _IPATH_UNIT_ERROR(t, + "failed to allocate shadow page * array, no expected sends!\n"); + else + memset(dd->ipath_pageshadow, 0, + dd->ipath_cfgports * dd->ipath_rcvtidcnt * + sizeof(struct page *)); + + /* set up the port 0 (kernel) rcvhdr q and egr TIDs */ + if (!(ret = ipath_create_rcvhdrq(dd->ipath_pd[0]))) + ret = ipath_create_port0_egr(dd->ipath_pd[0]); + if (ret) + _IPATH_UNIT_ERROR(t, + "failed to allocate port 0 (kernel) rcvhdrq and/or egr bufs\n"); + else { + init_waitqueue_head(&ipath_sma_wait); + init_waitqueue_head(&ipath_sma_state_wait); + + ipath_kput_kreg(pd->port_unit, kr_rcvctrl, dd->ipath_rcvctrl); + + ipath_kput_kreg(t, kr_rcvbthqp, IPATH_KD_QP); + + /* Enable PIO send, and update of PIOavail regs to memory. */ + dd->ipath_sendctrl = INFINIPATH_S_PIOENABLE + | INFINIPATH_S_PIOBUFAVAILUPD; + ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl); + + /* + * enable port 0 receive, and receive interrupt + * other ports done as user opens and inits them + */ + dd->ipath_rcvctrl = INFINIPATH_R_TAILUPD | + (1ULL << INFINIPATH_R_PORTENABLE_SHIFT) | + (1ULL << INFINIPATH_R_INTRAVAIL_SHIFT); + ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl); + + /* + * now ready for use + * this should be cleared whenever we detect a reset, or + * initiate one. + */ + dd->ipath_flags |= IPATH_INITTED; + + /* + * init our shadow copies of head from tail values, and write + * head values to match + */ + val32 = ipath_kget_ureg32(t, ur_rcvegrindextail, 0); + (void)ipath_kput_ureg(t, ur_rcvegrindexhead, val32, 0); + dd->ipath_port0head = ipath_kget_ureg32(t, ur_rcvhdrtail, 0); + (void)ipath_kput_ureg(t, ur_rcvhdrhead, dd->ipath_port0head, 0); + + /* + * by now pioavail updates to memory should have occurred, + * so copy them into our working/shadow registers; this is + * in case something went wrong with abort, but mostly to + * get the initial values of the generation bit correct + */ + for (i = 0; i < dd->ipath_pioavregs; i++) { + /* + * Chip Errata bug 6641; even and odd qwords>3 + * are swapped + */ + if (i > 3) { + if (i & 1) + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i - 1]; + else + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i + 1]; + } else + dd->ipath_pioavailshadow[i] = + dd->ipath_pioavailregs_dma[i]; + } + /* can get counters, stats, etc. */ + dd->ipath_flags |= IPATH_PRESENT; + } + + /* + * cause retrigger of pending interrupts ignored during init, even if + * we had errors + */ + ipath_kput_kreg(t, kr_intclear, 0ULL); + + /* + * set up stats retrieval timer, even if we had errors in last + * portion of setup + */ + init_timer(&dd->ipath_stats_timer); + dd->ipath_stats_timer.function = ipath_get_faststats; + dd->ipath_stats_timer.data = (unsigned long)t; + /* every 5 seconds; */ + dd->ipath_stats_timer.expires = jiffies + 5 * HZ; + /* takes ~16 seconds to overflow at full IB 4x bandwdith */ + add_timer(&dd->ipath_stats_timer); + + dd->ipath_stats_timer_active = 1; + +done: + if (!ret) { + ipath_get_guid(t); + *dd->ipath_statusp |= IPATH_STATUS_CHIP_PRESENT; + if (!ipath_sma_data_spare) { + /* first init, setup SMA data structs */ + ipath_sma_data_spare = + ipath_sma_data_bufs[IPATH_NUM_SMAPKTS]; + for (i = 0; i < IPATH_NUM_SMAPKTS; i++) + ipath_sma_data[i].buf = ipath_sma_data_bufs[i]; + } + /* + * sps_nports is a global, so, we set it to the highest + * number of ports of any of the chips we find; we never + * decrement it, at least for now. + */ + if (dd->ipath_cfgports > ipath_stats.sps_nports) + ipath_stats.sps_nports = dd->ipath_cfgports; + } + /* if ret is non-zero, we probably should do some cleanup here... */ + return ret; +} + +int ipath_waitfor_complete(const ipath_type t, ipath_kreg reg_id, + uint64_t bits_to_wait_for, uint64_t * valp) +{ + uint64_t timeout, lastval, val; + + lastval = ipath_kget_kreg64(t, reg_id); + timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */ + do { + val = ipath_kget_kreg64(t, reg_id); + *valp = val; /* so they have something, even on failures. */ + if ((val & bits_to_wait_for) == bits_to_wait_for) + return 0; + if (val != lastval) + _IPATH_VDBG + ("Changed from %llx to %llx, waiting for %llx bits\n", + lastval, val, bits_to_wait_for); + yield(); + if (get_cycles() > timeout) { + _IPATH_DBG + ("Didn't get bits %llx in register 0x%x, got %llx\n", + bits_to_wait_for, reg_id, *valp); + return ENODEV; + } + } while (1); +} + +/* + * like ipath_waitfor_complete(), but we wait for the CMDVALID bit to go away + * indicating the last command has completed. It doesn't return data + */ +int ipath_waitfor_mdio_cmdready(const ipath_type t) +{ + uint64_t timeout; + uint64_t val; + + timeout = get_cycles() + 0x10000000ULL; /* <- ridiculously long time */ + do { + val = ipath_kget_kreg64(t, kr_mdio); + if (!(val & IPATH_MDIO_CMDVALID)) + return 0; + yield(); + if (get_cycles() > timeout) { + _IPATH_DBG("CMDVALID stuck in mdio reg? (%llx)\n", val); + return ENODEV; + } + } while (1); +} + +void ipath_set_ib_lstate(const ipath_type t, int which) +{ + struct ipath_devdata *dd = &devdata[t]; + char *what; + + /* + * For all cases, we'll either be setting a new value of linkcmd, or + * we want it to be NOP, so clear it here. + * Similarly, we want the linkinitcmd to be NOP for everything + * other than explictly than explictly changing linkinitcmd, + * and for that case, we want to first clear any existing bits + */ + dd->ipath_ibcctrl &= ~((INFINIPATH_IBCC_LINKCMD_MASK << + INFINIPATH_IBCC_LINKCMD_SHIFT) | + (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT)); + + if (which == INFINIPATH_IBCC_LINKCMD_INIT) { + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE + | IPATH_LINK_SLEEPING); + /* so we can watch for a transition */ + dd->ipath_flags |= IPATH_LINKDOWN; + what = "INIT"; + } else if (which == INFINIPATH_IBCC_LINKCMD_ARMED) { + dd->ipath_flags |= IPATH_LINK_TOARMED; + dd->ipath_flags &= ~(IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING); + /* + * this is mainly for loopback testing. If INITCMD is + * NOP or SLEEP, the link won't ever come up in loopback... + */ + if (! + (dd-> + ipath_flags & (IPATH_LINKINIT | IPATH_LINKARMED | + IPATH_LINKACTIVE))) { + _IPATH_SMADBG + ("going to armed, but link not yet up, set POLL\n"); + dd->ipath_ibcctrl |= + INFINIPATH_IBCC_LINKINITCMD_POLL << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + } + what = "ARMED"; + } else if (which == INFINIPATH_IBCC_LINKCMD_ACTIVE) { + dd->ipath_flags |= IPATH_LINK_TOACTIVE; + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING); + what = "ACTIVE"; + } else if (which & (INFINIPATH_IBCC_LINKINITCMD_MASK << INFINIPATH_IBCC_LINKINITCMD_SHIFT)) { /* down, disable, etc. */ + dd->ipath_flags &= ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE); + if (((which & INFINIPATH_IBCC_LINKINITCMD_MASK) >> + INFINIPATH_IBCC_LINKINITCMD_SHIFT) == + INFINIPATH_IBCC_LINKINITCMD_SLEEP) { + dd->ipath_flags |= IPATH_LINK_SLEEPING | IPATH_LINKDOWN; + } else + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_ibcctrl |= + which & (INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + what = "DOWN"; + } else { + what = "UNKNOWN"; + _IPATH_INFO("Unknown link transition requested (which=0x%x)\n", + which); + } + + dd->ipath_ibcctrl |= ((uint64_t) which & INFINIPATH_IBCC_LINKCMD_MASK) + << INFINIPATH_IBCC_LINKCMD_SHIFT; + + _IPATH_SMADBG("Trying to move unit %u to %s, current ltstate is %s\n", + t, what, ipath_ibcstatus_str[(ipath_kget_kreg64(t, kr_ibcstatus) + >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) + & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK]); + ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl); +} + +static int ipath_bringup_link(const ipath_type t) +{ + struct ipath_devdata *dd = &devdata[t]; + uint64_t val, ibc; + int ret = 0; + + dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* hold IBC in reset */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + + /* + * Note that prior to try 14 or 15 of IB, the credit scaling + * wasn't working, because it was swapped for writes with the + * 1 bit default linkstate field + */ + + /* ignore pbc and align word */ + val = dd->ipath_piosize - 2 * sizeof(uint32_t); + /* + * for ICRC, which we only send in diag test pkt mode, and we don't + * need to worry about that for mtu + */ + val += 1; + /* + * set the IBC maxpktlength to the size of our pio buffers + * the maxpktlength is in words. This is *not* the IB data MTU + */ + ibc = (val / sizeof(uint32_t)) << INFINIPATH_IBCC_MAXPKTLEN_SHIFT; + /* in KB */ + ibc |= 0x5ULL << INFINIPATH_IBCC_FLOWCTRLWATERMARK_SHIFT; + /* how often flowctrl sent + * more or less in usecs; balance against watermark value, so that + * in theory senders always get a flow control update in time to not + * let the IB link go idle. + */ + ibc |= 0x3ULL << INFINIPATH_IBCC_FLOWCTRLPERIOD_SHIFT; + /* max error tolerance */ + ibc |= 0xfULL << INFINIPATH_IBCC_PHYERRTHRESHOLD_SHIFT; + /* use "real" buffer space for */ + ibc |= 4ULL << INFINIPATH_IBCC_CREDITSCALE_SHIFT; + /* IB credit flow control. */ + ibc |= 0xfULL << INFINIPATH_IBCC_OVERRUNTHRESHOLD_SHIFT; + /* initially come up waiting for TS1, without sending anything. */ + dd->ipath_ibcctrl = ibc; + /* don't put linkinitcmd in ipath_ibcctrl, want that to stay a NOP */ + ibc |= + INFINIPATH_IBCC_LINKINITCMD_SLEEP << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + dd->ipath_flags |= IPATH_LINK_SLEEPING; + ipath_kput_kreg(t, kr_ibcctrl, ibc); + + ret = ipath_bringup_serdes(t); + + if (ret) + _IPATH_INFO("Could not initialize SerDes, not usable\n"); + else { + dd->ipath_control |= INFINIPATH_C_LINKENABLE; /* enable IBC */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + } + + return ret; +} + +/* + * called from ipath_shutdown_link(), and from sma doing a LINKDOWN + * Left as a separate function for historical reasons, and may want + * it to do more than just call ipath_set_ib_lstate() again sometime + * in the future. + */ +void ipath_down_link(const ipath_type t) +{ + ipath_set_ib_lstate(t, INFINIPATH_IBCC_LINKINITCMD_SLEEP << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); +} + +/* + * do this when driver is being unloaded, or perhaps for diags, and + * maybe when we get an interrupt of a fatal link error that requires + * bringing the linkd down and back up + */ +static int ipath_shutdown_link(const ipath_type t) +{ + uint64_t val; + struct ipath_devdata *dd = &devdata[t]; + int ret = 0; + + _IPATH_DBG("Shutting down the link\n"); + ipath_down_link(t); + + /* + * we are shutting down, so tell the layered driver. We don't + * do this on just a link state change, much like ethernet, + * a cable unplug, etc. doesn't change driver state + */ + if (dd->ipath_layer.l_intr) + dd->ipath_layer.l_intr(t, IPATH_LAYER_INT_IF_DOWN); + + dd->ipath_control &= ~INFINIPATH_C_LINKENABLE; /* disable IBC */ + ipath_kput_kreg(t, kr_control, dd->ipath_control); + + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_CONF | IPATH_STATUS_IB_READY); + + /* + * clear SerdesEnable and turn the leds off; do this here because + * we are unloading, so don't count on interrupts to move along + */ + + ipath_quiet_serdes(t); + val = dd->ipath_extctrl & + ~(INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON); + dd->ipath_extctrl = val; + ipath_kput_kreg(t, kr_extctrl, val); + + if (dd->ipath_stats_timer_active) { + del_timer_sync(&dd->ipath_stats_timer); + dd->ipath_stats_timer_active = 0; + } + if (*dd->ipath_statusp & IPATH_STATUS_CHIP_PRESENT) { + /* can't do anything more with chip */ + /* needs re-init */ + *dd->ipath_statusp &= ~IPATH_STATUS_CHIP_PRESENT; + if (dd->ipath_kregbase) { + /* + * if we haven't already cleaned up before these + * are to ensure any register reads/writes "fail" + * until re-init + */ + dd->ipath_kregbase = NULL; + dd->ipath_kregvirt = NULL; + dd->ipath_uregbase = 0ULL; + dd->ipath_sregbase = 0ULL; + dd->ipath_cregbase = 0ULL; + dd->ipath_kregsize = 0; + } +#ifdef CONFIG_MTRR + if (dd->ipath_mtrr) { + _IPATH_VDBG("undoing WCCOMB on pio buffers\n"); + mtrr_del(dd->ipath_mtrr, 0, 0); + dd->ipath_mtrr = 0; + } +#endif + } + + return ret; +} + +/* + * when closing, free up any allocated data for a port, if the + * reference count goes to zero + * Note: this also frees the portdata itself! + */ +void ipath_free_pddata(struct ipath_devdata * dd, uint32_t port, int freehdrq) +{ + struct ipath_portdata *pd = dd->ipath_pd[port]; + + if (!pd) + return; + if (freehdrq) + /* + * only clear and free portdata if we are going to + * also release the hdrq, otherwise we leak the hdrq on each + * open/close cycle + */ + dd->ipath_pd[port] = NULL; + /* cleanup locked pages private data structures */ + ipath_upages_cleanup(pd); + if (freehdrq && pd->port_rcvhdrq) { + int i, n = 1 << pd->port_rcvhdrq_order; + _IPATH_VDBG("free closed port %d rcvhdrq @ %p (order=%u)\n", + pd->port_port, pd->port_rcvhdrq, + pd->port_rcvhdrq_order); + for (i = 0; i < n; i++) + ClearPageReserved(virt_to_page + (pd->port_rcvhdrq + (i * PAGE_SIZE))); + free_pages((unsigned long)pd->port_rcvhdrq, + pd->port_rcvhdrq_order); + pd->port_rcvhdrq = NULL; + } + if (port && pd->port_rcvegrbuf_pages) { /* always free this, however */ + void *virt; + unsigned e, i, n = 1 << pd->port_rcvegrbuf_order; + if (pd->port_rcvegrbuf_virt) { + for (e = 0; e < pd->port_rcvegrbuf_chunks; e++) { + virt = pd->port_rcvegrbuf_virt[e]; + for (i = 0; i < n; i++) + ClearPageReserved(virt_to_page + (virt + + (i * PAGE_SIZE))); + _IPATH_VDBG + ("egrbuf free_pages(%p, %x), chunk %u/%u\n", + virt, pd->port_rcvegrbuf_order, e, + pd->port_rcvegrbuf_chunks); + free_pages((unsigned long)virt, + pd->port_rcvegrbuf_order); + } + vfree(pd->port_rcvegrbuf_virt); + pd->port_rcvegrbuf_virt = NULL; + } + pd->port_rcvegrbuf_chunks = 0; + _IPATH_VDBG("free closed port %d rcvegrbufs ptr array\n", + pd->port_port); + /* now the pointer array. */ + vfree(pd->port_rcvegrbuf_pages); + pd->port_rcvegrbuf_pages = NULL; + } else if (port == 0 && dd->ipath_port0_skbs) { + unsigned e; + struct sk_buff **skbs = dd->ipath_port0_skbs; + + dd->ipath_port0_skbs = NULL; + _IPATH_VDBG("free closed port %d ipath_port0_skbs @ %p\n", + pd->port_port, skbs); + for (e = 0; e < dd->ipath_rcvegrcnt; e++) + if (skbs[e]) + dev_kfree_skb(skbs[e]); + vfree(skbs); + } + if (freehdrq) { + kfree(pd->port_tid_pg_list); + kfree(pd); + } +} + +int __init infinipath_init(void) +{ + int r = 0, i; + + _IPATH_DBG(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version); + + ipath_init_picotime(); /* init cycles -> pico conversion */ + + /* + * initialize the statusp to temporary storage so we can use it + * everywhere without first checking. When we "really" assign it, + * we copy from _ipath_status + */ + for (i = 0; i < infinipath_max; i++) + devdata[i].ipath_statusp = &devdata[i]._ipath_status; + + /* + * init these early, in case we take an interrupt as soon as the irq + * is setup. Saw a spinlock panic once that appeared to be due to that + * problem, when they were initted later on. + */ + spin_lock_init(&ipath_pioavail_lock); + spin_lock_init(&ipath_sma_lock); + + pci_register_driver(&infinipath_driver); + + driver_create_file(&(infinipath_driver.driver), &driver_attr_version); + + if ((r = register_chrdev(ipath_major, MODNAME, &ipath_fops))) + _IPATH_ERROR("Unable to register %s device\n", MODNAME); + + + /* + * never return an error, since we could have stuff registered, + * resources used, etc., even if no hardware found. This way we + * can clean up through unload. + */ + return 0; +} + +/* + * note: if for some reason the unload fails after this routine, and leaves + * the driver enterable by user code, we'll almost certainly crash and burn... + */ +static void __exit infinipath_cleanup(void) +{ + int r, m, port; + + driver_remove_file(&(infinipath_driver.driver), &driver_attr_version); + if ((r = unregister_chrdev(ipath_major, MODNAME))) + _IPATH_DBG("unregister of device failed: %d\n", r); + + + /* + * turn off rcv, send, and interrupts for all ports, all drivers + * should also hard reset the chip here? + * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs + * for all versions of the driver, if they were allocated + */ + for (m = 0; m < infinipath_max; m++) { + uint64_t val; + struct ipath_devdata *dd = &devdata[m]; + if (dd->ipath_kregbase) { + /* in case unload fails, be consistent */ + dd->ipath_rcvctrl = 0U; + ipath_kput_kreg(m, kr_rcvctrl, dd->ipath_rcvctrl); + + /* + * gracefully stop all sends allowing any in + * progress to trickle out first. + */ + ipath_kput_kreg(m, kr_sendctrl, 0ULL); + val = ipath_kget_kreg64(m, kr_scratch); /* flush it */ + /* + * enough for anything that's going to trickle + * out to have actually done so. + */ + udelay(5); + + /* + * abort any armed or launched PIO buffers that + * didn't go. (self clearing). Will cause any + * packet currently being transmitted to go out + * with an EBP, and may also cause a short packet + * error on the receiver. + */ + ipath_kput_kreg(m, kr_sendctrl, INFINIPATH_S_ABORT); + + /* mask interrupts, but not errors */ + ipath_kput_kreg(m, kr_intmask, 0ULL); + ipath_shutdown_link(m); + + /* + * clear all interrupts and errors. Next time + * driver is loaded, we know that whatever is + * set happened while we were unloaded + */ + ipath_kput_kreg(m, kr_hwerrclear, -1LL); + ipath_kput_kreg(m, kr_errorclear, -1LL); + ipath_kput_kreg(m, kr_intclear, -1LL); + if (dd->__ipath_pioavailregs_base) { + kfree((void *)dd->__ipath_pioavailregs_base); + dd->__ipath_pioavailregs_base = NULL; + dd->ipath_pioavailregs_dma = NULL; + } + + if (dd->ipath_pageshadow) { + struct page **tmpp = dd->ipath_pageshadow; + int i, cnt = 0; + + _IPATH_VDBG + ("Unlocking any expTID pages still locked\n"); + for (port = 0; port < dd->ipath_cfgports; + port++) { + int port_tidbase = + port * dd->ipath_rcvtidcnt; + int maxtid = + port_tidbase + dd->ipath_rcvtidcnt; + for (i = port_tidbase; i < maxtid; i++) { + if (tmpp[i]) { + ipath_putpages(1, + &tmpp[i]); + tmpp[i] = NULL; + cnt++; + } + } + } + if (cnt) { + ipath_stats.sps_pageunlocks += cnt; + _IPATH_VDBG + ("There were still %u expTID entries locked\n", + cnt); + } + if (ipath_stats.sps_pagelocks + || ipath_stats.sps_pageunlocks) + _IPATH_VDBG + ("%llu pages locked, %llu unlocked via ipath_m{un}lock\n", + ipath_stats.sps_pagelocks, + ipath_stats.sps_pageunlocks); + + _IPATH_VDBG + ("Free shadow page tid array at %p\n", + dd->ipath_pageshadow); + vfree(dd->ipath_pageshadow); + dd->ipath_pageshadow = NULL; + } + + /* + * free any resources still in use (usually just + * kernel ports) at unload + */ + for (port = 0; port < dd->ipath_cfgports; port++) + ipath_free_pddata(dd, port, 1); + kfree(dd->ipath_pd); + /* + * debuggability, in case some cleanup path + * tries to use it after this + */ + dd->ipath_pd = NULL; + } + + if (dd->pcidev) { + if (dd->pcidev->irq) { + _IPATH_VDBG("unit %u free_irq of irq %x\n", m, + dd->pcidev->irq); + free_irq(dd->pcidev->irq, dd); + } else + _IPATH_DBG + ("irq is 0, not doing free_irq for unit %u\n", + m); + dd->pcidev = NULL; + } + if (dd->pci_registered) { + _IPATH_VDBG + ("Unregistering pci infrastructure unit %u\n", m); + pci_unregister_driver(&infinipath_driver); + dd->pci_registered = 0; + } else + _IPATH_VDBG + ("unit %u: no pci unreg, wasn't registered\n", m); + ipath_chip_cleanup(dd); /* clean up any per-chip chip-specific stuff */ + } + /* + * clean up any chip-specific stuff for now, only one type of chip + * for any given driver + */ + ipath_chip_done(); + + /* cleanup all our locked pages private data structures */ + ipath_upages_cleanup(NULL); +} + +/* This is a generic function here, so it can return device-specific + * info. This allows keeping in sync with the version that supports + * multiple chip types. +*/ +void ipath_get_boardname(const ipath_type t, char *name, size_t namelen) +{ + ipath_ht_get_boardname(t, name, namelen); +} + +module_init(infinipath_init); +module_exit(infinipath_cleanup); + +EXPORT_SYMBOL(infinipath_debug); +EXPORT_SYMBOL(ipath_get_boardname); + From bos at pathscale.com Wed Dec 28 16:31:28 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:28 -0800 Subject: [openib-general] [PATCH 9 of 20] ipath - core driver, part 2 of 4 In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r ddd21709e12c -r dad2e87e21f4 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 @@ -1877,3 +1877,2004 @@ return ret; } + +/* + * cancel a range of PIO buffers, used when they might be armed, but + * not triggered. Used at init to ensure buffer state, and also user + * process close, in case it died while writing to a PIO buffer + */ + +static void ipath_disarm_piobufs(const ipath_type t, unsigned first, + unsigned cnt) +{ + unsigned i, last = first + cnt; + uint64_t sendctrl; + for (i = first; i < last; i++) { + sendctrl = devdata[t].ipath_sendctrl | INFINIPATH_S_DISARM | + (i << INFINIPATH_S_DISARMPIOBUF_SHIFT); + ipath_kput_kreg(t, kr_sendctrl, sendctrl); + } +} + +static void ipath_clean_partkey(struct ipath_portdata * pd, + struct ipath_devdata * dd) +{ + int i, j, pchanged = 0; + uint64_t oldpkey; + + /* for debugging only */ + oldpkey = + (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd-> + ipath_pkeys[1] << 16) + | ((uint64_t) dd->ipath_pkeys[2] << 32) + | ((uint64_t) dd->ipath_pkeys[3] << 48); + + for (i = 0; i < (sizeof(pd->port_pkeys) / sizeof(pd->port_pkeys[0])); + i++) { + if (!pd->port_pkeys[i]) + continue; + _IPATH_VDBG("look for key[%d] %hx in pkeys\n", i, + pd->port_pkeys[i]); + for (j = 0; + j < (sizeof(dd->ipath_pkeys) / sizeof(dd->ipath_pkeys[0])); + j++) { + /* check for match independent of the global bit */ + if ((dd->ipath_pkeys[j] & 0x7fff) == + (pd->port_pkeys[i] & 0x7fff)) { + if (atomic_dec_and_test(&dd->ipath_pkeyrefs[j])) { + _IPATH_VDBG + ("p%u clear key %x matches #%d\n", + pd->port_port, pd->port_pkeys[i], + j); + ipath_stats.sps_pkeys[j] = + dd->ipath_pkeys[j] = 0; + pchanged++; + } else + _IPATH_VDBG + ("p%u key %x matches #%d, but ref still %d\n", + pd->port_port, pd->port_pkeys[i], + j, + atomic_read(&dd-> + ipath_pkeyrefs[j])); + break; + } + } + pd->port_pkeys[i] = 0; + } + if (pchanged) { + uint64_t pkey; + pkey = + (uint64_t) dd->ipath_pkeys[0] | ((uint64_t) dd-> + ipath_pkeys[1] << 16) + | ((uint64_t) dd->ipath_pkeys[2] << 32) + | ((uint64_t) dd->ipath_pkeys[3] << 48); + _IPATH_VDBG("p%u old pkey reg %llx, new pkey reg %llx\n", + pd->port_port, oldpkey, pkey); + ipath_kput_kreg(pd->port_unit, kr_partitionkey, pkey); + } +} + +static unsigned int ipath_poll(struct file *fp, struct poll_table_struct *pt) +{ + int ret; + struct ipath_portdata *pd; + + pd = port_fp(fp); + /* nothing for select/poll in this driver, at least for now */ + ret = 0; + + return ret; +} + +/* + * wait up to msecs milliseconds for IB link state change to occur + * for now, take the easy polling route. Currently used only by + * the SMA ioctls. Returns 0 if state reached, otherwise -ETIMEDOUT + * state can have multiple states set, for any of several transitions. + */ + +int ipath_wait_linkstate(const ipath_type t, uint32_t state, int msecs) +{ + devdata[t].ipath_sma_state_wanted = state; + wait_event_interruptible_timeout(ipath_sma_state_wait, + (devdata[t].ipath_flags & state), + msecs_to_jiffies(msecs)); + devdata[t].ipath_sma_state_wanted = 0; + + if (!(devdata[t].ipath_flags & state)) + _IPATH_DBG + ("Didn't reach linkstate %s within %u ms (ibcc %llx %s)\n", + /* test INIT ahead of DOWN, both can be set */ + (state & IPATH_LINKINIT) ? "INIT" : + ((state & IPATH_LINKDOWN) ? "DOWN" : + ((state & IPATH_LINKARMED) ? "ARM" : "ACTIVE")), + msecs, ipath_kget_kreg64(t, kr_ibcctrl), + ipath_ibcstatus_str[ipath_kget_kreg64(t, kr_ibcstatus) & + 0xf]); + return (devdata[t].ipath_flags & state) ? 0 : -ETIMEDOUT; +} + +/* unit number is already validated in ipath_ioctl() */ +static int ipath_kset_lid(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + _IPATH_SMADBG("Unit %u setting lid to 0x%x, was 0x%x\n", unit, arg, + devdata[unit].ipath_lid); + ipath_set_sps_lid(unit, arg); + return 0; +} + +static int ipath_kset_mlid(uint32_t arg) +{ + unsigned unit = (arg >> 16) & 0xffff; + + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + arg &= 0xffff; + _IPATH_SMADBG("Unit %u setting mlid to 0x%x, was 0x%x\n", unit, arg, + devdata[unit].ipath_mlid); + ipath_stats.sps_mlid[unit] = devdata[unit].ipath_mlid = arg; + if (devdata[unit].ipath_layer.l_intr) + devdata[unit].ipath_layer.l_intr(unit, IPATH_LAYER_INT_BCAST); + return 0; +} + +/* unit number is in incoming, overwritten on return with data */ + +static int ipath_get_devstatus(uint64_t __user *a) +{ + int ret; + uint64_t unit64; + uint32_t unit; + uint64_t devstatus; + + if ((ret = copy_from_user(&unit64, a, sizeof unit64))) { + _IPATH_DBG("Failed to copy in unit: %d\n", ret); + return -EFAULT; + } + unit = unit64; + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + devstatus = *devdata[unit].ipath_statusp; + + if ((ret = copy_to_user(a, &devstatus, sizeof devstatus))) { + _IPATH_DBG("Failed to copy out device status: %d\n", ret); + ret = -EFAULT; + } + return ret; +} + +/* unit number is in incoming, overwritten on return with data */ + +static int ipath_get_mlid(uint32_t __user *a) +{ + int ret; + uint32_t unit; + uint32_t mlid; + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in mlid: %d\n", ret); + return -EFAULT; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + + mlid = devdata[unit].ipath_mlid; + + if ((ret = copy_to_user(a, &mlid, sizeof mlid))) { + _IPATH_DBG("Failed to copy out MLID: %d\n", ret); + ret = -EFAULT; + } + return ret; +} + +static int ipath_kset_guid(struct ipath_setguid __user *a) +{ + struct ipath_setguid setguid; + int ret; + + if ((ret = copy_from_user(&setguid, a, sizeof setguid))) { + _IPATH_DBG("Failed to copy in guid info: %d\n", ret); + return -EFAULT; + } + if (setguid.sunit >= infinipath_max || + !(devdata[setguid.sunit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %llu\n", setguid.sunit); + return -ENODEV; + } + if (setguid.sguid == 0ULL || setguid.sguid == -1LL) { + /* + * use INFO, not DBG, because ipath_mux doesn't yet + * complain about errors on this + */ + + _IPATH_INFO("Ignoring attempt to set invalid GUID %llx\n", + setguid.sguid); + return -EINVAL; + } + devdata[setguid.sunit].ipath_guid = setguid.sguid; + devdata[setguid.sunit].ipath_nguid = 1; + _IPATH_DBG("SMA set hardware GUID unit %llu to %llx (network order)\n", + setguid.sunit, devdata[setguid.sunit].ipath_guid); + return 0; +} + +/* + * receive an IB packet with QP 0 or 1. For now, we have no timeout implemented + * We put the actual received count into the iov on return, and the unit we + * received from goes into the lower 16 bits of sps_flags. + * This receives from all/any of the active chips, and we currently do not + * allow specifying just one (we could, by filling in unit in the library + * before the syscall, and checking here). + */ + +static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *p) +{ + struct ipath_sendpkt rpkt; + int i, any, ret; + unsigned long flags; + + if ((ret = copy_from_user(&rpkt, p, sizeof rpkt))) { + _IPATH_DBG("Failed to copy in pkt struct (%d)\n", ret); + return -EFAULT; + } + if (!ipath_sma_data_spare) { + _IPATH_DBG("can't do receive, sma not initialized\n"); + return -ENETDOWN; + } + + for (any = i = 0; i < infinipath_max; i++) + if (devdata[i].ipath_flags & IPATH_INITTED) + any++; + if (!any) { /* no hardware, freeze, etc. */ + _IPATH_SMADBG("Didn't find any initialized and usable chips\n"); + return -ENODEV; + } + + wait_event_interruptible(ipath_sma_wait, + ipath_sma_data[ipath_sma_first].len); + + spin_lock_irqsave(&ipath_sma_lock, flags); + if (ipath_sma_data[ipath_sma_first].len) { + int len; + uint32_t slen; + uint8_t *sdata; + struct _ipath_sma_rpkt *smpkt = + &ipath_sma_data[ipath_sma_first]; + + /* + * we swap out the buffer we are going to use with the + * spare buffer and set spare to that buffer. This code + * is the only code that ever manipulates spare, other + * than the initialization code. This code should never + * be entered by more than one process at a time, and + * if it is, the user code doing so deserves what it gets; + * it won't break anything in the driver by doing so. + * We do it this way to avoid holding a lock across the + * copy_to_user, which could fault, or delay a long time + * while paging occurs; ditto for printks + */ + + slen = smpkt->len; + sdata = smpkt->buf; + rpkt.sps_flags = smpkt->unit; + smpkt->buf = ipath_sma_data_spare; + ipath_sma_data_spare = sdata; + smpkt->len = 0; /* it's available again */ + if (++ipath_sma_first >= IPATH_NUM_SMAPKTS) + ipath_sma_first = 0; + spin_unlock_irqrestore(&ipath_sma_lock, flags); + + len = min((uint32_t) rpkt.sps_iov[0].iov_len, slen); + ret = copy_to_user((void __user *) rpkt.sps_iov[0].iov_base, + sdata, len); + _IPATH_VDBG("SMA packet (index=%d), len %d (actual %d) " + "buf %p, ubuf %llx\n", ipath_sma_first, slen, + len, sdata, rpkt.sps_iov[0].iov_base); + if (!ret) { + /* actual length read. */ + rpkt.sps_iov[0].iov_len = len; + rpkt.sps_cnt = 1; /* received one packet */ + if ((ret = copy_to_user(p, &rpkt, sizeof rpkt))) { + _IPATH_DBG("Failed to copy out pkt struct " + "(%d)\n", ret); + ret = -EFAULT; + } + } else { + _IPATH_DBG("copyout failed: %d\n", ret); + ret = -EFAULT; + } + } else { + /* usually means SMA process received a signal */ + spin_unlock_irqrestore(&ipath_sma_lock, flags); + return -EAGAIN; + } + + return ret; +} + +/* unit number is in first word incoming, overwritten on return with data */ +static int ipath_get_portinfo(uint32_t __user *a) +{ + int ret; + uint32_t unit, tmp, tmp2; + struct ipath_devdata *dd; + uint32_t portinfo[13]; /* just the data for Portinfo, in host horder */ + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in portinfo: %d\n", ret); + return -EFAULT; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG("Invalid unit %u\n", unit); + return -ENODEV; + } + dd = &devdata[unit]; + /* so we only initialize non-zero fields. */ + memset(portinfo, 0, sizeof portinfo); + + /* + * Notimpl yet M_Key (64) + * Notimpl yet GID (64) + */ + + portinfo[4] = (dd->ipath_lid << 16); + + /* + * Notimpl yet SMLID (should we store this in the driver, in + * case SMA dies?) + * CapabilityMask is 0, we don't support any of these + * DiagCode is 0; we don't store any diag info for now + * Notimpl yet M_KeyLeasePeriod (we don't support M_Key) + */ + + /* LocalPortNum is whichever port number they ask for */ + portinfo[7] = (unit << 24) + /* LinkWidthEnabled */ + |(2 << 16) + /* LinkWidthSupported (really 2, but that's not IB valid...) */ + |(3 << 8) + /* LinkWidthActive */ + |(2 << 0); + tmp = dd->ipath_lastibcstat & 0xff; + tmp2 = 5; + if (tmp == 0x11) + tmp = 2; + else if (tmp == 0x21) + tmp = 3; + else if (tmp == 0x31) + tmp = 4; + else { + tmp = 0; /* down */ + tmp2 = tmp & 0xf; + } + portinfo[8] = (1 << 28) /* LinkSpeedSupported */ + |(tmp << 24) /* PortState */ + |(tmp2 << 20) /* PortPhysicalState */ + |(2 << 16) + + /* LinkDownDefaultState */ + /* M_KeyProtectBits == 0 */ + /* NotImpl yet LMC == 0 (we can support all values) */ + |(1 << 4) /* LinkSpeedActive */ + |(1 << 0); /* LinkSpeedEnabled */ + switch (dd->ipath_ibmtu) { + case 4096: + tmp = 5; + break; + case 2048: + tmp = 4; + break; + case 1024: + tmp = 3; + break; + case 512: + tmp = 2; + break; + case 256: + tmp = 1; + break; + default: /* oops, something is wrong */ + _IPATH_DBG + ("Problem, ipath_ibmtu 0x%x not a valid IB MTU, treat as 2048\n", + dd->ipath_ibmtu); + tmp = 4; + break; + } + portinfo[9] = (tmp << 28) + /* NeighborMTU */ + /* Notimpl MasterSMSL */ + |(1 << 20) + + /* VLCap */ + /* Notimpl InitType (actually, an SMA decision) */ + /* VLHighLimit is 0 (only one VL) */ + ; /* VLArbitrationHighCap is 0 (only one VL) */ + portinfo[10] = /* VLArbitrationLowCap is 0 (only one VL) */ + /* InitTypeReply is SMA decision */ + (5 << 16) /* MTUCap 4096 */ + |(7 << 13) /* VLStallCount */ + |(0x1f << 8) /* HOQLife */ + |(1 << 4) /* OperationalVLs 0 */ + + /* PartitionEnforcementInbound */ + /* PartitionEnforcementOutbound not enforced */ + /* FilterRawinbound not enforced */ + ; /* FilterRawOutbound not enforced */ + /* M_KeyViolations are not counted by hardware, SMA can count */ + tmp = ipath_kget_creg32(unit, cr_errpkey); + /* P_KeyViolations are counted by hardware. */ + portinfo[11] = ((tmp & 0xffff) << 0); + portinfo[12] = + /* Q_KeyViolations are not counted by hardware */ + (1 << 8) + + /* GUIDCap */ + /* SubnetTimeOut handled by SMA */ + /* RespTimeValue handled by SMA */ + ; + /* LocalPhyErrors are programmed to max */ + portinfo[12] |= (0xf << 20) + |(0xf << 16) /* OverRunErrors are programmed to max */ + ; + + if ((ret = copy_to_user(a, portinfo, sizeof portinfo))) { + _IPATH_DBG("Failed to copy out portinfo: %d\n", ret); + ret = -EFAULT; + } + return ret; +} + +/* unit number is in first word incoming, overwritten on return with data */ +static int ipath_get_nodeinfo(uint32_t __user *a) +{ + int ret; + uint32_t unit; /*, tmp, tmp2; */ + struct ipath_devdata *dd; + uint32_t nodeinfo[10]; /* just the data for Nodeinfo, in host horder */ + + if ((ret = copy_from_user(&unit, a, sizeof unit))) { + _IPATH_DBG("Failed to copy in nodeinfo: %d\n", ret); + return -EFAULT; + } + if (unit >= infinipath_max + || !(devdata[unit].ipath_flags & IPATH_INITTED)) { + /* VDBG because sma normally probes for all possible units */ + _IPATH_VDBG("Invalid unit %u\n", unit); + return -ENODEV; + } + dd = &devdata[unit]; + + /* so we only initialize non-zero fields. */ + memset(nodeinfo, 0, sizeof nodeinfo); + + nodeinfo[0] = /* BaseVersion is SMA */ + /* ClassVersion is SMA */ + (1 << 8) /* NodeType */ + |(1 << 0); /* NumPorts */ + nodeinfo[1] = (uint32_t) (dd->ipath_guid >> 32); + nodeinfo[2] = (uint32_t) (dd->ipath_guid & 0xffffffff); + nodeinfo[3] = nodeinfo[1]; /* PortGUID == SystemImageGUID for us */ + nodeinfo[4] = nodeinfo[2]; /* PortGUID == SystemImageGUID for us */ + nodeinfo[5] = nodeinfo[3]; /* PortGUID == NodeGUID for us */ + nodeinfo[6] = nodeinfo[4]; /* PortGUID == NodeGUID for us */ + nodeinfo[7] = (4 << 16) /* we support 4 pkeys */ + |(dd->ipath_deviceid << 0); + /* our chip version as 16 bits major, 16 bits minor */ + nodeinfo[8] = dd->ipath_minrev | (dd->ipath_majrev << 16); + nodeinfo[9] = (unit << 24) | (dd->ipath_vendorid << 0); + + if ((ret = copy_to_user(a, nodeinfo, sizeof nodeinfo))) { + _IPATH_DBG("Failed to copy out nodeinfo: %d\n", ret); + ret = -EFAULT; + } + return ret; +} + +static int ipath_sma_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + switch (cmd) { + case IPATH_SEND_SMA_PKT: /* send SMA packet */ + if (!(ret = ipath_send_smapkt((struct ipath_sendpkt __user *) a))) + /* another SMA packet sent */ + ipath_stats.sps_sma_spkts++; + break; + case IPATH_RCV_SMA_PKT: /* recieve an SMA or MAD packet */ + ret = ipath_rcvsma_pkt((struct ipath_sendpkt __user *) a); + break; + case IPATH_SET_LID: /* set our lid, (SMA) */ + ret = ipath_kset_lid((uint32_t) a); + break; + case IPATH_SET_MTU: /* set the IB mtu (not maxpktlen) (SMA) */ + ret = ipath_kset_mtu((uint32_t) a); + break; + case IPATH_SET_LINKSTATE: + /* walk through the linkstate states (SMA) */ + ret = ipath_kset_linkstate((uint32_t) a); + break; + case IPATH_GET_PORTINFO: /* get the SMA portinfo */ + ret = ipath_get_portinfo((uint32_t __user *) a); + break; + case IPATH_GET_NODEINFO: /* get the SMA nodeinfo */ + ret = ipath_get_nodeinfo((uint32_t __user *) a); + break; + case IPATH_SET_GUID: + /* + * set our guid, (SMA). This is not normally + * used, but provides a way to set the GUID when the i2c flash + * has a problem, or for special testing. + */ + ret = ipath_kset_guid((struct ipath_setguid __user *) a); + break; + case IPATH_SET_MLID: /* set multicast LID for ipath broadcast */ + ret = ipath_kset_mlid((uint32_t) a); + break; + case IPATH_GET_MLID: /* get multicast LID for ipath broadcast */ + ret = ipath_get_mlid((uint32_t __user *) a); + break; + case IPATH_GET_DEVSTATUS: /* get device status */ + ret = ipath_get_devstatus((uint64_t __user *) a); + break; + default: + _IPATH_DBG("%x not a valid SMA ioctl for infinipath\n", cmd); + ret = -EINVAL; + break; + } + return ret; +} + +static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a) +{ + struct infinipath_getunitcounters c; + + if (copy_from_user(&c, a, sizeof c)) + return -EFAULT; + + if (c.unit >= infinipath_max || + !(devdata[c.unit].ipath_flags & IPATH_PRESENT)) + return -ENODEV; + + return ipath_get_counters(c.unit, + (struct infinipath_counters __user *) c.data); +} + +/* + * ioctls for the control device, which is useful when you don't want + * to open the main device and use up a port. + */ + +static int ipath_ctrl_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + + switch (cmd) { + case IPATH_GETSTATS: /* return driver stats */ + ret = ipath_get_stats((struct infinipath_stats __user *) a); + break; + case IPATH_GETUNITCOUNTERS: /* return chip counters */ + ret = ipath_get_unit_counters((struct infinipath_getunitcounters __user *) a); + break; + default: + _IPATH_DBG("%x not a valid CTRL ioctl for infinipath\n", cmd); + ret = -EINVAL; + break; + } + + return ret; +} + +long ipath_ioctl(struct file *fp, unsigned int cmd, unsigned long a) +{ + int ret = 0; + struct ipath_portdata *pd; + ipath_type unit; + uint32_t tmp, i, nactive = 0; + + if (cmd == IPATH_GETUNITS) { + /* + * Return number of units supported. This is called + * here as this ioctl is needed via both the normal and + * diags interface, and it does not need the device to + * be opened. + */ + return ipath_get_units(); + } + + pd = port_fp(fp); + if (!pd) { + if (IPATH_SMA == (unsigned long)fp->private_data) + /* sma separate; no pd */ + return (long)ipath_sma_ioctl(fp, cmd, a); +#ifdef IPATH_DIAG + else if (IPATH_DIAG == (unsigned long)fp->private_data) + /* diags separate; no pd */ + return (long)ipath_diags_ioctl(fp, cmd, a); +#endif + else if (IPATH_CTRL == (unsigned long)fp->private_data) + /* ctrl separate; no pd */ + return (long)ipath_ctrl_ioctl(fp, cmd, a); + else { + _IPATH_DBG("NULL pd from fp (%p), cmd=%x\n", fp, cmd); + return -ENODEV; /* bad; shouldn't ever happen */ + } + } + + unit = pd->port_unit; + + if ((devdata[unit].ipath_flags & IPATH_PRESENT) + && (cmd == IPATH_GETCOUNTERS || cmd == IPATH_GETSTATS + || cmd == IPATH_READ_EEPROM || cmd == IPATH_WRITE_EEPROM)) { + /* allowed to do these, as long as chip is accessible */ + } else if (!(devdata[unit].ipath_flags & IPATH_INITTED)) { + _IPATH_DBG + ("%s not initialized (flags=0x%x), failing ioctl #%u\n", + ipath_get_unit_name(unit), devdata[unit].ipath_flags, + _IOC_NR(cmd)); + ret = -ENODEV; + } else + if ((devdata[unit]. + ipath_flags & (IPATH_LINKDOWN | IPATH_LINKUNK))) { + _IPATH_DBG("%s link is down, failing ioctl #%u\n", + ipath_get_unit_name(unit), _IOC_NR(cmd)); + ret = -ENETDOWN; + } + + if (ret) + return ret; + + switch (cmd) { + case IPATH_USERINIT: + /* real application is starting on a port */ + ret = ipath_do_user_init(pd, (struct ipath_user_info __user *) a); + break; + case IPATH_BASEINFO: + /* it's done the init, now return the info it needs */ + ret = ipath_get_baseinfo(pd, (struct ipath_base_info __user *) a); + break; + case IPATH_GETPORT: + /* + * just return the unit:port that we were assigned, + * and the number of active chips. This is is used for + * doing sched_setaffinity() before initialization. + */ + for (i = 0; i < infinipath_max; i++) + if ((devdata[i].ipath_flags & IPATH_PRESENT) + && devdata[i].ipath_kregbase + && devdata[i].ipath_lid + && !(devdata[i].ipath_flags & + (IPATH_LINKDOWN | IPATH_LINKUNK))) + nactive++; + tmp = (nactive << 24) | (unit << 16) | pd->port_port; + if (copy_to_user((void __user *) a, &tmp, sizeof(tmp))) + ret = EFAULT; + break; + case IPATH_GETLID: + /* get LID for given unit # */ + ret = ipath_layer_get_lid(a); + break; + case IPATH_UPDM_TID: /* update expected TID entries */ + ret = ipath_tid_update(pd, (struct _tidupd __user *) a); + break; + case IPATH_FREE_TID: /* free expected TID entries */ + ret = ipath_tid_free(pd, (struct _tidupd __user *) a); + break; + case IPATH_GETCOUNTERS: /* return chip counters */ + ret = ipath_get_counters(unit, (struct infinipath_counters __user *) a); + break; + case IPATH_GETSTATS: /* return driver stats */ + ret = ipath_get_stats((struct infinipath_stats __user *) a); + break; + case IPATH_GETUNITCOUNTERS: /* return chip counters */ + ret = ipath_get_unit_counters( + (struct infinipath_getunitcounters __user *) a); + break; + case IPATH_SET_PKEY: /* set a partition key */ + ret = ipath_set_partkey(pd, (uint16_t) a); + break; + case IPATH_RCVCTRL: /* error handling to manage the rcvq */ + ret = ipath_manage_rcvq(pd, (uint16_t) a); + break; + case IPATH_WRITE_EEPROM: + /* write the eeprom (for GUID) */ + ret = ipath_wr_eeprom(pd, + (struct ipath_eeprom_req __user *) a); + break; + case IPATH_READ_EEPROM: /* read the eeprom (for GUID) */ + ret = ipath_rd_eeprom(pd->port_unit, + (struct ipath_eeprom_req __user *) a); + break; + case IPATH_WAIT: + /* + * wait for a receive intr for this port, or PIO avail + */ + ret = ipath_wait_intr(pd, (uint32_t) a); + break; + + default: + _IPATH_DBG("cmd %x (%c,%u) not a valid ioctl\n", cmd, + _IOC_TYPE(cmd), _IOC_NR(cmd)); + ret = -EINVAL; + break; + } + + return ret; +} + +static loff_t ipath_llseek(struct file *fp, loff_t off, int whence) +{ + loff_t ret; + + /* range checking is done where offset is used, not here. */ + down(&fp->f_dentry->d_inode->i_sem); + if (!whence) + ret = fp->f_pos = off; + else if (whence == 1) { + fp->f_pos += off; + ret = fp->f_pos; + } else + ret = -EINVAL; + up(&fp->f_dentry->d_inode->i_sem); + _IPATH_DBG("New offset %llx from seek %llx whence=%d\n", fp->f_pos, off, + whence); + + return ret; +} + +/* + * We use this to have a shared buffer between the kernel and the user + * code for the rcvhdr queue, egr buffers, and the per-port user regs and pio + * buffers in the chip. We have the open and close entries so we can bump + * the ref count and keep the driver from being unloaded while still mapped. + */ + +static struct vm_operations_struct ipath_vmops = { + .nopage = ipath_nopage, +}; + +static int ipath_mmap(struct file *fp, struct vm_area_struct *vm) +{ + int setlen = 0, ret = -EINVAL; + struct ipath_portdata *pd; + + if (fp->private_data && 255UL < (unsigned long)fp->private_data) { + pd = port_fp(fp); + { + /* + * This is the ipath_do_user_init() code, + * mapping the shared buffers into the user + * process. The address referred to by vm_pgoff + * is the virtual, not physical, address; we only + * do one mmap for each space mapped. + */ + uint64_t pgaddr, ureg; + + pgaddr = vm->vm_pgoff << PAGE_SHIFT; + + /* + * note that ureg does *NOT* have the kregvirt + * as part of it, to be sure that for 32 bit + * programs, we don't end up trying to map + * a > 44 address. Has to match ipath_get_baseinfo() + * code that sets __spi_uregbase + */ + + ureg = devdata[pd->port_unit].ipath_uregbase + + devdata[pd->port_unit].ipath_palign * pd->port_port; + + _IPATH_MMDBG + ("ushare: pgaddr %llx vm_start=%lx, vmlen %lx\n", + pgaddr, vm->vm_start, vm->vm_end - vm->vm_start); + + if (pgaddr == ureg) { + /* it's the real hardware, so io_remap works */ + unsigned long phys; + if ((vm->vm_end - vm->vm_start) > PAGE_SIZE) { + _IPATH_INFO + ("FAIL mmap userreg: reqlen %lx > PAGE\n", + vm->vm_end - vm->vm_start); + ret = -EFAULT; + } else { + phys = + devdata[pd->port_unit]. + ipath_physaddr + ureg; + vm->vm_page_prot = + pgprot_noncached(vm->vm_page_prot); + + vm->vm_flags |= + VM_DONTCOPY | VM_DONTEXPAND | VM_IO + | VM_SHM | VM_LOCKED; + ret = + io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + vm->vm_end - vm->vm_start, + vm->vm_page_prot); + } + } else if (pgaddr == pd->port_piobufs) { + /* + * We use io_remap, so there is not a + * nopage handler for this case! + * when we map the PIO buffers, we want + * to map them as writeonly, no read possible. + */ + + unsigned long phys; + if ((vm->vm_end - vm->vm_start) > + (devdata[pd->port_unit].ipath_pbufsport * + devdata[pd->port_unit].ipath_palign)) { + _IPATH_INFO + ("FAIL mmap piobufs: reqlen %lx > PAGE\n", + vm->vm_end - vm->vm_start); + ret = -EFAULT; + } else { + phys = + devdata[pd->port_unit]. + ipath_physaddr + pd->port_piobufs; + /* + * Do *NOT* mark this as + * non-cached (PWT bit), or we + * don't get the write combining + * behavior we want on the + * PIO buffers! + * vm->vm_page_prot = pgprot_noncached(vm->vm_page_prot); + */ + +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) + /* Enable WC */ + vm->vm_page_prot = + pgprot_writecombine(vm-> + vm_page_prot); +#endif + + if (vm->vm_flags & VM_READ) { + _IPATH_INFO + ("Can't map piobufs as readable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to + * later change to readable + * with mprotect + */ + + vm->vm_flags &= ~VM_MAYWRITE; + + vm->vm_flags |= + VM_DONTCOPY | VM_DONTEXPAND + | VM_IO | VM_SHM | + VM_LOCKED; + ret = + io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + vm->vm_end - vm->vm_start, + vm->vm_page_prot); + } + } + } else if (pgaddr == (uint64_t) pd->port_rcvegr_phys) { + if (!pd->port_rcvegrbuf_virt) + return -EFAULT; + /* + * page_alloc'ed egr memory, not + * physically contiguous + * *BUT* to work around the 32 bit mmap64 + * only handling 44 bits, we have remapped + * the first page to kernel virtual, so + * we have to do the conversion here to + * get back to the original virtual + * address (not contig pages) so we have + * to mark this for special handling. + */ + + /* + * not egrbufs * egrsize since they are + * no longer virtually contiguous. + */ + setlen = pd->port_rcvegrbuf_chunks * PAGE_SIZE * + (1 << pd->port_rcvegrbuf_order); + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on egr bufs: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else { + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(3 | (uint64_t) pd); + if (vm->vm_flags & VM_WRITE) { + _IPATH_INFO + ("Can't map eager buffers as writable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to + * later change to writeable + * with mprotect + */ + + vm->vm_flags &= ~VM_MAYWRITE; + _IPATH_MMDBG + ("egrbufs, set private to %p, not %llx\n", + vm->vm_private_data, + pgaddr); + ret = 0; + } + } + } else if (pgaddr == (uint64_t) pd->port_rcvhdrq_phys) { + /* + * kmalloc'ed memory, physically + * contiguous; this is from + * spi_rcvhdr_base; we allow user to + * map read-write so they can write + * hdrq entries to allow protocol code + * to directly poll whether a hdrq entry + * has been written. + */ + setlen = ALIGN(devdata[pd->port_unit].ipath_rcvhdrcnt * devdata[pd->port_unit].ipath_rcvhdrentsize * sizeof(uint32_t), PAGE_SIZE); + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on rcvhdrq: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else { + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(pgaddr | 1); + ret = 0; + } + } + /* + * when we map the PIO bufferavail registers, + * we want to map them as readonly, no read + * possible. + */ + else if (pgaddr == devdata[pd->port_unit].ipath_pioavailregs_phys) { + /* + * kmalloc'ed memory, physically + * contiguous, one page only, readonly + */ + setlen = PAGE_SIZE; + if ((vm->vm_end - vm->vm_start) > setlen) { + _IPATH_INFO + ("FAIL on pioavailregs_dma: reqlen %lx > actual %x\n", + vm->vm_end - vm->vm_start, setlen); + ret = -EFAULT; + } else if (vm->vm_flags & VM_WRITE) { + _IPATH_INFO + ("Can't map pioavailregs as writable (flags=%lx)\n", + vm->vm_flags); + ret = -EPERM; + } else { + /* + * don't allow them to later + * change with mprotect + */ + vm->vm_flags &= ~VM_MAYWRITE; + vm->vm_ops = &ipath_vmops; + vm->vm_private_data = + (void *)(pgaddr | 2); + ret = 0; + } + } + if (!ret && setlen) { + /* keep page(s) from being swapped, etc. */ + vm->vm_flags |= + VM_DONTEXPAND | VM_DONTCOPY | VM_RESERVED | + VM_IO | VM_SHM; + } else { + /* failure, or io_remap case */ + vm->vm_private_data = NULL; + if (ret) + _IPATH_INFO + ("Failure %d, setlen %d, on addr %lx, off %lx\n", + ret, setlen, vm->vm_start, + vm->vm_pgoff); + } + } + } else /* something very wrong */ + _IPATH_INFO("fp_private wasn't set, no mmaping\n"); + + return ret; +} + +/* page fault handler. For each page that is first faulted in from the + * mmap'ed shared address buffer, this routine is called. + * It's always for a single page. + * We use the low bits of the private_data field to tell us which case + * we are dealing with. + */ + +static struct page *ipath_nopage(struct vm_area_struct *vma, unsigned long addr, + int *type) +{ + unsigned long avirt, /* the original [kv]malloc virtual address */ + paddr, /* physical address */ + off; /* calculated page offset */ + uint32_t which, chunk; + void *vaddr = NULL; + struct ipath_portdata *pd; + struct page *vpage = NOPAGE_SIGBUS; + + if (!(avirt = (unsigned long)vma->vm_private_data)) { + _IPATH_DBG("NULL private_data, vm_pgoff %lx\n", vma->vm_pgoff); + which = 0; /* quiet incorrect gcc warning */ + goto done; + } + which = avirt & 3; + avirt &= ~3ULL; + + if (addr > vma->vm_end) { + _IPATH_DBG("trying to fault in addr %lx past end\n", addr); + goto done; + } + + /* + * most of our memory is vmalloc'ed, but rcvhdr Q is physically + * contiguous, either from kmalloc or alloc_pages() + * pgoff is virtual. + */ + switch (which) { + case 1: /* rcvhdrq_phys */ + /* should always be 0 */ + off = vma->vm_pgoff - (avirt >> PAGE_SHIFT); + paddr = addr - vma->vm_start + (off << PAGE_SHIFT) + avirt; + _IPATH_MMDBG("hdrq %lx (u=%lx)\n", paddr, addr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + break; + case 2: /* PIO buffer avail regs */ + /* should always be 0 */ + off = vma->vm_pgoff - (avirt >> PAGE_SHIFT); + paddr = (addr - vma->vm_start + (off << PAGE_SHIFT) + avirt); + _IPATH_MMDBG("pioav %lx\n", paddr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + break; + case 3: + /* + * rcvegrbufs; page_alloc()'ed like rcvhdrq, but we + * have to pick out which page_alloc()'ed chunk it is. + */ + pd = (struct ipath_portdata *) avirt; + /* this should always be 0 */ + off = + vma->vm_pgoff - + ((unsigned long)pd->port_rcvegr_phys >> PAGE_SHIFT); + off = (addr - vma->vm_start + (off << PAGE_SHIFT)); + + chunk = off / (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order)); + if (chunk > pd->port_rcvegrbuf_chunks) + _IPATH_DBG("Bad egrbuf chunk %u (max %u); off = %lx\n", + chunk, pd->port_rcvegrbuf_chunks, off); + vaddr = pd->port_rcvegrbuf_virt[chunk] + + off % (PAGE_SIZE * (1 << pd->port_rcvegrbuf_order)); + paddr = virt_to_phys(vaddr); + vpage = pfn_to_page(paddr >> PAGE_SHIFT); + _IPATH_MMDBG("egrb %p,%lx\n", vaddr, paddr); + break; + default: + _IPATH_DBG + ("trying to fault in mmap addr %lx (avirt %lx) that isn't known (case %u)\n", + addr, avirt, which); + } + +done: + if (vpage != NOPAGE_SIGBUS && vpage != NOPAGE_OOM) { + if (which == 2) + /* + * media/video/video-buf.c doesn't do get_page() for + * buffer from alloc_page(). Hmmm. + * + * keep it from being swapped, complaints if + * process exits before we [vf]free it, etc, + * and keep shared page counts correct, etc. + */ + get_page(vpage); + mark_page_accessed(vpage); + if (type) + *type = VM_FAULT_MINOR; + } else + _IPATH_DBG("faultin of addr %lx vaddr %p avirt %lx failed\n", + addr, vaddr, avirt); + + return vpage; +} + +/* this is separate to allow for better optimization of ipath_intr() */ + +static void ipath_bad_intr(const ipath_type t, uint32_t * unexpectp) +{ + struct ipath_devdata *dd = &devdata[t]; + + /* + * sometimes happen during driver init and unload, don't want + * to process any interrupts at that point + */ + + /* this is just a bandaid, not a fix, if something goes badly wrong */ + if (++*unexpectp > 100) { + if (++*unexpectp > 105) { + /* + * ok, we must be taking somebody else's interrupts, + * due to a messed up mptable and/or PIRQ table, so + * unregister the interrupt. We've seen this + * during linuxbios development work, and it + * may happen in the future again. + */ + if (dd->pcidev && dd->pcidev->irq) { + _IPATH_UNIT_ERROR(t, + "Now %u unexpected interrupts, unregistering interrupt handler\n", + *unexpectp); + _IPATH_DBG("free_irq of irq %x\n", + dd->pcidev->irq); + free_irq(dd->pcidev->irq, dd); + dd->pcidev->irq = 0; + } + } + if (ipath_kget_kreg32(t, kr_intmask)) { + _IPATH_UNIT_ERROR(t, + "%u unexpected interrupts, disabling interrupts completely\n", + *unexpectp); + /* disable all interrupts, something is very wrong */ + ipath_kput_kreg(t, kr_intmask, 0ULL); + } + } else if (*unexpectp > 1) + _IPATH_DBG + ("Interrupt when not ready, should not happen, ignoring\n"); +} + +/* separate routine, for better optimization of ipath_intr() */ + +static void ipath_bad_regread(const ipath_type t) +{ + static int allbits; + struct ipath_devdata *dd = &devdata[t]; + + /* + * We print the message and disable interrupts, in hope of + * having a better chance of debugging the problem. + */ + _IPATH_UNIT_ERROR(t, + "Read of interrupt status failed (all bits set)\n"); + if (allbits++) { + /* disable all interrupts, something is very wrong */ + ipath_kput_kreg(t, kr_intmask, 0ULL); + if (allbits == 2) { + _IPATH_UNIT_ERROR(t, + "Still bad interrupt status, unregistering interrupt\n"); + free_irq(dd->pcidev->irq, dd); + dd->pcidev->irq = 0; + } else if (allbits > 2) { + if ((allbits % 10000) == 0) + printk("."); + } else + _IPATH_UNIT_ERROR(t, + "Disabling interrupts, multiple errors\n"); + } +} + +static irqreturn_t ipath_intr(int irq, void *data, struct pt_regs *regs) +{ + struct ipath_devdata *dd = data; + const ipath_type t = IPATH_UNIT(dd); + uint32_t istat = ipath_kget_kreg32(t, kr_intstatus); + uint64_t estat = 0; + static unsigned unexpected = 0; + + if (unlikely(!istat)) { + ipath_stats.sps_nullintr++; + /* not our interrupt, or already handled */ + return IRQ_NONE; + } + if (unlikely(istat == -1)) { + ipath_bad_regread(t); + /* don't know if it was our interrupt or not */ + return IRQ_NONE; + } + + ipath_stats.sps_ints++; + + /* + * this needs to be flags&initted, not statusp, so we keep + * taking interrupts even after link goes down, etc. + * Also, we *must* clear the interrupt at some point, or we won't + * take it again, which can be real bad for errors, etc... + */ + + if (!(dd->ipath_flags & IPATH_INITTED)) { + ipath_bad_intr(t, &unexpected); + return IRQ_NONE; + } + if (unexpected) + unexpected = 0; + + if (istat & ~infinipath_i_bitsextant) + _IPATH_UNIT_ERROR(t, + "interrupt with unknown interrupts %x set\n", + istat & (uint32_t) ~ infinipath_i_bitsextant); + + if (istat & INFINIPATH_I_ERROR) { + ipath_stats.sps_errints++; + estat = ipath_kget_kreg64(t, kr_errorstatus); + if (!estat) + _IPATH_INFO + ("error interrupt (%x), but no error bits set!\n", + istat); + else if (estat == -1LL) + /* + * should we try clearing all, or hope next read + * works? + */ + _IPATH_UNIT_ERROR(t, + "Read of error status failed (all bits set); ignoring\n"); + else + ipath_handle_errors(t, estat); + } + + if (istat & INFINIPATH_I_GPIO) { + /* Clear GPIO status bit 2 */ + ipath_kput_kreg(t, kr_gpio_clear, (uint64_t)(1 << 2)); + + /* + * Packets are available in the port 0 receive queue. + * Eventually this needs to be generalized to check + * IPATH_GPIO_INTR, and the specific GPIO bit, when + * GPIO interrupts start being used for other things. + * We skip that now to improve performance. + */ + ipath_kreceive(t); + } + + /* + * clear the ones we will deal with on this round + * We clear it early, mostly for receive interrupts, so we + * know the chip will have seen this by the time we process + * the queue, and will re-interrupt if necessary. The processor + * itself won't take the interrupt again until we return. + */ + ipath_kput_kreg(t, kr_intclear, istat); + + if (istat & INFINIPATH_I_SPIOBUFAVAIL) { + atomic_clear_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, dd->ipath_sendctrl); + + if (dd->ipath_portpiowait) { + uint32_t i; + /* + * start from port 1, since for now port 0 is + * never using wait_event for PIO + */ + for (i = 1; + dd->ipath_portpiowait && i < dd->ipath_cfgports; + i++) { + if (dd->ipath_pd[i] + && dd->ipath_portpiowait & (1U << i)) { + atomic_clear_mask(1U << i, + &dd-> + ipath_portpiowait); + if (dd->ipath_pd[i]-> + port_flag & IPATH_PORT_WAITING_PIO) + { + dd->ipath_pd[i]->port_flag &= + ~IPATH_PORT_WAITING_PIO; + wake_up_interruptible(&dd-> + ipath_pd + [i]-> + port_wait); + } + } + } + } + + if (dd->ipath_layer.l_intr) { + if (dd->ipath_layer.l_intr(t, + IPATH_LAYER_INT_SEND_CONTINUE)) { + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, + dd->ipath_sendctrl); + } + } + + if (dd->verbs_layer.l_piobufavail) { + if (!dd->verbs_layer.l_piobufavail(t)) { + atomic_set_mask(INFINIPATH_S_PIOINTBUFAVAIL, + &dd->ipath_sendctrl); + ipath_kput_kreg(t, kr_sendctrl, + dd->ipath_sendctrl); + } + } + } + + /* + * we check for both transition from empty to non-empty, and urgent + * packets (those with the interrupt bit set in the header) + */ + + if (istat & ((infinipath_i_rcvavail_mask << INFINIPATH_I_RCVAVAIL_SHIFT) + | (infinipath_i_rcvurg_mask << INFINIPATH_I_RCVURG_SHIFT))) { + uint64_t portr; + int i; + uint32_t rcvdint = 0; + + portr = ((istat >> INFINIPATH_I_RCVAVAIL_SHIFT) & + infinipath_i_rcvavail_mask) + | ((istat >> INFINIPATH_I_RCVURG_SHIFT) & + infinipath_i_rcvurg_mask); + for (i = 0; i < dd->ipath_cfgports; i++) { + if (portr & (1 << i) && dd->ipath_pd[i]) { + if (i == 0) + ipath_kreceive(t); + else if (dd->ipath_pd[i]-> + port_flag & IPATH_PORT_WAITING_RCV) { + atomic_clear_mask + (IPATH_PORT_WAITING_RCV, + &dd->ipath_pd[i]->port_flag); + wake_up_interruptible(&dd->ipath_pd[i]-> + port_wait); + rcvdint |= 1U << i; + } + } + } + if (rcvdint) { + /* + * only want to take one interrupt, so turn off + * the rcv interrupt for all the ports that we + * did the wakeup on (but never for kernel port) + */ + atomic_clear_mask(rcvdint << + INFINIPATH_R_INTRAVAIL_SHIFT, + &dd->ipath_rcvctrl); + ipath_kput_kreg(t, kr_rcvctrl, dd->ipath_rcvctrl); + } + } + + return IRQ_HANDLED; +} + +static void ipath_decode_err(char *buf, size_t blen, uint64_t err) +{ + *buf = '\0'; + if (err & INFINIPATH_E_RHDRLEN) + strlcat(buf, "rhdrlen ", blen); + if (err & INFINIPATH_E_RBADTID) + strlcat(buf, "rbadtid ", blen); + if (err & INFINIPATH_E_RBADVERSION) + strlcat(buf, "rbadversion ", blen); + if (err & INFINIPATH_E_RHDR) + strlcat(buf, "rhdr ", blen); + if (err & INFINIPATH_E_RLONGPKTLEN) + strlcat(buf, "rlongpktlen ", blen); + if (err & INFINIPATH_E_RSHORTPKTLEN) + strlcat(buf, "rshortpktlen ", blen); + if (err & INFINIPATH_E_RMAXPKTLEN) + strlcat(buf, "rmaxpktlen ", blen); + if (err & INFINIPATH_E_RMINPKTLEN) + strlcat(buf, "rminpktlen ", blen); + if (err & INFINIPATH_E_RFORMATERR) + strlcat(buf, "rformaterr ", blen); + if (err & INFINIPATH_E_RUNSUPVL) + strlcat(buf, "runsupvl ", blen); + if (err & INFINIPATH_E_RUNEXPCHAR) + strlcat(buf, "runexpchar ", blen); + if (err & INFINIPATH_E_RIBFLOW) + strlcat(buf, "ribflow ", blen); + if (err & INFINIPATH_E_REBP) + strlcat(buf, "EBP ", blen); + if (err & INFINIPATH_E_SUNDERRUN) + strlcat(buf, "sunderrun ", blen); + if (err & INFINIPATH_E_SPIOARMLAUNCH) + strlcat(buf, "spioarmlaunch ", blen); + if (err & INFINIPATH_E_SUNEXPERRPKTNUM) + strlcat(buf, "sunexperrpktnum ", blen); + if (err & INFINIPATH_E_SDROPPEDDATAPKT) + strlcat(buf, "sdroppeddatapkt ", blen); + if (err & INFINIPATH_E_SDROPPEDSMPPKT) + strlcat(buf, "sdroppedsmppkt ", blen); + if (err & INFINIPATH_E_SMAXPKTLEN) + strlcat(buf, "smaxpktlen ", blen); + if (err & INFINIPATH_E_SMINPKTLEN) + strlcat(buf, "sminpktlen ", blen); + if (err & INFINIPATH_E_SUNSUPVL) + strlcat(buf, "sunsupVL ", blen); + if (err & INFINIPATH_E_SPKTLEN) + strlcat(buf, "spktlen ", blen); + if (err & INFINIPATH_E_INVALIDADDR) + strlcat(buf, "invalidaddr ", blen); + if (err & INFINIPATH_E_RICRC) + strlcat(buf, "CRC ", blen); + if (err & INFINIPATH_E_RVCRC) + strlcat(buf, "VCRC ", blen); + if (err & INFINIPATH_E_RRCVEGRFULL) + strlcat(buf, "rcvegrfull ", blen); + if (err & INFINIPATH_E_RRCVHDRFULL) + strlcat(buf, "rcvhdrfull ", blen); + if (err & INFINIPATH_E_IBSTATUSCHANGED) + strlcat(buf, "ibcstatuschg ", blen); + if (err & INFINIPATH_E_RIBLOSTLINK) + strlcat(buf, "riblostlink ", blen); + if (err & INFINIPATH_E_HARDWARE) + strlcat(buf, "hardware ", blen); + if (err & INFINIPATH_E_RESET) + strlcat(buf, "reset ", blen); +} + +/* decode RHF errors; only used one place now, may want more later */ +static void get_rhf_errstring(uint32_t err, char *msg, size_t len) +{ + /* if no errors, and so don't need to check what's first */ + *msg = '\0'; + + if (err & INFINIPATH_RHF_H_ICRCERR) + strlcat(msg, "icrcerr ", len); + if (err & INFINIPATH_RHF_H_VCRCERR) + strlcat(msg, "vcrcerr ", len); + if (err & INFINIPATH_RHF_H_PARITYERR) + strlcat(msg, "parityerr ", len); + if (err & INFINIPATH_RHF_H_LENERR) + strlcat(msg, "lenerr ", len); + if (err & INFINIPATH_RHF_H_MTUERR) + strlcat(msg, "mtuerr ", len); + if (err & INFINIPATH_RHF_H_IHDRERR) + /* infinipath hdr checksum error */ + strlcat(msg, "ipathhdrerr ", len); + if (err & INFINIPATH_RHF_H_TIDERR) + strlcat(msg, "tiderr ", len); + if (err & INFINIPATH_RHF_H_MKERR) + /* bad port, offset, etc. */ + strlcat(msg, "invalid ipathhdr ", len); + if (err & INFINIPATH_RHF_H_IBERR) + strlcat(msg, "iberr ", len); + if (err & INFINIPATH_RHF_L_SWA) + strlcat(msg, "swA ", len); + if (err & INFINIPATH_RHF_L_SWB) + strlcat(msg, "swB ", len); +} + +static void ipath_handle_errors(const ipath_type t, uint64_t errs) +{ + char msg[512]; + uint32_t piobcnt; + uint64_t sbuf[4], ignore_this_time = 0; + int i; + int chkerrpkts = 0, noprint = 0; + cycles_t nc; + static cycles_t nextmsg_time; + static unsigned nmsgs, supp_msgs; + struct ipath_devdata *dd = &devdata[t]; + +#define E_SUM_PKTERRS (INFINIPATH_E_RHDRLEN | INFINIPATH_E_RBADTID \ + | INFINIPATH_E_RBADVERSION \ + | INFINIPATH_E_RHDR | INFINIPATH_E_RLONGPKTLEN | INFINIPATH_E_RSHORTPKTLEN \ + | INFINIPATH_E_RMAXPKTLEN | INFINIPATH_E_RMINPKTLEN \ + | INFINIPATH_E_RFORMATERR | INFINIPATH_E_RUNSUPVL | INFINIPATH_E_RUNEXPCHAR \ + | INFINIPATH_E_REBP) + +#define E_SUM_ERRS ( INFINIPATH_E_SPIOARMLAUNCH \ + | INFINIPATH_E_SUNEXPERRPKTNUM | INFINIPATH_E_SDROPPEDDATAPKT \ + | INFINIPATH_E_SDROPPEDSMPPKT | INFINIPATH_E_SMAXPKTLEN \ + | INFINIPATH_E_SUNSUPVL | INFINIPATH_E_SMINPKTLEN | INFINIPATH_E_SPKTLEN \ + | INFINIPATH_E_INVALIDADDR) + + /* + * throttle back "fast" messages to no more than 10 per 5 seconds + * (1.4-2GHz clock). This isn't perfect, but it's a reasonable + * heuristic + * If we get more than 10, give a 5x longer delay + */ + nc = get_cycles(); + if (nmsgs > 10) { + if (nc < nextmsg_time) { + noprint = 1; + if (!supp_msgs++) + nextmsg_time = nc + 50000000000ULL; + } else if (supp_msgs) { + /* + * Print the message unless it's ibc status + * change only, which happens so often we never + * want to count it. + */ + if (dd->ipath_lasterror & ~INFINIPATH_E_IBSTATUSCHANGED) { + ipath_decode_err(msg, sizeof msg, + dd-> + ipath_lasterror & + ~INFINIPATH_E_IBSTATUSCHANGED); + if (dd-> + ipath_lasterror & ~(INFINIPATH_E_RRCVEGRFULL + | + INFINIPATH_E_RRCVHDRFULL)) + _IPATH_UNIT_ERROR(t, + "Suppressed %u messages for fast-repeating errors (%s) (%llx)\n", + supp_msgs, msg, + dd->ipath_lasterror); + else { + /* + * rcvegrfull and rcvhdrqfull are + * "normal", for some types of + * processes (mostly benchmarks) + * that send huge numbers of + * messages, while not processing + * them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Suppressed %u messages for %s\n", + supp_msgs, msg); + } + } + supp_msgs = 0; + nmsgs = 0; + } + } else if (!nmsgs++ || nc > nextmsg_time) /* start timer */ + nextmsg_time = nc + 10000000000ULL; + + /* + * don't report errors that are masked (includes those always + * ignored) + */ + errs &= ~dd->ipath_maskederrs; + + /* do these first, they are most important */ + if (errs & INFINIPATH_E_HARDWARE) { + /* reuse same msg buf */ + ipath_handle_hwerrors(t, msg, sizeof msg); + } + + if (!noprint && (errs & ~infinipath_e_bitsextant)) + _IPATH_UNIT_ERROR(t, + "error interrupt with unknown errors %llx set\n", + errs & ~infinipath_e_bitsextant); + + if (errs & E_SUM_ERRS) { + /* if possible that sendbuffererror could be valid */ + piobcnt = dd->ipath_piobcnt; + /* read these before writing errorclear */ + sbuf[0] = ipath_kget_kreg64(t, kr_sendbuffererror); + sbuf[1] = ipath_kget_kreg64(t, kr_sendbuffererror + 1); + if (piobcnt > 128) { + sbuf[2] = ipath_kget_kreg64(t, kr_sendbuffererror + 2); + sbuf[3] = ipath_kget_kreg64(t, kr_sendbuffererror + 3); + } + + if (sbuf[0] || sbuf[1] + || (piobcnt > 128 && (sbuf[2] || sbuf[3]))) { + _IPATH_PDBG("SendbufErrs %llx %llx ", sbuf[0], sbuf[1]); + if (infinipath_debug & __IPATH_PKTDBG && piobcnt > 128) + printk("%llx %llx ", sbuf[2], sbuf[3]); + for (i = 0; i < piobcnt; i++) { + if (test_bit(i, sbuf)) { + uint32_t sendctrl; + if (infinipath_debug & __IPATH_PKTDBG) + printk("%u ", i); + sendctrl = + dd-> + ipath_sendctrl | INFINIPATH_S_DISARM + | (i << + INFINIPATH_S_DISARMPIOBUF_SHIFT); + ipath_kput_kreg(t, kr_sendctrl, + sendctrl); + } + } + if (infinipath_debug & __IPATH_PKTDBG) + printk("\n"); + } + if ((errs & + (INFINIPATH_E_SDROPPEDDATAPKT | INFINIPATH_E_SDROPPEDSMPPKT + | INFINIPATH_E_SMINPKTLEN)) + && !(dd->ipath_flags & IPATH_LINKACTIVE)) { + /* + * This can happen when SMA is trying to bring + * the link up, but the IB link changes state + * at the "wrong" time. The IB logic then + * complains that the packet isn't valid. + * We don't want to confuse people, so we just + * don't print them, except at debug + */ + _IPATH_DBG + ("Ignoring pktsend errors %llx, because not yet active\n", + errs); + ignore_this_time |= + INFINIPATH_E_SDROPPEDDATAPKT | + INFINIPATH_E_SDROPPEDSMPPKT | + INFINIPATH_E_SMINPKTLEN; + } + } + + if (supp_msgs == 250000) { + /* + * It's not entirely reasonable assuming that the errors + * set in the last clear period are all responsible for + * the problem, but the alternative is to assume it's the only + * ones on this particular interrupt, which also isn't great + */ + dd->ipath_maskederrs |= dd->ipath_lasterror | errs; + ipath_kput_kreg(t, kr_errormask, ~dd->ipath_maskederrs); + ipath_decode_err(msg, sizeof msg, + (dd->ipath_maskederrs & ~dd-> + ipath_ignorederrs)); + + if ((dd->ipath_maskederrs & ~dd->ipath_ignorederrs) + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) + _IPATH_UNIT_ERROR(t, + "Disabling error(s) %llx because occuring too frequently (%s)\n", + (dd->ipath_maskederrs & ~dd-> + ipath_ignorederrs), msg); + else { + /* + * rcvegrfull and rcvhdrqfull are "normal", + * for some types of processes (mostly benchmarks) + * that send huge numbers of messages, while not + * processing them. So only complain about + * these at debug level. + */ + _IPATH_DBG + ("Disabling frequent queue full errors (%s)\n", + msg); + } + + /* + * re-enable the masked errors after around 3 minutes. + * in ipath_get_faststats(). If we have a series of + * fast repeating but different errors, the interval will keep + * stretching out, but that's OK, as that's pretty catastrophic. + */ + dd->ipath_unmasktime = nc + 400000000000ULL; + } + + ipath_kput_kreg(t, kr_errorclear, errs); + if (ignore_this_time) + errs &= ~ignore_this_time; + if (errs & ~dd->ipath_lasterror) { + errs &= ~dd->ipath_lasterror; + /* never suppress duplicate hwerrors or ibstatuschange */ + dd->ipath_lasterror |= errs & + ~(INFINIPATH_E_HARDWARE | INFINIPATH_E_IBSTATUSCHANGED); + } + if (!errs) + return; + + if (!noprint) + /* the ones we mask off are handled specially below or above */ + ipath_decode_err(msg, sizeof msg, + errs & ~(INFINIPATH_E_IBSTATUSCHANGED | + INFINIPATH_E_RRCVEGRFULL | + INFINIPATH_E_RRCVHDRFULL | + INFINIPATH_E_HARDWARE)); + else + /* so we don't need if (!noprint) at strlcat's below */ + *msg = 0; + + if (errs & E_SUM_PKTERRS) { + ipath_stats.sps_pkterrs++; + chkerrpkts = 1; + } + if (errs & E_SUM_ERRS) + ipath_stats.sps_errs++; + + if (errs & (INFINIPATH_E_RICRC | INFINIPATH_E_RVCRC)) { + ipath_stats.sps_crcerrs++; + chkerrpkts = 1; + } + + /* + * We don't want to print these two as they happen, or we can make + * the situation even worse, because it takes so long to print messages. + * to serial consoles. kernel ports get printed from fast_stats, no + * more than every 5 seconds, user ports get printed on close + */ + if (errs & INFINIPATH_E_RRCVHDRFULL) { + int any; + uint32_t hd, tl; + ipath_stats.sps_hdrqfull++; + for (any = i = 0; i < dd->ipath_cfgports; i++) { + if (i == 0) { + hd = dd->ipath_port0head; + tl = *dd->ipath_hdrqtailptr; + } else if (dd->ipath_pd[i] && + dd->ipath_pd[i]->port_rcvhdrtail_kvaddr) { + /* + * don't report same point multiple times, + * except kernel + */ + tl = (uint32_t) * + dd->ipath_pd[i]->port_rcvhdrtail_kvaddr; + if (tl == dd->ipath_lastrcvhdrqtails[i]) + continue; + hd = ipath_kget_ureg32(t, ur_rcvhdrhead, i); + } else + continue; + if (hd == (tl + 1) || (!hd && tl == dd->ipath_hdrqlast)) { + dd->ipath_lastrcvhdrqtails[i] = tl; + dd->ipath_pd[i]->port_hdrqfull++; + if (i == 0) + chkerrpkts = 1; + } + } + } + if (errs & INFINIPATH_E_RRCVEGRFULL) { + /* + * since this is of less importance and not likely to + * happen without also getting hdrfull, only count + * occurrences; don't check each port (or even the kernel + * vs user) + */ + ipath_stats.sps_etidfull++; + if (dd->ipath_port0head != *dd->ipath_hdrqtailptr) + chkerrpkts = 1; + } + + /* + * do this before IBSTATUSCHANGED, in case both bits set in a single + * interrupt; we want the STATUSCHANGE to "win", so we do our + * internal copy of state machine correctly + */ + if (errs & INFINIPATH_E_RIBLOSTLINK) { + /* force through block below */ + errs |= INFINIPATH_E_IBSTATUSCHANGED; + ipath_stats.sps_iblink++; + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT + | IPATH_LINKARMED | IPATH_LINKACTIVE); + if (!noprint) + _IPATH_DBG("Lost link, link now down (%s)\n", + ipath_ibcstatus_str[ipath_kget_kreg64 + (t, + kr_ibcstatus) & 0xf]); + } + + if ((errs & INFINIPATH_E_IBSTATUSCHANGED) && (!ipath_diags_enabled)) { + uint64_t val; + uint32_t ltstate; + + val = ipath_kget_kreg64(t, kr_ibcstatus); + ltstate = val & 0xff; + if (ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31) + _IPATH_DBG("Link state changed unit %u to 0x%x, last was 0x%llx\n", + t, ltstate, dd->ipath_lastibcstat); + else { + ltstate = dd->ipath_lastibcstat & 0xff; + if (ltstate == 0x11 || ltstate == 0x21 || ltstate == 0x31) + _IPATH_DBG("Link state unit %u changed to down state 0x%llx, last was 0x%llx\n", + t, val, dd->ipath_lastibcstat); + else + _IPATH_VDBG("Link state unit %u changed to 0x%llx from one of down states\n", + t, val); + } + ltstate = (val >> INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) & + INFINIPATH_IBCS_LINKTRAININGSTATE_MASK; + + if (ltstate == 2 || ltstate == 3) { + uint32_t last_ltstate; + + /* + * ignore cycling back and forth from states 2 to 3 + * while waiting for other end of link to come up + * except that if it keeps happening, we switch between + * linkinitstate SLEEP and POLL. While we cycle + * back and forth between them, we aren't seeing + * any other device, either no cable plugged in, + * other device powered off, other device is + * switch that hasn't yet polled us, etc. + */ + last_ltstate = (dd->ipath_lastibcstat >> + INFINIPATH_IBCS_LINKTRAININGSTATE_SHIFT) + & INFINIPATH_IBCS_LINKTRAININGSTATE_MASK; + if (last_ltstate == 2 || last_ltstate == 3) { + if (++dd->ipath_ibpollcnt > 4) { + uint64_t ibc; + dd->ipath_flags |= + IPATH_LINK_SLEEPING | IPATH_NOCABLE; + *dd->ipath_statusp |= + IPATH_STATUS_IB_NOCABLE; + _IPATH_VDBG + ("linkinitcmd POLL, move to SLEEP\n"); + ibc = dd->ipath_ibcctrl; + ibc |= INFINIPATH_IBCC_LINKINITCMD_SLEEP + << + INFINIPATH_IBCC_LINKINITCMD_SHIFT; + /* + * don't put linkinitcmd in + * ipath_ibcctrl, want that to + * stay a NOP + */ + ipath_kput_kreg(t, kr_ibcctrl, ibc); + dd->ipath_ibpollcnt = 0; + } + goto skip_ibchange; + } + } + /* some state other than 2 or 3 */ + dd->ipath_ibpollcnt = 0; + ipath_stats.sps_iblink++; + /* + * Note: We try to match the Mellanox HCA LED behavior + * as best we can. That changed around Oct 2003. + * Green indicates link state (something is plugged in, + * and we can train). Amber indicates the link is + * logically up (ACTIVE). Mellanox further blinks the + * amber LED to indicate data packet activity, but we + * have no hardware support for that, so it would require + * waking up every 10-20 msecs and checking the counters + * on the chip, and then turning the LED off if + * appropriate. That's visible overhead, so not something + * we will do. + */ + if (ltstate != 1 || ((dd->ipath_lastibcstat & 0x30) == 0x30 && + (val & 0x30) != 0x30)) { + dd->ipath_flags |= IPATH_LINKDOWN; + dd->ipath_flags &= ~(IPATH_LINKUNK | IPATH_LINKINIT + | IPATH_LINKACTIVE | + IPATH_LINKARMED); + *dd->ipath_statusp &= ~IPATH_STATUS_IB_READY; + if (!noprint) { + if ((dd->ipath_lastibcstat & 0x30) == 0x30) + /* if from up to down be more vocal */ + _IPATH_DBG("Link unit %u is now down (%s)\n", + t, ipath_ibcstatus_str + [ltstate]); + else + _IPATH_VDBG("Link unit %u is down (%s)\n", + t, ipath_ibcstatus_str + [ltstate]); + } + + if (val & 0x30) { + /* leave just green on, 0x11 and 0x21 */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= + INFINIPATH_EXTC_LEDPRIPORTGREENON; + } else /* not up at all, so turn the leds off */ + dd->ipath_extctrl &= + ~(INFINIPATH_EXTC_LEDPRIPORTGREENON | + INFINIPATH_EXTC_LEDPRIPORTYELLOWON); + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (ltstate == 1 + && (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE))) { + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_INIT); + } + } else if ((val & 0x31) == 0x31) { + if (!noprint) + _IPATH_DBG("Link unit %u is now in active state\n", t); + dd->ipath_flags |= IPATH_LINKACTIVE; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKINIT | IPATH_LINKDOWN | + IPATH_LINKARMED | IPATH_NOCABLE | + IPATH_LINK_TOACTIVE | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~IPATH_STATUS_IB_NOCABLE; + *dd->ipath_statusp |= + IPATH_STATUS_IB_READY | IPATH_STATUS_IB_CONF; + /* set the externally visible LEDs to indicate state */ + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON + | INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + + /* + * since we are now active, set the linkinitcmd + * to NOP (0) it was probably either POLL or SLEEP + */ + dd->ipath_ibcctrl &= + ~(INFINIPATH_IBCC_LINKINITCMD_MASK << + INFINIPATH_IBCC_LINKINITCMD_SHIFT); + ipath_kput_kreg(t, kr_ibcctrl, dd->ipath_ibcctrl); + + if (devdata[t].ipath_layer.l_intr) + devdata[t].ipath_layer.l_intr(t, + IPATH_LAYER_INT_IF_UP); + } else if ((val & 0x31) == 0x11) { + /* + * set set INIT and DOWN. Down is checked by + * most of the other code, but INIT is useful + * to know in a few places. + */ + dd->ipath_flags |= IPATH_LINKINIT | IPATH_LINKDOWN; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKACTIVE | IPATH_LINKARMED + | IPATH_NOCABLE | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE + | IPATH_STATUS_IB_READY); + + /* set the externally visible LEDs to indicate state */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE)) { + /* + * if we got here while trying to bring + * the link up, try again, but only once more! + */ + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_ARMED); + dd->ipath_flags &= + ~(IPATH_LINK_TOARMED | IPATH_LINK_TOACTIVE); + } + } else if ((val & 0x31) == 0x21) { + dd->ipath_flags |= IPATH_LINKARMED; + dd->ipath_flags &= + ~(IPATH_LINKUNK | IPATH_LINKDOWN | IPATH_LINKINIT | + IPATH_LINKACTIVE | IPATH_NOCABLE | + IPATH_LINK_TOARMED | IPATH_LINK_SLEEPING); + *dd->ipath_statusp &= ~(IPATH_STATUS_IB_NOCABLE + | IPATH_STATUS_IB_READY); + /* + * set the externally visible LEDs to indicate + * state (same as 0x11) + */ + dd->ipath_extctrl &= + ~INFINIPATH_EXTC_LEDPRIPORTYELLOWON; + dd->ipath_extctrl |= INFINIPATH_EXTC_LEDPRIPORTGREENON; + ipath_kput_kreg(t, kr_extctrl, + (uint64_t) dd->ipath_extctrl); + if (dd->ipath_flags & IPATH_LINK_TOACTIVE) { + /* + * if we got here while trying to bring + * the link up, try again, but only once more! + */ + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_ACTIVE); + dd->ipath_flags &= ~IPATH_LINK_TOACTIVE; + } + } else { + if (dd-> + ipath_flags & (IPATH_LINK_TOARMED | + IPATH_LINK_TOACTIVE)) + ipath_set_ib_lstate(t, + INFINIPATH_IBCC_LINKCMD_INIT); + else if (!noprint) + _IPATH_DBG("IBstatuschange unit %u: %s\n", + t, ipath_ibcstatus_str[ltstate]); + } + dd->ipath_lastibcstat = val; + } + +skip_ibchange: + + if (errs & INFINIPATH_E_RESET) { + if (!noprint) + _IPATH_UNIT_ERROR(t, + "Got reset, requires re-initialization (unload and reload driver)\n"); + dd->ipath_flags &= ~IPATH_INITTED; /* needs re-init */ + /* mark as having had error */ + *dd->ipath_statusp |= IPATH_STATUS_HWERROR; + *dd->ipath_statusp &= ~IPATH_STATUS_IB_CONF; + } + + if (!noprint && *msg) + _IPATH_UNIT_ERROR(t, "%s error\n", msg); + if (dd->ipath_sma_state_wanted & dd->ipath_flags) { + _IPATH_VDBG("sma wanted state %x, iflags now %x, waking\n", + dd->ipath_sma_state_wanted, dd->ipath_flags); + wake_up_interruptible(&ipath_sma_state_wait); + } + + if (chkerrpkts) + /* process possible error packets in hdrq */ + ipath_kreceive(t); +} From bos at pathscale.com Wed Dec 28 16:31:38 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:38 -0800 Subject: [openib-general] [PATCH 19 of 20] ipath - kbuild infrastructure In-Reply-To: Message-ID: <07bf9f34e2218a4b26e3.1135816298@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r e7cabc7a2e78 -r 07bf9f34e221 drivers/infiniband/hw/ipath/Kconfig --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/Kconfig Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,18 @@ +config IPATH_CORE + tristate "PathScale InfiniPath Driver" + depends on PCI_MSI + ---help--- + This is a low-level driver for PathScale InfiniPath host + channel adapters (HCAs) based on the HT-400 chip, including the + InfiniPath HT-460, the small form factor InfiniPath HT-460, + the InfiniPath HT-470 and the Linux Networx LS/X. + +config INFINIBAND_IPATH + tristate "PathScale InfiniPath Verbs Driver" + depends on IPATH_CORE && INFINIBAND + ---help--- + This is a driver that provides InfiniBand verbs support for + PathScale InfiniPath host channel adapters (HCAs). This + allows these devices to be used with both kernel upper level + protocols such as IP-over-InfiniBand as well as with userspace + applications (in conjunction with InfiniBand userspace access). diff -r e7cabc7a2e78 -r 07bf9f34e221 drivers/infiniband/hw/ipath/Makefile --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/Makefile Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,7 @@ +obj-$(CONFIG_IPATH_CORE) += ipath_core.o +obj-$(CONFIG_INFINIBAND_IPATH) += ib_ipath.o + +ipath_core-y := ipath_copy.o ipath_driver.o ipath_ht400.o ipath_i2c.o \ + ipath_layer.o ipath_lib.o ipath_upages.o + +ib_ipath-y := ipath_mad.o ipath_verbs.o From bos at pathscale.com Wed Dec 28 16:31:33 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:33 -0800 Subject: [openib-general] [PATCH 14 of 20] ipath - infiniband verbs header In-Reply-To: Message-ID: <26993cb5faeef807a840.1135816293@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/ipath_verbs.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,532 @@ +/* + * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef IPATH_VERBS_H +#define IPATH_VERBS_H + +#include +#include +#include +#include +#include + +#include "ipath_kernel.h" +#include "verbs_debug.h" + +#define CTL_IPATH_VERBS 0x70736e68 /* "spin" as a hex value, top level */ +#define CTL_IPATH_VERBS_FAULT 1 +#define CTL_IPATH_VERBS_DEBUG 2 + +#define QPN_MAX (1 << 24) +#define QPNMAP_ENTRIES (QPN_MAX / PAGE_SIZE / BITS_PER_BYTE) + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define IPATH_UVERBS_ABI_VERSION 1 + +/* + * Define an ib_cq_notify value that is not valid so we know when CQ + * notifications are armed. + */ +#define IB_CQ_NONE (IB_CQ_NEXT_COMP + 1) + +enum { + IB_RNR_NAK = 0x20, + + IB_NAK_PSN_ERROR = 0x60, + IB_NAK_INVALID_REQUEST = 0x61, + IB_NAK_REMOTE_ACCESS_ERROR = 0x62, + IB_NAK_REMOTE_OPERATIONAL_ERROR = 0x63, + IB_NAK_INVALID_RD_REQUEST = 0x64 +}; + +/* IB Performance Manager status values */ +enum { + IB_PMA_SAMPLE_STATUS_DONE = 0x00, + IB_PMA_SAMPLE_STATUS_STARTED = 0x01, + IB_PMA_SAMPLE_STATUS_RUNNING = 0x02 +}; + +/* Mandatory IB performance counter select values. */ +#define IB_PMA_PORT_XMIT_DATA __constant_htons(0x0001) +#define IB_PMA_PORT_RCV_DATA __constant_htons(0x0002) +#define IB_PMA_PORT_XMIT_PKTS __constant_htons(0x0003) +#define IB_PMA_PORT_RCV_PKTS __constant_htons(0x0004) +#define IB_PMA_PORT_XMIT_WAIT __constant_htons(0x0005) + +struct ib_reth { + u64 vaddr; + u32 rkey; + u32 length; +} __attribute__ ((packed)); + +struct ib_atomic_eth { + u64 vaddr; + u32 rkey; + u64 swap_data; + u64 compare_data; +} __attribute__ ((packed)); + +struct ipath_other_headers { + u32 bth[3]; + union { + struct { + u32 deth[2]; + u32 imm_data; + } ud; + struct { + struct ib_reth reth; + u32 imm_data; + } rc; + struct { + u32 aeth; + u64 atomic_ack_eth; + } at; + u32 imm_data; + u32 aeth; + struct ib_atomic_eth atomic_eth; + } u; +} __attribute__ ((packed)); + +/* + * Note that UD packets with a GRH header are 8+40+12+8 = 68 bytes long + * (72 w/ imm_data). + * Only the first 56 bytes of the IB header will be in the + * eager header buffer. The remaining 12 or 16 bytes are in the data buffer. + */ +struct ipath_ib_header { + u16 lrh[4]; + union { + struct { + struct ib_grh grh; + struct ipath_other_headers oth; + } l; + struct ipath_other_headers oth; + } u; +} __attribute__ ((packed)); + +/* + * There is one struct ipath_mcast for each multicast GID. + * All attached QPs are then stored as a list of + * struct ipath_mcast_qp. + */ +struct ipath_mcast_qp { + struct list_head list; + struct ipath_qp *qp; +}; + +struct ipath_mcast { + struct rb_node rb_node; + union ib_gid mgid; + struct list_head qp_list; + wait_queue_head_t wait; + atomic_t refcount; +}; + +/* Memory region */ +struct ipath_mr { + struct ib_mr ibmr; + struct ipath_mregion mr; /* must be last */ +}; + +/* Fast memory region */ +struct ipath_fmr { + struct ib_fmr ibfmr; + u8 page_size; + struct ipath_mregion mr; /* must be last */ +}; + +/* Protection domain */ +struct ipath_pd { + struct ib_pd ibpd; + int user; /* non-zero if created from user space */ +}; + +/* Address Handle */ +struct ipath_ah { + struct ib_ah ibah; + struct ib_ah_attr attr; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct ipath_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct ipath_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct ipath_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + */ + +struct ipath_cq { + struct ib_cq ibcq; + struct tasklet_struct comptask; + spinlock_t lock; + u8 notify; + u8 triggered; + u32 head; /* new records added to the head */ + u32 tail; /* poll_cq() reads from here. */ + struct ib_wc queue[1]; /* this is actually ibcq.cqe + 1 */ +}; + +/* + * Send work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->s_max_sge. + */ +struct ipath_swqe { + struct ib_send_wr wr; /* don't use wr.sg_list */ + u32 psn; /* first packet sequence number */ + u32 lpsn; /* last packet sequence number */ + u32 ssn; /* send sequence number */ + u32 length; /* total length of data in sg_list */ + struct ipath_sge sg_list[0]; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->r_max_sge. + */ +struct ipath_rwqe { + u64 wr_id; + u32 length; /* total length of data in sg_list */ + u8 num_sge; + struct ipath_sge sg_list[0]; +}; + +struct ipath_rq { + spinlock_t lock; + u32 head; /* new work requests posted to the head */ + u32 tail; /* receives pull requests from here. */ + u32 size; /* size of RWQE array */ + u8 max_sge; + struct ipath_rwqe *wq; /* RWQE array */ +}; + +struct ipath_srq { + struct ib_srq ibsrq; + struct ipath_rq rq; + u32 limit; /* send signal when number of RWQEs < limit */ +}; + +/* + * Variables prefixed with s_ are for the requester (sender). + * Variables prefixed with r_ are for the responder (receiver). + * Variables prefixed with ack_ are for responder replies. + * + * Common variables are protected by both r_rq.lock and s_lock in that order + * which only happens in modify_qp() or changing the QP 'state'. + */ +struct ipath_qp { + struct ib_qp ibqp; + struct ipath_qp *next; /* link list for QPN hash table */ + struct list_head piowait; /* link for wait PIO buf */ + struct list_head timerwait; /* link for waiting for timeouts */ + struct ib_ah_attr remote_ah_attr; + struct ipath_ib_header s_hdr; /* next packet header to send */ + atomic_t refcount; + wait_queue_head_t wait; + struct tasklet_struct s_task; + struct ipath_sge_state *s_cur_sge; + struct ipath_sge_state s_sge; /* current send request data */ + struct ipath_sge_state s_rdma_sge; /* current RDMA read send data */ + struct ipath_sge_state r_sge; /* current receive data */ + spinlock_t s_lock; + int s_flags; + u32 s_hdrwords; /* size of s_hdr in 32 bit words */ + u32 s_cur_size; /* size of send packet in bytes */ + u32 s_len; /* total length of s_sge */ + u32 s_rdma_len; /* total length of s_rdma_sge */ + u32 s_next_psn; /* PSN for next request */ + u32 s_last_psn; /* last response PSN processed */ + u32 s_psn; /* current packet sequence number */ + u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */ + u32 s_ack_psn; /* PSN for next ACK or RDMA_READ */ + u64 s_ack_atomic; /* data for atomic ACK */ + u64 r_wr_id; /* ID for current receive WQE */ + u64 r_atomic_data; /* data for last atomic op */ + u32 r_atomic_psn; /* PSN of last atomic op */ + u32 r_len; /* total length of r_sge */ + u32 r_rcv_len; /* receive data len processed */ + u32 r_psn; /* expected rcv packet sequence number */ + u8 state; /* QP state */ + u8 s_state; /* opcode of last packet sent */ + u8 s_ack_state; /* opcode of packet to ACK */ + u8 s_nak_state; /* non-zero if NAK is pending */ + u8 r_state; /* opcode of last packet received */ + u8 r_reuse_sge; /* for UC receive errors */ + u8 r_sge_inx; /* current index into sg_list */ + u8 s_max_sge; /* size of s_wq->sg_list */ + u8 qp_access_flags; + u8 s_retry_cnt; /* number of times to retry */ + u8 s_rnr_retry_cnt; + u8 s_min_rnr_timer; + u8 s_retry; /* requester retry counter */ + u8 s_rnr_retry; /* requester RNR retry counter */ + u8 s_pkey_index; /* PKEY index to use */ + enum ib_mtu path_mtu; + atomic_t msn; /* message sequence number */ + u32 remote_qpn; + u32 qkey; /* QKEY for this QP (for UD or RD) */ + u32 s_size; /* send work queue size */ + u32 s_head; /* new entries added here */ + u32 s_tail; /* next entry to process */ + u32 s_cur; /* current work queue entry */ + u32 s_last; /* last un-ACK'ed entry */ + u32 s_ssn; /* SSN of tail entry */ + u32 s_lsn; /* limit sequence number (credit) */ + struct ipath_swqe *s_wq; /* send work queue */ + struct ipath_rq r_rq; /* receive work queue */ +}; + +/* + * Bit definitions for s_flags. + */ +#define IPATH_S_BUSY 0 +#define IPATH_S_SIGNAL_REQ_WR 1 + +/* + * Since struct ipath_swqe is not a fixed size, we can't simply index into + * struct ipath_qp.s_wq. This function does the array index computation. + */ +static inline struct ipath_swqe *get_swqe_ptr(struct ipath_qp *qp, unsigned n) +{ + return (struct ipath_swqe *)((char *) qp->s_wq + + (sizeof(struct ipath_swqe) + + qp->s_max_sge * sizeof(struct ipath_sge)) * n); +} + +/* + * Since struct ipath_rwqe is not a fixed size, we can't simply index into + * struct ipath_rq.wq. This function does the array index computation. + */ +static inline struct ipath_rwqe *get_rwqe_ptr(struct ipath_rq *rq, unsigned n) +{ + return (struct ipath_rwqe *)((char *) rq->wq + + (sizeof(struct ipath_rwqe) + + rq->max_sge * sizeof(struct ipath_sge)) * n); +} + +/* + * QPN-map pages start out as NULL, they get allocated upon + * first use and are never deallocated. This way, + * large bitmaps are not allocated unless large numbers of QPs are used. + */ +struct qpn_map { + atomic_t n_free; + void *page; +}; + +struct ipath_qp_table { + spinlock_t lock; + u32 last; /* last QP number allocated */ + u32 max; /* size of the hash table */ + u32 nmaps; /* size of the map table */ + struct ipath_qp **table; + struct qpn_map map[QPNMAP_ENTRIES]; /* bit map of free numbers */ +}; + +struct ipath_lkey_table { + spinlock_t lock; + u32 next; /* next unused index (speeds search) */ + u32 gen; /* generation count */ + u32 max; /* size of the table */ + struct ipath_mregion **table; +}; + +struct ipath_opcode_stats { + u64 n_packets; /* number of packets */ + u64 n_bytes; /* total number of bytes */ +}; + +struct ipath_ibdev { + struct ib_device ibdev; + ipath_type ib_unit; /* This is the device number */ + u16 sm_lid; /* in host order */ + u8 sm_sl; + u8 mkeyprot_resv_lmc; + unsigned long mkey_lease_timeout; /* non-zero when timer is set */ + + /* The following fields are really per port. */ + struct ipath_qp_table qp_table; + struct ipath_lkey_table lk_table; + struct list_head pending[3]; /* FIFO of QPs waiting for ACKs */ + struct list_head piowait; /* list for wait PIO buf */ + struct list_head rnrwait; /* list of QPs waiting for RNR timer */ + spinlock_t pending_lock; + __be64 sys_image_guid; /* in network order */ + __be64 gid_prefix; /* in network order */ + __be64 mkey; + u64 ipath_sword; /* total dwords sent (sample result) */ + u64 ipath_rword; /* total dwords received (sample result) */ + u64 ipath_spkts; /* total packets sent (sample result) */ + u64 ipath_rpkts; /* total packets received (sample result) */ + u64 n_unicast_xmit; /* total unicast packets sent */ + u64 n_unicast_rcv; /* total unicast packets received */ + u64 n_multicast_xmit; /* total multicast packets sent */ + u64 n_multicast_rcv; /* total multicast packets received */ + u64 n_symbol_error_counter; /* starting count for PMA */ + u64 n_link_error_recovery_counter; /* starting count for PMA */ + u64 n_link_downed_counter; /* starting count for PMA */ + u64 n_port_rcv_errors; /* starting count for PMA */ + u64 n_port_rcv_remphys_errors; /* starting count for PMA */ + u64 n_port_xmit_discards; /* starting count for PMA */ + u64 n_port_xmit_data; /* starting count for PMA */ + u64 n_port_rcv_data; /* starting count for PMA */ + u64 n_port_xmit_packets; /* starting count for PMA */ + u64 n_port_rcv_packets; /* starting count for PMA */ + u32 n_rc_resends; + u32 n_rc_acks; + u32 n_rc_qacks; + u32 n_seq_naks; + u32 n_rdma_seq; + u32 n_rnr_naks; + u32 n_other_naks; + u32 n_timeouts; + u32 n_pkt_drops; + u32 n_wqe_errs; + u32 n_rdma_dup_busy; + u32 n_piowait; + u32 n_no_piobuf; + u32 port_cap_flags; + u32 pma_sample_start; + u32 pma_sample_interval; + __be16 pma_counter_select[5]; + u16 pma_tag; + u16 qkey_violations; + u16 mkey_violations; + u16 mkey_lease_period; + u16 pending_index; /* which pending queue is active */ + u8 pma_sample_status; + u8 subnet_timeout; + struct ipath_opcode_stats opstats[128]; +}; + +struct ipath_ucontext { + struct ib_ucontext ibucontext; +}; + +static inline struct ipath_mr *to_imr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct ipath_mr, ibmr); +} + +static inline struct ipath_fmr *to_ifmr(struct ib_fmr *ibfmr) +{ + return container_of(ibfmr, struct ipath_fmr, ibfmr); +} + +static inline struct ipath_pd *to_ipd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct ipath_pd, ibpd); +} + +static inline struct ipath_ah *to_iah(struct ib_ah *ibah) +{ + return container_of(ibah, struct ipath_ah, ibah); +} + +static inline struct ipath_cq *to_icq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct ipath_cq, ibcq); +} + +static inline struct ipath_srq *to_isrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct ipath_srq, ibsrq); +} + +static inline struct ipath_qp *to_iqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct ipath_qp, ibqp); +} + +static inline struct ipath_ibdev *to_idev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct ipath_ibdev, ibdev); +} + +int ipath_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad); + +static inline struct ipath_ucontext *to_iucontext(struct ib_ucontext + *ibucontext) +{ + return container_of(ibucontext, struct ipath_ucontext, ibucontext); +} + +#endif /* IPATH_VERBS_H */ diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/verbs_debug.h --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/verbs_debug.h Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,106 @@ +/* + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#ifndef _VERBS_DEBUG_H +#define _VERBS_DEBUG_H + +/* + * This file contains tracing code for the ib_ipath kernel module. + */ +#ifndef _VERBS_DEBUGGING /* tracing enabled or not */ +#define _VERBS_DEBUGGING 1 +#endif + +extern unsigned ib_ipath_debug; + +#define _VERBS_ERROR(fmt,...) \ + do { \ + printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \ + } while(0) + +#define _VERBS_UNIT_ERROR(unit,fmt,...) \ + do { \ + printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \ + } while(0) + +#if _VERBS_DEBUGGING + +/* + * Mask values for debugging. The scheme allows us to compile out any of + * the debug tracing stuff, and if compiled in, to enable or disable dynamically + * This can be set at modprobe time also: + * modprobe ib_path ib_ipath_debug=3 + */ + +#define __VERBS_INFO 0x1 /* generic low verbosity stuff */ +#define __VERBS_DBG 0x2 /* generic debug */ +#define __VERBS_VDBG 0x4 /* verbose debug */ +#define __VERBS_SMADBG 0x8000 /* sma packet debug */ + +#define _VERBS_INFO(fmt,...) \ + do { \ + if(unlikely(ib_ipath_debug&__VERBS_INFO)) \ + printk(KERN_INFO "%s: " fmt,"ib_ipath",##__VA_ARGS__); \ + } while(0) + +#define _VERBS_DBG(fmt,...) \ + do { \ + if(unlikely(ib_ipath_debug&__VERBS_DBG)) \ + printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_VDBG(fmt,...) \ + do { \ + if(unlikely(ib_ipath_debug&__VERBS_VDBG)) \ + printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#define _VERBS_SMADBG(fmt,...) \ + do { \ + if(unlikely(ib_ipath_debug&__VERBS_SMADBG)) \ + printk(KERN_DEBUG "%s: " fmt, __func__,##__VA_ARGS__); \ + } while(0) + +#else /* ! _VERBS_DEBUGGING */ + +#define _VERBS_INFO(fmt,...) +#define _VERBS_DBG(fmt,...) +#define _VERBS_VDBG(fmt,...) +#define _VERBS_SMADBG(fmt,...) + +#endif /* _VERBS_DEBUGGING */ + +#endif /* _VERBS_DEBUG_H */ From bos at pathscale.com Wed Dec 28 16:31:36 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:36 -0800 Subject: [openib-general] [PATCH 17 of 20] ipath - infiniband verbs support, part 3 of 3 In-Reply-To: Message-ID: <584777b6f4dc5269fa89.1135816296@eng-12.pathscale.com> Signed-off-by: Bryan O'Sullivan diff -r fc067af322a1 -r 584777b6f4dc drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800 @@ -4815,3 +4815,1393 @@ /* Call do_rc_send() in another thread. */ tasklet_schedule(&qp->s_task); } + +/* + * This is called from ipath_ib_rcv() to process an incomming packet + * for the given QP. + * Called at interrupt level. + */ +static inline void ipath_qp_rcv(struct ipath_ibdev *dev, + struct ipath_ib_header *hdr, int has_grh, + void *data, u32 tlen, struct ipath_qp *qp) +{ + /* Check for valid receive state. */ + if (!(state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) { + dev->n_pkt_drops++; + return; + } + + switch (qp->ibqp.qp_type) { + case IB_QPT_SMI: + case IB_QPT_GSI: + case IB_QPT_UD: + ipath_ud_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + case IB_QPT_RC: + ipath_rc_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + case IB_QPT_UC: + ipath_uc_rcv(dev, hdr, has_grh, data, tlen, qp); + break; + + default: + break; + } +} + +/* + * This is called from ipath_kreceive() to process an incomming packet at + * interrupt level. Tlen is the length of the header + data + CRC in bytes. + */ +static void ipath_ib_rcv(const ipath_type t, void *rhdr, void *data, u32 tlen) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_ib_header *hdr = rhdr; + struct ipath_other_headers *ohdr; + struct ipath_qp *qp; + u32 qp_num; + int lnh; + u8 opcode; + + if (dev == NULL) + return; + + if (tlen < 24) { /* LRH+BTH+CRC */ + dev->n_pkt_drops++; + return; + } + + /* Check for GRH */ + lnh = be16_to_cpu(hdr->lrh[0]) & 3; + if (lnh == IPS_LRH_BTH) + ohdr = &hdr->u.oth; + else if (lnh == IPS_LRH_GRH) + ohdr = &hdr->u.l.oth; + else { + dev->n_pkt_drops++; + return; + } + + opcode = *(u8 *) (&ohdr->bth[0]); + dev->opstats[opcode].n_bytes += tlen; + dev->opstats[opcode].n_packets++; + + /* Get the destination QP number. */ + qp_num = be32_to_cpu(ohdr->bth[1]) & 0xFFFFFF; + if (qp_num == 0xFFFFFF) { + struct ipath_mcast *mcast; + struct ipath_mcast_qp *p; + + mcast = ipath_mcast_find(&hdr->u.l.grh.dgid); + if (mcast == NULL) { + dev->n_pkt_drops++; + return; + } + dev->n_multicast_rcv++; + list_for_each_entry_rcu(p, &mcast->qp_list, list) + ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen, + p->qp); + /* + * Notify ipath_multicast_detach() if it is waiting for us + * to finish. + */ + if (atomic_dec_return(&mcast->refcount) <= 1) + wake_up(&mcast->wait); + } else if ((qp = ipath_lookup_qpn(&dev->qp_table, qp_num)) != NULL) { + dev->n_unicast_rcv++; + ipath_qp_rcv(dev, hdr, lnh == IPS_LRH_GRH, data, tlen, qp); + /* + * Notify ipath_destroy_qp() if it is waiting for us to finish. + */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } else + dev->n_pkt_drops++; +} + +/* + * This is called from ipath_do_rcv_timer() at interrupt level + * to check for QPs which need retransmits and to collect performance numbers. + */ +static void ipath_ib_timer(const ipath_type t) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_qp *resend = NULL; + struct ipath_qp *rnr = NULL; + struct list_head *last; + struct ipath_qp *qp; + unsigned long flags; + + if (dev == NULL) + return; + + spin_lock_irqsave(&dev->pending_lock, flags); + /* Start filling the next pending queue. */ + if (++dev->pending_index >= ARRAY_SIZE(dev->pending)) + dev->pending_index = 0; + /* Save any requests still in the new queue, they have timed out. */ + last = &dev->pending[dev->pending_index]; + while (!list_empty(last)) { + qp = list_entry(last->next, struct ipath_qp, timerwait); + if (last->next == LIST_POISON1 || + last->next != &qp->timerwait || + qp->timerwait.prev != last) { + INIT_LIST_HEAD(last); + } else { + list_del(&qp->timerwait); + qp->timerwait.prev = (struct list_head *) resend; + resend = qp; + atomic_inc(&qp->refcount); + } + } + last = &dev->rnrwait; + if (!list_empty(last)) { + qp = list_entry(last->next, struct ipath_qp, timerwait); + if (--qp->s_rnr_timeout == 0) { + do { + if (last->next == LIST_POISON1 || + last->next != &qp->timerwait || + qp->timerwait.prev != last) { + INIT_LIST_HEAD(last); + break; + } + list_del(&qp->timerwait); + qp->timerwait.prev = (struct list_head *) rnr; + rnr = qp; + if (list_empty(last)) + break; + qp = list_entry(last->next, struct ipath_qp, + timerwait); + } while (qp->s_rnr_timeout == 0); + } + } + /* We should only be in the started state if pma_sample_start != 0 */ + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_STARTED && + --dev->pma_sample_start == 0) { + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_RUNNING; + ipath_layer_snapshot_counters(dev->ib_unit, &dev->ipath_sword, + &dev->ipath_rword, + &dev->ipath_spkts, + &dev->ipath_rpkts); + } + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_RUNNING) { + if (dev->pma_sample_interval == 0) { + u64 ta, tb, tc, td; + + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_DONE; + ipath_layer_snapshot_counters(dev->ib_unit, + &ta, &tb, &tc, &td); + + dev->ipath_sword = ta - dev->ipath_sword; + dev->ipath_rword = tb - dev->ipath_rword; + dev->ipath_spkts = tc - dev->ipath_spkts; + dev->ipath_rpkts = td - dev->ipath_rpkts; + } else { + dev->pma_sample_interval--; + } + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + + /* XXX What if timer fires again while this is running? */ + for (qp = resend; qp != NULL; + qp = (struct ipath_qp *) qp->timerwait.prev) { + struct ib_wc wc; + + spin_lock_irqsave(&qp->s_lock, flags); + if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) { + dev->n_timeouts++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + } + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + for (qp = rnr; qp != NULL; + qp = (struct ipath_qp *) qp->timerwait.prev) { + tasklet_schedule(&qp->s_task); + } +} + +/* + * This is called from ipath_intr() at interrupt level when a PIO buffer + * is available after ipath_verbs_send() returned an error that no + * buffers were available. + * Return 0 if we consumed all the PIO buffers and we still have QPs + * waiting for buffers (for now, just do a tasklet_schedule and return one). + */ +static int ipath_ib_piobufavail(const ipath_type t) +{ + struct ipath_ibdev *dev = ipath_devices[t]; + struct ipath_qp *qp; + unsigned long flags; + + if (dev == NULL) + return 1; + + spin_lock_irqsave(&dev->pending_lock, flags); + while (!list_empty(&dev->piowait)) { + qp = list_entry(dev->piowait.next, struct ipath_qp, piowait); + list_del(&qp->piowait); + tasklet_schedule(&qp->s_task); + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + + return 1; +} + +static struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, + struct ib_qp_init_attr *init_attr, + struct ib_udata *udata) +{ + struct ipath_qp *qp; + int err; + struct ipath_swqe *swq = NULL; + struct ipath_ibdev *dev; + size_t sz; + + if (init_attr->cap.max_send_sge > 255 || + init_attr->cap.max_recv_sge > 255) + return ERR_PTR(-ENOMEM); + + switch (init_attr->qp_type) { + case IB_QPT_UC: + case IB_QPT_RC: + sz = sizeof(struct ipath_sge) * init_attr->cap.max_send_sge + + sizeof(struct ipath_swqe); + swq = vmalloc((init_attr->cap.max_send_wr + 1) * sz); + if (swq == NULL) + return ERR_PTR(-ENOMEM); + /* FALLTHROUGH */ + case IB_QPT_UD: + case IB_QPT_SMI: + case IB_QPT_GSI: + qp = kmalloc(sizeof(*qp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + qp->r_rq.size = init_attr->cap.max_recv_wr + 1; + sz = sizeof(struct ipath_sge) * init_attr->cap.max_recv_sge + + sizeof(struct ipath_rwqe); + qp->r_rq.wq = vmalloc(qp->r_rq.size * sz); + if (!qp->r_rq.wq) { + kfree(qp); + return ERR_PTR(-ENOMEM); + } + + /* + * ib_create_qp() will initialize qp->ibqp + * except for qp->ibqp.qp_num. + */ + spin_lock_init(&qp->s_lock); + spin_lock_init(&qp->r_rq.lock); + atomic_set(&qp->refcount, 0); + init_waitqueue_head(&qp->wait); + tasklet_init(&qp->s_task, + init_attr->qp_type == IB_QPT_RC ? do_rc_send : + do_uc_send, (unsigned long)qp); + qp->piowait.next = LIST_POISON1; + qp->piowait.prev = LIST_POISON2; + qp->timerwait.next = LIST_POISON1; + qp->timerwait.prev = LIST_POISON2; + qp->state = IB_QPS_RESET; + qp->s_wq = swq; + qp->s_size = init_attr->cap.max_send_wr + 1; + qp->s_max_sge = init_attr->cap.max_send_sge; + qp->r_rq.max_sge = init_attr->cap.max_recv_sge; + qp->s_flags = init_attr->sq_sig_type == IB_SIGNAL_REQ_WR ? + 1 << IPATH_S_SIGNAL_REQ_WR : 0; + dev = to_idev(ibpd->device); + err = ipath_alloc_qpn(&dev->qp_table, qp, init_attr->qp_type); + if (err) { + vfree(swq); + vfree(qp->r_rq.wq); + kfree(qp); + return ERR_PTR(err); + } + ipath_reset_qp(qp); + + /* Tell the core driver that the kernel SMA is present. */ + if (qp->ibqp.qp_type == IB_QPT_SMI) + ipath_verbs_set_flags(dev->ib_unit, + IPATH_VERBS_KERNEL_SMA); + break; + + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +/* + * Note that this can be called while the QP is actively sending or receiving! + */ +static int ipath_destroy_qp(struct ib_qp *ibqp) +{ + struct ipath_qp *qp = to_iqp(ibqp); + struct ipath_ibdev *dev = to_idev(ibqp->device); + unsigned long flags; + + /* Tell the core driver that the kernel SMA is gone. */ + if (qp->ibqp.qp_type == IB_QPT_SMI) + ipath_verbs_set_flags(dev->ib_unit, 0); + + spin_lock_irqsave(&qp->r_rq.lock, flags); + spin_lock(&qp->s_lock); + qp->state = IB_QPS_ERR; + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* Stop the sending tasklet. */ + tasklet_kill(&qp->s_task); + + /* Make sure the QP isn't on the timeout list. */ + spin_lock_irqsave(&dev->pending_lock, flags); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + if (qp->piowait.next != LIST_POISON1) + list_del(&qp->piowait); + spin_unlock_irqrestore(&dev->pending_lock, flags); + + /* + * Make sure that the QP is not in the QPN table so receive interrupts + * will discard packets for this QP. + * XXX Also remove QP from multicast table. + */ + if (atomic_read(&qp->refcount) != 0) + ipath_free_qp(&dev->qp_table, qp); + + vfree(qp->s_wq); + vfree(qp->r_rq.wq); + kfree(qp); + return 0; +} + +static struct ib_srq *ipath_create_srq(struct ib_pd *ibpd, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata) +{ + struct ipath_srq *srq; + u32 sz; + + if (srq_init_attr->attr.max_sge < 1) + return ERR_PTR(-EINVAL); + + srq = kmalloc(sizeof(*srq), GFP_KERNEL); + if (!srq) + return ERR_PTR(-ENOMEM); + + /* Need to use vmalloc() if we want to support large #s of entries. */ + srq->rq.size = srq_init_attr->attr.max_wr + 1; + sz = sizeof(struct ipath_sge) * srq_init_attr->attr.max_sge + + sizeof(struct ipath_rwqe); + srq->rq.wq = vmalloc(srq->rq.size * sz); + if (!srq->rq.wq) { + kfree(srq); + return ERR_PTR(-ENOMEM); + } + + /* + * ib_create_srq() will initialize srq->ibsrq. + */ + spin_lock_init(&srq->rq.lock); + srq->rq.head = 0; + srq->rq.tail = 0; + srq->rq.max_sge = srq_init_attr->attr.max_sge; + srq->limit = srq_init_attr->attr.srq_limit; + + return &srq->ibsrq; +} + +int ipath_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + unsigned long flags; + + if (attr_mask & IB_SRQ_LIMIT) { + spin_lock_irqsave(&srq->rq.lock, flags); + srq->limit = attr->srq_limit; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + if (attr_mask & IB_SRQ_MAX_WR) { + u32 size = attr->max_wr + 1; + struct ipath_rwqe *wq, *p; + u32 n; + u32 sz; + + if (attr->max_sge < srq->rq.max_sge) + return -EINVAL; + + sz = sizeof(struct ipath_rwqe) + + attr->max_sge * sizeof(struct ipath_sge); + wq = vmalloc(size * sz); + if (!wq) + return -ENOMEM; + + spin_lock_irqsave(&srq->rq.lock, flags); + if (srq->rq.head < srq->rq.tail) + n = srq->rq.size + srq->rq.head - srq->rq.tail; + else + n = srq->rq.head - srq->rq.tail; + if (size <= n || size <= srq->limit) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + vfree(wq); + return -EINVAL; + } + n = 0; + p = wq; + while (srq->rq.tail != srq->rq.head) { + struct ipath_rwqe *wqe; + int i; + + wqe = get_rwqe_ptr(&srq->rq, srq->rq.tail); + p->wr_id = wqe->wr_id; + p->length = wqe->length; + p->num_sge = wqe->num_sge; + for (i = 0; i < wqe->num_sge; i++) + p->sg_list[i] = wqe->sg_list[i]; + n++; + p = (struct ipath_rwqe *)((char *) p + sz); + if (++srq->rq.tail >= srq->rq.size) + srq->rq.tail = 0; + } + vfree(srq->rq.wq); + srq->rq.wq = wq; + srq->rq.size = size; + srq->rq.head = n; + srq->rq.tail = 0; + srq->rq.max_sge = attr->max_sge; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + return 0; +} + +static int ipath_destroy_srq(struct ib_srq *ibsrq) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + + vfree(srq->rq.wq); + kfree(srq); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct ipath_cq *cq = to_icq(ibcq); + unsigned long flags; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled, ++entry) { + if (cq->tail == cq->head) + break; + *entry = cq->queue[cq->tail]; + if (++cq->tail == cq->ibcq.cqe) + cq->tail = 0; + } + + spin_unlock_irqrestore(&cq->lock, flags); + + return npolled; +} + +static struct ib_cq *ipath_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct ipath_cq *cq; + + /* Need to use vmalloc() if we want to support large #s of entries. */ + cq = vmalloc(sizeof(*cq) + entries * sizeof(*cq->queue)); + if (!cq) + return ERR_PTR(-ENOMEM); + /* + * ib_create_cq() will initialize cq->ibcq except for cq->ibcq.cqe. + * The number of entries should be >= the number requested or + * return an error. + */ + cq->ibcq.cqe = entries + 1; + cq->notify = IB_CQ_NONE; + cq->triggered = 0; + spin_lock_init(&cq->lock); + tasklet_init(&cq->comptask, send_complete, (unsigned long)cq); + cq->head = 0; + cq->tail = 0; + + return &cq->ibcq; +} + +static int ipath_destroy_cq(struct ib_cq *ibcq) +{ + struct ipath_cq *cq = to_icq(ibcq); + + tasklet_kill(&cq->comptask); + vfree(cq); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +{ + struct ipath_cq *cq = to_icq(ibcq); + unsigned long flags; + + spin_lock_irqsave(&cq->lock, flags); + /* + * Don't change IB_CQ_NEXT_COMP to IB_CQ_SOLICITED but allow + * any other transitions. + */ + if (cq->notify != IB_CQ_NEXT_COMP) + cq->notify = notify; + spin_unlock_irqrestore(&cq->lock, flags); + return 0; +} + +static int ipath_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + uint32_t vendor, boardrev, majrev, minrev; + + memset(props, 0, sizeof(*props)); + + props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR | + IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | + IB_DEVICE_SYS_IMAGE_GUID; + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + props->vendor_id = vendor; + props->vendor_part_id = boardrev; + props->hw_ver = boardrev << 16 | majrev << 8 | minrev; + + props->sys_image_guid = dev->sys_image_guid; + props->node_guid = ipath_layer_get_guid(dev->ib_unit); + + props->max_mr_size = ~0ull; + props->max_qp = 0xffff; + props->max_qp_wr = 0xffff; + props->max_sge = 255; + props->max_cq = 0xffff; + props->max_cqe = 0xffff; + props->max_mr = 0xffff; + props->max_pd = 0xffff; + props->max_qp_rd_atom = 1; + props->max_qp_init_rd_atom = 1; + /* props->max_res_rd_atom */ + props->max_srq = 0xffff; + props->max_srq_wr = 0xffff; + props->max_srq_sge = 255; + /* props->local_ca_ack_delay */ + props->atomic_cap = IB_ATOMIC_HCA; + props->max_pkeys = ipath_layer_get_npkeys(dev->ib_unit); + props->max_mcast_grp = 0xffff; + props->max_mcast_qp_attach = 0xffff; + props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * + props->max_mcast_grp; + + return 0; +} + +static int ipath_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + uint32_t flags = ipath_layer_get_flags(dev->ib_unit); + enum ib_mtu mtu; + uint32_t l; + uint16_t lid = ipath_layer_get_lid(dev->ib_unit); + + memset(props, 0, sizeof(*props)); + props->lid = lid ? lid : IB_LID_PERMISSIVE; + props->lmc = dev->mkeyprot_resv_lmc & 7; + props->sm_lid = dev->sm_lid; + props->sm_sl = dev->sm_sl; + if (flags & IPATH_LINKDOWN) + props->state = IB_PORT_DOWN; + else if (flags & IPATH_LINKARMED) + props->state = IB_PORT_ARMED; + else if (flags & IPATH_LINKACTIVE) + props->state = IB_PORT_ACTIVE; + else if (flags & IPATH_LINK_SLEEPING) + props->state = IB_PORT_ACTIVE_DEFER; + else + props->state = IB_PORT_NOP; + /* See phys_state_show() */ + props->phys_state = 5; /* LinkUp */ + props->port_cap_flags = dev->port_cap_flags; + props->gid_tbl_len = 1; + props->max_msg_sz = 4096; + props->pkey_tbl_len = ipath_layer_get_npkeys(dev->ib_unit); + props->bad_pkey_cntr = ipath_layer_get_cr_errpkey(dev->ib_unit); + props->qkey_viol_cntr = dev->qkey_violations; + props->active_width = IB_WIDTH_4X; + /* See rate_show() */ + props->active_speed = 1; /* Regular 10Mbs speed. */ + props->max_vl_num = 1; /* VLCap = VL0 */ + props->init_type_reply = 0; + + props->max_mtu = IB_MTU_4096; + l = ipath_layer_get_ibmtu(dev->ib_unit); + switch (l) { + case 4096: + mtu = IB_MTU_4096; + break; + case 2048: + mtu = IB_MTU_2048; + break; + case 1024: + mtu = IB_MTU_1024; + break; + case 512: + mtu = IB_MTU_512; + break; + case 256: + mtu = IB_MTU_256; + break; + default: + mtu = IB_MTU_2048; + } + props->active_mtu = mtu; + props->subnet_timeout = dev->subnet_timeout; + + return 0; +} + +static int ipath_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + if (device_modify_mask & IB_DEVICE_MODIFY_SYS_IMAGE_GUID) + to_idev(device)->sys_image_guid = device_modify->sys_image_guid; + + return 0; +} + +static int ipath_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + atomic_set_mask(props->set_port_cap_mask, &dev->port_cap_flags); + atomic_clear_mask(props->clr_port_cap_mask, &dev->port_cap_flags); + if (port_modify_mask & IB_PORT_SHUTDOWN) + ipath_kset_linkstate(dev->ib_unit << 16 | IPATH_IB_LINKDOWN); + if (port_modify_mask & IB_PORT_RESET_QKEY_CNTR) + dev->qkey_violations = 0; + return 0; +} + +static int ipath_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + if (index >= ipath_layer_get_npkeys(dev->ib_unit)) + return -EINVAL; + *pkey = ipath_layer_get_pkey(dev->ib_unit, index); + return 0; +} + +static int ipath_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ipath_ibdev *dev = to_idev(ibdev); + + if (index >= 1) + return -EINVAL; + gid->global.subnet_prefix = dev->gid_prefix; + gid->global.interface_id = ipath_layer_get_guid(dev->ib_unit); + + return 0; +} + +static struct ib_pd *ipath_alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct ipath_pd *pd; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + /* ib_alloc_pd() will initialize pd->ibpd. */ + pd->user = udata != NULL; + + return &pd->ibpd; +} + +static int ipath_dealloc_pd(struct ib_pd *ibpd) +{ + struct ipath_pd *pd = to_ipd(ibpd); + + kfree(pd); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static struct ib_ah *ipath_create_ah(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + struct ipath_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_ATOMIC); + if (!ah) + return ERR_PTR(-ENOMEM); + + /* ib_create_ah() will initialize ah->ibah. */ + ah->attr = *ah_attr; + + return &ah->ibah; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_destroy_ah(struct ib_ah *ibah) +{ + struct ipath_ah *ah = to_iah(ibah); + + kfree(ah); + + return 0; +} + +static struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ipath_mr *mr; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* ib_get_dma_mr() will initialize mr->ibmr except for lkey and rkey. */ + memset(mr, 0, sizeof *mr); + mr->mr.access_flags = acc; + return &mr->ibmr; +} + +static struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 *iova_start) +{ + struct ipath_mr *mr; + int n, m, i; + + /* Allocate struct plus pointers to first level page tables. */ + m = (num_phys_buf + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); + if (!mr->mr.map[i]) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + } + mr->mr.mapsz = m; + + /* + * ib_reg_phys_mr() will initialize mr->ibmr except for + * lkey and rkey. + */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey; + mr->mr.user_base = *iova_start; + mr->mr.iova = *iova_start; + mr->mr.length = 0; + mr->mr.offset = 0; + mr->mr.access_flags = acc; + mr->mr.max_segs = num_phys_buf; + m = 0; + n = 0; + for (i = 0; i < num_phys_buf; i++) { + mr->mr.map[m]->segs[n].vaddr = + phys_to_virt(buffer_list[i].addr); + mr->mr.map[m]->segs[n].length = buffer_list[i].size; + mr->mr.length += buffer_list[i].size; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + return &mr->ibmr; +} + +static struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, + struct ib_umem *region, + int mr_access_flags, + struct ib_udata *udata) +{ + struct ipath_mr *mr; + struct ib_umem_chunk *chunk; + int n, m, i; + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + /* Allocate struct plus pointers to first level page tables. */ + m = (n + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); + if (!mr->mr.map[i]) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + } + mr->mr.mapsz = m; + + /* + * ib_uverbs_reg_mr() will initialize mr->ibmr except for + * lkey and rkey. + */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &mr->mr)) { + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return ERR_PTR(-ENOMEM); + } + mr->ibmr.rkey = mr->ibmr.lkey = mr->mr.lkey; + mr->mr.user_base = region->user_base; + mr->mr.iova = region->virt_base; + mr->mr.length = region->length; + mr->mr.offset = region->offset; + mr->mr.access_flags = mr_access_flags; + mr->mr.max_segs = n; + m = 0; + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) { + for (i = 0; i < chunk->nmap; i++) { + mr->mr.map[m]->segs[n].vaddr = + page_address(chunk->page_list[i].page); + mr->mr.map[m]->segs[n].length = region->page_size; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + } + return &mr->ibmr; +} + +/* + * Note that this is called to free MRs created by + * ipath_get_dma_mr() or ipath_reg_user_mr(). + */ +static int ipath_dereg_mr(struct ib_mr *ibmr) +{ + struct ipath_mr *mr = to_imr(ibmr); + int i; + + ipath_free_lkey(&to_idev(ibmr->device)->lk_table, ibmr->lkey); + i = mr->mr.mapsz; + while (i) + kfree(mr->mr.map[--i]); + kfree(mr); + return 0; +} + +static struct ib_fmr *ipath_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ipath_fmr *fmr; + int m, i; + + /* Allocate struct plus pointers to first level page tables. */ + m = (fmr_attr->max_pages + IPATH_SEGSZ - 1) / IPATH_SEGSZ; + fmr = kmalloc(sizeof *fmr + m * sizeof fmr->mr.map[0], GFP_KERNEL); + if (!fmr) + return ERR_PTR(-ENOMEM); + + /* Allocate first level page tables. */ + for (i = 0; i < m; i++) { + fmr->mr.map[i] = kmalloc(sizeof *fmr->mr.map[0], GFP_KERNEL); + if (!fmr->mr.map[i]) { + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return ERR_PTR(-ENOMEM); + } + } + fmr->mr.mapsz = m; + + /* ib_alloc_fmr() will initialize fmr->ibfmr except for lkey & rkey. */ + if (!ipath_alloc_lkey(&to_idev(pd->device)->lk_table, &fmr->mr)) { + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return ERR_PTR(-ENOMEM); + } + fmr->ibfmr.rkey = fmr->ibfmr.lkey = fmr->mr.lkey; + /* Resources are allocated but no valid mapping (RKEY can't be used). */ + fmr->mr.user_base = 0; + fmr->mr.iova = 0; + fmr->mr.length = 0; + fmr->mr.offset = 0; + fmr->mr.access_flags = mr_access_flags; + fmr->mr.max_segs = fmr_attr->max_pages; + fmr->page_size = fmr_attr->page_size; + return &fmr->ibfmr; +} + +/* + * This may be called from interrupt context. + * XXX Can we ever be called to map a portion of the RKEY space? + */ +static int ipath_map_phys_fmr(struct ib_fmr *ibfmr, + u64 * page_list, int list_len, u64 iova) +{ + struct ipath_fmr *fmr = to_ifmr(ibfmr); + struct ipath_lkey_table *rkt; + unsigned long flags; + int m, n, i; + u32 ps; + + if (list_len > fmr->mr.max_segs) + return -EINVAL; + rkt = &to_idev(ibfmr->device)->lk_table; + spin_lock_irqsave(&rkt->lock, flags); + fmr->mr.user_base = iova; + fmr->mr.iova = iova; + ps = 1 << fmr->page_size; + fmr->mr.length = list_len * ps; + m = 0; + n = 0; + ps = 1 << fmr->page_size; + for (i = 0; i < list_len; i++) { + fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]); + fmr->mr.map[m]->segs[n].length = ps; + if (++n == IPATH_SEGSZ) { + m++; + n = 0; + } + } + spin_unlock_irqrestore(&rkt->lock, flags); + return 0; +} + +static int ipath_unmap_fmr(struct list_head *fmr_list) +{ + struct ipath_fmr *fmr; + + list_for_each_entry(fmr, fmr_list, ibfmr.list) { + fmr->mr.user_base = 0; + fmr->mr.iova = 0; + fmr->mr.length = 0; + } + return 0; +} + +static int ipath_dealloc_fmr(struct ib_fmr *ibfmr) +{ + struct ipath_fmr *fmr = to_ifmr(ibfmr); + int i; + + ipath_free_lkey(&to_idev(ibfmr->device)->lk_table, ibfmr->lkey); + i = fmr->mr.mapsz; + while (i) + kfree(fmr->mr.map[--i]); + kfree(fmr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + return sprintf(buf, "%d.%d\n", majrev, minrev); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + ipath_get_boardname(dev->ib_unit, buf, 128); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + int vendor, boardrev, majrev, minrev; + + ipath_layer_query_device(dev->ib_unit, &vendor, &boardrev, + &majrev, &minrev); + ipath_get_boardname(dev->ib_unit, buf, 128); + strcat(buf, "\n"); + return strlen(buf); +} + +static ssize_t show_stats(struct class_device *cdev, char *buf) +{ + struct ipath_ibdev *dev = + container_of(cdev, struct ipath_ibdev, ibdev.class_dev); + char *p; + int i; + + sprintf(buf, + "RC resends %d\n" + "RC QACKs %d\n" + "RC ACKs %d\n" + "RC SEQ NAKs %d\n" + "RC RDMA seq %d\n" + "RC RNR NAKs %d\n" + "RC OTH NAKs %d\n" + "RC timeouts %d\n" + "RC RDMA dup %d\n" + "piobuf wait %d\n" + "no piobuf %d\n" + "PKT drops %d\n" + "WQE errs %d\n", + dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, + dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, + dev->n_other_naks, dev->n_timeouts, dev->n_rdma_dup_busy, + dev->n_piowait, dev->n_no_piobuf, dev->n_pkt_drops, + dev->n_wqe_errs); + p = buf; + for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { + if (!dev->opstats[i].n_packets && !dev->opstats[i].n_bytes) + continue; + p += strlen(p); + sprintf(p, "%02x %llu/%llu\n", + i, dev->opstats[i].n_packets, dev->opstats[i].n_bytes); + } + return strlen(buf); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static CLASS_DEVICE_ATTR(stats, S_IRUGO, show_stats, NULL); + +static struct class_device_attribute *ipath_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_hca_type, + &class_device_attr_board_id, + &class_device_attr_stats +}; + +/* + * Allocate a ucontext. + */ + +static struct ib_ucontext *ipath_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct ipath_ucontext *context; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + + return &context->ibucontext; +} + +static int ipath_dealloc_ucontext(struct ib_ucontext *context) +{ + kfree(to_iucontext(context)); + return 0; +} + +/* + * Register our device with the infiniband core. + */ +static int ipath_register_ib_device(const ipath_type t) +{ + struct ipath_ibdev *idev; + struct ib_device *dev; + int i; + int ret; + + idev = (struct ipath_ibdev *)ib_alloc_device(sizeof *idev); + if (idev == NULL) + return -ENOMEM; + + dev = &idev->ibdev; + + /* Only need to initialize non-zero fields. */ + spin_lock_init(&idev->qp_table.lock); + spin_lock_init(&idev->lk_table.lock); + idev->sm_lid = IB_LID_PERMISSIVE; + idev->gid_prefix = __constant_cpu_to_be64(0xfe80000000000000UL); + idev->qp_table.last = 1; /* QPN 0 and 1 are special. */ + idev->qp_table.max = ib_ipath_qp_table_size; + idev->qp_table.nmaps = 1; + idev->qp_table.table = kmalloc(idev->qp_table.max * + sizeof(*idev->qp_table.table), + GFP_KERNEL); + if (idev->qp_table.table == NULL) { + ret = -ENOMEM; + goto err_qp; + } + memset(idev->qp_table.table, 0, + idev->qp_table.max * sizeof(*idev->qp_table.table)); + for (i = 0; i < ARRAY_SIZE(idev->qp_table.map); i++) { + atomic_set(&idev->qp_table.map[i].n_free, BITS_PER_PAGE); + idev->qp_table.map[i].page = NULL; + } + /* + * The top ib_ipath_lkey_table_size bits are used to index the table. + * The lower 8 bits can be owned by the user (copied from the LKEY). + * The remaining bits act as a generation number or tag. + */ + idev->lk_table.max = 1 << ib_ipath_lkey_table_size; + idev->lk_table.table = kmalloc(idev->lk_table.max * + sizeof(*idev->lk_table.table), + GFP_KERNEL); + if (idev->lk_table.table == NULL) { + ret = -ENOMEM; + goto err_lk; + } + memset(idev->lk_table.table, 0, + idev->lk_table.max * sizeof(*idev->lk_table.table)); + spin_lock_init(&idev->pending_lock); + INIT_LIST_HEAD(&idev->pending[0]); + INIT_LIST_HEAD(&idev->pending[1]); + INIT_LIST_HEAD(&idev->pending[2]); + INIT_LIST_HEAD(&idev->piowait); + INIT_LIST_HEAD(&idev->rnrwait); + idev->pending_index = 0; + idev->port_cap_flags = + IB_PORT_SYS_IMAGE_GUID_SUP | IB_PORT_CLIENT_REG_SUP; + idev->pma_counter_select[0] = IB_PMA_PORT_XMIT_DATA; + idev->pma_counter_select[1] = IB_PMA_PORT_RCV_DATA; + idev->pma_counter_select[2] = IB_PMA_PORT_XMIT_PKTS; + idev->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS; + idev->pma_counter_select[5] = IB_PMA_PORT_XMIT_WAIT; + + /* + * The system image GUI is supposed to be the same for all + * IB HCAs in a single system. + * Note that this code assumes device zero is found first. + */ + idev->sys_image_guid = + t ? ipath_devices[t]->sys_image_guid : ipath_layer_get_guid(t); + idev->ib_unit = t; + + strlcpy(dev->name, "ipath%d", IB_DEVICE_NAME_MAX); + dev->node_guid = ipath_layer_get_guid(t); + dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION; + dev->uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); + dev->node_type = IB_NODE_CA; + dev->phys_port_cnt = 1; + dev->dma_device = ipath_layer_get_pcidev(t); + dev->class_dev.dev = dev->dma_device; + dev->query_device = ipath_query_device; + dev->modify_device = ipath_modify_device; + dev->query_port = ipath_query_port; + dev->modify_port = ipath_modify_port; + dev->query_pkey = ipath_query_pkey; + dev->query_gid = ipath_query_gid; + dev->alloc_ucontext = ipath_alloc_ucontext; + dev->dealloc_ucontext = ipath_dealloc_ucontext; + dev->alloc_pd = ipath_alloc_pd; + dev->dealloc_pd = ipath_dealloc_pd; + dev->create_ah = ipath_create_ah; + dev->destroy_ah = ipath_destroy_ah; + dev->create_srq = ipath_create_srq; + dev->modify_srq = ipath_modify_srq; + dev->destroy_srq = ipath_destroy_srq; + dev->create_qp = ipath_create_qp; + dev->modify_qp = ipath_modify_qp; + dev->destroy_qp = ipath_destroy_qp; + dev->post_send = ipath_post_send; + dev->post_recv = ipath_post_receive; + dev->post_srq_recv = ipath_post_srq_receive; + dev->create_cq = ipath_create_cq; + dev->destroy_cq = ipath_destroy_cq; + dev->poll_cq = ipath_poll_cq; + dev->req_notify_cq = ipath_req_notify_cq; + dev->get_dma_mr = ipath_get_dma_mr; + dev->reg_phys_mr = ipath_reg_phys_mr; + dev->reg_user_mr = ipath_reg_user_mr; + dev->dereg_mr = ipath_dereg_mr; + dev->alloc_fmr = ipath_alloc_fmr; + dev->map_phys_fmr = ipath_map_phys_fmr; + dev->unmap_fmr = ipath_unmap_fmr; + dev->dealloc_fmr = ipath_dealloc_fmr; + dev->attach_mcast = ipath_multicast_attach; + dev->detach_mcast = ipath_multicast_detach; + dev->process_mad = ipath_process_mad; + + ret = ib_register_device(dev); + if (ret) + goto err_reg; + + /* + * We don't need to register a MAD agent, we just need to create + * a linker dependency on ib_mad so the module is loaded before + * this module is initialized. The call to ib_register_device() + * above will then cause ib_mad to create QP 0 & 1. + */ + (void) ib_register_mad_agent(dev, 1, (enum ib_qp_type) 2, + NULL, 0, NULL, NULL, NULL); + + for (i = 0; i < ARRAY_SIZE(ipath_class_attributes); ++i) { + ret = class_device_create_file(&dev->class_dev, + ipath_class_attributes[i]); + if (ret) + goto err_class; + } + + ipath_layer_enable_timer(t); + + ipath_devices[t] = idev; + return 0; + +err_class: + ib_unregister_device(dev); +err_reg: + kfree(idev->lk_table.table); +err_lk: + kfree(idev->qp_table.table); +err_qp: + ib_dealloc_device(dev); + return ret; +} + +static void ipath_unregister_ib_device(struct ipath_ibdev *dev) +{ + struct ib_device *ibdev = &dev->ibdev; + + ipath_layer_disable_timer(dev->ib_unit); + + ib_unregister_device(ibdev); + + if (!list_empty(&dev->pending[0]) || !list_empty(&dev->pending[1]) || + !list_empty(&dev->pending[2])) + _VERBS_ERROR("ipath%d pending list not empty!\n", dev->ib_unit); + if (!list_empty(&dev->piowait)) + _VERBS_ERROR("ipath%d piowait list not empty!\n", dev->ib_unit); + if (!list_empty(&dev->rnrwait)) + _VERBS_ERROR("ipath%d rnrwait list not empty!\n", dev->ib_unit); + if (mcast_tree.rb_node != NULL) + _VERBS_ERROR("ipath%d multicast table memory leak!\n", + dev->ib_unit); + /* + * Note that ipath_unregister_ib_device() can be called before all + * the QPs are destroyed! + */ + ipath_free_all_qps(&dev->qp_table); + kfree(dev->qp_table.table); + kfree(dev->lk_table.table); + ib_dealloc_device(ibdev); +} + +int __init ipath_verbs_init(void) +{ + int i; + + number_of_devices = ipath_layer_get_num_of_dev(); + i = number_of_devices * sizeof(struct ipath_ibdev *); + ipath_devices = kmalloc(i, GFP_ATOMIC); + if (ipath_devices == NULL) + return -ENOMEM; + + for (i = 0; i < number_of_devices; i++) { + int ret = ipath_verbs_register(i, ipath_ib_piobufavail, + ipath_ib_rcv, ipath_ib_timer); + + if (ret == 0) + ipath_devices[i] = NULL; + else if ((ret = ipath_register_ib_device(i)) != 0) { + _VERBS_ERROR("ib_ipath%d cannot register ib device " + "(%d)!\n", i, ret); + ipath_verbs_unregister(i); + ipath_devices[i] = NULL; + } + } + + return 0; +} + +void __exit ipath_verbs_cleanup(void) +{ + int i; + + for (i = 0; i < number_of_devices; i++) + if (ipath_devices[i]) { + ipath_unregister_ib_device(ipath_devices[i]); + ipath_verbs_unregister(i); + } + + kfree(ipath_devices); +} + +module_init(ipath_verbs_init); +module_exit(ipath_verbs_cleanup); From bos at pathscale.com Wed Dec 28 16:31:35 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:35 -0800 Subject: [openib-general] [PATCH 16 of 20] path - infiniband verbs support, part 2 of 3 In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r 471b7a7a005c -r fc067af322a1 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 28 14:19:43 2005 -0800 @@ -2305,3 +2305,2513 @@ spin_unlock_irqrestore(&qp->s_lock, flags); clear_bit(IPATH_S_BUSY, &qp->s_flags); } + +static void send_rc_ack(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + u16 lrh0; + u32 bth0; + u32 hwords; + struct ipath_other_headers *ohdr; + + /* Construct the header. */ + ohdr = &qp->s_hdr.u.oth; + lrh0 = IPS_LRH_BTH; + /* header size in 32-bit words LRH+BTH+AETH = (8+12+4)/4. */ + hwords = 6; + if (unlikely(qp->remote_ah_attr.ah_flags & IB_AH_GRH)) { + ohdr = &qp->s_hdr.u.l.oth; + /* Header size in 32-bit words. */ + hwords += 10; + lrh0 = IPS_LRH_GRH; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (qp->remote_ah_attr.grh.traffic_class << 20) | + qp->remote_ah_attr.grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((hwords - 12) + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = qp->remote_ah_attr.grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = qp->remote_ah_attr.grh.dgid; + } + bth0 = ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + ohdr->u.aeth = ipath_compute_aeth(qp); + if (qp->s_ack_state >= IB_OPCODE_RC_COMPARE_SWAP) { + bth0 |= IB_OPCODE_ATOMIC_ACKNOWLEDGE << 24; + ohdr->u.at.atomic_ack_eth = cpu_to_be64(qp->s_ack_atomic); + hwords += sizeof(ohdr->u.at.atomic_ack_eth) / 4; + } else { + bth0 |= IB_OPCODE_RC_ACKNOWLEDGE << 24; + } + lrh0 |= qp->remote_ah_attr.sl << 4; + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + /* DEST LID */ + qp->s_hdr.lrh[1] = cpu_to_be16(qp->remote_ah_attr.dlid); + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + SIZE_OF_CRC); + qp->s_hdr.lrh[3] = cpu_to_be16(ipath_layer_get_lid(dev->ib_unit)); + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(qp->remote_qpn); + ohdr->bth[2] = cpu_to_be32(qp->s_ack_psn & 0xFFFFFF); + + /* + * If we can send the ACK, clear the ACK state. + */ + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, + 0, NULL) == 0) { + qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; + dev->n_rc_qacks++; + dev->n_unicast_xmit++; + } +} + +/* + * Back up the requester to resend the last un-ACKed request. + * The QP s_lock should be held. + */ +static void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) +{ + struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); + struct ipath_ibdev *dev; + u32 n; + + /* + * If there are no requests pending, we are done. + */ + if (cmp24(psn, qp->s_next_psn) >= 0 || qp->s_last == qp->s_tail) + goto done; + + if (qp->s_retry == 0) { + wc->wr_id = wqe->wr.wr_id; + wc->status = IB_WC_RETRY_EXC_ERR; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = 0; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = qp->remote_qpn; + wc->pkey_index = 0; + wc->slid = qp->remote_ah_attr.dlid; + wc->sl = qp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_sqerror_qp(qp, wc); + return; + } + qp->s_retry--; + + /* + * Remove the QP from the timeout queue. + * Note: it may already have been removed by ipath_ib_timer(). + */ + dev = to_idev(qp->ibqp.device); + spin_lock(&dev->pending_lock); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + spin_unlock(&dev->pending_lock); + + if (wqe->wr.opcode == IB_WR_RDMA_READ) + dev->n_rc_resends++; + else + dev->n_rc_resends += (int)qp->s_psn - (int)psn; + + /* + * If we are starting the request from the beginning, let the + * normal send code handle initialization. + */ + qp->s_cur = qp->s_last; + if (cmp24(psn, wqe->psn) <= 0) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } else { + n = qp->s_cur; + for (;;) { + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) { + if (cmp24(psn, qp->s_next_psn) >= 0) { + qp->s_cur = n; + wqe = get_swqe_ptr(qp, n); + } + break; + } + wqe = get_swqe_ptr(qp, n); + if (cmp24(psn, wqe->psn) < 0) + break; + qp->s_cur = n; + } + qp->s_psn = psn; + + /* + * Reset the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See do_rc_send(). + */ + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + break; + + case IB_WR_RDMA_READ: + qp->s_state = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + } + +done: + tasklet_schedule(&qp->s_task); +} + +/* + * Handle RC and UC post sends. + */ +static int ipath_post_rc_send(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ipath_swqe *wqe; + unsigned long flags; + u32 next; + int i, j; + int acc; + + /* + * Don't allow RDMA reads or atomic operations on UC or + * undefined operations. + * Make sure buffer is large enough to hold the result for atomics. + */ + if (qp->ibqp.qp_type == IB_QPT_UC) { + if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) + return -EINVAL; + } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) + return -EINVAL; + else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && + (wr->num_sge == 0 || wr->sg_list[0].length < sizeof(u64) || + wr->sg_list[0].addr & 0x7)) + return -EINVAL; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + return -ENOMEM; + + spin_lock_irqsave(&qp->s_lock, flags); + next = qp->s_head + 1; + if (next >= qp->s_size) + next = 0; + if (next == qp->s_last) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + + wqe = get_swqe_ptr(qp, qp->s_head); + wqe->wr = *wr; + wqe->ssn = qp->s_ssn++; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + acc = wr->opcode >= IB_WR_RDMA_READ ? IB_ACCESS_LOCAL_WRITE : 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], acc)) { + spin_unlock_irqrestore(&qp->s_lock, flags); + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->wr.num_sge = j; + qp->s_head = next; + /* + * Wake up the send tasklet if the QP is not waiting + * for an RNR timeout. + */ + next = qp->s_rnr_timeout; + spin_unlock_irqrestore(&qp->s_lock, flags); + + if (next == 0) { + if (qp->ibqp.qp_type == IB_QPT_UC) + do_uc_send((unsigned long) qp); + else + do_rc_send((unsigned long) qp); + } + return 0; +} + +/* + * Note that we actually send the data as it is posted instead of putting + * the request into a ring buffer. If we wanted to use a ring buffer, + * we would need to save a reference to the destination address in the SWQE. + */ +static int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ipath_other_headers *ohdr; + struct ib_ah_attr *ah_attr; + struct ipath_sge_state ss; + struct ipath_sge *sg_list; + struct ib_wc wc; + u32 hwords; + u32 nwords; + u32 len; + u32 extra_bytes; + u32 bth0; + u16 lrh0; + u16 lid; + int i; + + if (!(state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + return 0; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + return -EINVAL; + + if (wr->num_sge > 1) { + sg_list = kmalloc((qp->s_max_sge - 1) * sizeof(*sg_list), + GFP_ATOMIC); + if (!sg_list) + return -ENOMEM; + } else + sg_list = NULL; + + /* Check the buffer to send. */ + ss.sg_list = sg_list; + ss.sge.mr = NULL; + ss.sge.vaddr = NULL; + ss.sge.length = 0; + ss.sge.sge_length = 0; + ss.num_sge = 0; + len = 0; + for (i = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(qp->ibqp.pd)->user && wr->sg_list[i].lkey == 0) + return -EINVAL; + + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&dev->lk_table, ss.num_sge ? + sg_list + ss.num_sge : &ss.sge, + &wr->sg_list[i], 0)) { + return -EINVAL; + } + len += wr->sg_list[i].length; + ss.num_sge++; + } + extra_bytes = (4 - len) & 3; + nwords = (len + extra_bytes) >> 2; + + /* Construct the header. */ + ah_attr = &to_iah(wr->wr.ud.ah)->attr; + if (ah_attr->dlid >= 0xC000 && ah_attr->dlid < 0xFFFF) + dev->n_multicast_xmit++; + else + dev->n_unicast_xmit++; + if (unlikely(ah_attr->dlid == ipath_layer_get_lid(dev->ib_unit))) { + /* Pass in an uninitialized ib_wc to save stack space. */ + ipath_ud_loopback(qp, &ss, len, wr, &wc); + goto done; + } + if (ah_attr->ah_flags & IB_AH_GRH) { + /* Header size in 32-bit words. */ + hwords = 17; + lrh0 = IPS_LRH_GRH; + ohdr = &qp->s_hdr.u.l.oth; + qp->s_hdr.u.l.grh.version_tclass_flow = + cpu_to_be32((6 << 28) | + (ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + qp->s_hdr.u.l.grh.paylen = + cpu_to_be16(((wr->opcode == + IB_WR_SEND_WITH_IMM ? 6 : 5) + nwords + + SIZE_OF_CRC) << 2); + qp->s_hdr.u.l.grh.next_hdr = 0x1B; + qp->s_hdr.u.l.grh.hop_limit = ah_attr->grh.hop_limit; + /* The SGID is 32-bit aligned. */ + qp->s_hdr.u.l.grh.sgid.global.subnet_prefix = dev->gid_prefix; + qp->s_hdr.u.l.grh.sgid.global.interface_id = + ipath_layer_get_guid(dev->ib_unit); + qp->s_hdr.u.l.grh.dgid = ah_attr->grh.dgid; + /* + * Don't worry about sending to locally attached + * multicast QPs. It is unspecified by the spec. what happens. + */ + } else { + /* Header size in 32-bit words. */ + hwords = 7; + lrh0 = IPS_LRH_BTH; + ohdr = &qp->s_hdr.u.oth; + } + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + ohdr->u.ud.imm_data = wr->imm_data; + wc.imm_data = wr->imm_data; + hwords += 1; + bth0 = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE << 24; + } else if (wr->opcode == IB_WR_SEND) { + wc.imm_data = 0; + bth0 = IB_OPCODE_UD_SEND_ONLY << 24; + } else + return -EINVAL; + lrh0 |= ah_attr->sl << 4; + if (qp->ibqp.qp_type == IB_QPT_SMI) + lrh0 |= 0xF000; /* Set VL */ + qp->s_hdr.lrh[0] = cpu_to_be16(lrh0); + qp->s_hdr.lrh[1] = cpu_to_be16(ah_attr->dlid); /* DEST LID */ + qp->s_hdr.lrh[2] = cpu_to_be16(hwords + nwords + SIZE_OF_CRC); + lid = ipath_layer_get_lid(dev->ib_unit); + qp->s_hdr.lrh[3] = lid ? cpu_to_be16(lid) : IB_LID_PERMISSIVE; + if (wr->send_flags & IB_SEND_SOLICITED) + bth0 |= 1 << 23; + bth0 |= extra_bytes << 20; + bth0 |= qp->ibqp.qp_type == IB_QPT_SMI ? IPS_DEFAULT_P_KEY : + ipath_layer_get_pkey(dev->ib_unit, qp->s_pkey_index); + ohdr->bth[0] = cpu_to_be32(bth0); + ohdr->bth[1] = cpu_to_be32(wr->wr.ud.remote_qpn); + /* XXX Could lose a PSN count but not worth locking */ + ohdr->bth[2] = cpu_to_be32(qp->s_psn++ & 0xFFFFFF); + /* + * Qkeys with the high order bit set mean use the + * qkey from the QP context instead of the WR (see 10.2.5). + */ + ohdr->u.ud.deth[0] = cpu_to_be32((int)wr->wr.ud.remote_qkey < 0 ? + qp->qkey : wr->wr.ud.remote_qkey); + ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num); + if (ipath_verbs_send(dev->ib_unit, hwords, (uint32_t *) &qp->s_hdr, + len, &ss)) + dev->n_no_piobuf++; + +done: + /* Queue the completion status entry. */ + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wr->send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.vendor_err = 0; + wc.opcode = IB_WC_SEND; + wc.byte_len = len; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = 0; + wc.wc_flags = 0; + /* XXX initialize other fields? */ + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + } + kfree(sg_list); + + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + int err = 0; + + /* Check that state is OK to post send. */ + if (!(state_ops[qp->state] & IPATH_POST_SEND_OK)) { + *bad_wr = wr; + return -EINVAL; + } + + for (; wr; wr = wr->next) { + switch (qp->ibqp.qp_type) { + case IB_QPT_UC: + case IB_QPT_RC: + err = ipath_post_rc_send(qp, wr); + break; + + case IB_QPT_SMI: + case IB_QPT_GSI: + case IB_QPT_UD: + err = ipath_post_ud_send(qp, wr); + break; + + default: + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + } + return err; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct ipath_qp *qp = to_iqp(ibqp); + unsigned long flags; + + /* Check that state is OK to post receive. */ + if (!(state_ops[qp->state] & IPATH_POST_RECV_OK)) { + *bad_wr = wr; + return -EINVAL; + } + + for (; wr; wr = wr->next) { + struct ipath_rwqe *wqe; + u32 next; + int i, j; + + if (wr->num_sge > qp->r_rq.max_sge) { + *bad_wr = wr; + return -ENOMEM; + } + + spin_lock_irqsave(&qp->r_rq.lock, flags); + next = qp->r_rq.head + 1; + if (next >= qp->r_rq.size) + next = 0; + if (next == qp->r_rq.tail) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -ENOMEM; + } + + wqe = get_rwqe_ptr(&qp->r_rq, qp->r_rq.head); + wqe->wr_id = wr->wr_id; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(qp->ibqp.pd)->user && + wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&to_idev(qp->ibqp.device)->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->num_sge = j; + qp->r_rq.head = next; + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + } + return 0; +} + +/* + * This may be called from interrupt context. + */ +static int ipath_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct ipath_srq *srq = to_isrq(ibsrq); + struct ipath_ibdev *dev = to_idev(ibsrq->device); + unsigned long flags; + + for (; wr; wr = wr->next) { + struct ipath_rwqe *wqe; + u32 next; + int i, j; + + if (wr->num_sge > srq->rq.max_sge) { + *bad_wr = wr; + return -ENOMEM; + } + + spin_lock_irqsave(&srq->rq.lock, flags); + next = srq->rq.head + 1; + if (next >= srq->rq.size) + next = 0; + if (next == srq->rq.tail) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -ENOMEM; + } + + wqe = get_rwqe_ptr(&srq->rq, srq->rq.head); + wqe->wr_id = wr->wr_id; + wqe->sg_list[0].mr = NULL; + wqe->sg_list[0].vaddr = NULL; + wqe->sg_list[0].length = 0; + wqe->sg_list[0].sge_length = 0; + wqe->length = 0; + for (i = 0, j = 0; i < wr->num_sge; i++) { + /* Check LKEY */ + if (to_ipd(srq->ibsrq.pd)->user && + wr->sg_list[i].lkey == 0) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + if (wr->sg_list[i].length == 0) + continue; + if (!ipath_lkey_ok(&dev->lk_table, + &wqe->sg_list[j], &wr->sg_list[i], + IB_ACCESS_LOCAL_WRITE)) { + spin_unlock_irqrestore(&srq->rq.lock, flags); + *bad_wr = wr; + return -EINVAL; + } + wqe->length += wr->sg_list[i].length; + j++; + } + wqe->num_sge = j; + srq->rq.head = next; + spin_unlock_irqrestore(&srq->rq.lock, flags); + } + return 0; +} + +/* + * This is called from ipath_qp_rcv() to process an incomming UD packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 qkey; + u32 src_qp; + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_rwqe *wqe; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12 + 8; /* LRH + BTH + DETH */ + qkey = be32_to_cpu(ohdr->u.ud.deth[0]); + src_qp = be32_to_cpu(ohdr->u.ud.deth[1]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12 + 8; /* LRH + GRH + BTH + DETH */ + /* + * The header with GRH is 68 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 12 bytes of + * the IB header is in the data buffer. + */ + qkey = be32_to_cpu(((u32 *) data)[1]); + src_qp = be32_to_cpu(((u32 *) data)[2]); + data += 12; + } + src_qp &= 0xFFFFFF; + + /* Check that the qkey matches (except for QP0, see 9.6.1.4.1). */ + if (unlikely(qp->ibqp.qp_num && qkey != qp->qkey)) { + /* XXX OK to lose a count once in a while. */ + dev->qkey_violations++; + dev->n_pkt_drops++; + return; + } + + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + if (unlikely(tlen < (hdrsize + pad + 4))) { + /* Drop incomplete packets. */ + dev->n_pkt_drops++; + return; + } + + /* + * A GRH is expected to preceed the data even if not + * present on the wire. + */ + wc.byte_len = tlen - (hdrsize + pad + 4) + sizeof(struct ib_grh); + + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + if (opcode == IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE) { + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else + wc.imm_data = ohdr->u.ud.imm_data; + wc.wc_flags = IB_WC_WITH_IMM; + hdrsize += sizeof(u32); + } else if (opcode == IB_OPCODE_UD_SEND_ONLY) { + wc.imm_data = 0; + wc.wc_flags = 0; + } else { + dev->n_pkt_drops++; + return; + } + + /* + * Get the next work request entry to find where to put the data. + * Note that it is safe to drop the lock after changing rq->tail + * since ipath_post_receive() won't fill the empty slot. + */ + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + } else { + srq = NULL; + rq = &qp->r_rq; + } + spin_lock_irqsave(&rq->lock, flags); + if (rq->tail == rq->head) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + return; + } + /* Silently drop packets which are too big. */ + wqe = get_rwqe_ptr(rq, rq->tail); + if (wc.byte_len > wqe->length) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + return; + } + wc.wr_id = wqe->wr_id; + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq && srq->ibsrq.event_handler) { + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + struct ib_event ev; + + srq->limit = 0; + spin_unlock_irqrestore(&rq->lock, flags); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock_irqrestore(&rq->lock, flags); + } else + spin_unlock_irqrestore(&rq->lock, flags); + if (has_grh) { + copy_sge(&qp->r_sge, &hdr->u.l.grh, sizeof(struct ib_grh)); + wc.wc_flags |= IB_WC_GRH; + } else + skip_sge(&qp->r_sge, sizeof(struct ib_grh)); + copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = src_qp; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc.pkey_index = 0; + wc.slid = be16_to_cpu(hdr->lrh[3]); + wc.sl = (be16_to_cpu(hdr->lrh[0]) >> 4) & 0xF; + wc.dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); +} + +/* + * This is called from ipath_post_ud_send() to forward a WQE addressed + * to the same HCA. + */ +static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_sge_state *ss, + u32 length, struct ib_send_wr *wr, + struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); + struct ipath_qp *qp; + struct ib_ah_attr *ah_attr; + unsigned long flags; + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_sge_state rsge; + struct ipath_sge *sge; + struct ipath_rwqe *wqe; + + qp = ipath_lookup_qpn(&dev->qp_table, wr->wr.ud.remote_qpn); + if (!qp) + return; + + /* + * Check that the qkey matches (except for QP0, see 9.6.1.4.1). + * Qkeys with the high order bit set mean use the + * qkey from the QP context instead of the WR (see 10.2.5). + */ + if (unlikely(qp->ibqp.qp_num && ((int)wr->wr.ud.remote_qkey < 0 ? + qp->qkey : wr->wr.ud.remote_qkey) != qp->qkey)) { + /* XXX OK to lose a count once in a while. */ + dev->qkey_violations++; + dev->n_pkt_drops++; + goto done; + } + + /* + * A GRH is expected to preceed the data even if not + * present on the wire. + */ + wc->byte_len = length + sizeof(struct ib_grh); + + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wr->imm_data; + } else { + wc->wc_flags = 0; + wc->imm_data = 0; + } + + /* + * Get the next work request entry to find where to put the data. + * Note that it is safe to drop the lock after changing rq->tail + * since ipath_post_receive() won't fill the empty slot. + */ + if (qp->ibqp.srq) { + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + } else { + srq = NULL; + rq = &qp->r_rq; + } + spin_lock_irqsave(&rq->lock, flags); + if (rq->tail == rq->head) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto done; + } + /* Silently drop packets which are too big. */ + wqe = get_rwqe_ptr(rq, rq->tail); + if (wc->byte_len > wqe->length) { + spin_unlock_irqrestore(&rq->lock, flags); + dev->n_pkt_drops++; + goto done; + } + wc->wr_id = wqe->wr_id; + rsge.sge = wqe->sg_list[0]; + rsge.sg_list = wqe->sg_list + 1; + rsge.num_sge = wqe->num_sge; + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq && srq->ibsrq.event_handler) { + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + struct ib_event ev; + + srq->limit = 0; + spin_unlock_irqrestore(&rq->lock, flags); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock_irqrestore(&rq->lock, flags); + } else + spin_unlock_irqrestore(&rq->lock, flags); + ah_attr = &to_iah(wr->wr.ud.ah)->attr; + if (ah_attr->ah_flags & IB_AH_GRH) { + copy_sge(&rsge, &ah_attr->grh, sizeof(struct ib_grh)); + wc->wc_flags |= IB_WC_GRH; + } else + skip_sge(&rsge, sizeof(struct ib_grh)); + sge = &ss->sge; + while (length) { + u32 len = sge->length; + + if (len > length) + len = length; + BUG_ON(len == 0); + copy_sge(&rsge, sge->vaddr, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } + wc->status = IB_WC_SUCCESS; + wc->opcode = IB_WC_RECV; + wc->vendor_err = 0; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = sqp->ibqp.qp_num; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc->pkey_index = 0; + wc->slid = ipath_layer_get_lid(dev->ib_unit); + wc->sl = ah_attr->sl; + wc->dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, + wr->send_flags & IB_SEND_SOLICITED); + +done: + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +/* + * Copy the next RWQE into the QP's RWQE. + * Return zero if no RWQE is available. + * Called at interrupt level with the QP r_rq.lock held. + */ +static int get_rwqe(struct ipath_qp *qp, int wr_id_only) +{ + struct ipath_rq *rq; + struct ipath_srq *srq; + struct ipath_rwqe *wqe; + + if (!qp->ibqp.srq) { + rq = &qp->r_rq; + if (unlikely(rq->tail == rq->head)) + return 0; + wqe = get_rwqe_ptr(rq, rq->tail); + qp->r_wr_id = wqe->wr_id; + if (!wr_id_only) { + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + qp->r_len = wqe->length; + } + if (++rq->tail >= rq->size) + rq->tail = 0; + return 1; + } + + srq = to_isrq(qp->ibqp.srq); + rq = &srq->rq; + spin_lock(&rq->lock); + if (unlikely(rq->tail == rq->head)) { + spin_unlock(&rq->lock); + return 0; + } + wqe = get_rwqe_ptr(rq, rq->tail); + qp->r_wr_id = wqe->wr_id; + if (!wr_id_only) { + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->num_sge; + qp->r_len = wqe->length; + } + if (++rq->tail >= rq->size) + rq->tail = 0; + if (srq->ibsrq.event_handler) { + struct ib_event ev; + u32 n; + + if (rq->head < rq->tail) + n = rq->size + rq->head - rq->tail; + else + n = rq->head - rq->tail; + if (n < srq->limit) { + srq->limit = 0; + spin_unlock(&rq->lock); + ev.device = qp->ibqp.device; + ev.element.srq = qp->ibqp.srq; + ev.event = IB_EVENT_SRQ_LIMIT_REACHED; + srq->ibsrq.event_handler(&ev, srq->ibsrq.srq_context); + } else + spin_unlock(&rq->lock); + } else + spin_unlock(&rq->lock); + return 1; +} + +/* + * This is called from ipath_qp_rcv() to process an incomming UC packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 psn; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + struct ib_reth *reth; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12; /* LRH + BTH */ + psn = be32_to_cpu(ohdr->bth[2]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */ + /* + * The header with GRH is 60 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 4 bytes of + * the BTH header (PSN) is in the data buffer. + */ + psn = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + + wc.imm_data = 0; + wc.wc_flags = 0; + + spin_lock_irqsave(&qp->r_rq.lock, flags); + + /* Compare the PSN verses the expected PSN. */ + if (unlikely(cmp24(psn, qp->r_psn) != 0)) { + /* + * Handle a sequence error. + * Silently drop any current message. + */ + qp->r_psn = psn; + inv: + qp->r_state = IB_OPCODE_UC_SEND_LAST; + switch (opcode) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_ONLY: + case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE: + goto send_first; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_ONLY: + case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + goto rdma_first; + + default: + dev->n_pkt_drops++; + goto done; + } + } + + /* Check for opcode sequence errors. */ + switch (qp->r_state) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_MIDDLE: + if (opcode == IB_OPCODE_UC_SEND_MIDDLE || + opcode == IB_OPCODE_UC_SEND_LAST || + opcode == IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE) + break; + goto inv; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + if (opcode == IB_OPCODE_UC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_UC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + break; + goto inv; + + default: + if (opcode == IB_OPCODE_UC_SEND_FIRST || + opcode == IB_OPCODE_UC_SEND_ONLY || + opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE || + opcode == IB_OPCODE_UC_RDMA_WRITE_FIRST || + opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY || + opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE) + break; + goto inv; + } + + /* OK, process the packet. */ + switch (opcode) { + case IB_OPCODE_UC_SEND_FIRST: + case IB_OPCODE_UC_SEND_ONLY: + case IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE: + send_first: + if (qp->r_reuse_sge) { + qp->r_reuse_sge = 0; + qp->r_sge = qp->s_rdma_sge; + } else if (!get_rwqe(qp, 0)) { + dev->n_pkt_drops++; + goto done; + } + /* Save the WQE so we can reuse it in case of an error. */ + qp->s_rdma_sge = qp->r_sge; + qp->r_rcv_len = 0; + if (opcode == IB_OPCODE_UC_SEND_ONLY) + goto send_last; + else if (opcode == IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE) + goto send_last_imm; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_MIDDLE: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE: + send_last_imm: + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* FALLTHROUGH */ + case IB_OPCODE_UC_SEND_LAST: + send_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + wc.byte_len = tlen + qp->r_rcv_len; + if (unlikely(wc.byte_len > qp->r_len)) { + qp->r_reuse_sge = 1; + dev->n_pkt_drops++; + goto done; + } + /* XXX Need to free SGEs */ + last_imm: + copy_sge(&qp->r_sge, data, tlen); + wc.wr_id = qp->r_wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); + break; + + case IB_OPCODE_UC_RDMA_WRITE_FIRST: + case IB_OPCODE_UC_RDMA_WRITE_ONLY: + case IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: /* consume RWQE */ + rdma_first: + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + hdrsize += sizeof(*reth); + qp->r_len = be32_to_cpu(reth->length); + qp->r_rcv_len = 0; + if (qp->r_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, + vaddr, rkey, + IB_ACCESS_REMOTE_WRITE))) { + dev->n_pkt_drops++; + goto done; + } + } else { + qp->r_sge.sg_list = NULL; + qp->r_sge.sge.mr = NULL; + qp->r_sge.sge.vaddr = NULL; + qp->r_sge.sge.length = 0; + qp->r_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) { + dev->n_pkt_drops++; + goto done; + } + if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY) + goto rdma_last; + else if (opcode == IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE) + goto rdma_last_imm; + /* FALLTHROUGH */ + case IB_OPCODE_UC_RDMA_WRITE_MIDDLE: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + dev->n_pkt_drops++; + goto done; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE: + rdma_last_imm: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + if (qp->r_reuse_sge) { + qp->r_reuse_sge = 0; + } else if (!get_rwqe(qp, 1)) { + dev->n_pkt_drops++; + goto done; + } + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + wc.byte_len = 0; + goto last_imm; + + case IB_OPCODE_UC_RDMA_WRITE_LAST: + rdma_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + dev->n_pkt_drops++; + goto done; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + if (unlikely(tlen + qp->r_rcv_len != qp->r_len)) { + dev->n_pkt_drops++; + goto done; + } + copy_sge(&qp->r_sge, data, tlen); + break; + + default: + /* Drop packet for unknown opcodes. */ + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + dev->n_pkt_drops++; + return; + } + qp->r_psn++; + qp->r_state = opcode; +done: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); +} + +/* + * Put this QP on the RNR timeout list for the device. + * XXX Use a simple list for now. We might need a priority + * queue if we have lots of QPs waiting for RNR timeouts + * but that should be rare. + */ +static void insert_rnr_queue(struct ipath_qp *qp) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + if (list_empty(&dev->rnrwait)) + list_add(&qp->timerwait, &dev->rnrwait); + else { + struct list_head *l = &dev->rnrwait; + struct ipath_qp *nqp = list_entry(l->next, struct ipath_qp, + timerwait); + + while (qp->s_rnr_timeout >= nqp->s_rnr_timeout) { + qp->s_rnr_timeout -= nqp->s_rnr_timeout; + l = l->next; + if (l->next == &dev->rnrwait) + break; + nqp = list_entry(l->next, struct ipath_qp, timerwait); + } + list_add(&qp->timerwait, l); + } + spin_unlock_irqrestore(&dev->pending_lock, flags); +} + +/* + * This is called from do_uc_send() or do_rc_send() to forward a WQE addressed + * to the same HCA. + * Note that although we are single threaded due to the tasklet, we still + * have to protect against post_send(). We don't have to worry about + * receive interrupts since this is a connected protocol and all packets + * will pass through here. + */ +static void ipath_ruc_loopback(struct ipath_qp *sqp, struct ib_wc *wc) +{ + struct ipath_ibdev *dev = to_idev(sqp->ibqp.device); + struct ipath_qp *qp; + struct ipath_swqe *wqe; + struct ipath_sge *sge; + unsigned long flags; + u64 sdata; + + qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn); + if (!qp) { + dev->n_pkt_drops++; + return; + } + +again: + spin_lock_irqsave(&sqp->s_lock, flags); + + if (!(state_ops[sqp->state] & IPATH_PROCESS_SEND_OK)) { + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto done; + } + + /* Get the next send request. */ + if (sqp->s_last == sqp->s_head) { + /* Send work queue is empty. */ + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto done; + } + + /* + * We can rely on the entry not changing without the s_lock + * being held until we update s_last. + */ + wqe = get_swqe_ptr(sqp, sqp->s_last); + spin_unlock_irqrestore(&sqp->s_lock, flags); + + wc->wc_flags = 0; + wc->imm_data = 0; + + sqp->s_sge.sge = wqe->sg_list[0]; + sqp->s_sge.sg_list = wqe->sg_list + 1; + sqp->s_sge.num_sge = wqe->wr.num_sge; + sqp->s_len = wqe->length; + switch (wqe->wr.opcode) { + case IB_WR_SEND_WITH_IMM: + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wqe->wr.imm_data; + /* FALLTHROUGH */ + case IB_WR_SEND: + spin_lock_irqsave(&qp->r_rq.lock, flags); + if (!get_rwqe(qp, 0)) { + rnr_nak: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + /* Handle RNR NAK */ + if (qp->ibqp.qp_type == IB_QPT_UC) + goto send_comp; + if (sqp->s_rnr_retry == 0) { + wc->status = IB_WC_RNR_RETRY_EXC_ERR; + goto err; + } + if (sqp->s_rnr_retry_cnt < 7) + sqp->s_rnr_retry--; + dev->n_rnr_naks++; + sqp->s_rnr_timeout = rnr_table[sqp->s_min_rnr_timer]; + insert_rnr_queue(sqp); + goto done; + } + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + break; + + case IB_WR_RDMA_WRITE_WITH_IMM: + wc->wc_flags = IB_WC_WITH_IMM; + wc->imm_data = wqe->wr.imm_data; + spin_lock_irqsave(&qp->r_rq.lock, flags); + if (!get_rwqe(qp, 1)) + goto rnr_nak; + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + /* FALLTHROUGH */ + case IB_WR_RDMA_WRITE: + if (wqe->length == 0) + break; + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, wqe->length, + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_WRITE))) { + acc_err: + wc->status = IB_WC_REM_ACCESS_ERR; + err: + wc->wr_id = wqe->wr.wr_id; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = 0; + wc->qp_num = sqp->ibqp.qp_num; + wc->src_qp = sqp->remote_qpn; + wc->pkey_index = 0; + wc->slid = sqp->remote_ah_attr.dlid; + wc->sl = sqp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_sqerror_qp(sqp, wc); + goto done; + } + break; + + case IB_WR_RDMA_READ: + if (unlikely(!ipath_rkey_ok(dev, &sqp->s_sge, wqe->length, + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_READ))) { + goto acc_err; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ))) + goto acc_err; + qp->r_sge.sge = wqe->sg_list[0]; + qp->r_sge.sg_list = wqe->sg_list + 1; + qp->r_sge.num_sge = wqe->wr.num_sge; + break; + + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, sizeof(u64), + wqe->wr.wr.rdma.remote_addr, + wqe->wr.wr.rdma.rkey, + IB_ACCESS_REMOTE_ATOMIC))) { + goto acc_err; + } + /* Perform atomic OP and save result. */ + sdata = wqe->wr.wr.atomic.swap; + spin_lock_irqsave(&dev->pending_lock, flags); + qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr; + if (wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) { + *(u64 *) qp->r_sge.sge.vaddr = + qp->r_atomic_data + sdata; + } else if (qp->r_atomic_data == wqe->wr.wr.atomic.compare_add) { + *(u64 *) qp->r_sge.sge.vaddr = sdata; + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + *(u64 *) sqp->s_sge.sge.vaddr = qp->r_atomic_data; + goto send_comp; + + default: + goto done; + } + + sge = &sqp->s_sge.sge; + while (sqp->s_len) { + u32 len = sqp->s_len; + + if (len > sge->length) + len = sge->length; + BUG_ON(len == 0); + copy_sge(&qp->r_sge, sge->vaddr, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (--sqp->s_sge.num_sge) + *sge = *sqp->s_sge.sg_list++; + } else if (sge->length == 0 && sge->mr != NULL) { + if (++sge->n >= IPATH_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } + sqp->s_len -= len; + } + + if (wqe->wr.opcode == IB_WR_RDMA_WRITE || + wqe->wr.opcode == IB_WR_RDMA_READ) + goto send_comp; + + if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM) + wc->opcode = IB_WC_RECV_RDMA_WITH_IMM; + else + wc->opcode = IB_WC_RECV; + wc->wr_id = qp->r_wr_id; + wc->status = IB_WC_SUCCESS; + wc->vendor_err = 0; + wc->byte_len = wqe->length; + wc->qp_num = qp->ibqp.qp_num; + wc->src_qp = qp->remote_qpn; + /* XXX do we know which pkey matched? Only needed for GSI. */ + wc->pkey_index = 0; + wc->slid = qp->remote_ah_attr.dlid; + wc->sl = qp->remote_ah_attr.sl; + wc->dlid_path_bits = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), wc, + wqe->wr.send_flags & IB_SEND_SOLICITED); + +send_comp: + sqp->s_rnr_retry = sqp->s_rnr_retry_cnt; + + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &sqp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc->wr_id = wqe->wr.wr_id; + wc->status = IB_WC_SUCCESS; + wc->opcode = wc_opcode[wqe->wr.opcode]; + wc->vendor_err = 0; + wc->byte_len = wqe->length; + wc->qp_num = sqp->ibqp.qp_num; + wc->src_qp = 0; + wc->pkey_index = 0; + wc->slid = 0; + wc->sl = 0; + wc->dlid_path_bits = 0; + wc->port_num = 0; + ipath_cq_enter(to_icq(sqp->ibqp.send_cq), wc, 0); + } + + /* Update s_last now that we are finished with the SWQE */ + spin_lock_irqsave(&sqp->s_lock, flags); + if (++sqp->s_last >= sqp->s_size) + sqp->s_last = 0; + spin_unlock_irqrestore(&sqp->s_lock, flags); + goto again; + +done: + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +/* + * Flush send work queue. + * The QP s_lock should be held. + */ +static void ipath_get_credit(struct ipath_qp *qp, u32 aeth) +{ + u32 credit = (aeth >> 24) & 0x1F; + + /* + * If credit == 0x1F, credit is invalid and we can send + * as many packets as we like. Otherwise, we have to + * honor the credit field. + */ + if (credit == 0x1F) { + qp->s_lsn = (u32) -1; + } else if (qp->s_lsn != (u32) -1) { + /* Compute new LSN (i.e., MSN + credit) */ + credit = (aeth + credit_table[credit]) & 0xFFFFFF; + if (cmp24(credit, qp->s_lsn) > 0) + qp->s_lsn = credit; + } + + /* Restart sending if it was blocked due to lack of credits. */ + if (qp->s_cur != qp->s_head && + (qp->s_lsn == (u32) -1 || + cmp24(get_swqe_ptr(qp, qp->s_cur)->ssn, qp->s_lsn + 1) <= 0)) { + tasklet_schedule(&qp->s_task); + } +} + +/* + * This is called from ipath_rc_rcv() to process an incomming RC ACK + * for the given QP. + * Called at interrupt level with the QP s_lock held. + * Returns 1 if OK, 0 if current operation should be aborted (NAK). + */ +static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode) +{ + struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + struct ib_wc wc; + struct ipath_swqe *wqe; + + /* + * Remove the QP from the timeout queue (or RNR timeout queue). + * If ipath_ib_timer() has already removed it, + * it's OK since we hold the QP s_lock and ipath_restart_rc() + * just won't find anything to restart if we ACK everything. + */ + spin_lock(&dev->pending_lock); + if (qp->timerwait.next != LIST_POISON1) + list_del(&qp->timerwait); + spin_unlock(&dev->pending_lock); + + /* + * Note that NAKs implicitly ACK outstanding SEND and + * RDMA write requests and implicitly NAK RDMA read and + * atomic requests issued before the NAK'ed request. + * The MSN won't include the NAK'ed request but will include + * an ACK'ed request(s). + */ + wqe = get_swqe_ptr(qp, qp->s_last); + + /* Nothing is pending to ACK/NAK. */ + if (qp->s_last == qp->s_tail) + return 0; + + /* + * The MSN might be for a later WQE than the PSN indicates so + * only complete WQEs that the PSN finishes. + */ + while (cmp24(psn, wqe->lpsn) >= 0) { + /* If we are ACKing a WQE, the MSN should be >= the SSN. */ + if (cmp24(aeth, wqe->ssn) < 0) + break; + /* + * If this request is a RDMA read or atomic, and the ACK is + * for a later operation, this ACK NAKs the RDMA read or atomic. + * In other words, only a RDMA_READ_LAST or ONLY can ACK + * a RDMA read and likewise for atomic ops. + * Note that the NAK case can only happen if relaxed ordering + * is used and requests are sent after an RDMA read + * or atomic is sent but before the response is received. + */ + if ((wqe->wr.opcode == IB_WR_RDMA_READ && + opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST) || + ((wqe->wr.opcode == IB_WR_ATOMIC_CMP_AND_SWP || + wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD) && + (opcode != IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE || + cmp24(wqe->psn, psn) != 0))) { + /* The last valid PSN seen is the previous request's. */ + qp->s_last_psn = wqe->psn - 1; + /* Retry this request. */ + ipath_restart_rc(qp, wqe->psn, &wc); + /* + * No need to process the ACK/NAK since we are + * restarting an earlier request. + */ + return 0; + } + /* Post a send completion queue entry if requested. */ + if (!test_bit(IPATH_S_SIGNAL_REQ_WR, &qp->s_flags) || + (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + wc.wr_id = wqe->wr.wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = wqe->length; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + } + qp->s_retry = qp->s_retry_cnt; + /* + * If we are completing a request which is in the process + * of being resent, we can stop resending it since we know + * the responder has already seen it. + */ + if (qp->s_last == qp->s_cur) { + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; + wqe = get_swqe_ptr(qp, qp->s_cur); + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } + if (++qp->s_last >= qp->s_size) + qp->s_last = 0; + wqe = get_swqe_ptr(qp, qp->s_last); + if (qp->s_last == qp->s_tail) + break; + } + + switch (aeth >> 29) { + case 0: /* ACK */ + dev->n_rc_acks++; + /* If this is a partial ACK, reset the retransmit timer. */ + if (qp->s_last != qp->s_tail) { + spin_lock(&dev->pending_lock); + list_add_tail(&qp->timerwait, + &dev->pending[dev->pending_index]); + spin_unlock(&dev->pending_lock); + } + ipath_get_credit(qp, aeth); + qp->s_rnr_retry = qp->s_rnr_retry_cnt; + qp->s_retry = qp->s_retry_cnt; + qp->s_last_psn = psn; + return 1; + + case 1: /* RNR NAK */ + dev->n_rnr_naks++; + if (qp->s_rnr_retry == 0) { + if (qp->s_last == qp->s_tail) + return 0; + + wc.status = IB_WC_RNR_RETRY_EXC_ERR; + goto class_b; + } + if (qp->s_rnr_retry_cnt < 7) + qp->s_rnr_retry--; + if (qp->s_last == qp->s_tail) + return 0; + + /* The last valid PSN seen is the previous request's. */ + qp->s_last_psn = wqe->psn - 1; + + /* Restart this request after the RNR timeout. */ + wqe = get_swqe_ptr(qp, qp->s_last); + + dev->n_rc_resends += (int)qp->s_psn - (int)psn; + + /* + * If we are starting the request from the beginning, let the + * normal send code handle initialization. + */ + qp->s_cur = qp->s_last; + if (cmp24(psn, wqe->psn) <= 0) { + qp->s_state = IB_OPCODE_RC_SEND_LAST; + qp->s_psn = wqe->psn; + } else { + u32 n; + + n = qp->s_cur; + for (;;) { + if (++n == qp->s_size) + n = 0; + if (n == qp->s_tail) { + if (cmp24(psn, qp->s_next_psn) >= 0) { + qp->s_cur = n; + wqe = get_swqe_ptr(qp, n); + } + break; + } + wqe = get_swqe_ptr(qp, n); + if (cmp24(psn, wqe->psn) < 0) + break; + qp->s_cur = n; + } + qp->s_psn = psn; + + /* + * Set the state to restart in the middle of a request. + * Don't change the s_sge, s_cur_sge, or s_cur_size. + * See do_rc_send(). + */ + switch (wqe->wr.opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST; + break; + + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST; + break; + + case IB_WR_RDMA_READ: + qp->s_state = + IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE; + break; + + default: + /* + * This case shouldn't happen since its only + * one PSN per req. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + } + + qp->s_rnr_timeout = rnr_table[(aeth >> 24) & 0x1F]; + insert_rnr_queue(qp); + return 0; + + case 3: /* NAK */ + /* The last valid PSN seen is the previous request's. */ + if (qp->s_last != qp->s_tail) + qp->s_last_psn = wqe->psn - 1; + switch ((aeth >> 24) & 0x1F) { + case 0: /* PSN sequence error */ + dev->n_seq_naks++; + /* + * Back up to the responder's expected PSN. + * XXX Note that we might get a NAK in the + * middle of an RDMA READ response which + * terminates the RDMA READ. + */ + if (qp->s_last == qp->s_tail) + break; + + if (cmp24(psn, wqe->psn) < 0) { + break; + } + /* Retry the request. */ + ipath_restart_rc(qp, psn, &wc); + break; + + case 1: /* Invalid Request */ + wc.status = IB_WC_REM_INV_REQ_ERR; + dev->n_other_naks++; + goto class_b; + + case 2: /* Remote Access Error */ + wc.status = IB_WC_REM_ACCESS_ERR; + dev->n_other_naks++; + goto class_b; + + case 3: /* Remote Operation Error */ + wc.status = IB_WC_REM_OP_ERR; + dev->n_other_naks++; + class_b: + wc.wr_id = wqe->wr.wr_id; + wc.opcode = wc_opcode[wqe->wr.opcode]; + wc.vendor_err = 0; + wc.byte_len = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + ipath_sqerror_qp(qp, &wc); + break; + + default: + /* Ignore other reserved NAK error codes */ + goto reserved; + } + qp->s_rnr_retry = qp->s_rnr_retry_cnt; + return 0; + + default: /* 2: reserved */ + reserved: + /* Ignore reserved NAK codes. */ + return 0; + } +} + +/* + * This is called from ipath_qp_rcv() to process an incomming RC packet + * for the given QP. + * Called at interrupt level. + */ +static void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, + int has_grh, void *data, u32 tlen, struct ipath_qp *qp) +{ + struct ipath_other_headers *ohdr; + int opcode; + u32 hdrsize; + u32 psn; + u32 pad; + unsigned long flags; + struct ib_wc wc; + u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); + int diff; + struct ib_reth *reth; + + /* Check for GRH */ + if (!has_grh) { + ohdr = &hdr->u.oth; + hdrsize = 8 + 12; /* LRH + BTH */ + psn = be32_to_cpu(ohdr->bth[2]); + } else { + ohdr = &hdr->u.l.oth; + hdrsize = 8 + 40 + 12; /* LRH + GRH + BTH */ + /* + * The header with GRH is 60 bytes and the + * core driver sets the eager header buffer + * size to 56 bytes so the last 4 bytes of + * the BTH header (PSN) is in the data buffer. + */ + psn = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + /* + * The opcode is in the low byte when its in network order + * (top byte when in host order). + */ + opcode = *(u8 *) (&ohdr->bth[0]); + + /* + * Process responses (ACKs) before anything else. + * Note that the packet sequence number will be for something + * in the send work queue rather than the expected receive + * packet sequence number. In other words, this QP is the + * requester. + */ + if (opcode >= IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST && + opcode <= IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) { + + spin_lock_irqsave(&qp->s_lock, flags); + + /* Ignore invalid responses. */ + if (cmp24(psn, qp->s_next_psn) >= 0) { + goto ack_done; + } + + /* Ignore duplicate responses. */ + diff = cmp24(psn, qp->s_last_psn); + if (unlikely(diff <= 0)) { + /* Update credits for "ghost" ACKs */ + if (diff == 0 && opcode == IB_OPCODE_RC_ACKNOWLEDGE) { + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + if ((pad >> 29) == 0) { + ipath_get_credit(qp, pad); + } + } + goto ack_done; + } + + switch (opcode) { + case IB_OPCODE_RC_ACKNOWLEDGE: + case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE: + case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST: + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + if (opcode == IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE) { + *(u64 *) qp->s_sge.sge.vaddr = *(u64 *) data; + } + if (!do_rc_ack(qp, pad, psn, opcode) || + opcode != IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST) { + goto ack_done; + } + hdrsize += 4; + /* + * do_rc_ack() has already checked the PSN so skip + * the sequence check. + */ + goto rdma_read; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE: + /* no AETH, no ACK */ + if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) { + dev->n_rdma_seq++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + goto ack_done; + } + rdma_read: + if (unlikely(qp->s_state != + IB_OPCODE_RC_RDMA_READ_REQUEST)) + goto ack_done; + if (unlikely(tlen != (hdrsize + pmtu + 4))) + goto ack_done; + if (unlikely(pmtu >= qp->s_len)) + goto ack_done; + /* We got a response so update the timeout. */ + if (unlikely(qp->s_last == qp->s_tail || + get_swqe_ptr(qp, qp->s_last)->wr.opcode != + IB_WR_RDMA_READ)) + goto ack_done; + spin_lock(&dev->pending_lock); + if (qp->s_rnr_timeout == 0 && + qp->timerwait.next != LIST_POISON1) { + list_move_tail(&qp->timerwait, + &dev->pending[dev-> + pending_index]); + } + spin_unlock(&dev->pending_lock); + /* + * Update the RDMA receive state but do the copy w/o + * holding the locks and blocking interrupts. + * XXX Yet another place that affects relaxed + * RDMA order since we don't want s_sge modified. + */ + qp->s_len -= pmtu; + qp->s_last_psn = psn; + spin_unlock_irqrestore(&qp->s_lock, flags); + copy_sge(&qp->s_sge, data, pmtu); + return; + + case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST: + /* ACKs READ req. */ + if (unlikely(cmp24(psn, qp->s_last_psn + 1) != 0)) { + dev->n_rdma_seq++; + ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + goto ack_done; + } + /* FALLTHROUGH */ + case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY: + if (unlikely(qp->s_state != + IB_OPCODE_RC_RDMA_READ_REQUEST)) { + goto ack_done; + } + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* + * Check that the data size is >= 1 && <= pmtu. + * Remember to account for the AETH header (4) + * and ICRC (4). + */ + if (unlikely(tlen <= (hdrsize + pad + 8))) { + /* XXX Need to generate an error CQ entry. */ + goto ack_done; + } + tlen -= hdrsize + pad + 8; + if (unlikely(tlen != qp->s_len)) { + /* XXX Need to generate an error CQ entry. */ + goto ack_done; + } + if (!has_grh) { + pad = be32_to_cpu(ohdr->u.aeth); + } else { + pad = be32_to_cpu(((u32 *) data)[0]); + data += sizeof(u32); + } + copy_sge(&qp->s_sge, data, tlen); + if (do_rc_ack(qp, pad, psn, + IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST)) { + /* + * Change the state so we contimue + * processing new requests. + */ + qp->s_state = IB_OPCODE_RC_SEND_LAST; + } + goto ack_done; + } + ack_done: + spin_unlock_irqrestore(&qp->s_lock, flags); + return; + } + + spin_lock_irqsave(&qp->r_rq.lock, flags); + + /* Compute 24 bits worth of difference. */ + diff = cmp24(psn, qp->r_psn); + if (unlikely(diff)) { + if (diff > 0) { + /* + * Packet sequence error. + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or + * NAK is pending though. + */ + spin_lock(&qp->s_lock); + if ((qp->s_ack_state >= + IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) || + qp->s_nak_state != 0) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_NAK_PSN_ERROR; + /* Use the expected PSN. */ + qp->s_ack_psn = qp->r_psn; + goto resched; + } + + /* + * Handle a duplicate request. + * Don't re-execute SEND, RDMA write or atomic op. + * Don't NAK errors, just silently drop the duplicate request. + * Note that r_sge, r_len, and r_rcv_len may be + * in use so don't modify them. + * + * We are supposed to ACK the earliest duplicate PSN + * but we can coalesce an outstanding duplicate ACK. + * We have to send the earliest so that RDMA reads + * can be restarted at the requester's expected PSN. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE && + cmp24(psn, qp->s_ack_psn) >= 0) { + if (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) + qp->s_ack_psn = psn; + spin_unlock(&qp->s_lock); + goto done; + } + switch (opcode) { + case IB_OPCODE_RC_RDMA_READ_REQUEST: + /* + * We have to be careful to not change s_rdma_sge + * while do_rc_send() is using it and not holding + * the s_lock. + */ + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && + qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { + spin_unlock(&qp->s_lock); + dev->n_rdma_dup_busy++; + goto done; + } + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + qp->s_rdma_len = be32_to_cpu(reth->length); + if (qp->s_rdma_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* + * Address range must be a subset of the + * original request and start on pmtu + * boundaries. + */ + if (unlikely(!ipath_rkey_ok(dev, + &qp->s_rdma_sge, + qp->s_rdma_len, + vaddr, rkey, + IB_ACCESS_REMOTE_READ))) + { + goto done; + } + } else { + qp->s_rdma_sge.sg_list = NULL; + qp->s_rdma_sge.num_sge = 0; + qp->s_rdma_sge.sge.mr = NULL; + qp->s_rdma_sge.sge.vaddr = NULL; + qp->s_rdma_sge.sge.length = 0; + qp->s_rdma_sge.sge.sge_length = 0; + } + break; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * Check for the PSN of the last atomic operations + * performed and resend the result if found. + */ + if ((psn & 0xFFFFFF) != qp->r_atomic_psn) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_atomic = qp->r_atomic_data; + break; + } + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + goto resched; + } + + /* Check for opcode sequence errors. */ + switch (qp->r_state) { + case IB_OPCODE_RC_SEND_FIRST: + case IB_OPCODE_RC_SEND_MIDDLE: + if (opcode == IB_OPCODE_RC_SEND_MIDDLE || + opcode == IB_OPCODE_RC_SEND_LAST || + opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE) + break; + nack_inv: + /* + * A NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read, atomic, or + * NAK is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_NAK_INVALID_REQUEST; + qp->s_ack_psn = qp->r_psn; + goto resched; + + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + if (opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + break; + goto nack_inv; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD: + /* + * Drop all new requests until a response has been sent. + * A new request then ACKs the RDMA response we sent. + * Relaxed ordering would allow new requests to be + * processed but we would need to keep a queue + * of rwqe's for all that are in progress. + * Note that we can't RNR NAK this request since the RDMA + * READ or atomic response is already queued to be sent + * (unless we implement a response send queue). + */ + goto done; + + default: + if (opcode == IB_OPCODE_RC_SEND_MIDDLE || + opcode == IB_OPCODE_RC_SEND_LAST || + opcode == IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE || + opcode == IB_OPCODE_RC_RDMA_WRITE_MIDDLE || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE) + goto nack_inv; + break; + } + + wc.imm_data = 0; + wc.wc_flags = 0; + + /* OK, process the packet. */ + switch (opcode) { + case IB_OPCODE_RC_SEND_FIRST: + if (!get_rwqe(qp, 0)) { + rnr_nak: + /* + * A RNR NAK will ACK earlier sends and RDMA writes. + * Don't queue the NAK if a RDMA read or atomic + * is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_ack_state = IB_OPCODE_RC_SEND_ONLY; + qp->s_nak_state = IB_RNR_NAK | qp->s_min_rnr_timer; + qp->s_ack_psn = qp->r_psn; + goto resched; + } + qp->r_rcv_len = 0; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_MIDDLE: + case IB_OPCODE_RC_RDMA_WRITE_MIDDLE: + send_middle: + /* Check for invalid length PMTU or posted rwqe len. */ + if (unlikely(tlen != (hdrsize + pmtu + 4))) { + goto nack_inv; + } + qp->r_rcv_len += pmtu; + if (unlikely(qp->r_rcv_len > qp->r_len)) { + goto nack_inv; + } + copy_sge(&qp->r_sge, data, pmtu); + break; + + case IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE: + /* consume RWQE */ + if (!get_rwqe(qp, 1)) + goto rnr_nak; + goto send_last_imm; + + case IB_OPCODE_RC_SEND_ONLY: + case IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE: + if (!get_rwqe(qp, 0)) + goto rnr_nak; + qp->r_rcv_len = 0; + if (opcode == IB_OPCODE_RC_SEND_ONLY) + goto send_last; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE: + send_last_imm: + if (has_grh) { + wc.imm_data = *(u32 *) data; + data += sizeof(u32); + } else { + /* Immediate data comes after BTH */ + wc.imm_data = ohdr->u.imm_data; + } + hdrsize += 4; + wc.wc_flags = IB_WC_WITH_IMM; + /* FALLTHROUGH */ + case IB_OPCODE_RC_SEND_LAST: + case IB_OPCODE_RC_RDMA_WRITE_LAST: + send_last: + /* Get the number of bytes the message was padded by. */ + pad = (ohdr->bth[0] >> 12) & 3; + /* Check for invalid length. */ + /* XXX LAST len should be >= 1 */ + if (unlikely(tlen < (hdrsize + pad + 4))) { + goto nack_inv; + } + /* Don't count the CRC. */ + tlen -= (hdrsize + pad + 4); + wc.byte_len = tlen + qp->r_rcv_len; + if (unlikely(wc.byte_len > qp->r_len)) { + goto nack_inv; + } + /* XXX Need to free SGEs */ + copy_sge(&qp->r_sge, data, tlen); + atomic_inc(&qp->msn); + if (opcode == IB_OPCODE_RC_RDMA_WRITE_LAST || + opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY) + break; + wc.wr_id = qp->r_wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.qp_num = qp->ibqp.qp_num; + wc.src_qp = qp->remote_qpn; + wc.pkey_index = 0; + wc.slid = qp->remote_ah_attr.dlid; + wc.sl = qp->remote_ah_attr.sl; + wc.dlid_path_bits = 0; + wc.port_num = 0; + /* Signal completion event if the solicited bit is set. */ + ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, + ohdr->bth[0] & __constant_cpu_to_be32(1 << 23)); + break; + + case IB_OPCODE_RC_RDMA_WRITE_FIRST: + case IB_OPCODE_RC_RDMA_WRITE_ONLY: + case IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + /* consume RWQE */ + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + hdrsize += sizeof(*reth); + qp->r_len = be32_to_cpu(reth->length); + qp->r_rcv_len = 0; + if (qp->r_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, qp->r_len, + vaddr, rkey, + IB_ACCESS_REMOTE_WRITE))) { + nack_acc: + /* + * A NAK will ACK earlier sends and RDMA + * writes. + * Don't queue the NAK if a RDMA read, + * atomic, or NAK is pending though. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state >= + IB_OPCODE_RC_RDMA_READ_REQUEST && + qp->s_ack_state != IB_OPCODE_ACKNOWLEDGE) { + spin_unlock(&qp->s_lock); + goto done; + } + /* XXX Flush WQEs */ + qp->state = IB_QPS_ERR; + qp->s_ack_state = IB_OPCODE_RC_RDMA_WRITE_ONLY; + qp->s_nak_state = IB_NAK_REMOTE_ACCESS_ERROR; + qp->s_ack_psn = qp->r_psn; + goto resched; + } + } else { + qp->r_sge.sg_list = NULL; + qp->r_sge.sge.mr = NULL; + qp->r_sge.sge.vaddr = NULL; + qp->r_sge.sge.length = 0; + qp->r_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) + goto nack_acc; + if (opcode == IB_OPCODE_RC_RDMA_WRITE_FIRST) + goto send_middle; + else if (opcode == IB_OPCODE_RC_RDMA_WRITE_ONLY) + goto send_last; + if (!get_rwqe(qp, 1)) + goto rnr_nak; + goto send_last_imm; + + case IB_OPCODE_RC_RDMA_READ_REQUEST: + /* RETH comes after BTH */ + if (!has_grh) + reth = &ohdr->u.rc.reth; + else { + reth = (struct ib_reth *)data; + data += sizeof(*reth); + } + spin_lock(&qp->s_lock); + if (qp->s_ack_state != IB_OPCODE_RC_ACKNOWLEDGE && + qp->s_ack_state >= IB_OPCODE_RDMA_READ_REQUEST) { + spin_unlock(&qp->s_lock); + goto done; + } + qp->s_rdma_len = be32_to_cpu(reth->length); + if (qp->s_rdma_len != 0) { + u32 rkey = be32_to_cpu(reth->rkey); + u64 vaddr = be64_to_cpu(reth->vaddr); + + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->s_rdma_sge, + qp->s_rdma_len, + vaddr, rkey, + IB_ACCESS_REMOTE_READ))) { + spin_unlock(&qp->s_lock); + goto nack_acc; + } + /* + * Update the next expected PSN. + * We add 1 later below, so only add the remainder here. + */ + if (qp->s_rdma_len > pmtu) + qp->r_psn += (qp->s_rdma_len - 1) / pmtu; + } else { + qp->s_rdma_sge.sg_list = NULL; + qp->s_rdma_sge.num_sge = 0; + qp->s_rdma_sge.sge.mr = NULL; + qp->s_rdma_sge.sge.vaddr = NULL; + qp->s_rdma_sge.sge.length = 0; + qp->s_rdma_sge.sge.sge_length = 0; + } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ))) + goto nack_acc; + /* + * We need to increment the MSN here instead of when we + * finish sending the result since a duplicate request would + * increment it more than once. + */ + atomic_inc(&qp->msn); + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + qp->r_psn++; + qp->r_state = opcode; + goto rdmadone; + + case IB_OPCODE_RC_COMPARE_SWAP: + case IB_OPCODE_RC_FETCH_ADD:{ + struct ib_atomic_eth *ateth; + u64 vaddr; + u64 sdata; + u32 rkey; + + if (!has_grh) + ateth = &ohdr->u.atomic_eth; + else { + ateth = (struct ib_atomic_eth *)data; + data += sizeof(*ateth); + } + vaddr = be64_to_cpu(ateth->vaddr); + if (unlikely(vaddr & 0x7)) + goto nack_inv; + rkey = be32_to_cpu(ateth->rkey); + /* Check rkey & NAK */ + if (unlikely(!ipath_rkey_ok(dev, &qp->r_sge, + sizeof(u64), vaddr, rkey, + IB_ACCESS_REMOTE_ATOMIC))) { + goto nack_acc; + } + if (unlikely(!(qp->qp_access_flags & + IB_ACCESS_REMOTE_ATOMIC))) + goto nack_acc; + /* Perform atomic OP and save result. */ + sdata = be64_to_cpu(ateth->swap_data); + spin_lock(&dev->pending_lock); + qp->r_atomic_data = *(u64 *) qp->r_sge.sge.vaddr; + if (opcode == IB_OPCODE_RC_FETCH_ADD) { + *(u64 *) qp->r_sge.sge.vaddr = + qp->r_atomic_data + sdata; + } else if (qp->r_atomic_data == + be64_to_cpu(ateth->compare_data)) { + *(u64 *) qp->r_sge.sge.vaddr = sdata; + } + spin_unlock(&dev->pending_lock); + atomic_inc(&qp->msn); + qp->r_atomic_psn = psn & 0xFFFFFF; + psn |= 1 << 31; + break; + } + + default: + /* Drop packet for unknown opcodes. */ + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return; + } + qp->r_psn++; + qp->r_state = opcode; + /* Send an ACK if requested or required. */ + if (psn & (1 << 31)) { + /* + * Coalesce ACKs unless there is a RDMA READ or + * ATOMIC pending. + */ + spin_lock(&qp->s_lock); + if (qp->s_ack_state == IB_OPCODE_RC_ACKNOWLEDGE || + qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST) { + qp->s_ack_state = opcode; + qp->s_nak_state = 0; + qp->s_ack_psn = psn; + qp->s_ack_atomic = qp->r_atomic_data; + goto resched; + } + spin_unlock(&qp->s_lock); + } +done: + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + return; + +resched: + /* Try to send ACK right away but not if do_rc_send() is active. */ + if (qp->s_hdrwords == 0 && + (qp->s_ack_state < IB_OPCODE_RDMA_READ_REQUEST || + qp->s_ack_state >= IB_OPCODE_COMPARE_SWAP)) + send_rc_ack(qp); + +rdmadone: + spin_unlock(&qp->s_lock); + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + + /* Call do_rc_send() in another thread. */ + tasklet_schedule(&qp->s_task); +} From bos at pathscale.com Wed Dec 28 16:31:37 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 28 Dec 2005 16:31:37 -0800 Subject: [openib-general] [PATCH 18 of 20] ipath - infiniband management datagram support In-Reply-To: Message-ID: Signed-off-by: Bryan O'Sullivan diff -r 584777b6f4dc -r e7cabc7a2e78 drivers/infiniband/hw/ipath/ipath_mad.c --- /dev/null Thu Jan 1 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_mad.c Wed Dec 28 14:19:43 2005 -0800 @@ -0,0 +1,1144 @@ +/* + * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Patent licenses, if any, provided herein do not apply to + * combinations of this program with other software, or any other + * product whatsoever. + */ + +#include +#include + +#include "ips_common.h" +#include "ipath_verbs.h" +#include "ipath_layer.h" + + +#define IB_SMP_INVALID_FIELD __constant_htons(0x001C) + +static int reply(struct ib_smp *smp, int line) +{ + + /* + * The verbs framework will handle the directed/LID route + * packet changes. + */ + smp->method = IB_MGMT_METHOD_GET_RESP; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + smp->status |= IB_SMP_DIRECTION; + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static inline int recv_subn_get_nodedescription(struct ib_smp *smp) +{ + + strncpy(smp->data, "Infinipath", sizeof(smp->data)); + + return reply(smp, __LINE__); +} + +struct nodeinfo { + u8 base_version; + u8 class_version; + u8 node_type; + u8 num_ports; + __be64 sys_guid; + __be64 node_guid; + __be64 port_guid; + __be16 partition_cap; + __be16 device_id; + __be32 revision; + u8 local_port_num; + u8 vendor_id[3]; +} __attribute__ ((packed)); + +/* + * XXX The num_ports value will need a layer function to get the value + * if we ever have more than one IB port on a chip. + * We will also need to get the GUID for the port. + */ +static inline int recv_subn_get_nodeinfo(struct ib_smp *smp, + struct ib_device *ibdev, u8 port) +{ + struct nodeinfo *nip = (struct nodeinfo *)&smp->data; + ipath_type t = to_idev(ibdev)->ib_unit; + uint32_t vendor, boardid, majrev, minrev; + + nip->base_version = 1; + nip->class_version = 1; + nip->node_type = 1; /* channel adapter */ + nip->num_ports = 1; + /* This is already in network order */ + nip->sys_guid = to_idev(ibdev)->sys_image_guid; + nip->node_guid = ipath_layer_get_guid(t); + nip->port_guid = nip->sys_guid; + nip->partition_cap = cpu_to_be16(ipath_layer_get_npkeys(t)); + nip->device_id = cpu_to_be16(ipath_layer_get_deviceid(t)); + ipath_layer_query_device(t, &vendor, &boardid, &majrev, &minrev); + nip->revision = cpu_to_be32((majrev << 16) | minrev); + nip->local_port_num = port; + nip->vendor_id[0] = 0; + nip->vendor_id[1] = vendor >> 8; + nip->vendor_id[2] = vendor; + + return reply(smp, __LINE__); +} + +static int recv_subn_get_guidinfo(struct ib_smp *smp, struct ib_device *ibdev) +{ + uint32_t t = to_idev(ibdev)->ib_unit; + u32 startgx = 8 * be32_to_cpu(smp->attr_mod); + u64 *p = (u64 *) smp->data; + + /* 32 blocks of 8 64-bit GUIDs per block */ + + memset(smp->data, 0, sizeof(smp->data)); + + /* + * We only support one GUID for now. + * If this changes, the portinfo.guid_cap field needs to be updated too. + */ + if (startgx == 0) { + /* The first is a copy of the read-only HW GUID. */ + *p = ipath_layer_get_guid(t); + } + + return reply(smp, __LINE__); +} + +struct port_info { + __be64 mkey; + __be64 gid_prefix; + __be16 lid; + __be16 sm_lid; + __be32 cap_mask; + __be16 diag_code; + __be16 mkey_lease_period; + u8 local_port_num; + u8 link_width_enabled; + u8 link_width_supported; + u8 link_width_active; + u8 linkspeed_portstate; /* 4 bits, 4 bits */ + u8 portphysstate_linkdown; /* 4 bits, 4 bits */ + u8 mkeyprot_resv_lmc; /* 2 bits, 3 bits, 3 bits */ + u8 linkspeedactive_enabled; /* 4 bits, 4 bits */ + u8 neighbormtu_mastersmsl; /* 4 bits, 4 bits */ + u8 vlcap_inittype; /* 4 bits, 4 bits */ + u8 vl_high_limit; + u8 vl_arb_high_cap; + u8 vl_arb_low_cap; + u8 inittypereply_mtucap; /* 4 bits, 4 bits */ + u8 vlstallcnt_hoqlife; /* 3 bits, 5 bits */ + u8 operationalvl_pei_peo_fpi_fpo; /* 4 bits, 1, 1, 1, 1 */ + __be16 mkey_violations; + __be16 pkey_violations; + __be16 qkey_violations; + u8 guid_cap; + u8 clientrereg_resv_subnetto; /* 1 bit, 2 bits, 5 bits */ + u8 resv_resptimevalue; /* 3 bits, 5 bits */ + u8 localphyerrors_overrunerrors; /* 4 bits, 4 bits */ + __be16 max_credit_hint; + u8 resv; + u8 link_roundtrip_latency[3]; +} __attribute__ ((packed)); + +static int recv_subn_get_portinfo(struct ib_smp *smp, struct ib_device *ibdev, + u8 port) +{ + u32 lportnum = be32_to_cpu(smp->attr_mod); + struct ipath_ibdev *dev; + struct port_info *pip = (struct port_info *)smp->data; + u32 tmp, tmp2; + + if (lportnum == 0) { + lportnum = port; + smp->attr_mod = cpu_to_be32(lportnum); + } + + if (lportnum < 1 || lportnum > ibdev->phys_port_cnt) + return IB_MAD_RESULT_FAILURE; + + dev = to_idev(ibdev); + + /* Clear all fields. Only set the non-zero fields. */ + memset(smp->data, 0, sizeof(smp->data)); + + /* Only return the mkey if the protection field allows it. */ + if ((dev->mkeyprot_resv_lmc >> 6) == 0) + pip->mkey = dev->mkey; + else + pip->mkey = 0; + pip->gid_prefix = dev->gid_prefix; + tmp = ipath_layer_get_lid(dev->ib_unit); + pip->lid = tmp ? cpu_to_be16(tmp) : IB_LID_PERMISSIVE; + pip->sm_lid = cpu_to_be16(dev->sm_lid); + pip->cap_mask = cpu_to_be32(dev->port_cap_flags); + /* pip->diag_code; */ + pip->mkey_lease_period = cpu_to_be16(dev->mkey_lease_period); + pip->local_port_num = port; + pip->link_width_enabled = 2; /* 4x */ + pip->link_width_supported = 3; /* 1x or 4x */ + pip->link_width_active = 2; /* 4x */ + pip->linkspeed_portstate = 0x10; /* 2.5Gbps */ + tmp = ipath_layer_get_lastibcstat(dev->ib_unit) & 0xff; + tmp2 = 5; /* link up */ + if (tmp == 0x11) + pip->linkspeed_portstate |= 2; /* initialize */ + else if (tmp == 0x21) + pip->linkspeed_portstate |= 3; /* armed */ + else if (tmp == 0x31) + pip->linkspeed_portstate |= 4; /* active */ + else { + pip->linkspeed_portstate |= 1; /* down */ + tmp2 = tmp & 0xf; + } + pip->portphysstate_linkdown = (tmp2 << 4) | + (ipath_layer_get_linkdowndefaultstate(dev->ib_unit) ? 1 : 2); + pip->mkeyprot_resv_lmc = dev->mkeyprot_resv_lmc; + pip->linkspeedactive_enabled = 0x11; /* 2.5Gbps, 2.5Gbps */ + switch (ipath_layer_get_ibmtu(dev->ib_unit)) { + case 4096: + tmp = IB_MTU_4096; + break; + case 2048: + tmp = IB_MTU_2048; + break; + case 1024: + tmp = IB_MTU_1024; + break; + case 512: + tmp = IB_MTU_512; + break; + case 256: + tmp = IB_MTU_256; + break; + default: /* oops, something is wrong */ + tmp = IB_MTU_2048; + break; + } + pip->neighbormtu_mastersmsl = (tmp << 4) | dev->sm_sl; + pip->vlcap_inittype = 0x10; /* VLCap = VL0, InitType = 0 */ + /* pip->vl_high_limit; // only one VL */ + /* pip->vl_arb_high_cap; // only one VL */ + /* pip->vl_arb_low_cap; // only one VL */ + pip->inittypereply_mtucap = IB_MTU_4096; /* InitTypeReply = 0 */ + /* pip->vlstallcnt_hoqlife; // HCAs ignore VLStallCount and HOQLife */ + pip->operationalvl_pei_peo_fpi_fpo = 0x10; /* OVLs = 1 */ + pip->mkey_violations = cpu_to_be16(dev->mkey_violations); + /* P_KeyViolations are counted by hardware. */ + tmp = ipath_layer_get_cr_errpkey(dev->ib_unit) & 0xFFFF; + pip->pkey_violations = cpu_to_be16(tmp); + pip->qkey_violations = cpu_to_be16(dev->qkey_violations); + /* Only the hardware GUID is supported for now */ + pip->guid_cap = 1; + pip->clientrereg_resv_subnetto = dev->subnet_timeout; + /* 32.768 usec. response time (guessing) */ + pip->resv_resptimevalue = 3; + pip->localphyerrors_overrunerrors = + (ipath_layer_get_phyerrthreshold(dev->ib_unit) << 4) | + ipath_layer_get_overrunthreshold(dev->ib_unit); + /* pip->max_credit_hint; */ + /* pip->link_roundtrip_latency[3]; */ + + return reply(smp, __LINE__); +} + +static int recv_subn_get_pkeytable(struct ib_smp *smp, struct ib_device *ibdev) +{ + u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff); + u16 *p = (u16 *) smp->data; + + /* 64 blocks of 32 16-bit P_Key entries */ + + memset(smp->data, 0, sizeof(smp->data)); + if (startpx == 0) + ipath_layer_get_pkeys(to_idev(ibdev)->ib_unit, p); + else + smp->status |= IB_SMP_INVALID_FIELD; + + return reply(smp, __LINE__); +} + +static inline int recv_subn_set_guidinfo(struct ib_smp *smp, + struct ib_device *ibdev) +{ + /* The only GUID we support is the first read-only entry. */ + return recv_subn_get_guidinfo(smp, ibdev); +} + +static inline int recv_subn_set_portinfo(struct ib_smp *smp, + struct ib_device *ibdev, u8 port) +{ + struct port_info *pip = (struct port_info *)smp->data; + uint32_t lportnum = be32_to_cpu(smp->attr_mod); + struct ib_event event; + struct ipath_ibdev *dev; + uint32_t flags; + char clientrereg = 0; + u32 tmp; + u32 tmp2; + int ret; + + if (lportnum == 0) { + lportnum = port; + smp->attr_mod = cpu_to_be32(lportnum); + } + + if (lportnum < 1 || lportnum > ibdev->phys_port_cnt) + return IB_MAD_RESULT_FAILURE; + + dev = to_idev(ibdev); + event.device = ibdev; + event.element.port_num = port; + + if (dev->mkey != pip->mkey) + dev->mkey = pip->mkey; + + if (pip->gid_prefix != dev->gid_prefix) + dev->gid_prefix = pip->gid_prefix; + + tmp = be16_to_cpu(pip->lid); + if (tmp != ipath_layer_get_lid(dev->ib_unit)) { + ipath_set_sps_lid(dev->ib_unit, tmp); + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + tmp = be16_to_cpu(pip->sm_lid); + if (tmp != dev->sm_lid) { + dev->sm_lid = tmp; + event.event = IB_EVENT_SM_CHANGE; + ib_dispatch_event(&event); + } + + dev->mkey_lease_period = be16_to_cpu(pip->mkey_lease_period); + +#if 0 + tmp = pip->link_width_enabled; + if (tmp && (tmp != lpp->linkwidthenabled)) { + lpp->linkwidthenabled = tmp; + /* JAG - notify driver here */ + } +#endif + + tmp = pip->portphysstate_linkdown & 0xF; + if (tmp == 1) { + /* SLEEP */ + if (ipath_layer_set_linkdowndefaultstate(dev->ib_unit, 1)) + return IB_MAD_RESULT_FAILURE; + } else if (tmp == 2) { + /* POLL */ + if (ipath_layer_set_linkdowndefaultstate(dev->ib_unit, 0)) + return IB_MAD_RESULT_FAILURE; + } else if (tmp) + return IB_MAD_RESULT_FAILURE; + + dev->mkeyprot_resv_lmc = pip->mkeyprot_resv_lmc; + +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_LinkSpeedEnabled); + if (tmp && (tmp != lpp->linkspeedenabled)) { + lpp->linkspeedenabled = tmp; + /* JAG - notify driver here */ + } +#endif + + switch ((pip->neighbormtu_mastersmsl >> 4) & 0xF) { + case IB_MTU_256: + tmp = 256; + break; + case IB_MTU_512: + tmp = 512; + break; + case IB_MTU_1024: + tmp = 1024; + break; + case IB_MTU_2048: + tmp = 2048; + break; + case IB_MTU_4096: + tmp = 4096; + break; + default: + /* XXX We have already partially updated our state! */ + return IB_MAD_RESULT_FAILURE; + } + ipath_kset_mtu(dev->ib_unit << 16 | tmp); + + dev->sm_sl = pip->neighbormtu_mastersmsl & 0xF; + +#if 0 + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_VLHighLimit); + if (tmp != lpp->vlhighlimit) { + lpp->vlhighlimit = tmp; + /* JAG - notify driver here */ + } + + lpp->inittypereply = + BF_GET(g.madp, iba_Subn_PortInfo, FIELD_InitTypeReply); + + tmp = BF_GET(g.madp, iba_Subn_PortInfo, FIELD_OperationalVLs); + if (tmp && (tmp != lpp->operationalvls)) { + lpp->operationalvls = tmp; + /* JAG - notify driver here */ + } +#endif + + if (pip->mkey_violations != 0) + dev->mkey_violations = 0; +#if 0 + /* XXX Hardware counter can't be reset. */ + if (pip->pkey_violations != 0) + dev->pkey_violations = 0; +#endif + + if (pip->qkey_violations != 0) + dev->qkey_violations = 0; + + tmp = (pip->localphyerrors_overrunerrors >> 4) & 0xF; + if (ipath_layer_set_phyerrthreshold(dev->ib_unit, tmp)) + return IB_MAD_RESULT_FAILURE; + + tmp = pip->localphyerrors_overrunerrors & 0xF; + if (ipath_layer_set_overrunthreshold(dev->ib_unit, tmp)) + return IB_MAD_RESULT_FAILURE; + + dev->subnet_timeout = pip->clientrereg_resv_subnetto & 0x1F; + + if (pip->clientrereg_resv_subnetto & 0x80) { + clientrereg = 1; + event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } + + /* + * Do the port state change now that the other link parameters + * have been set. + * Changing the port physical state only makes sense if the link + * is down or is being set to down. + */ + tmp = pip->linkspeed_portstate & 0xF; + flags = ipath_layer_get_flags(dev->ib_unit); + tmp2 = (pip->portphysstate_linkdown >> 4) & 0xF; + if (tmp2) { + if (tmp != IB_PORT_DOWN && !(flags & IPATH_LINKDOWN)) + return IB_MAD_RESULT_FAILURE; + tmp = IB_PORT_DOWN; + tmp2 = IB_PORT_NOP; + } else if (flags & IPATH_LINKDOWN) + tmp2 = IB_PORT_DOWN; + else if (flags & IPATH_LINKINIT) + tmp2 = IB_PORT_INIT; + else if (flags & IPATH_LINKARMED) + tmp2 = IB_PORT_ARMED; + else if (flags & IPATH_LINKACTIVE) + tmp2 = IB_PORT_ACTIVE; + else + tmp2 = IB_PORT_NOP; + + if (tmp && tmp != tmp2) { + switch (tmp) { + case IB_PORT_DOWN: + tmp = (pip->portphysstate_linkdown >> 4) & 0xF; + if (tmp <= 1) + tmp = IPATH_IB_LINKDOWN; + else if (tmp == 2) + tmp = IPATH_IB_LINKDOWN_POLL; + else if (tmp == 3) + tmp = IPATH_IB_LINKDOWN_DISABLE; + else + return IB_MAD_RESULT_FAILURE; + ipath_kset_linkstate(dev->ib_unit << 16 | tmp); + if (tmp2 == IB_PORT_ACTIVE) { + event.event = IB_EVENT_PORT_ERR; + ib_dispatch_event(&event); + } + break; + + case IB_PORT_INIT: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKINIT); + if (tmp2 == IB_PORT_ACTIVE) { + event.event = IB_EVENT_PORT_ERR; + ib_dispatch_event(&event); + } + break; + + case IB_PORT_ARMED: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKARM); + if (tmp2 == IB_PORT_ACTIVE) { + event.event = IB_EVENT_PORT_ERR; + ib_dispatch_event(&event); + } + break; + + case IB_PORT_ACTIVE: + ipath_kset_linkstate(dev->ib_unit << 16 | + IPATH_IB_LINKACTIVE); + event.event = IB_EVENT_PORT_ACTIVE; + ib_dispatch_event(&event); + break; + + default: + /* XXX We have already partially updated our state! */ + return IB_MAD_RESULT_FAILURE; + } + } + + ret = recv_subn_get_portinfo(smp, ibdev, port); + + if (clientrereg) + pip->clientrereg_resv_subnetto |= 0x80; + + return ret; +} + +static inline int recv_subn_set_pkeytable(struct ib_smp *smp, + struct ib_device *ibdev) +{ + u32 startpx = 32 * (be32_to_cpu(smp->attr_mod) & 0xffff); + u16 *p = (u16 *) smp->data; + + if (startpx != 0 || + ipath_layer_set_pkeys(to_idev(ibdev)->ib_unit, p) != 0) + smp->status |= IB_SMP_INVALID_FIELD; + + return recv_subn_get_pkeytable(smp, ibdev); +} + +#define IB_PMA_CLASS_PORT_INFO __constant_htons(0x0001) +#define IB_PMA_PORT_SAMPLES_CONTROL __constant_htons(0x0010) +#define IB_PMA_PORT_SAMPLES_RESULT __constant_htons(0x0011) +#define IB_PMA_PORT_COUNTERS __constant_htons(0x0012) +#define IB_PMA_PORT_COUNTERS_EXT __constant_htons(0x001D) +#define IB_PMA_PORT_SAMPLES_RESULT_EXT __constant_htons(0x001E) + +struct ib_perf { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + __be16 status; + __be16 unused; + __be64 tid; + __be16 attr_id; + __be16 resv; + __be32 attr_mod; + u8 reserved[40]; + u8 data[192]; +} __attribute__ ((packed)); + +struct ib_pma_classportinfo { + u8 base_version; + u8 class_version; + __be16 cap_mask; + u8 reserved[3]; + u8 resp_time_value; /* only lower 5 bits */ + union ib_gid redirect_gid; + __be32 redirect_tc_sl_fl; /* 8, 4, 20 bits respectively */ + __be16 redirect_lid; + __be16 redirect_pkey; + __be32 redirect_qp; /* only lower 24 bits */ + __be32 redirect_qkey; + union ib_gid trap_gid; + __be32 trap_tc_sl_fl; /* 8, 4, 20 bits respectively */ + __be16 trap_lid; + __be16 trap_pkey; + __be32 trap_hl_qp; /* 8, 24 bits respectively */ + __be32 trap_qkey; +} __attribute__ ((packed)); + +struct ib_pma_portsamplescontrol { + u8 opcode; + u8 port_select; + u8 tick; + u8 counter_width; /* only lower 3 bits */ + __be32 counter_mask0_9; /* 2, 10 * 3, bits */ + __be16 counter_mask10_14; /* 1, 5 * 3, bits */ + u8 sample_mechanisms; + u8 sample_status; /* only lower 2 bits */ + __be64 option_mask; + __be64 vendor_mask; + __be32 sample_start; + __be32 sample_interval; + __be16 tag; + __be16 counter_select[15]; +} __attribute__ ((packed)); + +struct ib_pma_portsamplesresult { + __be16 tag; + __be16 sample_status; /* only lower 2 bits */ + __be32 counter[15]; +} __attribute__ ((packed)); + +struct ib_pma_portsamplesresult_ext { + __be16 tag; + __be16 sample_status; /* only lower 2 bits */ + __be32 extended_width; /* only upper 2 bits */ + __be64 counter[15]; +} __attribute__ ((packed)); + +struct ib_pma_portcounters { + u8 reserved; + u8 port_select; + __be16 counter_select; + __be16 symbol_error_counter; + u8 link_error_recovery_counter; + u8 link_downed_counter; + __be16 port_rcv_errors; + __be16 port_rcv_remphys_errors; + __be16 port_rcv_switch_relay_errors; + __be16 port_xmit_discards; + u8 port_xmit_constraint_errors; + u8 port_rcv_constraint_errors; + u8 reserved1; + u8 lli_ebor_errors; /* 4, 4, bits */ + __be16 reserved2; + __be16 vl15_dropped; + __be32 port_xmit_data; + __be32 port_rcv_data; + __be32 port_xmit_packets; + __be32 port_rcv_packets; +} __attribute__ ((packed)); + +#define IB_PMA_SEL_SYMBOL_ERROR __constant_htons(0x0001) +#define IB_PMA_SEL_LINK_ERROR_RECOVERY __constant_htons(0x0002) +#define IB_PMA_SEL_LINK_DOWNED __constant_htons(0x0004) +#define IB_PMA_SEL_PORT_RCV_ERRORS __constant_htons(0x0008) +#define IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS __constant_htons(0x0010) +#define IB_PMA_SEL_PORT_XMIT_DISCARDS __constant_htons(0x0040) +#define IB_PMA_SEL_PORT_XMIT_DATA __constant_htons(0x1000) +#define IB_PMA_SEL_PORT_RCV_DATA __constant_htons(0x2000) +#define IB_PMA_SEL_PORT_XMIT_PACKETS __constant_htons(0x4000) +#define IB_PMA_SEL_PORT_RCV_PACKETS __constant_htons(0x8000) + +struct ib_pma_portcounters_ext { + u8 reserved; + u8 port_select; + __be16 counter_select; + __be32 reserved1; + __be64 port_xmit_data; + __be64 port_rcv_data; + __be64 port_xmit_packets; + __be64 port_rcv_packets; + __be64 port_unicast_xmit_packets; + __be64 port_unicast_rcv_packets; + __be64 port_multicast_xmit_packets; + __be64 port_multicast_rcv_packets; +} __attribute__ ((packed)); + +#define IB_PMA_SELX_PORT_XMIT_DATA __constant_htons(0x0001) +#define IB_PMA_SELX_PORT_RCV_DATA __constant_htons(0x0002) +#define IB_PMA_SELX_PORT_XMIT_PACKETS __constant_htons(0x0004) +#define IB_PMA_SELX_PORT_RCV_PACKETS __constant_htons(0x0008) +#define IB_PMA_SELX_PORT_UNI_XMIT_PACKETS __constant_htons(0x0010) +#define IB_PMA_SELX_PORT_UNI_RCV_PACKETS __constant_htons(0x0020) +#define IB_PMA_SELX_PORT_MULTI_XMIT_PACKETS __constant_htons(0x0040) +#define IB_PMA_SELX_PORT_MULTI_RCV_PACKETS __constant_htons(0x0080) + +static int recv_pma_get_classportinfo(struct ib_perf *pmp) +{ + /* + struct ib_pma_classportinfo *p = + (struct ib_pma_classportinfo *)pmp->data; + */ + + memset(pmp->data, 0, sizeof(pmp->data)); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portsamplescontrol(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portsamplescontrol *p = + (struct ib_pma_portsamplescontrol *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + unsigned long flags; + + memset(pmp->data, 0, sizeof(pmp->data)); + + p->port_select = port; + p->tick = 0xFA; /* 1 ms. */ + p->counter_width = 4; /* 32 bit counters */ + p->counter_mask0_9 = __constant_htonl(0x09248000); /* counters 0-4 */ + spin_lock_irqsave(&dev->pending_lock, flags); + p->sample_status = dev->pma_sample_status; + p->sample_start = cpu_to_be32(dev->pma_sample_start); + p->sample_interval = cpu_to_be32(dev->pma_sample_interval); + p->tag = cpu_to_be16(dev->pma_tag); + p->counter_select[0] = dev->pma_counter_select[0]; + p->counter_select[1] = dev->pma_counter_select[1]; + p->counter_select[2] = dev->pma_counter_select[2]; + p->counter_select[3] = dev->pma_counter_select[3]; + p->counter_select[4] = dev->pma_counter_select[4]; + spin_unlock_irqrestore(&dev->pending_lock, flags); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_set_portsamplescontrol(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portsamplescontrol *p = + (struct ib_pma_portsamplescontrol *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + unsigned long flags; + u32 start = be32_to_cpu(p->sample_start); + + if (pmp->attr_mod == 0 && p->port_select == port && start != 0) { + spin_lock_irqsave(&dev->pending_lock, flags); + if (dev->pma_sample_status == IB_PMA_SAMPLE_STATUS_DONE) { + dev->pma_sample_status = IB_PMA_SAMPLE_STATUS_STARTED; + dev->pma_sample_start = start; + dev->pma_sample_interval = + be32_to_cpu(p->sample_interval); + dev->pma_tag = be16_to_cpu(p->tag); + if (p->counter_select[0]) + dev->pma_counter_select[0] = + p->counter_select[0]; + if (p->counter_select[1]) + dev->pma_counter_select[1] = + p->counter_select[1]; + if (p->counter_select[2]) + dev->pma_counter_select[2] = + p->counter_select[2]; + if (p->counter_select[3]) + dev->pma_counter_select[3] = + p->counter_select[3]; + if (p->counter_select[4]) + dev->pma_counter_select[4] = + p->counter_select[4]; + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + } + return recv_pma_get_portsamplescontrol(pmp, ibdev, port); +} + +static u64 get_counter(struct ipath_ibdev *dev, __be16 sel) +{ + switch (sel) { + case IB_PMA_PORT_XMIT_DATA: + return dev->ipath_sword; + case IB_PMA_PORT_RCV_DATA: + return dev->ipath_rword; + case IB_PMA_PORT_XMIT_PKTS: + return dev->ipath_spkts; + case IB_PMA_PORT_RCV_PKTS: + return dev->ipath_rpkts; + case IB_PMA_PORT_XMIT_WAIT: + default: + return 0; + } +} + +static int recv_pma_get_portsamplesresult(struct ib_perf *pmp, + struct ib_device *ibdev) +{ + struct ib_pma_portsamplesresult *p = + (struct ib_pma_portsamplesresult *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + int i; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->tag = cpu_to_be16(dev->pma_tag); + p->sample_status = cpu_to_be16(dev->pma_sample_status); + for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++) + p->counter[i] = + cpu_to_be32(get_counter(dev, dev->pma_counter_select[i])); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portsamplesresult_ext(struct ib_perf *pmp, + struct ib_device *ibdev) +{ + struct ib_pma_portsamplesresult_ext *p = + (struct ib_pma_portsamplesresult_ext *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + int i; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->tag = cpu_to_be16(dev->pma_tag); + p->sample_status = cpu_to_be16(dev->pma_sample_status); + p->extended_width = __constant_cpu_to_be16(0x80000000); /* 64 bits */ + for (i = 0; i < ARRAY_SIZE(dev->pma_counter_select); i++) + p->counter[i] = + cpu_to_be64(get_counter(dev, dev->pma_counter_select[i])); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portcounters(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + struct ipath_layer_counters cntrs; + + ipath_layer_get_counters(dev->ib_unit, &cntrs); + + /* Adjust counters for any resets done. */ + cntrs.symbol_error_counter -= dev->n_symbol_error_counter; + cntrs.link_error_recovery_counter -= dev->n_link_error_recovery_counter; + cntrs.link_downed_counter -= dev->n_link_downed_counter; + cntrs.port_rcv_errors -= dev->n_port_rcv_errors; + cntrs.port_rcv_remphys_errors -= dev->n_port_rcv_remphys_errors; + cntrs.port_xmit_discards -= dev->n_port_xmit_discards; + cntrs.port_xmit_data -= dev->n_port_xmit_data; + cntrs.port_rcv_data -= dev->n_port_rcv_data; + cntrs.port_xmit_packets -= dev->n_port_xmit_packets; + cntrs.port_rcv_packets -= dev->n_port_rcv_packets; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->port_select = port; + if (cntrs.symbol_error_counter > 0xFFFFUL) + p->symbol_error_counter = 0xFFFF; + else + p->symbol_error_counter = + cpu_to_be16((u16)cntrs.symbol_error_counter); + if (cntrs.link_error_recovery_counter > 0xFFUL) + p->link_error_recovery_counter = 0xFF; + else + p->link_error_recovery_counter = + (u8)cntrs.link_error_recovery_counter; + if (cntrs.link_downed_counter > 0xFFUL) + p->link_downed_counter = 0xFF; + else + p->link_downed_counter = (u8)cntrs.link_downed_counter; + if (cntrs.port_rcv_errors > 0xFFFFUL) + p->port_rcv_errors = 0xFFFF; + else + p->port_rcv_errors = cpu_to_be16((u16)cntrs.port_rcv_errors); + if (cntrs.port_rcv_remphys_errors > 0xFFFFUL) + p->port_rcv_remphys_errors = 0xFFFF; + else + p->port_rcv_remphys_errors = + cpu_to_be16((u16)cntrs.port_rcv_remphys_errors); + if (cntrs.port_xmit_discards > 0xFFFFUL) + p->port_xmit_discards = 0xFFFF; + else + p->port_xmit_discards = + cpu_to_be16((u16)cntrs.port_xmit_discards); + if (cntrs.port_xmit_data > 0xFFFFFFFFUL) + p->port_xmit_data = 0xFFFFFFFF; + else + p->port_xmit_data = cpu_to_be32((u32)cntrs.port_xmit_data); + if (cntrs.port_rcv_data > 0xFFFFFFFFUL) + p->port_rcv_data = 0xFFFFFFFF; + else + p->port_rcv_data = cpu_to_be32((u32)cntrs.port_rcv_data); + if (cntrs.port_xmit_packets > 0xFFFFFFFFUL) + p->port_xmit_packets = 0xFFFFFFFF; + else + p->port_xmit_packets = + cpu_to_be32((u32)cntrs.port_xmit_packets); + if (cntrs.port_rcv_packets > 0xFFFFFFFFUL) + p->port_rcv_packets = 0xFFFFFFFF; + else + p->port_rcv_packets = cpu_to_be32((u32)cntrs.port_rcv_packets); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_get_portcounters_ext(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters_ext *p = + (struct ib_pma_portcounters_ext *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + u64 swords, rwords, spkts, rpkts; + + ipath_layer_snapshot_counters(dev->ib_unit, + &swords, &rwords, &spkts, &rpkts); + + /* Adjust counters for any resets done. */ + swords -= dev->n_port_xmit_data; + rwords -= dev->n_port_rcv_data; + spkts -= dev->n_port_xmit_packets; + rpkts -= dev->n_port_rcv_packets; + + memset(pmp->data, 0, sizeof(pmp->data)); + p->port_select = port; + p->port_xmit_data = cpu_to_be64(swords); + p->port_rcv_data = cpu_to_be64(rwords); + p->port_xmit_packets = cpu_to_be64(spkts); + p->port_rcv_packets = cpu_to_be64(rpkts); + p->port_unicast_xmit_packets = cpu_to_be64(dev->n_unicast_xmit); + p->port_unicast_rcv_packets = cpu_to_be64(dev->n_unicast_rcv); + p->port_multicast_xmit_packets = cpu_to_be64(dev->n_multicast_xmit); + p->port_multicast_rcv_packets = cpu_to_be64(dev->n_multicast_rcv); + + return reply((struct ib_smp *)pmp, __LINE__); +} + +static int recv_pma_set_portcounters(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + struct ipath_layer_counters cntrs; + + /* + * Since the HW doesn't support clearing counters, we save the + * current count and subtract it from future responses. + */ + ipath_layer_get_counters(dev->ib_unit, &cntrs); + + if (p->counter_select & IB_PMA_SEL_SYMBOL_ERROR) + dev->n_symbol_error_counter = cntrs.symbol_error_counter; + + if (p->counter_select & IB_PMA_SEL_LINK_ERROR_RECOVERY) + dev->n_link_error_recovery_counter = + cntrs.link_error_recovery_counter; + + if (p->counter_select & IB_PMA_SEL_LINK_DOWNED) + dev->n_link_downed_counter = cntrs.link_downed_counter; + + if (p->counter_select & IB_PMA_SEL_PORT_RCV_ERRORS) + dev->n_port_rcv_errors = cntrs.port_rcv_errors; + + if (p->counter_select & IB_PMA_SEL_PORT_RCV_REMPHYS_ERRORS) + dev->n_port_rcv_remphys_errors = cntrs.port_rcv_remphys_errors; + + if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DISCARDS) + dev->n_port_xmit_discards = cntrs.port_xmit_discards; + + if (p->counter_select & IB_PMA_SEL_PORT_XMIT_DATA) + dev->n_port_xmit_data = cntrs.port_xmit_data; + + if (p->counter_select & IB_PMA_SEL_PORT_RCV_DATA) + dev->n_port_rcv_data = cntrs.port_rcv_data; + + if (p->counter_select & IB_PMA_SEL_PORT_XMIT_PACKETS) + dev->n_port_xmit_packets = cntrs.port_xmit_packets; + + if (p->counter_select & IB_PMA_SEL_PORT_RCV_PACKETS) + dev->n_port_rcv_packets = cntrs.port_rcv_packets; + + return recv_pma_get_portcounters(pmp, ibdev, port); +} + +static int recv_pma_set_portcounters_ext(struct ib_perf *pmp, + struct ib_device *ibdev, u8 port) +{ + struct ib_pma_portcounters *p = (struct ib_pma_portcounters *)pmp->data; + struct ipath_ibdev *dev = to_idev(ibdev); + u64 swords, rwords, spkts, rpkts; + + ipath_layer_snapshot_counters(dev->ib_unit, + &swords, &rwords, &spkts, &rpkts); + + if (p->counter_select & IB_PMA_SELX_PORT_XMIT_DATA) + dev->n_port_xmit_data = swords; + + if (p->counter_select & IB_PMA_SELX_PORT_RCV_DATA) + dev->n_port_rcv_data = rwords; + + if (p->counter_select & IB_PMA_SELX_PORT_XMIT_PACKETS) + dev->n_port_xmit_packets = spkts; + + if (p->counter_select & IB_PMA_SELX_PORT_RCV_PACKETS) + dev->n_port_rcv_packets = rpkts; + + if (p->counter_select & IB_PMA_SELX_PORT_UNI_XMIT_PACKETS) + dev->n_unicast_xmit = 0; + + if (p->counter_select & IB_PMA_SELX_PORT_UNI_RCV_PACKETS) + dev->n_unicast_rcv = 0; + + if (p->counter_select & IB_PMA_SELX_PORT_MULTI_XMIT_PACKETS) + dev->n_multicast_xmit = 0; + + if (p->counter_select & IB_PMA_SELX_PORT_MULTI_RCV_PACKETS) + dev->n_multicast_rcv = 0; + + return recv_pma_get_portcounters_ext(pmp, ibdev, port); +} + +static inline int process_subn(struct ib_device *ibdev, int mad_flags, + u8 port_num, struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + struct ib_smp *smp = (struct ib_smp *)out_mad; + struct ipath_ibdev *dev = to_idev(ibdev); + + /* Is the mkey in the process of expiring? */ + if (dev->mkey_lease_timeout && jiffies >= dev->mkey_lease_timeout) { + dev->mkey_lease_timeout = 0; + dev->mkeyprot_resv_lmc &= 0x3F; + } + + /* + * M_Key checking depends on + * Portinfo:M_Key_protect_bits + */ + if ((mad_flags & IB_MAD_IGNORE_MKEY) == 0 && dev->mkey != 0 && + dev->mkey != smp->mkey && (smp->method != IB_MGMT_METHOD_GET || + (dev->mkeyprot_resv_lmc >> 7) != 0)) { + if (dev->mkey_violations != 0xFFFF) + ++dev->mkey_violations; + if (dev->mkey_lease_timeout || dev->mkey_lease_period == 0) + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + dev->mkey_lease_timeout = jiffies + dev->mkey_lease_period * HZ; + /* Future: Generate a trap notice. */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + *out_mad = *in_mad; + switch (smp->method) { + case IB_MGMT_METHOD_GET: + switch (smp->attr_id) { + case IB_SMP_ATTR_NODE_DESC: + return recv_subn_get_nodedescription(smp); + + case IB_SMP_ATTR_NODE_INFO: + return recv_subn_get_nodeinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_GUID_INFO: + return recv_subn_get_guidinfo(smp, ibdev); + + case IB_SMP_ATTR_PORT_INFO: + return recv_subn_get_portinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_PKEY_TABLE: + return recv_subn_get_pkeytable(smp, ibdev); + + default: + break; + } + break; + + case IB_MGMT_METHOD_SET: + switch (smp->attr_id) { + case IB_SMP_ATTR_GUID_INFO: + return recv_subn_set_guidinfo(smp, ibdev); + + case IB_SMP_ATTR_PORT_INFO: + return recv_subn_set_portinfo(smp, ibdev, port_num); + + case IB_SMP_ATTR_PKEY_TABLE: + return recv_subn_set_pkeytable(smp, ibdev); + + default: + break; + } + break; + + default: + break; + } + return IB_MAD_RESULT_FAILURE; +} + +static inline int process_perf(struct ib_device *ibdev, u8 port_num, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + struct ib_perf *pmp = (struct ib_perf *)out_mad; + + *out_mad = *in_mad; + switch (pmp->method) { + case IB_MGMT_METHOD_GET: + switch (pmp->attr_id) { + case IB_PMA_CLASS_PORT_INFO: + return recv_pma_get_classportinfo(pmp); + + case IB_PMA_PORT_SAMPLES_CONTROL: + return recv_pma_get_portsamplescontrol(pmp, ibdev, + port_num); + + case IB_PMA_PORT_SAMPLES_RESULT: + return recv_pma_get_portsamplesresult(pmp, ibdev); + + case IB_PMA_PORT_SAMPLES_RESULT_EXT: + return recv_pma_get_portsamplesresult_ext(pmp, ibdev); + + case IB_PMA_PORT_COUNTERS: + return recv_pma_get_portcounters(pmp, ibdev, port_num); + + case IB_PMA_PORT_COUNTERS_EXT: + return recv_pma_get_portcounters_ext(pmp, ibdev, + port_num); + + default: + break; + } + break; + + case IB_MGMT_METHOD_SET: + switch (pmp->attr_id) { + case IB_PMA_PORT_SAMPLES_CONTROL: + return recv_pma_set_portsamplescontrol(pmp, ibdev, + port_num); + + case IB_PMA_PORT_COUNTERS: + return recv_pma_set_portcounters(pmp, ibdev, port_num); + + case IB_PMA_PORT_COUNTERS_EXT: + return recv_pma_set_portcounters_ext(pmp, ibdev, + port_num); + + default: + break; + } + break; + + default: + break; + } + return IB_MAD_RESULT_FAILURE; +} + +/* + * Note that the verbs framework has already done the MAD sanity checks, + * and hop count/pointer updating for IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE MADs. + * + * Return IB_MAD_RESULT_SUCCESS if this is a MAD that we are not interested + * in processing. + */ +int ipath_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + switch (in_mad->mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + return process_subn(ibdev, mad_flags, port_num, + in_mad, out_mad); + + case IB_MGMT_CLASS_PERF_MGMT: + return process_perf(ibdev, port_num, in_mad, out_mad); + + default: + return IB_MAD_RESULT_SUCCESS; + } +} From rdreier at cisco.com Wed Dec 28 18:19:49 2005 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Dec 2005 18:19:49 -0800 Subject: [openib-general] [PATCH 11 of 20] ipath - core driver, part 4 of 4 In-Reply-To: (Bryan O'Sullivan's message of "Wed, 28 Dec 2005 16:31:30 -0800") References: Message-ID: I didn't notice this before: > + * This is volatile as it's the target of a DMA from the chip. > + */ > + > +static volatile uint64_t ipath_port0_rcvhdrtail[512] > + __attribute__ ((aligned(4096))); ... and then much later ... > + /* > + * kernel modules loaded into vmalloc'ed memory, > + * verify that when we assume that, map to phys, and back to virt, > + * that we get the right contents, so we did the mapping right. > + */ > + vpage = vmalloc_to_page((void *)ipath_port0_rcvhdrtail); > + if (vpage == NOPAGE_SIGBUS || vpage == NOPAGE_OOM) { > + _IPATH_UNIT_ERROR(t, "vmalloc_to_page for rcvhdrtail fails!\n"); > + ret = -ENOMEM; > + goto done; > + } This seems very wrong to me: there's no guarantee that a module will be loaded into memory that can be used as a DMA target. For example, on a non-cache-coherent architecture, I think this memory must be accessed through a non-cached mapping. I think the correct solution is to allocate a buffer for each device with pci_alloc_consistent() (or maybe dma_alloc_coherent()). (As a general comment, I'm still unhappy about how your driver has a static, fixed-size table of devices rather than allocating per-device data structures dynamically) - R. From penberg at cs.helsinki.fi Thu Dec 29 00:18:03 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 29 Dec 2005 10:18:03 +0200 Subject: [openib-general] Re: [PATCH 5 of 20] ipath - driver core header files In-Reply-To: <2d9a3f27a10c8f11df92.1135816284@eng-12.pathscale.com> References: <2d9a3f27a10c8f11df92.1135816284@eng-12.pathscale.com> Message-ID: <84144f020512290018q189b0e34pc5cba9b251a8914b@mail.gmail.com> Hi Bryan, On 12/29/05, Bryan O'Sullivan wrote: > +/* > + * Copy routine that is guaranteed to work in terms of aligned 32-bit > + * quantities. > + */ > +void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords); Wasn't this supposed to be killed? Pekka From penberg at cs.helsinki.fi Thu Dec 29 00:21:02 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 29 Dec 2005 10:21:02 +0200 Subject: [openib-general] Re: [PATCH 14 of 20] ipath - infiniband verbs header In-Reply-To: <26993cb5faeef807a840.1135816293@eng-12.pathscale.com> References: <26993cb5faeef807a840.1135816293@eng-12.pathscale.com> Message-ID: <84144f020512290021x34a028eck290c238a24bd14d1@mail.gmail.com> On 12/29/05, Bryan O'Sullivan wrote: > diff -r f9bcd9de3548 -r 26993cb5faee drivers/infiniband/hw/ipath/verbs_debug.h > --- /dev/null Thu Jan 1 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/verbs_debug.h Wed Dec 28 14:19:43 2005 -0800 > +#ifndef _VERBS_DEBUG_H > +#define _VERBS_DEBUG_H > + > +/* > + * This file contains tracing code for the ib_ipath kernel module. > + */ > +#ifndef _VERBS_DEBUGGING /* tracing enabled or not */ > +#define _VERBS_DEBUGGING 1 > +#endif > + > +extern unsigned ib_ipath_debug; > + > +#define _VERBS_ERROR(fmt,...) \ > + do { \ > + printk(KERN_ERR "%s: " fmt, "ib_ipath", ##__VA_ARGS__); \ > + } while(0) [snip, snip] Please consider using dev_dbg, dev_err, and friends from . Pekka From penberg at cs.helsinki.fi Thu Dec 29 00:22:26 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 29 Dec 2005 10:22:26 +0200 Subject: [openib-general] Re: [PATCH 6 of 20] ipath - driver debugging headers In-Reply-To: <9e8d017ed298d591ea33.1135816285@eng-12.pathscale.com> References: <9e8d017ed298d591ea33.1135816285@eng-12.pathscale.com> Message-ID: <84144f020512290022i20504893n95eb01484de62e3f@mail.gmail.com> On 12/29/05, Bryan O'Sullivan wrote: > +#endif /* _IPATH_DEBUG_H */ > diff -r 2d9a3f27a10c -r 9e8d017ed298 drivers/infiniband/hw/ipath/ipath_kdebug.h > --- /dev/null Thu Jan 1 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_kdebug.h Wed Dec 28 14:19:42 2005 -0800 > @@ -0,0 +1,109 @@ > +#ifndef _IPATH_KDEBUG_H > +#define _IPATH_KDEBUG_H > + > +#include "ipath_debug.h" > + > +/* > + * This file contains lightweight kernel tracing code. > + */ > + > +extern unsigned infinipath_debug; > +const char *ipath_get_unit_name(int unit); > + > +#if _IPATH_DEBUGGING > + > +#define _IPATH_UNIT_ERROR(unit,fmt,...) \ > + printk(KERN_ERR "%s: " fmt, ipath_get_unit_name(unit), ##__VA_ARGS__) > + > +#define _IPATH_ERROR(fmt,...) printk(KERN_ERR "infinipath: " fmt, ##__VA_ARGS__) > + > +#define _IPATH_INFO(fmt,...) \ > + do { \ > + if(unlikely(infinipath_debug & __IPATH_INFO)) \ > + printk(KERN_INFO "infinipath: " fmt, ##__VA_ARGS__); \ > + } while(0) > + [snip, snip] Please consider using dev_dbg, dev_err, et al from . Pekka From eitan at mellanox.co.il Thu Dec 29 00:46:06 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 29 Dec 2005 10:46:06 +0200 Subject: [openib-general] RE: [PATCH] OpenSM/osm_helper.c: In osm_dump_smp_dr_path, display DRLIDs only if DR SMP Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3618BF9@mtlexch01.mtl.com> Hi Hal, This is fine nit. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, December 28, 2005 10:54 PM > To: Yael Kalka; Eitan Zahavi > Cc: openib-general at openib.org > Subject: [PATCH] OpenSM/osm_helper.c: In osm_dump_smp_dr_path, display > DRLIDs only if DR SMP > > OpenSM/osm_helper.c: In osm_dump_smp_dr_path, display DR LIDs only if DR > SMP > > Signed-off-by: Hal Rosenstock > > Index: osm_helper.c > =================================================================== > --- osm_helper.c (revision 4645) > +++ osm_helper.c (working copy) > @@ -1457,9 +1457,7 @@ osm_dump_dr_smp( > "\t\t\t\tattr_id.................0x%X (%s)\n" > "\t\t\t\tresv....................0x%X\n" > "\t\t\t\tattr_mod................0x%X\n" > - "\t\t\t\tm_key...................0x%016" PRIx64 "\n" > - "\t\t\t\tdr_slid.................0x%X\n" > - "\t\t\t\tdr_dlid.................0x%X\n", > + "\t\t\t\tm_key...................0x%016" PRIx64 "\n", > p_smp->hop_ptr, > p_smp->hop_count, > cl_ntoh64(p_smp->trans_id), > @@ -1467,14 +1465,20 @@ osm_dump_dr_smp( > ib_get_sm_attr_str( p_smp->attr_id ), > cl_ntoh16(p_smp->resv), > cl_ntoh32(p_smp->attr_mod), > - cl_ntoh64(p_smp->m_key), > - cl_ntoh16(p_smp->dr_slid), > - cl_ntoh16(p_smp->dr_dlid) > + cl_ntoh64(p_smp->m_key) > ); > strcat( buf, line ); > > if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) > { > + sprintf( line, > + "\t\t\t\tdr_slid.................0x%X\n" > + "\t\t\t\tdr_dlid.................0x%X\n", > + cl_ntoh16(p_smp->dr_slid), > + cl_ntoh16(p_smp->dr_dlid) > + ); > + strcat( buf, line ); > + > strcat( buf, "\n\t\t\t\tInitial path: " ); > > for( i = 0; i <= p_smp->hop_count; i++ ) > @@ -1652,7 +1656,7 @@ osm_dump_smp_dr_path( > > if( osm_log_is_active( p_log, log_level) ) > { > - sprintf( buf, "Received a SMP on a %u hop path:" > + sprintf( buf, "Received SMP on a %u hop path:" > "\n\t\t\t\tInitial path = ", p_smp->hop_count ); > > for( i = 0; i <= p_smp->hop_count; i++ ) > From oferg at mellanox.co.il Thu Dec 29 02:20:31 2005 From: oferg at mellanox.co.il (Ofer Gigi) Date: Thu, 29 Dec 2005 12:20:31 +0200 Subject: [openib-general] [PATCH] osm: support for trivial PKey manager Message-ID: <074q4sfbm8.fsf@swlab25.yok.mtl.com> Hi Hal, My name is Ofer Gigi, and I am a new software engineer in Mellanox working on OpenSM. This patch provides a new manager that solves the following problem: OpenSM is not currently compliant to the spec statement: C14.62.1.1 Table 183 p870 l34: "However, the SM shall ensure that one of the P_KeyTable entries in every node contains either the value 0xFFFF (the default P_Key, full membership) or the value 0x7FFF (the default P_Key, partial membership)." Luckily, all IB devices comes up from reset with preconfigured 0xffff key. This was discovered during last plugfest. To overcome this limitation I implemented a simple elementary PKey manager that will enforce the above rule (currently adds 0xffff if missing). This additional manager would be used for a full PKey policy manager in the future. We have tested this patch Thanks Ofer G. Signed-off-by: Ofer Gigi Index: include/opensm/osm_pkey_mgr.h =================================================================== --- include/opensm/osm_pkey_mgr.h (revision 0) +++ include/opensm/osm_pkey_mgr.h (revision 0) @@ -0,0 +1,259 @@ +/* + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: osm_pkey_mgr.h 1743 2005-12-15 09:38:35Z oferg $ + */ + + +/* + * Abstract: + * Declaration of osm_pkey_mgr_t. + * This object represents the P_Key Manager object. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + * $Revision: 1.4 $ + */ + + +#ifndef _OSM_PKEY_MGR_H_ +#define _OSM_PKEY_MGR_H_ + +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/P_Key Manager +* NAME +* P_Key Manager +* +* DESCRIPTION +* The P_Key Manager object manage the p_key tables of all +* objects in the subnet +* +* AUTHOR +* Ofer Gigi, Mellanox +* +*********/ +/****s* OpenSM: P_Key Manager/osm_pkey_mgr_t +* NAME +* osm_pkey_mgr_t +* +* DESCRIPTION +* p_Key Manager structure. +* +* +* SYNOPSIS +*/ + +typedef struct _osm_pkey_mgr +{ + osm_subn_t *p_subn; + osm_log_t *p_log; + osm_req_t *p_req; + cl_plock_t *p_lock; + +} osm_pkey_mgr_t; + +/* +* FIELDS +* p_subn +* Pointer to the Subnet object for this subnet. +* +* p_log +* Pointer to the log object. +* +* p_req +* Pointer to the Request object. +* +* p_lock +* Pointer to the serializing lock. +* +* SEE ALSO +* P_Key Manager object +*********/ + +/****** OpenSM: P_Key Manager/osm_pkey_mgr_construct +* NAME +* osm_pkey_mgr_construct +* +* DESCRIPTION +* This function constructs a P_Key Manager object. +* +* SYNOPSIS +*/ +void +osm_pkey_mgr_construct( + IN osm_pkey_mgr_t* const p_mgr ); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to a P_Key Manager object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_pkey_mgr_init, osm_pkey_mgr_destroy +* +* Calling osm_pkey_mgr_construct is a prerequisite to calling any other +* method except osm_pkey_mgr_init. +* +* SEE ALSO +* P_Key Manager object, osm_pkey_mgr_init, +* osm_pkey_mgr_destroy +*********/ + +/****f* OpenSM: P_Key Manager/osm_pkey_mgr_destroy +* NAME +* osm_pkey_mgr_destroy +* +* DESCRIPTION +* The osm_pkey_mgr_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void +osm_pkey_mgr_destroy( + IN osm_pkey_mgr_t* const p_mgr ); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* P_Key Manager object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_pkey_mgr_construct or osm_pkey_mgr_init. +* +* SEE ALSO +* P_Key Manager object, osm_pkey_mgr_construct, +* osm_pkey_mgr_init +*********/ + +/****f* OpenSM: P_Key Manager/osm_pkey_mgr_init +* NAME +* osm_pkey_mgr_init +* +* DESCRIPTION +* The osm_pkey_mgr_init function initializes a +* P_Key Manager object for use. +* +* SYNOPSIS +*/ +ib_api_status_t +osm_pkey_mgr_init( + IN osm_pkey_mgr_t* const p_mgr, + IN osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN osm_req_t* const p_req, + IN cl_plock_t* const p_lock ); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to an osm_pkey_mgr_t object to initialize. +* +* p_subn +* [in] Pointer to the Subnet object for this subnet. +* +* p_log +* [in] Pointer to the log object. +* +* p_req +* [in] Pointer to an osm_req_t object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* +* RETURN VALUES +* IB_SUCCESS if the P_Key Manager object was initialized +* successfully. +* +* NOTES +* Allows calling other P_Key Manager methods. +* +* SEE ALSO +* P_Key Manager object, osm_pkey_mgr_construct, +* osm_pkey_mgr_destroy +*********/ + +/****f* OpenSM: P_Key Manager/osm_pkey_mgr_process +* NAME +* osm_pkey_mgr_process +* +* DESCRIPTION +* This function enforce pkey rules on the SM DB. +* +* SYNOPSIS +*/ +osm_signal_t +osm_pkey_mgr_process( + IN const osm_pkey_mgr_t* const p_mgr ); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to an osm_pkey_mgr_t object. +* +* RETURN VALUES +* None +* +* NOTES +* Current Operations: +* - Inserts IB_DEFAULT_PKEY to all node objects that don't have +* IB_DEFAULT_PARTIAL_PKEY or IB_DEFAULT_PKEY as part +* of their p_key table +* +* SEE ALSO +* P_Key Manager +*********/ + +END_C_DECLS + +#endif /* _OSM_PKEY_MGR_H_ */ Index: include/opensm/osm_state_mgr.h =================================================================== --- include/opensm/osm_state_mgr.h (revision 4651) +++ include/opensm/osm_state_mgr.h (working copy) @@ -60,6 +60,7 @@ #include #include #include +#include #include #include @@ -112,6 +113,7 @@ typedef struct _osm_state_mgr osm_mcast_mgr_t *p_mcast_mgr; osm_link_mgr_t *p_link_mgr; osm_drop_mgr_t *p_drop_mgr; + osm_pkey_mgr_t *p_pkey_mgr; osm_req_t *p_req; osm_stats_t *p_stats; struct _osm_sm_state_mgr *p_sm_state_mgr; @@ -149,6 +151,9 @@ typedef struct _osm_state_mgr * p_drop_mgr * Pointer to the Drop Manager object. * +* p_pkey_mgr +* Pointer to the P_Key Manager object. +* * p_req * Pointer to the Requester object sending SMPs. * @@ -374,6 +379,7 @@ osm_state_mgr_init( IN osm_mcast_mgr_t* const p_mcast_mgr, IN osm_link_mgr_t* const p_link_mgr, IN osm_drop_mgr_t* const p_drop_mgr, + IN osm_pkey_mgr_t* const p_pkey_mgr, IN osm_req_t* const p_req, IN osm_stats_t* const p_stats, IN struct _osm_sm_state_mgr* const p_sm_state_mgr, @@ -405,6 +411,9 @@ osm_state_mgr_init( * p_drop_mgr * [in] Pointer to the Drop Manager object. * +* p_pkey_mgr +* [in] Pointer to the P_Key Manager object. +* * p_req * [in] Pointer to the Request Controller object. * Index: include/opensm/osm_base.h =================================================================== --- include/opensm/osm_base.h (revision 4651) +++ include/opensm/osm_base.h (working copy) @@ -578,8 +578,10 @@ typedef enum _osm_sm_state OSM_SM_STATE_PROCESS_REQUEST_WAIT, OSM_SM_STATE_PROCESS_REQUEST_DONE, OSM_SM_STATE_MASTER_OR_HIGHER_SM_DETECTED, + OSM_SM_STATE_SET_PKEY, + OSM_SM_STATE_SET_PKEY_WAIT, + OSM_SM_STATE_SET_PKEY_DONE, OSM_SM_STATE_MAX - } osm_sm_state_t; /***********/ Index: include/opensm/osm_sm.h =================================================================== --- include/opensm/osm_sm.h (revision 4651) +++ include/opensm/osm_sm.h (working copy) @@ -74,6 +74,7 @@ #include #include #include +#include #include #include #include @@ -161,6 +162,7 @@ typedef struct _osm_sm osm_link_mgr_t link_mgr; osm_state_mgr_t state_mgr; osm_drop_mgr_t drop_mgr; + osm_pkey_mgr_t pkey_mgr; osm_lft_rcv_t lft_rcv; osm_lft_rcv_ctrl_t lft_rcv_ctrl; osm_mft_rcv_t mft_rcv; Index: include/opensm/osm_pkey.h =================================================================== --- include/opensm/osm_pkey.h (revision 4651) +++ include/opensm/osm_pkey.h (working copy) @@ -164,7 +164,7 @@ void osm_pkey_tbl_destroy( * osm_pkey_get_num_blocks * * DESCRIPTION -* Obtain the pointer to the IB PKey table block stored in the object +* Obtain the number of blocks in IB PKey table * * SYNOPSIS */ Index: include/iba/ib_types.h =================================================================== --- include/iba/ib_types.h (revision 4651) +++ include/iba/ib_types.h (working copy) @@ -368,7 +368,7 @@ BEGIN_C_DECLS * * SOURCE */ -#define IB_PKEY_BASE_MASK (CL_NTOH16(0x7FFF)) +#define IB_PKEY_BASE_MASK (CL_HTON16(0x7FFF)) /*********/ /****d* IBA Base: Constants/IB_PKEY_TYPE_MASK @@ -383,6 +383,18 @@ BEGIN_C_DECLS #define IB_PKEY_TYPE_MASK (CL_NTOH16(0x8000)) /*********/ +/****d* IBA Base: Constants/IB_DEFAULT_PARTIAL_PKEY +* NAME +* IB_DEFAULT_PARTIAL_PKEY +* +* DESCRIPTION +* 0x7fff in network order +* +* SOURCE +*/ +#define IB_DEFAULT_PARTIAL_PKEY (CL_HTON16(0x7fff)) +/**********/ + /****d* IBA Base: Constants/IB_MCLASS_SUBN_LID * NAME * IB_MCLASS_SUBN_LID Index: opensm/osm_state_mgr.c =================================================================== --- opensm/osm_state_mgr.c (revision 4651) +++ opensm/osm_state_mgr.c (working copy) @@ -108,6 +108,7 @@ osm_state_mgr_init( IN osm_mcast_mgr_t * const p_mcast_mgr, IN osm_link_mgr_t * const p_link_mgr, IN osm_drop_mgr_t * const p_drop_mgr, + IN osm_pkey_mgr_t * const p_pkey_mgr, IN osm_req_t * const p_req, IN osm_stats_t * const p_stats, IN osm_sm_state_mgr_t * const p_sm_state_mgr, @@ -127,6 +128,7 @@ osm_state_mgr_init( CL_ASSERT( p_mcast_mgr ); CL_ASSERT( p_link_mgr ); CL_ASSERT( p_drop_mgr ); + CL_ASSERT( p_pkey_mgr ); CL_ASSERT( p_req ); CL_ASSERT( p_stats ); CL_ASSERT( p_sm_state_mgr ); @@ -143,6 +145,7 @@ osm_state_mgr_init( p_mgr->p_mcast_mgr = p_mcast_mgr; p_mgr->p_link_mgr = p_link_mgr; p_mgr->p_drop_mgr = p_drop_mgr; + p_mgr->p_pkey_mgr = p_pkey_mgr; p_mgr->p_mad_ctrl = p_mad_ctrl; p_mgr->p_req = p_req; p_mgr->p_stats = p_stats; @@ -2216,9 +2219,11 @@ osm_state_mgr_process( } } } + /* Need to continue with lid assigning */ osm_drop_mgr_process( p_mgr->p_drop_mgr ); - p_mgr->state = OSM_SM_STATE_SET_SM_UCAST_LID; + + p_mgr->state = OSM_SM_STATE_SET_PKEY; /* * If we are not MASTER already - this means that we are @@ -2229,6 +2234,62 @@ osm_state_mgr_process( osm_sm_state_mgr_process( p_mgr->p_sm_state_mgr, OSM_SM_SIGNAL_DISCOVERY_COMPLETED ); + /* signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); */ + /* the returned signal might be DONE or DONE_PENDING */ + signal = osm_pkey_mgr_process( p_mgr->p_pkey_mgr ); + break; + + default: + __osm_state_mgr_signal_error( p_mgr, signal ); + signal = OSM_SIGNAL_NONE; + break; + } + break; + + case OSM_SM_STATE_SET_PKEY: + switch ( signal ) + { + case OSM_SIGNAL_DONE: + p_mgr->state = OSM_SM_STATE_SET_PKEY_DONE; + break; + + case OSM_SIGNAL_DONE_PENDING: + /* + * There are outstanding transactions, so we + * must wait for the wire to clear. + */ + p_mgr->state = OSM_SM_STATE_SET_PKEY_WAIT; + signal = OSM_SIGNAL_NONE; + break; + + default: + __osm_state_mgr_signal_error( p_mgr, signal ); + signal = OSM_SIGNAL_NONE; + break; + } + break; + + case OSM_SM_STATE_SET_PKEY_WAIT: + switch ( signal ) + { + case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: + p_mgr->state = OSM_SM_STATE_SET_PKEY_DONE; + break; + + default: + __osm_state_mgr_signal_error( p_mgr, signal ); + signal = OSM_SIGNAL_NONE; + break; + } + break; + + case OSM_SM_STATE_SET_PKEY_DONE: + switch ( signal ) + { + case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: + case OSM_SIGNAL_DONE: + + p_mgr->state = OSM_SM_STATE_SET_SM_UCAST_LID; signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); break; @@ -2237,6 +2298,7 @@ osm_state_mgr_process( signal = OSM_SIGNAL_NONE; break; } + break; case OSM_SM_STATE_SET_SM_UCAST_LID: Index: opensm/osm_helper.c =================================================================== --- opensm/osm_helper.c (revision 4651) +++ opensm/osm_helper.c (working copy) @@ -1698,19 +1698,22 @@ const char* const __osm_sm_state_str[] = "OSM_SM_STATE_SET_LINK_PORTS", /* 19 */ "OSM_SM_STATE_SET_LINK_PORTS_WAIT", /* 20 */ "OSM_SM_STATE_SET_LINK_PORTS_DONE", /* 21 */ - "OSM_SM_STATE_SET_ARMED", /* 19 */ - "OSM_SM_STATE_SET_ARMED_WAIT", /* 20 */ - "OSM_SM_STATE_SET_ARMED_DONE", /* 21 */ - "OSM_SM_STATE_SET_ACTIVE", /* 22 */ - "OSM_SM_STATE_SET_ACTIVE_WAIT", /* 23 */ - "OSM_SM_STATE_LOST_NEGOTIATION", /* 24 */ - "OSM_SM_STATE_STANDBY", /* 25 */ - "OSM_SM_STATE_SUBNET_UP", /* 26 */ - "OSM_SM_STATE_PROCESS_REQUEST", /* 27 */ - "OSM_SM_STATE_PROCESS_REQUEST_WAIT", /* 28 */ - "OSM_SM_STATE_PROCESS_REQUEST_DONE", /* 29 */ - "OSM_SM_STATE_MASTER_OR_HIGHER_SM_DETECTED", /* 30 */ - "UNKNOWN STATE!!" /* 31 */ + "OSM_SM_STATE_SET_ARMED", /* 22 */ + "OSM_SM_STATE_SET_ARMED_WAIT", /* 23 */ + "OSM_SM_STATE_SET_ARMED_DONE", /* 24 */ + "OSM_SM_STATE_SET_ACTIVE", /* 25 */ + "OSM_SM_STATE_SET_ACTIVE_WAIT", /* 26 */ + "OSM_SM_STATE_LOST_NEGOTIATION", /* 27 */ + "OSM_SM_STATE_STANDBY", /* 28 */ + "OSM_SM_STATE_SUBNET_UP", /* 29 */ + "OSM_SM_STATE_PROCESS_REQUEST", /* 30 */ + "OSM_SM_STATE_PROCESS_REQUEST_WAIT", /* 31 */ + "OSM_SM_STATE_PROCESS_REQUEST_DONE", /* 32 */ + "OSM_SM_STATE_MASTER_OR_HIGHER_SM_DETECTED",/* 33 */ + "OSM_SM_STATE_SET_PKEY", /* 34 */ + "OSM_SM_STATE_SET_PKEY_WAIT", /* 35 */ + "OSM_SM_STATE_SET_PKEY_DONE", /* 36 */ + "UNKNOWN STATE!!" /* 37 */ }; const char* const __osm_sm_signal_str[] = Index: opensm/osm_sm.c =================================================================== --- opensm/osm_sm.c (revision 4651) +++ opensm/osm_sm.c (working copy) @@ -67,6 +67,7 @@ #include #include #include +#include #include #include #include @@ -164,6 +165,7 @@ osm_sm_construct( osm_state_mgr_construct( &p_sm->state_mgr ); osm_state_mgr_ctrl_construct( &p_sm->state_mgr_ctrl ); osm_drop_mgr_construct( &p_sm->drop_mgr ); + osm_pkey_mgr_construct( &p_sm->pkey_mgr ); osm_lft_rcv_construct( &p_sm->lft_rcv ); osm_lft_rcv_ctrl_construct( &p_sm->lft_rcv_ctrl ); osm_mft_rcv_construct( &p_sm->mft_rcv ); @@ -253,6 +255,7 @@ osm_sm_destroy( osm_ucast_mgr_destroy( &p_sm->ucast_mgr ); osm_link_mgr_destroy( &p_sm->link_mgr ); osm_drop_mgr_destroy( &p_sm->drop_mgr ); + osm_pkey_mgr_destroy( &p_sm->pkey_mgr ); osm_lft_rcv_destroy( &p_sm->lft_rcv ); osm_mft_rcv_destroy( &p_sm->mft_rcv ); osm_slvl_rcv_destroy( &p_sm->slvl_rcv ); @@ -410,6 +413,7 @@ osm_sm_init( &p_sm->mcast_mgr, &p_sm->link_mgr, &p_sm->drop_mgr, + &p_sm->pkey_mgr, &p_sm->req, p_stats, &p_sm->sm_state_mgr, @@ -432,6 +436,12 @@ osm_sm_init( if( status != IB_SUCCESS ) goto Exit; + status = osm_pkey_mgr_init( &p_sm->pkey_mgr, + p_sm->p_subn, + p_sm->p_log, &p_sm->req, p_sm->p_lock ); + if( status != IB_SUCCESS ) + goto Exit; + status = osm_lft_rcv_init( &p_sm->lft_rcv, p_subn, p_log, p_lock ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_indent =================================================================== --- opensm/osm_indent (revision 4651) +++ opensm/osm_indent (working copy) @@ -63,8 +63,8 @@ # -i3 Substitute indent with 3 spaces # -npcs No space after procedure calls # -prs Space after parenthesis -# -nsai No space after if keyword -# -nsaw No space after while keyword +# -nsai No space after if keyword - removed +# -nsaw No space after while keyword - removed # -sc Put * at left of comments in a block comment style # -nsob Don't swallow unnecessary blank lines # -ts3 Tab size is 3 @@ -81,7 +81,7 @@ for sourcefile in $*; do perl -piW -e's/\x0D//' "$sourcefile" echo Processing $sourcefile indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 -ci3 -cli0 -ncs \ - -hnl -i3 -npcs -prs -nsai -nsaf -nsaw -sc -nsob -ts3 -psl -bfda -nut $sourcefile + -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl -bfda -nut $sourcefile rm ${sourcefile}W Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 4651) +++ opensm/Makefile.am (working copy) @@ -32,7 +32,7 @@ opensm_SOURCES = main.c osm_console.c os osm_mcm_port.c osm_mtree.c osm_multicast.c osm_node.c \ osm_node_desc_rcv.c osm_node_desc_rcv_ctrl.c \ osm_node_info_rcv.c osm_node_info_rcv_ctrl.c \ - osm_opensm.c osm_pkey.c osm_pkey_rcv.c \ + osm_opensm.c osm_pkey.c osm_pkey_mgr.c osm_pkey_rcv.c \ osm_pkey_rcv_ctrl.c osm_port.c \ osm_port_info_rcv.c osm_port_info_rcv_ctrl.c \ osm_remote_sm.c osm_req.c osm_req_ctrl.c \ Index: opensm/osm_pkey_mgr.c =================================================================== --- opensm/osm_pkey_mgr.c (revision 0) +++ opensm/osm_pkey_mgr.c (revision 0) @@ -0,0 +1,294 @@ +/* + * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: osm_pkey_mgr.c 4278 2005-12-15 13:50:55Z oferg $ + */ + + +/* + * Abstract: + * Implementation of osm_pkey_mgr_t. + * This object represents the P_Key Manager object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + * $Revision: 1.7 $ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include + +/********************************************************************** + **********************************************************************/ +void +osm_pkey_mgr_construct( + IN osm_pkey_mgr_t * const p_mgr ) +{ + CL_ASSERT( p_mgr ); + cl_memclr( p_mgr, sizeof( *p_mgr ) ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_pkey_mgr_destroy( + IN osm_pkey_mgr_t * const p_mgr ) +{ + CL_ASSERT( p_mgr ); + + OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_destroy ); + + OSM_LOG_EXIT( p_mgr->p_log ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_pkey_mgr_init( + IN osm_pkey_mgr_t * const p_mgr, + IN osm_subn_t * const p_subn, + IN osm_log_t * const p_log, + IN osm_req_t * const p_req, + IN cl_plock_t * const p_lock ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_log, osm_pkey_mgr_init ); + + osm_pkey_mgr_construct( p_mgr ); + + p_mgr->p_log = p_log; + p_mgr->p_subn = p_subn; + p_mgr->p_lock = p_lock; + p_mgr->p_req = p_req; + + OSM_LOG_EXIT( p_mgr->p_log ); + return ( status ); +} + +/********************************************************************** + **********************************************************************/ +boolean_t +__osm_pkey_mgr_process_physical_port( + IN const osm_pkey_mgr_t * const p_mgr, + IN osm_node_t * p_node, + IN uint8_t port_num, + IN osm_physp_t * p_physp ) +{ + boolean_t return_val = FALSE; /* TRUE if IB_DEFAULT_PKEY was inserted */ + osm_madw_context_t context; + ib_pkey_table_t *block; + uint16_t block_index; + uint16_t num_of_blocks; + const osm_pkey_tbl_t *p_pkey_tbl; + uint32_t attr_mod; + uint32_t i; + ib_net16_t pkey; + ib_api_status_t status; + boolean_t block_with_empty_entry_found; + + OSM_LOG_ENTER( p_mgr->p_log, __osm_pkey_mgr_process_physical_port ); + + /* + * Send a new entry for the pkey table for this node that includes + * IB_DEFAULT_PKEY when IB_DEFAULT_PARTIAL_PKEY or IB_DEFAULT_PKEY + * doesn't exist + */ + + if ( ( osm_physp_has_pkey( p_mgr->p_log, + IB_DEFAULT_PKEY, + p_physp ) == FALSE ) && + ( osm_physp_has_pkey( p_mgr->p_log, + IB_DEFAULT_PARTIAL_PKEY, p_physp ) == FALSE ) ) + { + context.pkey_context.node_guid = osm_node_get_node_guid( p_node ); + context.pkey_context.port_guid = osm_physp_get_port_guid( p_physp ); + context.pkey_context.set_method = TRUE; + + p_pkey_tbl = osm_physp_get_pkey_tbl( p_physp ); + num_of_blocks = osm_pkey_tbl_get_num_blocks( p_pkey_tbl ); + block_with_empty_entry_found = FALSE; + + for ( block_index = 0; block_index < num_of_blocks; block_index++ ) + { + block = osm_pkey_tbl_block_get( p_pkey_tbl, block_index ); + for ( i = 0; i < IB_NUM_PKEY_ELEMENTS_IN_BLOCK; i++ ) + { + pkey = block->pkey_entry[i]; + if ( ib_pkey_is_invalid( pkey ) ) + { + block->pkey_entry[i] = IB_DEFAULT_PKEY; + block_with_empty_entry_found = TRUE; + break; + } + } + if ( block_with_empty_entry_found ) + { + break; + } + } + + if ( block_with_empty_entry_found == FALSE ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "__osm_pkey_mgr_process_physical_port: ERR 0501: " + "No empty block was found to insert IB_DEFAULT_PKEY for node " + "0x%016" PRIx64 " and port %u.\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + } + else + { + /* Building the attribute modifier */ + if ( osm_node_get_type( p_node ) == IB_NODE_TYPE_SWITCH ) + { + /* Port num | Block Index */ + attr_mod = port_num << 16 | block_index; + } + else + { + attr_mod = block_index; + } + + status = osm_req_set( p_mgr->p_req, + osm_physp_get_dr_path_ptr( p_physp ), + ( uint8_t * ) block, + sizeof( *block ), + IB_MAD_ATTR_P_KEY_TABLE, + cl_hton32( attr_mod ), + CL_DISP_MSGID_NONE, &context ); + return_val = TRUE; /*IB_DEFAULT_PKEY was inserted */ + + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "IB_DEFAULT_PKEY was inserted for node 0x%016" PRIx64 + " and port %u.\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + port_num ); + } + } + } + else + { + /* default key or partial default key already exist */ + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "__osm_pkey_mgr_process_physical_port: " + "No need to insert IB_DEFAULT_PKEY for node 0x%016" PRIx64 + " port %u.\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), port_num ); + } + } + + OSM_LOG_EXIT( p_mgr->p_log ); + return ( return_val ); +} + +/********************************************************************** + **********************************************************************/ +osm_signal_t +osm_pkey_mgr_process( + IN const osm_pkey_mgr_t * const p_mgr ) +{ + cl_qmap_t *p_node_guid_tbl; + osm_node_t *p_node; + osm_node_t *p_next_node; + + uint32_t port_num; + osm_physp_t *p_physp; + osm_signal_t result = OSM_SIGNAL_DONE; + + CL_ASSERT( p_mgr ); + + OSM_LOG_ENTER( p_mgr->p_log, osm_pkey_mgr_process ); + + p_node_guid_tbl = &p_mgr->p_subn->node_guid_tbl; + + CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + + p_next_node = ( osm_node_t * ) cl_qmap_head( p_node_guid_tbl ); + while ( p_next_node != ( osm_node_t * ) cl_qmap_end( p_node_guid_tbl ) ) + { + p_node = p_next_node; + p_next_node = ( osm_node_t * ) cl_qmap_next( &p_next_node->map_item ); + + for ( port_num = 0; port_num < osm_node_get_num_physp( p_node ); + port_num++ ) + { + p_physp = osm_node_get_physp_ptr( p_node, port_num ); + if ( osm_physp_is_valid( p_physp ) ) + { + + if ( __osm_pkey_mgr_process_physical_port + ( p_mgr, p_node, port_num, p_physp ) ) + { + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_pkey_mgr_process: " + "Adding IB_DEFAULT_PKEY for pkey table of node " + "0x%016" PRIx64 " port %u.\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + port_num ); + } + result = OSM_SIGNAL_DONE_PENDING; + } + } + else + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_pkey_mgr_process: ERR 0502: " + "Non-Valid physical port for node 0x%016" PRIx64 + " port %u.\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + port_num ); + } + } + } + + + CL_PLOCK_RELEASE( p_mgr->p_lock ); + OSM_LOG_EXIT( p_mgr->p_log ); + return ( result ); +} From bos at pathscale.com Thu Dec 29 06:15:40 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Dec 2005 06:15:40 -0800 Subject: [openib-general] Re: [PATCH 5 of 20] ipath - driver core header files In-Reply-To: <84144f020512290018q189b0e34pc5cba9b251a8914b@mail.gmail.com> References: <2d9a3f27a10c8f11df92.1135816284@eng-12.pathscale.com> <84144f020512290018q189b0e34pc5cba9b251a8914b@mail.gmail.com> Message-ID: <1135865740.7790.0.camel@localhost.localdomain> On Thu, 2005-12-29 at 10:18 +0200, Pekka Enberg wrote: > > +void ipath_dwordcpy(uint32_t *dest, uint32_t *src, uint32_t ndwords); > Wasn't this supposed to be killed? The routine itself is dead, but the prototype survived. Thanks for spotting that. References: Message-ID: <1135866098.7790.2.camel@localhost.localdomain> On Wed, 2005-12-28 at 18:19 -0800, Roland Dreier wrote: > This seems very wrong to me: there's no guarantee that a module will > be loaded into memory that can be used as a DMA target. Right. I think we're just getting lucky on x86_64. I'll fix this. Thanks, From bos at pathscale.com Thu Dec 29 06:22:58 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Dec 2005 06:22:58 -0800 Subject: [openib-general] Re: [PATCH 14 of 20] ipath - infiniband verbs header In-Reply-To: <84144f020512290021x34a028eck290c238a24bd14d1@mail.gmail.com> References: <26993cb5faeef807a840.1135816293@eng-12.pathscale.com> <84144f020512290021x34a028eck290c238a24bd14d1@mail.gmail.com> Message-ID: <1135866178.7790.5.camel@localhost.localdomain> On Thu, 2005-12-29 at 10:21 +0200, Pekka Enberg wrote: > Please consider using dev_dbg, dev_err, and friends from . Will do, thanks. References: <074q4sfbm8.fsf@swlab25.yok.mtl.com> Message-ID: <1135866212.4341.2334.camel@hal.voltaire.com> Hi Ofer, On Thu, 2005-12-29 at 05:20, Ofer Gigi wrote: > Hi Hal, > > My name is Ofer Gigi, and I am a new software engineer in Mellanox > working on OpenSM. > This patch provides a new manager that solves the following problem: Thanks! I am starting to work on integrating this. I have a couple of questions and some comments and questions embedded below. How are the PKeys configured ? Does the SM always go through the extra states to set PKeys ? What if there are none to be set ? > OpenSM is not currently compliant to the spec statement: > C14.62.1.1 Table 183 p870 l34: > "However, the SM shall ensure that one of the P_KeyTable entries in every > node contains either the value 0xFFFF (the default P_Key, full membership) > or the value 0x7FFF (the default P_Key, partial membership)." > > Luckily, all IB devices comes up from reset with preconfigured 0xffff key. > This was discovered during last plugfest. > To overcome this limitation I implemented a simple elementary PKey manager > that will enforce the above rule (currently adds 0xffff if missing). Might 0x7fff be better in this case ? > This additional manager would be used for a full PKey policy manager > in the future. Can you elaborate on any plans in this area as to timing and what "full" means ? > We have tested this patch Can you elaborate on the testing ? Thanks again. -- Hal From hch at lst.de Thu Dec 29 06:42:22 2005 From: hch at lst.de (Christoph Hellwig) Date: Thu, 29 Dec 2005 15:42:22 +0100 Subject: [openib-general] PathScale license In-Reply-To: <20051228020255.GA3280@cuprite.internal.keyresearch.com> References: <1135363454.4328.95007.camel@hal.voltaire.com> <20051228020255.GA3280@cuprite.internal.keyresearch.com> Message-ID: <20051229144221.GA29260@lst.de> On Tue, Dec 27, 2005 at 06:02:55PM -0800, Johann George wrote: > We have heard the issues that have been raised regarding the PathScale > license. PathScale's intent is solely to protect its hardware IP and not to > limit use of the software in any way. > > PathScale's use of this language is not original. SGI has used, and perhaps > originated, the additional language. It currently appears in several files > in the Linux kernel. As an example, see fs/xfs/linux-2.6/kmem.c XFS has been switched to a normal short GPL boilerplate exactly because this wording is not okay. From oferg at mellanox.co.il Thu Dec 29 06:56:30 2005 From: oferg at mellanox.co.il (Ofer Gigi) Date: Thu, 29 Dec 2005 16:56:30 +0200 Subject: [openib-general] RE: [PATCH] osm: support for trivial PKey manager Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA311D@mtlexch01.mtl.com> Hi Hal, Every time, between the stages of "Heavy Sweep" and "Set Ucast Lids", the stage of the PKey Manager is done. During this stage, the SM goes through its database and verifies that each valid physical port has either 0x7fff (IB_DEFAULT_PARTIAL_PKEY) or 0xffff (IB_DEFAULT_PKEY). If it does have, nothing is done. If it doesn't, a set request is sent to the physical port with an updated block containing 0xffff. Please email me back if you have further questions. Thanks! Ofer -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, December 29, 2005 4:24 PM To: Ofer Gigi Cc: OPENIB Subject: Re: [PATCH] osm: support for trivial PKey manager Hi Ofer, On Thu, 2005-12-29 at 05:20, Ofer Gigi wrote: > Hi Hal, > > My name is Ofer Gigi, and I am a new software engineer in Mellanox > working on OpenSM. > This patch provides a new manager that solves the following problem: Thanks! I am starting to work on integrating this. I have a couple of questions and some comments and questions embedded below. How are the PKeys configured ? Does the SM always go through the extra states to set PKeys ? What if there are none to be set ? > OpenSM is not currently compliant to the spec statement: > C14.62.1.1 Table 183 p870 l34: > "However, the SM shall ensure that one of the P_KeyTable entries in every > node contains either the value 0xFFFF (the default P_Key, full membership) > or the value 0x7FFF (the default P_Key, partial membership)." > > Luckily, all IB devices comes up from reset with preconfigured 0xffff key. > This was discovered during last plugfest. > To overcome this limitation I implemented a simple elementary PKey manager > that will enforce the above rule (currently adds 0xffff if missing). Might 0x7fff be better in this case ? > This additional manager would be used for a full PKey policy manager > in the future. Can you elaborate on any plans in this area as to timing and what "full" means ? > We have tested this patch Can you elaborate on the testing ? Thanks again. -- Hal From info at hgat.info Thu Dec 29 07:57:01 2005 From: info at hgat.info (info at hgat.info) Date: Thu, 29 Dec 2005 23:57:01 +0800 (MYT) Subject: [openib-general] =?iso-2022-jp?b?GyRCJTUlXSRHNS5KfSRiJCobKEI=?= =?iso-2022-jp?b?GyRCNmI7fSRBJEskSiRqJF4kOyRzJCshKSFaJCo2Yjt9JEEhWxsoQg==?= Message-ID: <20051229155701.A141D992ED@mail.hgat.info> $B!y(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,!y(B $B%j%C%A$J$*6b;}$A$N=w at -2q0w$H5U%5%]8r:]$7$^$;$s$+!)(B $B$*6b$O$"$k$,0&$K7C$^$l$J$$=w at -C#$r?HBN$GL~$7$F$"$2$F$/$@$5$$!#(B $B7n(B1$B!A(B4$B2s$N%G!<%H$G7n(B30$BK|$N5U%5%]$,:GDc8B$N%i%$%s$J$N$G!"(B $B=w at -$K$h$C$F$O$=$l0J>e$N5U%5%]$,4|BT$G$-$^$9!#(B $BK\F|!"CK at -$HD>%aO"Mm2DG=$JJ}$r$*6b;}$A=w at -$N=w at -2q0w$5$^$KJg$C$?=j!"(B $BB??t$N?=$79~$_$r$$$?$@$-$^$7$?!#(B $B5U%5%]$G5.J}$b$*6b;}$A$K$J$C$F$_$^$;$s$+!)(B $B!!!!!!(Bhttp://29z.hgat.info $B"#>R2pNA6b at _Dj$J$7!*EPO?40A4L5NA!*(B $B"#:GDc(B30$BK|!A$N5U%5%]!y%H(B $B"#=w at -$O(B30$BBe8eH>$^$G(B $B"#$*9%$-$J%?%$%W$N=w at -$rA*$s$G$/$@$5$$(B http://29z.hgat.info $B!y(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,!y(B From oferg at mellanox.co.il Thu Dec 29 08:14:01 2005 From: oferg at mellanox.co.il (Ofer Gigi) Date: Thu, 29 Dec 2005 18:14:01 +0200 Subject: [openib-general] RE: [PATCH] osm: support for trivial PKey manager Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA314E@mtlexch01.mtl.com> Hi Hal, Sorry, I didn't see there are other questions below.... Here are the answers: 1. Might 0x7fff be better in this case ? As I said, currently all IB devices come up from reset with preconfigured 0xffff key. Since we don't want to be inconsistent, we decided to insert 0xffff. 2. Can you elaborate on any plans in this area as to timing and what "full" means ? Since all IB devices come up from reset with 0xffff, and anyway, we insert it when 0x7fff or 0xffff doesn't exist, currently all physical ports belong to the same Pkey group, so there is no meaning to pkey groups. In the future, we would like to have a Pkey manager that will manage different Pkey groups (that don't share the same pkeys) for the physical ports. 3. Can you elaborate on the testing ? In the test, the following steps were done: 1. Remove 0x7fff or 0xffff from all physical ports 2. Force full sweep. 3. Verified that all physical ports have now 0xffff Thanks! Ofer -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, December 29, 2005 4:24 PM To: Ofer Gigi Cc: OPENIB Subject: Re: [PATCH] osm: support for trivial PKey manager Hi Ofer, On Thu, 2005-12-29 at 05:20, Ofer Gigi wrote: > Hi Hal, > > My name is Ofer Gigi, and I am a new software engineer in Mellanox > working on OpenSM. > This patch provides a new manager that solves the following problem: Thanks! I am starting to work on integrating this. I have a couple of questions and some comments and questions embedded below. How are the PKeys configured ? Does the SM always go through the extra states to set PKeys ? What if there are none to be set ? > OpenSM is not currently compliant to the spec statement: > C14.62.1.1 Table 183 p870 l34: > "However, the SM shall ensure that one of the P_KeyTable entries in every > node contains either the value 0xFFFF (the default P_Key, full membership) > or the value 0x7FFF (the default P_Key, partial membership)." > > Luckily, all IB devices comes up from reset with preconfigured 0xffff key. > This was discovered during last plugfest. > To overcome this limitation I implemented a simple elementary PKey manager > that will enforce the above rule (currently adds 0xffff if missing). Might 0x7fff be better in this case ? > This additional manager would be used for a full PKey policy manager > in the future. Can you elaborate on any plans in this area as to timing and what "full" means ? > We have tested this patch Can you elaborate on the testing ? Thanks again. -- Hal From halr at voltaire.com Thu Dec 29 08:21:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2005 11:21:38 -0500 Subject: [openib-general] RE: [PATCH] osm: support for trivial PKey manager In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA314E@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3FA314E@mtlexch01.mtl.com> Message-ID: <1135873298.4341.3078.camel@hal.voltaire.com> Hi again Ofer, On Thu, 2005-12-29 at 11:14, Ofer Gigi wrote: > Hi Hal, > Sorry, I didn't see there are other questions below.... > > Here are the answers: > 1. Might 0x7fff be better in this case ? > As I said, currently all IB devices come up from reset with > preconfigured 0xffff key. Since we don't want to be inconsistent, we > decided to insert 0xffff. In the long run, 0x7fff might be better for these ports. Making every port in the default full partition makes them accessible to everyone. > 2. Can you elaborate on any plans in this area as to timing and what > "full" > means ? > Since all IB devices come up from reset with 0xffff, and anyway, we > insert it when 0x7fff or 0xffff doesn't exist, currently all physical > ports belong to the same Pkey group, so there is no meaning to pkey > groups. > In the future, we would like to have a Pkey manager that will manage > different Pkey groups (that don't share the same pkeys) for the physical > ports. So what is there now just ensures the full (or limited) default partition is configured on every port where pkeys are supported. > 3. Can you elaborate on the testing ? > In the test, the following steps were done: > 1. Remove 0x7fff or 0xffff from all physical ports How is this step done ? > 2. Force full sweep. > 3. Verified that all physical ports have now 0xffff Thanks. -- Hal > Thanks! > Ofer From vonbrand at inf.utfsm.cl Thu Dec 29 11:01:53 2005 From: vonbrand at inf.utfsm.cl (Horst von Brand) Date: Thu, 29 Dec 2005 16:01:53 -0300 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: Message from "Bryan O'Sullivan" of "Wed, 28 Dec 2005 16:31:19 -0800." Message-ID: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> Bryan O'Sullivan wrote: > Following Roland's submission of our InfiniPath InfiniBand HCA driver > earlier this month, we have responded to people's comments by making a > large number of changes to the driver. Many thanks! > Here is another set of driver patches for review. Roland is on > vacation until January 4, so I'm posting these in his place. Once > again, your comments are appreciated. We'd like to submit this driver > for inclusion in 2.6.16, so we'll be responding quickly to all > feedback. > > A short summary of the changes we have made is as follows: Some comments, just based on this: [...] > - Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into > linux/types.h Haven't come across anything with this not 8 for a /long/ time now, and no Linux on that in sight. [...] > There are a few requested changes we have chosen to omit for now: > > - The driver still uses EXPORT_SYMBOL, for consistency with other > code in drivers/infiniband I'd suppose that is your choice... > - Someone asked for the kernel's i2c infrastructure to be used, but > our i2c usage is very specialised, and it would be more of a mess > to use the kernel's Problem with that is that if everybody and Aunt Tillie does the same, the kernel as a whole gets to be a mess. > - We're still using ioctls instead of sysfs or configfs in some > cases, to maintain userspace compatibility With what? You can very well ask people to upgrade to the latest userland utilities, and even make them run the old versions when they find that the new interface isn't there. Happened recently with modprobe/modutils. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513 From rlrevell at joe-job.com Thu Dec 29 11:26:24 2005 From: rlrevell at joe-job.com (Lee Revell) Date: Thu, 29 Dec 2005 14:26:24 -0500 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> References: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> Message-ID: <1135884385.6804.0.camel@mindpipe> On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote: > > - Someone asked for the kernel's i2c infrastructure to be used,but > > our i2c usage is very specialised, and it would be more of a mess > > to use the kernel's > > Problem with that is that if everybody and Aunt Tillie does the same, > the kernel as a whole gets to be a mess. ALSA does the exact same thing for the exact same reason. Maybe an indication that the kernel's i2c layer is too heavy? Lee From penberg at cs.helsinki.fi Thu Dec 29 11:24:38 2005 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 29 Dec 2005 21:24:38 +0200 Subject: [openib-general] Re: [PATCH 17 of 20] ipath - infiniband verbs support, part 3 of 3 In-Reply-To: <584777b6f4dc5269fa89.1135816296@eng-12.pathscale.com> References: <584777b6f4dc5269fa89.1135816296@eng-12.pathscale.com> Message-ID: <84144f020512291124sd895dfbp87ca9fd75552d671@mail.gmail.com> Hi, [Copy-paste reuse alert!] On 12/29/05, Bryan O'Sullivan wrote: > +static struct ib_mr *ipath_reg_phys_mr(struct ib_pd *pd, > + struct ib_phys_buf *buffer_list, > + int num_phys_buf, > + int acc, u64 *iova_start) > +{ > + struct ipath_mr *mr; > + int n, m, i; > + > + /* Allocate struct plus pointers to first level page tables. */ > + m = (num_phys_buf + IPATH_SEGSZ - 1) / IPATH_SEGSZ; > + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); > + if (!mr) > + return ERR_PTR(-ENOMEM); > + > + /* Allocate first level page tables. */ > + for (i = 0; i < m; i++) { > + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); > + if (!mr->mr.map[i]) { > + while (i) > + kfree(mr->mr.map[--i]); > + kfree(mr); > + return ERR_PTR(-ENOMEM); > + } > + } > + mr->mr.mapsz = m; [snip, snip] > +static struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, > + struct ib_umem *region, > + int mr_access_flags, > + struct ib_udata *udata) > +{ > + struct ipath_mr *mr; > + struct ib_umem_chunk *chunk; > + int n, m, i; > + > + n = 0; > + list_for_each_entry(chunk, ®ion->chunk_list, list) > + n += chunk->nents; > + > + /* Allocate struct plus pointers to first level page tables. */ > + m = (n + IPATH_SEGSZ - 1) / IPATH_SEGSZ; > + mr = kmalloc(sizeof *mr + m * sizeof mr->mr.map[0], GFP_KERNEL); > + if (!mr) > + return ERR_PTR(-ENOMEM); > + > + /* Allocate first level page tables. */ > + for (i = 0; i < m; i++) { > + mr->mr.map[i] = kmalloc(sizeof *mr->mr.map[0], GFP_KERNEL); > + if (!mr->mr.map[i]) { > + while (i) > + kfree(mr->mr.map[--i]); > + kfree(mr); > + return ERR_PTR(-ENOMEM); > + } > + } > + mr->mr.mapsz = m; [snip, more duplicate code] The above fragment is repeated at least three times. Please factor out the common code into separate functions. Pekka From ralphc at pathscale.com Thu Dec 29 13:00:06 2005 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 29 Dec 2005 13:00:06 -0800 Subject: [openib-general] Uninitialized structure field in ib_uverbs_create_ah() Message-ID: <1135890006.5081.60.camel@brick.internal.keyresearch.com> The attr.ah_flags field is not being initialized. Here is a patch: Index: uverbs_cmd.c =================================================================== --- uverbs_cmd.c (revision 4654) +++ uverbs_cmd.c (working copy) @@ -1448,6 +1448,7 @@ attr.sl = cmd.attr.sl; attr.src_path_bits = cmd.attr.src_path_bits; attr.static_rate = cmd.attr.static_rate; + attr.ah_flags = cmd.attr.is_global ? IB_AH_GRH : 0; attr.port_num = cmd.attr.port_num; attr.grh.flow_label = cmd.attr.grh.flow_label; attr.grh.sgid_index = cmd.attr.grh.sgid_index; -- Ralph Campbell From mshefty at ichips.intel.com Thu Dec 29 14:21:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 29 Dec 2005 14:21:00 -0800 Subject: [openib-general] Re: [PATCH] fix umad object lifetime stuff In-Reply-To: <528xwdqn4x.fsf@cisco.com> References: <528xwdqn4x.fsf@cisco.com> Message-ID: <43B4614C.7060509@ichips.intel.com> Roland Dreier wrote: > I just committed the following patch for user_mad.c, which fixes > various issues with possibly freeing various data structures before > the last reference is gone. For example, cdev_del() might return > before the last reference to the cdev is gone, so freeing a structure > containing the cdev is wrong at that point. (Side note: it's > essentially impossible to use cdev_init() safely unless the cdev in > question is statically allocated as part of the module). > > Something like this is probably required for ucm and anything else > that exports a character device, since everyone seems to have copied > my bad user_mad code. But I haven't had a chance to do anything > beyond user_mad and uverbs so far... I'm just now getting back to looking at this issue. If I understand the problem in the ucm correctly, struct cdev is freed as part of struct ib_ucm_device after cdev_del() returns; however, a user could still have a reference on the cdev. Also, the user could still make calls into the driver. Is this correct? If this is the case, isn't more protection needed that simply preventing access to cdev? I.e. what prevents the user from invoking a call that tries to access the underlying ib_device? Does every file operation need synchronization with device removal to ensure that the underlying hardware is still there? (This appears to be what user_mad now does.) Assuming that my understanding is correct (which is a stretch), it seems that there has to be a better way to handle this that is or can be integrated with the kernel, rather than adding complex reference counting, synchronization, and clean-up code to every driver that wants to handle device removal... - Sean From caitlinb at broadcom.com Thu Dec 29 15:06:36 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 29 Dec 2005 15:06:36 -0800 Subject: [openib-general] PathScale license Message-ID: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > On Tue, Dec 27, 2005 at 06:02:55PM -0800, Johann George wrote: >> We have heard the issues that have been raised regarding the >> PathScale license. PathScale's intent is solely to protect its >> hardware IP and not to limit use of the software in any way. >> >> PathScale's use of this language is not original. SGI has used, and >> perhaps originated, the additional language. It currently appears in >> several files in the Linux kernel. As an example, see >> fs/xfs/linux-2.6/kmem.c > > XFS has been switched to a normal short GPL boilerplate > exactly because this wording is not okay. > The best statement I could google explicitly states that patented code *can* be submitted *if* it has a license. The plain reading of pathscale's license grants an unencumbered license *to the code*. It merely refrains from waving any related hardware rights. As I read it, the code *may* be used with alternative hardware. If the alternative hardware violates the patent, the driver code is irrelevant. The code is not being restricted to work only with the patented hardware, correct? From mst at mellanox.co.il Thu Dec 29 15:50:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 30 Dec 2005 01:50:00 +0200 Subject: [openib-general] backwards compatibility Message-ID: <20051229235000.GB13951@mellanox.co.il> Hi! I'm reading a thread on lkml about backwards compatibility http://lkml.org/lkml/2005/12/29/204 and I wander whether we should work harder on supporting older userspace library ABIs in kernel? Currently, we only implement backwards compatibility in user-space, so that you always have to upgrade userspace when upgrading the kernel, but we could do this in kernel, too. We would just need the kernel to return a pair of ABI revision numbers: minimal and maximal ABI supported. What do you guys think? -- MST From mshefty at ichips.intel.com Thu Dec 29 15:55:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 29 Dec 2005 15:55:39 -0800 Subject: [openib-general] backwards compatibility In-Reply-To: <20051229235000.GB13951@mellanox.co.il> References: <20051229235000.GB13951@mellanox.co.il> Message-ID: <43B4777B.1090500@ichips.intel.com> Michael S. Tsirkin wrote: > We would just need the kernel to return a pair of ABI revision numbers: > minimal and maximal ABI supported. > > What do you guys think? I thought that some of the kernel modules had this, but I don't see where now. I think this becomes more important as the kernel and user interfaces become better defined and are used by more people. But I don't think it's worth trying to maintain backwards compatibility in cases where an API must change in order to fix a serious bug. - Sean From mst at mellanox.co.il Thu Dec 29 16:06:25 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 30 Dec 2005 02:06:25 +0200 Subject: [openib-general] Re: backwards compatibility In-Reply-To: <43B4777B.1090500@ichips.intel.com> References: <43B4777B.1090500@ichips.intel.com> Message-ID: <20051230000625.GC13951@mellanox.co.il> Quoting Sean Hefty : > But I don't think it's worth trying to maintain backwards compatibility in > cases where an API must change in order to fix a serious bug. Yea ... hope we wont have this anytime soon. -- MST From nacc at us.ibm.com Thu Dec 29 16:43:13 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 29 Dec 2005 16:43:13 -0800 Subject: [openib-general] Userspace testing results (2.6.15-rc7-git2 with modules) Message-ID: <20051230004313.GA8111@us.ibm.com> Hello all, After quite a bit of struggling, I have added automated userspace testing, as described below, to my daily kernel builds. Effectively, with the combinations of kernel.org -git trees and svn kernel code and modular or built-in CONFIG_ settings, I have added the combinations of 64-bit and 32-bit userspace, thus leading to 16 builds/test runs daily. This does make the overall time a little excessive, but gives reasonable coverage. Currently, I am running netpipe, iperf and netperf (these three tests are giving horrible results but we are pretty sure that it is a local issue, as both eth1 and ib0 based tests lead to poor performance) and also netpipe with a patch from Shirley Ma to run over native IB [1]. Additionally, I am running the 4 pingpong tests (rc, srq, uc, ud) and the two perftest tests: rdma_lat and rdma_bw. There are some issues with some size combinations; or, at least, that is how it seems to me. Here are the results, where each row's heading indicates server-client size (e.g. 32-65 is a 32-bit server and a 64-bit client), only related to userspace; that is both machines are running (identical) 64-bit kernels. The userpace svn revision is 4651 (or right around there): netpipe over IB rdma_write avg b/w (Mbps) peak b/w (Mbps) 32-32 1035.6 1839.98 32-64 Failed with [2] 64-32 Failed with [3] 64-64 1037.19 1839.99 rdma_write with immediate 32-32 Failed with [4] 32-64 Failed with [4] 64-32 Failed with [4] 64-64 Failed with [4] send_recv 32-32 Failed with [4] 32-64 Failed with [4] 64-32 Failed with [4] 64-64 Failed with [4] send_recv with immediate 32-32 Failed with [4] 32-64 Failed with [4] 64-32 Failed with [4] 64-64 Failed with [4] The type [4] failures are due to programmatic errors in the netpipe patch, I think... If someone has guidance there, I'd appreciate it. Type [2] & [3] may be a legitimate bug, though, as they seem to be tied to mixing server and client word-size? pingpong b/w (Mbps) rc 32-32 959.49 32-64 974.79 64-32 973.25 64-64 982.77 srq 32-32 3246.28 32-64 3273.20 64-32 3390.38 64-64 3383.03 uc 32-32 958.89 32-64 978.62 64-32 979.79 64-64 987.14 ud 32-32 464.78 32-64 468.09 64-32 468.47 64-64 471.62 I do have the timing numbers, but didn't include them here, if you'd like those in the results as well, please don't hesitate to ask. perftest rdma_lat peak b/w (MBps) avg b/w peak SD (cycles/KB) avg SD 32-32 4.34445e-07 4.34445e-07 10486 0 32-64 1866.19 1866.17 837 837 64-32 4.34504e-07 4.34504e-07 52284 0 64-64 1866.19 1866.17 837 837 rdma_bw typical (us) best worst 32-32 3.26619e+09 3.21854e+09 4.29282e+10 32-64 0.755625 0.742812 9.81156 64-32 3.23062e+09 3.18364e+09 5.64641e+10 64-64 0.743437 0.731563 10.1022 I am pretty convinced that having a 32-bit client with these tests seems to cause errors, those numbers simply don't look right (regardless of the server word-size). Eventually (as more jobs come in and thus I have more data to present), I will try to create some nice and pretty graphs to track regressions in performance. Hopefully, this data is useful for someone... Thanks, Nish P.S. Yes, I am running the same set of tests for mainline with CONFIG_ =m, and using the subversion kernel code with =y and =m, but those tests are still going and I wanted to post *some* numbers while I had them :) [1] Here is the patch I am using right now to the 3.6.2 version of NetPIPE: diff -urpN NetPIPE_3.6.2/makefile NetPIPE_3.6.2.patch/makefile --- NetPIPE_3.6.2/makefile 2004-06-09 12:46:35.000000000 -0700 +++ NetPIPE_3.6.2.patch/makefile 2005-12-15 11:51:52.000000000 -0800 @@ -20,9 +20,10 @@ # ######################################################################## -CC = cc -CFLAGS = -O +CC = cc +CFLAGS = -O SRC = ./src +LDFLAGS = -L/usr/local/lib # For MPI, mpicc will set up the proper include and library paths @@ -229,6 +230,10 @@ ib: $(SRC)/ib.c $(SRC)/netpipe.c $(SRC)/ -DINFINIBAND -DTCP -I $(VAPI_INC) -L $(VAPI_LIB) \ -lmpga -lvapi -lpthread +ibv: $(SRC)/ibv.c $(SRC)/netpipe.c $(SRC)/netpipe.h + $(CC) $(CFLAGS) $(SRC)/ibv.c $(SRC)/netpipe.c -o NPibv \ + -DOPENIB -DTCP $(LDFLAGS) -libverbs + atoll: $(SRC)/atoll.c $(SRC)/netpipe.c $(SRC)/netpipe.h $(CC) $(CFLAGS) -DATOLL $(SRC)/netpipe.c \ $(SRC)/atoll.c -o NPatoll \ diff -urpN NetPIPE_3.6.2/src/ibv.c NetPIPE_3.6.2.patch/src/ibv.c --- NetPIPE_3.6.2/src/ibv.c 1969-12-31 16:00:00.000000000 -0800 +++ NetPIPE_3.6.2.patch/src/ibv.c 2005-12-15 11:57:10.000000000 -0800 @@ -0,0 +1,1074 @@ +/*****************************************************************************/ +/* "NetPIPE" -- Network Protocol Independent Performance Evaluator. */ +/* Copyright 1997, 1998 Iowa State University Research Foundation, Inc. */ +/* */ +/* This program is free software; you can redistribute it and/or modify */ +/* it under the terms of the GNU General Public License as published by */ +/* the Free Software Foundation. You should have received a copy of the */ +/* GNU General Public License along with this program; if not, write to the */ +/* Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ +/* */ +/* ibv.c ---- Infiniband module for OpenIB verbs */ +/*****************************************************************************/ + +#define USE_VOLATILE_RPTR /* needed for polling on last byte of recv buffer */ +#include "netpipe.h" +#include +#include +#include + +/* Debugging output macro */ + +FILE* logfile; + +#if 0 +#define LOGPRINTF(_format, _aa...) fprintf(logfile, "%s: " _format, __func__ , ##_aa); fflush(logfile) +#else +#define LOGPRINTF(_format, _aa...) +#endif + +/* Header files needed for Infiniband */ + +#include +/* Global vars */ + +static struct ibv_device *hca; +static struct ibv_context *ctx; +static struct ibv_comp_channel *channel; +static struct ibv_port_attr hca_port; +static int port_num; +static uint16_t lid; +static uint16_t d_lid; +static struct ibv_pd *pd_hndl; +static int num_cqe; +static int act_num_cqe; +static struct ibv_cq *s_cq_hndl; +static struct ibv_cq *r_cq_hndl; +static struct ibv_mr *s_mr_hndl; +static struct ibv_mr *r_mr_hndl; +static struct ibv_qp_init_attr qp_init_attr; +static struct ibv_qp *qp_hndl; +static uint32_t d_qp_num; +static struct ibv_qp_attr qp_attr; +static struct ibv_wc wc; +static int max_wq=50000; +static void* remote_address; +static uint32_t remote_key; +static volatile int receive_complete; +static pthread_t thread; + +/* Function definitions */ + +void Init(ArgStruct *p, int* pargc, char*** pargv) +{ + /* Set defaults + */ + p->prot.ib_mtu = IBV_MTU_1024; /* 1024 Byte MTU */ + p->prot.commtype = NP_COMM_RDMAWRITE; /* Use RDMA write communications */ + p->prot.comptype = NP_COMP_LOCALPOLL; /* Use local polling for completion */ + p->tr = 0; /* I am not the transmitter */ + p->rcv = 1; /* I am the receiver */ +} + +void Setup(ArgStruct *p) +{ + + int one = 1; + int sockfd; + struct sockaddr_in *lsin1, *lsin2; /* ptr to sockaddr_in in ArgStruct */ + char *host; + struct hostent *addr; + struct protoent *proto; + int send_size, recv_size, sizeofint = sizeof(int); + struct sigaction sigact1; + char logfilename[80]; + + /* Sanity check */ + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype != NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, RDMA Write may only be used with local polling.\n"); + fprintf(stderr, "Try using RDMA Write With Immediate Data with vapi polling\n"); + fprintf(stderr, "or event completion\n"); + exit(-1); + } + + if( p->prot.commtype != NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) { + fprintf(stderr, "Error, local polling may only be used with RDMA Write.\n"); + fprintf(stderr, "Try using vapi polling or event completion\n"); + exit(-1); + } + + /* Open log file */ + sprintf(logfilename, ".iblog%d", 1 - p->tr); + logfile = fopen(logfilename, "w"); + + host = p->host; /* copy ptr to hostname */ + + lsin1 = &(p->prot.sin1); + lsin2 = &(p->prot.sin2); + + bzero((char *) lsin1, sizeof(*lsin1)); + bzero((char *) lsin2, sizeof(*lsin2)); + + if ( (sockfd = socket(AF_INET, SOCK_STREAM, 0)) < 0){ + printf("NetPIPE: can't open stream socket! errno=%d\n", errno); + exit(-4); + } + + if(!(proto = getprotobyname("tcp"))){ + printf("NetPIPE: protocol 'tcp' unknown!\n"); + exit(555); + } + + if (p->tr){ /* if client i.e., Sender */ + + + if (atoi(host) > 0) { /* Numerical IP address */ + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = inet_addr(host); + + } else { + + if ((addr = gethostbyname(host)) == NULL){ + printf("NetPIPE: invalid hostname '%s'\n", host); + exit(-5); + } + + lsin1->sin_family = addr->h_addrtype; + bcopy(addr->h_addr, (char*) &(lsin1->sin_addr.s_addr), addr->h_length); + } + + lsin1->sin_port = htons(p->port); + + } else { /* we are the receiver (server) */ + + bzero((char *) lsin1, sizeof(*lsin1)); + lsin1->sin_family = AF_INET; + lsin1->sin_addr.s_addr = htonl(INADDR_ANY); + lsin1->sin_port = htons(p->port); + + if (bind(sockfd, (struct sockaddr *) lsin1, sizeof(*lsin1)) < 0){ + printf("NetPIPE: server: bind on local address failed! errno=%d", errno); + exit(-6); + } + + } + + if(p->tr) + p->commfd = sockfd; + else + p->servicefd = sockfd; + + + + /* Establish tcp connections */ + + establish(p); + + /* Initialize Mellanox Infiniband */ + + if(initIB(p) == -1) { + CleanUp(p); + exit(-1); + } +} + +void event_handler(struct ibv_cq *cq); + +void *EventThread(void *unused) +{ + struct ibv_cq *cq; + void *data; + + while (1) { + if (ibv_get_cq_event(channel, &cq, &data)) { + fprintf(stderr, "Failed to get CQ event\n"); + return NULL; + } + event_handler(cq); + } +} + +int initIB(ArgStruct *p) +{ + struct ibv_device **dev_list; + int ret; + + dev_list = ibv_get_device_list(NULL); + + hca = *dev_list; + if (!hca) { + fprintf(stderr, "Couldn't find any InfiniBand devices\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); + } + + ctx = ibv_open_device(hca); + if (!ctx) { + fprintf(stderr, "Couldn't create InfiniBand context\n"); + return -1; + } else { + LOGPRINTF("Found Infiniband HCA %s\n", ibv_get_device_name(hca)); +// channel = ibv_create_comp_channel(ctx); + channel = NULL; + } + + /* Get HCA properties */ + + port_num=1; + ret = ibv_query_port(ctx, port_num, &hca_port); + if(ret) { + fprintf(stderr, "Error querying Infiniband HCA\n"); + return -1; + } else { + LOGPRINTF("Queried Infiniband HCA\n"); + } + lid = hca_port.lid; + LOGPRINTF(" lid = %d\n", lid); + + + /* Allocate Protection Domain */ + + pd_hndl = ibv_alloc_pd(ctx); + if(!pd_hndl) { + fprintf(stderr, "Error allocating PD\n"); + return -1; + } else { + LOGPRINTF("Allocated Protection Domain\n"); + } + + + /* Create send completion queue */ + + num_cqe = 30000; /* Requested number of completion q elements */ + s_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL, channel, 0); + if(!s_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = s_cq_hndl->cqe; + LOGPRINTF("Created Send Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Create recv completion queue */ + + num_cqe = 20000; /* Requested number of completion q elements */ + r_cq_hndl = ibv_create_cq(ctx, num_cqe, NULL, channel, 0); + if(!r_cq_hndl) { + fprintf(stderr, "Error creating send CQ\n"); + return -1; + } else { + act_num_cqe = r_cq_hndl->cqe; + LOGPRINTF("Created Recv Completion Queue with %d elements\n", act_num_cqe); + } + + + /* Placeholder for MR */ + + + /* Create Queue Pair */ + + qp_init_attr.cap.max_recv_wr = max_wq; /* Max outstanding WR on RQ */ + qp_init_attr.cap.max_send_wr = max_wq; /* Max outstanding WR on SQ */ + qp_init_attr.cap.max_recv_sge = 1; /* Max scatter/gather entries on RQ */ + qp_init_attr.cap.max_send_sge = 1; /* Max scatter/gather entries on SQ */ + qp_init_attr.recv_cq = r_cq_hndl; /* CQ handle for RQ */ + qp_init_attr.send_cq = s_cq_hndl; /* CQ handle for SQ */ + qp_init_attr.sq_sig_all = 0; /* Signalling type */ + qp_init_attr.qp_type = IBV_QPT_RC; /* Transmission type */ + + qp_hndl = ibv_create_qp(pd_hndl, &qp_init_attr); + if(!qp_hndl) { + fprintf(stderr, "Error creating Queue Pair\n"); + return -1; + } else { + LOGPRINTF("Created Queue Pair\n"); + } + + + /* Exchange lid and qp_num with other node */ + + if( write(p->commfd, &lid, sizeof(lid) ) != sizeof(lid) ) { + fprintf(stderr, "Failed to send lid over socket\n"); + return -1; + } + if( write(p->commfd, &qp_hndl->qp_num, sizeof(qp_hndl->qp_num) ) != sizeof(qp_hndl->qp_num) ) { + fprintf(stderr, "Failed to send qpnum over socket\n"); + return -1; + } + if( read(p->commfd, &d_lid, sizeof(d_lid) ) != sizeof(d_lid) ) { + fprintf(stderr, "Failed to read lid from socket\n"); + return -1; + } + if( read(p->commfd, &d_qp_num, sizeof(d_qp_num) ) != sizeof(d_qp_num) ) { + fprintf(stderr, "Failed to read qpnum from socket\n"); + return -1; + } + + LOGPRINTF("Local: lid=%d qp_num=%d Remote: lid=%d qp_num=%d\n", + lid, qp_hndl->qp_num, d_lid, d_qp_num); + + + /* Bring up Queue Pair */ + + /******* INIT state ******/ + + qp_attr.qp_state = IBV_QPS_INIT; + qp_attr.pkey_index = 0; + qp_attr.port_num = port_num; + qp_attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS); + if(ret) { + fprintf(stderr, "Error modifying QP to INIT\n"); + return -1; + } + + LOGPRINTF("Modified QP to INIT\n"); + + /******* RTR (Ready-To-Receive) state *******/ + + qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.max_dest_rd_atomic = 1; + qp_attr.dest_qp_num = d_qp_num; + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.is_global = 0; + qp_attr.ah_attr.dlid = d_lid; + qp_attr.ah_attr.static_rate = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = port_num; + qp_attr.path_mtu = p->prot.ib_mtu; + qp_attr.rq_psn = 0; + qp_attr.pkey_index = 0; + qp_attr.min_rnr_timer = 5; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTR\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTR\n"); + + /* Sync before going to RTS state */ + Sync(p); + + /******* RTS (Ready-to-Send) state *******/ + + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.sq_psn = 0; + qp_attr.timeout = 31; + qp_attr.retry_cnt = 1; + qp_attr.rnr_retry = 1; + qp_attr.max_rd_atomic = 1; + + ret = ibv_modify_qp(qp_hndl, &qp_attr, + IBV_QP_STATE | + IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC); + + if(ret) { + fprintf(stderr, "Error modifying QP to RTS\n"); + return -1; + } + + LOGPRINTF("Modified QP to RTS\n"); + + /* If using event completion, request the initial notification */ + if( p->prot.comptype == NP_COMP_EVENT ) { + if (pthread_create(&thread, NULL, EventThread, NULL)) { + fprintf(stderr, "Couldn't start event thread\n"); + return -1; + } + ibv_req_notify_cq(r_cq_hndl, 0); + } + + return 0; +} + +int finalizeIB(ArgStruct *p) +{ + int ret; + + LOGPRINTF("Finalizing IB stuff\n"); + + if(qp_hndl) { + LOGPRINTF("Destroying QP\n"); + ret = ibv_destroy_qp(qp_hndl); + if(ret) { + fprintf(stderr, "Error destroying Queue Pair\n"); + } + } + + if(r_cq_hndl) { + LOGPRINTF("Destroying Recv CQ\n"); + ret = ibv_destroy_cq(r_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying recv CQ\n"); + } + } + + if(s_cq_hndl) { + LOGPRINTF("Destroying Send CQ\n"); + ret = ibv_destroy_cq(s_cq_hndl); + if(ret) { + fprintf(stderr, "Error destroying send CQ\n"); + } + } + + /* Check memory registrations just in case user bailed out */ + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } + } + + if(pd_hndl) { + LOGPRINTF("Deallocating PD\n"); + ret = ibv_dealloc_pd(pd_hndl); + if(ret) { + fprintf(stderr, "Error deallocating PD\n"); + } + } + + /* Application code should not close HCA, just release handle */ + + if(ctx) { + LOGPRINTF("Releasing HCA\n"); + ret = ibv_close_device(ctx); + if(ret) { + fprintf(stderr, "Error releasing HCA\n"); + } + } + + return 0; +} + +void event_handler(struct ibv_cq *cq) +{ + int ret; + + while(1) { + + ret = ibv_poll_cq(cq, 1, &wc); + + if(ret == 0) { + LOGPRINTF("Empty completion queue, requesting next notification\n"); + ibv_req_notify_cq(r_cq_hndl, 0); + return; + } else if(ret < 0) { + fprintf(stderr, "Error in event_handler, polling cq\n"); + exit(-1); + } else if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in event_handler, on returned work completion " + "status: %d\n", wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved work completion\n"); + + /* For ping-pong mode at least, this check shouldn't be needed for + * normal operation, but it will help catch any bugs with multiple + * sends coming through when we're only expecting one. + */ + if(receive_complete == 1) { + + while(receive_complete != 0) sched_yield(); + + } + + receive_complete = 1; + + } + +} + +static int +readFully(int fd, void *obuf, int len) +{ + int bytesLeft = len; + char *buf = (char *) obuf; + int bytesRead = 0; + + while (bytesLeft > 0 && + (bytesRead = read(fd, (void *) buf, bytesLeft)) > 0) + { + bytesLeft -= bytesRead; + buf += bytesRead; + } + if (bytesRead <= 0) + return bytesRead; + return len; +} + +void Sync(ArgStruct *p) +{ + char s[] = "SyncMe"; + char response[7]; + + if (write(p->commfd, s, strlen(s)) < 0 || + readFully(p->commfd, response, strlen(s)) < 0) + { + perror("NetPIPE: error writing or reading synchronization string"); + exit(3); + } + if (strncmp(s, response, strlen(s))) + { + fprintf(stderr, "NetPIPE: Synchronization string incorrect!\n"); + exit(3); + } +} + +void PrepareToReceive(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_recv_wr rr; /* Receive request */ + struct ibv_recv_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* We don't need to post a receive if doing RDMA write with local polling */ + + if( p->prot.commtype == NP_COMM_RDMAWRITE && + p->prot.comptype == NP_COMP_LOCALPOLL ) + return; + + rr.num_sge = 1; + rr.sg_list = &sg_entry; + rr.next = NULL; + + sg_entry.lkey = r_mr_hndl->lkey; + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->r_ptr; + + ret = ibv_post_recv(qp_hndl, &rr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting recv request\n"); + CleanUp(p); + exit(-1); + } else { + LOGPRINTF("Posted recv request\n"); + } + + /* Set receive flag to zero and request event completion + * notification for this receive so the event handler will + * be triggered when the receive completes. + */ + if( p->prot.comptype == NP_COMP_EVENT ) { + receive_complete = 0; + } +} + +void SendData(ArgStruct *p) +{ + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_wr; + struct ibv_sge sg_entry; /* Scatter/Gather list - holds buff addr */ + + /* Fill in send request struct */ + + if(p->prot.commtype == NP_COMM_SENDRECV) { + sr.opcode = IBV_WR_SEND; + LOGPRINTF("Doing regular send\n"); + } else if(p->prot.commtype == NP_COMM_SENDRECV_WITH_IMM) { + sr.opcode = IBV_WR_SEND_WITH_IMM; + LOGPRINTF("Doing regular send with imm\n"); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE) { + sr.opcode = IBV_WR_RDMA_WRITE; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else if(p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM) { + sr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; + sr.wr.rdma.remote_addr = (uintptr_t)(remote_address + (p->s_ptr - p->s_buff)); + sr.wr.rdma.rkey = remote_key; + LOGPRINTF("Doing RDMA write with imm (raddr=%p)\n", sr.wr.rdma.remote_addr); + } else { + fprintf(stderr, "Error, invalid communication type in SendData\n"); + exit(-1); + } + + sr.send_flags = 0; /* This needed due to a bug in Mellanox HW rel a-0 */ + + sr.num_sge = 1; + sr.sg_list = &sg_entry; + sr.next = NULL; + + sg_entry.lkey = s_mr_hndl->lkey; /* Local memory region key */ + sg_entry.length = p->bufflen; + sg_entry.addr = (uintptr_t)p->s_ptr; + + ret = ibv_post_send(qp_hndl, &sr, &bad_wr); + if(ret) { + fprintf(stderr, "Error posting send request\n"); + } else { + LOGPRINTF("Posted send request\n"); + } + +} + +void RecvData(ArgStruct *p) +{ + int ret; + + /* Busy wait for incoming data */ + + LOGPRINTF("Receiving at buffer address %p\n", p->r_ptr); + + /* + * Unsignaled receives are not supported, so we must always poll the + * CQ, except when using RDMA writes. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE ) { + + /* Poll for receive completion locally on the receive data */ + + LOGPRINTF("Waiting for last byte of data to arrive\n"); + + while(p->r_ptr[p->bufflen-1] != 'a' + (p->cache ? 1 - p->tr : 1) ) + { + /* BUSY WAIT -- this should be fine since we + * declared r_ptr with volatile qualifier */ + } + + /* Reset last byte */ + p->r_ptr[p->bufflen-1] = 'a' + (p->cache ? p->tr : 0); + + LOGPRINTF("Received all of data\n"); + + } else if( p->prot.comptype != NP_COMP_EVENT ) { + + /* Poll for receive completion using VAPI poll function */ + + LOGPRINTF("Polling completion queue for VAPI work completion\n"); + + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error in RecvData, polling for completion\n"); + exit(-1); + } + + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, "Error in status of returned completion: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Retrieved successful completion\n"); + + } else if( p->prot.comptype == NP_COMP_EVENT ) { + + /* Instead of polling directly on data or VAPI completion queue, + * let the VAPI event completion handler set a flag when the receive + * completes, and poll on that instead. Could try using semaphore here + * as well to eliminate busy polling + */ + + LOGPRINTF("Polling receive flag\n"); + + while( receive_complete == 0 ) + { + /* BUSY WAIT */ + } + + /* If in prepost-burst mode, we won't be calling PrepareToReceive + * between ping-pongs, so we need to reset the receive_complete + * flag here. + */ + if( p->preburst ) receive_complete = 0; + + LOGPRINTF("Receive completed\n"); + } +} + +/* Reset is used after a trial to empty the work request queues so we + have enough room for the next trial to run */ +void Reset(ArgStruct *p) +{ + + int ret; /* Return code */ + struct ibv_send_wr sr; /* Send request */ + struct ibv_send_wr *bad_sr; + struct ibv_recv_wr rr; /* Recv request */ + struct ibv_recv_wr *bad_rr; + + /* If comptype is event, then we'll use event handler to detect receive, + * so initialize receive_complete flag + */ + if(p->prot.comptype == NP_COMP_EVENT) receive_complete = 0; + + /* Prepost receive */ + rr.num_sge = 0; + rr.next = NULL; + + LOGPRINTF("Posting recv request in Reset\n"); + ret = ibv_post_recv(qp_hndl, &rr, &bad_rr); + if(ret) { + fprintf(stderr, " Error posting recv request\n"); + CleanUp(p); + exit(-1); + } + + /* Make sure both nodes have preposted receives */ + Sync(p); + + /* Post Send */ + sr.opcode = IBV_WR_SEND; + sr.send_flags = IBV_SEND_SIGNALED; + sr.num_sge = 0; + sr.next = NULL; + + LOGPRINTF("Posting send request \n"); + ret = ibv_post_send(qp_hndl, &sr, &bad_sr); + if(ret) { + fprintf(stderr, " Error posting send request in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Polling for completion of send request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(s_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for send in Reset\n"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of send completion: %d\n", wc.status); + + if(p->prot.comptype == NP_COMP_EVENT) { + /* If using event completion, the event handler will set receive_complete + * when it gets the completion event. + */ + LOGPRINTF("Waiting for receive_complete flag\n"); + while(receive_complete == 0) { /* BUSY WAIT */ } + } else { + LOGPRINTF("Polling for completion of recv request\n"); + ret = 0; + while(ret == 0) + ret = ibv_poll_cq(r_cq_hndl, 1, &wc); + + if(ret < 0) { + fprintf(stderr, "Error polling CQ for recv in Reset"); + exit(-1); + } + if(wc.status != IBV_WC_SUCCESS) { + fprintf(stderr, " Error in completion status: %d\n", + wc.status); + exit(-1); + } + + LOGPRINTF("Status of recv completion: %d\n", wc.status); + } + LOGPRINTF("Done with reset\n"); +} + +void SendTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + + /* + Multiply the number of seconds by 1e6 to get time in microseconds + and convert value to an unsigned 32-bit integer. + */ + ltime = (uint32_t)(*t * 1.e6); + + /* Send time in network order */ + ntime = htonl(ltime); + if (write(p->commfd, (char *)&ntime, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendTime: errno=%d\n", errno); + exit(301); + } +} + +void RecvTime(ArgStruct *p, double *t) +{ + uint32_t ltime, ntime; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&ntime, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvTime: errno=%d\n", errno); + exit(302); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvTime of %d bytes\n", + bytesRead); + exit(303); + } + ltime = ntohl(ntime); + + /* Result is ltime (in microseconds) divided by 1.0e6 to get seconds */ + *t = (double)ltime / 1.0e6; +} + +void SendRepeat(ArgStruct *p, int rpt) +{ + uint32_t lrpt, nrpt; + + lrpt = rpt; + /* Send repeat count as a long in network order */ + nrpt = htonl(lrpt); + if (write(p->commfd, (void *) &nrpt, sizeof(uint32_t)) < 0) + { + printf("NetPIPE: write failed in SendRepeat: errno=%d\n", errno); + exit(304); + } +} + +void RecvRepeat(ArgStruct *p, int *rpt) +{ + uint32_t lrpt, nrpt; + int bytesRead; + + bytesRead = readFully(p->commfd, (void *)&nrpt, sizeof(uint32_t)); + if (bytesRead < 0) + { + printf("NetPIPE: read failed in RecvRepeat: errno=%d\n", errno); + exit(305); + } + else if (bytesRead != sizeof(uint32_t)) + { + fprintf(stderr, "NetPIPE: partial read in RecvRepeat of %d bytes\n", + bytesRead); + exit(306); + } + lrpt = ntohl(nrpt); + + *rpt = lrpt; +} + +void establish(ArgStruct *p) +{ + int clen; + int one = 1; + struct protoent; + + clen = sizeof(p->prot.sin2); + if(p->tr){ + if(connect(p->commfd, (struct sockaddr *) &(p->prot.sin1), + sizeof(p->prot.sin1)) < 0){ + printf("Client: Cannot Connect! errno=%d\n",errno); + exit(-10); + } + } + else { + /* SERVER */ + listen(p->servicefd, 5); + p->commfd = accept(p->servicefd, (struct sockaddr *) &(p->prot.sin2), + &clen); + + if(p->commfd < 0){ + printf("Server: Accept Failed! errno=%d\n",errno); + exit(-12); + } + } +} + +void CleanUp(ArgStruct *p) +{ + char *quit="QUIT"; + if (p->tr) + { + write(p->commfd,quit, 5); + read(p->commfd, quit, 5); + close(p->commfd); + } + else + { + read(p->commfd,quit, 5); + write(p->commfd,quit,5); + close(p->commfd); + close(p->servicefd); + } + + finalizeIB(p); +} + + +void AfterAlignmentInit(ArgStruct *p) +{ + int bytesRead; + + /* Exchange buffer pointers and remote infiniband keys if doing rdma. Do + * the exchange in this function because this will happen after any + * memory alignment is done, which is important for getting the + * correct remote address. + */ + if( p->prot.commtype == NP_COMM_RDMAWRITE || + p->prot.commtype == NP_COMM_RDMAWRITE_WITH_IMM ) { + + /* Send my receive buffer address + */ + if(write(p->commfd, (void *)&p->r_buff, sizeof(void*)) < 0) { + perror("NetPIPE: write of buffer address failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent buffer address: %p\n", p->r_buff); + + /* Send my remote key for accessing + * my remote buffer via IB RDMA + */ + if(write(p->commfd, (void *)&r_mr_hndl->rkey, sizeof(uint32_t)) < 0) { + perror("NetPIPE: write of remote key failed in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Sent remote key: %d\n", r_mr_hndl->rkey); + + /* Read the sent data + */ + bytesRead = readFully(p->commfd, (void *)&remote_address, sizeof(void*)); + if (bytesRead < 0) { + perror("NetPIPE: read of buffer address failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(void*)) { + perror("NetPIPE: partial read of buffer address in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote address from other node: %p\n", remote_address); + + bytesRead = readFully(p->commfd, (void *)&remote_key, sizeof(uint32_t)); + if (bytesRead < 0) { + perror("NetPIPE: read of remote key failed in AfterAlignmentInit"); + exit(-1); + } else if (bytesRead != sizeof(uint32_t)) { + perror("NetPIPE: partial read of remote key in AfterAlignmentInit"); + exit(-1); + } + + LOGPRINTF("Received remote key from other node: %d\n", remote_key); + + } +} + + +void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) +{ + /* Allocate buffers */ + + p->r_buff = malloc(bufflen+MAX(soffset,roffset)); + if(p->r_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + if(p->cache) { + + /* Infiniband spec says we can register same memory region + * more than once, so just copy buffer address. We will register + * the same buffer twice with Infiniband. + */ + p->s_buff = p->r_buff; + + } else { + + p->s_buff = malloc(bufflen+soffset); + if(p->s_buff == NULL) { + fprintf(stderr, "Error malloc'ing buffer\n"); + exit(-1); + } + + } + + /* Register buffers with Infiniband */ + + r_mr_hndl = ibv_reg_mr(pd_hndl, p->r_buff, bufflen + MAX(soffset, roffset), + IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); + if(!r_mr_hndl) + { + fprintf(stderr, "Error registering recv buffer\n"); + exit(-1); + } + else + { + LOGPRINTF("Registered Recv Buffer\n"); + } + + s_mr_hndl = ibv_reg_mr(pd_hndl, p->s_buff, bufflen+soffset, IBV_ACCESS_LOCAL_WRITE); + if(!s_mr_hndl) { + fprintf(stderr, "Error registering send buffer\n"); + exit(-1); + } else { + LOGPRINTF("Registered Send Buffer\n"); + } + +} +void FreeBuff(char *buff1, char *buff2) +{ + int ret; + + if(s_mr_hndl) { + LOGPRINTF("Deregistering send buffer\n"); + ret = ibv_dereg_mr(s_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering send mr\n"); + } else { + s_mr_hndl = NULL; + } + } + + if(r_mr_hndl) { + LOGPRINTF("Deregistering recv buffer\n"); + ret = ibv_dereg_mr(r_mr_hndl); + if(ret) { + fprintf(stderr, "Error deregistering recv mr\n"); + } else { + r_mr_hndl = NULL; + } + } + + if(buff1 != NULL) + free(buff1); + + if(buff2 != NULL) + free(buff2); +} + diff -urpN NetPIPE_3.6.2/src/netpipe.c NetPIPE_3.6.2.patch/src/netpipe.c --- NetPIPE_3.6.2/src/netpipe.c 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2.patch/src/netpipe.c 2005-12-15 11:51:52.000000000 -0800 @@ -142,7 +142,7 @@ int main(int argc, char **argv) case 's': streamopt = 1; printf("Streaming in one direction only.\n\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("Sockets are reset between trials to avoid\n"); printf("degradation from a collapsing window size.\n\n"); #endif @@ -168,7 +168,7 @@ int main(int argc, char **argv) case 'u': end = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'b': /* -b # resets the buffer size, -b 0 keeps system defs */ args.prot.sndbufsz = args.prot.rcvbufsz = atoi(optarg); break; @@ -178,7 +178,7 @@ int main(int argc, char **argv) /* end will be maxed at sndbufsz+rcvbufsz */ printf("Passing data in both directions simultaneously.\n"); printf("Output is for the combined bandwidth.\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("The socket buffer size limits the maximum test size.\n\n"); #endif if( streamopt ) { @@ -270,7 +270,29 @@ int main(int argc, char **argv) exit(-1); } break; +#endif + +#if defined(OPENIB) + case 'm': switch(atoi(optarg)) { + case 256: args.prot.ib_mtu = IBV_MTU_256; + break; + case 512: args.prot.ib_mtu = IBV_MTU_512; + break; + case 1024: args.prot.ib_mtu = IBV_MTU_1024; + break; + case 2048: args.prot.ib_mtu = IBV_MTU_2048; + break; + case 4096: args.prot.ib_mtu = IBV_MTU_4096; + break; + default: + fprintf(stderr, "Invalid MTU size, must be one of " + "256, 512, 1024, 2048, 4096\n"); + exit(-1); + } + break; +#endif +#if defined(OPENIB) || defined(INFINIBAND) case 't': if( !strcmp(optarg, "send_recv") ) { printf("Using Send/Receive communications\n"); args.prot.commtype = NP_COMM_SENDRECV; @@ -317,7 +339,7 @@ int main(int argc, char **argv) case 'n': nrepeat_const = atoi(optarg); break; -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) case 'r': args.reset_conn = 1; printf("Resetting connection after every trial\n"); break; @@ -331,7 +353,7 @@ int main(int argc, char **argv) #endif /* ! defined TCGMSG */ -#if defined(INFINIBAND) +#if defined(OPENIB) || defined(INFINIBAND) asyncReceive = 1; fprintf(stderr, "Preposting asynchronous receives (required for Infiniband)\n"); if(args.bidir && ( @@ -377,7 +399,7 @@ int main(int argc, char **argv) end = args.upper; if( args.tr ) { printf("The upper limit is being set to %d Bytes\n", end); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("due to socket buffer size limitations\n\n"); #endif } } @@ -990,7 +1012,7 @@ void VerifyIntegrity(ArgStruct *p) void PrintUsage() { printf("\n NETPIPE USAGE \n\n"); -#if ! defined(INFINIBAND) +#if ! defined(INFINIBAND) && !defined(OPENIB) printf("a: asynchronous receive (a.k.a. preposted receive)\n"); #endif printf("B: burst all preposts before measuring performance\n"); @@ -998,7 +1020,7 @@ void PrintUsage() printf("b: specify TCP send/receive socket buffer sizes\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("c: specify type of completion <-c type>\n" " valid types: local_poll, vapi_poll, event\n" " default: local_poll\n"); @@ -1010,7 +1032,7 @@ void PrintUsage() printf(" all MPI-2 implementations\n"); #endif -#if defined(TCP) || defined(INFINIBAND) +#if defined(TCP) || defined(INFINIBAND) || defined(OPENIB) printf("h: specify hostname of the receiver <-h host>\n"); #endif @@ -1019,7 +1041,7 @@ void PrintUsage() printf("i: Do an integrity check instead of measuring performance\n"); printf("l: lower bound start value e.g. <-l 1>\n"); -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("m: set MTU for Infiniband adapter <-m mtu_size>\n"); printf(" valid sizes: 256, 512, 1024, 2048, 4096 (default 1024)\n"); #endif @@ -1030,7 +1052,7 @@ void PrintUsage() printf("p: set the perturbation number <-p 1>\n" " (default = 3 Bytes, set to 0 for no perturbations)\n"); -#if defined(TCP) && ! defined(INFINIBAND) +#if defined(TCP) && ! defined(INFINIBAND) && !defined(OPENIB) printf("r: reset sockets for every trial\n"); #endif @@ -1039,7 +1061,7 @@ void PrintUsage() printf("S: Use synchronous sends.\n"); #endif -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) printf("t: specify type of communications <-t type>\n" " valid types: send_recv, send_recv_with_imm,\n" " rdma_write, rdma_write_with_imm\n" @@ -1056,7 +1078,7 @@ void PrintUsage() #if defined(MPI) printf(" May need to use -a to choose asynchronous communications for MPI/n"); #endif -#if defined(TCP) && !defined(INFINIBAND) +#if defined(TCP) && !defined(INFINIBAND) && !defined(OPENIB) printf(" The maximum test size is limited by the TCP buffer size/n"); #endif printf("\n"); @@ -1131,7 +1153,7 @@ void InitBufferData(ArgStruct *p, int nb memset(p->s_buff, 'b', nbytes+soffset); } -#if !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) +#if !defined(OPENIB) && !defined(INFINIBAND) && !defined(ARMCI) && !defined(LAPI) && !defined(GPSHMEM) && !defined(SHMEM) && !defined(GM) void MyMalloc(ArgStruct *p, int bufflen, int soffset, int roffset) { diff -urpN NetPIPE_3.6.2/src/netpipe.h NetPIPE_3.6.2.patch/src/netpipe.h --- NetPIPE_3.6.2/src/netpipe.h 2004-06-22 12:38:41.000000000 -0700 +++ NetPIPE_3.6.2.patch/src/netpipe.h 2005-12-15 11:51:52.000000000 -0800 @@ -27,6 +27,10 @@ #include /* ib_mtu_t */ #endif +#ifdef OPENIB +#include /* enum ibv_mtu */ +#endif + #ifdef FINAL #define TRIALS 7 #define RUNTM 0.25 @@ -73,9 +77,14 @@ int commtype; /* Communications type */ int comptype; /* Completion type */ #endif +#if defined(OPENIB) + enum ibv_mtu ib_mtu; /* MTU Size for Infiniband HCA */ + int commtype; /* Communications type */ + int comptype; /* Completion type */ +#endif }; -#if defined(INFINIBAND) +#if defined(INFINIBAND) || defined(OPENIB) enum completion_types { NP_COMP_LOCALPOLL, /* Poll locally on last byte of data */ NP_COMP_VAPIPOLL, /* Poll using vapi function */ [2] Preposting asynchronous receives (required for Infiniband) NetPIPE: error writing or reading synchronization string: Connection reset by peer Using RDMA Write communications [3] Preposting asynchronous receives (required for Infiniband) NetPIPE: Synchronization string incorrect! Using RDMA Write communications [4] All are similar to: Preposting asynchronous receives (required for Infiniband) Error, local polling may only be used with RDMA Write. Try using vapi polling or event completion Using RDMA Write communications with immediate data With the differences being in the last line. From mshefty at ichips.intel.com Thu Dec 29 16:53:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 29 Dec 2005 16:53:09 -0800 Subject: [openib-general] [PATCH] iWARP Support added to the CMA In-Reply-To: <1134669336.7186.2.camel@trinity.austin.ammasso.com> References: <1134669336.7186.2.camel@trinity.austin.ammasso.com> Message-ID: <43B484F5.7030603@ichips.intel.com> Tom Tucker wrote: I'm a lot slow to review this, but comments below. I'll start to address some of them that affect the generic code next week, in particular changes to ib_addr. > Index: core/cm.c > =================================================================== > --- core/cm.c (revision 4186) > +++ core/cm.c (working copy) > @@ -3227,6 +3227,10 @@ > int ret; > u8 i; > > + /* Ignore RNIC devices */ > + if (device->node_type == IB_NODE_RNIC) > + return; > + > cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * > device->phys_port_cnt, GFP_KERNEL); > if (!cm_dev) > @@ -3291,6 +3295,10 @@ > if (!cm_dev) > return; > > + /* Ignore RNIC devices */ > + if (device->node_type == IB_NODE_RNIC) > + return; > + > write_lock_irqsave(&cm.device_lock, flags); > list_del(&cm_dev->list); > write_unlock_irqrestore(&cm.device_lock, flags); The changes to cm_remove_one() are not needed. ib_get_client_data() should return NULL because IB_NODE_RNIC is skipped in cm_add_one(). > Index: core/addr.c > =================================================================== > --- core/addr.c (revision 4186) > +++ core/addr.c (working copy) > @@ -73,8 +73,13 @@ > if (!dev) > return -EADDRNOTAVAIL; > > - *gid = *(union ib_gid *) (dev->dev_addr + 4); > - *pkey = addr_get_pkey(dev); > + if (dev->type == ARPHRD_INFINIBAND) { > + *gid = *(union ib_gid *) (dev->dev_addr + 4); > + *pkey = addr_get_pkey(dev); > + } else { > + *gid = *(union ib_gid *) (dev->dev_addr); > + *pkey = 0; > + } > dev_put(dev); > return 0; > } If this call is being used, we should consider changing it to something more generic, rather than returning a "gid" as the hardware address for a non-IB device. One possibility is to make the API return the hardware address of a given IP address. The CMA can then break that address into a GID/pkey pair if needed. > @@ -476,8 +498,22 @@ > state = cma_exch(id_priv, CMA_DESTROYING); > cma_cancel_operation(id_priv, state); > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > - ib_destroy_cm_id(id_priv->cm_id); > + if (id->device) { > + switch (id->device->node_type) { > + case IB_NODE_RNIC: > + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) { > + iw_destroy_cm_id(id_priv->cm_id.iw); > + id_priv->cm_id.iw = 0; > + } > + break; > + default: > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { > + ib_destroy_cm_id(id_priv->cm_id.ib); > + id_priv->cm_id.ib = 0; The iw/ib devices should be set to NULL instead of assigned 0. > + ret = cma_notify_user(id_priv, > + event_type, > + event->status, > + event->private_data, > + event->private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id.iw = NULL; > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return ret; > + } > + > + cma_release_remove(id_priv); > + return ret; > +} This looks different than the cma_ib_handler, and it makes me think that the cma_ib_handler has a bug where it doesn't decrement dev_remove. > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > + struct iw_cm_event *iw_event) > +{ > + struct rdma_cm_id* new_cm_id; > + struct rdma_id_private *listen_id, *conn_id; > + struct sockaddr_in* sin; > + int ret; > + > + listen_id = cm_id->context; > + atomic_inc(&listen_id->dev_remove); > + if (!cma_comp(listen_id, CMA_LISTEN)) { > + ret = -ECONNABORTED; > + goto out; > + } > + > + /* Create a new RDMA id the new IW CM ID */ > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > + listen_id->id.context); > + if (!new_cm_id) { > + ret = -ENOMEM; > + goto out; > + } > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > + atomic_inc(&conn_id->dev_remove); > + conn_id->state = CMA_CONNECT; > + > + /* New connection inherits device from parent */ > + cma_attach_to_dev(conn_id, listen_id->cma_dev); cma_attach_to_dev doesn't provide synchronization around cma_dev->id_list. Access to that list needs to be protected with 'mutex'. Other than that, I think that this works fine. > @@ -785,8 +950,9 @@ > goto out; > > list_add_tail(&id_priv->list, &listen_any_list); > - list_for_each_entry(cma_dev, &dev_list, list) > + list_for_each_entry(cma_dev, &dev_list, list) { > cma_listen_on_dev(id_priv, cma_dev); > + } Please drop the extra braces. > @@ -796,7 +962,6 @@ > { > struct rdma_id_private *id_priv; > int ret; > - > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > return -EINVAL; Please keep the blank line. > @@ -890,6 +1058,30 @@ > +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) > +{ > + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; > + int rc; Please use 'ret' instead of 'rc' to match the rest of the code. > + > + atomic_inc(&id_priv->dev_remove); > + > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) > + BUG_ON(1); The device associated with the id could have been removed while the user was in the process of making this call. We should simply fail the call here. > + > + rc = cma_notify_user(id_priv, event, 0, NULL, 0); > + if (rc) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return rc; > + } The callback needs to come from another thread other than the one that the user called down with. Calling the user back in their own thread can make it difficult for them to provide synchronization. You can use the rdma_wq that's been exposed in ib_addr. Also, the user is likely to call this routine from within a CMA callback (such as after ib_resolve_addr), so deadlock will occur if you try to destroy the id. > + > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + return rc; > +} > + > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > { > struct rdma_id_private *id_priv; > @@ -952,20 +1147,133 @@ > + > +/* Find the local interface with a route to the specified address and > + * bind the CM ID to this interface's CMA device > + */ > +static int cma_acquire_iw_dev(struct rdma_cm_id* id, struct sockaddr* addr) > +{ > + int ret = -ENOENT; > + struct cma_device* cma_dev; > + struct rdma_id_private *id_priv; > + struct sockaddr_in* sin; > + struct rtable *rt = 0; > + struct flowi fl; > + struct net_device* netdev; > + struct in_addr src_ip; > + unsigned char* dev_addr; > + > + sin = (struct sockaddr_in*)addr; > + if (sin->sin_family != AF_INET) > + return -EINVAL; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + > + /* If the address is local, use the device. If it is remote, > + * look up a route to get the local address > + */ > + netdev = ip_dev_find(sin->sin_addr.s_addr); > + if (netdev) { > + src_ip = sin->sin_addr; > + dev_addr = netdev->dev_addr; > + dev_put(netdev); > + } else { > + memset(&fl, 0, sizeof(fl)); > + fl.nl_u.ip4_u.daddr = sin->sin_addr.s_addr; > + if (ip_route_output_key(&rt, &fl)) { > + return -ENETUNREACH; > + } > + dev_addr = rt->idev->dev->dev_addr; > + src_ip.s_addr = rt->rt_src; > + > + ip_rt_put(rt); > + } Can we push the above code into ib_addr? > + down(&mutex); > + > + list_for_each_entry(cma_dev, &dev_list, list) { > + if (memcmp(dev_addr, > + &cma_dev->node_guid, > + sizeof(cma_dev->node_guid)) == 0) { > + /* If we find the device, then check if this > + * is an iWARP device. If it is, then call the > + * callback handler immediately because we > + * already have the native address > + */ I'm not following this comment. What callback is being invoked? > + if (cma_dev->device->node_type == IB_NODE_RNIC) { > + struct sockaddr_in* cm_sin; > + /* Set our source address */ > + cm_sin = (struct sockaddr_in*) > + &id_priv->id.route.addr.src_addr; > + cm_sin->sin_family = AF_INET; > + cm_sin->sin_addr.s_addr = src_ip.s_addr; > + > + /* Claim the device in the mutex */ > + cma_attach_to_dev(id_priv, cma_dev); > + ret = 0; > + break; > + } > + } > + } > + up(&mutex); > + > + return ret; > +} I'd like to see if it's possible to merge this call with cma_acquire_ib_dev() and create a new routine, cma_acquire_dev() that can walk the device list and check for both. Maybe if ib_addr returned the full hardware address, along with a device type it might be possible. (Although the likelihood of a hardware address collision seems near impossible.) > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > struct sockaddr *dst_addr, int timeout_ms) > { > struct rdma_id_private *id_priv; > - int ret; > + int ret = 0; > > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) > return -EINVAL; > > atomic_inc(&id_priv->refcount); > + > id->route.addr.dst_addr = *dst_addr; > - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, > - timeout_ms, addr_handler, id_priv); > + > + if (cma_acquire_iw_dev(id, dst_addr)==0) { > + > + enum rdma_cm_event_type event; > + > + cma_exch(id_priv, CMA_ADDR_RESOLVED); > + > + atomic_inc(&id_priv->dev_remove); > + > + event = RDMA_CM_EVENT_ADDR_RESOLVED; > + if (cma_notify_user(id_priv, event, 0, NULL, 0)) { > + cma_exch(id_priv, CMA_DESTROYING); > + cma_deref_id(id_priv); > + cma_release_remove(id_priv); > + rdma_destroy_id(&id_priv->id); > + return -EINVAL; > + } Similar to other comments. Callbacks should be scheduled to a separate thread. The behavior is also slightly different. The IB code will return a source IP address that may be used to connect to the destination address if one is not given. This is needed in order to perform the reverse resolution on the remote side. > + cma_release_remove(id_priv); > + cma_deref_id(id_priv); > + > + } else { > + > + ret = ib_resolve_addr(src_addr, > + dst_addr, &id->route.addr.addr.ibaddr, > + timeout_ms, addr_handler, id_priv); We might be able to make this call generic by replacing ibaddr with source and destination hardware addresses. Users that really want to know what 'gid' they're on could be given functions that extract the gid and pkey from the hardware addresses. I.e. struct ib_addr could be renamed to struct rdma_addr and contain source and destination hardware addresses. > @@ -980,10 +1288,13 @@ > int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) > { > struct rdma_id_private *id_priv; > + struct sockaddr_in* sin; > struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; > int ret; > > - if (addr->sa_family != AF_INET) > + sin = (struct sockaddr_in*)addr; > + > + if (sin->sin_family != AF_INET) > return -EINVAL; Please remove this change. Right now, the check's only there because the code does not fully support IPv6 addressing. > int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > { > struct rdma_id_private *id_priv; > int ret; > > id_priv = container_of(id, struct rdma_id_private, id); > - if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) Please undo extra white space at the end of the line. > @@ -1190,7 +1551,6 @@ > { > struct rdma_id_private *id_priv; > int ret; > - Please add blank line back in. > id_priv = container_of(id, struct rdma_id_private, id); > if (!cma_comp(id_priv, CMA_CONNECT)) > return -EINVAL; - Sean From nacc at us.ibm.com Thu Dec 29 17:31:28 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 29 Dec 2005 17:31:28 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) Message-ID: <20051230013128.GB8111@us.ibm.com> Hi, Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads to: drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 make[2]: *** [drivers/infiniband/ulp/iser] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 Thanks, Nish From nacc at us.ibm.com Thu Dec 29 17:36:06 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 29 Dec 2005 17:36:06 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230013128.GB8111@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> Message-ID: <20051230013606.GC8111@us.ibm.com> On 29.12.2005 [17:31:28 -0800], Nishanth Aravamudan wrote: > Hi, > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > to: > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 Forgot one detail, this is with all relevant CONFIG_INFINIBAND options as modules: CONFIG_INFINIBAND_ISER=m CONFIG_INFINIBAND_IPOIB=m CONFIG_INFINIBAND_SRP=m CONFIG_INFINIBAND_SDP=m CONFIG_INFINIBAND_SDP_SEND_ZCOPY=m CONFIG_INFINIBAND_SDP_RECV_ZCOPY=m CONFIG_KDAPL_INFINIBAND=m CONFIG_KDAPL=m CONFIG_INFINIBAND_EHCA=m CONFIG_IPATH_CORE=m CONFIG_IPATH_ETHER=m CONFIG_INFINIBAND_IPATH=m CONFIG_INFINIBAND_MTHCA=m CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_MODULES=y Thanks, Nish From johann at pathscale.com Thu Dec 29 18:56:27 2005 From: johann at pathscale.com (Johann George) Date: Thu, 29 Dec 2005 18:56:27 -0800 Subject: [openib-general] PathScale license In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051230025627.GA2706@cuprite.internal.keyresearch.com> > The plain reading of pathscale's license grants an unencumbered license > *to the code*. It merely refrains from waving any related hardware rights. That is exactly PathScale's intentions. > The code is not being restricted to work only with the patented hardware, > correct? Absolutely. PathScale is not trying to restrict use of the code in any way. Johann From bos at pathscale.com Thu Dec 29 19:17:29 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 29 Dec 2005 19:17:29 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> References: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> Message-ID: <1135912649.7790.11.camel@localhost.localdomain> On Thu, 2005-12-29 at 16:01 -0300, Horst von Brand wrote: > > - Renamed _BITS_PER_BYTE to BITS_PER_BYTE, and moved it into > > linux/types.h > Haven't come across anything with this not 8 for a /long/ time now, and no > Linux on that in sight. The point isn't that it might change, but that it makes code clearer to use BITS_PER_BYTE in arithmetic than to have the magic number 8 sprinkled around mysteriously. References: <584777b6f4dc5269fa89.1135816296@eng-12.pathscale.com> <84144f020512291124sd895dfbp87ca9fd75552d671@mail.gmail.com> Message-ID: <1135912781.7790.15.camel@localhost.localdomain> On Thu, 2005-12-29 at 21:24 +0200, Pekka Enberg wrote: > [Copy-paste reuse alert!] Yep, thanks for pointing that out. The source file in question is about to go on a serious diet :-) References: <20051230013128.GB8111@us.ibm.com> Message-ID: <20051230033131.GA3608@cuprite.internal.keyresearch.com> > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > to: > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': Also having trouble compiling iser on revision 4655 under Linux 2.6.14.4 on x86_64. Some of the errors: In file included from drivers/infiniband/ulp/iser/iser.h:47, from drivers/infiniband/ulp/iser/iser_mod.c:52: drivers/infiniband/ulp/iser/iscsi_iser.h:27:30: scsi/iscsi_proto.h: No such file or directory In file included from drivers/infiniband/ulp/iser/iser.h:47, from drivers/infiniband/ulp/iser/iser_mod.c:52: drivers/infiniband/ulp/iser/iscsi_iser.h:216: error: field `tmhdr' has incomplete type drivers/infiniband/ulp/iser/iscsi_iser.h:240: error: field `hdr' has incomplete type Johann From halr at voltaire.com Thu Dec 29 20:29:10 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2005 23:29:10 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230013128.GB8111@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> Message-ID: <1135916950.4331.618.camel@hal.voltaire.com> Hi Nish, On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > Hi, > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > to: > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 There is an iscsi patch required for this as iser requires an open-iscsi version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure the best way to handle this yet as the build is different for 2.6.14 which does not contain open-iscsi. -- Hal From halr at voltaire.com Thu Dec 29 20:33:01 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2005 23:33:01 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230033131.GA3608@cuprite.internal.keyresearch.com> References: <20051230013128.GB8111@us.ibm.com> <20051230033131.GA3608@cuprite.internal.keyresearch.com> Message-ID: <1135917181.4331.628.camel@hal.voltaire.com> Hi Johann, On Thu, 2005-12-29 at 22:31, Johann George wrote: > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > to: > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > Also having trouble compiling iser on revision 4655 under Linux 2.6.14.4 on > x86_64. Some of the errors: > > In file included from drivers/infiniband/ulp/iser/iser.h:47, > from drivers/infiniband/ulp/iser/iser_mod.c:52: > drivers/infiniband/ulp/iser/iscsi_iser.h:27:30: scsi/iscsi_proto.h: No such file or directory > In file included from drivers/infiniband/ulp/iser/iser.h:47, > from drivers/infiniband/ulp/iser/iser_mod.c:52: > drivers/infiniband/ulp/iser/iscsi_iser.h:216: error: field `tmhdr' has incomplete type > drivers/infiniband/ulp/iser/iscsi_iser.h:240: error: field `hdr' has incomplete type This is different from Nish's issue in that open-iscsi is not part of 2.6.14 but will be in 2.6.15. I will post updated directions for this on the iser wiki tomorrow. Can you give this another shot after that ? Thanks. -- Hal From nacc at us.ibm.com Thu Dec 29 21:04:21 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 29 Dec 2005 21:04:21 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135916950.4331.618.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> Message-ID: <20051230050421.GA6431@us.ibm.com> On 29.12.2005 [23:29:10 -0500], Hal Rosenstock wrote: > Hi Nish, > > On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > > Hi, > > > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > to: > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > make[1]: *** [drivers/infiniband] Error 2 > > make: *** [drivers] Error 2 > > There is an iscsi patch required for this as iser requires an open-iscsi > version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure > the best way to handle this yet as the build is different for 2.6.14 > which does not contain open-iscsi. Where can I find this patch? I can temporarily add it to the build-path for the svn-based builds, until a better solution is found. Thanks, Nish From johann at pathscale.com Thu Dec 29 21:06:03 2005 From: johann at pathscale.com (Johann George) Date: Thu, 29 Dec 2005 21:06:03 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135917181.4331.628.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <20051230033131.GA3608@cuprite.internal.keyresearch.com> <1135917181.4331.628.camel@hal.voltaire.com> Message-ID: <20051230050603.GA5879@cuprite.internal.keyresearch.com> > I will post updated directions for this on the iser wiki tomorrow. Can > you give this another shot after that ? Certainly. Thanks, Hal. Johann From halr at voltaire.com Fri Dec 30 04:26:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2005 07:26:34 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230050421.GA6431@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> Message-ID: <1135945593.4331.1109.camel@hal.voltaire.com> Hi Nish, On Fri, 2005-12-30 at 00:04, Nishanth Aravamudan wrote: > On 29.12.2005 [23:29:10 -0500], Hal Rosenstock wrote: > > Hi Nish, > > > > On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > > > Hi, > > > > > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > > to: > > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > > make[1]: *** [drivers/infiniband] Error 2 > > > make: *** [drivers] Error 2 > > > > There is an iscsi patch required for this as iser requires an open-iscsi > > version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure > > the best way to handle this yet as the build is different for 2.6.14 > > which does not contain open-iscsi. > > Where can I find this patch? I can temporarily add it to the build-path > for the svn-based builds, until a better solution is found. I am attaching the patch for this. Note that this patch is for 2.6.15-rc and not 2.6.14 variants. It has been tested with 2.6.15-rc6. Please let me know if it works for you. Thanks. -- Hal diff -ru -x '*.o*' -x '*.ko' linux-2.6.15-rc7/drivers/scsi/iscsi_tcp.c linux-2.6.15-rc7-iser/drivers/scsi/iscsi_tcp.c --- linux-2.6.15-rc7/drivers/scsi/iscsi_tcp.c 2005-12-28 18:21:13.000000000 +0200 +++ linux-2.6.15-rc7-iser/drivers/scsi/iscsi_tcp.c 2005-12-29 08:55:55.000000000 +0200 @@ -3590,6 +3590,8 @@ .name = "tcp", .caps = CAP_RECOVERY_L0 | CAP_MULTI_R2T | CAP_HDRDGST | CAP_DATADGST, + .af = AF_INET, + .rdma = 0, .host_template = &iscsi_sht, .hostdata_size = sizeof(struct iscsi_session), .max_conn = 1, diff -ru -x '*.o*' -x '*.ko' linux-2.6.15-rc7/drivers/scsi/scsi_transport_iscsi.c linux-2.6.15-rc7-iser/drivers/scsi/scsi_transport_iscsi.c --- linux-2.6.15-rc7/drivers/scsi/scsi_transport_iscsi.c 2005-12-28 18:21:13.000000000 +0200 +++ linux-2.6.15-rc7-iser/drivers/scsi/scsi_transport_iscsi.c 2005-12-29 08:57:44.000000000 +0200 @@ -117,6 +117,8 @@ show_transport_attr(max_lun, "%d"); show_transport_attr(max_conn, "%d"); show_transport_attr(max_cmd_len, "%d"); +show_transport_attr(af, "%d"); +show_transport_attr(rdma, "%d"); static struct attribute *iscsi_transport_attrs[] = { &class_device_attr_handle.attr, @@ -124,6 +126,8 @@ &class_device_attr_max_lun.attr, &class_device_attr_max_conn.attr, &class_device_attr_max_cmd_len.attr, + &class_device_attr_af.attr, + &class_device_attr_rdma.attr, NULL, }; diff -ru -x '*.o*' -x '*.ko' linux-2.6.15-rc7/include/scsi/iscsi_if.h linux-2.6.15-rc7-iser/include/scsi/iscsi_if.h --- linux-2.6.15-rc7/include/scsi/iscsi_if.h 2005-12-28 18:21:14.000000000 +0200 +++ linux-2.6.15-rc7-iser/include/scsi/iscsi_if.h 2005-12-29 09:02:43.000000000 +0200 @@ -160,8 +160,9 @@ ISCSI_PARAM_ERL = 11, ISCSI_PARAM_IFMARKER_EN = 12, ISCSI_PARAM_OFMARKER_EN = 13, + ISCSI_PARAM_RDMAEXTENSIONS = 14, }; -#define ISCSI_PARAM_MAX 14 +#define ISCSI_PARAM_MAX 15 typedef uint64_t iscsi_sessionh_t; /* iSCSI Data-Path session handle */ typedef uint64_t iscsi_connh_t; /* iSCSI Data-Path connection handle */ diff -ru -x '*.o*' -x '*.ko' linux-2.6.15-rc7/include/scsi/scsi_transport_iscsi.h linux-2.6.15-rc7-iser/include/scsi/scsi_transport_iscsi.h --- linux-2.6.15-rc7/include/scsi/scsi_transport_iscsi.h 2005-12-28 18:21:14.000000000 +0200 +++ linux-2.6.15-rc7-iser/include/scsi/scsi_transport_iscsi.h 2005-12-29 09:00:18.000000000 +0200 @@ -30,6 +30,8 @@ * * @name: transport name * @caps: iSCSI Data-Path capabilities + * @af socket address family + * @rdma indicates if transport supports RDMA operations * @create_session: create new iSCSI session object * @destroy_session: destroy existing iSCSI session object * @create_conn: create new iSCSI connection @@ -47,6 +49,8 @@ struct module *owner; char *name; unsigned int caps; + unsigned short af; + unsigned short rdma; struct scsi_host_template *host_template; int hostdata_size; int max_lun; From halr at voltaire.com Fri Dec 30 04:36:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2005 07:36:40 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230050603.GA5879@cuprite.internal.keyresearch.com> References: <20051230013128.GB8111@us.ibm.com> <20051230033131.GA3608@cuprite.internal.keyresearch.com> <1135917181.4331.628.camel@hal.voltaire.com> <20051230050603.GA5879@cuprite.internal.keyresearch.com> Message-ID: <1135946199.4331.1131.camel@hal.voltaire.com> On Fri, 2005-12-30 at 00:06, Johann George wrote: > > I will post updated directions for this on the iser wiki tomorrow. Can > > you give this another shot after that ? > > Certainly. Thanks, Hal. This part is already dealt with in the OpenIB iser wiki: Yhere is a patch (linux-2.6.14-iscsi_includes.diff) in linux-kernel/patches which will get you past the compile problem you reported The iSER wiki states: When building with 2.6.14 kernel, iSER requires open_iscsi header files that are not part of the 2.6.14 kernel tree. The header files are available at https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-iscsi_includes.diff When building with these headers but without actually building open_iscsi, the build produces the following warnings: *** Warning: "iscsi_unregister_transport" [drivers/infiniband/ulp/iser/ib_iser.ko] undefined! *** Warning: "iscsi_recv_pdu" [drivers/infiniband/ulp/iser/ib_iser.ko] undefined! *** Warning: "iscsi_conn_error" [drivers/infiniband/ulp/iser/ib_iser.ko] undefined! *** Warning: "iscsi_register_transport" [drivers/infiniband/ulp/iser/ib_iser.ko] undefined! These are normal. It means the module will not load since those symbols are missing and need to be provided by open_iser scsi_transport_iscsi.ko. 2.6.15-rc6 kernel and later includes open_iscsi and the header files, but requires an iser patch (to be available soon). The open-iscsi code is available at http://www.open-iscsi.org/. Please let me know if this gets you to this point or something different. Thanks. I am considering adding more instructions on open-iscsi to the OpenIB iser wiki as I think this will help with 2.6.14 and earlier kernels (although open-iscsi is only supported at 2.6.11 and beyond). -- Hal From halr at voltaire.com Fri Dec 30 06:51:33 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2005 09:51:33 -0500 Subject: [openib-general] Re: [PATCH] osm: support for trivial PKey manager In-Reply-To: <074q4sfbm8.fsf@swlab25.yok.mtl.com> References: <074q4sfbm8.fsf@swlab25.yok.mtl.com> Message-ID: <1135954292.4331.1348.camel@hal.voltaire.com> Hi again Ofer, On Thu, 2005-12-29 at 05:20, Ofer Gigi wrote: > Hi Hal, > > My name is Ofer Gigi, and I am a new software engineer in Mellanox > working on OpenSM. > This patch provides a new manager that solves the following problem: > > OpenSM is not currently compliant to the spec statement: > C14.62.1.1 Table 183 p870 l34: > "However, the SM shall ensure that one of the P_KeyTable entries in every > node contains either the value 0xFFFF (the default P_Key, full membership) > or the value 0x7FFF (the default P_Key, partial membership)." > > Luckily, all IB devices comes up from reset with preconfigured 0xffff key. > This was discovered during last plugfest. > > To overcome this limitation I implemented a simple elementary PKey manager > that will enforce the above rule (currently adds 0xffff if missing). > > This additional manager would be used for a full PKey policy manager > in the future. > > We have tested this patch > > Thanks Thanks. Applied. Some mechanical comments below (and also embedded). The general rule is one thought per patch. osm_indent is separate from this. Please try to ensure there is no extra whitespace at the end of the lines. There were several places where it was present. -- Hal > Ofer G. > > Signed-off-by: Ofer Gigi [snip...] > Index: opensm/osm_state_mgr.c > =================================================================== > --- opensm/osm_state_mgr.c (revision 4651) > +++ opensm/osm_state_mgr.c (working copy) > @@ -2216,9 +2219,11 @@ osm_state_mgr_process( > } > } > } > + > /* Need to continue with lid assigning */ > osm_drop_mgr_process( p_mgr->p_drop_mgr ); > - p_mgr->state = OSM_SM_STATE_SET_SM_UCAST_LID; > + > + p_mgr->state = OSM_SM_STATE_SET_PKEY; > > /* > * If we are not MASTER already - this means that we are > @@ -2229,6 +2234,62 @@ osm_state_mgr_process( > osm_sm_state_mgr_process( p_mgr->p_sm_state_mgr, > OSM_SM_SIGNAL_DISCOVERY_COMPLETED ); > > + /* signal = osm_lid_mgr_process_sm( p_mgr->p_lid_mgr ); */ Why add this commented out line ? [I think this was also in one other place as well.] > + /* the returned signal might be DONE or DONE_PENDING */ > + signal = osm_pkey_mgr_process( p_mgr->p_pkey_mgr ); > + break; > + > + default: > + __osm_state_mgr_signal_error( p_mgr, signal ); > + signal = OSM_SIGNAL_NONE; > + break; > + } > + break; > + [snip...] > Index: opensm/osm_indent > =================================================================== > --- opensm/osm_indent (revision 4651) > +++ opensm/osm_indent (working copy) > @@ -63,8 +63,8 @@ > # -i3 Substitute indent with 3 spaces > # -npcs No space after procedure calls > # -prs Space after parenthesis > -# -nsai No space after if keyword > -# -nsaw No space after while keyword > +# -nsai No space after if keyword - removed > +# -nsaw No space after while keyword - removed Should these comments just be removed ? > # -sc Put * at left of comments in a block comment style > # -nsob Don't swallow unnecessary blank lines > # -ts3 Tab size is 3 > @@ -81,7 +81,7 @@ for sourcefile in $*; do > perl -piW -e's/\x0D//' "$sourcefile" > echo Processing $sourcefile > indent -bad -bap -bbb -nbbo -bl -bli0 -bls -cbi0 -ci3 -cli0 -ncs \ > - -hnl -i3 -npcs -prs -nsai -nsaf -nsaw -sc -nsob -ts3 -psl -bfda -nut $sourcefile > + -hnl -i3 -npcs -prs -sc -nsob -ts3 -psl -bfda -nut $sourcefile > > rm ${sourcefile}W > From brilong at cisco.com Fri Dec 30 07:05:54 2005 From: brilong at cisco.com (Brian Long) Date: Fri, 30 Dec 2005 10:05:54 -0500 Subject: [openib-general] Kickstart over OpenIB? Message-ID: <1135955154.5243.34.camel@brilong-lnx> Hello, I would like to know how difficult it would be to modify kickstart such that it would work over Infiniband. I've asked the Red Hat anaconda developers about this and, as you can see in the attached email, they believe IB only accepts netlink and no ioctls for network setup. Is this true? I am a system admin testing OpenIB on RHEL 4 Update 3 beta which was released a few days ago. I was able to kickstart over my standard Ethernet and all the IB drivers are loaded upon boot. How difficult would it be to get Anaconda to be able to use an IB link for kickstart such that a host's only connection to the outside world was IB? Thanks! /Brian/ -- Brian Long | | | IT Data Center Systems | .|||. .|||. Cisco Linux Developer | ..:|||||||:...:|||||||:.. Phone: (919) 392-7363 | C i s c o S y s t e m s -------------- next part -------------- An embedded message was scrubbed... From: Jeremy Katz Subject: RE: Kickstart over OpenIB? Date: Tue, 20 Dec 2005 11:54:13 -0500 Size: 3096 URL: From tom at opengridcomputing.com Fri Dec 30 08:08:03 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 30 Dec 2005 10:08:03 -0600 Subject: [openib-general] PathScale license In-Reply-To: <20051230025627.GA2706@cuprite.internal.keyresearch.com> References: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> <20051230025627.GA2706@cuprite.internal.keyresearch.com> Message-ID: <1135958883.8578.47.camel@strider.opengridcomputing.com> Does the iPod manual that documents how to operate the volume give the user rights to the iPod interface patent? Unfortunately -- no, thus all the goofy knock-offs. EVERY hardware vendor has patents related to its interface and implementation. EVERY hardware vendor that submits open source documents, demonstrates, and utilizes that hardware interface. To presume that the vendor is summarily dismissing its rights to its related patents is -- well -- wrong. Let's suppose that I'm wrong. If there *really* is a legal risk, then *every* hardware vendor shares that risk and a clause should be added to the generic GPL template. Otherwise, it does nothing but confuse and expand the legal encumbrance users assume when they download -- and that is very very bad. Linux absolutely needs hardware vendors and therefore must be sensitive to protecting their legal rights, but the Linux community must also be diligent in keeping the open source legal landscape as simple and uniform as possible -- or the value of open source dies one clause at a time. Thu, 2005-12-29 at 18:56 -0800, Johann George wrote: > > The plain reading of pathscale's license grants an unencumbered license > > *to the code*. It merely refrains from waving any related hardware rights. > > That is exactly PathScale's intentions. > > > The code is not being restricted to work only with the patented hardware, > > correct? > > Absolutely. PathScale is not trying to restrict use of the code in any way. > > Johann > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Fri Dec 30 08:03:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2005 11:03:13 -0500 Subject: [openib-general] Some opensm/osm_vl15intf.c questions Message-ID: <1135958592.4331.1438.camel@hal.voltaire.com> Hi Eitan, In chasing an issue with a trap repress not being sent in a certain scenario, I stumbled across the following questions about opensm/osm_vl15intf.c. 1. osm_vl15_post increments qp0_mads_outstanding when a response is expected (rfifo) and not when unsolicited (ufifo) (what appears to be called unicasts): osm_vl15_post: if( p_madw->resp_expected == TRUE ) { cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); } else { cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); } osm_vl15_shutdown retires all outstanding MADs as follows: osm_vl15_shutdown: while ( p_madw != (osm_madw_t*)cl_qlist_end( &p_vl->ufifo ) ) { if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_vl->p_log, OSM_LOG_DEBUG, "osm_vl15_shutdown: " "Releasing Response p_madw = %p\n", p_madw ); } osm_mad_pool_put( p_mad_pool, p_madw ); cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); p_madw = (osm_madw_t*)cl_qlist_remove_head( &p_vl->ufifo ); } Either post should increment qp0_mads_outstanding for unsolicited or shutdown shouldn't decrement it when removing from ufifo. If you agree, which should it be ? 2. In the case of a failure from osm_vendor_send, __osm_vl15_poller decrements qp0_mads_outstanding regardless of whether a response is expected. This is inconsistent with the increment. This leads me to believe that this should also be incremented for unsolicited (unicasts) as well as those for which responses are expected. Is this correct or am I missing something ? So my conclusion is that in osm_vl15_post, it should be: if( p_madw->resp_expected == TRUE ) { cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); } else { cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); } cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); If you agree, I will generate a patch for this. Thanks. -- Hal From nacc at us.ibm.com Fri Dec 30 08:13:26 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 30 Dec 2005 08:13:26 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135945593.4331.1109.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> Message-ID: <20051230161326.GD6431@us.ibm.com> On 30.12.2005 [07:26:34 -0500], Hal Rosenstock wrote: > Hi Nish, > > On Fri, 2005-12-30 at 00:04, Nishanth Aravamudan wrote: > > On 29.12.2005 [23:29:10 -0500], Hal Rosenstock wrote: > > > Hi Nish, > > > > > > On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > > > > Hi, > > > > > > > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > > > to: > > > > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > > > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > > > make[1]: *** [drivers/infiniband] Error 2 > > > > make: *** [drivers] Error 2 > > > > > > There is an iscsi patch required for this as iser requires an open-iscsi > > > version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure > > > the best way to handle this yet as the build is different for 2.6.14 > > > which does not contain open-iscsi. > > > > Where can I find this patch? I can temporarily add it to the build-path > > for the svn-based builds, until a better solution is found. > > I am attaching the patch for this. Note that this patch is for > 2.6.15-rc and not 2.6.14 variants. It has been tested with > 2.6.15-rc6. Please let me know if it works for you. Thanks. Great, I will start some jobs with the patch now and let you know. Thanks, Nish From nacc at us.ibm.com Fri Dec 30 09:23:04 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 30 Dec 2005 09:23:04 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230161326.GD6431@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20051230161326.GD6431@us.ibm.com> Message-ID: <20051230172304.GE6431@us.ibm.com> On 30.12.2005 [08:13:26 -0800], Nishanth Aravamudan wrote: > On 30.12.2005 [07:26:34 -0500], Hal Rosenstock wrote: > > Hi Nish, > > > > On Fri, 2005-12-30 at 00:04, Nishanth Aravamudan wrote: > > > On 29.12.2005 [23:29:10 -0500], Hal Rosenstock wrote: > > > > Hi Nish, > > > > > > > > On Thu, 2005-12-29 at 20:31, Nishanth Aravamudan wrote: > > > > > Hi, > > > > > > > > > > Building 2.6.15-rc7-git3 on ppc64 with svn=4654 kernel components leads > > > > > to: > > > > > > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_set_param': > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: (Each undeclared identifier is reported only once > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1436: error: for each function it appears in.) > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: In function `iscsi_iser_conn_get_param': > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1496: error: `ISCSI_PARAM_RDMAEXTENSIONS' undeclared (first use in this function) > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c: At top level: > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: error: unknown field `af' specified in initializer > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1634: warning: initialization makes pointer from integer without a cast > > > > > drivers/infiniband/ulp/iser/iscsi_iser.c:1635: error: unknown field `rdma' specified in initializer > > > > > make[3]: *** [drivers/infiniband/ulp/iser/iscsi_iser.o] Error 1 > > > > > make[2]: *** [drivers/infiniband/ulp/iser] Error 2 > > > > > make[1]: *** [drivers/infiniband] Error 2 > > > > > make: *** [drivers] Error 2 > > > > > > > > There is an iscsi patch required for this as iser requires an open-iscsi > > > > version which is subsequent to what is in 2.6.15-rc7-git3. I'm not sure > > > > the best way to handle this yet as the build is different for 2.6.14 > > > > which does not contain open-iscsi. > > > > > > Where can I find this patch? I can temporarily add it to the build-path > > > for the svn-based builds, until a better solution is found. > > > > I am attaching the patch for this. Note that this patch is for > > 2.6.15-rc and not 2.6.14 variants. It has been tested with > > 2.6.15-rc6. Please let me know if it works for you. Thanks. > > Great, I will start some jobs with the patch now and let you know. The patch works, but does fail at LD time with: drivers/built-in.o(.text+0xfd784): In function `iscsi_iser_conn_failure': drivers/infiniband/ulp/iser/iscsi_iser.c:220: undefined reference to `.iscsi_conn_error' drivers/built-in.o(.text+0x1003b0): In function `iscsi_iser_exit': drivers/infiniband/ulp/iser/iscsi_iser.c:1989: undefined reference to `.iscsi_unregister_transport' drivers/built-in.o(.text+0x10046c): In function `iscsi_iser_init': drivers/infiniband/ulp/iser/iscsi_iser.c:1978: undefined reference to `.iscsi_register_transport' drivers/built-in.o(.text+0x100a18): In function `iscsi_iser_control_notify': drivers/infiniband/ulp/iser/iscsi_iser.c:1855: undefined reference to `.iscsi_recv_pdu' drivers/built-in.o(.text+0x100bc0):drivers/infiniband/ulp/iser/iscsi_iser.c:1899: undefined reference to `.iscsi_recv_pdu' drivers/built-in.o(.text+0x100cd4):drivers/infiniband/ulp/iser/iscsi_iser.c:1921: undefined reference to `.iscsi_recv_pdu' Unfortunately, this causes make to fail, killing the job (I realize these are technically non-critical errors). So, I'm just going to disable ISER for now in my tests. Thanks, Nish From halr at voltaire.com Fri Dec 30 09:30:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2005 12:30:18 -0500 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <20051230172304.GE6431@us.ibm.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20051230161326.GD6431@us.ibm.com> <20051230172304.GE6431@us.ibm.com> Message-ID: <1135963817.4331.1614.camel@hal.voltaire.com> Hi again Nish, On Fri, 2005-12-30 at 12:23, Nishanth Aravamudan wrote: [snip...] > > > I am attaching the patch for this. Note that this patch is for > > > 2.6.15-rc and not 2.6.14 variants. It has been tested with > > > 2.6.15-rc6. Please let me know if it works for you. Thanks. > > > > Great, I will start some jobs with the patch now and let you know. > > The patch works, but does fail at LD time with: > > drivers/built-in.o(.text+0xfd784): In function `iscsi_iser_conn_failure': > drivers/infiniband/ulp/iser/iscsi_iser.c:220: undefined reference to `.iscsi_conn_error' > drivers/built-in.o(.text+0x1003b0): In function `iscsi_iser_exit': > drivers/infiniband/ulp/iser/iscsi_iser.c:1989: undefined reference to `.iscsi_unregister_transport' > drivers/built-in.o(.text+0x10046c): In function `iscsi_iser_init': > drivers/infiniband/ulp/iser/iscsi_iser.c:1978: undefined reference to `.iscsi_register_transport' > drivers/built-in.o(.text+0x100a18): In function `iscsi_iser_control_notify': > drivers/infiniband/ulp/iser/iscsi_iser.c:1855: undefined reference to `.iscsi_recv_pdu' > drivers/built-in.o(.text+0x100bc0):drivers/infiniband/ulp/iser/iscsi_iser.c:1899: undefined reference to `.iscsi_recv_pdu' > drivers/built-in.o(.text+0x100cd4):drivers/infiniband/ulp/iser/iscsi_iser.c:1921: undefined reference to `.iscsi_recv_pdu' > > Unfortunately, this causes make to fail, killing the job (I realize > these are technically non-critical errors). > > So, I'm just going to disable ISER for now in my tests. OK but I think you can get this all to build with a little more as follows: Enable the cryptographic API (as built in rather than as module) with CRC32c CRC algorithm, and iSCSI (under Device Drivers/SCSI device support/SCSI low level drivers) in your kernel configuration. That should cause scsi_transport_iscsi (and iscsi_tcp) to be built which should take care of the unresolved symbols you are seeing. -- Hal From nacc at us.ibm.com Fri Dec 30 09:43:41 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 30 Dec 2005 09:43:41 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135963817.4331.1614.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20051230161326.GD6431@us.ibm.com> <20051230172304.GE6431@us.ibm.com> <1135963817.4331.1614.camel@hal.voltaire.com> Message-ID: <20051230174341.GF6431@us.ibm.com> On 30.12.2005 [12:30:18 -0500], Hal Rosenstock wrote: > Hi again Nish, > > On Fri, 2005-12-30 at 12:23, Nishanth Aravamudan wrote: > [snip...] > > > > > I am attaching the patch for this. Note that this patch is for > > > > 2.6.15-rc and not 2.6.14 variants. It has been tested with > > > > 2.6.15-rc6. Please let me know if it works for you. Thanks. > > > > > > Great, I will start some jobs with the patch now and let you know. > > > > The patch works, but does fail at LD time with: > > > > drivers/built-in.o(.text+0xfd784): In function `iscsi_iser_conn_failure': > > drivers/infiniband/ulp/iser/iscsi_iser.c:220: undefined reference to `.iscsi_conn_error' > > drivers/built-in.o(.text+0x1003b0): In function `iscsi_iser_exit': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1989: undefined reference to `.iscsi_unregister_transport' > > drivers/built-in.o(.text+0x10046c): In function `iscsi_iser_init': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1978: undefined reference to `.iscsi_register_transport' > > drivers/built-in.o(.text+0x100a18): In function `iscsi_iser_control_notify': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1855: undefined reference to `.iscsi_recv_pdu' > > drivers/built-in.o(.text+0x100bc0):drivers/infiniband/ulp/iser/iscsi_iser.c:1899: undefined reference to `.iscsi_recv_pdu' > > drivers/built-in.o(.text+0x100cd4):drivers/infiniband/ulp/iser/iscsi_iser.c:1921: undefined reference to `.iscsi_recv_pdu' > > > > Unfortunately, this causes make to fail, killing the job (I realize > > these are technically non-critical errors). > > > > So, I'm just going to disable ISER for now in my tests. > > OK but I think you can get this all to build with a little more as > follows: > > Enable the cryptographic API (as built in rather than as module) with > CRC32c CRC algorithm, and iSCSI (under Device Drivers/SCSI device > support/SCSI low level drivers) > > in your kernel configuration. > > That should cause scsi_transport_iscsi (and iscsi_tcp) to be built which > should take care of the unresolved symbols you are seeing. Ok, I'll try that as well. Thanks, Nish From greg at kroah.com Fri Dec 30 00:25:05 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 00:25:05 -0800 Subject: [openib-general] Re: [PATCH 12 of 20] ipath - misc driver support code In-Reply-To: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> References: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> Message-ID: <20051230082505.GC7438@kroah.com> On Wed, Dec 28, 2005 at 04:31:31PM -0800, Bryan O'Sullivan wrote: > Signed-off-by: Bryan O'Sullivan No description of what the patch does? > +struct _infinipath_do_not_use_kernel_regs { > + unsigned long long Revision; u64? > + unsigned long long Control; > + unsigned long long PageAlign; > + unsigned long long PortCnt; And what's with the InterCapsNamingScheme of these variables? > +/* > + * would prefer to not inline this, to avoid code bloat, and simplify debugging > + * But when compiling against 2.6.10 kernel tree, it gets an error, so > + * not for now. > + */ > +static void ipath_i2c_delay(ipath_type, int); You aren't compiling this for a 2.6.10 kernel anymore :) > +/* > + * we use this instead of udelay directly, so we can make sure > + * that previous register writes have been flushed all the way > + * to the chip. Since we are delaying anyway, the cost doesn't > + * hurt, and makes the bit twiddling more regular > + * If delay is negative, we'll do the chip read, to be sure write made it > + * to our chip, but won't do udelay() > + */ > +static void ipath_i2c_delay(ipath_type dev, int dtime) > +{ > + /* > + * This needs to be volatile, so that the compiler doesn't > + * optimize away the read to the device's mapped memory. > + */ > + volatile uint32_t read_val; > + if (!dtime) > + return; > + read_val = ipath_kget_kreg32(dev, kr_scratch); > + if (--dtime > 0) /* register read takes about .5 usec, itself */ > + udelay(dtime); > +} Huh? After reading your comment, I still don't understand why you can't just use udelay(). Or are you counting on calling this function with only "1" being set for dtime? Ah, in looking at your code, that is exactly what is happening. That's a mess, just delay and everything will work properly on the next rev of the hardware where the time to read that register will have dropped to 1/8 the time it does today... > +/* > + * write a byte, one bit at a time. Returns 0 if we got the following > + * ack, otherwise 1 > + */ > +static int ipath_wr_byte(ipath_type dev, uint8_t data) > +{ > + int bit_cntr; > + uint8_t bit; > + > + for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) { > + bit = (data >> bit_cntr) & 1; > + ipath_sda_out(dev, bit, 1); > + ipath_scl_out(dev, i2c_line_high, 1); > + ipath_scl_out(dev, i2c_line_low, 1); > + } > + if (!ipath_i2c_ackrcv(dev)) > + return 1; > + return 0; > +} Ah, isn't it fun to write bit-banging functions... And the in-kernel i2c code is messier than doing this by hand? > +/* > + * ipath_eeprom_read - Receives x # byte from the eeprom via I2C. > + * > + * eeprom: Atmel AT24C01 > + * > + */ > + > +int ipath_eeprom_read(ipath_type dev, uint8_t eeprom_offset, void *buffer, > + int len) Odd function comment style. Please fix this to be in kerneldoc format. > diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_lib.c > --- /dev/null Thu Jan 1 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_lib.c Wed Dec 28 14:19:43 2005 -0800 > @@ -0,0 +1,90 @@ > +/* > + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Patent licenses, if any, provided herein do not apply to > + * combinations of this program with other software, or any other > + * product whatsoever. > + */ > + > +/* > + * This is library code for the driver, similar to what's in libinfinipath for > + * usermode code. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include Are you _sure_ you need all of these for the one function in this file? > + > +#include "ipath_kernel.h" > + > +unsigned infinipath_debug = __IPATH_INFO; > + > +uint32_t _ipath_pico_per_cycle; /* always present, for now */ > + > +/* > + * This isn't perfect, but it's close enough for timing work. We want this > + * to work on systems where the cycle counter isn't the same as the clock > + * frequency. The one msec spin is OK, since we execute this only once > + * when first loaded. We don't use CURRENT_TIME because on some systems > + * it only has jiffy resolution; we just assume udelay is well calibrated > + * and that we aren't likely to be rescheduled. Do it multiple times, > + * with a yield in between, to try to make sure we get the "true minimum" > + * value. > + * _ipath_pico_per_cycle isn't going to lead to completely accurate > + * conversions from timestamps to nanoseconds, but it's close enough > + * for our purposes, which is mainly to allow people to show events with > + * nsecs or usecs if desired, rather than cycles. > + */ > +void ipath_init_picotime(void) > +{ > + int i; > + u_int64_t ts, te, delta = -1ULL; > + > + for (i = 0; i < 5; i++) { > + ts = get_cycles(); > + udelay(250); > + te = get_cycles(); > + if ((te - ts) < delta) > + delta = te - ts; > + yield(); > + } > + _ipath_pico_per_cycle = 250000000 / delta; > +} Ick. A whole file for one function and 2 public variables? And a horrible timing function too? Please just use the core kernel timing functions, which will work all the time on all arches... > diff -r e8af3873b0d9 -r 5e9b0b7876e2 drivers/infiniband/hw/ipath/ipath_upages.c > --- /dev/null Thu Jan 1 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_upages.c Wed Dec 28 14:19:43 2005 -0800 > @@ -0,0 +1,144 @@ > +/* > + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Patent licenses, if any, provided herein do not apply to > + * combinations of this program with other software, or any other > + * product whatsoever. > + */ > + > +#include Where is this file being pulled in from? > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +#include "ipath_kernel.h" > + > +/* > + * Our version of the kernel mlock function. This function is no longer > + * exposed, so we need to do it ourselves. Woah, um, don't you think that you should either export the main mlock function itself, or fix your code to not need it? Rolling it yourself isn't a good idea... thanks, greg k-h From greg at kroah.com Fri Dec 30 00:00:02 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 00:00:02 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: References: Message-ID: <20051230080002.GA7438@kroah.com> On Wed, Dec 28, 2005 at 04:31:19PM -0800, Bryan O'Sullivan wrote: > > There are a few requested changes we have chosen to omit for now: > > - The driver still uses EXPORT_SYMBOL, for consistency with other > code in drivers/infiniband Why would that matter? > - Someone asked for the kernel's i2c infrastructure to be used, but > our i2c usage is very specialised, and it would be more of a mess > to use the kernel's Why is this? What is so messy about the in-kernel i2c interfaces? (yeah, I know that there are some oddities, just want to know what you specifically are not liking...) > - We're still using ioctls instead of sysfs or configfs in some > cases, to maintain userspace compatibility Compatibility with what? The driver isn't in the kernel tree yet, so there's no old kernel versions to remain compatibile with :) I also noticed that you are still using the uint64_t type variable types, can you please switch to the proper kernel types instead (u64 in this specific example.) thanks, greg k-h From arjan at infradead.org Fri Dec 30 10:15:47 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Fri, 30 Dec 2005 19:15:47 +0100 Subject: [openib-general] Re: [PATCH 12 of 20] ipath - misc driver support code In-Reply-To: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> References: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> Message-ID: <1135966547.2941.30.camel@laptopd505.fenrus.org> > +int ipath_get_upages_nocopy(unsigned long start_page, struct page **p) > +{ > + int n; > + struct vm_area_struct *vm = NULL; > + > + down_read(¤t->mm->mmap_sem); > + n = get_user_pages(current, current->mm, start_page, 1, 1, 1, p, &vm); > + up_read(¤t->mm->mmap_sem); > + if (n != 1) { > + _IPATH_INFO("get_user_pages for 0x%lx failed with %d\n", > + start_page, n); > + if (n < 0) /* it's an errno */ > + return n; > + /* > + * If we ever ask for more than a single page, we will have to > + * free the pages (if any) that we did get, via ipath_get_upages() > + * or put_page() directly. > + */ > + return -ENOMEM; /* no way to know actual error */ > + } > + vm->vm_flags |= VM_SHM | VM_LOCKED; > + > + return 0; > +} I hope you're not depending on the VM_LOCKED thing.. since the user can just undo that easily! (this is also why all this "sys_mlock from the driver" is traditionally buggy to the point of being a roothole, things like some of the binary 3D drivers have had this security hole for a long time, as did some of the early infiniband drivers) From nacc at us.ibm.com Fri Dec 30 10:44:02 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 30 Dec 2005 10:44:02 -0800 Subject: [openib-general] ISER fails to build on 2.6.15-rc7-git3 (svn=4654) In-Reply-To: <1135963817.4331.1614.camel@hal.voltaire.com> References: <20051230013128.GB8111@us.ibm.com> <1135916950.4331.618.camel@hal.voltaire.com> <20051230050421.GA6431@us.ibm.com> <1135945593.4331.1109.camel@hal.voltaire.com> <20051230161326.GD6431@us.ibm.com> <20051230172304.GE6431@us.ibm.com> <1135963817.4331.1614.camel@hal.voltaire.com> Message-ID: <20051230184402.GG6431@us.ibm.com> On 30.12.2005 [12:30:18 -0500], Hal Rosenstock wrote: > Hi again Nish, > > On Fri, 2005-12-30 at 12:23, Nishanth Aravamudan wrote: > [snip...] > > > > > I am attaching the patch for this. Note that this patch is for > > > > 2.6.15-rc and not 2.6.14 variants. It has been tested with > > > > 2.6.15-rc6. Please let me know if it works for you. Thanks. > > > > > > Great, I will start some jobs with the patch now and let you know. > > > > The patch works, but does fail at LD time with: > > > > drivers/built-in.o(.text+0xfd784): In function `iscsi_iser_conn_failure': > > drivers/infiniband/ulp/iser/iscsi_iser.c:220: undefined reference to `.iscsi_conn_error' > > drivers/built-in.o(.text+0x1003b0): In function `iscsi_iser_exit': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1989: undefined reference to `.iscsi_unregister_transport' > > drivers/built-in.o(.text+0x10046c): In function `iscsi_iser_init': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1978: undefined reference to `.iscsi_register_transport' > > drivers/built-in.o(.text+0x100a18): In function `iscsi_iser_control_notify': > > drivers/infiniband/ulp/iser/iscsi_iser.c:1855: undefined reference to `.iscsi_recv_pdu' > > drivers/built-in.o(.text+0x100bc0):drivers/infiniband/ulp/iser/iscsi_iser.c:1899: undefined reference to `.iscsi_recv_pdu' > > drivers/built-in.o(.text+0x100cd4):drivers/infiniband/ulp/iser/iscsi_iser.c:1921: undefined reference to `.iscsi_recv_pdu' > > > > Unfortunately, this causes make to fail, killing the job (I realize > > these are technically non-critical errors). > > > > So, I'm just going to disable ISER for now in my tests. > > OK but I think you can get this all to build with a little more as > follows: > > Enable the cryptographic API (as built in rather than as module) with > CRC32c CRC algorithm, and iSCSI (under Device Drivers/SCSI device > support/SCSI low level drivers) > > in your kernel configuration. > > That should cause scsi_transport_iscsi (and iscsi_tcp) to be built which > should take care of the unresolved symbols you are seeing. This seems to work, I'll report on the numbers once the tests are done. Thanks, Nish From torvalds at osdl.org Fri Dec 30 10:46:06 2005 From: torvalds at osdl.org (Linus Torvalds) Date: Fri, 30 Dec 2005 10:46:06 -0800 (PST) Subject: [openib-general] Re: [PATCH 10 of 20] ipath - core driver, part 3 of 4 In-Reply-To: References: Message-ID: All your user page lookup/pinning code is terminally broken. You can't do it that way. You have serveral major conceptual bugs, like keeping track of pages without incrementing their page count, and just expecting that they are magically "pinned" even you do nothing at all to pin them. The process exits or does an munmap, and the page will be used for something else, and you'll just corrupt totally random memory. Similarly, you do page_address() on the page, which just can't work on highmem pages. Crap like this must not be merged. Drivers aren't supposed to play VM tricks in the first place - even if they were to get it right (which they never do). Don't do it. Linus From info at fqnw.com Fri Dec 30 10:17:35 2005 From: info at fqnw.com (info at fqnw.com) Date: 31 Dec 2005 03:17:35 +0900 Subject: [openib-general] $BD>@\EEOC$G$*OC$7$^$;$s$+!)(B Message-ID: <20051230181735.10596.qmail@mail.fqnw.com> $BEv%5%$%H$O!"=w at -2q0wMM!"CK at -2q0wMM$9$Y$F$NJ}$KEPO?$N:]!"7HBS%"%I%l%9!"7HBSHV9f!"EE;R%a!<%k%"%I%l%9$N$$$:$l$+$rEPO?$7$F$b$i$&%7%9%F%`$H$J$C$F$*$j!"2q0wMM$9$Y$F$N7G<(HD$KD>@\$NO"Mm at h$,5-:\$5$l!"$I$J$?$G$b5$$KF~$C$?0[@-$NJ}$K%"%W%m!<%A$7$d$9$/$J$C$F$$$^$9!#$^$:$O$3$A$i$G!!(Bhttp://www.awg5.net/?lv $B9%$_$N0[@-$,$$$i$C$7$c$k$+3NG'$7$F$_$F$/$@$5$$!#(B ////////////////////////////////////////////////////////////////////// $B!c(BNO.I don't veceive your mail$B!d"M!!(Bokyakusama_soudangakari at yahoo.ca $B!c:#8e!"l9g$O!d"M!!(Bokyakusama_soudangakari at yahoo.ca ////////////////////////////////////////////////////////////////////// From jlentini at netapp.com Fri Dec 30 12:54:17 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 30 Dec 2005 15:54:17 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t Message-ID: Why does the kernel verbs API use u64 as the type for I/O virtual addresses instead of dma_addr_t (e.g. mthca_mr_alloc_phys()'s iova param)? james From trimmer at silverstorm.com Fri Dec 30 13:11:00 2005 From: trimmer at silverstorm.com (Rimmer, Todd) Date: Fri, 30 Dec 2005 16:11:00 -0500 Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> James wrote: > Why does the kernel verbs API use u64 as the type for I/O virtual > addresses instead of dma_addr_t (e.g. mthca_mr_alloc_phys()'s iova > param)? dma_addr_t is the appropriate datatype for an address on the local CPU/system. However, in Infiniband IO virtual addresses can also be exchanged across the wire for use by RDMA. Hence the address must be 64 bits [the size defined by Infiniband RDMA addressing protocols on the wire] even if the local system is a 32 bit CPU. This is necessary to support the case where the remote system has a 64 bit address space. Todd Rimmer From jlentini at netapp.com Fri Dec 30 13:25:44 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 30 Dec 2005 16:25:44 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> Message-ID: On Fri, 30 Dec 2005, Rimmer, Todd wrote: > James wrote: > > Why does the kernel verbs API use u64 as the type for I/O virtual > > addresses instead of dma_addr_t (e.g. mthca_mr_alloc_phys()'s iova > > param)? > > dma_addr_t is the appropriate datatype for an address on the local > CPU/system. > > However, in Infiniband IO virtual addresses can also be exchanged > across the wire for use by RDMA. Hence the address must be 64 bits > [the size defined by Infiniband RDMA addressing protocols on the > wire] even if the local system is a 32 bit CPU. This is necessary > to support the case where the remote system has a 64 bit address > space. Thanks Todd. That makes sense now. From jlentini at netapp.com Fri Dec 30 14:06:44 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 30 Dec 2005 17:06:44 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: References: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> Message-ID: One more question on this topic. Why is the ib_sge's addr a u64 and not a dma_addr_t? From mshefty at ichips.intel.com Fri Dec 30 14:09:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 30 Dec 2005 14:09:21 -0800 Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: References: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> Message-ID: <43B5B011.6000100@ichips.intel.com> James Lentini wrote: > One more question on this topic. > > Why is the ib_sge's addr a u64 and not a dma_addr_t? It's the same address that the user can transfer to the remote side. Also, if inline sends are being used, the address is not necessarily a DMA address. - Sean From tom at opengridcomputing.com Fri Dec 30 14:39:11 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 30 Dec 2005 16:39:11 -0600 Subject: [openib-general] [PATCH] iWARP Support added to the CMA In-Reply-To: <43B484F5.7030603@ichips.intel.com> References: <1134669336.7186.2.camel@trinity.austin.ammasso.com> <43B484F5.7030603@ichips.intel.com> Message-ID: <1135982351.7698.2.camel@strider.opengridcomputing.com> Sean: Thanks for the comments -- great stuff. I'm in the process of merging with the latest CMA changes in the trunk and will address your comments in the next patch. Thanks for the review! On Thu, 2005-12-29 at 16:53 -0800, Sean Hefty wrote: > Tom Tucker wrote: > > I'm a lot slow to review this, but comments below. I'll start to address some > of them that affect the generic code next week, in particular changes to ib_addr. > > > > Index: core/cm.c > > =================================================================== > > --- core/cm.c (revision 4186) > > +++ core/cm.c (working copy) > > @@ -3227,6 +3227,10 @@ > > int ret; > > u8 i; > > > > + /* Ignore RNIC devices */ > > + if (device->node_type == IB_NODE_RNIC) > > + return; > > + > > cm_dev = kmalloc(sizeof(*cm_dev) + sizeof(*port) * > > device->phys_port_cnt, GFP_KERNEL); > > if (!cm_dev) > > @@ -3291,6 +3295,10 @@ > > if (!cm_dev) > > return; > > > > + /* Ignore RNIC devices */ > > + if (device->node_type == IB_NODE_RNIC) > > + return; > > + > > write_lock_irqsave(&cm.device_lock, flags); > > list_del(&cm_dev->list); > > write_unlock_irqrestore(&cm.device_lock, flags); > > The changes to cm_remove_one() are not needed. ib_get_client_data() should > return NULL because IB_NODE_RNIC is skipped in cm_add_one(). > > > Index: core/addr.c > > =================================================================== > > --- core/addr.c (revision 4186) > > +++ core/addr.c (working copy) > > @@ -73,8 +73,13 @@ > > if (!dev) > > return -EADDRNOTAVAIL; > > > > - *gid = *(union ib_gid *) (dev->dev_addr + 4); > > - *pkey = addr_get_pkey(dev); > > + if (dev->type == ARPHRD_INFINIBAND) { > > + *gid = *(union ib_gid *) (dev->dev_addr + 4); > > + *pkey = addr_get_pkey(dev); > > + } else { > > + *gid = *(union ib_gid *) (dev->dev_addr); > > + *pkey = 0; > > + } > > dev_put(dev); > > return 0; > > } > > If this call is being used, we should consider changing it to something more > generic, rather than returning a "gid" as the hardware address for a non-IB > device. One possibility is to make the API return the hardware address of a > given IP address. The CMA can then break that address into a GID/pkey pair if > needed. > > > @@ -476,8 +498,22 @@ > > state = cma_exch(id_priv, CMA_DESTROYING); > > cma_cancel_operation(id_priv, state); > > > > - if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > > - ib_destroy_cm_id(id_priv->cm_id); > > + if (id->device) { > > + switch (id->device->node_type) { > > + case IB_NODE_RNIC: > > + if (id_priv->cm_id.iw && !IS_ERR(id_priv->cm_id.iw)) { > > + iw_destroy_cm_id(id_priv->cm_id.iw); > > + id_priv->cm_id.iw = 0; > > + } > > + break; > > + default: > > + if (id_priv->cm_id.ib && !IS_ERR(id_priv->cm_id.ib)) { > > + ib_destroy_cm_id(id_priv->cm_id.ib); > > + id_priv->cm_id.ib = 0; > > The iw/ib devices should be set to NULL instead of assigned 0. > > > + ret = cma_notify_user(id_priv, > > + event_type, > > + event->status, > > + event->private_data, > > + event->private_data_len); > > + if (ret) { > > + /* Destroy the CM ID by returning a non-zero value. */ > > + id_priv->cm_id.iw = NULL; > > + cma_exch(id_priv, CMA_DESTROYING); > > + cma_release_remove(id_priv); > > + rdma_destroy_id(&id_priv->id); > > + return ret; > > + } > > + > > + cma_release_remove(id_priv); > > + return ret; > > +} > > This looks different than the cma_ib_handler, and it makes me think that the > cma_ib_handler has a bug where it doesn't decrement dev_remove. > > > +static int iw_conn_req_handler(struct iw_cm_id *cm_id, > > + struct iw_cm_event *iw_event) > > +{ > > + struct rdma_cm_id* new_cm_id; > > + struct rdma_id_private *listen_id, *conn_id; > > + struct sockaddr_in* sin; > > + int ret; > > + > > + listen_id = cm_id->context; > > + atomic_inc(&listen_id->dev_remove); > > + if (!cma_comp(listen_id, CMA_LISTEN)) { > > + ret = -ECONNABORTED; > > + goto out; > > + } > > + > > + /* Create a new RDMA id the new IW CM ID */ > > + new_cm_id = rdma_create_id(listen_id->id.event_handler, > > + listen_id->id.context); > > + if (!new_cm_id) { > > + ret = -ENOMEM; > > + goto out; > > + } > > + conn_id = container_of(new_cm_id, struct rdma_id_private, id); > > + atomic_inc(&conn_id->dev_remove); > > + conn_id->state = CMA_CONNECT; > > + > > + /* New connection inherits device from parent */ > > + cma_attach_to_dev(conn_id, listen_id->cma_dev); > > > cma_attach_to_dev doesn't provide synchronization around cma_dev->id_list. > Access to that list needs to be protected with 'mutex'. Other than that, I > think that this works fine. > > > @@ -785,8 +950,9 @@ > > goto out; > > > > list_add_tail(&id_priv->list, &listen_any_list); > > - list_for_each_entry(cma_dev, &dev_list, list) > > + list_for_each_entry(cma_dev, &dev_list, list) { > > cma_listen_on_dev(id_priv, cma_dev); > > + } > > Please drop the extra braces. > > > @@ -796,7 +962,6 @@ > > { > > struct rdma_id_private *id_priv; > > int ret; > > - > > id_priv = container_of(id, struct rdma_id_private, id); > > if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > > return -EINVAL; > > Please keep the blank line. > > > @@ -890,6 +1058,30 @@ > > +static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms) > > +{ > > + enum rdma_cm_event_type event = RDMA_CM_EVENT_ROUTE_RESOLVED; > > + int rc; > > Please use 'ret' instead of 'rc' to match the rest of the code. > > > + > > + atomic_inc(&id_priv->dev_remove); > > + > > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ROUTE_RESOLVED)) > > + BUG_ON(1); > > The device associated with the id could have been removed while the user was in > the process of making this call. We should simply fail the call here. > > > + > > + rc = cma_notify_user(id_priv, event, 0, NULL, 0); > > + if (rc) { > > + cma_exch(id_priv, CMA_DESTROYING); > > + cma_release_remove(id_priv); > > + cma_deref_id(id_priv); > > + rdma_destroy_id(&id_priv->id); > > + return rc; > > + } > > The callback needs to come from another thread other than the one that the user > called down with. Calling the user back in their own thread can make it > difficult for them to provide synchronization. You can use the rdma_wq that's > been exposed in ib_addr. Also, the user is likely to call this routine from > within a CMA callback (such as after ib_resolve_addr), so deadlock will occur if > you try to destroy the id. > > > + > > + cma_release_remove(id_priv); > > + cma_deref_id(id_priv); > > + return rc; > > +} > > + > > int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) > > { > > struct rdma_id_private *id_priv; > > @@ -952,20 +1147,133 @@ > > + > > +/* Find the local interface with a route to the specified address and > > + * bind the CM ID to this interface's CMA device > > + */ > > +static int cma_acquire_iw_dev(struct rdma_cm_id* id, struct sockaddr* addr) > > +{ > > + int ret = -ENOENT; > > + struct cma_device* cma_dev; > > + struct rdma_id_private *id_priv; > > + struct sockaddr_in* sin; > > + struct rtable *rt = 0; > > + struct flowi fl; > > + struct net_device* netdev; > > + struct in_addr src_ip; > > + unsigned char* dev_addr; > > + > > + sin = (struct sockaddr_in*)addr; > > + if (sin->sin_family != AF_INET) > > + return -EINVAL; > > + > > + id_priv = container_of(id, struct rdma_id_private, id); > > + > > + /* If the address is local, use the device. If it is remote, > > + * look up a route to get the local address > > + */ > > + netdev = ip_dev_find(sin->sin_addr.s_addr); > > + if (netdev) { > > + src_ip = sin->sin_addr; > > + dev_addr = netdev->dev_addr; > > + dev_put(netdev); > > + } else { > > + memset(&fl, 0, sizeof(fl)); > > + fl.nl_u.ip4_u.daddr = sin->sin_addr.s_addr; > > + if (ip_route_output_key(&rt, &fl)) { > > + return -ENETUNREACH; > > + } > > + dev_addr = rt->idev->dev->dev_addr; > > + src_ip.s_addr = rt->rt_src; > > + > > + ip_rt_put(rt); > > + } > > Can we push the above code into ib_addr? > > > + down(&mutex); > > + > > + list_for_each_entry(cma_dev, &dev_list, list) { > > + if (memcmp(dev_addr, > > + &cma_dev->node_guid, > > + sizeof(cma_dev->node_guid)) == 0) { > > + /* If we find the device, then check if this > > + * is an iWARP device. If it is, then call the > > + * callback handler immediately because we > > + * already have the native address > > + */ > > I'm not following this comment. What callback is being invoked? > > > + if (cma_dev->device->node_type == IB_NODE_RNIC) { > > + struct sockaddr_in* cm_sin; > > + /* Set our source address */ > > + cm_sin = (struct sockaddr_in*) > > + &id_priv->id.route.addr.src_addr; > > + cm_sin->sin_family = AF_INET; > > + cm_sin->sin_addr.s_addr = src_ip.s_addr; > > + > > + /* Claim the device in the mutex */ > > + cma_attach_to_dev(id_priv, cma_dev); > > + ret = 0; > > + break; > > + } > > + } > > + } > > + up(&mutex); > > + > > + return ret; > > +} > > I'd like to see if it's possible to merge this call with cma_acquire_ib_dev() > and create a new routine, cma_acquire_dev() that can walk the device list and > check for both. Maybe if ib_addr returned the full hardware address, along with > a device type it might be possible. (Although the likelihood of a hardware > address collision seems near impossible.) > > > int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, > > struct sockaddr *dst_addr, int timeout_ms) > > { > > struct rdma_id_private *id_priv; > > - int ret; > > + int ret = 0; > > > > id_priv = container_of(id, struct rdma_id_private, id); > > if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) > > return -EINVAL; > > > > atomic_inc(&id_priv->refcount); > > + > > id->route.addr.dst_addr = *dst_addr; > > - ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, > > - timeout_ms, addr_handler, id_priv); > > + > > + if (cma_acquire_iw_dev(id, dst_addr)==0) { > > + > > + enum rdma_cm_event_type event; > > + > > + cma_exch(id_priv, CMA_ADDR_RESOLVED); > > + > > + atomic_inc(&id_priv->dev_remove); > > + > > + event = RDMA_CM_EVENT_ADDR_RESOLVED; > > + if (cma_notify_user(id_priv, event, 0, NULL, 0)) { > > + cma_exch(id_priv, CMA_DESTROYING); > > + cma_deref_id(id_priv); > > + cma_release_remove(id_priv); > > + rdma_destroy_id(&id_priv->id); > > + return -EINVAL; > > + } > > Similar to other comments. Callbacks should be scheduled to a separate thread. > The behavior is also slightly different. The IB code will return a source IP > address that may be used to connect to the destination address if one is not > given. This is needed in order to perform the reverse resolution on the remote > side. > > > + cma_release_remove(id_priv); > > + cma_deref_id(id_priv); > > + > > + } else { > > + > > + ret = ib_resolve_addr(src_addr, > > + dst_addr, &id->route.addr.addr.ibaddr, > > + timeout_ms, addr_handler, id_priv); > > We might be able to make this call generic by replacing ibaddr with source and > destination hardware addresses. Users that really want to know what 'gid' > they're on could be given functions that extract the gid and pkey from the > hardware addresses. I.e. struct ib_addr could be renamed to struct rdma_addr > and contain source and destination hardware addresses. > > > @@ -980,10 +1288,13 @@ > > int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) > > { > > struct rdma_id_private *id_priv; > > + struct sockaddr_in* sin; > > struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; > > int ret; > > > > - if (addr->sa_family != AF_INET) > > + sin = (struct sockaddr_in*)addr; > > + > > + if (sin->sin_family != AF_INET) > > return -EINVAL; > > Please remove this change. Right now, the check's only there because the code > does not fully support IPv6 addressing. > > > int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) > > { > > struct rdma_id_private *id_priv; > > int ret; > > > > id_priv = container_of(id, struct rdma_id_private, id); > > - if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > > + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > > Please undo extra white space at the end of the line. > > > @@ -1190,7 +1551,6 @@ > > { > > struct rdma_id_private *id_priv; > > int ret; > > - > > Please add blank line back in. > > > id_priv = container_of(id, struct rdma_id_private, id); > > if (!cma_comp(id_priv, CMA_CONNECT)) > > return -EINVAL; > > > - Sean From caitlinb at broadcom.com Fri Dec 30 14:44:22 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 30 Dec 2005 14:44:22 -0800 Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t Message-ID: <54AD0F12E08D1541B826BE97C98F99F1142243@NT-SJCA-0751.brcm.ad.broadcom.com> openib-general-bounces at openib.org wrote: > One more question on this topic. > > Why is the ib_sge's addr a u64 and not a dma_addr_t? Because the hardware may need for it to be a 64 bit IO Address accessible on the system bus. That applies to the whole system bus, no matter how many PCI roots or virtual OSs there are. In particular there could be a guest OS that was running in 32-bit mode, and the RDMA hardware receiving fast path requests will not support different work request formats for each guest OS. From greg at kroah.com Fri Dec 30 00:12:18 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 00:12:18 -0800 Subject: [openib-general] Re: [PATCH 11 of 20] ipath - core driver, part 4 of 4 In-Reply-To: References: Message-ID: <20051230081218.GB7438@kroah.com> On Wed, Dec 28, 2005 at 04:31:30PM -0800, Bryan O'Sullivan wrote: > Signed-off-by: Bryan O'Sullivan > > diff -r c37b118ef806 -r e8af3873b0d9 drivers/infiniband/hw/ipath/ipath_driver.c > --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:43 2005 -0800 > @@ -5408,3 +5408,1709 @@ Clever use of 4 patches to just add onto the same file. This has grown into a huge file, can't you split it up into smaller pieces? > +int __init infinipath_init(void) > +{ > + int r = 0, i; > + > + _IPATH_DBG(KERN_INFO DRIVER_LOAD_MSG "%s", ipath_core_version); > + > + ipath_init_picotime(); /* init cycles -> pico conversion */ > + > + /* > + * initialize the statusp to temporary storage so we can use it > + * everywhere without first checking. When we "really" assign it, > + * we copy from _ipath_status > + */ > + for (i = 0; i < infinipath_max; i++) > + devdata[i].ipath_statusp = &devdata[i]._ipath_status; > + > + /* > + * init these early, in case we take an interrupt as soon as the irq > + * is setup. Saw a spinlock panic once that appeared to be due to that > + * problem, when they were initted later on. > + */ > + spin_lock_init(&ipath_pioavail_lock); > + spin_lock_init(&ipath_sma_lock); > + > + pci_register_driver(&infinipath_driver); > + > + driver_create_file(&(infinipath_driver.driver), &driver_attr_version); > + > + if ((r = register_chrdev(ipath_major, MODNAME, &ipath_fops))) > + _IPATH_ERROR("Unable to register %s device\n", MODNAME); Why even save off the return value if you don't do anything with it? And please don't put assignments in the middle of if statements, that's just messy and harder to read (the fact that gcc made you put an extra () should be a hint that you were doing something wrong...) And does your driver work with udev? I didn't see where you were exporting the major:minor number of your devices to sysfs, but I might have missed it. > + /* > + * never return an error, since we could have stuff registered, > + * resources used, etc., even if no hardware found. This way we > + * can clean up through unload. > + */ > + return 0; > +} Are you sure that's a good idea? Please do the proper thing and tear down your infrastructure if something fails, that's the correct thing to do. That way you can actually recover if something that you call in this function fails (like driver_create_file(), or pci_register_driver().) Functions don't return error values just so you can ignore them :) > +/* > + * note: if for some reason the unload fails after this routine, and leaves > + * the driver enterable by user code, we'll almost certainly crash and burn... > + */ See, you admit that what you are doing isn't the wisest thing, which should tell you something... > +static void __exit infinipath_cleanup(void) > +{ > + int r, m, port; > + > + driver_remove_file(&(infinipath_driver.driver), &driver_attr_version); > + if ((r = unregister_chrdev(ipath_major, MODNAME))) > + _IPATH_DBG("unregister of device failed: %d\n", r); > + > + > + /* > + * turn off rcv, send, and interrupts for all ports, all drivers > + * should also hard reset the chip here? > + * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs > + * for all versions of the driver, if they were allocated > + */ > + for (m = 0; m < infinipath_max; m++) { > + uint64_t val; > + struct ipath_devdata *dd = &devdata[m]; > + if (dd->ipath_kregbase) { > + /* in case unload fails, be consistent */ > + dd->ipath_rcvctrl = 0U; > + ipath_kput_kreg(m, kr_rcvctrl, dd->ipath_rcvctrl); > + > + /* > + * gracefully stop all sends allowing any in > + * progress to trickle out first. > + */ > + ipath_kput_kreg(m, kr_sendctrl, 0ULL); > + val = ipath_kget_kreg64(m, kr_scratch); /* flush it */ > + /* > + * enough for anything that's going to trickle > + * out to have actually done so. > + */ > + udelay(5); > + > + /* > + * abort any armed or launched PIO buffers that > + * didn't go. (self clearing). Will cause any > + * packet currently being transmitted to go out > + * with an EBP, and may also cause a short packet > + * error on the receiver. > + */ > + ipath_kput_kreg(m, kr_sendctrl, INFINIPATH_S_ABORT); > + > + /* mask interrupts, but not errors */ > + ipath_kput_kreg(m, kr_intmask, 0ULL); > + ipath_shutdown_link(m); > + > + /* > + * clear all interrupts and errors. Next time > + * driver is loaded, we know that whatever is > + * set happened while we were unloaded > + */ > + ipath_kput_kreg(m, kr_hwerrclear, -1LL); > + ipath_kput_kreg(m, kr_errorclear, -1LL); > + ipath_kput_kreg(m, kr_intclear, -1LL); > + if (dd->__ipath_pioavailregs_base) { > + kfree((void *)dd->__ipath_pioavailregs_base); > + dd->__ipath_pioavailregs_base = NULL; > + dd->ipath_pioavailregs_dma = NULL; > + } > + > + if (dd->ipath_pageshadow) { > + struct page **tmpp = dd->ipath_pageshadow; > + int i, cnt = 0; > + > + _IPATH_VDBG > + ("Unlocking any expTID pages still locked\n"); > + for (port = 0; port < dd->ipath_cfgports; > + port++) { > + int port_tidbase = > + port * dd->ipath_rcvtidcnt; > + int maxtid = > + port_tidbase + dd->ipath_rcvtidcnt; > + for (i = port_tidbase; i < maxtid; i++) { > + if (tmpp[i]) { > + ipath_putpages(1, > + &tmpp[i]); > + tmpp[i] = NULL; > + cnt++; > + } > + } > + } > + if (cnt) { > + ipath_stats.sps_pageunlocks += cnt; > + _IPATH_VDBG > + ("There were still %u expTID entries locked\n", > + cnt); > + } > + if (ipath_stats.sps_pagelocks > + || ipath_stats.sps_pageunlocks) > + _IPATH_VDBG > + ("%llu pages locked, %llu unlocked via ipath_m{un}lock\n", > + ipath_stats.sps_pagelocks, > + ipath_stats.sps_pageunlocks); > + > + _IPATH_VDBG > + ("Free shadow page tid array at %p\n", > + dd->ipath_pageshadow); > + vfree(dd->ipath_pageshadow); > + dd->ipath_pageshadow = NULL; > + } > + > + /* > + * free any resources still in use (usually just > + * kernel ports) at unload > + */ > + for (port = 0; port < dd->ipath_cfgports; port++) > + ipath_free_pddata(dd, port, 1); > + kfree(dd->ipath_pd); > + /* > + * debuggability, in case some cleanup path > + * tries to use it after this > + */ > + dd->ipath_pd = NULL; > + } > + > + if (dd->pcidev) { > + if (dd->pcidev->irq) { > + _IPATH_VDBG("unit %u free_irq of irq %x\n", m, > + dd->pcidev->irq); > + free_irq(dd->pcidev->irq, dd); > + } else > + _IPATH_DBG > + ("irq is 0, not doing free_irq for unit %u\n", > + m); > + dd->pcidev = NULL; > + } > + if (dd->pci_registered) { > + _IPATH_VDBG > + ("Unregistering pci infrastructure unit %u\n", m); > + pci_unregister_driver(&infinipath_driver); This is the call that should have cleaned up all of the memory and other stuff that you do above. If not, then your driver will not work in any hotplug pci systems, which would not be a good thing. Please do like Roland says and put your resources and stuff in the device specific structures, like the rest of the kernel drivers do. You know, we do things like this for a reason, not just because we like to be difficult :) > + dd->pci_registered = 0; > + } else > + _IPATH_VDBG > + ("unit %u: no pci unreg, wasn't registered\n", m); > + ipath_chip_cleanup(dd); /* clean up any per-chip chip-specific stuff */ > + } > + /* > + * clean up any chip-specific stuff for now, only one type of chip > + * for any given driver > + */ > + ipath_chip_done(); > + > + /* cleanup all our locked pages private data structures */ > + ipath_upages_cleanup(NULL); > +} > + > +/* This is a generic function here, so it can return device-specific > + * info. This allows keeping in sync with the version that supports > + * multiple chip types. > +*/ > +void ipath_get_boardname(const ipath_type t, char *name, size_t namelen) > +{ > + ipath_ht_get_boardname(t, name, namelen); > +} Why not just export ipath_ht_get_boardname instead? > +module_init(infinipath_init); > +module_exit(infinipath_cleanup); > + > +EXPORT_SYMBOL(infinipath_debug); > +EXPORT_SYMBOL(ipath_get_boardname); EXPORT_SYMBOL_GPL() ? And put them next to the functions themselves, it's easier to notice that way. thanks, greg k-h From greg at kroah.com Fri Dec 30 00:39:28 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 00:39:28 -0800 Subject: [openib-general] Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4 In-Reply-To: References: Message-ID: <20051230083928.GD7438@kroah.com> On Wed, Dec 28, 2005 at 04:31:27PM -0800, Bryan O'Sullivan wrote: > Signed-off-by: Bryan O'Sullivan > > diff -r ffbd416f30d4 -r ddd21709e12c drivers/infiniband/hw/ipath/ipath_driver.c > --- /dev/null Thu Jan 1 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Dec 28 14:19:42 2005 -0800 > @@ -0,0 +1,1879 @@ > +/* > + * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Patent licenses, if any, provided herein do not apply to > + * combinations of this program with other software, or any other > + * product whatsoever. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include /* we can generate our own crc's for testing */ > + > +#include "ipath_kernel.h" > +#include "ips_common.h" > +#include "ipath_layer.h" > + > +/* > + * Our LSB-assigned major number, so scripts can figure > + * out how to make entry in /dev. > + */ > + > +static int ipath_major = 233; > + > +/* > + * number of buffers reserved for driver (layered drivers and SMA send). > + * Reserved at end of buffer list. > + */ > + > +static uint infinipath_kpiobufs = 32; > + > +/* > + * number of ports we are configured to use (to allow for more pio > + * buffers per port, etc.) Zero means use chip value. > + */ > + > +static uint infinipath_cfgports; > + > +/* > + * number of units we are configured to use (to allow for bringup on > + * multi-chip systems) Zero means use only one for now, but eventually > + * will mean to use infinipath_max > + */ > + > +static uint infinipath_cfgunits; > + > +uint64_t ipath_dummy_val_for_testing; > + > +static __kernel_pid_t ipath_sma_alive; /* PID of SMA, if it's running */ > +static spinlock_t ipath_sma_lock; /* SMA receive */ > + > +/* max SM received packets we'll queue; we keep the most recent packets. */ > + > +#define IPATH_NUM_SMAPKTS 16 > + > +#define IPATH_SMA_HDRSZ (8+12+8) /* LRH+BTH+DETH */ > + > +static struct _ipath_sma_rpkt { > + /* length of received packet; non-zero if queued */ > + uint32_t len; > + /* unit number of interface packet was received from */ > + uint32_t unit; > + uint8_t *buf; > +} ipath_sma_data[IPATH_NUM_SMAPKTS]; > + > +static unsigned ipath_sma_first; /* oldest sma packet index */ > +static unsigned ipath_sma_next; /* next sma packet index to use */ > + > +/* > + * ipath_sma_data_bufs has one extra, pointed to by ipath_sma_data_spare, > + * so we can exchange buffers to do copy_to_user, and not hold the lock > + * across the copy_to_user(). > + */ > + > +#define SMA_MAX_PKTSZ (IPATH_SMA_HDRSZ+256) /* max len of an SMA packet */ > + > +static uint8_t ipath_sma_data_bufs[IPATH_NUM_SMAPKTS + 1][SMA_MAX_PKTSZ]; > +static uint8_t *ipath_sma_data_spare; > +/* sma waits globally on all units */ > +static wait_queue_head_t ipath_sma_wait; > +static wait_queue_head_t ipath_sma_state_wait; > + > +struct infinipath_stats ipath_stats; > + > +/* > + * this will only be used for diags, now that we have enabled the DMA > + * of the sendpioavail regs to system memory. > + */ > + > +static inline uint64_t ipath_kget_sreg(const ipath_type stype, > + ipath_sreg regno) > +{ > + uint64_t val; > + uint64_t *sbase; > + > + sbase = (uint64_t *) (devdata[stype].ipath_sregbase > + + (char *)devdata[stype].ipath_kregbase); > + val = sbase ? sbase[regno] : 0ULL; > + return val; > +} > + > +static int ipath_do_user_init(struct ipath_portdata *, > + struct ipath_user_info __user *); > +static int ipath_get_baseinfo(struct ipath_portdata *, > + struct ipath_base_info __user *); > +static int ipath_get_units(void); > +static int ipath_wr_eeprom(struct ipath_portdata *, > + struct ipath_eeprom_req __user *); > +static int ipath_wait_intr(struct ipath_portdata *, uint32_t); > +static int ipath_tid_update(struct ipath_portdata *, struct _tidupd __user *); > +static int ipath_tid_free(struct ipath_portdata *, struct _tidupd __user *); > +static int ipath_get_counters(ipath_type, struct infinipath_counters __user *); > +static int ipath_get_unit_counters(struct infinipath_getunitcounters __user *a); > +static int ipath_get_stats(struct infinipath_stats __user *); > +static int ipath_set_partkey(struct ipath_portdata *, uint16_t); > +static int ipath_manage_rcvq(struct ipath_portdata *, uint16_t); > +static void ipath_clean_partkey(struct ipath_portdata *, > + struct ipath_devdata *); > +static void ipath_disarm_piobufs(const ipath_type, unsigned, unsigned); > +static int ipath_create_user_egr(struct ipath_portdata *); > +static int ipath_create_port0_egr(struct ipath_portdata *); > +static int ipath_create_rcvhdrq(struct ipath_portdata *); > +static void ipath_handle_errors(const ipath_type, uint64_t); > +static void ipath_update_pio_bufs(const ipath_type); > +static int ipath_shutdown_link(const ipath_type); > +static int ipath_bringup_link(const ipath_type); > +int ipath_bringup_serdes(const ipath_type); > +static void ipath_get_faststats(unsigned long); > +static int ipath_setup_htconfig(struct pci_dev *, uint64_t *, const ipath_type); > +static struct page *ipath_nopage(struct vm_area_struct *, unsigned long, int *); > +static irqreturn_t ipath_intr(int irq, void *devid, struct pt_regs *regs); > +static void ipath_decode_err(char *, size_t, uint64_t); > +void ipath_free_pddata(struct ipath_devdata *, uint32_t, int); > +static void ipath_clear_tids(const ipath_type, unsigned); > +static void ipath_get_guid(const ipath_type); > +static int ipath_sma_ioctl(struct file *, unsigned int, unsigned long); > +static int ipath_rcvsma_pkt(struct ipath_sendpkt __user *); > +static int ipath_kset_lid(uint32_t); > +static int ipath_kset_mlid(uint32_t); > +static int ipath_get_mlid(uint32_t __user *); > +static int ipath_get_devstatus(uint64_t __user *); > +static int ipath_kset_guid(struct ipath_setguid __user *); > +static int ipath_get_portinfo(uint32_t __user *); > +static int ipath_get_nodeinfo(uint32_t __user *); > +#ifdef _IPATH_EXTRA_DEBUG > +static void ipath_dump_allregs(char *, ipath_type); > +#endif > + > +static const char ipath_sma_name[] = "infinipath_SMA"; > + > +/* > + * is diags mode enabled? if it is, then things like auto bringup of > + * links is disabled > + */ > + > +int ipath_diags_enabled = 0; > + > +void ipath_chip_done(void) > +{ > +} > + > +void ipath_chip_cleanup(struct ipath_devdata * dd) > +{ > +} What are these two empty functions for? > +/* > + * cache aligned location > + * > + * where port 0 rcvhdrtail register is written back; also want > + * nothing else sharing the cache line, so make it a cache line in size > + * used for all units > + * > + * This is volatile as it's the target of a DMA from the chip. > + */ > + > +static volatile uint64_t ipath_port0_rcvhdrtail[512] > + __attribute__ ((aligned(4096))); > + > +#define MODNAME "ipath_core" > +#define DRIVER_LOAD_MSG "PathScale " MODNAME " loaded: " > +#define PFX MODNAME ": " > + > +/* > + * min buffers we want to have per port, after driver > + */ > + > +#define IPATH_MIN_USER_PORT_BUFCNT 8 > + > +/* The size has to be longer than this string, so we can > + * append board/chip information to it in the init code. > + */ > +static char ipath_core_version[192] = IPATH_IDSTR; > +static char *chip_driver_version; > +static int chip_driver_size; > + > +/* mylid and lidbase are to deal with LIDs in "fabric", until SM is working */ > + > +module_param(infinipath_debug, uint, 0644); > +module_param(infinipath_kpiobufs, uint, 0644); > +module_param(infinipath_cfgports, uint, 0644); > +module_param(infinipath_cfgunits, uint, 0644); > + > +MODULE_PARM_DESC(infinipath_debug, "mask for debug prints"); > +MODULE_PARM_DESC(infinipath_cfgports, "Set max number of ports to use"); > +MODULE_PARM_DESC(infinipath_cfgunits, "Set max number of devices to use"); > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("PathScale "); > +MODULE_DESCRIPTION("Pathscale InfiniPath driver"); > + > +#ifdef IPATH_DIAG > +static __kernel_pid_t ipath_diag_alive; /* PID of diags, if running */ > +int ipath_diags_ioctl(struct file *, unsigned, unsigned long); > +static int ipath_opendiag(struct inode *, struct file *); > +#endif > + > +#if __IPATH_INFO || __IPATH_DBG > +static const char *ipath_ibcstatus_str[] = { > + "Disabled", > + "LinkUp", > + "PollActive", > + "PollQuiet", > + "SleepDelay", > + "SleepQuiet", > + "LState6", /* unused */ > + "LState7", /* unused */ > + "CfgDebounce", > + "CfgRcvfCfg", > + "CfgWaitRmt", > + "CfgIdle", > + "RecovRetrain", > + "LState0xD", /* unused */ > + "RecovWaitRmt", > + "RecovIdle", > +}; > +#endif > + > +static ssize_t show_version(struct device_driver *dev, char *buf) > +{ > + return snprintf(buf, PAGE_SIZE, "%s", ipath_core_version); > +} > + > +static ssize_t show_status(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; > + > + if (!dd->ipath_statusp) > + return -EINVAL; > + > + return snprintf(buf, PAGE_SIZE, "%llx\n", *(dd->ipath_statusp)); > +} > + > +static const char *ipath_status_str[] = { > + "Initted", > + "Disabled", > + "4", /* unused */ > + "OIB_SMA", > + "SMA", > + "Present", > + "IB_link_up", > + "IB_configured", > + "NoIBcable", > + "Fatal_Hardware_Error", > + NULL, > +}; > + > +static ssize_t show_status_str(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + int i, any; > + uint64_t s; > + > + if (!dd) > + return -EINVAL; > + > + if (!dd->ipath_statusp) > + return -EINVAL; > + > + s = *(dd->ipath_statusp); > + *buf = '\0'; > + for (any = i = 0; s && ipath_status_str[i]; i++) { > + if (s & 1) { > + if (any && strlcat(buf, " ", PAGE_SIZE) >= PAGE_SIZE) > + /* overflow */ > + break; > + if (strlcat(buf, ipath_status_str[i], > + PAGE_SIZE) >= PAGE_SIZE) > + break; > + any = 1; > + } > + s >>= 1; > + } > + if (any) > + strlcat(buf, "\n", PAGE_SIZE); > + > + return strlen(buf); > +} how big can this "status string" be? If it's even getting close to PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should break it up into its individual pieces. Based on the table above, this function can get much simpler... > + > +static ssize_t show_lid(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; > + > + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_lid); > +} > + > +static ssize_t show_mlid(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; > + > + return snprintf(buf, PAGE_SIZE, "%x\n", dd->ipath_mlid); > +} > + > +static ssize_t show_guid(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + uint8_t *guid; > + > + if (!dd) > + return -EINVAL; > + > + guid = (uint8_t *)&(dd->ipath_guid); > + > + return snprintf(buf, PAGE_SIZE, "%x:%x:%x:%x:%x:%x:%x:%x\n", > + guid[0], guid[1], guid[2], guid[3], guid[4], guid[5], > + guid[6], guid[7]); > +} > + > +static ssize_t show_nguid(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; > + > + return snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_nguid); > +} > + > +static ssize_t show_serial(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; > + > + buf[sizeof dd->ipath_serial] = '\0'; > + memcpy(buf, dd->ipath_serial, sizeof dd->ipath_serial); > + strcat(buf, "\n"); > + return strlen(buf); > +} > + > +static ssize_t show_unit(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct ipath_devdata *dd = dev_get_drvdata(dev); > + > + if (!dd) > + return -EINVAL; Don't you mean -ENODEV? > + > + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit); > + return strlen(buf); return the snprintf() call instead of calling strlen() all the time please. > +} > + > +static DRIVER_ATTR(version, S_IRUGO, show_version, NULL); > +static DEVICE_ATTR(status, S_IRUGO, show_status, NULL); > +static DEVICE_ATTR(status_str, S_IRUGO, show_status_str, NULL); > +static DEVICE_ATTR(lid, S_IRUGO, show_lid, NULL); > +static DEVICE_ATTR(mlid, S_IRUGO, show_mlid, NULL); > +static DEVICE_ATTR(guid, S_IRUGO, show_guid, NULL); > +static DEVICE_ATTR(nguid, S_IRUGO, show_nguid, NULL); > +static DEVICE_ATTR(serial, S_IRUGO, show_serial, NULL); > +static DEVICE_ATTR(unit, S_IRUGO, show_unit, NULL); > + > +/* > + * called from add_timer and user counter read calls, to deal with > + * counters that wrap in "human time". The words sent and received, and > + * the packets sent and received are all that we worry about. For now, > + * at least, we don't worry about error counters, because if they wrap > + * that quickly, we probably don't care. We may eventually just make this > + * handle all the counters. word counters can wrap in about 20 seconds > + * of full bandwidth traffic, packet counters in a few hours. > + */ > + > +uint64_t ipath_snap_cntr(const ipath_type t, ipath_creg creg) > +{ > + uint32_t val; > + uint64_t val64, t0, t1; > + struct ipath_devdata *dd = &devdata[t]; > + static uint64_t one_sec_in_cycles; > + extern uint32_t _ipath_pico_per_cycle; > + > + if (!one_sec_in_cycles && _ipath_pico_per_cycle) > + one_sec_in_cycles = 1000000000000UL / _ipath_pico_per_cycle; > + > + t0 = get_cycles(); > + val = ipath_kget_creg32(t, creg); > + t1 = get_cycles(); > + if ((t1 - t0) > one_sec_in_cycles && val == -1) { > + /* > + * This is just a way to detect things that are quite broken. > + * Normally this should take just a few cycles (the check is > + * for long enough that we don't care if we get pre-empted.) > + * An Opteron HT O read timeout is 4 seconds with normal > + * NB values > + */ > + > + _IPATH_UNIT_ERROR(t, "Error! Reading counter 0x%x timed out\n", > + creg); > + return 0ULL; > + } > + > + if (creg == cr_wordsendcnt) { > + if (val != dd->ipath_lastsword) { > + dd->ipath_sword += val - dd->ipath_lastsword; > + dd->ipath_lastsword = val; > + } > + val64 = dd->ipath_sword; > + } else if (creg == cr_wordrcvcnt) { > + if (val != dd->ipath_lastrword) { > + dd->ipath_rword += val - dd->ipath_lastrword; > + dd->ipath_lastrword = val; > + } > + val64 = dd->ipath_rword; > + } else if (creg == cr_pktsendcnt) { > + if (val != dd->ipath_lastspkts) { > + dd->ipath_spkts += val - dd->ipath_lastspkts; > + dd->ipath_lastspkts = val; > + } > + val64 = dd->ipath_spkts; > + } else if (creg == cr_pktrcvcnt) { > + if (val != dd->ipath_lastrpkts) { > + dd->ipath_rpkts += val - dd->ipath_lastrpkts; > + dd->ipath_lastrpkts = val; > + } > + val64 = dd->ipath_rpkts; > + } else > + val64 = (uint64_t) val; > + > + return val64; > +} > + > +/* > + * print the delta of egrfull/hdrqfull errors for kernel ports no more > + * than every 5 seconds. User processes are printed at close, but kernel > + * doesn't close, so... Separate routine so may call from other places > + * someday, and so function name when printed by _IPATH_INFO is meaningfull > + */ > + > +static void ipath_qcheck(const ipath_type t) > +{ > + static uint64_t last_tot_hdrqfull; > + size_t blen = 0; > + struct ipath_devdata *dd = &devdata[t]; > + char buf[128]; > + > + *buf = 0; > + if (dd->ipath_pd[0]->port_hdrqfull != dd->ipath_p0_hdrqfull) { > + blen = snprintf(buf, sizeof buf, "port 0 hdrqfull %u", > + dd->ipath_pd[0]->port_hdrqfull - > + dd->ipath_p0_hdrqfull); > + dd->ipath_p0_hdrqfull = dd->ipath_pd[0]->port_hdrqfull; > + } > + if (ipath_stats.sps_etidfull != dd->ipath_last_tidfull) { > + blen += > + snprintf(buf + blen, sizeof buf - blen, "%srcvegrfull %llu", > + blen ? ", " : "", > + ipath_stats.sps_etidfull - dd->ipath_last_tidfull); > + dd->ipath_last_tidfull = ipath_stats.sps_etidfull; > + } > + > + /* > + * this is actually the number of hdrq full interrupts, not actual > + * events, but at the moment that's mostly what I'm interested in. > + * Actual count, etc. is in the counters, if needed. For production > + * users this won't ordinarily be printed. > + */ > + > + if ((infinipath_debug & (__IPATH_PKTDBG | __IPATH_DBG)) && > + ipath_stats.sps_hdrqfull != last_tot_hdrqfull) { > + blen += > + snprintf(buf + blen, sizeof buf - blen, > + "%shdrqfull %llu (all ports)", blen ? ", " : "", > + ipath_stats.sps_hdrqfull - last_tot_hdrqfull); > + last_tot_hdrqfull = ipath_stats.sps_hdrqfull; > + } > + if (blen) > + _IPATH_DBG("%s\n", buf); > + > + if (*dd->ipath_hdrqtailptr != dd->ipath_port0head) { > + if (dd->ipath_lastport0rcv_cnt == ipath_stats.sps_port0pkts) { > + _IPATH_PDBG("missing rcv interrupts? port0 hd=%llx tl=%x; port0pkts %llx\n", > + *dd->ipath_hdrqtailptr, dd->ipath_port0head,ipath_stats.sps_port0pkts); > + ipath_kreceive(t); > + } > + dd->ipath_lastport0rcv_cnt = ipath_stats.sps_port0pkts; > + } > +} > + > +/* > + * called from add_timer to get word counters from chip before they > + * can overflow > + */ > + > +static void ipath_get_faststats(unsigned long t) > +{ > + uint32_t val; > + struct ipath_devdata *dd = &devdata[t]; > + static unsigned cnt; > + > + /* > + * don't access the chip while running diags, or memory diags > + * can fail > + */ > + if (!dd->ipath_kregbase || !(dd->ipath_flags & IPATH_PRESENT) || > + ipath_diags_enabled) { > + /* but re-arm the timer, for diags case; won't hurt other */ > + goto done; > + } > + > + ipath_snap_cntr((ipath_type) t, cr_wordsendcnt); > + ipath_snap_cntr((ipath_type) t, cr_wordrcvcnt); > + ipath_snap_cntr((ipath_type) t, cr_pktsendcnt); > + ipath_snap_cntr((ipath_type) t, cr_pktrcvcnt); > + > + ipath_qcheck(t); > + > + /* > + * deal with repeat error suppression. Doesn't really matter if > + * last error was almost a full interval ago, or just a few usecs > + * ago; still won't get more than 2 per interval. We may want > + * longer intervals for this eventually, could do with mod, counter > + * or separate timer. Also see code in ipath_handle_errors() and > + * ipath_handle_hwerrors(). > + */ > + > + if (dd->ipath_lasterror) > + dd->ipath_lasterror = 0; > + if (dd->ipath_lasthwerror) > + dd->ipath_lasthwerror = 0; > + if ((devdata[t].ipath_maskederrs & ~devdata[t].ipath_ignorederrs) > + && get_cycles() > devdata[t].ipath_unmasktime) { > + char ebuf[256]; > + ipath_decode_err(ebuf, sizeof ebuf, > + (devdata[t].ipath_maskederrs & ~devdata[t]. > + ipath_ignorederrs)); > + if ((devdata[t].ipath_maskederrs & ~devdata[t]. > + ipath_ignorederrs) > + & ~(INFINIPATH_E_RRCVEGRFULL | INFINIPATH_E_RRCVHDRFULL)) { > + _IPATH_UNIT_ERROR(t, "Re-enabling masked errors (%s)\n", > + ebuf); > + } else { > + /* > + * rcvegrfull and rcvhdrqfull are "normal", > + * for some types of processes (mostly benchmarks) > + * that send huge numbers of messages, while > + * not processing them. So only complain about > + * these at debug level. > + */ > + _IPATH_DBG > + ("Disabling frequent queue full errors (%s)\n", > + ebuf); > + } > + devdata[t].ipath_maskederrs = devdata[t].ipath_ignorederrs; > + ipath_kput_kreg(t, kr_errormask, ~devdata[t].ipath_maskederrs); > + } > + > + if (dd->ipath_flags & IPATH_LINK_SLEEPING) { > + uint64_t ibc; > + _IPATH_VDBG("linkinitcmd SLEEP, move to POLL\n"); > + dd->ipath_flags &= ~IPATH_LINK_SLEEPING; > + ibc = dd->ipath_ibcctrl; > + /* > + * don't put linkinitcmd in ipath_ibcctrl, want that to > + * stay a NOP > + */ > + ibc |= > + INFINIPATH_IBCC_LINKINITCMD_POLL << > + INFINIPATH_IBCC_LINKINITCMD_SHIFT; > + ipath_kput_kreg(t, kr_ibcctrl, ibc); > + } > + > + /* limit qfull messages to ~one per minute per port */ > + if ((++cnt & 0x10)) { > + for (val = devdata[t].ipath_cfgports - 1; ((int)val) >= 0; > + val--) { > + if (dd->ipath_lastegrheads[val] != -1) > + dd->ipath_lastegrheads[val] = -1; > + if (dd->ipath_lastrcvhdrqtails[val] != -1) > + dd->ipath_lastrcvhdrqtails[val] = -1; > + } > + } > + > + if (dd->ipath_nosma_bufs) { > + dd->ipath_nosma_secs += 5; > + if (dd->ipath_nosma_secs >= 30) { > + _IPATH_SMADBG("No SMA bufs avail %u seconds; cancelling pending sends\n", > + dd->ipath_nosma_secs); > + ipath_disarm_piobufs(t, dd->ipath_lastport_piobuf, > + dd->ipath_piobcnt - dd->ipath_lastport_piobuf); > + dd->ipath_nosma_secs = 0; /* start again, if necessary */ > + } > + else > + _IPATH_SMADBG("No SMA bufs avail %u tries, after %u seconds\n", > + dd->ipath_nosma_bufs, dd->ipath_nosma_secs); > + } > + > +done: > + mod_timer(&dd->ipath_stats_timer, jiffies + HZ * 5); > +} > + > + > +static void __devexit infinipath_remove_one(struct pci_dev *); > +static int infinipath_init_one(struct pci_dev *, const struct pci_device_id *); > + > +/* Only needed for registration, nothing else needs this info */ > +#define PCI_VENDOR_ID_PATHSCALE 0x1fc1 > +#define PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT 0xd > + > +const struct pci_device_id infinipath_pci_tbl[] = { > + { > + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT, > + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, PCI_DEVICE() instead? > + {0,} {}, is all that is needed here. > +}; > + > +MODULE_DEVICE_TABLE(pci, infinipath_pci_tbl); > + > +static struct pci_driver infinipath_driver = { > + .name = MODNAME, > + .driver.owner = THIS_MODULE, This line is not needed, you can remove it. > + .probe = infinipath_init_one, > + .remove = __devexit_p(infinipath_remove_one), > + .id_table = infinipath_pci_tbl, > +}; > + > +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) > +int remap_area_pages(unsigned long address, unsigned long phys_addr, > + unsigned long size, unsigned long flags); > +#endif > + > +static int infinipath_init_one(struct pci_dev *pdev, > + const struct pci_device_id *ent) > +{ > + int ret, len, j; > + static int chip_idx = -1; > + unsigned long addr; > + uint64_t intconfig; > + uint8_t rev; > + ipath_type dev; > + > + /* > + * XXX: Right now, we have a hardcoded array of devices. We'll > + * change this in a future release, but not just yet. For the > + * moment, we're limited to 4 infinipath devices per system. > + */ > + > + dev = ++chip_idx; > + > + _IPATH_VDBG("initializing unit #%u\n", dev); > + if ((!infinipath_cfgunits && (dev >= 1)) || > + (infinipath_cfgunits && (dev >= infinipath_cfgunits)) || > + (dev >= infinipath_max)) { > + _IPATH_ERROR("Trying to initialize unit %u, max is %u\n", > + dev, infinipath_max - 1); > + return -EINVAL; > + } > + > + devdata[dev].pci_registered = 1; > + devdata[dev].ipath_unit = dev; > + > + if ((ret = pci_enable_device(pdev))) { > + _IPATH_DBG("pci_enable unit %u failed: %x\n", dev, ret); > + } {} not needed here. > + > + if ((ret = pci_request_regions(pdev, MODNAME))) > + _IPATH_INFO("pci_request_regions unit %u fails: %d\n", dev, > + ret); > + > + if ((ret = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) != 0) > + _IPATH_INFO("pci_set_dma_mask unit %u fails: %d\n", dev, ret); > + > + pci_set_master(pdev); /* probably not be needed for HT */ > + > + addr = pci_resource_start(pdev, 0); > + len = pci_resource_len(pdev, 0); > + _IPATH_VDBG > + ("regbase (0) %lx len %d irq %x, vend %x/%x driver_data %lx\n", > + addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data); > + devdata[dev].ipath_deviceid = ent->device; /* save for later use */ > + devdata[dev].ipath_vendorid = ent->vendor; > + for (j = 0; j < 6; j++) { > + if (!pdev->resource[j].start) > + continue; > + _IPATH_VDBG("BAR %d start %lx, end %lx, len %lx\n", > + j, pdev->resource[j].start, > + pdev->resource[j].end, pci_resource_len(pdev, j)); > + } > + > + if (!addr) { > + _IPATH_UNIT_ERROR(dev, "No valid address in BAR 0!\n"); > + return -ENODEV; > + } > + > + if ((ret = pci_read_config_byte(pdev, PCI_REVISION_ID, &rev))) { > + _IPATH_UNIT_ERROR(dev, > + "Failed to read PCI revision ID unit %u: %d\n", > + dev, ret); > + return ret; /* shouldn't ever happen */ > + } else > + devdata[dev].ipath_pcirev = rev; > + > + devdata[dev].ipath_kregbase = ioremap_nocache(addr, len); > +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) > + printk("Remapping pages WC\n"); No KERN_ level? > + remap_area_pages((unsigned long) devdata[dev].ipath_kregbase + > + 1024 * 1024, addr + 1024 * 1024, 1024 * 1024, > + _PAGE_MA_WC); > + /* devdata[dev].ipath_kregbase = __ioremap(addr, len, _PAGE_MA_WC); */ > +#endif > + > + if (!devdata[dev].ipath_kregbase) { > + _IPATH_DBG("Unable to map io addr %lx to kvirt, failing\n", > + addr); > + ret = -ENOMEM; > + goto fail; > + } > + devdata[dev].ipath_kregend = (uint64_t __iomem *) > + ((void __iomem *) devdata[dev].ipath_kregbase + len); > + devdata[dev].ipath_physaddr = addr; /* used for io_remap, etc. */ > + /* for user mmap */ > + devdata[dev].ipath_kregvirt = (uint64_t __iomem *) phys_to_virt(addr); > + _IPATH_VDBG("mapped io addr %lx to kregbase %p kregvirt %p\n", addr, > + devdata[dev].ipath_kregbase, devdata[dev].ipath_kregvirt); > + > + /* > + * set these up before registering the interrupt handler, just > + * in case > + */ > + devdata[dev].pcidev = pdev; > + pci_set_drvdata(pdev, &(devdata[dev])); It's not a "just in case" type thing, you have to do this before you register that interrupt handler, as you can be instantly called here. Are you sure everything else is set up properly here before calling that function? > + > + /* > + * set up our interrupt handler; SA_SHIRQ probably not needed, > + * but won't hurt for now. > + */ > + > + if (!pdev->irq) { > + _IPATH_UNIT_ERROR(dev, "irq is 0, failing init\n"); > + ret = -EINVAL; > + goto fail; > + } > + if ((ret = request_irq(pdev->irq, ipath_intr, > + SA_SHIRQ, MODNAME, &devdata[dev]))) { > + _IPATH_UNIT_ERROR(dev, > + "Couldn't setup interrupt handler, irq=%u: %d\n", > + pdev->irq, ret); > + goto fail; > + } > + > + /* > + * clear ipath_flags here instead of in ipath_init_chip as it is set > + * by ipath_setup_htconfig. > + */ > + devdata[dev].ipath_flags = 0; > + if (ipath_setup_htconfig(pdev, &intconfig, dev)) > + _IPATH_DBG > + ("Failed to setup HT config, continuing anyway for now\n"); > + > + ret = ipath_init_chip(dev); /* do the chip-specific init */ > + if (!ret) { > +#ifdef CONFIG_MTRR > + uint64_t pioaddr, piolen; > + unsigned bits; > + /* > + * Set the PIO buffers to be WCCOMB, so we get HT bursts > + * to the chip. Linux (possibly the hardware) requires > + * it to be on a power of 2 address matching the length > + * (which has to be a power of 2). For rev1, that means > + * the base address, for rev2, it will be just the PIO > + * buffers themselves. > + */ > + pioaddr = addr + devdata[dev].ipath_piobufbase; > + piolen = devdata[dev].ipath_piobcnt * > + ALIGN(devdata[dev].ipath_piosize, > + devdata[dev].ipath_palign); > + > + for (bits = 0; !(piolen & (1ULL << bits)); bits++) > + /* do nothing */; > + > + if (piolen != (1ULL << bits)) { > + _IPATH_DBG("piolen 0x%llx not power of 2, bits=%u\n", > + piolen, bits); > + piolen >>= bits; > + while (piolen >>= 1) > + bits++; > + piolen = 1ULL << (bits + 1); > + _IPATH_DBG("Changed piolen to 0x%llx bits=%u\n", piolen, > + bits); > + } > + if (pioaddr & (piolen - 1)) { > + uint64_t atmp; > + _IPATH_DBG > + ("pioaddr %llx not on right boundary for size %llx, fixing\n", > + pioaddr, piolen); > + atmp = pioaddr & ~(piolen - 1); > + if (atmp < addr || (atmp + piolen) > (addr + len)) { > + _IPATH_UNIT_ERROR(dev, > + "No way to align address/size (%llx/%llx), no WC mtrr\n", > + atmp, piolen << 1); > + ret = -ENODEV; > + } else { > + _IPATH_DBG > + ("changing WC base from %llx to %llx, len from %llx to %llx\n", > + pioaddr, atmp, piolen, piolen << 1); > + pioaddr = atmp; > + piolen <<= 1; > + } > + } > + > + if (!ret) { > + int cookie; > + _IPATH_VDBG > + ("Setting mtrr for chip to WC (addr %llx, len=0x%llx)\n", > + pioaddr, piolen); > + cookie = mtrr_add(pioaddr, piolen, MTRR_TYPE_WRCOMB, 0); > + if (cookie < 0) { > + _IPATH_INFO > + ("mtrr_add(%llx,0x%llx,WC,0) failed (%d)\n", > + pioaddr, piolen, cookie); > + ret = -EINVAL; > + } else { > + _IPATH_VDBG > + ("Set mtrr for chip to WC, cookie is %d\n", > + cookie); > + devdata[dev].ipath_mtrr = (uint32_t) cookie; > + } > + } > +#endif /* CONFIG_MTRR */ > + } > + > + if (!ret && devdata[dev].ipath_kregbase && (devdata[dev].ipath_flags > + & IPATH_PRESENT)) { > + /* > + * for the hardware, enable interrupts only after > + * kr_interruptconfig is written, if we could set it up > + */ > + if (intconfig) { > + /* interrupt address */ > + ipath_kput_kreg(dev, kr_interruptconfig, intconfig); > + /* enable all interrupts */ > + ipath_kput_kreg(dev, kr_intmask, -1LL); > + /* force re-interrupt of any pending interrupts. */ > + ipath_kput_kreg(dev, kr_intclear, 0ULL); > + /* OK, the chip is usable, marked it as initialized */ > + *devdata[dev].ipath_statusp |= IPATH_STATUS_INITTED; > + } else > + _IPATH_UNIT_ERROR(dev, > + "No interrupts enabled, couldn't setup interrupt address\n"); > + } else if (ret != -EPERM) > + _IPATH_INFO("Not configuring unit %u interrupts, init failed\n", > + dev); > + > + device_create_file(&(pdev->dev), &dev_attr_status); > + device_create_file(&(pdev->dev), &dev_attr_status_str); > + device_create_file(&(pdev->dev), &dev_attr_lid); > + device_create_file(&(pdev->dev), &dev_attr_mlid); > + device_create_file(&(pdev->dev), &dev_attr_guid); > + device_create_file(&(pdev->dev), &dev_attr_nguid); > + device_create_file(&(pdev->dev), &dev_attr_serial); > + device_create_file(&(pdev->dev), &dev_attr_unit); Why not use an attribute array? Makes for proper error handling if one of those calls does not work... > + /* > + * We used to cleanup here, with pci_release_regions, etc. but that > + * can cause other problems if we want to run diags, etc., so instead > + * defer that until driver unload. > + */ So memory leaks are acceptable? > +fail: /* after we've done at least some of the pci setup */ > + if (ret == -EPERM) /* disabled device, don't want module load error; > + * just want to carry status through to this point */ > + ret = 0; Module load error does not happen no matter what kind of return value you send back from this function. So the comment is wrong, and the fact that you failed initializing the device is also wrong, please don't do this. thanks, greg k-h From jlentini at netapp.com Fri Dec 30 15:07:52 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 30 Dec 2005 18:07:52 -0500 (EST) Subject: [openib-general] [kernel verbs] u64 vs dma_addr_t In-Reply-To: <43B5B011.6000100@ichips.intel.com> References: <5D78D28F88822E4D8702BB9EEF1A43670A0944@mercury.infiniconsys.com> <43B5B011.6000100@ichips.intel.com> Message-ID: sean> James Lentini wrote: sean> > Why is the ib_sge's addr a u64 and not a dma_addr_t? sean> sean> It's the same address that the user can transfer to the remote sean> side. It can be the same address, but does it have to be? A user can directly map local addresses to InfiniBand I/O virtual addresses, but I don't think it is a requirement. In other words, I thought that user could register address x and request an InfiniBand I/O virtual address of y, x != y, for the mapping. I understand why the ib_send_wr's rdma.remote_addr needs to be a u64, since it ultimately winds up on the wire. In the case of the ib_sge's addr, I didn't think these values left the local node. My assumption (based on looking at the mthca driver) is that they are supposed to contain "local" I/O addresses (bus addresses). Therefore, my confusion over why dma_addr_t wasn't used. sean> Also, if inline sends are being used, the address is not sean> necessarily a DMA address. Which ib_wr_opcode[s] are "inline sends"? IB_WR_SEND, IB_WR_SEND_WITH_IMM, ...? My expectation was that all of the scatter/gather data for both sends (of all flavors: send, rdma read, rdma write,...) and recvs would be DMA. The "not necessarily" part makes me worry. How can I determine in a device independent way which buffer need to be DMA'able and which do not? Is the only safe option to assume that all do? From bos at pathscale.com Fri Dec 30 15:10:09 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 30 Dec 2005 15:10:09 -0800 Subject: [openib-general] Re: [PATCH 12 of 20] ipath - misc driver support code In-Reply-To: <20051230082505.GC7438@kroah.com> References: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> <20051230082505.GC7438@kroah.com> Message-ID: <1135984209.13318.47.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 00:25 -0800, Greg KH wrote: > No description of what the patch does? Ahem. Oops. > > +struct _infinipath_do_not_use_kernel_regs { > > + unsigned long long Revision; > > u64? Right. > > + unsigned long long Control; > > + unsigned long long PageAlign; > > + unsigned long long PortCnt; > > And what's with the InterCapsNamingScheme of these variables? They're taken straight from the register names in our chip spec. I can squish them to lowercase-only, if that seems important. > > +/* > > + * would prefer to not inline this, to avoid code bloat, and simplify debugging > > + * But when compiling against 2.6.10 kernel tree, it gets an error, so > > + * not for now. > > + */ > > +static void ipath_i2c_delay(ipath_type, int); > > You aren't compiling this for a 2.6.10 kernel anymore :) Yes, that hunk is redundant. Thanks for spotting it. > > +static void ipath_i2c_delay(ipath_type dev, int dtime) > Huh? After reading your comment, I still don't understand why you can't > just use udelay(). Or are you counting on calling this function with > only "1" being set for dtime? It's usually called with a dtime of 1, but there's an added delay in one place. I just rewrote that routine, so it's now a one-liner that does a read which waits for writes to the chip to complete. The sole caller that wanted an added wait calls udelay itself now. > Ah, isn't it fun to write bit-banging functions... And the in-kernel > i2c code is messier than doing this by hand? >From looking at it, it will make the i2c part of the driver longer, rather than shorter. There's nothing objectionable about the kernel i2c interfaces per se, but our bit-banging code is pretty small and specialised. > Odd function comment style. Please fix this to be in kerneldoc format. Sure. > Are you _sure_ you need all of these for the one function in this file? That file will be taken out and put to sleep. > > +#include > > Where is this file being pulled in from? Ugh, braino. > Woah, um, don't you think that you should either export the main mlock > function itself, or fix your code to not need it? Rolling it yourself > isn't a good idea... Other people have pointed out that our page-pinning code is horked. We'll find a saner alternative. Thanks for the comments, Greg. Message-ID: >Thanks for the comments -- great stuff. I'm in the process of merging >with the latest CMA changes in the trunk and will address your comments >in the next patch. I'm testing some changes, but here is a proposed change to ib_addr.h that I hope will help support iWarp. - Sean Index: ib_addr.h =================================================================== --- ib_addr.h (revision 4654) +++ ib_addr.h (working copy) @@ -32,26 +32,28 @@ #include #include +#include #include #include extern struct workqueue_struct *rdma_wq; -struct ib_addr { - union ib_gid sgid; - union ib_gid dgid; - u16 pkey; +struct rdma_dev_addr { + unsigned char src_dev_addr[MAX_ADDR_LEN]; + unsigned char dst_dev_addr[MAX_ADDR_LEN]; + unsigned char broadcast[MAX_ADDR_LEN]; + enum ib_node_type dev_type; }; /** - * ib_translate_addr - Translate a local IP address to an Infiniband GID and - * PKey. + * rdma_translate_ip - Translate a local IP address to an RDMA hardware + * address. */ -int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey); +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr); /** - * ib_resolve_addr - Resolve source and destination IP addresses to - * Infiniband network addresses. + * rdma_resolve_ip - Resolve source and destination IP addresses to + * RDMA hardware addresses. * @src_addr: An optional source address to use in the resolution. If a * source address is not provided, a usable address will be returned via * the callback. @@ -64,13 +66,13 @@ * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. */ -int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, - struct ib_addr *addr, int timeout_ms, +int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, - struct ib_addr *addr, void *context), + struct rdma_dev_addr *addr, void *context), void *context); -void ib_addr_cancel(struct ib_addr *addr); +void rdma_addr_cancel(struct rdma_dev_addr *addr); static inline int ip_addr_size(struct sockaddr *addr) { @@ -78,5 +80,38 @@ sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in); } +static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr) +{ + return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9]; +} + +static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey) +{ + dev_addr->broadcast[8] = pkey >> 8; + dev_addr->broadcast[9] = (unsigned char) pkey; +} + +static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->src_dev_addr + 4); +} + +static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid); +} + +static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr) +{ + return (union ib_gid *) (dev_addr->dst_dev_addr + 4); +} + +static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr, + union ib_gid *gid) +{ + memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid); +} + #endif /* IB_ADDR_H */ From bos at pathscale.com Fri Dec 30 15:11:44 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 30 Dec 2005 15:11:44 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <20051230080002.GA7438@kroah.com> References: <20051230080002.GA7438@kroah.com> Message-ID: <1135984304.13318.50.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 00:00 -0800, Greg KH wrote: > > - The driver still uses EXPORT_SYMBOL, for consistency with other > > code in drivers/infiniband > > Why would that matter? I don't want to do something gratuitously different to the prevailing set of code in which it lives. > > - We're still using ioctls instead of sysfs or configfs in some > > cases, to maintain userspace compatibility > > Compatibility with what? The driver isn't in the kernel tree yet, so > there's no old kernel versions to remain compatibile with :) We already ship userspace code to customers that relies on the ioctl interfaces. > I also noticed that you are still using the uint64_t type variable > types, can you please switch to the proper kernel types instead (u64 in > this specific example.) Yes, we'll use u64 for internal variables, and __u64 for stuff exported to userspace, etc. References: <20051230081218.GB7438@kroah.com> Message-ID: <1135984675.13318.58.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 00:12 -0800, Greg KH wrote: > This has grown > into a huge file, can't you split it up into smaller pieces? Absolutely. > Why even save off the return value if you don't do anything with it? I think that's just a throwback to an earlier rev of the driver. > And please don't put assignments in the middle of if statements, that's > just messy and harder to read (the fact that gcc made you put an extra > () should be a hint that you were doing something wrong...) OK. > And does your driver work with udev? I didn't see where you were > exporting the major:minor number of your devices to sysfs, but I might > have missed it. It was written in a pre-udev world, so it still uses a fixed major and minor number. How important is this to you? Is it "nice to have", or "blocker"? :-) > Are you sure that's a good idea? Please do the proper thing and tear > down your infrastructure if something fails, that's the correct thing to > do. That way you can actually recover if something that you call in > this function fails (like driver_create_file(), or > pci_register_driver().) Functions don't return error values just so you > can ignore them :) This will take a bit of cleaning up, but it's a reasonable request. > > +/* > > + * note: if for some reason the unload fails after this routine, and leaves > > + * the driver enterable by user code, we'll almost certainly crash and burn... > > + */ > > See, you admit that what you are doing isn't the wisest thing, which > should tell you something... Indeed. > This is the call that should have cleaned up all of the memory and other > stuff that you do above. If not, then your driver will not work in any > hotplug pci systems, which would not be a good thing. Please do like > Roland says and put your resources and stuff in the device specific > structures, like the rest of the kernel drivers do. I'm working on the appropriate hearts and minds as we speak :-) > Why not just export ipath_ht_get_boardname instead? Because that's too specific to HT for my personal liking. > > +module_init(infinipath_init); > > +module_exit(infinipath_cleanup); > > + > > +EXPORT_SYMBOL(infinipath_debug); > > +EXPORT_SYMBOL(ipath_get_boardname); > > EXPORT_SYMBOL_GPL() ? I don't see a problem with that. > And put them next to the functions themselves, it's easier to notice > that way. OK. Thanks again for the review, openib-general-bounces at openib.org wrote: > sean> James Lentini wrote: > sean> > Why is the ib_sge's addr a u64 and not a dma_addr_t? > sean> > sean> It's the same address that the user can transfer to the remote > sean> side. > > It can be the same address, but does it have to be? > > A user can directly map local addresses to InfiniBand I/O > virtual addresses, but I don't think it is a requirement. In > other words, I thought that user could register address x and > request an InfiniBand I/O virtual address of y, x != y, for > the mapping. > > I understand why the ib_send_wr's rdma.remote_addr needs to > be a u64, since it ultimately winds up on the wire. > > In the case of the ib_sge's addr, I didn't think these values > left the local node. My assumption (based on looking at the > mthca driver) is that they are supposed to contain "local" > I/O addresses (bus addresses). Therefore, my confusion over > why dma_addr_t wasn't used. > A privileged user, such as an NFS Daemon or iSER iSCSI Target, can and will create Memory Regions that are not part of its own address space out of page buffers. Even running on a 32-bit kernel it might create a memory region larger than 2**32. Admittedly, that isn't very likely unless it is the *only* daemon running on the machine. But it is legal. From bos at pathscale.com Fri Dec 30 15:47:07 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 30 Dec 2005 15:47:07 -0800 Subject: [openib-general] Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4 In-Reply-To: <20051230083928.GD7438@kroah.com> References: <20051230083928.GD7438@kroah.com> Message-ID: <1135986427.13318.79.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 00:39 -0800, Greg KH wrote: > > +void ipath_chip_done(void) > > +{ > > +} > > + > > +void ipath_chip_cleanup(struct ipath_devdata * dd) > > +{ > > +} > > What are these two empty functions for? They're just as dead as they look. > > +static ssize_t show_status_str(struct device *dev, > how big can this "status string" be? Just a few dozen bytes. > If it's even getting close to > PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should > break it up into its individual pieces. Do you think that's still warranted, given this? > > +static ssize_t show_unit(struct device *dev, > Don't you mean -ENODEV? Yes, thanks. > > + snprintf(buf, PAGE_SIZE, "%u\n", dd->ipath_unit); > > + return strlen(buf); > > return the snprintf() call instead of calling strlen() all the time > please. OK. > > +const struct pci_device_id infinipath_pci_tbl[] = { > > + { > > + PCI_VENDOR_ID_PATHSCALE, PCI_DEVICE_ID_PATHSCALE_INFINIPATH_HT, > > + PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, > > PCI_DEVICE() instead? OK. > > + {0,} > > {}, > is all that is needed here. OK. > > + .driver.owner = THIS_MODULE, > > This line is not needed, you can remove it. OK. > {} not needed here. OK. > > +#if defined (pgprot_writecombine) && defined(_PAGE_MA_WC) > > + printk("Remapping pages WC\n"); > > No KERN_ level? That should just become a debug statement. > > + /* > > + * set these up before registering the interrupt handler, just > > + * in case > > + */ > > + devdata[dev].pcidev = pdev; > > + pci_set_drvdata(pdev, &(devdata[dev])); > > It's not a "just in case" type thing, you have to do this before you > register that interrupt handler, as you can be instantly called here. OK, I'll remove the misleading comment. > Are you sure everything else is set up properly here before calling that > function? I believe so. I'll double check. > > + device_create_file(&(pdev->dev), &dev_attr_status); > > + device_create_file(&(pdev->dev), &dev_attr_status_str); > > + device_create_file(&(pdev->dev), &dev_attr_lid); > > + device_create_file(&(pdev->dev), &dev_attr_mlid); > > + device_create_file(&(pdev->dev), &dev_attr_guid); > > + device_create_file(&(pdev->dev), &dev_attr_nguid); > > + device_create_file(&(pdev->dev), &dev_attr_serial); > > + device_create_file(&(pdev->dev), &dev_attr_unit); > > Why not use an attribute array? Makes for proper error handling if one > of those calls does not work... OK, thanks. > > + /* > > + * We used to cleanup here, with pci_release_regions, etc. but that > > + * can cause other problems if we want to run diags, etc., so instead > > + * defer that until driver unload. > > + */ > > So memory leaks are acceptable? That clearly needs a bit of attention. > > +fail: /* after we've done at least some of the pci setup */ > > + if (ret == -EPERM) /* disabled device, don't want module load error; > > + * just want to carry status through to this point */ > > + ret = 0; > > Module load error does not happen no matter what kind of return value > you send back from this function. So the comment is wrong, and the fact > that you failed initializing the device is also wrong, please don't do > this. OK. Thanks for the extensive comments, References: Message-ID: <1135986615.13318.82.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 10:46 -0800, Linus Torvalds wrote: > All your user page lookup/pinning code is terminally broken. Yes, this has been pointed out by a few others. > Crap like this must not be merged. I'm already busy decrappifying it... References: <54AD0F12E08D1541B826BE97C98F99F1142243@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: caitlin> > One more question on this topic. caitlin> > caitlin> > Why is the ib_sge's addr a u64 and not a dma_addr_t? caitlin> caitlin> Because the hardware may need for it to be a 64 bit caitlin> IO Address accessible on the system bus. That applies caitlin> to the whole system bus, no matter how many PCI roots caitlin> or virtual OSs there are. caitlin> caitlin> In particular there could be a guest OS that was caitlin> running in 32-bit mode, and the RDMA hardware receiving caitlin> fast path requests will not support different caitlin> work request formats for each guest OS. Let me back up a step and explain the context for this question. As you know, our goal is to use the Linux IB verbs as a hardware/protocol independent RDMA API. I'm reviewing my use of the API to make sure that it does not make any particular assumptions. Roland stated that a scatter/gather list's address value should be a bus address: http://openib.org/pipermail/openib-general/2005-August/009748.html This made me question by the type wasn't dma_addr_t and if there was anything protocol/hardware specific about the choice of u64. At this point, I'm still not sure why dma_addr_t wouldn't be correct and how a transport/hardware independent consumer of this API should set this field. From greg at kroah.com Fri Dec 30 16:13:36 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 16:13:36 -0800 Subject: [openib-general] Re: [PATCH 12 of 20] ipath - misc driver support code In-Reply-To: <1135984209.13318.47.camel@serpentine.pathscale.com> References: <5e9b0b7876e2d570f25e.1135816291@eng-12.pathscale.com> <20051230082505.GC7438@kroah.com> <1135984209.13318.47.camel@serpentine.pathscale.com> Message-ID: <20051231001336.GD20314@kroah.com> On Fri, Dec 30, 2005 at 03:10:09PM -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 00:25 -0800, Greg KH wrote: > > > + unsigned long long Control; > > > + unsigned long long PageAlign; > > > + unsigned long long PortCnt; > > > > And what's with the InterCapsNamingScheme of these variables? > > They're taken straight from the register names in our chip spec. I can > squish them to lowercase-only, if that seems important. No, but document it that this is the reason for it (along with a pointer to your chip spec, if possible.) thanks, greg k-h From greg at kroah.com Fri Dec 30 16:08:58 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 16:08:58 -0800 Subject: [openib-general] Re: [PATCH 11 of 20] ipath - core driver, part 4 of 4 In-Reply-To: <1135984675.13318.58.camel@serpentine.pathscale.com> References: <20051230081218.GB7438@kroah.com> <1135984675.13318.58.camel@serpentine.pathscale.com> Message-ID: <20051231000858.GA20314@kroah.com> On Fri, Dec 30, 2005 at 03:17:55PM -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 00:12 -0800, Greg KH wrote: > > And does your driver work with udev? I didn't see where you were > > exporting the major:minor number of your devices to sysfs, but I might > > have missed it. > > It was written in a pre-udev world, so it still uses a fixed major and > minor number. How important is this to you? Is it "nice to have", or > "blocker"? :-) Well, depends on if you want your driver to work with any of the major distros that rely on udev (RHEL, SLES, etc...) If not, fine, you don't need it :) thanks, greg k-h From greg at kroah.com Fri Dec 30 16:10:51 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 16:10:51 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <1135984304.13318.50.camel@serpentine.pathscale.com> References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> Message-ID: <20051231001051.GB20314@kroah.com> On Fri, Dec 30, 2005 at 03:11:44PM -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 00:00 -0800, Greg KH wrote: > > > - The driver still uses EXPORT_SYMBOL, for consistency with other > > > code in drivers/infiniband > > > > Why would that matter? > > I don't want to do something gratuitously different to the prevailing > set of code in which it lives. > > > > - We're still using ioctls instead of sysfs or configfs in some > > > cases, to maintain userspace compatibility > > > > Compatibility with what? The driver isn't in the kernel tree yet, so > > there's no old kernel versions to remain compatibile with :) > > We already ship userspace code to customers that relies on the ioctl > interfaces. But we (the kernel community), don't really accept that as a valid reason to accept this kind of code, sorry. Why not just update your userspace code and ship that out to your customers, as you know exactly who they are due to the lack of the driver in the mainline kernel tree :) thanks, greg k-h From greg at kroah.com Fri Dec 30 16:12:12 2005 From: greg at kroah.com (Greg KH) Date: Fri, 30 Dec 2005 16:12:12 -0800 Subject: [openib-general] Re: [PATCH 8 of 20] ipath - core driver, part 1 of 4 In-Reply-To: <1135986427.13318.79.camel@serpentine.pathscale.com> References: <20051230083928.GD7438@kroah.com> <1135986427.13318.79.camel@serpentine.pathscale.com> Message-ID: <20051231001212.GC20314@kroah.com> On Fri, Dec 30, 2005 at 03:47:07PM -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 00:39 -0800, Greg KH wrote: > > > > +void ipath_chip_done(void) > > > +{ > > > +} > > > + > > > +void ipath_chip_cleanup(struct ipath_devdata * dd) > > > +{ > > > +} > > > > What are these two empty functions for? > > They're just as dead as they look. Then you might want to remove them :) > > > +static ssize_t show_status_str(struct device *dev, > > > how big can this "status string" be? > > Just a few dozen bytes. > > > If it's even getting close to > > PAGE_SIZE, this doesn't need to be a sysfs attribute, but you should > > break it up into its individual pieces. > > Do you think that's still warranted, given this? No I don't, unless you think that message will grow somehow... thanks, greg k-h From eajmnq at runbox.com Fri Dec 30 15:20:02 2005 From: eajmnq at runbox.com (Jerry Driscoll) Date: Sat, 31 Dec 2005 01:20:02 +0200 Subject: [openib-general] Two hundred and thirty thousand dollars for only $320/month Message-ID: Hello, You have been chosen to participate in an invitation only limited time event! Are you currently paying over 3% for your mortgage? STOP! We can help you lower that today! Answer only a few questions and we can give you an approval in under 30 seconds it's that simple! http://funnyhow.net/p4.asp And stop fighting for lenders let them fight for you! Make them work for your business by giving you the lowest rates around! $230,000 loans are available for only $340/month! WE'RE PRACTICALLY GIVING AWAY MONEY! Think your credit is too bad to get a deal like this? THINK AGAIN! We will have you saving your money in no time! Are you ready to save your money? http://funnyhow.net/p4.asp Regards, Jerry Driscoll From bos at pathscale.com Fri Dec 30 17:40:50 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 30 Dec 2005 17:40:50 -0800 Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <20051231001051.GB20314@kroah.com> References: <20051230080002.GA7438@kroah.com> <1135984304.13318.50.camel@serpentine.pathscale.com> <20051231001051.GB20314@kroah.com> Message-ID: <1135993250.13318.94.camel@serpentine.pathscale.com> On Fri, 2005-12-30 at 16:10 -0800, Greg KH wrote: > But we (the kernel community), don't really accept that as a valid > reason to accept this kind of code, sorry. Fair enough. I'd like some guidance in that case. Some of our ioctls access the hardware more or less directly, while others do things like read or reset counters. Which of these kinds of operations are appropriate to retain as ioctls, in your eyes, and which are best converted to sysfs or configfs alternatives? As an example, take a look at ipath_sma_ioctl. It seems to me that receiving or sending subnet management packets ought to remain as ioctls, while getting port or node data could be turned into sysfs attributes. Lane identification could live in configfs. If you think otherwise, please let me know what's more appropriate. The less blind I am in doing these conversions, the fewer rounds we'll have to go in reviewing humongous driver submission patches :-) Thanks, References: <54AD0F12E08D1541B826BE97C98F99F1142243@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <469958e00512301942h29490c39xe82bbe5118371732@mail.gmail.com> On 12/30/05, James Lentini wrote: > > > caitlin> > One more question on this topic. > caitlin> > > caitlin> > Why is the ib_sge's addr a u64 and not a dma_addr_t? > caitlin> > caitlin> Because the hardware may need for it to be a 64 bit > caitlin> IO Address accessible on the system bus. That applies > caitlin> to the whole system bus, no matter how many PCI roots > caitlin> or virtual OSs there are. > caitlin> > caitlin> In particular there could be a guest OS that was > caitlin> running in 32-bit mode, and the RDMA hardware receiving > caitlin> fast path requests will not support different > caitlin> work request formats for each guest OS. > > Let me back up a step and explain the context for this question. > > As you know, our goal is to use the Linux IB verbs as a > hardware/protocol independent RDMA API. I'm reviewing my use of the > API to make sure that it does not make any particular assumptions. > > Roland stated that a scatter/gather list's address value should be a > bus address: > > http://openib.org/pipermail/openib-general/2005-August/009748.html > That depends on whether it is part of a registered memory space, or being used to specify a new registered memory space (i.e. it is for a memory register operation). When *using* an already established memory region, the address is interpreted in the context of that memory region. The size of address within an RDMA managed memory regions is always 64 bits. No matter which transport or what processor. That is extremely unlikely to change (in fact I think the R-Key/L-Key/ STag size would increase to 64-bits before the address size itself changed. But I'm expecting that a 96-bit logical address space should be adequate for quite some time). When creating a memory region the "physical address" is really a bus address, which on a strictly local basis could be 32 or 64 bits. If you were trying to generalize that, the "physical address" is a "RDMA Device accessible address", which on anything even vaguely PCI-ish is a bus address. But just as the distinction between "physical address" and "bus address" would not have been anticipated in the past, there may be some other distinction that we are not anticipating yet. So, in that context, the Memory Region defines the translation from logical addresses with the context of a Memory Region (most typically a subset of an existing virtual memory map) to addresses that the RDMA device can use to access the same memory. Whatever that distinction is, I'm sure it will be relevant before another decade goes by. From jengelh at linux01.gwdg.de Fri Dec 30 21:36:18 2005 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Sat, 31 Dec 2005 06:36:18 +0100 (MET) Subject: [openib-general] Re: [PATCH 0 of 20] [RFC] ipath - PathScale InfiniPath driver In-Reply-To: <1135884385.6804.0.camel@mindpipe> References: <200512291901.jBTJ1rOm017519@laptop11.inf.utfsm.cl> <1135884385.6804.0.camel@mindpipe> Message-ID: >> > - Someone asked for the kernel's i2c infrastructure to be used,but >> > our i2c usage is very specialised, and it would be more of a mess >> > to use the kernel's >> >> Problem with that is that if everybody and Aunt Tillie does the same, >> the kernel as a whole gets to be a mess. > >ALSA does the exact same thing for the exact same reason. Maybe an >indication that the kernel's i2c layer is too heavy? Sounds like a discussion a while back why jfs/xfs/reiser3/reiser4 all have their own journalling - compared to ext3-jbd. Jan Engelhardt -- From iod00d at hp.com Fri Dec 30 23:12:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 30 Dec 2005 23:12:50 -0800 Subject: [openib-general] backwards compatibility In-Reply-To: <20051229235000.GB13951@mellanox.co.il> References: <20051229235000.GB13951@mellanox.co.il> Message-ID: <20051231071250.GE32607@esmail.cup.hp.com> On Fri, Dec 30, 2005 at 01:50:00AM +0200, Michael S. Tsirkin wrote: > Hi! > I'm reading a thread on lkml about backwards compatibility > http://lkml.org/lkml/2005/12/29/204 > and I wander whether we should work harder on supporting > older userspace library ABIs in kernel? > Currently, we only implement backwards compatibility in user-space, > so that you always have to upgrade userspace when upgrading the kernel, > but we could do this in kernel, too. > > We would just need the kernel to return a pair of ABI revision numbers: > minimal and maximal ABI supported. > > What do you guys think? I don't expect this fly with kernel.org. Stuff like this just clutters up the kernel source tree for the most part. Userspace suffers for it (a la glibc) but I still think it's easier to deal with there. grant From arjan at infradead.org Sat Dec 31 00:36:24 2005 From: arjan at infradead.org (Arjan van de Ven) Date: Sat, 31 Dec 2005 09:36:24 +0100 Subject: [openib-general] Re: [PATCH 10 of 20] ipath - core driver, part 3 of 4 In-Reply-To: <1135986615.13318.82.camel@serpentine.pathscale.com> References: <1135986615.13318.82.camel@serpentine.pathscale.com> Message-ID: <1136018184.2901.6.camel@laptopd505.fenrus.org> On Fri, 2005-12-30 at 15:50 -0800, Bryan O'Sullivan wrote: > On Fri, 2005-12-30 at 10:46 -0800, Linus Torvalds wrote: > > > All your user page lookup/pinning code is terminally broken. > > Yes, this has been pointed out by a few others. > > > Crap like this must not be merged. > > I'm already busy decrappifying it... the point I think also was the fact that it exists is already wrong :) makes it easier for you.. "rm" is a very powerful decrappify tool, as is "block delete" in just about any editor ;) From eitan at mellanox.co.il Sat Dec 31 01:42:45 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 31 Dec 2005 11:42:45 +0200 Subject: [openib-general] RE: Some opensm/osm_vl15intf.c questions Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30102B3E4@mtlexch01.mtl.com> Hi Hal, As Yael was working on the ref-counting issues (a month or two ago) I will let her answer. It is very possible we are missing some. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, December 30, 2005 6:03 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Some opensm/osm_vl15intf.c questions > > Hi Eitan, > > In chasing an issue with a trap repress not being sent in a certain > scenario, I stumbled across the following questions about > opensm/osm_vl15intf.c. > > 1. osm_vl15_post increments qp0_mads_outstanding when a response is > expected (rfifo) and not when unsolicited (ufifo) (what appears to be > called unicasts): > > osm_vl15_post: > if( p_madw->resp_expected == TRUE ) > { > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > } > else > { > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > } > > osm_vl15_shutdown retires all outstanding MADs as follows: > > osm_vl15_shutdown: > while ( p_madw != (osm_madw_t*)cl_qlist_end( &p_vl->ufifo ) ) > { > if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > { > osm_log( p_vl->p_log, OSM_LOG_DEBUG, > "osm_vl15_shutdown: " > "Releasing Response p_madw = %p\n", p_madw ); > } > > osm_mad_pool_put( p_mad_pool, p_madw ); > cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); > > p_madw = (osm_madw_t*)cl_qlist_remove_head( &p_vl->ufifo ); > } > > Either post should increment qp0_mads_outstanding for unsolicited or > shutdown shouldn't decrement it when removing from ufifo. If you agree, > which should it be ? > > 2. In the case of a failure from osm_vendor_send, __osm_vl15_poller > decrements qp0_mads_outstanding regardless of whether a response is > expected. This is inconsistent with the increment. This leads me to > believe that this should also be incremented for unsolicited (unicasts) > as well as those for which responses are expected. Is this correct or am > I missing something ? > > So my conclusion is that in osm_vl15_post, it should be: > > if( p_madw->resp_expected == TRUE ) > { > cl_qlist_insert_tail( &p_vl->rfifo, (cl_list_item_t*)p_madw ); > } > else > { > cl_qlist_insert_tail( &p_vl->ufifo, (cl_list_item_t*)p_madw ); > } > cl_atomic_inc( &p_vl->p_stats->qp0_mads_outstanding ); > > If you agree, I will generate a patch for this. Thanks. > > -- Hal From rminnich at lanl.gov Sat Dec 31 07:37:01 2005 From: rminnich at lanl.gov (Ronald G Minnich) Date: Sat, 31 Dec 2005 08:37:01 -0700 Subject: [openib-general] PathScale license In-Reply-To: <1135958883.8578.47.camel@strider.opengridcomputing.com> References: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> <20051230025627.GA2706@cuprite.internal.keyresearch.com> <1135958883.8578.47.camel@strider.opengridcomputing.com> Message-ID: <43B6A59D.3000906@lanl.gov> is there any chance that pathscale could reword that to be less confusing? It clearly caused a lot of confusion and worry for folks on this list. ron From bos at pathscale.com Sat Dec 31 08:27:31 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Sat, 31 Dec 2005 08:27:31 -0800 Subject: [openib-general] PathScale license In-Reply-To: <43B6A59D.3000906@lanl.gov> References: <54AD0F12E08D1541B826BE97C98F99F11421DE@NT-SJCA-0751.brcm.ad.broadcom.com> <20051230025627.GA2706@cuprite.internal.keyresearch.com> <1135958883.8578.47.camel@strider.opengridcomputing.com> <43B6A59D.3000906@lanl.gov> Message-ID: <1136046452.18623.2.camel@localhost.localdomain> On Sat, 2005-12-31 at 08:37 -0700, Ronald G Minnich wrote: > is there any chance that pathscale could reword that to be less > confusing? It clearly caused a lot of confusion and worry for folks on > this list. We're looking into it. All the vampires^H^H^H^H^H^H^H^Hlawyers are on their winter break at the moment, so it will take a bit to clear it up. Hi, The OpenIB diagnostics (https://openib.org/svn/gen2/trunk/src/userspace/management/diags) have been updated as follows: 1. discover.pl diagnostic tool added discover.pl uses a topology file create by ibnetdiscover and a discover.map file which the network administrator creates which indicates the nodes to be expected and a discover.topo file which is the expected connectivity and produces a new connectivity file (discover.topo.new) and outputs the changes to stdout. The network administrator can choose to replace the "old" topo file with the new one or certain changes in. The syntax of the discover.map file is: |port|"Text for node"| e.g. 8f10400410015|8|"ISR 6000"|# SW-6IB4 Voltaire port 0 lid 5 8f10403960558|2|"HCA 1"|# MT23108 InfiniHost Mellanox Technologies The syntax of the old and new topo files (discover.topo and discover.topo.new) are: ||| e.g. 10|5442ba00003080|1|8f10400410015 These topo files are produced by the discover.pl tool. 2. ibportstate diagnostic tool added to query, disable, and enable switch ports 3. Added error only mode to diagnostic scripts so less data to weed through on a large fabric (also verbose mode to see everything) 4. Tree structure collapsed so all tools in same directory as opposed to individual ones and build simplified Let me know about any comments or issues. Thanks. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hozer at hozed.org Sat Dec 31 18:40:00 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sat, 31 Dec 2005 20:40:00 -0600 Subject: [openib-general] Userspace testing results (2.6.15-rc7-git2 with modules) In-Reply-To: <20051230004313.GA8111@us.ibm.com> References: <20051230004313.GA8111@us.ibm.com> Message-ID: <20060101024000.GG14100@narn.hozed.org> > Currently, I am running netpipe, iperf and netperf (these three tests > are giving horrible results but we are pretty sure that it is a local > issue, as both eth1 and ib0 based tests lead to poor performance) and > also netpipe with a patch from Shirley Ma to run over native IB [1]. > Additionally, I am running the 4 pingpong tests (rc, srq, uc, ud) and > the two perftest tests: rdma_lat and rdma_bw. There are some issues with > some size combinations; or, at least, that is how it seems to me. > I assume this is the same patch I have here.. http://source.scl.ameslab.gov/hg/netpipe3-dev Brad Benton also found that the way NetPIPE does polling to see if a message arrives does not work well with the relaxed memory consistency model the Power5 systems with ehca seem to use. His workaround was to add a 'dcbf' instruction in NetPIPE, but that's obviously not portable. Shouldn't we have a function in openIB to force immediate cache consitency checks? From spoole at lanl.gov Thu Dec 29 07:52:43 2005 From: spoole at lanl.gov (Steve Poole) Date: Thu, 29 Dec 2005 08:52:43 -0700 Subject: [openib-general] RE: Technical content of Sonoma Workshop Feb 5-8 In-Reply-To: References: Message-ID: <6.2.5.6.0.20051229085139.01ff3ec8@lanl.gov> Matt and I are working on a list of what the PF OpenIB project has paid for and what is in the queue. I have a document on what we will need for 12X-QDR IB and more of what we will need for iSCSI and the like. It is coming together. The original doc is 2 yrs old. :-( Steve... At 05:57 AM 12/27/2005, Head Bubba wrote: > Windows and Linux were required yesterday... > Since we will have both Mellanox and PathScale for a > roundtable session, we should also add any enhancements we need to future HCA > and the firmware as well to the discussion - from a little > SDP Proof of Concept we did at CSFB, we ended up needing a firmware upgrade > As for SDP... > At the roundtable we did not go into the gory > details of the SDP Proof Of Concept we did at CSFB with Mellanox in which we > could dynamically change virtual lane being used, > so I think at this we should get Mellanox to go over the details > with us to get this > something done with SDP in OpenIB to have the real > implementation it needs (for those not having the details, contact Nimrod). > A better implementation of SDP is needed. This is > a good first step to get off of TCP/IP without code changes, but this has > also been problematic in our > experience. Additionally, it needs to be code better to deliver at > near native performance /// > we use SDP to eliminate TCP/IP issues, so IPoIB is > not viable for us > As for RDS, we should all see who has it aside from Bubba > which everyone knows about, and whether or not we can get an end-user > experience discussed > We also would like to virtualize everything... the server, > the desktop, the fabric, the storage, etc... to create a Virtual > Resource Market (VRM) > > >-----Original Message----- >From: Bill Boas [mailto:bboas at llnl.gov] >Sent: Saturday, December 24, 2005 2:01 PM >To: Woodruff, Robert J >Cc: Matt Leininger; Steve Poole; Hal Rosenstock; Roland Dreier; Head >Bubba; Peter Krey; openib-promoters at openib.org; >openib-windows at openib.org; openib-general at openib.org >Subject: RE: Technical content of Sonoma Workshop Feb 5-8 > > >Woody, > >It'll be a longshot for the Pats to get to theSuperbowl this year, I >think. But I hope! > >Your list is a great start, but isn't each item you mention in the >context of Release 1.0? > > From the Labs and Wall Street perspectives, the preference is to"tie >a ribbon around" Rel 1.0 (both Windows and Linux) as soon as we can, >and go to the next stage of the evolution of the stack. > >So that means making the definition of the content of Rel 2.0 the >main technical focus of the workshop. > >OpenIB PathForward Phase 2, iWARP integration, QOS, improved OpenSM, >and more ...... > >Perhaps Matt, Steve Poole, Hal, Roland, HB, Peter,and others will >join in this discussion and express different opinions..... > >Bill. > >At 02:56 PM 12/23/2005, you wrote: > >I'll give it some thought and try to start a discussion on > >the list. Some ideas for a technical track that come to mind are: > > > >RDS - perhaps we could get someone from Oracle and Silverstorm > >to present something on this. There has been some discussion on > >the list, but not sure we have everyone aligned on what needs to > >be done for this. > > > >Core S/W update. where we are and where we are going moving forward. > > > >Generic RDMA support, what is there, what needs to be done. > > > >iSer update. > > > >SDP update, what needs to be done before it is ready to be pushed > >upstream. > > > >OpenMPI update > > > >OpenSM and diags update > > > >Linux distributor update, RedHat, Suse, ... > > > >New H/W support, Pathscale, IBM ? > > > >Why the Patriots didn't win another superbowl, can we give someone else > >a turn please... > > > >Were there any specific topics that the DOE folks would like to > >hear on the technical side ? > > > >I'll be OOP on vacation next week, but will probably > >be checking email and perhaps we can start a discussion on the list. > > > >woody > > > >-----Original Message----- > >From: Bill Boas [mailto:bboas at llnl.gov] > >Sent: Friday, December 23, 2005 12:02 PM > >To: Woodruff, Robert J > >Subject: RE: [openib-general] Please register for Sonoma Workshop > > > >No agenda yet, and definitely need help......I was planning to send > >out ideas... maybe you could start that process, please. > > > >At 01:42 PM 12/22/2005, you wrote: > > > > > >Hi Bill, > > > > > >Do you have a proposed agenda for this yet > > >or need any help in putting one together. > > > > > >Trying to determine who from my team should attend. > > > > > >woody > > > >Bill Boas bboas at llnl.gov > >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 > >7000 East Ave, L-555 Cell: 925-337-2224 > >Livermore, CA 94551 Pgr: 877-203-2248 > >Bill Boas bboas at llnl.gov >ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 >7000 East Ave, L-555 Cell: 925-337-2224 >Livermore, CA 94551 Pgr: 877-203-2248 > > >============================================================================== >Please access the attached hyperlink for an important electronic >communications disclaimer: > >http://www.csfb.com/legal_terms/disclaimer_external_email.shtml > >==============================================================================