From halr at voltaire.com Wed Sep 1 07:58:45 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 10:58:45 -0400 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] Message-ID: <1094050724.1832.4.camel@localhost.localdomain> Here's a patch to ib_verbs.h define the flags for the ib_device structure: Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 711) +++ ib_verbs.h (working copy) @@ -612,6 +612,11 @@ IB_MAD_IGNORE_MKEY = 1 }; +enum ib_device_flags { + IB_DEVICE_IS_SWITCH = 1, + IB_DEVICE_IS_ROUTER = 2 +}; + #define IB_DEVICE_NAME_MAX 64 struct ib_device { -----Forwarded Message----- From: Hal Rosenstock To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] ib_verbs.h ib_device_attr device type Date: Wed, 25 Aug 2004 12:59:30 -0400 On Wed, 2004-08-25 at 12:41, Roland Dreier wrote: > Hal> Is there a way to determine whether a device is a HCA, > Hal> switch, or router ? Does there need to be another field in > Hal> ib_device_attr for this ? > > I would use the flags member of struct ib_device... add something like > > enum { > IB_DEV_FLAG_IS_SWITCH = 1, > /* etc */ > }; Sounds like a good solution to me. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Sep 1 08:51:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 11:51:07 -0400 Subject: [openib-general] ib_mad.h ib_mad_agent.hi_tid Message-ID: <1094053866.1832.14.camel@localhost.localdomain> I have some questions related to TIDs in ib_mad.h: * ib_mad_agent - Used to track MAD registration with the access layer. * @hi_tid - Access layer assigned transaction ID for this client. * Unsolicited MADs sent by this client will have the upper 32-bits * of their TID set to this value. Is it the access layer client's or access layer's responsibility to set the high 32 bits of the TID to this value ? In this context, what is the definition of unsolicited ? Are these MADs to be sent which have the R bit (response) bit off ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 07:52:42 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 07:52:42 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094050724.1832.4.camel@localhost.localdomain> References: <1094050724.1832.4.camel@localhost.localdomain> Message-ID: <20040901075242.59395b28.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 10:58:45 -0400 Hal Rosenstock wrote: > Here's a patch to ib_verbs.h define the flags for the ib_device > structure: > > Index: ib_verbs.h > =================================================================== > --- ib_verbs.h (revision 711) > +++ ib_verbs.h (working copy) > @@ -612,6 +612,11 @@ > IB_MAD_IGNORE_MKEY = 1 > }; > > +enum ib_device_flags { > + IB_DEVICE_IS_SWITCH = 1, > + IB_DEVICE_IS_ROUTER = 2 > +}; > + > #define IB_DEVICE_NAME_MAX 64 > > struct ib_device { Thanks for the patch. One question though, where are these flags set in the ib_device structure? I was looking at this request yesterday and thought about extending the ib_device_cap_flags, but these seemed a little different. Maybe we could call the enum ib_device_type and add a new field to ib_device? - Sean From halr at voltaire.com Wed Sep 1 09:15:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 12:15:53 -0400 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <20040901075242.59395b28.mshefty@ichips.intel.com> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> Message-ID: <1094055352.1832.30.camel@localhost.localdomain> On Wed, 2004-09-01 at 10:52, Sean Hefty wrote: > On Wed, 01 Sep 2004 10:58:45 -0400 > One question though, where are these flags set in the ib_device > structure? I would think that this would be done when the device driver is notified by PCI that the device is present (and would be done inside mthca_provider.c:mthca_register_device() where it is setting up all the other ib_device fields). Since the entire mthca_dev structure is cleared via memset (in mthca_main.c:mthca_init_one), the flags are 0 as is correct for an HCA. > I was looking at this request yesterday and thought > about extending the ib_device_cap_flags, but these seemed a little > different. Maybe we could call the enum ib_device_type and add a > new field to ib_device? device_cap is different from this (and device_type is better). It could be done this way as well (as using the device flags). -- Hal From halr at voltaire.com Wed Sep 1 09:30:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 12:30:25 -0400 Subject: [openib-general] Re: [PATCH] ib_mad.h: Add in IB management class definitions In-Reply-To: <20040831145923.661c12f6.mshefty@ichips.intel.com> References: <1093988376.1830.131.camel@localhost.localdomain> <20040831145923.661c12f6.mshefty@ichips.intel.com> Message-ID: <1094056224.1832.36.camel@localhost.localdomain> On Tue, 2004-08-31 at 17:59, Sean Hefty wrote: > On Tue, 31 Aug 2004 17:39:37 -0400 > some nit-picky comments below... > > > +/* Management classes */ > > +#define IB_MGMT_CLASS_PERF 0x04 > > +#define IB_MGMT_CLASS_BM 0x05 > > +#define IB_MGMT_CLASS_DEV_MGT 0x06 > > +#define IB_MGMT_CLASS_COM_MGT 0x07 > > I've gone back and forth on the names here. The names closest to the spec > would be what you have: PERF, BM, DEV_MGT, and COM_MGT. Yes, the names were chosen based on the descriptions in the table on management class methods in chapter 13 of the spec. > For API consistency, we use MGMT, > instead of MGT, and DEVICE instead of DEV. OK. > And I'm guessing that the resulting CM API will > use "cm" or "conn" in its name. I also think it would likely use CM. > Anyway, I'm inclined to go with: > PERF_MGMT (or PM), BM, DEVICE_MGMT (or DM), and CM. Does anyone care or have an opinion? Fine with me. -- Hal From mshefty at ichips.intel.com Wed Sep 1 08:35:26 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 08:35:26 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094055352.1832.30.camel@localhost.localdomain> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> Message-ID: <20040901083526.722f644d.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 12:15:53 -0400 Hal Rosenstock wrote: > On Wed, 2004-09-01 at 10:52, Sean Hefty wrote: > > On Wed, 01 Sep 2004 10:58:45 -0400 > > One question though, where are these flags set in the ib_device > > structure? > > I would think that this would be done when the device driver is notified > by PCI that the device is present (and would be done inside > mthca_provider.c:mthca_register_device() where it is setting up all the > other ib_device fields). Since the entire mthca_dev structure is cleared > via memset (in mthca_main.c:mthca_init_one), the flags are 0 as is > correct for an HCA. I was asking which field in what structure these would use. Were you intending to use the "flags" field in ib_device? Or were you hoping to set a field in the ib_device_attr structure? (I was thinking the latter, but see that you meant the former now.) What values are currently being set in the ib_device::flags field? Is there a reason that field is u32 and not an int? How about using a type field that matches up with NodeInfo::NodeType values? - Sean From mshefty at ichips.intel.com Wed Sep 1 08:39:21 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 08:39:21 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <1094053866.1832.14.camel@localhost.localdomain> References: <1094053866.1832.14.camel@localhost.localdomain> Message-ID: <20040901083921.088092f2.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 11:51:07 -0400 Hal Rosenstock wrote: > I have some questions related to TIDs in ib_mad.h: > * ib_mad_agent - Used to track MAD registration with the access layer. > * @hi_tid - Access layer assigned transaction ID for this client. > * Unsolicited MADs sent by this client will have the upper 32-bits > * of their TID set to this value. > > Is it the access layer client's or access layer's responsibility to set > the high 32 bits of the TID to this value ? I was thinking that the access layer would, but could go either way on this. > In this context, what is the definition of unsolicited ? Are these MADs > to be sent which have the R bit (response) bit off ? In general, unsolicited are MADs with the R bit set to 0. MADs with the R bit set, and all RMPP MADs would be considered solicited. The CM is a little different, in that most of its MADs are solicited, but do not have the R bit set. This is a case where clean layering breaks down a little. From halr at voltaire.com Wed Sep 1 09:48:01 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 12:48:01 -0400 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <20040901083526.722f644d.mshefty@ichips.intel.com> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> Message-ID: <1094057280.1832.41.camel@localhost.localdomain> On Wed, 2004-09-01 at 11:35, Sean Hefty wrote: > How about using a type field that matches up with > NodeInfo::NodeType values? Sounds good to me. That is the information we are representing. Is ib_device.flags used for anything else ? If not, should it be eliminated ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 08:45:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 08:45:49 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update Message-ID: <20040901084549.279d22bb.mshefty@ichips.intel.com> Here's a patch that checks for optional functions provided by the device. It also includes a patch from Hal for IB management classes. - Sean -- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 705) +++ ib_verbs.h (working copy) @@ -29,6 +29,7 @@ #include #include #include +#include struct ib_mad; @@ -763,22 +764,14 @@ return device->query_pkey(device, port_num, index, pkey); } -static inline int ib_modify_device(struct ib_device *device, - int device_modify_mask, - struct ib_device_modify *device_modify) -{ - return device->modify_device(device, device_modify_mask, - device_modify); -} - -static inline int ib_modify_port(struct ib_device *device, - u8 port_num, - int port_modify_mask, - struct ib_port_modify *port_modify) -{ - return device->modify_port(device, port_num, port_modify_mask, - port_modify); -} +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, + int port_modify_mask, + struct ib_port_modify *port_modify); struct ib_pd *ib_alloc_pd(struct ib_device *device); @@ -787,17 +780,11 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); -static inline int ib_modify_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr) -{ - return ah->device->modify_ah(ah, ah_attr); -} +int ib_modify_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); -static inline int ib_query_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr) -{ - return ah->device->query_ah(ah, ah_attr); -} +int ib_query_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); int ib_destroy_ah(struct ib_ah *ah); @@ -813,13 +800,10 @@ return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, qp_cap); } -static inline int ib_query_qp(struct ib_qp *qp, - struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_init_attr *qp_init_attr) -{ - return qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr); -} +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); int ib_destroy_qp(struct ib_qp *qp); @@ -827,19 +811,13 @@ void *srq_context, struct ib_srq_attr *srq_attr); -static inline int ib_query_srq(struct ib_srq *srq, - struct ib_srq_attr *srq_attr) -{ - return srq->device->query_srq(srq, srq_attr); -} +int ib_modify_srq(struct ib_srq *srq, + struct ib_pd *pd, + struct ib_srq_attr *srq_attr, + int srq_attr_mask); -static inline int ib_modify_srq(struct ib_srq *srq, - struct ib_pd *pd, - struct ib_srq_attr *srq_attr, - int srq_attr_mask) -{ - return srq->device->modify_srq(srq, pd, srq_attr, srq_attr_mask); -} +int ib_query_srq(struct ib_srq *srq, + struct ib_srq_attr *srq_attr); static inline int ib_post_srq(struct ib_srq *srq, struct ib_recv_wr *recv_wr, @@ -855,11 +833,8 @@ void *cq_context, int cqe); -static inline int ib_resize_cq(struct ib_cq *cq, - int cqe) -{ - return cq->device->resize_cq(cq, cqe); -} +int ib_resize_cq(struct ib_cq *cq, + int cqe); int ib_destroy_cq(struct ib_cq *cq); @@ -869,11 +844,8 @@ int mr_access_flags, u64 *iova_start); -static inline int ib_query_mr(struct ib_mr *mr, - struct ib_mr_attr *mr_attr) -{ - return mr->device->query_mr(mr, mr_attr); -} +int ib_query_mr(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); @@ -916,19 +888,13 @@ int ib_dealloc_fmr(struct ib_fmr *fmr); -static inline int ib_attach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid) -{ - return qp->device->attach_mcast(qp, gid, lid); -} - -static inline int ib_detach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid) -{ - return qp->device->detach_mcast(qp, gid, lid); -} +int ib_attach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + +int ib_detach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); static inline int ib_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, @@ -966,7 +932,9 @@ static inline int ib_peek_cq(struct ib_cq *cq, int wc_cnt) { - return cq->device->peek_cq(cq, wc_cnt); + return cq->device->peek_cq ? + cq->device->peek_cq(cq, wc_cnt) : + -ENOSYS; } /** @@ -984,7 +952,9 @@ static inline int ib_req_n_notify_cq(struct ib_cq *cq, int wc_cnt) { - return cq->device->req_n_notify_cq(cq, wc_cnt); + return cq->device->req_n_notify_cq ? + cq->device->req_n_notify_cq(cq, wc_cnt) : + -ENOSYS; } #endif /* IB_VERBS_H */ Index: ib_mad.h =================================================================== --- ib_mad.h (revision 706) +++ ib_mad.h (working copy) @@ -26,7 +26,17 @@ #if !defined( IB_MAD_H ) #define IB_MAD_H -#include "ib_verbs.h" +#include + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 #define IB_QP0 0 #define IB_QP1 cpu_to_be32(1) Index: TODO =================================================================== --- TODO (revision 705) +++ TODO (working copy) @@ -9,6 +9,8 @@ - Should ib_unmap_fmr take fmr_array as input, or just fmr? What should the restriction on the fmr_array be? All from same device? + - Change ib_map_fmr to ib_map_phys_fmr. + - Add way to determine if a device is an HCA, router, or switch. MAD TODOs: - Need to define queuing model for ib_mad_post_send. Index: ib_verbs.c =================================================================== --- ib_verbs.c (revision 708) +++ ib_verbs.c (working copy) @@ -26,6 +26,35 @@ #include #include +/* Device */ + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + if (!device->modify_device) + return -ENOSYS; + + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +int ib_modify_port(struct ib_device *device, + u8 port_num, + int port_modify_mask, + struct ib_port_modify *port_modify) +{ + if (!device->modify_port) + return -ENOSYS; + + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +/* Protection domain */ + struct ib_pd *ib_alloc_pd(struct ib_device *device) { struct ib_pd *pd; @@ -50,6 +79,8 @@ } EXPORT_SYMBOL(ib_dealloc_pd); +/* Address handle */ + struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) { @@ -67,6 +98,26 @@ } EXPORT_SYMBOL(ib_create_ah); +int ib_modify_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr) +{ + if (!ah->device->modify_ah) + return -ENOSYS; + + return ah->device->modify_ah(ah, ah_attr); +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr) +{ + if (!ah->device->query_ah) + return -ENOSYS; + + return ah->device->query_ah(ah, ah_attr); +} +EXPORT_SYMBOL(ib_query_ah); + int ib_destroy_ah(struct ib_ah *ah) { struct ib_pd *pd; @@ -82,6 +133,8 @@ } EXPORT_SYMBOL(ib_destroy_ah); +/* Queue pair */ + struct ib_qp *ib_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *qp_init_attr, struct ib_qp_cap *qp_cap) @@ -109,6 +162,18 @@ } EXPORT_SYMBOL(ib_create_qp); +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + if (!qp->device->query_qp) + return -ENOSYS; + + return qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr); +} +EXPORT_SYMBOL(ib_query_qp); + int ib_destroy_qp(struct ib_qp *qp) { struct ib_pd *pd; @@ -134,6 +199,8 @@ } EXPORT_SYMBOL(ib_destroy_qp); +/* Shared receive queue */ + struct ib_srq *ib_create_srq(struct ib_pd *pd, void *srq_context, struct ib_srq_attr *srq_attr) @@ -157,6 +224,28 @@ } EXPORT_SYMBOL(ib_create_srq); +int ib_modify_srq(struct ib_srq *srq, + struct ib_pd *pd, + struct ib_srq_attr *srq_attr, + int srq_attr_mask) +{ + if (!srq->device->modify_srq) + return -ENOSYS; + + return srq->device->modify_srq(srq, pd, srq_attr, srq_attr_mask); +} +EXPORT_SYMBOL(ib_modify_srq); + +int ib_query_srq(struct ib_srq *srq, + struct ib_srq_attr *srq_attr) +{ + if (!srq->device->query_srq) + return -ENOSYS; + + return srq->device->query_srq(srq, srq_attr); +} +EXPORT_SYMBOL(ib_modify_srq); + int ib_destroy_srq(struct ib_srq *srq) { struct ib_pd *pd; @@ -175,6 +264,8 @@ } EXPORT_SYMBOL(ib_destroy_srq); +/* Completion queue */ + struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void *cq_context, @@ -195,6 +286,16 @@ } EXPORT_SYMBOL(ib_create_cq); +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + if (!cq->device->resize_cq) + return -ENOSYS; + + return cq->device->resize_cq(cq, cqe); +} +EXPORT_SYMBOL(ib_resize_cq); + int ib_destroy_cq(struct ib_cq *cq) { if (atomic_read(&cq->usecnt)) @@ -204,6 +305,8 @@ } EXPORT_SYMBOL(ib_destroy_cq); +/* Memory region */ + struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *phys_buf_array, int num_phys_buf, @@ -226,6 +329,16 @@ } EXPORT_SYMBOL(ib_reg_phys_mr); +int ib_query_mr(struct ib_mr *mr, + struct ib_mr_attr *mr_attr) +{ + if (!mr->device->query_mr) + return -ENOSYS; + + return mr->device->query_mr(mr, mr_attr); +} +EXPORT_SYMBOL(ib_query_mr); + int ib_dereg_mr(struct ib_mr *mr) { struct ib_pd *pd; @@ -255,6 +368,9 @@ struct ib_pd *old_pd; int ret; + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + if (atomic_read(&mr->usecnt)) return -EBUSY; @@ -273,10 +389,14 @@ } EXPORT_SYMBOL(ib_rereg_phys_mr); +/* Memory window */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) { struct ib_mw *mw; + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + mw = pd->device->alloc_mw(pd); if (!IS_ERR(mw)) { @@ -304,12 +424,17 @@ } EXPORT_SYMBOL(ib_dealloc_mw); +/* Fast-memory registration */ + struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, int mr_access_flags, struct ib_fmr_attr *fmr_attr) { struct ib_fmr *fmr; + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); if (!IS_ERR(fmr)) { @@ -336,3 +461,22 @@ return ret; } EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast */ + +int ib_attach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid) +{ + if (!qp->device->attach_mcast) + return -ENOSYS; + + return qp->device->attach_mcast(qp, gid, lid); +} + +int ib_detach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid) +{ + return qp->device->detach_mcast(qp, gid, lid); +} From mshefty at ichips.intel.com Wed Sep 1 09:05:03 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 09:05:03 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094057280.1832.41.camel@localhost.localdomain> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> Message-ID: <20040901090503.226c9c08.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 12:48:01 -0400 Hal Rosenstock wrote: > On Wed, 2004-09-01 at 11:35, Sean Hefty wrote: > > How about using a type field that matches up with > > NodeInfo::NodeType values? > > Sounds good to me. That is the information we are representing. > > Is ib_device.flags used for anything else ? If not, should it be > eliminated ? I checked Roland's stack, and he's using this field to determine if a device is a switch, among a couple of other things. I was thinking of adding this field to ib_device_attr, but this means that you'd need access to the attributes if you needed to check for these fields, which his stack does. From halr at voltaire.com Wed Sep 1 10:10:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 13:10:25 -0400 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <20040901083921.088092f2.mshefty@ichips.intel.com> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> Message-ID: <1094058624.1969.50.camel@localhost.localdomain> On Wed, 2004-09-01 at 11:39, Sean Hefty wrote: > On Wed, 01 Sep 2004 11:51:07 -0400 > Hal Rosenstock wrote: > > Is it the access layer client's or access layer's responsibility to set > > the high 32 bits of the TID to this value ? > > I was thinking that the access layer would, but could go either way on this. > > > In this context, what is the definition of unsolicited ? Are these MADs > > to be sent which have the R bit (response) bit off ? > > In general, unsolicited are MADs with the R bit set to 0. > MADs with the R bit set, and all RMPP MADs would be considered solicited. > > The CM is a little different, in that most of its MADs are solicited, > but do not have the R bit set. This is a case where clean layering breaks > down a little. If the access layer were to set the high 32 bits of the TID for unsolicited MADs, would it do it for all but CM MADs or should it also handle the CM case as well ? Which CM MADs are unsolicited ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 09:16:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 09:16:11 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <1094058624.1969.50.camel@localhost.localdomain> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> Message-ID: <20040901091611.6876f0ce.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 13:10:25 -0400 Hal Rosenstock wrote: > If the access layer were to set the high 32 bits of the TID for > unsolicited MADs, would it do it for all but CM MADs or should it also > handle the CM case as well ? I think... have it do it for the CM MADs as well. > Which CM MADs are unsolicited ? See C12-5.1.2. REQ, LAP, DREQ, and SIDR_REQ are "unsolicited", according to how I've been describing it. REP, MRA, REJ, RTU, APR, DREP, and SIDR_REP are response MADs that need their TID to match the MAD that they are sent in response to. - Sean From halr at voltaire.com Wed Sep 1 10:20:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 13:20:50 -0400 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <20040901090503.226c9c08.mshefty@ichips.intel.com> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> <20040901090503.226c9c08.mshefty@ichips.intel.com> Message-ID: <1094059250.1832.60.camel@localhost.localdomain> On Wed, 2004-09-01 at 12:05, Sean Hefty wrote: > I was thinking of adding this field to ib_device_attr, but this > means that you'd need access to the attributes if you needed to check > for these fields, which his stack does. I think this has now come full circle as that was my original proposal for this (to add another field to ib_device_attr). Roland then responded to add it into the ib_device flags. I don't really have a strong preference for which way. -- Hal From roland at topspin.com Wed Sep 1 10:25:16 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 10:25:16 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094057280.1832.41.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 01 Sep 2004 12:48:01 -0400") References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> Message-ID: <52r7pl296r.fsf@topspin.com> Hal> Is ib_device.flags used for anything else ? If not, should it Hal> be eliminated ? In the Topspin stack, it is used to mark various quirks in SMA/SMI handling. For example Tavor requires 0-hop DR SMPs be passed to the process_mad method (actually sending them won't work); however for the Anafa2 switch, there is a significant performance penalty for using process_mad to handle 0-hop DR SMPs and we want to send them via QP0. - Roland From roland at topspin.com Wed Sep 1 10:27:40 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 10:27:40 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094059250.1832.60.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 01 Sep 2004 13:20:50 -0400") References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> <20040901090503.226c9c08.mshefty@ichips.intel.com> <1094059250.1832.60.camel@localhost.localdomain> Message-ID: <52n009292r.fsf@topspin.com> Hal> I think this has now come full circle as that was my original Hal> proposal for this (to add another field to Hal> ib_device_attr). Roland then responded to add it into the Hal> ib_device flags. I don't really have a strong preference for Hal> which way. I don't have a strong preference either. However I would put the field in struct ib_device rather than ib_device_attr -- no point forcing the consumer to allocate a ib_device_attr and call the query function just to find out something that can't ever change. - R. From mshefty at ichips.intel.com Wed Sep 1 09:28:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 09:28:49 -0700 Subject: [openib-general] MAD queuing model Message-ID: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> Does anyone have any thoughts about where MAD queuing should occur? If a work request for a MAD cannot be immediately posted to a QP, should a call to ib_mad_post_send fail, or should the work request be queued for later? Along this same line, should a MAD requiring RMPP post multiple work requests or post a single request at a time, until it completes? (By completion, I mean the work request only, and not an RMPP response.) My initial thoughts are to queue the MADs in the access layer. But timers would not start until the work request had actually been posted. -- From halr at voltaire.com Wed Sep 1 10:34:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 13:34:11 -0400 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <20040901091611.6876f0ce.mshefty@ichips.intel.com> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> <20040901091611.6876f0ce.mshefty@ichips.intel.com> Message-ID: <1094060051.1969.73.camel@localhost.localdomain> On Wed, 2004-09-01 at 12:16, Sean Hefty wrote: > On Wed, 01 Sep 2004 13:10:25 -0400 > Hal Rosenstock wrote: > > > If the access layer were to set the high 32 bits of the TID for > > unsolicited MADs, would it do it for all but CM MADs or should it also > > handle the CM case as well ? > > I think... have it do it for the CM MADs as well. > > > Which CM MADs are unsolicited ? > > See C12-5.1.2. REQ, LAP, DREQ, and SIDR_REQ are "unsolicited", according to > how I've been describing it. REP, MRA, REJ, RTU, APR, DREP, and SIDR_REP are > response MADs that need their TID to match the MAD that they are sent in response to. Makes sense. The one additional complication appears to be when a REJ is sent by the active side from the Timeout or REP Wait state as a result of CM protocol timeout. The CM protocol timeout is indicated by Message REJected = 2. Reason code is 4 for timeout. Does the access layer need to make exceptions for those REJs ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 09:32:41 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 09:32:41 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <52n009292r.fsf@topspin.com> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> <20040901090503.226c9c08.mshefty@ichips.intel.com> <1094059250.1832.60.camel@localhost.localdomain> <52n009292r.fsf@topspin.com> Message-ID: <20040901093241.0b9549f8.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 10:27:40 -0700 Roland Dreier wrote: > Hal> I think this has now come full circle as that was my original > Hal> proposal for this (to add another field to > Hal> ib_device_attr). Roland then responded to add it into the > Hal> ib_device flags. I don't really have a strong preference for > Hal> which way. I'm just slow... > I don't have a strong preference either. However I would put the > field in struct ib_device rather than ib_device_attr -- no point > forcing the consumer to allocate a ib_device_attr and call the query > function just to find out something that can't ever change. I'd vote for a new field in ib_device that matches the NodeType values. I think this should allow for a fairly easy transition in Roland's stack as well. From halr at voltaire.com Wed Sep 1 10:42:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 13:42:42 -0400 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <20040901093241.0b9549f8.mshefty@ichips.intel.com> References: <1094050724.1832.4.camel@localhost.localdomain> <20040901075242.59395b28.mshefty@ichips.intel.com> <1094055352.1832.30.camel@localhost.localdomain> <20040901083526.722f644d.mshefty@ichips.intel.com> <1094057280.1832.41.camel@localhost.localdomain> <20040901090503.226c9c08.mshefty@ichips.intel.com> <1094059250.1832.60.camel@localhost.localdomain> <52n009292r.fsf@topspin.com> <20040901093241.0b9549f8.mshefty@ichips.intel.com> Message-ID: <1094060561.1832.78.camel@localhost.localdomain> On Wed, 2004-09-01 at 12:32, Sean Hefty wrote: > On Wed, 01 Sep 2004 10:27:40 -0700 > I'd vote for a new field in ib_device that matches the NodeType values. > I think this should allow for a fairly easy transition in Roland's stack as well. That's fine with me (a node_type field (separated out of flags) in the ib_device structure). -- Hal From mshefty at ichips.intel.com Wed Sep 1 10:02:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 10:02:17 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <1094060051.1969.73.camel@localhost.localdomain> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> <20040901091611.6876f0ce.mshefty@ichips.intel.com> <1094060051.1969.73.camel@localhost.localdomain> Message-ID: <20040901100217.203cff62.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 13:34:11 -0400 Hal Rosenstock wrote: > The one additional complication appears to be when a REJ is sent by the > active side from the Timeout or REP Wait state as a result of CM > protocol timeout. The CM protocol timeout is indicated by Message > REJected = 2. Reason code is 4 for timeout. Does the access layer need > to make exceptions for those REJs ? ...gurgle... Here's what we did for the SF stack. All CM MADs were routed at the receive side using a dispatch table based on "unsolicited" registration. (There was a special check for the CM class.) The impact was that a client other than the CM could not initiate a connection request by sending a REQ directly. It would be nice to avoid this restriction, which should allow multiple "active" side CMs to co-exist, but definitely isn't a requirement, and probably involves embedding a good deal of the CM protocol into the access layer. On the send side, the access layer did not change the TID for CM MADs. My preference is for the access layer to know as little about the CM protocol as possible. I think it would be nice if the access layer could examine something as simple as the R bit to determine if it needed to set the upper TID, but that's not possible given the CM protocol. As an alternative, we could extend the mad_flags to indicate if sent a MAD should be treated as: default (no flag set, so check the R bit), solicited (do not set TID), or unsolicited. (set TID). Of course, if the client needs to set a flag, it could just set the TID directly and be done... Maybe this is the way to go on the send side. What are your thoughts on this? From halr at voltaire.com Wed Sep 1 11:09:43 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 14:09:43 -0400 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901084549.279d22bb.mshefty@ichips.intel.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> Message-ID: <1094062182.1832.107.camel@localhost.localdomain> One minor typo: Index: ib_verbs.c =================================================================== --- ib_verbs.c (revision 712) +++ ib_verbs.c (working copy) @@ -244,7 +244,7 @@ return srq->device->query_srq(srq, srq_attr); } -EXPORT_SYMBOL(ib_modify_srq); +EXPORT_SYMBOL(ib_query_srq); int ib_destroy_srq(struct ib_srq *srq) { From mshefty at ichips.intel.com Wed Sep 1 10:12:13 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 10:12:13 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <1094062182.1832.107.camel@localhost.localdomain> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <1094062182.1832.107.camel@localhost.localdomain> Message-ID: <20040901101213.3f8d9cd3.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 14:09:43 -0400 Hal Rosenstock wrote: > One minor typo: > -EXPORT_SYMBOL(ib_modify_srq); > +EXPORT_SYMBOL(ib_query_srq); Good catch... thanks From mshefty at ichips.intel.com Wed Sep 1 10:34:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 10:34:34 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <1094050724.1832.4.camel@localhost.localdomain> References: <1094050724.1832.4.camel@localhost.localdomain> Message-ID: <20040901103434.4ac1ca34.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 10:58:45 -0400 Hal Rosenstock wrote: > Here's a patch to ib_verbs.h define the flags for the ib_device > structure: > > +enum ib_device_flags { > + IB_DEVICE_IS_SWITCH = 1, > + IB_DEVICE_IS_ROUTER = 2 > +}; > + Here's an updated patch based on our conversations. I debated about whether to define node_type as a u8 or an enum, but went with u8 to match the NodeInfo record. Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 712) +++ ib_verbs.h (working copy) @@ -609,6 +609,12 @@ IB_CQ_NEXT_COMP }; +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + enum ib_process_mad_flags { IB_MAD_IGNORE_MKEY = 1 }; @@ -733,6 +739,7 @@ struct ib_mad *out_mad); struct class_device class_dev; + u8 node_type; }; static inline int ib_query_device(struct ib_device *device, From halr at voltaire.com Wed Sep 1 11:57:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 14:57:03 -0400 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <20040901100217.203cff62.mshefty@ichips.intel.com> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> <20040901091611.6876f0ce.mshefty@ichips.intel.com> <1094060051.1969.73.camel@localhost.localdomain> <20040901100217.203cff62.mshefty@ichips.intel.com> Message-ID: <1094065022.1832.211.camel@localhost.localdomain> On Wed, 2004-09-01 at 13:02, Sean Hefty wrote: > On Wed, 01 Sep 2004 13:34:11 -0400 > Hal Rosenstock wrote: > > > The one additional complication appears to be when a REJ is sent by the > > active side from the Timeout or REP Wait state as a result of CM > > protocol timeout. The CM protocol timeout is indicated by Message > > REJected = 2. Reason code is 4 for timeout. Does the access layer need > > to make exceptions for those REJs ? > > ...gurgle... Was that the proverbial straw :-( > Here's what we did for the SF stack. All CM MADs were routed at the > receive side using a dispatch table based on "unsolicited" registration. > (There was a special check for the CM class.) > > The impact was that a client other than the CM could not initiate a > connection request by sending a REQ directly. It would be nice to avoid > this restriction, which should allow multiple "active" side CMs to co-exist, > but definitely isn't a requirement, and probably involves embedding a good > deal of the CM protocol into the access layer. As long as the CM clients use the hi_tid assigned by the access layer in forming their TIDs, I don't see a problem with multiple active CMs. > On the send side, > the access layer did not change the TID for CM MADs. > > My preference is for the access layer to know as little about the CM protocol > as possible. Mine as well. > I think it would be nice if the access layer could examine > something as simple as the R bit to determine if it needed to set the upper TID, > but that's not possible given the CM protocol. This may not only be an issue with CM but also vendor and application classes too (as they too can use Send methods (non request/response)). The other classes use of Send is more abstract than CM. > As an alternative, we could extend the mad_flags to indicate if sent a MAD > should be treated as: default (no flag set, so check the R bit), solicited > (do not set TID), or unsolicited. (set TID). > > Of course, if the client needs to set a flag, it could just set the TID directly > and be done... Maybe this is the way to go on the send side. > > What are your thoughts on this? I'm back to where I was on this as adding a flag seems of minimal benefit as you point out. So I would have the access layer set unsolicited TIDs with the exception of Send methods (CM and other clients using the Send method need to handle their own TIDs on send). If that is deemed too unclean, then all send TID handling is back at the client :-( Also, should the access layer validate that the client uses the hi_tid supplied in the mad_agent in the TIDs of sent MADs ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 11:15:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 11:15:35 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <1094065022.1832.211.camel@localhost.localdomain> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> <20040901091611.6876f0ce.mshefty@ichips.intel.com> <1094060051.1969.73.camel@localhost.localdomain> <20040901100217.203cff62.mshefty@ichips.intel.com> <1094065022.1832.211.camel@localhost.localdomain> Message-ID: <20040901111535.14c6bfa2.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 14:57:03 -0400 Hal Rosenstock wrote: > As long as the CM clients use the hi_tid assigned by the access layer in > forming their TIDs, I don't see a problem with multiple active CMs. The problem is routing MADs back to the proper CM without the access layer knowing the CM protocol. E.g. when a REP is received by the access layer, it can route the REP to the client registered to receive CM MADs based on the management class, or it needs to look at the management class, see that it's the CM, check the method, and then route based on the TID to the proper client. And getting the DREQ to the right CM would be difficult, so requires the CM to be active for all message sequences (REQ, DREQ, and LAP). I'm slowly remembering why we treated CM MADs differently and routed them to the one and only CM in all cases. > > My preference is for the access layer to know as little about the CM protocol > > as possible. > > Mine as well. I only wish that the architects had shared this view... > I'm back to where I was on this as adding a flag seems of minimal > benefit as you point out. So I would have the access layer set > unsolicited TIDs with the exception of Send methods (CM and other > clients using the Send method need to handle their own TIDs on send). > If that is deemed too unclean, then all send TID handling is back at > the client :-( At this point, my vote is for the clients to set the TID in all cases, but are required to use the upper TID assigned to them by the access layer. Clients already need to manage and set the lower TID, so this isn't a big issue. > Also, should the access layer validate that the client uses the hi_tid > supplied in the mad_agent in the TIDs of sent MADs ? It's probably not worth it. Since response TIDs have to match requests, adding this checks puts us back to having the access layer determine if a MAD is unsolicited before the check is done, which brings back all of the CM issues... From halr at voltaire.com Wed Sep 1 12:30:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 15:30:38 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> Message-ID: <1094067037.1969.249.camel@localhost.localdomain> On Wed, 2004-09-01 at 12:28, Sean Hefty wrote: > Does anyone have any thoughts about where MAD queuing should occur? I am just thinking "out loud" in my answers below. > If a work request for a MAD cannot be immediately posted to a QP, > should a call to ib_mad_post_send fail, or should the work request > be queued for later? It could be put on a deferred send queue that would be processed when the next send completion occurs. This would make the access layer implementation a little more complicated but save the client from rerequesting. > Along this same line, should a MAD requiring > RMPP post multiple work requests or post a single request at a time, > until it completes? (By completion, I mean the work request only, > and not an RMPP response.) Either (up to client) but there may be timing considerations if the WRs associated with an RMPP MAD are posted separately. Meaning that in the case where some WRs were posted and others deferred, some RMPP timeout could occur. It could cause a transaction too long and an ABORT would be received in response. Are there others ? It seems safer to post all WRs associated with an RMPP MAD as close together as possible (ideally it would be "atomic"). > My initial thoughts are to queue the MADs in the access layer. > But timers would not start until the work request had actually > been posted. I agree that if deferred sending were to be done, request/response timeouts (ib_mad_send_wr.timeout_ms) should not start until the MAD is actually posted. This brings another question to mind. Should timeout_ms be ignored for send methods if supplied in the send WR ? -- Hal From roland at topspin.com Wed Sep 1 12:18:32 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 12:18:32 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 09:28:49 -0700") References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> Message-ID: <521xhl23xz.fsf@topspin.com> Sean> Does anyone have any thoughts about where MAD queuing should Sean> occur? If a work request for a MAD cannot be immediately Sean> posted to a QP, should a call to ib_mad_post_send fail, or Sean> should the work request be queued for later? Along this Sean> same line, should a MAD requiring RMPP post multiple work Sean> requests or post a single request at a time, until it Sean> completes? I definitely think the queueing should be done in the access layer so that consumers don't have to deal with the send queue full condition. (Otherwise we would have to call consumers back when space was available, and there still wouldn't be a way to ensure fair queueing and avoid the possibility of starving one consumer indefinitely). For RMPP, we could have the RMPP layer wait for each send to complete before posting another send. However I'm not sure this gains much and it seems simpler just to let the RMPP queue all its sends as they are ready and let the MAD layer handle posting as many sends as will fit in the work queue. - R. From halr at voltaire.com Wed Sep 1 12:38:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 15:38:15 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <521xhl23xz.fsf@topspin.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <521xhl23xz.fsf@topspin.com> Message-ID: <1094067495.1969.256.camel@localhost.localdomain> On Wed, 2004-09-01 at 15:18, Roland Dreier wrote: > For RMPP, we could have the RMPP layer wait for each send to complete > before posting another send. However I'm not sure this gains much and > it seems simpler just to let the RMPP queue all its sends as they are > ready and let the MAD layer handle posting as many sends as will fit > in the work queue. If an RMPP send is n WRs, is there a way for the access layer to know before posting whether all the WRs for this will fit ? If so, is this "worth" the expense of this operation or should they just be posted until not fitting and the rest deferred (and RMPP will do the correct recovery either locally or remotely) ? (I obviously haven't thought this through in detail yet...) -- Hal From mshefty at ichips.intel.com Wed Sep 1 11:39:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 11:39:49 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <1094067037.1969.249.camel@localhost.localdomain> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> Message-ID: <20040901113949.7e05d225.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 15:30:38 -0400 Hal Rosenstock wrote: > > If a work request for a MAD cannot be immediately posted to a QP, > > should a call to ib_mad_post_send fail, or should the work request > > be queued for later? > > It could be put on a deferred send queue that would be processed when > the next send completion occurs. This would make the access layer > implementation a little more complicated but save the client from > rerequesting. I think we're in agreement that the access layer should perform the queuing, and that the queuing should be for actual work requests, which implies some of the internal layering of the code. > > Along this same line, should a MAD requiring > > RMPP post multiple work requests or post a single request at a time, > > until it completes? (By completion, I mean the work request only, > > and not an RMPP response.) I asked this more based on what the response to the first question was, along with how the layering worked. It also comes down to some fairness, since while a large RMPP request is being sent, responses for other MADs may be queued behind it, which could result in timeouts on other MADs. Ideally, correct RMPP windowing would avoid this type of condition. > This brings another question to mind. Should timeout_ms be ignored for > send methods if supplied in the send WR ? I didn't quite follow this question. From mshefty at ichips.intel.com Wed Sep 1 11:43:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 11:43:17 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <1094067495.1969.256.camel@localhost.localdomain> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <521xhl23xz.fsf@topspin.com> <1094067495.1969.256.camel@localhost.localdomain> Message-ID: <20040901114317.7ea6e33d.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 15:38:15 -0400 Hal Rosenstock wrote: > If an RMPP send is n WRs, is there a way for the access layer to know > before posting whether all the WRs for this will fit ? If so, is this 'n' may be smaller than the size of the QP. > "worth" the expense of this operation or should they just be posted > until not fitting and the rest deferred (and RMPP will do the correct > recovery either locally or remotely) ? (I obviously haven't thought this > through in detail yet...) As long as we can come close to maintaining the data rate, I'm not sure that defering the sends helps anything. I think we want to try to keep the queue as full as possible when it's busy. From halr at voltaire.com Wed Sep 1 12:53:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 15:53:50 -0400 Subject: [openib-general] Re: ib_mad.h ib_mad_agent.hi_tid In-Reply-To: <20040901111535.14c6bfa2.mshefty@ichips.intel.com> References: <1094053866.1832.14.camel@localhost.localdomain> <20040901083921.088092f2.mshefty@ichips.intel.com> <1094058624.1969.50.camel@localhost.localdomain> <20040901091611.6876f0ce.mshefty@ichips.intel.com> <1094060051.1969.73.camel@localhost.localdomain> <20040901100217.203cff62.mshefty@ichips.intel.com> <1094065022.1832.211.camel@localhost.localdomain> <20040901111535.14c6bfa2.mshefty@ichips.intel.com> Message-ID: <1094068430.1832.272.camel@localhost.localdomain> On Wed, 2004-09-01 at 14:15, Sean Hefty wrote: > On Wed, 01 Sep 2004 14:57:03 -0400 > Hal Rosenstock wrote: > > > As long as the CM clients use the hi_tid assigned by the access layer in > > forming their TIDs, I don't see a problem with multiple active CMs. > > The problem is routing MADs back to the proper CM without the access layer > knowing the CM protocol. E.g. when a REP is received by the access layer, > it can route the REP to the client registered to receive CM MADs based > on the management class, or it needs to look at the management class, > see that it's the CM, check the method, and then route based on the > TID to the proper client. I was thinking the latter. > And getting the DREQ to the right CM would be difficult, so requires the CM > to be active for all message sequences (REQ, DREQ, and LAP). Yup, that's the crux of this. The passive CM is not always so passive :-( > I'm slowly remembering why we treated CM MADs differently and routed them > to the one and only CM in all cases. Me too. CM is a hairball. I think it is acceptable to run a single CM at any one time. This doesn't preclude running different CM implementations if more than one is interesting to OpenIB; just not at the same time. If anyone thinks otherwise, now is the time to speak your piece. > > > My preference is for the access layer to know as little about the CM protocol > > > as possible. > > > > Mine as well. > > I only wish that the architects had shared this view... It would be nice to rewrite history on this... > > I'm back to where I was on this as adding a flag seems of minimal > > benefit as you point out. So I would have the access layer set > > unsolicited TIDs with the exception of Send methods (CM and other > > clients using the Send method need to handle their own TIDs on send). > > If that is deemed too unclean, then all send TID handling is back at > > the client :-( > > At this point, my vote is for the clients to set the TID in all cases, > but are required to use the upper TID assigned to them by the access layer. > Clients already need to manage and set the lower TID, so this isn't a big issue. OK. > > Also, should the access layer validate that the client uses the hi_tid > > supplied in the mad_agent in the TIDs of sent MADs ? > > It's probably not worth it. Since response TIDs have to match requests, > adding this checks puts us back to having the access layer determine if a MAD is > unsolicited before the check is done, which brings back all of the CM issues... Right. -- Hal From halr at voltaire.com Wed Sep 1 13:09:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 16:09:10 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901114317.7ea6e33d.mshefty@ichips.intel.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <521xhl23xz.fsf@topspin.com> <1094067495.1969.256.camel@localhost.localdomain> <20040901114317.7ea6e33d.mshefty@ichips.intel.com> Message-ID: <1094069349.1832.293.camel@localhost.localdomain> On Wed, 2004-09-01 at 14:43, Sean Hefty wrote: > On Wed, 01 Sep 2004 15:38:15 -0400 > Hal Rosenstock wrote: > > > If an RMPP send is n WRs, is there a way for the access layer to know > > before posting whether all the WRs for this will fit ? If so, is this > > 'n' may be smaller than the size of the QP. Do you mean larger rather than smaller so they all wouldn't fit ? > > "worth" the expense of this operation or should they just be posted > > until not fitting and the rest deferred (and RMPP will do the correct > > recovery either locally or remotely) ? (I obviously haven't thought this > > through in detail yet...) > > As long as we can come close to maintaining the data rate, Not sure how "fast" things will be. > I'm not sure that defering the sends helps anything. Certainly not in the case you cite above where all the RMPP WRs won't fit on the QP at one time. > I think we want to try to keep the queue as full as possible when it's busy. OK and the main issue is fairness. -- Hal From halr at voltaire.com Wed Sep 1 13:31:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 16:31:52 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901113949.7e05d225.mshefty@ichips.intel.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> Message-ID: <1094070711.1832.318.camel@localhost.localdomain> On Wed, 2004-09-01 at 14:39, Sean Hefty wrote: > > > Along this same line, should a MAD requiring > > > RMPP post multiple work requests or post a single request at a time, > > > until it completes? (By completion, I mean the work request only, > > > and not an RMPP response.) > > I asked this more based on what the response to the first question was, > along with how the layering worked. It also comes down to some fairness, > since while a large RMPP request is being sent, responses for other MADs > may be queued behind it, which could result in timeouts on other MADs. > Ideally, correct RMPP windowing would avoid this type of condition. This is mainly an issue with the SA (responses which can transfer a lot of data). (There may be other proprietary uses of RMPP too). The RMPP window could help with this but is this sufficient for achieving fairness ? If it is, great. Is fairness primarily an issue once sends are being deferred ? It seems that in order to be fair some progress should be made on non RMPP sends and perhaps this can be a simple as 1 non RMPP send to every n RMPP sends where n is a compile time parameter ? Other ideas ? > > This brings another question to mind. Should timeout_ms be ignored for > > send methods if supplied in the send WR ? > > I didn't quite follow this question. The previous comment "if deferred sending were to be done, request/response timeouts (ib_mad_send_wr.timeout_ms) should not start until the MAD is actually posted." brought to mind the meaning of ib_mad_send_wr.timeout_ms in light of the discussion on solicited and unsolicted messages. The header file says the following: Timeout value, in milliseconds, to wait for a response message. Set to 0 if no response is expected. I was trying to ask if the MAD method is Send and timeout_ms is not 0, whether this timeout should be honored or ignored by the access layer. -- Hal From roland at topspin.com Wed Sep 1 13:42:58 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 13:42:58 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901113949.7e05d225.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 11:39:49 -0700") References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> Message-ID: <52u0uhzpnx.fsf@topspin.com> Sean> I asked this more based on what the response to the first Sean> question was, along with how the layering worked. It also Sean> comes down to some fairness, since while a large RMPP Sean> request is being sent, responses for other MADs may be Sean> queued behind it, which could result in timeouts on other Sean> MADs. Ideally, correct RMPP windowing would avoid this type Sean> of condition. I'd be inclined not to worry about this type of fairness for our initial implementation. If it turns out that RMPP sends with 10 megabyte windows are starving other sends, then we can add more sophisticated queue processing (since our MAD send queue handling will be nicely encapsulated in the core MAD layer ;). For example one could have a scheduler that limits the number of consecutive sends with the same TID if other sends are waiting -- however as I said I don't think this should be in our first version. - R. From halr at voltaire.com Wed Sep 1 14:01:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 17:01:51 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <52u0uhzpnx.fsf@topspin.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> <52u0uhzpnx.fsf@topspin.com> Message-ID: <1094072511.1832.326.camel@localhost.localdomain> On Wed, 2004-09-01 at 16:42, Roland Dreier wrote: > Sean> I asked this more based on what the response to the first > Sean> question was, along with how the layering worked. It also > Sean> comes down to some fairness, since while a large RMPP > Sean> request is being sent, responses for other MADs may be > Sean> queued behind it, which could result in timeouts on other > Sean> MADs. Ideally, correct RMPP windowing would avoid this type > Sean> of condition. > > I'd be inclined not to worry about this type of fairness for our > initial implementation. If it turns out that RMPP sends with 10 > megabyte windows are starving other sends, then we can add more > sophisticated queue processing (since our MAD send queue handling will > be nicely encapsulated in the core MAD layer ;). > > For example one could have a scheduler that limits the number of > consecutive sends with the same TID if other sends are waiting -- > however as I said I don't think this should be in our first version. I'm also in favor of deferring this to wait for implementation experience :-) -- Hal From ftillier at infiniconsys.com Wed Sep 1 14:05:33 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 1 Sep 2004 14:05:33 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <52u0uhzpnx.fsf@topspin.com> Message-ID: <000001c49067$6ca79610$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Wednesday, September 01, 2004 1:43 PM > > Sean> I asked this more based on what the response to the first > Sean> question was, along with how the layering worked. It also > Sean> comes down to some fairness, since while a large RMPP > Sean> request is being sent, responses for other MADs may be > Sean> queued behind it, which could result in timeouts on other > Sean> MADs. Ideally, correct RMPP windowing would avoid this type > Sean> of condition. > > I'd be inclined not to worry about this type of fairness for our > initial implementation. If it turns out that RMPP sends with 10 > megabyte windows are starving other sends, then we can add more > sophisticated queue processing (since our MAD send queue handling will > be nicely encapsulated in the core MAD layer ;). > > For example one could have a scheduler that limits the number of > consecutive sends with the same TID if other sends are waiting -- > however as I said I don't think this should be in our first version. > Can we limit a RMPP send to a single WQE outstanding at a time? That is, all RMPP sends would require a single slot in the send queue of the QP. Each segment of the RMPP is sent one at a time, with subsequent segments waiting for the previous one's WR to complete. While this is not as efficient as posting many work requests, especially in the case where the send queue is nearly empty, it does provide for simple fairness. I would expect that the QP would turn these things around fast enough that it should be suitable. Thoughts? - Fab From roland at topspin.com Wed Sep 1 14:11:01 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 14:11:01 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <000001c49067$6ca79610$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Wed, 1 Sep 2004 14:05:33 -0700") References: <000001c49067$6ca79610$655aa8c0@infiniconsys.com> Message-ID: <52ekllzod6.fsf@topspin.com> Fab> Can we limit a RMPP send to a single WQE outstanding at a Fab> time? That is, all RMPP sends would require a single slot in Fab> the send queue of the QP. Each segment of the RMPP is sent Fab> one at a time, with subsequent segments waiting for the Fab> previous one's WR to complete. While this is not as Fab> efficient as posting many work requests, especially in the Fab> case where the send queue is nearly empty, it does provide Fab> for simple fairness. I would expect that the QP would turn Fab> these things around fast enough that it should be suitable. Sure, this is a possibility too. However this now creates another potential fairness problem: so many MADs are being posted that by the time the RMPP layer can get a completion and have its next send executed, a timeout has occurred. I would prefer to have consumers queue up as many MADs as they have ready to send as early as possible. This maximize the possibilities for adding a smart queue scheduler in the future. - R. From roland at topspin.com Wed Sep 1 14:31:28 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 14:31:28 -0700 Subject: [Fwd: Re: [openib-general] ib_verbs.h ib_device_attr device type] In-Reply-To: <20040901103434.4ac1ca34.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 10:34:34 -0700") References: <1094050724.1832.4.camel@localhost.localdomain> <20040901103434.4ac1ca34.mshefty@ichips.intel.com> Message-ID: <52acw9znf3.fsf@topspin.com> I'm applying this patch to bring my branch up to date. - R. Index: infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 714) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -301,7 +301,7 @@ return; } - if (device->flags & IB_DEVICE_IS_SWITCH) { + if (device->node_type == IB_NODE_SWITCH) { if (try_module_get(device->owner)) ipoib_add_port("ib%d", device, 0); } else { Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 714) +++ infiniband/include/ib_verbs.h (working copy) @@ -47,6 +47,12 @@ } global; }; +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + enum ib_device_cap_flags { IB_DEVICE_RESIZE_MAX_WR = 1, IB_DEVICE_BAD_PKEY_CNTR = (1<<1), @@ -684,6 +690,8 @@ ib_mad_process_func mad_process; struct class_device class_dev; + + u8 node_type; }; int ib_query_device(struct ib_device *device, Index: infiniband/include/ts_ib_core_types.h =================================================================== --- infiniband/include/ts_ib_core_types.h (revision 714) +++ infiniband/include/ts_ib_core_types.h (working copy) @@ -154,7 +154,6 @@ /* structures */ enum { - IB_DEVICE_IS_SWITCH = 1 << 0, IB_MAD_NO_HOP_POINTER_INCR = 1 << 1, IB_MAD_LOCAL_USE_QP = 1 << 2 }; Index: infiniband/core/mad_main.c =================================================================== --- infiniband/core/mad_main.c (revision 714) +++ infiniband/core/mad_main.c (working copy) @@ -181,7 +181,7 @@ device->mad = priv; priv->ib_dev = device; - priv->num_port = device->flags & IB_DEVICE_IS_SWITCH ? + priv->num_port = device->node_type == IB_NODE_SWITCH ? 1 : prop.phys_port_cnt; priv->pd = ib_alloc_pd(device); @@ -235,7 +235,7 @@ int start_port, end_port; int p, q, i; - if (device->flags & IB_DEVICE_IS_SWITCH) { + if (device->node_type == IB_NODE_SWITCH) { start_port = end_port = 0; } else { start_port = 1; Index: infiniband/core/mad_filter.c =================================================================== --- infiniband/core/mad_filter.c (revision 714) +++ infiniband/core/mad_filter.c (working copy) @@ -132,7 +132,7 @@ /* C14-9:2 */ if (hop_pointer != 0 && hop_pointer < hop_count) { - if (!(device->flags & IB_DEVICE_IS_SWITCH)) { + if (device->node_type != IB_NODE_SWITCH) { return 0; // Drop intermediate hop on non-switch. } else { /* XXX switch */ @@ -148,7 +148,7 @@ (TS_IB_MAD_SMP_DR_PAYLOAD(mad))->return_path[hop_pointer] = mad->port; ++mad->route.directed.hop_pointer; - if (device->flags & IB_DEVICE_IS_SWITCH) { + if (device->node_type == IB_NODE_SWITCH) { /* XXX switch */ TS_REPORT_WARN(MOD_KERNEL_IB, "Need to handle DrMad on switch"); return 0; @@ -198,7 +198,7 @@ /* C14-13:2 */ if (hop_count != 0 && 2 <= hop_pointer && hop_pointer <= hop_count) { - if (!(device->flags & IB_DEVICE_IS_SWITCH)) { + if (device->node_type != IB_NODE_SWITCH) { return 0; // Drop intermediate hop on non-switch. } else { /* XXX switch */ @@ -212,7 +212,7 @@ if (hop_pointer == 1) { --mad->route.directed.hop_pointer; - if (device->flags & IB_DEVICE_IS_SWITCH) { + if (device->node_type == IB_NODE_SWITCH) { /* XXX switch */ TS_REPORT_WARN(MOD_KERNEL_IB, "Need to handle DrMad on switch"); return 0; Index: infiniband/core/core_device.c =================================================================== --- infiniband/core/core_device.c (revision 714) +++ infiniband/core/core_device.c (working copy) @@ -162,7 +162,7 @@ memcpy(priv->node_guid, &prop.node_guid, sizeof (tTS_IB_GUID)); - if (device->flags & IB_DEVICE_IS_SWITCH) { + if (device->node_type == IB_NODE_SWITCH) { priv->start_port = priv->end_port = 0; } else { priv->start_port = 1; @@ -207,7 +207,7 @@ goto out_free_cache; } - ret = ib_proc_setup(device, !!(device->flags & IB_DEVICE_IS_SWITCH)); + ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); if (ret) { TS_REPORT_WARN(MOD_KERNEL_IB, "Couldn't create /proc dir for %s", Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 714) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -517,6 +517,7 @@ int mthca_register_device(struct mthca_dev *dev) { strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.owner = THIS_MODULE; dev->ib_dev.dma_device = dev->pdev; dev->ib_dev.provider = "mthca"; From gdror at mellanox.co.il Wed Sep 1 14:59:42 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 2 Sep 2004 00:59:42 +0300 Subject: [openib-general] [PATCH] mthca updates (2.6.8 dependent) Message-ID: <506C3D7B14CDD411A52C00025558DED605E00550@mtlex01.yok.mtl.com> > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, August 31, 2004 8:23 AM > > On Mon, Aug 30, 2004 at 09:10:35PM -0700, Roland Dreier wrote: > > The Intel E7500 Xeon chipset PCI bridge datasheet does say that an > > interrupt message causes the bridge to flush its write buffers to > > preserve precisely this ordering, but I don't know whether everyone > > followed or will follow this example (I'm sure we'll see > many more MSI > > implementations on PCI Express). > > Maybe the flush is required because of DMA write coalescing > in the E7500 chipset? > > Ie any DMA writes which can't be coalesced will cause this > kind of a flushing behavior. I don't really know since I > don't how E7500 handles cache coherency. > > And yes, I'm certain some chipsets get DMA write coalescing > wrong. Look at drivers/net/tg3.c and search for > TG3_FLAG_MBOX_WRITE_REORDER in tg3_get_invariants(). > > grant I think that one of the most important intentions of the MSI was to save the need to perform the MMIO read which is expensive in CPU time. I think that for mainstream systems you can assume that there is ordering between MSI and other DMA writes. If some systems have related bugs, then in this case they can add a workaround (which may be either to perform a MMIO read, or to use good old regular interrupts. Another thing worth mentioning, is that the InfiniHost will fire an interrupt again if the interrupt handler didn't clean up the event queue. I.e. in the rare (or non existing case) of the MSI bypassing the preceding DMA write, if the driver doesn't see the EQE in memory, at the time it rearms the EQ, it'll get an immediate interrupt. So, you should be safe anyway... BTW, here's an excerpt from the PCI spec* explaining the the driver need not do any MMIO read from the device in order to ensure that the data is in memory at the time the MSI is received: "An MSI or MSI-X message, by virtue of being a posted memory write (PMW) transaction, is prohibited by PCI ordering rules from passing PMW transactions sent earlier by the function. The system must guarantee that an interrupt service routine invoked as a result of a given message will observe any updates performed by PMW transactions arriving prior to that message. Thus, the interrupt service routine of a device driver is not required to read from a device register in order to ensure data consistency with previous PMW transactions. However, if multiple MSI-X Table entries share the same vector, the interrupt service routine may need to read from some device specific register to determine which interrupt sources need servicing." * Section 6.8.3.6 at the end in http://www.pcisig.com/specifications/conventional/msi-x_ecn.pdf -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdror at mellanox.co.il Wed Sep 1 15:06:43 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 2 Sep 2004 01:06:43 +0300 Subject: [openib-general] [PATCH] mthca updates (2.6.8 dependent) Message-ID: <506C3D7B14CDD411A52C00025558DED605E00554@mtlex01.yok.mtl.com> > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, August 31, 2004 7:00 AM > > On Mon, Aug 16, 2004 at 09:27:13PM +0300, Dror Goldenberg wrote: > > In PCI/PCIX, the interrupt is a wire, so it is not > > Dror, > I'm pretty sure you understand the issues but are using confusing > terminology: > o posted write. CPU does not wait for completion of write to > IO device o PIO write. Programmed IO - CPU write to IO > device. May or not be posted > and typically depends on chipset and which "space" (MMIO vs I/O > Port) is the target. > o PIO read. Programmed IO - CPU read stalls until completion > (may be MMIO > or I/O port space). > o DMA write: IO Device write to host memory (aka upstream) > o DMA read: Device command to retrieve data from host memory > (downstream) o DMA read return: completion portion of DMA > read command (upstream) > > A PIO Read "flushes" inflight DMA writes from a CPU > perspective because the CPU stalls until the PIO read completes. > Yes... it sounds much better. Thanks ! > > In PCI-Express, the interrupt is a message, so it will work. The > > interrupt will just flush the data to the memory because it > maintain > > ordering with posted writes upstream. > > The MSI/MSI-X interrupt doesn't do anything. > The interrupt transaction is just another DMA Write and must > follow the PCI ordering rules like any other DMA write. The > destination address is just not a regular host memory location. I was talking about regular interrupts in PCI express. For PCI express MSI/MSI-X are plain DMA writes. However, good old interrupts in PCI express don't go on external wire. They just go on the same bus like the data, for that they use special PCI express messages. And, they maintain ordering like other any other DMA writes. So, although the same semantics of "old interrupts" is preserved, the behavior is a bit different in PCI express -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdror at mellanox.co.il Wed Sep 1 15:06:44 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 2 Sep 2004 01:06:44 +0300 Subject: [openib-general] Multicast address aliasing in IPoIB Message-ID: <506C3D7B14CDD411A52C00025558DED605E00555@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, August 30, 2004 11:35 PM > > Dror> For gen2, will it be possible to define a new medium for the > Dror> IPoIB driver (not ARPHRD_ETHER), such that arp_mc_map() will > Dror> map the entire IP address into the HW address ? Today it > Dror> looks impossible, because arp_mc_map() just overrides bits > Dror> 31:24 of the IP address. > > I guess when we merge the IPoIB driver we will need to > include a patch to the networking core that treats > ARPHRD_INFINIBAND properly for IPv4 and IPv6 multicast addresses. > I also believe that this is the right way to go. Anyone is working in pushing this kind of change to the Linux kernel ? Thanks Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Sep 1 15:12:30 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Sep 2004 01:12:30 +0300 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901113949.7e05d225.mshefty@ichips.intel.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> Message-ID: <20040901221230.GD26044@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] MAD queuing model": > On Wed, 01 Sep 2004 15:30:38 -0400 > Hal Rosenstock wrote: > > > > If a work request for a MAD cannot be immediately posted to a QP, > > > should a call to ib_mad_post_send fail, or should the work request > > > be queued for later? > > > > It could be put on a deferred send queue that would be processed when > > the next send completion occurs. This would make the access layer > > implementation a little more complicated but save the client from > > rerequesting. > > I think we're in agreement that the access layer should perform the > queuing, and that the queuing should be for actual work requests, which > implies some of the internal layering of the code. > I'd like to propose a simpler approach: if the Q is full, drop the MAD without any indication to the user (pretend the MAD was sent), and let RMPP retry. Note that you need retries anyway since the hardware is unreliable. If the Q is big enough, this wont happend a lot, and its much simpler I think than queueing in the layer. MST From roland at topspin.com Wed Sep 1 15:11:03 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 15:11:03 -0700 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <506C3D7B14CDD411A52C00025558DED605E00555@mtlex01.yok.mtl.com> (Dror Goldenberg's message of "Thu, 2 Sep 2004 01:06:44 +0300") References: <506C3D7B14CDD411A52C00025558DED605E00555@mtlex01.yok.mtl.com> Message-ID: <521xhlzll4.fsf@topspin.com> Dror> I also believe that this is the right way to go. Anyone is Dror> working in pushing this kind of change to the Linux kernel ? I don't think there's much hope of getting a patch like that merged until the whole IPoIB driver is ready for merging. However it might be worth it for someone to work on the patch now -- we can keep it in our tree (as part of the kernel patch we already have for the drivers/ Kconfig and Makefile) until we're ready to merge upstream. - R. From iod00d at hp.com Wed Sep 1 15:13:37 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 1 Sep 2004 15:13:37 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901084549.279d22bb.mshefty@ichips.intel.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> Message-ID: <20040901221337.GM5403@cup.hp.com> On Wed, Sep 01, 2004 at 08:45:49AM -0700, Sean Hefty wrote: > Here's a patch that checks for optional functions provided by the device. > It also includes a patch from Hal for IB management classes. Nit: in general, combining patches is a nono. (And yes, I'm guilty of that too) > -static inline int ib_modify_device(struct ib_device *device, > - int device_modify_mask, > - struct ib_device_modify *device_modify) > -{ > - return device->modify_device(device, device_modify_mask, > - device_modify); > -} Why not leave this as an inline function? static inline int ib_modify_device(struct ib_device *device, int modify_mask, struct ib_device_modify *modify) { return (device->modify_device ? device->modify_device(device, modify_mask, modify) : -ENOSYS); } Or does anyone expect this to get more complicated? thanks, grant From mshefty at ichips.intel.com Wed Sep 1 14:32:36 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 14:32:36 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901221337.GM5403@cup.hp.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> Message-ID: <20040901143236.35e0a86d.mshefty@ichips.intel.com> On Wed, 1 Sep 2004 15:13:37 -0700 Grant Grundler wrote: > Nit: in general, combining patches is a nono. > (And yes, I'm guilty of that too) Yes, and I'm aware that I've combined several patches lately. Sorry, I'll try to stop. > > -static inline int ib_modify_device(struct ib_device *device, > > - int device_modify_mask, > > - struct ib_device_modify *device_modify) > > -{ > > - return device->modify_device(device, device_modify_mask, > > - device_modify); > > -} > > Why not leave this as an inline function? > > Or does anyone expect this to get more complicated? I thought it might be more complicated as additional checks were added, but can switch it (and similarly formatted calls) back to inline for now, then move them later, if needed. Thanks for the feedback! From mshefty at ichips.intel.com Wed Sep 1 14:56:42 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 14:56:42 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901221337.GM5403@cup.hp.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> Message-ID: <20040901145642.52b07c75.mshefty@ichips.intel.com> On Wed, 1 Sep 2004 15:13:37 -0700 Grant Grundler wrote: > Why not leave this as an inline function? Here's a patch that makes optional function inline. - Sean Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 716) +++ ib_verbs.h (working copy) @@ -771,14 +771,26 @@ return device->query_pkey(device, port_num, index, pkey); } -int ib_modify_device(struct ib_device *device, - int device_modify_mask, - struct ib_device_modify *device_modify); - -int ib_modify_port(struct ib_device *device, - u8 port_num, - int port_modify_mask, - struct ib_port_modify *port_modify); +static inline int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device ? + device->modify_device(device, device_modify_mask, + device_modify) : + -ENOSYS; +} + +static inline int ib_modify_port(struct ib_device *device, + u8 port_num, + int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port ? + device->modify_port(device, port_num, port_modify_mask, + port_modify) : + -ENOSYS; +} struct ib_pd *ib_alloc_pd(struct ib_device *device); @@ -787,11 +799,21 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); -int ib_modify_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr); +static inline int ib_modify_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} -int ib_query_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr); +static inline int ib_query_ah(struct ib_ah *ah, + struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} int ib_destroy_ah(struct ib_ah *ah); @@ -807,10 +829,16 @@ return qp->device->modify_qp(qp, qp_attr, qp_attr_mask, qp_cap); } -int ib_query_qp(struct ib_qp *qp, - struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_init_attr *qp_init_attr); +static inline int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; + +} int ib_destroy_qp(struct ib_qp *qp); @@ -818,13 +846,24 @@ void *srq_context, struct ib_srq_attr *srq_attr); -int ib_modify_srq(struct ib_srq *srq, - struct ib_pd *pd, - struct ib_srq_attr *srq_attr, - int srq_attr_mask); +static inline int ib_modify_srq(struct ib_srq *srq, + struct ib_pd *pd, + struct ib_srq_attr *srq_attr, + int srq_attr_mask) +{ + return srq->device->modify_srq ? + srq->device->modify_srq(srq, pd, srq_attr, srq_attr_mask) : + -ENOSYS; +} + +static inline int ib_query_srq(struct ib_srq *srq, + struct ib_srq_attr *srq_attr) +{ + return srq->device->query_srq ? + srq->device->query_srq(srq, srq_attr) : + -ENOSYS; +} -int ib_query_srq(struct ib_srq *srq, - struct ib_srq_attr *srq_attr); static inline int ib_post_srq(struct ib_srq *srq, struct ib_recv_wr *recv_wr, @@ -840,8 +879,13 @@ void *cq_context, int cqe); -int ib_resize_cq(struct ib_cq *cq, - int cqe); +static inline int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + return cq->device->resize_cq ? + cq->device->resize_cq(cq, cqe) : + -ENOSYS; +} int ib_destroy_cq(struct ib_cq *cq); @@ -851,8 +895,13 @@ int mr_access_flags, u64 *iova_start); -int ib_query_mr(struct ib_mr *mr, - struct ib_mr_attr *mr_attr); +static inline int ib_query_mr(struct ib_mr *mr, + struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : + -ENOSYS; +} int ib_dereg_mr(struct ib_mr *mr); @@ -895,13 +944,21 @@ int ib_dealloc_fmr(struct ib_fmr *fmr); -int ib_attach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid); - -int ib_detach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid); +static inline int ib_attach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} + +static inline int ib_detach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid) +{ + return qp->device->detach_mcast(qp, gid, lid); +} static inline int ib_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, Index: ib_verbs.c =================================================================== --- ib_verbs.c (revision 714) +++ ib_verbs.c (working copy) @@ -26,33 +26,6 @@ #include #include -/* Device */ - -int ib_modify_device(struct ib_device *device, - int device_modify_mask, - struct ib_device_modify *device_modify) -{ - if (!device->modify_device) - return -ENOSYS; - - return device->modify_device(device, device_modify_mask, - device_modify); -} -EXPORT_SYMBOL(ib_modify_device); - -int ib_modify_port(struct ib_device *device, - u8 port_num, - int port_modify_mask, - struct ib_port_modify *port_modify) -{ - if (!device->modify_port) - return -ENOSYS; - - return device->modify_port(device, port_num, port_modify_mask, - port_modify); -} -EXPORT_SYMBOL(ib_modify_port); - /* Protection domain */ struct ib_pd *ib_alloc_pd(struct ib_device *device) @@ -98,26 +71,6 @@ } EXPORT_SYMBOL(ib_create_ah); -int ib_modify_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr) -{ - if (!ah->device->modify_ah) - return -ENOSYS; - - return ah->device->modify_ah(ah, ah_attr); -} -EXPORT_SYMBOL(ib_modify_ah); - -int ib_query_ah(struct ib_ah *ah, - struct ib_ah_attr *ah_attr) -{ - if (!ah->device->query_ah) - return -ENOSYS; - - return ah->device->query_ah(ah, ah_attr); -} -EXPORT_SYMBOL(ib_query_ah); - int ib_destroy_ah(struct ib_ah *ah) { struct ib_pd *pd; @@ -162,18 +115,6 @@ } EXPORT_SYMBOL(ib_create_qp); -int ib_query_qp(struct ib_qp *qp, - struct ib_qp_attr *qp_attr, - int qp_attr_mask, - struct ib_qp_init_attr *qp_init_attr) -{ - if (!qp->device->query_qp) - return -ENOSYS; - - return qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr); -} -EXPORT_SYMBOL(ib_query_qp); - int ib_destroy_qp(struct ib_qp *qp) { struct ib_pd *pd; @@ -224,28 +165,6 @@ } EXPORT_SYMBOL(ib_create_srq); -int ib_modify_srq(struct ib_srq *srq, - struct ib_pd *pd, - struct ib_srq_attr *srq_attr, - int srq_attr_mask) -{ - if (!srq->device->modify_srq) - return -ENOSYS; - - return srq->device->modify_srq(srq, pd, srq_attr, srq_attr_mask); -} -EXPORT_SYMBOL(ib_modify_srq); - -int ib_query_srq(struct ib_srq *srq, - struct ib_srq_attr *srq_attr) -{ - if (!srq->device->query_srq) - return -ENOSYS; - - return srq->device->query_srq(srq, srq_attr); -} -EXPORT_SYMBOL(ib_query_srq); - int ib_destroy_srq(struct ib_srq *srq) { struct ib_pd *pd; @@ -286,16 +205,6 @@ } EXPORT_SYMBOL(ib_create_cq); -int ib_resize_cq(struct ib_cq *cq, - int cqe) -{ - if (!cq->device->resize_cq) - return -ENOSYS; - - return cq->device->resize_cq(cq, cqe); -} -EXPORT_SYMBOL(ib_resize_cq); - int ib_destroy_cq(struct ib_cq *cq) { if (atomic_read(&cq->usecnt)) @@ -329,16 +238,6 @@ } EXPORT_SYMBOL(ib_reg_phys_mr); -int ib_query_mr(struct ib_mr *mr, - struct ib_mr_attr *mr_attr) -{ - if (!mr->device->query_mr) - return -ENOSYS; - - return mr->device->query_mr(mr, mr_attr); -} -EXPORT_SYMBOL(ib_query_mr); - int ib_dereg_mr(struct ib_mr *mr) { struct ib_pd *pd; @@ -461,22 +360,3 @@ return ret; } EXPORT_SYMBOL(ib_dealloc_fmr); - -/* Multicast */ - -int ib_attach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid) -{ - if (!qp->device->attach_mcast) - return -ENOSYS; - - return qp->device->attach_mcast(qp, gid, lid); -} - -int ib_detach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid) -{ - return qp->device->detach_mcast(qp, gid, lid); -} From iod00d at hp.com Wed Sep 1 15:59:16 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 1 Sep 2004 15:59:16 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901143236.35e0a86d.mshefty@ichips.intel.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> <20040901143236.35e0a86d.mshefty@ichips.intel.com> Message-ID: <20040901225916.GO5403@cup.hp.com> On Wed, Sep 01, 2004 at 02:32:36PM -0700, Sean Hefty wrote: > Yes, and I'm aware that I've combined several patches lately. > Sorry, I'll try to stop. Well, if who ever is committing the patches isn't complaining, it's not a big deal. People have gotten irritated with me when I've combined trivial patches. grant From roland at topspin.com Wed Sep 1 15:56:24 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 15:56:24 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <20040901221230.GD26044@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 2 Sep 2004 01:12:30 +0300") References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> <20040901221230.GD26044@mellanox.co.il> Message-ID: <52wtzdy4x3.fsf@topspin.com> Michael> I'd like to propose a simpler approach: if the Q is full, Michael> drop the MAD without any indication to the user (pretend Michael> the MAD was sent), and let RMPP retry. It's an interesting idea (sort of the most trivial possible queue scheduler). However I don't see how you handle the fact that the consumer will be waiting for a send completion to free the MAD, and I think the consumer will rely on completions occurring in order. - R. From roland at topspin.com Wed Sep 1 16:00:30 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 16:00:30 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901143236.35e0a86d.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 14:32:36 -0700") References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> <20040901143236.35e0a86d.mshefty@ichips.intel.com> Message-ID: <52oekpy4q9.fsf@topspin.com> Sean> I thought it might be more complicated as additional checks Sean> were added, but can switch it (and similarly formatted Sean> calls) back to inline for now, then move them later, if Sean> needed. I'd rather not go that route -- this is exactly why the kernel currently has a bunch of bloated inline functions that should never have been inline. If something isn't in the fast path, let's not inline it. - R. From mshefty at ichips.intel.com Wed Sep 1 15:10:06 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 15:10:06 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <52oekpy4q9.fsf@topspin.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> <20040901143236.35e0a86d.mshefty@ichips.intel.com> <52oekpy4q9.fsf@topspin.com> Message-ID: <20040901151006.129d5d1d.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 16:00:30 -0700 Roland Dreier wrote: > Sean> I thought it might be more complicated as additional checks > Sean> were added, but can switch it (and similarly formatted > Sean> calls) back to inline for now, then move them later, if > Sean> needed. > > I'd rather not go that route -- this is exactly why the kernel > currently has a bunch of bloated inline functions that should never > have been inline. If something isn't in the fast path, let's not > inline it. Note that I didn't commit that last patch. I wanted to get some other responses first. The only optional calls that I did leave inline in my original tree were the CQ ones that I considered to be speed path operations. I can easily just ignore my last patch, and I'm open either way here... From halr at voltaire.com Wed Sep 1 16:23:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Sep 2004 19:23:51 -0400 Subject: [openib-general] MAD queuing model In-Reply-To: <52wtzdy4x3.fsf@topspin.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> <20040901221230.GD26044@mellanox.co.il> <52wtzdy4x3.fsf@topspin.com> Message-ID: <1094081031.1830.1.camel@localhost.localdomain> On Wed, 2004-09-01 at 18:56, Roland Dreier wrote: > Michael> I'd like to propose a simpler approach: if the Q is full, > Michael> drop the MAD without any indication to the user (pretend > Michael> the MAD was sent), and let RMPP retry. > > It's an interesting idea (sort of the most trivial possible queue > scheduler). However I don't see how you handle the fact that the > consumer will be waiting for a send completion to free the MAD, and I > think the consumer will rely on completions occurring in order. Couldn't a send completion with some error status code be fudged for this (of we were to go this way) ? -- Hal From mshefty at ichips.intel.com Wed Sep 1 15:26:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 15:26:11 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <52wtzdy4x3.fsf@topspin.com> References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> <20040901221230.GD26044@mellanox.co.il> <52wtzdy4x3.fsf@topspin.com> Message-ID: <20040901152611.4ddf8409.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 15:56:24 -0700 Roland Dreier wrote: > Michael> I'd like to propose a simpler approach: if the Q is full, > Michael> drop the MAD without any indication to the user (pretend > Michael> the MAD was sent), and let RMPP retry. > > It's an interesting idea (sort of the most trivial possible queue > scheduler). However I don't see how you handle the fact that the > consumer will be waiting for a send completion to free the MAD, and I > think the consumer will rely on completions occurring in order. I thought about this approach as well, and I think it could work (and probably fairly well for small to medium configurations). I'm not sure that I would consider it a "best effort" approach, but avoids a lot of complexities trying to queue the MADs. It also puts less stress on the fabric and remote receive queues, avoids excessively long queue times, and consumes fewer system resources. (Maybe I undervalued this approach...) We could define the MAD interface such that the completions could come in any order. That would give us the most flexibility regarding queuing. I think out of order completions is necessary for MADs that require responses anyway. From roland at topspin.com Wed Sep 1 16:29:40 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 16:29:40 -0700 Subject: [openib-general] MAD queuing model In-Reply-To: <1094081031.1830.1.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 01 Sep 2004 19:23:51 -0400") References: <20040901092849.10f6a8ab.mshefty@ichips.intel.com> <1094067037.1969.249.camel@localhost.localdomain> <20040901113949.7e05d225.mshefty@ichips.intel.com> <20040901221230.GD26044@mellanox.co.il> <52wtzdy4x3.fsf@topspin.com> <1094081031.1830.1.camel@localhost.localdomain> Message-ID: <52k6vdy3dn.fsf@topspin.com> Hal> Couldn't a send completion with some error status code be Hal> fudged for this (of we were to go this way) ? You don't need to even fudge an error code -- a successful status would be fine (since a MAD might be dropped after it's sent). The problem is that to deliver the completions in order, you would have to keep the MADs to be dropped around on a queue until the MADs on the send queue have completed. And if you do that, you might as well just post the queued MADs to the send queue anyway. - R. From mshefty at ichips.intel.com Wed Sep 1 16:10:51 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 16:10:51 -0700 Subject: [openib-general] [PATCH] fix exporting ib_attach_mcast/ib_detach_mcast Message-ID: <20040901161051.237ea563.mshefty@ichips.intel.com> This patch fixes exporting the multicast functions. - Sean -- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 716) +++ ib_verbs.h (working copy) @@ -899,9 +899,12 @@ union ib_gid *gid, u16 lid); -int ib_detach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid); +static inline int ib_detach_mcast(struct ib_qp *qp, + union ib_gid *gid, + u16 lid) +{ + return qp->device->detach_mcast(qp, gid, lid); +} static inline int ib_post_send(struct ib_qp *qp, struct ib_send_wr *send_wr, Index: ib_verbs.c =================================================================== --- ib_verbs.c (revision 714) +++ ib_verbs.c (working copy) @@ -473,10 +473,4 @@ return qp->device->attach_mcast(qp, gid, lid); } - -int ib_detach_mcast(struct ib_qp *qp, - union ib_gid *gid, - u16 lid) -{ - return qp->device->detach_mcast(qp, gid, lid); -} +EXPORT_SYMBOL(ib_attach_mcast); From roland at topspin.com Wed Sep 1 17:36:38 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 17:36:38 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <20040901151006.129d5d1d.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 15:10:06 -0700") References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> <20040901143236.35e0a86d.mshefty@ichips.intel.com> <52oekpy4q9.fsf@topspin.com> <20040901151006.129d5d1d.mshefty@ichips.intel.com> Message-ID: <524qmhy0a1.fsf@topspin.com> Sean> Note that I didn't commit that last patch. I wanted to get Sean> some other responses first. The only optional calls that I Sean> did leave inline in my original tree were the CQ ones that I Sean> considered to be speed path operations. I can easily just Sean> ignore my last patch, and I'm open either way here... I don't think it's a big deal either way, but having non-data path functions be inline makes ib_verbs.h messier and harder to read and starts us down the path towards having bloated slow path functions stay inline. So I would vote not to commit this change. - R. From mshefty at ichips.intel.com Wed Sep 1 16:54:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 16:54:54 -0700 Subject: [openib-general] [PATCH] Change ib_map_fmr to ib_map_phys_fmr Message-ID: <20040901165454.372f5008.mshefty@ichips.intel.com> This patch replaces ib_map_fmr with ib_map_phys_fmr. - Sean -- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 718) +++ ib_verbs.h (working copy) @@ -710,10 +710,10 @@ struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, int mr_access_flags, struct ib_fmr_attr *fmr_attr); - int (*map_fmr)(struct ib_fmr *fmr, void *addr, u64 size); int (*map_phys_fmr)(struct ib_fmr *fmr, struct ib_phys_buf *phys_buf_array, - int num_phys_buf); + int num_phys_buf, + u64 *iova_start); int (*unmap_fmr)(struct ib_fmr **fmr_array, int fmr_cnt); int (*dealloc_fmr)(struct ib_fmr *fmr); int (*attach_mcast)(struct ib_qp *qp, union ib_gid *gid, @@ -879,11 +879,13 @@ int mr_access_flags, struct ib_fmr_attr *fmr_attr); -static inline int ib_map_fmr(struct ib_fmr *fmr, - void *addr, - u64 size) +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + u64 *iova_start) { - return fmr->device->map_fmr(fmr, addr, size); + return fmr->device->map_phys_fmr(fmr, phys_buf_array, num_phys_buf, + iova_start); } static inline int ib_unmap_fmr(struct ib_fmr **fmr_array, From mshefty at ichips.intel.com Wed Sep 1 17:04:18 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 1 Sep 2004 17:04:18 -0700 Subject: [openib-general] ib_unmap_fmr Message-ID: <20040901170418.01d3fae3.mshefty@ichips.intel.com> The current API for ib_unmap_fmr takes an array. A couple of questions: Is it worth it for ib_unmap_fmr to operate on more than a single fmr? If so, should a linked list be used instead? And, should the access layer API assume that the FMRs all come from the same device? My initial thought is to restrict ib_unmap_fmr to a single fmr, _but_ if there's a reason to unmap multiples, update struct ib_fmr to allow for a linked list. The access layer would assume all FMRs come from the same device in such a case, to avoid walking the list. - Sean -- From roland at topspin.com Wed Sep 1 18:41:44 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Sep 2004 18:41:44 -0700 Subject: [openib-general] ib_unmap_fmr In-Reply-To: <20040901170418.01d3fae3.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 1 Sep 2004 17:04:18 -0700") References: <20040901170418.01d3fae3.mshefty@ichips.intel.com> Message-ID: <52zn49wip3.fsf@topspin.com> Sean> The current API for ib_unmap_fmr takes an array. A couple Sean> of questions: Is it worth it for ib_unmap_fmr to operate on Sean> more than a single fmr? If so, should a linked list be used Sean> instead? And, should the access layer API assume that the Sean> FMRs all come from the same device? My initial thought is Sean> to restrict ib_unmap_fmr to a single fmr, _but_ if there's a Sean> reason to unmap multiples, update struct ib_fmr to allow for Sean> a linked list. The access layer would assume all FMRs come Sean> from the same device in such a case, to avoid walking the Sean> list. One of the main reasons that FMRs (in the Mellanox Tavor sense) are a win is that the cost of flushing the HCA's memory mapping cache can be amortized across multiple FMRs with a single unmap call. So having unmap_fmr operate on multiple FMRs at once is central to its usefulness. With that said I prefer using a linked list (and that's what I have on my branch), and I agree with the assumption that all the FMRs come from the same device. - R. From iod00d at hp.com Wed Sep 1 20:32:24 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 1 Sep 2004 20:32:24 -0700 Subject: [openib-general] [PATCH] mthca updates (2.6.8 dependent) In-Reply-To: <506C3D7B14CDD411A52C00025558DED605E00554@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED605E00554@mtlex01.yok.mtl.com> Message-ID: <20040902033224.GB8922@cup.hp.com> On Thu, Sep 02, 2004 at 01:06:43AM +0300, Dror Goldenberg wrote: > I was talking about regular interrupts in PCI express. yes - sorry. I had to re-read that paragraph. > For PCI express > MSI/MSI-X are plain DMA writes. However, good old interrupts in PCI > express don't go on external wire. They just go on the same bus like > the data, for that they use special PCI express messages. And, they > maintain ordering like other any other DMA writes. So, although the > same semantics of "old interrupts" is preserved, the behavior is a bit > different in PCI express. Yes - I'm slightly aware of it but haven't worked with PCI express directly yet (ie collected or looked at PCI-express bus traces.). thanks for the correction, grant From iod00d at hp.com Wed Sep 1 22:02:55 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 1 Sep 2004 22:02:55 -0700 Subject: [openib-general] [PATCH] optional functions, ib_mad update In-Reply-To: <524qmhy0a1.fsf@topspin.com> References: <20040901084549.279d22bb.mshefty@ichips.intel.com> <20040901221337.GM5403@cup.hp.com> <20040901143236.35e0a86d.mshefty@ichips.intel.com> <52oekpy4q9.fsf@topspin.com> <20040901151006.129d5d1d.mshefty@ichips.intel.com> <524qmhy0a1.fsf@topspin.com> Message-ID: <20040902050255.GA9311@cup.hp.com> On Wed, Sep 01, 2004 at 05:36:38PM -0700, Roland Dreier wrote: > I don't think it's a big deal either way, but having non-data path > functions be inline makes ib_verbs.h messier and harder to read and > starts us down the path towards having bloated slow path functions > stay inline. So I would vote not to commit this change. Roland, "bloated slow path" makes sense to me too and answers my original question. "messier and harder to read" - I don't buy that. Since most are inlined currently, whoever commits the change please add that comment about "bloated slow path" to the patch/commit that changes "non-data path" inline functions to regular functions. (then I promise I won't ask again :^) thanks, grant From mshefty at ichips.intel.com Thu Sep 2 07:58:04 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 07:58:04 -0700 Subject: [openib-general] ib_unmap_fmr In-Reply-To: <52zn49wip3.fsf@topspin.com> References: <20040901170418.01d3fae3.mshefty@ichips.intel.com> <52zn49wip3.fsf@topspin.com> Message-ID: <20040902075804.25845963.mshefty@ichips.intel.com> On Wed, 01 Sep 2004 18:41:44 -0700 Roland Dreier wrote: > One of the main reasons that FMRs (in the Mellanox Tavor sense) are a > win is that the cost of flushing the HCA's memory mapping cache can be > amortized across multiple FMRs with a single unmap call. So having > unmap_fmr operate on multiple FMRs at once is central to its > usefulness. Got it. I guess I need to look closer at the implementation. Does the flushing occur when unmapped is called, or is the flushing done at some later point? I was under the impression that it was the latter, but I could be wrong. > With that said I prefer using a linked list (and that's what I have on > my branch), and I agree with the assumption that all the FMRs come > from the same device. I did notice that your branch used a linked list, which seemed to make more sense to me than using an array. From mshefty at ichips.intel.com Thu Sep 2 09:02:45 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 09:02:45 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr Message-ID: <20040902090245.21a4a939.mshefty@ichips.intel.com> Here's a patch that would combine ib_send_mad_wr with ib_send_wr, to optimize sending and completing MADs. As part of the change, there is a minor adjustment to the ib_wc structure that allows casting it to an ib_mad_send_wc structure to enable an additional optimization on MAD send completion processing. This patch is simply for discussion purposes at this point, to see if this is desirable. Thanks, - Sean -- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 724) +++ ib_verbs.h (working copy) @@ -504,7 +504,8 @@ IB_SEND_FENCE = 1, IB_SEND_SIGNALED = (1<<1), IB_SEND_SOLICITED = (1<<2), - IB_SEND_INLINE = (1<<3) + IB_SEND_INLINE = (1<<3), + IB_SEND_MAD_GRH_VALID = (1<<4) }; struct ib_sge { @@ -537,6 +538,7 @@ u32 remote_qpn; u32 remote_qkey; u16 pkey_index; /* valid for GSI only */ + int timeout_ms; /* valid for MADs only */ } ud; } wr; }; @@ -591,8 +593,8 @@ struct ib_wc { u64 wr_id; enum ib_wc_status status; - enum ib_wc_opcode opcode; u32 vendor_err; + enum ib_wc_opcode opcode; u32 byte_len; u32 imm_data; u32 src_qp; Index: ib_mad.h =================================================================== --- ib_mad.h (revision 712) +++ ib_mad.h (working copy) @@ -128,56 +128,23 @@ u32 hi_tid; }; -enum ib_mad_flags { - IB_MAD_GRH_VALID = 1 -}; - -/** - * ib_mad_send_wr - send MAD work request. - * @list - Allows chaining together multiple requests. - * @context - User-controlled work request context. - * @sg_list - An array of scatter-gather entries, referencing the MAD's - * data buffer(s). The first entry must reference the standard MAD - * header, plus any RMPP header, if used. - * @num_sge - The number of scatter-gather entries. - * @mad_flags - Flags used to control the send operation. - * @ah - Address handle for the destination. - * @timeout_ms - Timeout value, in milliseconds, to wait for a response - * message. Set to 0 if no response is expected. - * @remote_qpn - Destination QP. - * @remote_qkey - Specifies the qkey used by remote QP. - * @pkey_index - Pkey index to use. Required when sending on QP1 only. - */ -struct ib_mad_send_wr { - struct list_head list; - void *context; - struct ib_sge *sg_list; - int num_sge; - int mad_flags; - struct ib_ah *ah; - int timeout_ms; - u32 remote_qpn; - u32 remote_qkey; - u16 pkey_index; -}; - /** * ib_mad_send_wc - MAD send completion information. - * @context - Context associated with the send MAD request. + * @wr_id - Work request identifier associated with the send MAD request. * @status - Completion status. * @vendor_err - Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { - void *context; + u64 wr_id; enum ib_wc_status status; u32 vendor_err; }; /** * ib_mad_recv_wc - received MAD information. - * @context - For received response, set to the context specified for - * the corresponding send request. + * @wr_id - For received response, set to the work request identifier specified + * for the corresponding send request. * @grh - References a data buffer containing the global route header. * The data refereced by this buffer is only valid if the GRH is * valid. @@ -194,7 +161,7 @@ * An RMPP receive will be coalesced into a single data buffer. */ struct ib_mad_recv_wc { - void *context; + u64 wr_id; struct ib_grh *grh; struct ib_mad *mad; u32 length; @@ -263,10 +230,10 @@ * ib_mad_post_send - Posts a MAD to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. - * @mad_send_wr - Specifies the information needed to send the MAD. + * @send_wr - Specifies the information needed to send the MAD. */ int ib_mad_post_send(struct ib_mad_agent *mad_agent, - struct ib_mad_send_wr *mad_send_wr); + struct ib_send_wr *send_wr); /** * ib_mad_qp_redir - Registers a QP for MAD services. From roland at topspin.com Thu Sep 2 09:12:00 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 09:12:00 -0700 Subject: [openib-general] ib_unmap_fmr In-Reply-To: <20040902075804.25845963.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 2 Sep 2004 07:58:04 -0700") References: <20040901170418.01d3fae3.mshefty@ichips.intel.com> <52zn49wip3.fsf@topspin.com> <20040902075804.25845963.mshefty@ichips.intel.com> Message-ID: <52brgowsz3.fsf@topspin.com> Sean> Got it. I guess I need to look closer at the Sean> implementation. Does the flushing occur when unmapped is Sean> called, or is the flushing done at some later point? I was Sean> under the impression that it was the latter, but I could be Sean> wrong. unmap is the flush operation. One can also call map again on an fmr without unmap it again, but the remap operation might leave the old mapping in the HCA's context. That's the tradeoff with this style of FMR: higher performance but less protection. - R. From halr at voltaire.com Thu Sep 2 11:08:55 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Sep 2004 14:08:55 -0400 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <20040902090245.21a4a939.mshefty@ichips.intel.com> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> Message-ID: <1094148534.1830.36.camel@localhost.localdomain> On Thu, 2004-09-02 at 12:02, Sean Hefty wrote: > Here's a patch that would combine ib_send_mad_wr with ib_send_wr, > to optimize sending and completing MADs. Sounds like a good thing to do. > As part of the change, > there is a minor adjustment to the ib_wc structure that allows > casting it to an ib_mad_send_wc structure to enable an additional > optimization on MAD send completion processing. > > This patch is simply for discussion purposes at this point, > to see if this is desirable. Just a minor point/question: I forget why IB_SEND_MAD_GRH_VALID is needed. Isn't the grh setup properly in the AH which is supplied with the send WR ? However, the notion of IB_SEND_MAD is needed to determine whether the timeout_ms parameter is to be honored or not. -- Hal From mshefty at ichips.intel.com Thu Sep 2 10:17:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 10:17:39 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <1094148534.1830.36.camel@localhost.localdomain> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> Message-ID: <20040902101739.7978f8e8.mshefty@ichips.intel.com> On Thu, 02 Sep 2004 14:08:55 -0400 Hal Rosenstock wrote: > Just a minor point/question: > > I forget why IB_SEND_MAD_GRH_VALID is needed. Isn't the grh setup > properly in the AH which is supplied with the send WR ? However, the > notion of IB_SEND_MAD is needed to determine whether the timeout_ms > parameter is to be honored or not. I think that you'll need to know if the GRH is being sent in order to do RMPP correctly. MADs are still sent by calling ib_mad_post_send, so we shouldn't need a new opcode to determine if timeout_ms is valid. From mshefty at ichips.intel.com Thu Sep 2 10:22:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 10:22:20 -0700 Subject: [openib-general] [PATCH] allow chaining ib_fmr structures in a list Message-ID: <20040902102220.46d0dd05.mshefty@ichips.intel.com> This patch changes ib_unmap_fmr from taking an array to using a linked list. - Sean -- Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 724) +++ ib_verbs.h (working copy) @@ -117,9 +117,9 @@ struct ib_fmr { struct ib_device *device; struct ib_pd *pd; + struct list_head list; u32 lkey; u32 rkey; - atomic_t usecnt; }; enum ib_device_cap_flags { @@ -714,7 +714,7 @@ struct ib_phys_buf *phys_buf_array, int num_phys_buf, u64 *iova_start); - int (*unmap_fmr)(struct ib_fmr **fmr_array, int fmr_cnt); + int (*unmap_fmr)(struct ib_fmr *fmr); int (*dealloc_fmr)(struct ib_fmr *fmr); int (*attach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); @@ -888,11 +888,9 @@ iova_start); } -static inline int ib_unmap_fmr(struct ib_fmr **fmr_array, - int fmr_cnt) +static inline int ib_unmap_fmr(struct ib_fmr *fmr) { - /* Requires all FMRs to come from same device. */ - return fmr_array[0]->device->unmap_fmr(fmr_array, fmr_cnt); + return fmr->device->unmap_fmr(fmr); } int ib_dealloc_fmr(struct ib_fmr *fmr); From roland at topspin.com Thu Sep 2 10:56:41 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 10:56:41 -0700 Subject: [openib-general] [PATCH] Basic driver model/sysfs support Message-ID: <527jrcwo4m.fsf@topspin.com> This patch starts to add driver model/sysfs support. It adds ib_alloc_device/ib_dealloc_device so that the access layer can manage reference counting of struct ib_device (and free the ib_device in the driver model ->release() method). It also creates a basic "infiniband" class that devices belong to. Next step is to add attributes to the device and start killing off some of the info in /proc (and redundant ioctls). - Roland Index: src/linux-kernel/infiniband/include/ib_verbs.h =================================================================== --- src/linux-kernel/infiniband/include/ib_verbs.h (revision 715) +++ src/linux-kernel/infiniband/include/ib_verbs.h (working copy) @@ -593,7 +593,6 @@ char name[IB_DEVICE_NAME_MAX]; char *provider; - void *private; struct list_head core_list; void *core; void *mad; @@ -691,9 +690,20 @@ struct class_device class_dev; + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + u8 node_type; }; +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); +int ib_register_device (struct ib_device *device); +int ib_deregister_device(struct ib_device *device); + int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr); Index: src/linux-kernel/infiniband/include/ts_ib_provider.h =================================================================== --- src/linux-kernel/infiniband/include/ts_ib_provider.h (revision 715) +++ src/linux-kernel/infiniband/include/ts_ib_provider.h (working copy) @@ -26,18 +26,8 @@ #include -int ib_device_register (struct ib_device *device); -int ib_device_deregister(struct ib_device *device); - -void ib_completion_event_dispatch(struct ib_cq *cq); void ib_async_event_dispatch(struct ib_async_event_record *event_record); -/* Defines to support legacy code -- don't use the tsIb names in new code. */ -#define tsIbDeviceRegister ib_device_register -#define tsIbDeviceDeregister ib_device_deregister -#define tsIbCompletionEventDispatch ib_completion_event_dispatch -#define tsIbAsyncEventDispatch ib_async_event_dispatch - #endif /* _TS_IB_PROVIDER_H */ /* Index: src/linux-kernel/infiniband/core/Makefile =================================================================== --- src/linux-kernel/infiniband/core/Makefile (revision 715) +++ src/linux-kernel/infiniband/core/Makefile (working copy) @@ -35,6 +35,7 @@ header_main.o \ header_ud.o \ ib_verbs.o \ + ib_sysfs.o \ core_main.o \ core_device.o \ core_fmr_pool.o \ Index: src/linux-kernel/infiniband/core/core_main.c =================================================================== --- src/linux-kernel/infiniband/core/core_main.c (revision 715) +++ src/linux-kernel/infiniband/core/core_main.c (working copy) @@ -21,17 +21,11 @@ $Id$ */ -#include "core_priv.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - -#include #include - #include -#include +#include "core_priv.h" + MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("core kernel IB API"); MODULE_LICENSE("Dual BSD/GPL"); @@ -40,30 +34,26 @@ { int ret; - TS_REPORT_INIT(MOD_KERNEL_IB, - "Initializing core IB layer"); + ret = ib_sysfs_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + return ret; + } ret = ib_create_proc_dir(); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, "Couldn't create IB core proc directory"); + ib_sysfs_cleanup(); + printk(KERN_WARNING "Couldn't create IB core proc directory\n"); return ret; } - TS_REPORT_INIT(MOD_KERNEL_IB, - "core IB layer initialized"); - return 0; } static void __exit ib_core_cleanup(void) { - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "Unloading core IB layer"); - ib_remove_proc_dir(); - - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "core IB layer unloaded"); + ib_sysfs_cleanup(); } module_init(ib_core_init); Index: src/linux-kernel/infiniband/core/core_priv.h =================================================================== --- src/linux-kernel/infiniband/core/core_priv.h (revision 715) +++ src/linux-kernel/infiniband/core/core_priv.h (working copy) @@ -80,6 +80,11 @@ void ib_completion_thread(struct list_head *entry, void *device_ptr); void ib_async_thread(struct list_head *entry, void *device_ptr); +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_deregister_sysfs(struct ib_device *device); +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + #endif /* _CORE_PRIV_H */ /* Index: src/linux-kernel/infiniband/core/ib_sysfs.c =================================================================== --- src/linux-kernel/infiniband/core/ib_sysfs.c (revision 0) +++ src/linux-kernel/infiniband/core/ib_sysfs.c (revision 0) @@ -0,0 +1,70 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Corporation. All rights reserved. +*/ + +#include "core_priv.h" + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *dev, char **envp, + int num_envp, char *buffer, int buffer_size) +{ + /* What do we want to pass to userspace? GUID=? */ + return 0; +} + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + return class_device_register(class_dev); + + +} + +void ib_device_deregister_sysfs(struct ib_device *device) +{ + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} Property changes on: src/linux-kernel/infiniband/core/ib_sysfs.c ___________________________________________________________________ Name: svn:keywords + Id Index: src/linux-kernel/infiniband/core/core_device.c =================================================================== --- src/linux-kernel/infiniband/core/core_device.c (revision 715) +++ src/linux-kernel/infiniband/core/core_device.c (working copy) @@ -21,9 +21,6 @@ $Id$ */ -#include "core_priv.h" - -#include "ts_kernel_trace.h" #include "ts_kernel_services.h" #include @@ -33,6 +30,8 @@ #include +#include "core_priv.h" + static LIST_HEAD(device_list); static LIST_HEAD(notifier_list); static DECLARE_MUTEX(device_lock); @@ -68,9 +67,8 @@ for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { if (!*(void **) ((void *) device + mandatory_table[i].offset)) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Device %s is missing mandatory function %s", - device->name, mandatory_table[i].name); + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); return -EINVAL; } } @@ -122,8 +120,38 @@ return 0; } -int ib_device_register(struct ib_device *device) +struct ib_device *ib_alloc_device(size_t size) { + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + printk(KERN_ERR "device->reg_state = %d\n", device->reg_state); + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_deregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +int ib_register_device(struct ib_device *device) +{ struct ib_device_private *priv; struct ib_device_attr prop; int ret; @@ -143,8 +171,7 @@ priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't allocate private struct for %s", + printk(KERN_WARNING "Couldn't allocate private struct for %s\n", device->name); ret = -ENOMEM; goto out; @@ -154,8 +181,7 @@ ret = device->query_device(device, &prop); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "query_device failed for %s", + printk(KERN_WARNING "query_device failed for %s\n", device->name); goto out_free; } @@ -172,8 +198,7 @@ priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data), GFP_KERNEL); if (!priv->port_data) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't allocate port info for %s", + printk(KERN_WARNING "Couldn't allocate port info for %s\n", device->name); goto out_free; } @@ -190,8 +215,7 @@ ret = ib_cache_setup(device); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't create device info cache for %s", + printk(KERN_WARNING "Couldn't create device info cache for %s\n", device->name); goto out_free_port; } @@ -201,20 +225,24 @@ device, &priv->async_thread); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't start async thread for %s", + printk(KERN_WARNING "Couldn't start async thread for %s\n", device->name); goto out_free_cache; } ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't create /proc dir for %s", + printk(KERN_WARNING "Couldn't create /proc dir for %s\n", device->name); goto out_stop_async; } + if (ib_device_register_sysfs(device)) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out_proc; + } + list_add_tail(&device->core_list, &device_list); { struct list_head *ptr; @@ -226,9 +254,14 @@ } } + device->reg_state = IB_DEV_REGISTERED; + up(&device_lock); return 0; + out_proc: + ib_proc_cleanup(device); + out_stop_async: tsKernelQueueThreadStop(priv->async_thread); @@ -245,17 +278,18 @@ up(&device_lock); return ret; } -EXPORT_SYMBOL(ib_device_register); +EXPORT_SYMBOL(ib_register_device); -int ib_device_deregister(struct ib_device *device) +int ib_deregister_device(struct ib_device *device) { struct ib_device_private *priv; + printk(KERN_ERR "device->reg_state = %d\n", device->reg_state); + priv = device->core; if (tsKernelQueueThreadStop(priv->async_thread)) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "tsKernelThreadStop failed for %s async thread", + printk(KERN_WARNING "tsKernelThreadStop failed for %s async thread\n", device->name); } @@ -278,9 +312,12 @@ kfree(priv->port_data); kfree(priv); + device->reg_state = IB_DEV_UNREGISTERED; + printk(KERN_ERR "device->reg_state = %d\n", device->reg_state); + return 0; } -EXPORT_SYMBOL(ib_device_deregister); +EXPORT_SYMBOL(ib_deregister_device); struct ib_device *ib_device_get_by_name(const char *name) { Index: src/linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 715) +++ src/linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -178,7 +178,8 @@ }; struct mthca_dev { - struct pci_dev *pdev; + struct ib_device ib_dev; + struct pci_dev *pdev; int hca_type; unsigned long mthca_flags; @@ -209,8 +210,6 @@ struct mthca_pd driver_pd; struct mthca_mr driver_mr; - - struct ib_device ib_dev; }; #define mthca_printk(level, mdev, format, arg...) \ Index: src/linux-kernel/infiniband/hw/mthca/mthca_main.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 715) +++ src/linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -495,14 +495,13 @@ } } - mdev = kmalloc(sizeof *mdev, GFP_KERNEL); + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); if (!mdev) { dev_err(&pdev->dev, "Device struct alloc failed, " "aborting.\n"); err = -ENOMEM; goto err_out_free_res; } - memset(mdev, 0, sizeof *mdev); mdev->pdev = pdev; mdev->hca_type = id->driver_data; @@ -617,7 +616,7 @@ if (mdev->mthca_flags & MTHCA_FLAG_MSI) pci_disable_msi(pdev); - kfree(mdev); + ib_dealloc_device((struct ib_device *) mdev); err_out_free_res: mthca_release_regions(pdev, ddr_hidden); @@ -663,7 +662,7 @@ if (mdev->mthca_flags & MTHCA_FLAG_MSI) pci_disable_msi(pdev); - kfree(mdev); + ib_dealloc_device((struct ib_device *) mdev); mthca_release_regions(pdev, mdev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN); pci_disable_device(pdev); Index: src/linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- src/linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 715) +++ src/linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -545,12 +545,12 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.mad_process = mthca_process_mad; - return ib_device_register(&dev->ib_dev); + return ib_register_device(&dev->ib_dev); } void mthca_deregister_device(struct mthca_dev *dev) { - ib_device_deregister(&dev->ib_dev); + ib_deregister_device(&dev->ib_dev); } /* From roland at topspin.com Thu Sep 2 11:14:01 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 11:14:01 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <1094148534.1830.36.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 02 Sep 2004 14:08:55 -0400") References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> Message-ID: <523c20wnbq.fsf@topspin.com> Hal> I forget why IB_SEND_MAD_GRH_VALID is needed. Isn't the grh Hal> setup properly in the AH which is supplied with the send WR ? Hal> However, the notion of IB_SEND_MAD is needed to determine Hal> whether the timeout_ms parameter is to be honored or not. I don't follow why we need IB_SEND_MAD... we're still going to have a separate entry point different from ib_post_send() for sending MADs (it doesn't make sense to add tests to a key data path function just for the sake of combining entry points). - Roland From roland at topspin.com Thu Sep 2 11:22:29 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 11:22:29 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <20040902101739.7978f8e8.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 2 Sep 2004 10:17:39 -0700") References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> Message-ID: <52y8jsv8d6.fsf@topspin.com> Sean> I think that you'll need to know if the GRH is being sent in Sean> order to do RMPP correctly. I don't see why... can you expand on this point? Thanks, Roland From mshefty at ichips.intel.com Thu Sep 2 10:48:22 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 10:48:22 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <52y8jsv8d6.fsf@topspin.com> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> <52y8jsv8d6.fsf@topspin.com> Message-ID: <20040902104822.2d257388.mshefty@ichips.intel.com> On Thu, 02 Sep 2004 11:22:29 -0700 Roland Dreier wrote: > Sean> I think that you'll need to know if the GRH is being sent in > Sean> order to do RMPP correctly. > > I don't see why... can you expand on this point? I was remembering from the SF stack where it was provided on sends, but it looks like that was only done in cases where the access layer would create the address handle for the user. So, it doesn't look like that flag is needed for sends, but only on receive completions. Thanks for catching this. From halr at voltaire.com Thu Sep 2 12:31:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Sep 2004 15:31:41 -0400 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <20040902104822.2d257388.mshefty@ichips.intel.com> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> <52y8jsv8d6.fsf@topspin.com> <20040902104822.2d257388.mshefty@ichips.intel.com> Message-ID: <1094153500.1836.41.camel@localhost.localdomain> On Thu, 2004-09-02 at 13:48, Sean Hefty wrote: > On Thu, 02 Sep 2004 11:22:29 -0700 > Roland Dreier wrote: > > > Sean> I think that you'll need to know if the GRH is being sent in > > Sean> order to do RMPP correctly. > > > > I don't see why... can you expand on this point? > > I was remembering from the SF stack where it was provided on sends, but it looks like that was only done in cases where the access layer would create the address handle for the user. So, it doesn't look like that flag is needed for sends, but only on receive completions. > > Thanks for catching this. Assuming the change as proposed will not be done: Will ib_mad_flags be going away based on the above ? Also, will the MAD and normal send WCs be made more similar as proposed ? -- Hal From mshefty at ichips.intel.com Thu Sep 2 11:48:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 11:48:35 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <1094153500.1836.41.camel@localhost.localdomain> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> <52y8jsv8d6.fsf@topspin.com> <20040902104822.2d257388.mshefty@ichips.intel.com> <1094153500.1836.41.camel@localhost.localdomain> Message-ID: <20040902114835.7a6ca0da.mshefty@ichips.intel.com> On Thu, 02 Sep 2004 15:31:41 -0400 Hal Rosenstock wrote: > On Thu, 2004-09-02 at 13:48, Sean Hefty wrote: > > On Thu, 02 Sep 2004 11:22:29 -0700 > > Roland Dreier wrote: > > > > > Sean> I think that you'll need to know if the GRH is being sent in > > > Sean> order to do RMPP correctly. > > > > > > I don't see why... can you expand on this point? > > > > I was remembering from the SF stack where it was provided on sends, but it looks like that was only done in cases where the access layer would create the address handle for the user. So, it doesn't look like that flag is needed for sends, but only on receive completions. > > > > Thanks for catching this. > > Assuming the change as proposed will not be done: > > Will ib_mad_flags be going away based on the above ? In my working copy, I set ib_send_flags back as they were, and added ib_mad_flags back for use with ib_mad_recv_wc. I have another TODO to check on zero-copy RMPP receives, so ib_mad_recv_wc (and/or ib_recv_wc) could change slightly in the future. > Also, will the MAD and normal send WCs be made more similar as proposed This requires a very minor change to the send WC (swapping two fields), so I think it makes sense to do this. From halr at voltaire.com Thu Sep 2 12:59:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Sep 2004 15:59:39 -0400 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <20040902114835.7a6ca0da.mshefty@ichips.intel.com> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> <52y8jsv8d6.fsf@topspin.com> <20040902104822.2d257388.mshefty@ichips.intel.com> <1094153500.1836.41.camel@localhost.localdomain> <20040902114835.7a6ca0da.mshefty@ichips.intel.com> Message-ID: <1094155178.1830.45.camel@localhost.localdomain> On Thu, 2004-09-02 at 14:48, Sean Hefty wrote: > > Also, will the MAD and normal send WCs be made more similar as proposed > > This requires a very minor change to the send WC (swapping two fields), > so I think it makes sense to do this. Also, will void *context become u64 wr_id in the send WC ? -- Hal From mshefty at ichips.intel.com Thu Sep 2 12:09:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 12:09:30 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <1094155178.1830.45.camel@localhost.localdomain> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> <1094148534.1830.36.camel@localhost.localdomain> <20040902101739.7978f8e8.mshefty@ichips.intel.com> <52y8jsv8d6.fsf@topspin.com> <20040902104822.2d257388.mshefty@ichips.intel.com> <1094153500.1836.41.camel@localhost.localdomain> <20040902114835.7a6ca0da.mshefty@ichips.intel.com> <1094155178.1830.45.camel@localhost.localdomain> Message-ID: <20040902120930.362a0c2c.mshefty@ichips.intel.com> On Thu, 02 Sep 2004 15:59:39 -0400 Hal Rosenstock wrote: > On Thu, 2004-09-02 at 14:48, Sean Hefty wrote: > > > Also, will the MAD and normal send WCs be made more similar as proposed > > > > This requires a very minor change to the send WC (swapping two fields), > > so I think it makes sense to do this. > > Also, will void *context become u64 wr_id in the send WC ? Aye - I believe this is in the patch. From krkumar at us.ibm.com Thu Sep 2 14:40:57 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Thu, 2 Sep 2004 14:40:57 -0700 (PDT) Subject: [openib-general] [PATCH] Basic driver model/sysfs support In-Reply-To: <527jrcwo4m.fsf@topspin.com> Message-ID: Looks like kfree(device) is missing after ib_device_deregister_sysfs. Maybe it can be at a label at the end, which the top code can jump to if the device state is uninitialized. Rest looks fine to my sysfs-untrained eyes :-) - KK +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) + goto out; + + printk(KERN_ERR "device->reg_state = %d\n", device->reg_state); + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_deregister_sysfs(device); + +out: + kfree(device); +} From roland at topspin.com Thu Sep 2 15:18:09 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 15:18:09 -0700 Subject: [openib-general] [PATCH] Basic driver model/sysfs support In-Reply-To: (Krishna Kumar's message of "Thu, 2 Sep 2004 14:40:57 -0700 (PDT)") References: Message-ID: <52pt54tivy.fsf@topspin.com> Krishna> Looks like kfree(device) is missing after Krishna> ib_device_deregister_sysfs. Maybe it can be at a label Krishna> at the end, which the top code can jump to if the device Krishna> state is uninitialized. This is intentional -- ib_device_deregister_sysfs() calls class_device_unregister(), which will decrement the class_device's reference count. When it hits 0, ib_device_release() will be called to actually free the structure. (We only need the kfree in the uninitialized case, since that is when the class_device has not been registered with the driver model) - Roland From mshefty at ichips.intel.com Thu Sep 2 14:32:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 2 Sep 2004 14:32:46 -0700 Subject: [openib-general] [PATCH] for discussion - combine ib_send_mad_wr with ib_send_wr In-Reply-To: <20040902090245.21a4a939.mshefty@ichips.intel.com> References: <20040902090245.21a4a939.mshefty@ichips.intel.com> Message-ID: <20040902143246.740f942c.mshefty@ichips.intel.com> On Thu, 2 Sep 2004 09:02:45 -0700 Sean Hefty wrote: > Here's a patch that would combine ib_send_mad_wr with ib_send_wr, to optimize sending and completing MADs. As part of the change, there is a minor adjustment to the ib_wc structure that allows casting it to an ib_mad_send_wc structure to enable an additional optimization on MAD send completion processing. This is an updated patch based on our previous e-mails. If no one objects, I will be committing this change within the next day. - Sean Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 729) +++ ib_verbs.h (working copy) @@ -537,6 +537,7 @@ u32 remote_qpn; u32 remote_qkey; u16 pkey_index; /* valid for GSI only */ + int timeout_ms; /* valid for MADs only */ } ud; } wr; }; @@ -591,8 +592,8 @@ struct ib_wc { u64 wr_id; enum ib_wc_status status; - enum ib_wc_opcode opcode; u32 vendor_err; + enum ib_wc_opcode opcode; u32 byte_len; u32 imm_data; u32 src_qp; Index: ib_mad.h =================================================================== --- ib_mad.h (revision 728) +++ ib_mad.h (working copy) @@ -133,51 +133,22 @@ }; /** - * ib_mad_send_wr - send MAD work request. - * @list - Allows chaining together multiple requests. - * @context - User-controlled work request context. - * @sg_list - An array of scatter-gather entries, referencing the MAD's - * data buffer(s). The first entry must reference the standard MAD - * header, plus any RMPP header, if used. - * @num_sge - The number of scatter-gather entries. - * @mad_flags - Flags used to control the send operation. - * @ah - Address handle for the destination. - * @timeout_ms - Timeout value, in milliseconds, to wait for a response - * message. Set to 0 if no response is expected. - * @remote_qpn - Destination QP. - * @remote_qkey - Specifies the qkey used by remote QP. - * @pkey_index - Pkey index to use. Required when sending on QP1 only. - */ -struct ib_mad_send_wr { - struct list_head list; - void *context; - struct ib_sge *sg_list; - int num_sge; - int mad_flags; - struct ib_ah *ah; - int timeout_ms; - u32 remote_qpn; - u32 remote_qkey; - u16 pkey_index; -}; - -/** * ib_mad_send_wc - MAD send completion information. - * @context - Context associated with the send MAD request. + * @wr_id - Work request identifier associated with the send MAD request. * @status - Completion status. * @vendor_err - Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { - void *context; + u64 wr_id; enum ib_wc_status status; u32 vendor_err; }; /** * ib_mad_recv_wc - received MAD information. - * @context - For received response, set to the context specified for - * the corresponding send request. + * @wr_id - For received response, set to the work request identifier specified + * for the corresponding send request. * @grh - References a data buffer containing the global route header. * The data refereced by this buffer is only valid if the GRH is * valid. @@ -194,7 +165,7 @@ * An RMPP receive will be coalesced into a single data buffer. */ struct ib_mad_recv_wc { - void *context; + u64 wr_id; struct ib_grh *grh; struct ib_mad *mad; u32 length; @@ -263,10 +234,10 @@ * ib_mad_post_send - Posts a MAD to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. - * @mad_send_wr - Specifies the information needed to send the MAD. + * @send_wr - Specifies the information needed to send the MAD. */ int ib_mad_post_send(struct ib_mad_agent *mad_agent, - struct ib_mad_send_wr *mad_send_wr); + struct ib_send_wr *send_wr); /** * ib_mad_qp_redir - Registers a QP for MAD services. From krkumar at us.ibm.com Thu Sep 2 15:39:56 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Thu, 2 Sep 2004 15:39:56 -0700 (PDT) Subject: [openib-general] [PATCH] Basic driver model/sysfs support In-Reply-To: <52pt54tivy.fsf@topspin.com> Message-ID: > This is intentional -- ib_device_deregister_sysfs() calls > class_device_unregister(), which will decrement the class_device's > reference count. When it hits 0, ib_device_release() will be called > to actually free the structure. OK, this makes my eyes a little less sysfs-untrained :-) Thanks for the clarification. - KK From roland at topspin.com Thu Sep 2 19:28:48 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Sep 2004 19:28:48 -0700 Subject: [openib-general] Async event handlers: per consumer or per QP/CQ? Message-ID: <524qmgt7a7.fsf@topspin.com> I'm starting to work on implementing a new async event handling scheme to match our new API, and I'm starting to think that our initial design is somewhat suboptimal. To recap, the plan was to put the async handler in struct ib_device give each consumer a different copy of each device's struct ib_device. However, as I think about how to actually implement this, it starts to look like not such a good idea. If we want to give each consumer a copy of struct ib_device and still minimize the amount of pointer chasing that fast path functions do, it seems like this copy has to have all of the low-level driver's private state copied as well. If we do this, we now have the problem of coherency between different copies, etc. I'm sure these problems could be solved, but it seems like it will make things much more complicated than they need to be. I would propose putting a list of async handlers in struct ib_device for unaffiliated async events and put an async handler function pointer in the QP/CQ struct. The argument against this was that it adds overhead to have all these duplicated function pointers. Right now (if I remove the refcnt and wait members, based on my plan to implement a better locking scheme in mthca), the sizes are: 32-bit 64-bit struct mthca_cq 76 bytes 104 bytes struct mthca_qp 156 bytes 224 bytes add in the fact that every CQ and QP will have at least a page of memory dedicated to the actual queues, and the overhead of one more pointer (4 or 8 bytes depending on architecture) seems like it's lost in the noise. In fact, since right now mthca is just using kmalloc() to allocate these structures, the sizes are getting rounded up to a power of 2 anyway, so adding another pointer member really will have zero impact on our memory usage. If/when we switch to having separate slab caches, the worst effect would be dropping from 39 mthca_cqs per 4K page down to 36 mthca_cqs on a 64-bit arch. In my mind the big simplification of the code far outweighs the slight additional memory usage. What are other people's thoughts? Thanks, Roland From ftillier at infiniconsys.com Thu Sep 2 20:41:41 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 2 Sep 2004 20:41:41 -0700 Subject: [openib-general] Async event handlers: per consumer or per QP/CQ? In-Reply-To: <524qmgt7a7.fsf@topspin.com> Message-ID: <000101c49167$ed6450d0$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Thursday, September 02, 2004 7:29 PM > > In my mind the big simplification of the code far outweighs the slight > additional memory usage. What are other people's thoughts? > Memory is cheap so I agree with you - simple is better. - Fab From roland at topspin.com Fri Sep 3 07:32:01 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Sep 2004 07:32:01 -0700 Subject: [openib-general] Re: [PATCH] Basic driver model/sysfs support In-Reply-To: <20040903103326.GA5257@kroah.com> (Greg KH's message of "Fri, 3 Sep 2004 12:33:26 +0200") References: <527jrcwo4m.fsf@topspin.com> <20040903103326.GA5257@kroah.com> Message-ID: <52vfevs9su.fsf@topspin.com> Greg> At first glance this looks good. Greg> What does 'tree /sys/class/infiniband/' look like with this Greg> code? pretty minimal: /sys/class/infiniband/ `-- mthca0 Is it better style to set class_dev.dev in the low-level driver or in our infiniband core layer when the low-level driver registers a device? With class_dev.dev set properly it's a little more interesting: /sys/class/infiniband/ `-- mthca0 |-- device -> ../../../devices/pci0000:00/0000:00:02.0/0000:01:1f.0/0000:03:01.0/0000:04:00.0 `-- driver -> ../../../bus/pci/drivers/ib_mthca Thanks, Roland From halr at voltaire.com Fri Sep 3 09:43:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Sep 2004 12:43:28 -0400 Subject: [openib-general] [PATCH] ib_mad.h: Add IB_MGMT_MAX_METHODS definition Message-ID: <1094229808.1747.12.camel@localhost.localdomain> Add IB_MGMT_MAX_METHODS definition to ib_mad.h Index: ib_mad.h =================================================================== --- ib_mad.h (revision 733) +++ ib_mad.h (working copy) @@ -38,6 +38,8 @@ #define IB_MGMT_CLASS_CM 0x07 #define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_MAX_METHODS 128 + #define IB_QP0 0 #define IB_QP1 cpu_to_be32(1) #define IB_QP1_QKEY cpu_to_be32(0x80010000) @@ -220,7 +222,7 @@ struct ib_mad_reg_req { u8 mgmt_class; u8 mgmt_class_version; - DECLARE_BITMAP(method_mask, 128); + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); }; /** From greg at kroah.com Fri Sep 3 03:33:26 2004 From: greg at kroah.com (Greg KH) Date: Fri, 3 Sep 2004 12:33:26 +0200 Subject: [openib-general] Re: [PATCH] Basic driver model/sysfs support In-Reply-To: <527jrcwo4m.fsf@topspin.com> References: <527jrcwo4m.fsf@topspin.com> Message-ID: <20040903103326.GA5257@kroah.com> On Thu, Sep 02, 2004 at 10:56:41AM -0700, Roland Dreier wrote: > This patch starts to add driver model/sysfs support. It adds > ib_alloc_device/ib_dealloc_device so that the access layer can manage > reference counting of struct ib_device (and free the ib_device in the > driver model ->release() method). It also creates a basic > "infiniband" class that devices belong to. > > Next step is to add attributes to the device and start killing off > some of the info in /proc (and redundant ioctls). At first glance this looks good. What does 'tree /sys/class/infiniband/' look like with this code? thanks, greg k-h From halr at voltaire.com Fri Sep 3 10:55:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Sep 2004 13:55:06 -0400 Subject: [openib-general] [PATCH] ib_verbs.h: private in ib_device struct is unused currently Message-ID: <1094234105.1747.24.camel@localhost.localdomain> ib_verbs.h: private in ib_device struct is unused currently Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 736) +++ ib_verbs.h (working copy) @@ -627,7 +627,6 @@ char name[IB_DEVICE_NAME_MAX]; char *provider; - void *private; struct list_head core_list; void *core; void *mad; From sean.hefty at intel.com Fri Sep 3 11:08:22 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 3 Sep 2004 11:08:22 -0700 Subject: [openib-general] Async event handlers: per consumer or per QP/CQ? In-Reply-To: <524qmgt7a7.fsf@topspin.com> Message-ID: FYI - I'm out on vacation today, so may be slow to respond. >To recap, the plan was to put the async handler in struct ib_device >give each consumer a different copy of each device's struct >ib_device. However, as I think about how to actually implement this, >it starts to look like not such a good idea. If we want to give each >consumer a copy of struct ib_device and still minimize the amount of >pointer chasing that fast path functions do, it seems like this copy >has to have all of the low-level driver's private state copied as well. Hmm... it seems like you'd want to reference a single data structure, rather than having copies. But, yes, then you end up following one more pointer to get to needed data. >In my mind the big simplification of the code far outweighs the slight >additional memory usage. What are other people's thoughts? I'm not opposed to this, especially if it seems like the way to go. From libor at topspin.com Fri Sep 3 16:07:45 2004 From: libor at topspin.com (Libor Michalek) Date: Fri, 3 Sep 2004 16:07:45 -0700 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <4134FD31.4060301@ammasso.com>; from timur.tabi@ammasso.com on Tue, Aug 31, 2004 at 05:35:29PM -0500 References: <4134FD31.4060301@ammasso.com> Message-ID: <20040903160745.A6309@topspin.com> On Tue, Aug 31, 2004 at 05:35:29PM -0500, Timur Tabi wrote: > What is the reason that sys_mlock() is used instead of get_user_pages()? > I know the sys_mlock() is used because other methods of locking pages > didn't really lock the pages (i.e. there were still situations where the > page would be swapped out). Doe get_user_pages() have that problem > also? If so, has any checked to see if it's been fixed in the 2.6 kernel? Yes, get_user_pages() is the call that has the problem you describe in the 2.4 kernel, which resulted in the use of sys_mlock. Actually it does the correct thing, forces the pages to be resident and increments the reference count, but the swapper does not honor the reference count. The pte that points to that page can get unmapped and then mapped to another page, even though the reference count indicates that it should not happen. Also, the LOCK flag for the VM is checked before the pte is unmapped, so in the case of sys_mlock the problem does not happen. I've seen the problem in test cases, so it definetly can happen in 2.4. Looking at the 2.6 code the problem appears to be fixed, but I have not had a chance to run tests to verify it. Good place to take look if you are interested is in launder_page() and try_to_unmap() in the kernel. -Libor From greg at kroah.com Sat Sep 4 01:07:48 2004 From: greg at kroah.com (Greg KH) Date: Sat, 4 Sep 2004 10:07:48 +0200 Subject: [openib-general] Re: [PATCH] Basic driver model/sysfs support In-Reply-To: <52vfevs9su.fsf@topspin.com> References: <527jrcwo4m.fsf@topspin.com> <20040903103326.GA5257@kroah.com> <52vfevs9su.fsf@topspin.com> Message-ID: <20040904080747.GA21430@kroah.com> On Fri, Sep 03, 2004 at 07:32:01AM -0700, Roland Dreier wrote: > Greg> At first glance this looks good. > > Greg> What does 'tree /sys/class/infiniband/' look like with this > Greg> code? > > pretty minimal: > > /sys/class/infiniband/ > `-- mthca0 > > Is it better style to set class_dev.dev in the low-level driver or in > our infiniband core layer when the low-level driver registers a > device? The way the other classes work is that the low-level driver sets the struct device pointer up in the structure before registering with the class. This is because the low-level driver has access to the struct device, and usually the class knows nothing about it (actually, the class should not care about it at all.) > With class_dev.dev set properly it's a little more > interesting: > > /sys/class/infiniband/ > `-- mthca0 > |-- device -> ../../../devices/pci0000:00/0000:00:02.0/0000:01:1f.0/0000:03:01.0/0000:04:00.0 > `-- driver -> ../../../bus/pci/drivers/ib_mthca Yes, that's better :) Looks good to me. thanks, greg k-h p.s. Can someone please turn the "closed list" option off? If you all want to be a open mailing list, it's pretty rude to hold emails from non-list members. Almost all Linux development mailing lists accept email from anyone, list-member or not. p.s.s. And no, spam is not a valid reason for having such a policy, that can be handled properly by filters on the mail server, or filters by the users themselves. From mst at mellanox.co.il Sun Sep 5 00:47:09 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Sep 2004 10:47:09 +0300 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <506C3D7B14CDD411A52C00025558DED605E00235@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED605E00235@mtlex01.yok.mtl.com> Message-ID: <20040905074709.GA29254@mellanox.co.il> Hello! Quoting Dror Goldenberg (gdror at mellanox.co.il) "[openib-general] Multicast address aliasing in IPoIB": > IPoIB defines no aliasing in the mapping of IP multicast address into IPoIB HW > addresses. > In Ethernet, there is an aliasing, i.e. more than one IP address can map into > the same > Ethernet multicast MAC address. > > In short: IP to Ether takes 24 LSbits from the IP address > IP to IB takes 28 LSbits from the IP address (which are > essentially the whole > IP address, the remaining 4 bits are "class D prefix"). > > The problem is that the current IPoIB driver interfaces the Linux kernel as if > it were an Ethernet driver. > Therefore, the IP layer will not notify the net_device when > a new MC > address is added if it maps to the same MAC address. It will rather increment > the > reference count of the MAC address (net_device->mc_list->dmi_user) and won't > call > net_device->set_multicast_list(). > Therefore, if a user just adds itself to an IP MC group (setsockopt with > IP_ADD_MEMBERSHIP), then if the IPoIB driver already has this Ether MAC address > in its filter because of a previous registration to another IP MC group, then > the IPoIB driver > will not get any notification, and the user will not get registered to the MCG. > > I was wondering what should be the solution for that in the current kernels > (gen1) and > in future kernels (gen2). > What about registering for all possible IB multicast groups, up front? There are 2^(28-24)=16 options, so you end up being registered in 16 multicast groups, which is not that huge an overhead. Upper layers of the IP protocol will filter the right packets as they do for ethernet. This is essentially what we do with IP over IB anyway - emulating broadcast with multicast. MST From roland at topspin.com Sun Sep 5 14:12:44 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 05 Sep 2004 14:12:44 -0700 Subject: [openib-general] Re: [PATCH] Basic driver model/sysfs support In-Reply-To: <20040904080747.GA21430@kroah.com> (Greg KH's message of "Sat, 4 Sep 2004 10:07:48 +0200") References: <527jrcwo4m.fsf@topspin.com> <20040903103326.GA5257@kroah.com> <52vfevs9su.fsf@topspin.com> <20040904080747.GA21430@kroah.com> Message-ID: <52vfespghf.fsf@topspin.com> Greg> The way the other classes work is that the low-level driver Greg> sets the struct device pointer up in the structure before Greg> registering with the class. This is because the low-level Greg> driver has access to the struct device, and usually the Greg> class knows nothing about it (actually, the class should not Greg> care about it at all.) OK, I've done it that way. Greg> p.s. Can someone please turn the "closed list" option off? Greg> If you all want to be a open mailing list, it's pretty rude Greg> to hold emails from non-list members. Almost all Linux Greg> development mailing lists accept email from anyone, Greg> list-member or not. Yes, I agree. Matt, can we make it so? Thanks, Roland From roland at topspin.com Sun Sep 5 14:22:35 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 05 Sep 2004 14:22:35 -0700 Subject: [openib-general] [PATCH] More sysfs support In-Reply-To: <20040904080747.GA21430@kroah.com> (Greg KH's message of "Sat, 4 Sep 2004 10:07:48 +0200") References: <527jrcwo4m.fsf@topspin.com> <20040903103326.GA5257@kroah.com> <52vfevs9su.fsf@topspin.com> <20040904080747.GA21430@kroah.com> Message-ID: <52r7pgpg10.fsf_-_@topspin.com> I implemented some attributes to fill out the sysfs directory. There's still a fair number more attributes to expose but I think this is the complete framework. It should be pretty quick to finish up so that the current /proc/infiniband/core tree can be killed. Greg, is it cool to create kobjects the way I do so that we can get the hierarchy of ports in sysfs (like mthca0/ports/1)? Also is there any better way to dynamically create the GID table attribute group? The code creates a class tree that looks like: /sys/class/infiniband/ `-- mthca0 |-- device -> ../../../devices/pci0000:00/0000:00:1f.0/0000:01:01.0/0000:02:00.0 |-- driver -> ../../../bus/pci/drivers/ib_mthca |-- node_guid `-- ports |-- 1 | |-- gids | | |-- 0 | | |-- 1 ... 29 more entries 2...30 snipped ... | | `-- 31 | |-- lid | `-- state `-- 2 |-- gids | |-- 0 | |-- 1 ... 29 more entries 2...30 snipped ... | `-- 31 |-- lid `-- state And the contents of the files look like: # cat /sys/class/infiniband/mthca0/node_guid 0005:ad00:0001:8211 # cat /sys/class/infiniband/mthca0/ports/1/state 4: ACTIVE # cat /sys/class/infiniband/mthca0/ports/1/gids/0 fe80:0000:0000:0000:0005:ad00:0001:8212 By the way, I chose the plurals "ports" and "gids" to match other sysfs usage like "devices" and "drivers." However I'm wondering if it might not make more sense to use "port" and "gid" so that one could have a path like mthca0/port/1/gid/0. Any opinions? Thanks, Roland Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 727) +++ infiniband/include/ib_verbs.h (working copy) @@ -689,6 +689,8 @@ ib_mad_process_func mad_process; struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; enum { IB_DEV_UNINITIALIZED, Index: infiniband/core/ib_sysfs.c =================================================================== --- infiniband/core/ib_sysfs.c (revision 727) +++ infiniband/core/ib_sysfs.c (working copy) @@ -21,6 +21,139 @@ #include "core_priv.h" +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = { \ + .attr = { .name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \ + .show = _show, \ + .store = _store \ +} + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t show_port_state(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t show_port_lid(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +PORT_ATTR(state, S_IRUGO, show_port_state, NULL); +PORT_ATTR(lid, S_IRUGO, show_port_lid, NULL); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + static void ib_device_release(struct class_device *cdev) { struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); @@ -35,6 +168,126 @@ return 0; } +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + struct port_table_attribute **gid_attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + p->gid_attr = kmalloc((1 + attr.gid_tbl_len) * sizeof *p->gid_attr, + GFP_KERNEL); + if (!p->gid_attr) { + ret = -ENOMEM; + goto err; + } + memset(p->gid_attr, 0, (1 + attr.gid_tbl_len) * sizeof *p->gid_attr); + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + gid_attr = (struct port_table_attribute **) p->gid_attr; + + for (i = 0; i < attr.gid_tbl_len; ++i) { + gid_attr[i] = kmalloc(sizeof *gid_attr[i], GFP_KERNEL); + if (!gid_attr[i]) { + ret = -ENOMEM; + goto err_free; + } + memset(gid_attr[i], 0, sizeof *gid_attr[i]); + gid_attr[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!gid_attr[i]->attr.attr.name) { + ret = -ENOMEM; + goto err_free; + } + + if (snprintf(gid_attr[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err_free; + } + + gid_attr[i]->attr.attr.mode = S_IRUGO; + gid_attr[i]->attr.attr.owner = THIS_MODULE; + gid_attr[i]->attr.show = show_port_gid; + gid_attr[i]->index = i; + } + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free: + for (i = 0; i < attr.gid_tbl_len; ++i) { + if (p->gid_attr[i]) + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_node_guid +}; + static struct class ib_class = { .name = "infiniband", .release = ib_device_release, @@ -44,18 +297,93 @@ int ib_device_register_sysfs(struct ib_device *device) { struct class_device *class_dev = &device->class_dev; + int ret; + int i; class_dev->class = &ib_class; class_dev->class_data = device; strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); - return class_device_register(class_dev); + INIT_LIST_HEAD(&device->port_list); + ret = class_device_register(class_dev); + if (ret) + goto err; + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + struct ib_device_attr attr; + int i; + + ret = ib_query_device(device, &attr); + if (ret) + goto err_put; + + for (i = 1; i <= attr.phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; } void ib_device_deregister_sysfs(struct ib_device *device) { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p,&port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); class_device_unregister(&device->class_dev); } From roland at topspin.com Sun Sep 5 14:24:30 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 05 Sep 2004 14:24:30 -0700 Subject: [openib-general] Async event handlers: per consumer or per QP/CQ? In-Reply-To: (Sean Hefty's message of "Fri, 3 Sep 2004 11:08:22 -0700") References: Message-ID: <52n004pfxt.fsf@topspin.com> Sean> Hmm... it seems like you'd want to reference a single data Sean> structure, rather than having copies. But, yes, then you Sean> end up following one more pointer to get to needed data. Given the amount of thought being put into optimizing slow path MAD operations, it definitely seems worth it to me to save one pointer dereference for data path operations.... Roland> In my mind the big simplification of the code far Roland> outweighs the slight additional memory usage. What are Roland> other people's thoughts? Sean> I'm not opposed to this, especially if it seems like the way Sean> to go. I think it makes the implementation much cleaner. I'm going to go ahead and implement it so that we have something concrete to criticize. -R. From yaronh at voltaire.com Mon Sep 6 06:29:18 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 6 Sep 2004 16:29:18 +0300 Subject: [openib-general] [PATCH] More sysfs support Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> Probably need to add P_Key's and some other node/port attributes Yaron > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Monday, September 06, 2004 12:23 AM > To: openib-general at openib.org > Cc: Greg KH > Subject: [openib-general] [PATCH] More sysfs support > > I implemented some attributes to fill out the sysfs directory. > There's still a fair number more attributes to expose but I think this > is the complete framework. It should be pretty quick to finish up so > that the current /proc/infiniband/core tree can be killed. > > Greg, is it cool to create kobjects the way I do so that we can get > the hierarchy of ports in sysfs (like mthca0/ports/1)? Also is there > any better way to dynamically create the GID table attribute group? > > > The code creates a class tree that looks like: > > /sys/class/infiniband/ > `-- mthca0 > |-- device -> > ../../../devices/pci0000:00/0000:00:1f.0/0000:01:01.0/0000:02:00.0 > |-- driver -> ../../../bus/pci/drivers/ib_mthca > |-- node_guid > `-- ports > |-- 1 > | |-- gids > | | |-- 0 > | | |-- 1 > ... 29 more entries 2...30 snipped ... > | | `-- 31 > | |-- lid > | `-- state > `-- 2 > |-- gids > | |-- 0 > | |-- 1 > ... 29 more entries 2...30 snipped ... > | `-- 31 > |-- lid > `-- state > > And the contents of the files look like: > > # cat /sys/class/infiniband/mthca0/node_guid > 0005:ad00:0001:8211 > > # cat /sys/class/infiniband/mthca0/ports/1/state > 4: ACTIVE > > # cat /sys/class/infiniband/mthca0/ports/1/gids/0 > fe80:0000:0000:0000:0005:ad00:0001:8212 > > By the way, I chose the plurals "ports" and "gids" to match other > sysfs usage like "devices" and "drivers." However I'm wondering if it > might not make more sense to use "port" and "gid" so that one could > have a path like mthca0/port/1/gid/0. Any opinions? > > Thanks, > Roland > > Index: infiniband/include/ib_verbs.h > =================================================================== > --- infiniband/include/ib_verbs.h (revision 727) > +++ infiniband/include/ib_verbs.h (working copy) > @@ -689,6 +689,8 @@ > ib_mad_process_func mad_process; > > struct class_device class_dev; > + struct kobject ports_parent; > + struct list_head port_list; > > enum { > IB_DEV_UNINITIALIZED, > Index: infiniband/core/ib_sysfs.c > =================================================================== > --- infiniband/core/ib_sysfs.c (revision 727) > +++ infiniband/core/ib_sysfs.c (working copy) > @@ -21,6 +21,139 @@ > > #include "core_priv.h" > > +struct ib_port { > + struct kobject kobj; > + struct ib_device *ibdev; > + struct attribute_group gid_group; > + struct attribute **gid_attr; > + u8 port_num; > +}; > + > +struct port_attribute { > + struct attribute attr; > + ssize_t (*show)(struct ib_port *, struct port_attribute *, char > *buf); > + ssize_t (*store)(struct ib_port *, struct port_attribute *, const > char *buf, size_t count); > +}; > + > +#define PORT_ATTR(_name, _mode, _show, _store) \ > +struct port_attribute port_attr_##_name = { \ > + .attr = { .name = __stringify(_name), .mode = _mode, .owner = > THIS_MODULE }, \ > + .show = _show, \ > + .store = _store \ > +} > + > +struct port_table_attribute { > + struct port_attribute attr; > + int index; > +}; > + > +static ssize_t port_attr_show(struct kobject *kobj, > + struct attribute *attr, char *buf) > +{ > + struct port_attribute *port_attr = > + container_of(attr, struct port_attribute, attr); > + struct ib_port *p = container_of(kobj, struct ib_port, kobj); > + > + if (!port_attr->show) > + return 0; > + > + return port_attr->show(p, port_attr, buf); > +} > + > +static struct sysfs_ops port_sysfs_ops = { > + .show = port_attr_show > +}; > + > +static ssize_t show_port_state(struct ib_port *p, struct port_attribute > *unused, > + char *buf) > +{ > + struct ib_port_attr attr; > + ssize_t ret; > + > + static const char *state_name[] = { > + [IB_PORT_NOP] = "NOP", > + [IB_PORT_DOWN] = "DOWN", > + [IB_PORT_INIT] = "INIT", > + [IB_PORT_ARMED] = "ARMED", > + [IB_PORT_ACTIVE] = "ACTIVE", > + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" > + }; > + > + ret = ib_query_port(p->ibdev, p->port_num, &attr); > + if (ret) > + return ret; > + > + return sprintf(buf, "%d: %s\n", attr.state, > + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) > ? > + state_name[attr.state] : "UNKNOWN"); > +} > + > +static ssize_t show_port_lid(struct ib_port *p, struct port_attribute > *unused, > + char *buf) > +{ > + struct ib_port_attr attr; > + ssize_t ret; > + > + ret = ib_query_port(p->ibdev, p->port_num, &attr); > + if (ret) > + return ret; > + > + return sprintf(buf, "0x%x\n", attr.lid); > +} > + > +PORT_ATTR(state, S_IRUGO, show_port_state, NULL); > +PORT_ATTR(lid, S_IRUGO, show_port_lid, NULL); > + > +static struct attribute *port_default_attrs[] = { > + &port_attr_state.attr, > + &port_attr_lid.attr, > + NULL > +}; > + > +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute > *attr, > + char *buf) > +{ > + struct port_table_attribute *tab_attr = > + container_of(attr, struct port_table_attribute, attr); > + union ib_gid gid; > + ssize_t ret; > + > + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); > + if (ret) > + return ret; > + > + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", > + be16_to_cpu(((u16 *) gid.raw)[0]), > + be16_to_cpu(((u16 *) gid.raw)[1]), > + be16_to_cpu(((u16 *) gid.raw)[2]), > + be16_to_cpu(((u16 *) gid.raw)[3]), > + be16_to_cpu(((u16 *) gid.raw)[4]), > + be16_to_cpu(((u16 *) gid.raw)[5]), > + be16_to_cpu(((u16 *) gid.raw)[6]), > + be16_to_cpu(((u16 *) gid.raw)[7])); > +} > + > +static void ib_port_release(struct kobject *kobj) > +{ > + struct ib_port *p = container_of(kobj, struct ib_port, kobj); > + struct attribute *a; > + int i; > + > + for (i = 0; (a = p->gid_attr[i]); ++i) { > + kfree(a->name); > + kfree(a); > + } > + > + kfree(p->gid_attr); > + kfree(p); > +} > + > +static struct kobj_type port_type = { > + .release = ib_port_release, > + .sysfs_ops = &port_sysfs_ops, > + .default_attrs = port_default_attrs > +}; > + > static void ib_device_release(struct class_device *cdev) > { > struct ib_device *dev = container_of(cdev, struct ib_device, > class_dev); > @@ -35,6 +168,126 @@ > return 0; > } > > +static int add_port(struct ib_device *device, int port_num) > +{ > + struct ib_port *p; > + struct ib_port_attr attr; > + struct port_table_attribute **gid_attr; > + int i; > + int ret; > + > + ret = ib_query_port(device, port_num, &attr); > + if (ret) > + return ret; > + > + p = kmalloc(sizeof *p, GFP_KERNEL); > + if (!p) > + return -ENOMEM; > + memset(p, 0, sizeof *p); > + > + p->ibdev = device; > + p->port_num = port_num; > + p->kobj.ktype = &port_type; > + > + p->kobj.parent = kobject_get(&device->ports_parent); > + if (!p->kobj.parent) { > + ret = -EBUSY; > + goto err; > + } > + > + ret = kobject_set_name(&p->kobj, "%d", port_num); > + if (ret) > + goto err_put; > + > + ret = kobject_register(&p->kobj); > + if (ret) > + goto err_put; > + > + p->gid_attr = kmalloc((1 + attr.gid_tbl_len) * sizeof *p->gid_attr, > + GFP_KERNEL); > + if (!p->gid_attr) { > + ret = -ENOMEM; > + goto err; > + } > + memset(p->gid_attr, 0, (1 + attr.gid_tbl_len) * sizeof *p- > >gid_attr); > + > + p->gid_group.name = "gids"; > + p->gid_group.attrs = p->gid_attr; > + > + gid_attr = (struct port_table_attribute **) p->gid_attr; > + > + for (i = 0; i < attr.gid_tbl_len; ++i) { > + gid_attr[i] = kmalloc(sizeof *gid_attr[i], GFP_KERNEL); > + if (!gid_attr[i]) { > + ret = -ENOMEM; > + goto err_free; > + } > + memset(gid_attr[i], 0, sizeof *gid_attr[i]); > + gid_attr[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); > + if (!gid_attr[i]->attr.attr.name) { > + ret = -ENOMEM; > + goto err_free; > + } > + > + if (snprintf(gid_attr[i]->attr.attr.name, 8, "%d", i) >= 8) { > + ret = -ENOMEM; > + goto err_free; > + } > + > + gid_attr[i]->attr.attr.mode = S_IRUGO; > + gid_attr[i]->attr.attr.owner = THIS_MODULE; > + gid_attr[i]->attr.show = show_port_gid; > + gid_attr[i]->index = i; > + } > + > + ret = sysfs_create_group(&p->kobj, &p->gid_group); > + if (ret) > + goto err_free; > + > + list_add_tail(&p->kobj.entry, &device->port_list); > + > + return 0; > + > +err_free: > + for (i = 0; i < attr.gid_tbl_len; ++i) { > + if (p->gid_attr[i]) > + kfree(p->gid_attr[i]->name); > + kfree(p->gid_attr[i]); > + } > + > + kfree(p->gid_attr); > + > +err_put: > + kobject_put(&device->ports_parent); > + > +err: > + kfree(p); > + return ret; > +} > + > +static ssize_t show_node_guid(struct class_device *cdev, char *buf) > +{ > + struct ib_device *dev = container_of(cdev, struct ib_device, > class_dev); > + struct ib_device_attr attr; > + ssize_t ret; > + > + ret = ib_query_device(dev, &attr); > + if (ret) > + return ret; > + > + return sprintf(buf, "%04x:%04x:%04x:%04x\n", > + be16_to_cpu(((u16 *) &attr.node_guid)[0]), > + be16_to_cpu(((u16 *) &attr.node_guid)[1]), > + be16_to_cpu(((u16 *) &attr.node_guid)[2]), > + be16_to_cpu(((u16 *) &attr.node_guid)[3])); > +} > + > +CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); > + > +static struct class_device_attribute *ib_class_attributes[] = { > + &class_device_attr_node_guid > +}; > + > static struct class ib_class = { > .name = "infiniband", > .release = ib_device_release, > @@ -44,18 +297,93 @@ > int ib_device_register_sysfs(struct ib_device *device) > { > struct class_device *class_dev = &device->class_dev; > + int ret; > + int i; > > class_dev->class = &ib_class; > class_dev->class_data = device; > strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); > > - return class_device_register(class_dev); > + INIT_LIST_HEAD(&device->port_list); > > + ret = class_device_register(class_dev); > + if (ret) > + goto err; > > + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { > + ret = class_device_create_file(class_dev, > ib_class_attributes[i]); > + if (ret) > + goto err_unregister; > + } > + > + device->ports_parent.parent = kobject_get(&class_dev->kobj); > + if (!device->ports_parent.parent) { > + ret = -EBUSY; > + goto err_unregister; > + } > + ret = kobject_set_name(&device->ports_parent, "ports"); > + if (ret) > + goto err_put; > + ret = kobject_register(&device->ports_parent); > + if (ret) > + goto err_put; > + > + if (device->node_type == IB_NODE_SWITCH) { > + ret = add_port(device, 0); > + if (ret) > + goto err_put; > + } else { > + struct ib_device_attr attr; > + int i; > + > + ret = ib_query_device(device, &attr); > + if (ret) > + goto err_put; > + > + for (i = 1; i <= attr.phys_port_cnt; ++i) { > + ret = add_port(device, i); > + if (ret) > + goto err_put; > + } > + } > + > + return 0; > + > +err_put: > + { > + struct kobject *p, *t; > + struct ib_port *port; > + > + list_for_each_entry_safe(p, t, &device->port_list, entry) { > + list_del(&p->entry); > + port = container_of(p, struct ib_port, kobj); > + sysfs_remove_group(p, &port->gid_group); > + kobject_unregister(p); > + } > + } > + > + kobject_put(&class_dev->kobj); > + > +err_unregister: > + class_device_unregister(class_dev); > + > +err: > + return ret; > } > > void ib_device_deregister_sysfs(struct ib_device *device) > { > + struct kobject *p, *t; > + struct ib_port *port; > + > + list_for_each_entry_safe(p, t, &device->port_list, entry) { > + list_del(&p->entry); > + port = container_of(p, struct ib_port, kobj); > + sysfs_remove_group(p,&port->gid_group); > + kobject_unregister(p); > + } > + > + kobject_unregister(&device->ports_parent); > class_device_unregister(&device->class_dev); > } > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From roland at topspin.com Mon Sep 6 07:49:49 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 07:49:49 -0700 Subject: [openib-general] [PATCH] More sysfs support In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> (Yaron Haviv's message of "Mon, 6 Sep 2004 16:29:18 +0300") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> Message-ID: <52wtz7o3jm.fsf@topspin.com> Yaron> Probably need to add P_Key's and some other node/port attributes Thanks, you're right. In fact, that's exactly what I was referring to in the second sentence of my email when I wrote: Roland> There's still a fair number more attributes to expose ;) - Roland From halr at voltaire.com Mon Sep 6 09:56:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Sep 2004 12:56:36 -0400 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <20040905074709.GA29254@mellanox.co.il> References: <506C3D7B14CDD411A52C00025558DED605E00235@mtlex01.yok.mtl.com> <20040905074709.GA29254@mellanox.co.il> Message-ID: <1094489795.1850.15.camel@localhost.localdomain> On Sun, 2004-09-05 at 03:47, Michael S. Tsirkin wrote: > Hello! > Quoting Dror Goldenberg (gdror at mellanox.co.il) "[openib-general] Multicast address aliasing in IPoIB": > > IPoIB defines no aliasing in the mapping of IP multicast address into IPoIB HW > > addresses. > > In Ethernet, there is an aliasing, i.e. more than one IP address can map into > > the same > > Ethernet multicast MAC address. > > > > In short: IP to Ether takes 24 LSbits from the IP address > > IP to IB takes 28 LSbits from the IP address (which are > > essentially the whole > > IP address, the remaining 4 bits are "class D prefix"). > > > > The problem is that the current IPoIB driver interfaces the Linux kernel as if > > it were an Ethernet driver. > > Therefore, the IP layer will not notify the net_device when > > a new MC > > address is added if it maps to the same MAC address. It will rather increment > > the > > reference count of the MAC address (net_device->mc_list->dmi_user) and won't > > call > > net_device->set_multicast_list(). > > Therefore, if a user just adds itself to an IP MC group (setsockopt with > > IP_ADD_MEMBERSHIP), then if the IPoIB driver already has this Ether MAC address > > in its filter because of a previous registration to another IP MC group, then > > the IPoIB driver > > will not get any notification, and the user will not get registered to the MCG. > > > > I was wondering what should be the solution for that in the current kernels > > (gen1) and > > in future kernels (gen2). > > > > > What about registering for all possible IB multicast groups, up front? > There are 2^(28-24)=16 options, so you end up being registered in > 16 multicast groups, which is not that huge an overhead. There are a lot of groups and many of them are transient groups. Many of these are for data not just control (so the packet rates are higher). > Upper layers of the IP protocol will filter the right packets > as they do for ethernet. > > This is essentially what > we do with IP over IB anyway - emulating broadcast with multicast. That is the primary disadvantage of broadcast: the fact that unwanted packets need filtering. There is much less filtering (CPU waste) which is needed for true multicast as only the groups which share the same link multicast address and are not desired need to be filtered. Ethernet uses one scheme to share multicast MAC addresses. That is the one which the RFC describes. IB multicast groups (MGIDs) can share MLIDs, but only when the group characteristics are the same. -- Hal > > MST > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Mon Sep 6 14:07:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Sep 2004 17:07:08 -0400 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change Message-ID: <1094504827.1832.28.camel@localhost.localdomain> Currently (or maybe this should RSN) in ib_mad_post_send, a linked list of send WRs is supported and the routine returns int as follows: int ib_mad_post_send(struct ib_mad_agent *mad_agent, struct ib_send_wr *mad_send_wr) That seems fine when the WRs are not linked. What happens when they are linked and there is some error on one of the linked WRs ? In that case, some send WRs get posted and others do not. Does there need to be another parameter indicating how far in the list was posted so the ib_mad client knows what to repost ? I don't think that all errors can be hidden from the ib_mad client. Also, if this were to occur in the middle of an RMPP transaction, should this be detected and any special actions taken ? Or would this just rely on normal RMPP handling at the other end to detect any issues ? I also have a related implementation question. The ib_mad client supplies wr_id in the send WR. If it turns out that it might be better to use wr_ids in some special encoded way, is it acceptable to do that as long as the client wr_id is returned in the send WC ? Also, here is a proposed minor commentary change to ib_mad.h: /* * ib_mad_post_send - Posts MAD(s) to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. * @mad_send_wr - Specifies the information needed to send the MAD(s). */ rather than: * ib_mad_post_send - Posts a MAD to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. * @send_wr - Specifies the information needed to send the MAD. -- Hal From roland at topspin.com Mon Sep 6 15:32:11 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 15:32:11 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> (Yaron Haviv's message of "Mon, 6 Sep 2004 16:29:18 +0300") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> Message-ID: <52fz5vni50.fsf@topspin.com> OK, I just comitted this. It implements a pretty full set of sysfs attributes, including GID and P_Key tables. The only things in my tree exposed in /proc and not exposed through sysfs now are the PMA counters -- it will be very easy to implement them, but I'm holding off until the new MAD API is defined and implemented. Thanks, Roland Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 742) +++ infiniband/include/ib_verbs.h (revision 743) @@ -689,6 +689,8 @@ ib_mad_process_func mad_process; struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; enum { IB_DEV_UNINITIALIZED, Index: infiniband/core/ib_sysfs.c =================================================================== --- infiniband/core/ib_sysfs.c (revision 742) +++ infiniband/core/ib_sysfs.c (revision 743) @@ -21,6 +21,221 @@ #include "core_priv.h" +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = { \ + .attr = { .name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \ + .show = _show, \ + .store = _store \ +} + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t show_port_state(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t show_port_lid(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t show_port_lmc(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t show_port_sm_lid(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t show_port_sm_sl(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t show_port_cap(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static PORT_ATTR(state, S_IRUGO, show_port_state, NULL); +static PORT_ATTR(lid, S_IRUGO, show_port_lid, NULL); +static PORT_ATTR(lid_mask_count, S_IRUGO, show_port_lmc, NULL); +static PORT_ATTR(sm_lid, S_IRUGO, show_port_sm_lid, NULL); +static PORT_ATTR(sm_sl, S_IRUGO, show_port_sm_sl, NULL); +static PORT_ATTR(cap_mask, S_IRUGO, show_port_cap, NULL); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + static void ib_device_release(struct class_device *cdev) { struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); @@ -35,6 +250,189 @@ return 0; } +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_put; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + static struct class ib_class = { .name = "infiniband", .release = ib_device_release, @@ -44,18 +442,94 @@ int ib_device_register_sysfs(struct ib_device *device) { struct class_device *class_dev = &device->class_dev; + int ret; + int i; class_dev->class = &ib_class; class_dev->class_data = device; strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); - return class_device_register(class_dev); + INIT_LIST_HEAD(&device->port_list); + ret = class_device_register(class_dev); + if (ret) + goto err; + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + struct ib_device_attr attr; + int i; + + ret = ib_query_device(device, &attr); + if (ret) + goto err_put; + + for (i = 1; i <= attr.phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; } void ib_device_deregister_sysfs(struct ib_device *device) { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p,&port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); class_device_unregister(&device->class_dev); } From roland at topspin.com Mon Sep 6 15:36:27 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 15:36:27 -0700 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change In-Reply-To: <1094504827.1832.28.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 06 Sep 2004 17:07:08 -0400") References: <1094504827.1832.28.camel@localhost.localdomain> Message-ID: <52brgjnhxw.fsf@topspin.com> Hal> That seems fine when the WRs are not linked. What happens Hal> when they are linked and there is some error on one of the Hal> linked WRs ? In that case, some send WRs get posted and Hal> others do not. Does there need to be another parameter Hal> indicating how far in the list was posted so the ib_mad Hal> client knows what to repost ? I don't think that all errors Hal> can be hidden from the ib_mad client. I guess we should add a parameter like 'struct ib_send_mad_wr **bad_wr' to match the way that ib_post_send() is defined... Hal> Also, if this were to occur in the middle of an RMPP Hal> transaction, should this be detected and any special actions Hal> taken ? Or would this just rely on normal RMPP handling at Hal> the other end to detect any issues ? There's no reason for a send request to fail in normal operation, so I don't see much that the MAD layer can try to do to recover. In any case, I'd like to see some real code that actually can be used, at least in the normal case as soon as possible. So I would suggest deferring secondary issues like this until after we have a working implementation. Hal> I also have a related implementation question. The ib_mad Hal> client supplies wr_id in the send WR. If it turns out that it Hal> might be better to use wr_ids in some special encoded way, is Hal> it acceptable to do that as long as the client wr_id is Hal> returned in the send WC ? I don't see any reason why not. Do you see any problems? - R. From roland at topspin.com Mon Sep 6 15:39:21 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 15:39:21 -0700 Subject: [openib-general] [PATCH] mthca-specific sysfs attributes In-Reply-To: <52fz5vni50.fsf@topspin.com> (Roland Dreier's message of "Mon, 06 Sep 2004 15:32:11 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> Message-ID: <527jr7nht2.fsf_-_@topspin.com> This adds some extra fields to sysfs that the low-level driver knows how to format: FW version and HW revision. - R. Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 737) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -71,7 +71,8 @@ 0xffffff; props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->payload + 70)); props->hw_ver = be16_to_cpup((u16 *) (out_mad->payload + 72)); - memcpy(&props->node_guid, out_mad->payload + 52, 8); + memcpy(&props->sys_image_guid, out_mad->payload + 44, 8); + memcpy(&props->node_guid, out_mad->payload + 52, 8); err = 0; out: @@ -514,8 +515,46 @@ return 0; } +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + int mthca_register_device(struct mthca_dev *dev) { + int ret; + int i; + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.owner = THIS_MODULE; @@ -546,7 +585,20 @@ dev->ib_dev.detach_mcast = mthca_multicast_detach; dev->ib_dev.mad_process = mthca_process_mad; - return ib_register_device(&dev->ib_dev); + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_deregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; } void mthca_deregister_device(struct mthca_dev *dev) Index: infiniband/hw/mthca/ChangeLog =================================================================== --- infiniband/hw/mthca/ChangeLog (revision 736) +++ infiniband/hw/mthca/ChangeLog (working copy) @@ -51,6 +51,7 @@ - Fix mapping of HCA memory on JS20 (don't trust what the HCA tells us about PCI addresses). - Don't request SRQ events if FW doesn't support SRQ. + - Add firmware version and revision ID to sysfs. Local Variables: mode: text From roland at topspin.com Mon Sep 6 15:53:49 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 15:53:49 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <52fz5vni50.fsf@topspin.com> (Roland Dreier's message of "Mon, 06 Sep 2004 15:32:11 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> Message-ID: <523c1vnh4y.fsf@topspin.com> By the way, since /sys/class/infiniband/mthca0/ports/1/cap_mask just contains a hex number (such as '0x00500a68'), I wrote this quick python script to decode that capabilities. You can use it as: # python ~roland/bin/decode-cap mthca0 2 Capabilities for mthca0 port 2: IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsLEDInfoSupported IsSystemImageGUIDSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported - R. #!/usr/bin/python import sys dev = sys.argv[1] port = sys.argv[2] cap_name = { 1: 'IsSM', 2: 'IsNoticeSupported', 3: 'IsTrapSupported', 5: 'IsAutomaticMigrationSupported', 6: 'IsSLMappingSupported', 7: 'IsMKeyNVRAM', 8: 'IsPKeyNVRAM', 9: 'IsLEDInfoSupported', 10: 'IsSMdisabled', 11: 'IsSystemImageGUIDSupported', 12: 'IsPKeySwitchExternalPortTrapSupported', 16: 'IsCommunicationManagementSupported', 17: 'IsSNMPTunnelingSupported', 18: 'IsReinitSupported', 19: 'IsDeviceManagementSupported', 20: 'IsVendorClassSupported', 21: 'IsDRNoticeSupported', 22: 'IsCapabilityMaskNoticeSupported', 23: 'IsBootManagementSupported' } f = open('/sys/class/infiniband/' + dev + '/ports/' + port + '/cap_mask') cap = int(f.read(), 0) print 'Capabilities for ' + dev + ' port ' + port + ':' for k in cap_name.keys(): if cap & (1 << k) != 0: print '\t' + cap_name[k] From halr at voltaire.com Mon Sep 6 16:08:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Sep 2004 19:08:17 -0400 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change In-Reply-To: <52brgjnhxw.fsf@topspin.com> References: <1094504827.1832.28.camel@localhost.localdomain> <52brgjnhxw.fsf@topspin.com> Message-ID: <1094512096.1864.39.camel@localhost.localdomain> On Mon, 2004-09-06 at 18:36, Roland Dreier wrote: > Hal> That seems fine when the WRs are not linked. What happens > Hal> when they are linked and there is some error on one of the > Hal> linked WRs ? In that case, some send WRs get posted and > Hal> others do not. Does there need to be another parameter > Hal> indicating how far in the list was posted so the ib_mad > Hal> client knows what to repost ? I don't think that all errors > Hal> can be hidden from the ib_mad client. > > I guess we should add a parameter like 'struct ib_send_mad_wr **bad_wr' > to match the way that ib_post_send() is defined... Thanks for pointing that out. I had missed that (the fact that this API was different from ib_post_send). Patch for this to follow shortly... > Hal> Also, if this were to occur in the middle of an RMPP > Hal> transaction, should this be detected and any special actions > Hal> taken ? Or would this just rely on normal RMPP handling at > Hal> the other end to detect any issues ? > > There's no reason for a send request to fail in normal operation, so I > don't see much that the MAD layer can try to do to recover. Yes, but there are a number of errors which can occur (not just the posting of the send failing) as more information needs tracking per send so that there is some memory allocation involved too which can fail. > In any case, I'd like to see some real code that actually can be used, at > least in the normal case as soon as possible. So I would suggest > deferring secondary issues like this until after we have a working > implementation. I get the message :-( Sorry this is taking longer than I had projected. We can discuss the "schedule" and how to transition on Thursday. I will focus on the "normal" code path and leave comments on what else needs to be filled in/addressed in a subsequent pass. > Hal> I also have a related implementation question. The ib_mad > Hal> client supplies wr_id in the send WR. If it turns out that it > Hal> might be better to use wr_ids in some special encoded way, is > Hal> it acceptable to do that as long as the client wr_id is > Hal> returned in the send WC ? > > I don't see any reason why not. Do you see any problems? No problems but rather I was wondering about whether there would be any asthetic objections. -- Hal From roland at topspin.com Mon Sep 6 16:06:31 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Sep 2004 16:06:31 -0700 Subject: [openib-general] [PATCH] kill some ioctls In-Reply-To: <527jr7nht2.fsf_-_@topspin.com> (Roland Dreier's message of "Mon, 06 Sep 2004 15:39:21 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <527jr7nht2.fsf_-_@topspin.com> Message-ID: <52y8jnm1zc.fsf_-_@topspin.com> Get rid of some ioctls now that that the info is in sysfs... - R. Index: infiniband/include/ts_ib_useraccess.h =================================================================== --- infiniband/include/ts_ib_useraccess.h (revision 715) +++ infiniband/include/ts_ib_useraccess.h (working copy) @@ -37,8 +37,6 @@ */ typedef struct ib_user_mad_filter tTS_IB_USER_MAD_FILTER_STRUCT, *tTS_IB_USER_MAD_FILTER; -typedef struct ib_get_port_info_ioctl tTS_IB_GET_PORT_INFO_IOCTL_STRUCT, - *tTS_IB_GET_PORT_INFO_IOCTL; typedef struct ib_set_port_info_ioctl tTS_IB_SET_PORT_INFO_IOCTL_STRUCT, *tTS_IB_SET_PORT_INFO_IOCTL; typedef struct ib_mad_process_ioctl tTS_IB_MAD_PROCESS_IOCTL_STRUCT, @@ -47,8 +45,6 @@ *tTS_IB_QP_REGISTER_IOCTL; typedef struct ib_path_record_ioctl tTS_IB_PATH_RECORD_IOCTL_STRUCT, *tTS_IB_PATH_RECORD_IOCTL; -typedef struct ib_gid_entry_ioctl tTS_IB_GID_ENTRY_IOCTL_STRUCT, - *tTS_IB_GID_ENTRY_IOCTL; struct ib_user_mad_filter { tTS_IB_PORT port; @@ -61,11 +57,6 @@ tTS_IB_USER_MAD_FILTER_HANDLE handle; }; -struct ib_get_port_info_ioctl { - tTS_IB_PORT port; - struct ib_port_attr port_info; -}; - struct ib_set_port_info_ioctl { tTS_IB_PORT port; int port_modify_mask; @@ -82,16 +73,6 @@ struct ib_path_record *path_record; }; -struct ib_gid_entry_ioctl { - u8 port; - int index; - union ib_gid gid_entry; -}; - -struct ib_get_dev_info_ioctl { - struct ib_device_attr dev_info; -}; - /* Old useraccess module used magic 0xbb; we change it here so old binaries don't fail silently in strange ways. */ #define TS_IB_IOCTL_MAGIC 0xbc @@ -104,10 +85,6 @@ #define TS_IB_IOCSMADFILTDEL _IOW(TS_IB_IOCTL_MAGIC, 2, \ tTS_IB_USER_MAD_FILTER_HANDLE *) -/* Get port info */ -#define TS_IB_IOCGPORTINFO _IOR(TS_IB_IOCTL_MAGIC, 3, \ - struct ib_get_port_info_ioctl *) - /* Set port info */ #define TS_IB_IOCSPORTINFO _IOW(TS_IB_IOCTL_MAGIC, 4, \ struct ib_set_port_info_ioctl *) @@ -130,12 +107,4 @@ #define TS_IB_IOCGPATHRECORD _IOWR(TS_IB_IOCTL_MAGIC, 9, \ struct ib_path_record_ioctl *) -/* Fetch a GID */ -#define TS_IB_IOCGGIDENTRY _IOWR(TS_IB_IOCTL_MAGIC, 10, \ - struct ib_gid_entry_ioctl *) - -/* Get device info */ -#define TS_IB_IOCGDEVINFO _IOR(TS_IB_IOCTL_MAGIC, 11, \ - struct ib_get_dev_info_ioctl *) - #endif /* _TS_IB_USERACCESS_H */ Index: infiniband/core/useraccess_ioctl.c =================================================================== --- infiniband/core/useraccess_ioctl.c (revision 715) +++ infiniband/core/useraccess_ioctl.c (working copy) @@ -75,31 +75,6 @@ return tsIbUserFilterDel(priv, handle); } -static int _tsIbUserIoctlGetPortInfo(struct ib_useraccess_private *priv, - unsigned long arg) -{ - struct ib_get_port_info_ioctl get_port_info_ioctl; - int ret; - - if (copy_from_user(&get_port_info_ioctl, - (struct ib_get_port_info_ioctl *) arg, - sizeof get_port_info_ioctl)) { - return -EFAULT; - } - - ret = ib_query_port(priv->device->ib_device, - get_port_info_ioctl.port, - &get_port_info_ioctl.port_info); - - if (ret) { - return -EFAULT; - } - - return copy_to_user((struct ib_get_port_info_ioctl *) arg, - &get_port_info_ioctl, - sizeof get_port_info_ioctl) ? -EFAULT : 0; -} - static int _tsIbUserIoctlSetPortInfo(struct ib_useraccess_private *priv, unsigned long arg) { @@ -221,30 +196,6 @@ return 0; } -static int _tsIbUserIoctlGetGidEntry(struct ib_useraccess_private *priv, - unsigned long arg) -{ - struct ib_gid_entry_ioctl gid_ioctl; - int ret; - - if (copy_from_user(&gid_ioctl, (void *)arg, sizeof(gid_ioctl))) { - return -EFAULT; - } - - ret = ib_query_gid(priv->device->ib_device, gid_ioctl.port, - gid_ioctl.index, &gid_ioctl.gid_entry); - - if (ret) { - return ret; - } - - if (copy_to_user((void *)arg, &gid_ioctl, sizeof(gid_ioctl))) { - return -EFAULT; - } - - return 0; -} - static int _tsIbUserIoctlMadProcess(struct ib_useraccess_private *priv, unsigned long arg) { @@ -297,32 +248,6 @@ return ret; } -static int ib_user_ioctl_get_dev_info(struct ib_useraccess_private *priv, - unsigned long arg) -{ - struct ib_get_dev_info_ioctl *get_dev_info_ioctl; - int ret; - - get_dev_info_ioctl = kmalloc(sizeof *get_dev_info_ioctl, - GFP_KERNEL); - if (!get_dev_info_ioctl) - return -ENOMEM; - - ret = ib_query_device(priv->device->ib_device, - &get_dev_info_ioctl->dev_info); - - if (ret) - goto out; - - ret = copy_to_user((struct ib_get_dev_info_ioctl *) arg, - get_dev_info_ioctl, - sizeof *get_dev_info_ioctl) ? -EFAULT : 0; - -out: - kfree(get_dev_info_ioctl); - return ret; -} - static const struct { int cmd; int (*function) (struct ib_useraccess_private *, unsigned long); @@ -334,9 +259,6 @@ { .cmd = TS_IB_IOCSMADFILTDEL, .function = _tsIbUserIoctlMadFiltDel, .name = "delete MAD filter" }, - { .cmd = TS_IB_IOCGPORTINFO, - .function = _tsIbUserIoctlGetPortInfo, - .name = "get port info" }, { .cmd = TS_IB_IOCSPORTINFO, .function = _tsIbUserIoctlSetPortInfo, .name = "set port info" }, @@ -346,15 +268,9 @@ { .cmd = TS_IB_IOCSRCVQUEUELENGTH, .function = _tsIbUserIoctlSetReceiveQueueLength, .name = "set receive queue length" }, - { .cmd = TS_IB_IOCGGIDENTRY, - .function = _tsIbUserIoctlGetGidEntry, - .name = "get GID entry" }, { .cmd = TS_IB_IOCMADPROCESS, .function = _tsIbUserIoctlMadProcess, .name = "process MAD" }, - { .cmd = TS_IB_IOCGDEVINFO, - .function = ib_user_ioctl_get_dev_info, - .name = "get device info" } }; static const int num_ioctl = sizeof ioctl_table / sizeof ioctl_table[0]; From halr at voltaire.com Mon Sep 6 16:18:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Sep 2004 19:18:56 -0400 Subject: [openib-general] [PATCH] ib_mad.h: Add bad_send_wr parameter to ib_mad_post_send() Message-ID: <1094512736.1864.43.camel@localhost.localdomain> Add bad_send_wr parameter to ib_mad_post_send() Also modified the ib_mad_send_wc (but not the recv one nor eliminated the ib_mad_send_wr structure. Index: ib_mad.h =================================================================== --- ib_mad.h (revision 739) +++ ib_mad.h (working copy) @@ -163,13 +163,13 @@ /** * ib_mad_send_wc - MAD send completion information. - * @context - Context associated with the send MAD request. + * @wr_id - Work request identifier associated with the send MAD request. * @status - Completion status. * @vendor_err - Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { - void *context; + u64 wr_id; enum ib_wc_status status; u32 vendor_err; }; @@ -260,13 +260,15 @@ int ib_mad_dereg(struct ib_mad_agent *mad_agent); /** - * ib_mad_post_send - Posts a MAD to the send queue of the QP associated + * ib_mad_post_send - Posts MAD(s) to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. - * @mad_send_wr - Specifies the information needed to send the MAD. + * @send_wr - Specifies the information needed to send the MAD(s). + * @bad_send_wr - Specifies the MAD on which an error was encountered. */ int ib_mad_post_send(struct ib_mad_agent *mad_agent, - struct ib_mad_send_wr *mad_send_wr); + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); /** * ib_mad_qp_redir - Registers a QP for MAD services. From mst at mellanox.co.il Mon Sep 6 22:55:10 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Sep 2004 08:55:10 +0300 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <523c1vnh4y.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> Message-ID: <20040907055510.GA8185@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH] Yet more sysfs support": > By the way, since /sys/class/infiniband/mthca0/ports/1/cap_mask just > contains a hex number (such as '0x00500a68'), I wrote this quick > python script to decode that capabilities. You can use it as: > > # python ~roland/bin/decode-cap mthca0 2 > Capabilities for mthca0 port 2: > IsTrapSupported > IsAutomaticMigrationSupported > IsSLMappingSupported > IsLEDInfoSupported > IsSystemImageGUIDSupported > IsVendorClassSupported > IsCapabilityMaskNoticeSupported > > - R. Wouldnt it be nicer to have one file per bit here? From halr at voltaire.com Tue Sep 7 04:18:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Sep 2004 07:18:41 -0400 Subject: [openib-general] ib_mad.h ib_mad_post_send PCI mapping Message-ID: <1094555920.1864.50.camel@localhost.localdomain> I have another ib_mad_post_send question relative to PCI mapping. Is it correct to assume that like ib_post_send that any PCI mapping (and cache synchronization) is to be performed in the client rather than in ib_mad_post_send itself ? -- Hal From roland at topspin.com Tue Sep 7 07:44:09 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 07:44:09 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <20040907055510.GA8185@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Sep 2004 08:55:10 +0300") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> Message-ID: <52n002m952.fsf@topspin.com> Michael> Wouldnt it be nicer to have one file per bit here? I thought about it, but it seemed to be excessive to me. If you send a patch I'd probably apply it though. - R. From mst at mellanox.co.il Tue Sep 7 07:48:14 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Sep 2004 17:48:14 +0300 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <52n002m952.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> <52n002m952.fsf@topspin.com> Message-ID: <20040907144814.GB1340@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH] Yet more sysfs support": > Michael> Wouldnt it be nicer to have one file per bit here? > > I thought about it, but it seemed to be excessive to me. If you send > a patch I'd probably apply it though. > > - R. OK. When we start writing to bits (like IsSM bit) to have a separate flag will be better as it will guarantee automicity. MST From roland at topspin.com Tue Sep 7 07:51:53 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 07:51:53 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <20040907144814.GB1340@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Sep 2004 17:48:14 +0300") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> <52n002m952.fsf@topspin.com> <20040907144814.GB1340@mellanox.co.il> Message-ID: <52fz5um8s6.fsf@topspin.com> Michael> When we start writing to bits (like IsSM bit) to have a Michael> separate flag will be better as it will guarantee Michael> automicity. I'm not convinced sysfs is the right interface to set capability bits. You lose the ability to clean up when an app exits uncleanly (for example leaving the IsSM bit set with no SM running). - Roland From roland at topspin.com Tue Sep 7 07:53:56 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 07:53:56 -0700 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change In-Reply-To: <1094512096.1864.39.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 06 Sep 2004 19:08:17 -0400") References: <1094504827.1832.28.camel@localhost.localdomain> <52brgjnhxw.fsf@topspin.com> <1094512096.1864.39.camel@localhost.localdomain> Message-ID: <52brgim8or.fsf@topspin.com> Hal> I get the message :-( Sorry this is taking longer than I had Hal> projected. We can discuss the "schedule" and how to Hal> transition on Thursday. Don't take it personally -- I wasn't criticizing you in particular. I just think it's important to get the new MAD API usable soon or else we'll lose the little momentum we have left. - R. From mst at mellanox.co.il Tue Sep 7 07:53:56 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Sep 2004 17:53:56 +0300 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <52fz5um8s6.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> <52n002m952.fsf@topspin.com> <20040907144814.GB1340@mellanox.co.il> <52fz5um8s6.fsf@topspin.com> Message-ID: <20040907145356.GC1340@mellanox.co.il> Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH] Yet more sysfs support": > Michael> When we start writing to bits (like IsSM bit) to have a > Michael> separate flag will be better as it will guarantee > Michael> automicity. > > I'm not convinced sysfs is the right interface to set capability > bits. You lose the ability to clean up when an app exits uncleanly > (for example leaving the IsSM bit set with no SM running). > > - Roland Interesting. It seems we need a way to catch close on the sysfs file and there does not seem to exist an easy way to do this. But there *is* the hard way - by finding and replacing the f_ops pointer, no? MST From roland at topspin.com Tue Sep 7 08:05:35 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 08:05:35 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <20040907145356.GC1340@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 7 Sep 2004 17:53:56 +0300") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> <52n002m952.fsf@topspin.com> <20040907144814.GB1340@mellanox.co.il> <52fz5um8s6.fsf@topspin.com> <20040907145356.GC1340@mellanox.co.il> Message-ID: <527jr6m85c.fsf@topspin.com> Michael> Interesting. It seems we need a way to catch close on the Michael> sysfs file and there does not seem to exist an easy way Michael> to do this. Right. So if we want to catch the close, we need our own file (either a /dev node or in a driver-specific filesystem). Michael> But there *is* the hard way - by finding and replacing Michael> the f_ops pointer, no? That's technically possible but too ugly to consider. - R. From halr at voltaire.com Tue Sep 7 08:16:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Sep 2004 11:16:30 -0400 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change In-Reply-To: <52brgim8or.fsf@topspin.com> References: <1094504827.1832.28.camel@localhost.localdomain> <52brgjnhxw.fsf@topspin.com> <1094512096.1864.39.camel@localhost.localdomain> <52brgim8or.fsf@topspin.com> Message-ID: <1094570190.3500.75.camel@localhost.localdomain> On Tue, 2004-09-07 at 10:53, Roland Dreier wrote: > Don't take it personally -- I wasn't criticizing you in particular. I > just think it's important to get the new MAD API usable soon or else > we'll lose the little momentum we have left. Thanks. I'm not taking it personally but I don't like making commitments (or in this case an estimate) and not living up to it. In any case, we can go over the current design on Thursday if we have time if that is of group interest. I would like to discuss the transition approach to this API. -- Hal From maik.franke at s1999.tu-chemnitz.de Tue Sep 7 04:05:21 2004 From: maik.franke at s1999.tu-chemnitz.de (Maik Franke) Date: Tue, 7 Sep 2004 13:05:21 +0200 Subject: [openib-general] differences between IB stacks (sourceforge, openib, ..)? Message-ID: <1094555121.413d95f17bced@mail.tu-chemnitz.de> Is there somewhere more precise information about the differences between the different stacks (Sourceforge, OpenIB, TopSpin...)? I found only the file OLS-2004.pdf of the Linux Symposium. Thanks a lot. Maik From roland at topspin.com Tue Sep 7 08:53:41 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 08:53:41 -0700 Subject: [openib-general] differences between IB stacks (sourceforge, openib, ..)? In-Reply-To: <1094555121.413d95f17bced@mail.tu-chemnitz.de> (Maik Franke's message of "Tue, 7 Sep 2004 13:05:21 +0200") References: <1094555121.413d95f17bced@mail.tu-chemnitz.de> Message-ID: <52u0uakrcq.fsf@topspin.com> Maik> Is there somewhere more precise information about the Maik> differences between the different stacks (Sourceforge, Maik> OpenIB, TopSpin...)? I found only the file OLS-2004.pdf of Maik> the Linux Symposium. The most precise info is the source code, which is all checked in to OpenIB subversion. Since Woody presented his opinion at OLS, the controversy has been settled. All open source development is focused on the "gen2" OpenIB stack with API specified in https://openib.org/svn/trunk/contrib/intel - Roland From mshefty at ichips.intel.com Tue Sep 7 08:15:57 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 08:15:57 -0700 Subject: [openib-general] Async event handlers: per consumer or per QP/CQ? In-Reply-To: <52n004pfxt.fsf@topspin.com> References: <52n004pfxt.fsf@topspin.com> Message-ID: <20040907081557.55028f64.mshefty@ichips.intel.com> On Sun, 05 Sep 2004 14:24:30 -0700 Roland Dreier wrote: > I think it makes the implementation much cleaner. I'm going to go > ahead and implement it so that we have something concrete to > criticize. If you can provide a patch, I'll apply it to my tree. Otherwise, I'll back port the changes. From mshefty at ichips.intel.com Tue Sep 7 08:25:01 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 08:25:01 -0700 Subject: [openib-general] Re: [PATCH] ib_mad.h: Add IB_MGMT_MAX_METHODS definition In-Reply-To: <1094229808.1747.12.camel@localhost.localdomain> References: <1094229808.1747.12.camel@localhost.localdomain> Message-ID: <20040907082501.729b0361.mshefty@ichips.intel.com> On Fri, 03 Sep 2004 12:43:28 -0400 Hal Rosenstock wrote: > Add IB_MGMT_MAX_METHODS definition to ib_mad.h Applied - thanks! From mshefty at ichips.intel.com Tue Sep 7 08:37:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 08:37:46 -0700 Subject: [openib-general] ib_mad.h ib_mad_post_send questions and a minor commentary change In-Reply-To: <52brgjnhxw.fsf@topspin.com> References: <1094504827.1832.28.camel@localhost.localdomain> <52brgjnhxw.fsf@topspin.com> Message-ID: <20040907083746.4731f3b1.mshefty@ichips.intel.com> On Mon, 06 Sep 2004 15:36:27 -0700 Roland Dreier wrote: > Hal> Also, if this were to occur in the middle of an RMPP > Hal> transaction, should this be detected and any special actions > Hal> taken ? Or would this just rely on normal RMPP handling at > Hal> the other end to detect any issues ? > > There's no reason for a send request to fail in normal operation, so I > don't see much that the MAD layer can try to do to recover. In any > case, I'd like to see some real code that actually can be used, at > least in the normal case as soon as possible. So I would suggest > deferring secondary issues like this until after we have a working > implementation. In such a case, we can just complete the MAD with an error. > Hal> I also have a related implementation question. The ib_mad > Hal> client supplies wr_id in the send WR. If it turns out that it > Hal> might be better to use wr_ids in some special encoded way, is > Hal> it acceptable to do that as long as the client wr_id is > Hal> returned in the send WC ? > > I don't see any reason why not. Do you see any problems? I think that's perfectly acceptable. It may be nice to avoid saving/restoring wr_id, but I'm not sure if it's possible to do that and track RMPP and request/responses. From iod00d at hp.com Tue Sep 7 09:37:33 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 7 Sep 2004 09:37:33 -0700 Subject: [openib-general] [PATCH] Yet more sysfs support In-Reply-To: <20040907055510.GA8185@mellanox.co.il> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <523c1vnh4y.fsf@topspin.com> <20040907055510.GA8185@mellanox.co.il> Message-ID: <20040907163733.GA6341@cup.hp.com> On Tue, Sep 07, 2004 at 08:55:10AM +0300, Michael S. Tsirkin wrote: > > # python ~roland/bin/decode-cap mthca0 2 > > Capabilities for mthca0 port 2: > > IsTrapSupported > > IsAutomaticMigrationSupported > > IsSLMappingSupported > > IsLEDInfoSupported > > IsSystemImageGUIDSupported > > IsVendorClassSupported > > IsCapabilityMaskNoticeSupported > > Wouldnt it be nicer to have one file per bit here? I think it's a bit excessive too. My gut feeling was anything that's parsing those bits is probably written in C. grant From mshefty at ichips.intel.com Tue Sep 7 08:53:52 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 08:53:52 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_post_send PCI mapping In-Reply-To: <1094555920.1864.50.camel@localhost.localdomain> References: <1094555920.1864.50.camel@localhost.localdomain> Message-ID: <20040907085352.7b1b9653.mshefty@ichips.intel.com> On Tue, 07 Sep 2004 07:18:41 -0400 Hal Rosenstock wrote: > I have another ib_mad_post_send question relative to PCI mapping. Is it > correct to assume that like ib_post_send that any PCI mapping (and cache > synchronization) is to be performed in the client rather than in > ib_mad_post_send itself ? Not entirely sure what you mean here. Are you asking if the buffers are registered with the HCA? If so, then my assumption is that the client did this. From roland at topspin.com Tue Sep 7 10:20:04 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 10:20:04 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_post_send PCI mapping In-Reply-To: <20040907085352.7b1b9653.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 7 Sep 2004 08:53:52 -0700") References: <1094555920.1864.50.camel@localhost.localdomain> <20040907085352.7b1b9653.mshefty@ichips.intel.com> Message-ID: <52hdqakncr.fsf@topspin.com> Hal> I have another ib_mad_post_send question relative to PCI Hal> mapping. Is it correct to assume that like ib_post_send that Hal> any PCI mapping (and cache synchronization) is to be Hal> performed in the client rather than in ib_mad_post_send Hal> itself ? Sean> Not entirely sure what you mean here. Are you asking if the Sean> buffers are registered with the HCA? If so, then my Sean> assumption is that the client did this. Actually I think Hal is asking who calls pci_map()/pci_unmap() (I think the plan for memory registration is just for the MAD layer to register all of RAM or use the reserved L_Key). I agree that it makes sense for the consumer to do the PCI mapping since that allows consumers to do thinks like allocate MADs with pci_pool_alloc() and avoid calling pci_map at all. - R. From mshefty at ichips.intel.com Tue Sep 7 10:03:44 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 10:03:44 -0700 Subject: [openib-general] [PATCH] ib_mad.h: Add bad_send_wr parameter to ib_mad_post_send() In-Reply-To: <1094512736.1864.43.camel@localhost.localdomain> References: <1094512736.1864.43.camel@localhost.localdomain> Message-ID: <20040907100344.74e61210.mshefty@ichips.intel.com> On Mon, 06 Sep 2004 19:18:56 -0400 Hal Rosenstock wrote: > Add bad_send_wr parameter to ib_mad_post_send() > Also modified the ib_mad_send_wc (but not the recv one nor eliminated > the ib_mad_send_wr structure. Thanks! applied From iod00d at hp.com Tue Sep 7 11:08:04 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 7 Sep 2004 11:08:04 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_post_send PCI mapping In-Reply-To: <52hdqakncr.fsf@topspin.com> References: <1094555920.1864.50.camel@localhost.localdomain> <20040907085352.7b1b9653.mshefty@ichips.intel.com> <52hdqakncr.fsf@topspin.com> Message-ID: <20040907180804.GD6341@cup.hp.com> On Tue, Sep 07, 2004 at 10:20:04AM -0700, Roland Dreier wrote: > Actually I think Hal is asking who calls pci_map()/pci_unmap() (I > think the plan for memory registration is just for the MAD layer to > register all of RAM or use the reserved L_Key). > > I agree that it makes sense for the consumer to do the PCI mapping > since that allows consumers to do thinks like allocate MADs with > pci_pool_alloc() and avoid calling pci_map at all. pci_pool_alloc() is NOT a substitute for pci_map/unmap. See drivers/base/dmapool.c: * dma_pool_create - Creates a pool of consistent memory blocks, for dma. The HCA driver can use pci_pool_alloc() to manage data shared with the HCA. But payload data needs to be mapped/unmapped for each DMA. In April/May I exchanged emails on this topic as well. The thinking then was the client (eg IPoIB) would handle DMA mapping to be more like a traditional (NIC) driver. I'm still not sure that's the right approach but see no reason why that can't work. grant From halr at voltaire.com Tue Sep 7 11:37:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Sep 2004 14:37:48 -0400 Subject: [openib-general] ib_verbs.h: ib_recv_flags definition Message-ID: <1094582267.3500.129.camel@localhost.localdomain> Do we need to add ib_recv_flags to ib_verbs.h for use in setting ib_recv_wr.recv_flags ? enum ib_recv_flags { IB_RECV_SIGNALED = 1 }; -- Hal From mshefty at ichips.intel.com Tue Sep 7 10:45:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 10:45:16 -0700 Subject: [openib-general] Re: ib_verbs.h: ib_recv_flags definition In-Reply-To: <1094582267.3500.129.camel@localhost.localdomain> References: <1094582267.3500.129.camel@localhost.localdomain> Message-ID: <20040907104516.699e8fb9.mshefty@ichips.intel.com> On Tue, 07 Sep 2004 14:37:48 -0400 Hal Rosenstock wrote: > Do we need to add ib_recv_flags to ib_verbs.h for use in setting > ib_recv_wr.recv_flags ? > > enum ib_recv_flags { > IB_RECV_SIGNALED = 1 > }; Can/will mthca support this flag? From halr at voltaire.com Tue Sep 7 12:08:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Sep 2004 15:08:05 -0400 Subject: [openib-general] Re: ib_verbs.h: ib_recv_flags definition In-Reply-To: <20040907104516.699e8fb9.mshefty@ichips.intel.com> References: <1094582267.3500.129.camel@localhost.localdomain> <20040907104516.699e8fb9.mshefty@ichips.intel.com> Message-ID: <1094584084.3500.150.camel@localhost.localdomain> On Tue, 2004-09-07 at 13:45, Sean Hefty wrote: > On Tue, 07 Sep 2004 14:37:48 -0400 > Hal Rosenstock wrote: > > > Do we need to add ib_recv_flags to ib_verbs.h for use in setting > > ib_recv_wr.recv_flags ? > > > > enum ib_recv_flags { > > IB_RECV_SIGNALED = 1 > > }; > > Can/will mthca support this flag? It's already there: /home/hal/openib/gen2/branches/roland-merge/src/linux-kernel/infiniband [hal at localhost infiniband]$ grep IB_RECV_SIGN */*.h include/ib_verbs.h: IB_RECV_SIGNALED = 1 [hal at localhost infiniband]$ grep IB_RECV_SIGN */*.c core/mad_ib.c: receive_param.recv_flags = IB_RECV_SIGNALED; [hal at localhost infiniband]$ grep IB_RECV_SIGN */*/*.c hw/mthca/mthca_qp.c: (wr->recv_flags & IB_RECV_SIGNALED) ? ulp/ipoib/ipoib_ib.c: .recv_flags = IB_RECV_SIGNALED ulp/sdp/sdp_recv.c: receive_param.recv_flags = IB_RECV_SIGNALED; ulp/srp/srptp.c: rcv_param.recv_flags = IB_RECV_SIGNALED; -- Hal From roland at topspin.com Tue Sep 7 12:44:06 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Sep 2004 12:44:06 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_post_send PCI mapping In-Reply-To: <20040907180804.GD6341@cup.hp.com> (Grant Grundler's message of "Tue, 7 Sep 2004 11:08:04 -0700") References: <1094555920.1864.50.camel@localhost.localdomain> <20040907085352.7b1b9653.mshefty@ichips.intel.com> <52hdqakncr.fsf@topspin.com> <20040907180804.GD6341@cup.hp.com> Message-ID: <52zn41kgop.fsf@topspin.com> Grant> pci_pool_alloc() is NOT a substitute for pci_map/unmap. Grant> See drivers/base/dmapool.c: * dma_pool_create - Creates a Grant> pool of consistent memory blocks, for dma. It's not a direct substitute but at a high level, both pci_map and pci_pool_alloc are a way to get a usable DMA address for a chunk of kernel memory. Grant> The HCA driver can use pci_pool_alloc() to manage data Grant> shared with the HCA. But payload data needs to be Grant> mapped/unmapped for each DMA. Actually that's up to the consumer. It might be more convenient for a consumer to use the pci_pool stuff to keep some consistent buffers around and use them instead of dynamically mapping for every DMA. Grant> In April/May I exchanged emails on this topic as well. The Grant> thinking then was the client (eg IPoIB) would handle DMA Grant> mapping to be more like a traditional (NIC) driver. I'm Grant> still not sure that's the right approach but see no reason Grant> why that can't work. It is working in the sense of both IPoIB and the MAD layer implementing that approach and working on a large range of Linux archs. - Roland From iod00d at hp.com Tue Sep 7 12:57:10 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 7 Sep 2004 12:57:10 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_post_send PCI mapping In-Reply-To: <52zn41kgop.fsf@topspin.com> References: <1094555920.1864.50.camel@localhost.localdomain> <20040907085352.7b1b9653.mshefty@ichips.intel.com> <52hdqakncr.fsf@topspin.com> <20040907180804.GD6341@cup.hp.com> <52zn41kgop.fsf@topspin.com> Message-ID: <20040907195710.GI6341@cup.hp.com> On Tue, Sep 07, 2004 at 12:44:06PM -0700, Roland Dreier wrote: > Grant> pci_pool_alloc() is NOT a substitute for pci_map/unmap. > Grant> See drivers/base/dmapool.c: * dma_pool_create - Creates a > Grant> pool of consistent memory blocks, for dma. > > It's not a direct substitute but at a high level, both pci_map and > pci_pool_alloc are a way to get a usable DMA address for a chunk of > kernel memory. But the semantics of "consistent" DMA mappings is different than "streaming" DMA mappings. pci_map/unmap() deals with the latter while pci_pool_alloc() only deals with the former. > Grant> The HCA driver can use pci_pool_alloc() to manage data > Grant> shared with the HCA. But payload data needs to be > Grant> mapped/unmapped for each DMA. > > Actually that's up to the consumer. It might be more convenient for a > consumer to use the pci_pool stuff to keep some consistent buffers > around and use them instead of dynamically mapping for every DMA. oic. But please remember "consistent" and "streaming" mappings are very different things. If you are working on a client, please read Documentations/DMA-mapping.txt to understand the differences. > It is working in the sense of both IPoIB and the MAD layer > implementing that approach and working on a large range of Linux > archs. *nod* thanks, grant From mshefty at ichips.intel.com Tue Sep 7 11:58:40 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 11:58:40 -0700 Subject: [openib-general] Re: ib_verbs.h: ib_recv_flags definition In-Reply-To: <1094584084.3500.150.camel@localhost.localdomain> References: <1094582267.3500.129.camel@localhost.localdomain> <20040907104516.699e8fb9.mshefty@ichips.intel.com> <1094584084.3500.150.camel@localhost.localdomain> Message-ID: <20040907115840.63928721.mshefty@ichips.intel.com> On Tue, 07 Sep 2004 15:08:05 -0400 Hal Rosenstock wrote: > On Tue, 2004-09-07 at 13:45, Sean Hefty wrote: > > On Tue, 07 Sep 2004 14:37:48 -0400 > > Hal Rosenstock wrote: > > > > > Do we need to add ib_recv_flags to ib_verbs.h for use in setting > > > ib_recv_wr.recv_flags ? > > > > > > enum ib_recv_flags { > > > IB_RECV_SIGNALED = 1 > > > }; > > > > Can/will mthca support this flag? All-righty then... I've added this to ib_verbs.h and checked it in. -- From halr at voltaire.com Tue Sep 7 14:57:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Sep 2004 17:57:36 -0400 Subject: [openib-general] ib_mad.h ib_mad_recv_wc.mad_flags Message-ID: <1094594256.3500.170.camel@localhost.localdomain> Is ib_mad_recv_wc.mad_flags needed ? The only mad_flag defined is IB_MAD_GRH_VALID and that is redundant with ib_mad_recv_wc.grh == NULL. -- Hal From mshefty at ichips.intel.com Tue Sep 7 14:12:27 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 14:12:27 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_recv_wc.mad_flags In-Reply-To: <1094594256.3500.170.camel@localhost.localdomain> References: <1094594256.3500.170.camel@localhost.localdomain> Message-ID: <20040907141227.05dcbabb.mshefty@ichips.intel.com> On Tue, 07 Sep 2004 17:57:36 -0400 Hal Rosenstock wrote: > Is ib_mad_recv_wc.mad_flags needed ? The only mad_flag defined is > IB_MAD_GRH_VALID and that is redundant with ib_mad_recv_wc.grh == NULL. It may be redundant. But, I need to think more about whether it can be removed. There may be a case (with QP redirection perhaps?) where we'd want to return the GRH buffer to the user, but mark that the data's invalid. I will see if there's a way to have the MAD receive completion match closer with the normal receive completion case. For example, ib_wc is the only structure that uses bit fields for flags. Other structures in the API use an integer set using enum values. From mshefty at ichips.intel.com Tue Sep 7 16:10:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 7 Sep 2004 16:10:55 -0700 Subject: [openib-general] Re: ib_mad.h ib_mad_recv_wc.mad_flags In-Reply-To: <20040907141227.05dcbabb.mshefty@ichips.intel.com> References: <1094594256.3500.170.camel@localhost.localdomain> <20040907141227.05dcbabb.mshefty@ichips.intel.com> Message-ID: <20040907161055.7533d9b2.mshefty@ichips.intel.com> On Tue, 7 Sep 2004 14:12:27 -0700 Sean Hefty wrote: > I will see if there's a way to have the MAD receive completion match closer with the normal receive completion case. For example, ib_wc is the only structure that uses bit fields for flags. Other structures in the API use an integer set using enum values. Here's a patch for discussion that changes ib_mad_recv_wc to use ib_wc. The intent of the patch is to allow an API that: supports 0-copy receives, is efficient for processing non-RMPP receives, and helps simplify the underlying MAD implementation. Along these same lines, I thought about removing ib_mad_send_wc, and using ib_wc directly. I held off on this change for now. I also thought about replacing the bit fields in ib_wc with a single flag field. If I do that I'll submit a patch for Roland's stack, since it changes the verbs API. Comments? - Sean Index: ib_mad.h =================================================================== --- ib_mad.h (revision 750) +++ ib_mad.h (working copy) @@ -130,10 +130,6 @@ u32 hi_tid; }; -enum ib_mad_flags { - IB_MAD_GRH_VALID = 1 -}; - /** * ib_mad_send_wc - MAD send completion information. * @wr_id - Work request identifier associated with the send MAD request. @@ -148,36 +144,32 @@ }; /** - * ib_mad_recv_wc - received MAD information. - * @wr_id - For received response, set to the work request identifier specified - * for the corresponding send request. + * ib_mad_recv_buf - received MAD buffer information. + * @list - Reference to next data buffer for a received RMPP MAD. * @grh - References a data buffer containing the global route header. * The data refereced by this buffer is only valid if the GRH is * valid. * @mad - References the start of the received MAD. - * @length - Specifies the size of the received MAD. - * @mad_flags - Flags used to specify information about the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc - Completion information for the received data. + * @recv_buf - Specifies the location of the received data buffer(s). * @mad_len - The length of the received MAD, without duplicated headers. - * @src_qpn - Source QP. - * @pkey_index - Pkey index. - * @slid - LID of remote QP. - * @sl - Service level of source for a received message. - * @dlid_path_bits - Path bits of source for a received message. * - * An RMPP receive will be coalesced into a single data buffer. + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. */ struct ib_mad_recv_wc { - u64 wr_id; - struct ib_grh *grh; - struct ib_mad *mad; - u32 length; - int mad_flags; - u32 mad_len; - u32 src_qp; - u16 pkey_index; - u16 slid; - u8 sl; - u8 dlid_path_bits; + struct ib_wc *wc; + struct ib_mad_recv_buf *recv_buf; + int mad_len; }; /** From mst at mellanox.co.il Tue Sep 7 21:44:37 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Sep 2004 07:44:37 +0300 Subject: [openib-general] test pls ignore Message-ID: <20040908044437.GA4991@mellanox.co.il> sorry From mshefty at ichips.intel.com Wed Sep 8 13:07:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 8 Sep 2004 13:07:17 -0700 Subject: [openib-general] [PATCH] replace grh_flag with ah_flags in ib_ah Message-ID: <20040908130717.03cf8684.mshefty@ichips.intel.com> This patch replaces the u8 grh_flag with a more generic ah_flags for consistency with the rest of the API. - Sean -- Index: ulp/ipoib/ipoib_arp.c =================================================================== --- ulp/ipoib/ipoib_arp.c (revision 756) +++ ulp/ipoib/ipoib_arp.c (working copy) @@ -411,7 +411,7 @@ .sl = path->sl, .src_path_bits = 0, .static_rate = 0, - .grh_flag = 0, + .ah_flags = 0, .port_num = priv->port }; Index: ulp/ipoib/ipoib_multicast.c =================================================================== --- ulp/ipoib/ipoib_multicast.c (revision 756) +++ ulp/ipoib/ipoib_multicast.c (working copy) @@ -247,7 +247,7 @@ .sl = mcast->mcast_member.sl, .src_path_bits = 0, .static_rate = 0, - .grh_flag = 1, + .ah_flags = IB_AH_GRH, .grh = { .flow_label = mcast->mcast_member.flowlabel, .hop_limit = mcast->mcast_member.hoplmt, Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 756) +++ include/ib_verbs.h (working copy) @@ -214,13 +214,17 @@ u8 traffic_class; }; +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + struct ib_ah_attr { struct ib_global_route grh; u16 dlid; u8 sl; u8 src_path_bits; u8 static_rate; - u8 grh_flag; + u8 ah_flags; u8 port_num; }; Index: core/cm_path_migration.c =================================================================== --- core/cm_path_migration.c (revision 756) +++ core/cm_path_migration.c (working copy) @@ -57,7 +57,7 @@ qp_attr->ah_attr.dlid = connection->alternate_path.dlid; qp_attr->ah_attr.src_path_bits = connection->alternate_path.slid & 0x7f; qp_attr->ah_attr.static_rate = 0; - qp_attr->ah_attr.grh_flag = 0; + qp_attr->ah_attr.ah_flags = 0; qp_attr->path_mig_state = IB_MIG_REARM; if (ib_cached_gid_find(connection->alternate_path.sgid, NULL, Index: core/mad_ib.c =================================================================== --- core/mad_ib.c (revision 756) +++ core/mad_ib.c (working copy) @@ -73,11 +73,11 @@ av.dlid = mad->dlid; av.port_num = mad->port; av.src_path_bits = 0; - av.grh_flag = mad->has_grh; + av.ah_flags = (mad->has_grh ? IB_AH_GRH : 0); av.sl = mad->sl; av.static_rate = 0; - if (av.grh_flag) { + if (mad.has_grh) { av.grh.sgid_index = mad->gid_index; av.grh.flow_label = mad->flow_label; av.grh.hop_limit = mad->hop_limit; Index: core/cm_passive.c =================================================================== --- core/cm_passive.c (revision 756) +++ core/cm_passive.c (working copy) @@ -167,7 +167,7 @@ qp_attr->ah_attr.dlid = connection->primary_path.dlid; qp_attr->ah_attr.src_path_bits = connection->primary_path.slid & 0x7f; qp_attr->ah_attr.static_rate = 0; - qp_attr->ah_attr.grh_flag = 0; + qp_attr->ah_attr.ah_flags = 0; attr_mask = IB_QP_STATE | @@ -283,7 +283,7 @@ qp_attr->alt_ah_attr.dlid = connection->alternate_path.dlid; qp_attr->alt_ah_attr.src_path_bits = connection->alternate_path.slid & 0x7f; qp_attr->alt_ah_attr.static_rate = 0; - qp_attr->alt_ah_attr.grh_flag = 0; + qp_attr->alt_ah_attr.ah_flags = 0; qp_attr->path_mig_state = IB_MIG_REARM; ib_cached_gid_find(connection->alternate_path.sgid, NULL, &qp_attr->alt_port_num, NULL); Index: core/cm_active.c =================================================================== --- core/cm_active.c (revision 756) +++ core/cm_active.c (working copy) @@ -288,7 +288,7 @@ qp_attr->alt_ah_attr.dlid = connection->alternate_path.dlid; qp_attr->alt_ah_attr.src_path_bits = connection->alternate_path.slid & 0x7f; qp_attr->alt_ah_attr.static_rate = 0; - qp_attr->alt_ah_attr.grh_flag = 0; + qp_attr->alt_ah_attr.ah_flags = 0; qp_attr->path_mig_state = IB_MIG_REARM; ib_cached_gid_find(connection->alternate_path.sgid, NULL, @@ -418,7 +418,7 @@ qp_attr->ah_attr.dlid = connection->primary_path.dlid; qp_attr->ah_attr.src_path_bits = connection->primary_path.slid & 0x7f; qp_attr->ah_attr.static_rate = 0; - qp_attr->ah_attr.grh_flag = 0; + qp_attr->ah_attr.ah_flags = 0; attr_mask = IB_QP_STATE | Index: hw/mthca/mthca_av.c =================================================================== --- hw/mthca/mthca_av.c (revision 756) +++ hw/mthca/mthca_av.c (working copy) @@ -82,12 +82,13 @@ memset(av, 0, MTHCA_AV_SIZE); av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); - av->g_slid = (!!ah_attr->grh_flag << 7) | ah_attr->src_path_bits; - av->dlid = cpu_to_be16(ah_attr->dlid); - av->msg_sr = (3 << 4) | /* 2K message */ + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ ah_attr->static_rate; av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); - if (ah_attr->grh_flag) { + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + ah_attr->grh.sgid_index; av->hop_limit = ah_attr->grh.hop_limit; Index: hw/mthca/mthca_qp.c =================================================================== --- hw/mthca/mthca_qp.c (revision 756) +++ hw/mthca/mthca_qp.c (working copy) @@ -593,7 +593,7 @@ qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; - if (attr->ah_attr.grh_flag) { + if (attr->ah_attr.ah_flags & IB_AH_GRH) { qp_context->pri_path.g_mylmc |= 1 << 7; qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; From roland at topspin.com Wed Sep 8 14:38:59 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 14:38:59 -0700 Subject: [openib-general] Re: [PATCH] replace grh_flag with ah_flags in ib_ah In-Reply-To: <20040908130717.03cf8684.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 8 Sep 2004 13:07:17 -0700") References: <20040908130717.03cf8684.mshefty@ichips.intel.com> Message-ID: <52656oigp8.fsf@topspin.com> Thanks, I've applied this. In general though I think we should try and focus on the major issues in getting a working stack rather than making this kind of fiddly change. - R. From mshefty at ichips.intel.com Wed Sep 8 13:57:22 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 8 Sep 2004 13:57:22 -0700 Subject: [openib-general] Re: [PATCH] replace grh_flag with ah_flags in ib_ah In-Reply-To: <52656oigp8.fsf@topspin.com> References: <20040908130717.03cf8684.mshefty@ichips.intel.com> <52656oigp8.fsf@topspin.com> Message-ID: <20040908135722.456a578f.mshefty@ichips.intel.com> On Wed, 08 Sep 2004 14:38:59 -0700 Roland Dreier wrote: > Thanks, I've applied this. > > In general though I think we should try and focus on the major issues > in getting a working stack rather than making this kind of fiddly change. Well, there will be one more fiddly change coming to convert the WC bit fields to flags, and I would much rather get API changes in sooner rather than later. Both of these were only noticed while revisiting the GSI implementation issues. I do agree that we want to focus on a working stack, and I'm expecting that tomorrow at least Hal and I will discuss how to coordinate the MAD work. From mshefty at ichips.intel.com Wed Sep 8 14:28:47 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 8 Sep 2004 14:28:47 -0700 Subject: [openib-general] Re: [PATCH] replace grh_flag with ah_flags in ib_ah In-Reply-To: <20040908135722.456a578f.mshefty@ichips.intel.com> References: <20040908130717.03cf8684.mshefty@ichips.intel.com> <52656oigp8.fsf@topspin.com> <20040908135722.456a578f.mshefty@ichips.intel.com> Message-ID: <20040908142847.3ff5a3e6.mshefty@ichips.intel.com> On Wed, 8 Sep 2004 13:57:22 -0700 Sean Hefty wrote: > convert the WC bit fields to flags This patch should replace the WC bit fields with flags. - Sean Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 756) +++ infiniband/include/ib_verbs.h (working copy) @@ -263,6 +263,11 @@ IB_WC_RECV_RDMA_WITH_IMM }; +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + struct ib_wc { u64 wr_id; enum ib_wc_status status; @@ -271,8 +276,7 @@ u32 byte_len; u32 imm_data; u32 src_qp; - int grh_flag:1; - int imm_data_valid:1; + int wc_flags; u16 pkey_index; u16 slid; u8 sl; Index: infiniband/core/mad_ib.c =================================================================== --- infiniband/core/mad_ib.c (revision 756) +++ infiniband/core/mad_ib.c (working copy) @@ -230,7 +230,7 @@ mad->sqpn = entry->src_qp; mad->dqpn = wrid.field.qpn; - if (entry->grh_flag) { + if (entry->wc_flags & IB_WC_GRH) { u32 *grh = (void *) mad - IB_MAD_GRH_SIZE; mad->has_grh = 1; /* First 32 bytes of GRH have 4 bits of Index: infiniband/hw/mthca/mthca_cq.c =================================================================== --- infiniband/hw/mthca/mthca_cq.c (revision 756) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -459,18 +459,18 @@ switch (cqe->opcode & 0x1f) { case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: - entry->imm_data_valid = 1; + entry->wc_flags = IB_WC_WITH_IMM; entry->imm_data = cqe->imm_etype_pkey_eec; entry->opcode = IB_WC_RECV; break; case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: - entry->imm_data_valid = 1; + entry->wc_flags = IB_WC_WITH_IMM; entry->imm_data = cqe->imm_etype_pkey_eec; entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; break; default: - entry->imm_data_valid = 0; + entry->wc_flags = 0; entry->opcode = IB_WC_RECV; break; } @@ -479,7 +479,8 @@ entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; - entry->grh_flag = !!(be16_to_cpu(cqe->sl_g_mlpath) & 0x80); + entry->wc_flags |= ((be16_to_cpu(cqe->sl_g_mlpath) & 0x80) ? + IB_WC_GRH : 0); } entry->status = IB_WC_SUCCESS; From mshefty at ichips.intel.com Wed Sep 8 14:45:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 8 Sep 2004 14:45:05 -0700 Subject: [openib-general] merging software stacks Message-ID: <20040908144505.437029b4.mshefty@ichips.intel.com> Does anyone have any objection to merging the gen2 files from /trunk/contrib/intel with /gen2/branches/openib-candidate/src/linux-kernel/infiniband? This will merge the verbs definitions with the GSI implementation being developed, to avoid duplicating the ib_verbs.h, ib_verbs.c, and ib_mad.h files between those two branches. - Sean -- From greg at kroah.com Wed Sep 8 16:02:55 2004 From: greg at kroah.com (Greg KH) Date: Wed, 8 Sep 2004 16:02:55 -0700 Subject: [openib-general] Re: [PATCH] Yet more sysfs support In-Reply-To: <52fz5vni50.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> Message-ID: <20040908230255.GA30603@kroah.com> On Mon, Sep 06, 2004 at 03:32:11PM -0700, Roland Dreier wrote: > +#define PORT_ATTR(_name, _mode, _show, _store) \ Try using the __ATTR and __ATTR_RO macros to make these and the places you use it, simpler. Man, it really would be nice if it was easier to create subdirectories in the driver model without being forced to drop down to the kobject layer, wouldn't it :( Patches look good, nice job. thanks, greg k-h From mshefty at ichips.intel.com Wed Sep 8 16:47:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 8 Sep 2004 16:47:39 -0700 Subject: [openib-general] ib_mad.c comments Message-ID: <20040908164739.3e9c8723.mshefty@ichips.intel.com> Here's a list of comments from reviewing ib_mad.c. I'll use this list as kind of my to do list for the GSI. Several of these can be delayed when implementing. After we meet tomorrow, I will begin creating patches. Overall, it's a good start. ib_mad_reg(): - Need to lock when checking/setting version/class/methods. - Need to support registrations for "all" methods of a given class. (We may want the initial implementation to only do this for now, to shorten the development time.) - Should we reference qp0 and qp1 with the registration? - Need to ensure unique tids in case of wrapping. ib_mad_post_send(): - We should return the error code from ib_post_send in order to handle overruns differently. - The print level should be lowered from error. - Should we avoid casting the list_head to a structure where possible? allocate_method_table(): - Can just use memset to clear the table. check_class_table(): - Has an extra '{'. ib_mad_recv_done_handler(): ib_mad_send_done_handler(): - Not sure why these calls search for the corresponding work request. ib_mad_post_receive_mads(): - I think we can just pass &qp0 or &qp1, rather than a type to ib_mad_post_receive_mad. - Print level should be lowered from error - We can track the number of posted receives to avoid posting overruns. struct ib_mad_device_private: - If we make qp0 and qp1 an array, it may simply the code and remove several checks from the code. ib_mad_device_open(): - A nit, but it's logically initializing a port on the device. - Remove +20 to CQ size. - We could change to using 1 PD/device, versus 1 PD/port. - Not sure if we need to support max_sge on the send queue. This may be substantially larger than what we need. At a minimum, I think that we need 2 for optimal RMPP support. I'm not sure where the trade-off is between SGE versus copying into a single buffer lies. - Sean -- From tduffy at sun.com Wed Sep 8 18:48:18 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 08 Sep 2004 18:48:18 -0700 Subject: [openib-general] [PATCH][TRIVIAL] Fix build of mad_ib.c In-Reply-To: <20040908214705.82A802283D6@openib.ca.sandia.gov> References: <20040908214705.82A802283D6@openib.ca.sandia.gov> Message-ID: <1094694498.17038.12.camel@duffman> On Wed, 2004-09-08 at 14:47 -0700, roland at openib.org wrote: > Change grh_flag to generic ah_flags (patch from Sean Hefty) > Modified: gen2/branches/roland-merge/src/linux-kernel/infiniband/core/mad_ib.c > =================================================================== > --- gen2/branches/roland-merge/src/linux-kernel/infiniband/core/mad_ib.c 2004-09-08 08:22:45 UTC (rev 756) > +++ gen2/branches/roland-merge/src/linux-kernel/infiniband/core/mad_ib.c 2004-09-08 21:47:04 UTC (rev 757) > @@ -73,11 +73,11 @@ > av.dlid = mad->dlid; > av.port_num = mad->port; > av.src_path_bits = 0; > - av.grh_flag = mad->has_grh; > + av.ah_flags = mad->has_grh ? IB_AH_GRH : 0; > av.sl = mad->sl; > av.static_rate = 0; > > - if (av.grh_flag) { > + if (mad.has_grh) { > av.grh.sgid_index = mad->gid_index; > av.grh.flow_label = mad->flow_label; > av.grh.hop_limit = mad->hop_limit; Pointers, references, who's counting ;) Index: drivers/infiniband/core/mad_ib.c =================================================================== --- drivers/infiniband/core/mad_ib.c (revision 757) +++ drivers/infiniband/core/mad_ib.c (working copy) @@ -77,7 +77,7 @@ av.sl = mad->sl; av.static_rate = 0; - if (mad.has_grh) { + if (mad->has_grh) { av.grh.sgid_index = mad->gid_index; av.grh.flow_label = mad->flow_label; av.grh.hop_limit = mad->hop_limit; -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Sep 8 20:07:17 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 20:07:17 -0700 Subject: [openib-general] [PATCH][TRIVIAL] Fix build of mad_ib.c In-Reply-To: <1094694498.17038.12.camel@duffman> (Tom Duffy's message of "Wed, 08 Sep 2004 18:48:18 -0700") References: <20040908214705.82A802283D6@openib.ca.sandia.gov> <1094694498.17038.12.camel@duffman> Message-ID: <52656ogmxm.fsf@topspin.com> Thanks, applied. From roland at topspin.com Wed Sep 8 20:07:23 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 20:07:23 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <20040908164739.3e9c8723.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 8 Sep 2004 16:47:39 -0700") References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> Message-ID: <524qm8gmxg.fsf@topspin.com> Huh... I didn't even notice it was checked in... anyway, my comments follow after some comments on Sean's comments: Sean> Need to lock when checking/setting version/class/methods. I agree for the initial implementation. Ultimately RCU seems better but I would recommend sticking with locking to start with since it's much easier to code correctly. Sean> We should return the error code from ib_post_send in order Sean> to handle overruns differently. What did we decide about how to handle someone posting more sends than the underlying work queue can hold? In any case I agree with this. Sean> Should we avoid casting the list_head to a structure where Sean> possible? Yes, definitely. It's much better to do something like &mystruct->list rather than relying on the fact that mystruct has a struct list_head as its first member. In fact the usage of list.h is pretty broken throughout ib_mad.c, see below. Sean> Not sure why these calls search for the corresponding work request. Yes -- we know the next request to complete will always be the oldest one we have around, right? Sean> Not sure if we need to support max_sge on the send queue. Sean> This may be substantially larger than what we need. At a Sean> minimum, I think that we need 2 for optimal RMPP support. Sean> I'm not sure where the trade-off is between SGE versus Sean> copying into a single buffer lies. I'm not sure there's much practical difference between copying and using the HCA to do a gather/scatter on a buffer of size 256. The big difference is memory per WQE (at least for mthca): supporting the max_sge means each WQE will be about 1 KB, while using a smaller number means each WQE could be about 128 bytes. OK, my comments (which are based on only a quick read and therefore focused mostly on low-level coding details): kmem_cache_t *ib_mad_cache; seems to be unused -- should be static anyway. static u32 ib_mad_client_id = 0; needs to be protected by a lock when used later #define IB_MAD_DEVICE_LIST_LOCK_VAR unsigned long ib_mad_device_list_sflags #define IB_MAD_DEVICE_LIST_LOCK() spin_lock_irqsave(&ib_mad_device_list_lock, ib_mad_device_list_sflags) #define IB_MAD_DEVICE_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_device_list_lock, ib_mad_device_list_sflags) Don't use this idiom ... just use the spinlock functions directly. It makes locking code harder to read and review, it leads to wasteful stuff like the below (in ib_mad_reg()): IB_MAD_DEVICE_LIST_LOCK_VAR; IB_MAD_AGENT_LIST_LOCK_VAR; and besides, Documentation/CodingStyle says "macros that depend on having a local variable with a magic name might look like a good thing, but it's confusing as hell when one reads the code and it's prone to breakage from seemingly innocent changes." /* * ib_mad_reg - Register to send/receive MADs. * @device - The device to register with. Start with /** for kernel doc to pick this up. Might be better to put it in a header file so that it's easier to find the documentation (but it's OK to leave it in a .c). struct ib_mad_device_private *entry, *priv = NULL, *head = (struct ib_mad_device_private *) &ib_mad_device_list; This definition of head is totally broken, since ib_mad_device_list is declared as: static struct list_head ib_mad_device_list; so trying to use it as a struct ib_mad_device_private is just going off into random memory. However there's no reason to even have a variable named head, since it seems you only use it in: list_for_each(entry, head) { This really should be list_for_each_entry(entry, &ib_mad_device_list, list) { and the definition of struct ib_mad_device_private needs to be fixed from struct ib_mad_device_private { struct ib_mad_device_private *next; to struct ib_mad_device_private { struct list_head list; (you don't have to use the name list for your struct list_head member; that's just my habit). list_for_each(entry2, head2) { if (entry2->agent == mad_agent_priv->agent) { list_del((struct list_head *)entry2); break; } } This is broken for a couple of reasons: misuse of list_for_each as just described; also, you can't delete items from a list while walking through it with list_for_each (use list_for_each_safe instead); finally, there's no reason to walk a list to find the entry you just added in the same function -- just call list_del on the entry directly, since you should still have it around. Pretty much all of these comments apply to all use of the list.h macros in the file -- most look wrong. What context is it allowed to call ib_mad_post_send() from? We never discussed this, but since the current implementation allocates work requests with mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_KERNEL); right now it can only be called from process context with no locks held. This seems like it violates the principle of least surprise, because ib_post_send() can be called from any context. Also, the failure case if (!mad_send_wr) { printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); return -ENOMEM; } needs to set bad_send_wr. ib_mad_recv_done_handler() seems to be missing a call to pci_unmap_single(). static u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) { u8 mgmt_class; /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ if (mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { mgmt_class = 0; } else { mgmt_class = mad_reg_req->mgmt_class; } return mgmt_class; } I'd rewrite this as static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) { return mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? 0 : mad_reg_req->mgmt_class; } or just open code it in the two places it's used. static int allocate_method_table(struct ib_mad_mgmt_method_table **method) { /* .. */ return ENOMEM; probably should be -ENOMEM; static void ib_mad_completion_handler(struct ib_mad_device_private *priv) { /* * For stack overflow safety reason, WC is static here. * This callback may not be called more than once at the same time. */ static struct ib_wc wc; Seems like a bad plan to me -- on an SMP machine with multiple HCAs (or even multiple ports on a single HCA) it seems like we want to multithread MAD processing rather than serializing it (In fact Yaron has made a lot of noise about running on giant SGI NUMA machines with millions of HCAs, where this looks especially bad). Also the comment seems to be wrong -- there seems to be one thread per HCA so multiple copies of the callback can run at once. static int ib_mad_thread(void *param) { struct ib_mad_device_private *priv = param; struct ib_mad_thread_data *thread_data = &priv->thread_data; lock_kernel(); daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port); unlock_kernel(); Just use kthread_create() to start your thread and handle all this. Even though the current Topspin stack uses a MAD processing thread per HCA, I'm not sure it's the best design. Why do we need to defer work to process context? sema_init(&thread_data->sem, 0); Seems like a race condition here ... what happens if someone else tries to use the semaphore before the thread has gotten a chance to run? In any case... while (1) { if (down_interruptible(&thread_data->sem)) { printk(KERN_DEBUG "Exiting ib_mad thread\n"); break; } I don't think it's a good idea to use a semaphore and signals to control the worker thread. Better would be a wait queue and something like wait_event(). #define IB_MAD_DEVICE_SET_UP(__device__) {\ IB_MAD_DEVICE_LIST_LOCK_VAR;\ IB_MAD_DEVICE_LIST_LOCK();\ (__device__)->up = 1;\ IB_MAD_DEVICE_LIST_UNLOCK();} #define IB_MAD_DEVICE_SET_DOWN(__device__) {\ IB_MAD_DEVICE_LIST_LOCK_VAR;\ IB_MAD_DEVICE_LIST_LOCK();\ (__device__)->up = 0;\ IB_MAD_DEVICE_LIST_UNLOCK();} These don't seem to merit being macros. If you really want they could be inline functions but I don't see any use of the "up" member outside of the macros anyway, so maybe you can just kill them. It seems hard to think of how to test "up" in a way that's not racy. for (i = 0; i < num_ports; i++) { ret = ib_mad_device_open(device, i); This is wrong -- for a CA you need to handle ports 1 ... num_ports, while a switch just uses port 0. Thanks, Roland From roland at topspin.com Wed Sep 8 20:13:44 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 20:13:44 -0700 Subject: [openib-general] Re: [PATCH] replace grh_flag with ah_flags in ib_ah In-Reply-To: <20040908142847.3ff5a3e6.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 8 Sep 2004 14:28:47 -0700") References: <20040908130717.03cf8684.mshefty@ichips.intel.com> <52656oigp8.fsf@topspin.com> <20040908135722.456a578f.mshefty@ichips.intel.com> <20040908142847.3ff5a3e6.mshefty@ichips.intel.com> Message-ID: <52zn40f82f.fsf@topspin.com> Thanks, applied. - R. From roland at topspin.com Wed Sep 8 20:16:44 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 20:16:44 -0700 Subject: [openib-general] Re: [PATCH] Yet more sysfs support In-Reply-To: <20040908230255.GA30603@kroah.com> (Greg KH's message of "Wed, 8 Sep 2004 16:02:55 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <20040908230255.GA30603@kroah.com> Message-ID: <52vfeof7xf.fsf@topspin.com> Greg> Try using the __ATTR and __ATTR_RO macros to make these and Greg> the places you use it, simpler. Thanks (I was still looking at 2.6.7 headers so I missed them). The only place I found to use the macros was doing: #define PORT_ATTR(_name, _mode, _show, _store) \ -struct port_attribute port_attr_##_name = { \ - .attr = { .name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \ - .show = _show, \ - .store = _store \ -} +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) Or do you think it's worth defining PORT_ATTR_RO()? I notice you didn't bother in linux/device.h. Greg> Man, it really would be nice if it was easier to create Greg> subdirectories in the driver model without being forced to Greg> drop down to the kobject layer, wouldn't it :( Well, it was a good learning experience once to have to go all the way down to kobjects. Next time I'd like a better way, please ;) - R. From tduffy at sun.com Wed Sep 8 20:37:19 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 08 Sep 2004 20:37:19 -0700 Subject: [openib-general] Re: [openib-commits] r759 - in gen2/branches/roland-merge/src/linux-kernel/infiniband: core hw/mthca include In-Reply-To: <20040909032319.D152A2283D6@openib.ca.sandia.gov> References: <20040909032319.D152A2283D6@openib.ca.sandia.gov> Message-ID: <1094701039.17038.16.camel@duffman> On Wed, 2004-09-08 at 20:23 -0700, roland at openib.org wrote: > Log: > Change ib_wc bitfield members to a single flags member (patch from Sean Duffy) I am one with Sean...Muhahaha!!!1!!ONE!! -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Sep 8 20:54:32 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 20:54:32 -0700 Subject: [openib-general] Re: [openib-commits] r759 - in gen2/branches/roland-merge/src/linux-kernel/infiniband: core hw/mthca include In-Reply-To: <1094701039.17038.16.camel@duffman> (Tom Duffy's message of "Wed, 08 Sep 2004 20:37:19 -0700") References: <20040909032319.D152A2283D6@openib.ca.sandia.gov> <1094701039.17038.16.camel@duffman> Message-ID: <52r7pcf66f.fsf@topspin.com> Roland> Log: Change ib_wc bitfield members to a single flags member Roland> (patch from Sean Duffy) Tom> I am one with Sean...Muhahaha!!!1!!ONE!! What can I say... I was comitting patches from both of you... both names end in 'y'... $ svn propset --revprop -r 759 svn:log "Change ib_wc bitfield members to a single flags member (patch from Sean Hefty)" svn: DAV request failed; it's possible that the repository's pre-revprop-change hook either failed or is non-existent looks like we're stuck with it for the moment... - R. From greg at kroah.com Wed Sep 8 22:13:18 2004 From: greg at kroah.com (Greg KH) Date: Wed, 8 Sep 2004 22:13:18 -0700 Subject: [openib-general] Re: [PATCH] Yet more sysfs support In-Reply-To: <52vfeof7xf.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <20040908230255.GA30603@kroah.com> <52vfeof7xf.fsf@topspin.com> Message-ID: <20040909051318.GA11782@kroah.com> On Wed, Sep 08, 2004 at 08:16:44PM -0700, Roland Dreier wrote: > Greg> Try using the __ATTR and __ATTR_RO macros to make these and > Greg> the places you use it, simpler. > > Thanks (I was still looking at 2.6.7 headers so I missed them). The > only place I found to use the macros was doing: > > #define PORT_ATTR(_name, _mode, _show, _store) \ > -struct port_attribute port_attr_##_name = { \ > - .attr = { .name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \ > - .show = _show, \ > - .store = _store \ > -} > +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) Yes, that's a good place for it. Especially if we rework the attribute code, as will happen pretty soon... > Or do you think it's worth defining PORT_ATTR_RO()? I notice you > didn't bother in linux/device.h. As it looks like that's pretty much all you use it for, wouldn't it make sense for your situation? :) > Greg> Man, it really would be nice if it was easier to create > Greg> subdirectories in the driver model without being forced to > Greg> drop down to the kobject layer, wouldn't it :( > > Well, it was a good learning experience once to have to go all the way > down to kobjects. Next time I'd like a better way, please ;) Yes, it's on my list of things to do... greg k-h From gdror at mellanox.co.il Wed Sep 8 22:47:36 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 9 Sep 2004 08:47:36 +0300 Subject: [openib-general] IETF and Multicast Aliasing Message-ID: <506C3D7B14CDD411A52C00025558DED605E00E50@mtlex01.yok.mtl.com> There is an IETF meeting in the beginning of November. I was wondering whether it makes sense to bring up the multicast aliasing issue there ? I believe that without any patch to current kernels, it'll be pretty hard to support multicast aliasing which is different than the Ethernet aliasing. There are two options I'd consider: 1) OpenIB stack, once merged into the kernel, will include the kernel patch to fix it. Existing (gen1) implementations will not be able to interoperate with it (for multicast). 2) Discuss with IETF a change in the specification to accommodate similar aliasing to the one that is in Ethernet. It might be too late, though. I don't know too deeply how is that supposed to work in other operating systems. They may or may not include flexible mechanisms to take care of longer MAC address for multicast. However, if some of them also don't support it, then that may be a valid argument for IETF too. Thanks Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Sep 8 22:50:14 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Sep 2004 22:50:14 -0700 Subject: [openib-general] Re: [PATCH] Yet more sysfs support In-Reply-To: <20040909051318.GA11782@kroah.com> (Greg KH's message of "Wed, 8 Sep 2004 22:13:18 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <20040908230255.GA30603@kroah.com> <52vfeof7xf.fsf@topspin.com> <20040909051318.GA11782@kroah.com> Message-ID: <52n000f0tl.fsf@topspin.com> Greg> As it looks like that's pretty much all you use it for, Greg> wouldn't it make sense for your situation? :) Fair enough... Want some DEVICE_ATTR_RO() etc. patches for the kernel? Seems like they would get used all over the place too. - R. From greg at kroah.com Wed Sep 8 23:00:00 2004 From: greg at kroah.com (Greg KH) Date: Wed, 8 Sep 2004 23:00:00 -0700 Subject: [openib-general] Re: [PATCH] Yet more sysfs support In-Reply-To: <52n000f0tl.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DED9F@taurus.voltaire.com> <52fz5vni50.fsf@topspin.com> <20040908230255.GA30603@kroah.com> <52vfeof7xf.fsf@topspin.com> <20040909051318.GA11782@kroah.com> <52n000f0tl.fsf@topspin.com> Message-ID: <20040909055959.GA8813@kroah.com> On Wed, Sep 08, 2004 at 10:50:14PM -0700, Roland Dreier wrote: > Greg> As it looks like that's pretty much all you use it for, > Greg> wouldn't it make sense for your situation? :) > > Fair enough... > > Want some DEVICE_ATTR_RO() etc. patches for the kernel? Seems like > they would get used all over the place too. Sure, I'll gladly take them. thanks, greg k-h From David.Brean at Sun.COM Thu Sep 9 05:52:20 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 09 Sep 2004 08:52:20 -0400 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <506C3D7B14CDD411A52C00025558DED605E00235@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED605E00235@mtlex01.yok.mtl.com> Message-ID: <41405204.1080500@sun.com> Hello, [I saw your recent email regarding the a potential request to change the I-D at the next IETF meeting, so decided to reply using this email, since it describes the problem.] The IBTA allows multiple MGIDs to map to a MLID. That is the aliasing mechanism on IB. As a result, it seems ok to have IP multicast addresses to map 1-1 to MGIDs. As you know, the IPoIB multicast hardware address is a special format of the MGID. So, the OS should not need to map more than one IP multicast address to a IPoIB hardware address. [There are other steps needed to program the CA interface and obviously the SM must perform some internal mapping management.] Does that solve the problem? -David Dror Goldenberg wrote: > IPoIB defines no aliasing in the mapping of IP multicast address into > IPoIB HW addresses. > In Ethernet, there is an aliasing, i.e. more than one IP address can > map into the same > Ethernet multicast MAC address. > > In short: IP to Ether takes 24 LSbits from the IP address > IP to IB takes 28 LSbits from the IP address (which are > essentially the whole > IP address, the remaining 4 bits are "class D prefix"). > > The problem is that the current IPoIB driver interfaces the Linux > kernel as if it were > an Ethernet driver. Therefore, the IP layer will not notify the > net_device when a new MC > address is added if it maps to the same MAC address. It will rather > increment the > reference count of the MAC address (net_device->mc_list->dmi_user) and > won't call > net_device->set_multicast_list(). > Therefore, if a user just adds itself to an IP MC group (setsockopt with > IP_ADD_MEMBERSHIP), then if the IPoIB driver already has this Ether > MAC address > in its filter because of a previous registration to another IP MC > group, then the IPoIB driver > will not get any notification, and the user will not get registered to > the MCG. > > I was wondering what should be the solution for that in the current > kernels (gen1) and > in future kernels (gen2). > > For gen2, will it be possible to define a new medium for the IPoIB > driver (not ARPHRD_ETHER), > such that arp_mc_map() will map the entire IP address into the HW > address ? Today it looks > impossible, because arp_mc_map() just overrides bits 31:24 of the IP > address. > For gen1, what is that we can do ? Is there a way to obtain such an > event from the in_device ? > If not, then I don't see any clean escape. Is it possible to > periodically check out the in_device > multicast list and see if anything has changed ? would that cause any > problem during the transition > periods ? Any other idea of how to do that without a kernel patch ? > > - Dror > > * For reference: > The algorithm for mapping IP mcast address to Ether mcast address > is defined in > RFC 1113 section 6.4., and for IB in > draft-ietf-ipoib-ip-over-infiniband-07.txt section 4.0. > > > >------------------------------------------------------------------------ > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From gdror at mellanox.co.il Thu Sep 9 08:15:53 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 9 Sep 2004 18:15:53 +0300 Subject: [openib-general] Multicast address aliasing in IPoIB Message-ID: <506C3D7B14CDD411A52C00025558DED605E00F19@mtlex01.yok.mtl.com> > -----Original Message----- > From: David M. Brean [mailto:David.Brean at Sun.COM] > Sent: Thursday, September 09, 2004 3:52 PM > > The IBTA allows multiple MGIDs to map to a MLID. That is the > aliasing > mechanism on IB. As a result, it seems ok to have IP multicast > addresses to map 1-1 to MGIDs. As you know, the IPoIB multicast > hardware address is a special format of the MGID. Right, but this only goes out on the wire. I.e. in the case of such aliasing n MGIDs to single MLID, it'll be filtered by the HCAs, because QPs in the HCA are connected to specific MGIDs not LIDs. > > So, the OS should not need to map more than one IP multicast > address to > a IPoIB hardware address. [There are other steps needed to > program the > CA interface and obviously the SM must perform some internal mapping > management.] The problem is based on how Linux works. The IPoIB plugs into the OS as an Ethernet driver. Therefore, it is subject to the Ethernet rules, and from the OS, it only get lists of "Ethernet" multicast addresses. The notification for additional HW multicast address, will only be delivered to the IPoIB driver if there is no Ethernet aliasing. Otherwise, Linux will just maintain a reference count per HW address. Therefore, there is a problem with the implementation of the IPoIB driver as an Ethernet driver. Fixing this problem requires a kernel patch (or a very ugly solution). -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Thu Sep 9 11:55:23 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 09 Sep 2004 11:55:23 -0700 Subject: [openib-general] Re: [PATCH] remove ts_kernel_trace from client_query* In-Reply-To: <20040831015242.GE27631@cup.hp.com> References: <1092438376.32316.30.camel@localhost> <52r7qay0zh.fsf@topspin.com> <1092672183.2752.4.camel@duffman> <52k6vzoxla.fsf@topspin.com> <1092678729.2752.12.camel@duffman> <52n00kpcv1.fsf@topspin.com> <1093380548.13962.15.camel@duffman> <52acwjdvdz.fsf@topspin.com> <20040825195034.GB2399@mellanox.co.il> <52oekzc7oi.fsf@topspin.com> <20040831015242.GE27631@cup.hp.com> Message-ID: <1094756123.10728.4.camel@duffman> [Now I am catching up on back email after vacation :)] On Mon, 2004-08-30 at 18:52 -0700, Grant Grundler wrote: > > We've found it very useful for debugging to be able to turn on debug > > output after a problem is detected without having to disturb the system. > > It could be another one of the "CONFIG_EMBEDDED" options. I don't see how this follows at all. How is debugging have anything to do with embedded versions of the kernel? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From iod00d at hp.com Thu Sep 9 13:34:25 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Sep 2004 13:34:25 -0700 Subject: [openib-general] Re: [PATCH] remove ts_kernel_trace from client_query* In-Reply-To: <1094756123.10728.4.camel@duffman> References: <1092672183.2752.4.camel@duffman> <52k6vzoxla.fsf@topspin.com> <1092678729.2752.12.camel@duffman> <52n00kpcv1.fsf@topspin.com> <1093380548.13962.15.camel@duffman> <52acwjdvdz.fsf@topspin.com> <20040825195034.GB2399@mellanox.co.il> <52oekzc7oi.fsf@topspin.com> <20040831015242.GE27631@cup.hp.com> <1094756123.10728.4.camel@duffman> Message-ID: <20040909203425.GA22199@cup.hp.com> On Thu, Sep 09, 2004 at 11:55:23AM -0700, Tom Duffy wrote: > On Mon, 2004-08-30 at 18:52 -0700, Grant Grundler wrote: > > > We've found it very useful for debugging to be able to turn on debug > > > output after a problem is detected without having to disturb the system. > > > > It could be another one of the "CONFIG_EMBEDDED" options. > > I don't see how this follows at all. How is debugging have anything to > do with embedded versions of the kernel? s/debugging/support/ or whatever your favorite term is for more verbose output. e.g. EMBEDDED was just considered as a "trigger" to enable/disable the question. SCSI subsystem offers CONFIG_SCSI_CONSTANTS and CONFIG_SCSI_LOGGING. I'd be just as happy to make the default Y for IB_CONSTANTS (I just made that up) and not depend on CONFIG_EMBEDDED. thanks, grant From mst at mellanox.co.il Thu Sep 9 14:03:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 10 Sep 2004 00:03:39 +0300 Subject: [openib-general] useful procmail receipe Message-ID: <20040909210339.GA18523@mellanox.co.il> Hi! I've been long annoyed that mail from openib list for some reason has Status: field set to O, which causes my MUA to understand it as old. I finally solved this by pipeing mail through formail -I Status: I also wrote a procmail receipe to fix this for me before saving to the appropriate folder. Here it is (below) just in case its useful to someone else. HTH MST :0 * ^TOopenib-general at openib.org { :0hf |formail -I Status: :0: openib } From roland at topspin.com Thu Sep 9 19:43:29 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 09 Sep 2004 19:43:29 -0700 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <506C3D7B14CDD411A52C00025558DED605E00F19@mtlex01.yok.mtl.com> (Dror Goldenberg's message of "Thu, 9 Sep 2004 18:15:53 +0300") References: <506C3D7B14CDD411A52C00025558DED605E00F19@mtlex01.yok.mtl.com> Message-ID: <52wtz2etda.fsf@topspin.com> Dror> The problem is based on how Linux works. The IPoIB plugs Dror> into the OS as an Ethernet driver. Therefore, it is subject Dror> to the Ethernet rules, and from the OS, it only get lists of Dror> "Ethernet" multicast addresses. The notification for Dror> additional HW multicast address, will only be delivered to Dror> the IPoIB driver if there is no Ethernet Dror> aliasing. Otherwise, Linux will just maintain a reference Dror> count per HW address. Therefore, there is a problem with the Dror> implementation of the IPoIB driver as an Ethernet Dror> driver. Fixing this problem requires a kernel patch (or a Dror> very ugly solution). I see no reason to modify the IPoIB multicast mapping just because the Linux kernel does not yet have support for it. For customers that have to run older unpatched kernels, it is possible to support the full IPoIB spec via ugly hacks (such as the one found in the gen1 Topspin stack's IPoIB implementation). For newer kernels, there is no reason that the IPoIB driver has to masquerade at an ethernet driver -- we should be aiming for a fully native driver that sets its dev->type field to ARPHRD_INFINIBAND. Once the OpenIB IPoIB driver is ready to be merged upstream, it should be no problem to get the trivial changes required in the core networking code merged. As far as I can tell, the only changes needed would be: implement an ip_ib_mc_map() function and add case ARPHRD_INFINIBAND: ip_ib_mc_map(addr, haddr); return 0; to arp_mc_map() in net/ipv4/arp.c and to make the analogous addition for ndisc_mc_map() in net/ipv6/ndisc.c. (ARPHRD_INFINIBAND is already defined in the Linux headers -- I got that merged back in early 2003) - R. From halr at voltaire.com Fri Sep 10 09:31:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 12:31:05 -0400 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting Message-ID: <1094833865.1746.26.camel@localhost.localdomain> Here are my notes from yesterday's OpenIB SWG meeting: Attendees Roland Dreier Tom Duffy Gunnar Gunnarsson Sean Hefty Matt Leininger Libor Michalek Hal Rosenstock SM support is needed for initial deliverable for those without an embedded SM (e.g. back to back HCAs). OpenSM obtains access to kernel via character device ioctls Is this acceptable ? There were differing views on this. The kernel support for this is in core/useraccess_xxx.c as is done via read/write (to send and receive MADs). Needs to be made 32/64 clean and saner prior to kernel submission. OSM (OpenSM) build needs to be based on autoconf/etc. This can be deferred. There are also patches which have gone into the gen1 branch which are not in gen2. mthca is missing RDMA support and will be added beyond phase 1. No Arbel options are supported (only Tavor compatibility mode). Perhaps can bind CQs to different MSI-X lines ? Initial deliverable is by SuperComputing (mid November) to get IPoIB working in an OpenIB stack. This includes mthca (driver), access layer (MAD), SMI, SA client API for IPoIB, and IPoIB. (Also, hopefully OpenSM although that is not for the kernel). MAD layer is tall pole. Once working one available will port to this. Goal is to have this in 2 weeks. Other OpenIB porting would occur once the MAD layer is said to be working. SMI also needs to be developed as does OpenIB SA client support for IPoIB which is just PathRecord (get) and MCMemberRecord (set/get). IPoIB changes should be minor. Should we go into Andrew Morton's tree first ? What about OSDL ? Who can supply IB hardware for them ? Some number of HCAs and a switch would be needed. They also may have a budget for this. Need to make sure that the multicast mapping solution is acceptable to Linux kernel network maintainer if this impacts the core networking files. Once there is a working MAD layer, there will be a new official gen2 branch with just phase 1 deliverables. Then stable and development branches. Development would add in CM, etc. We will need to make things easier for kernel developers. Should there be a bkbits tree ? There needs to be a way to easily pull IB and add to kernel including building outside the kernel. There was also discussion about svn gateways to accomplish this. Also, there should be a nightly cron job to tar up appropriate branches and put them in openib.org downloads directory. openib.org website is confusing. The downloads are no longer applicable. There is nothing that indicates what everything is (both gen1 and gen2). Matt is working on improving this. Roland indicates he will shortly post flash support including building with autoconf. This requires no kernel support. It may need to be just GPL'd only as it use pcitools. -- Hal From roland at topspin.com Fri Sep 10 10:06:14 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 10:06:14 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool Message-ID: <52d60udpfd.fsf@topspin.com> I just checked in source for "tvflash," a tool for updating the firmware flash on Mellanox HCA's. It is available from: https://openib.org/svn/gen2/branches/roland-merge/src/userspace/tvflash This tool operates either by mmap()ing /dev/mem to get access to PCI memory, or (for systems such as IBM pSeries where this doesn't work) by peeking and poking at the PCI configuration header. I have used it successfully on a variety of systems but you should be prepared for it to fail and corrupt the flash on your HCA. Since it links with the GPLed pciutils library, tvflash is licensed under the GPL only. - Roland From roland at topspin.com Fri Sep 10 10:19:10 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 10:19:10 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <1094833865.1746.26.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 10 Sep 2004 12:31:05 -0400") References: <1094833865.1746.26.camel@localhost.localdomain> Message-ID: <528ybidott.fsf@topspin.com> A few comments (since I was not around for the whole meeting): Hal> SM support is needed for initial deliverable for those Hal> without an embedded SM (e.g. back to back HCAs). OpenSM Hal> obtains access to kernel via character device ioctls Is this Hal> acceptable ? There were differing views on this. The kernel Hal> support for this is in core/useraccess_xxx.c as is done via Hal> read/write (to send and receive MADs). Needs to be made 32/64 Hal> clean and saner prior to kernel submission. I'm pretty sure sane ioctls will be acceptable. If the semantics of the character device are read() for receive, write() for send and ioctl() to set parameters, I don't see much problem. If there's resistance to this we can look for alternatives (eg a new socket family and socket options instead of ioctl?). Actually the userspace stuff pretty much needs a complete rewrite for sysfs-ication etc. However I don't think this is much work and I can bang it out pretty quickly. Hal> mthca is missing RDMA support and will be added beyond phase Hal> 1. No Arbel options are supported (only Tavor compatibility Hal> mode). Perhaps can bind CQs to different MSI-X lines ? No firmware that supports Arbel native mode is available yet anyway. I've been intending to implement multiple CQ event handlers for a while but I wanted to defer design/discussion of the API until after we had the basics working (and Mellanox has not released firmware with MSI-X support either). Hal> Should we go into Andrew Morton's tree first ? I don't think this a decision for us to make. When we get closer to having patches to submit, we can work with Linus and Andrew to figure out the best way to get them merged. My feeling is that since our changes are pretty self-contained (the only core change I know of is to add ARPHRD_INFINIBAND multicast support) there's not much risk in merging upstream, so there's not much need to go through the -mm tree. However if Andrew does want to have our patches soak in -mm for a while, that's not a problem either. Hal> What about OSDL ? Who can supply IB hardware for them ? Some Hal> number of HCAs and a switch would be needed. They also may Hal> have a budget for this. OSDL does not accept donated agreement -- they insist on buying everything they use. Hal> Need to make sure that the multicast mapping solution is Hal> acceptable to Linux kernel network maintainer if this impacts Hal> the core networking files. I don't expect any problem here but it wouldn't hurt for someone to actually write the patch and run it by Dave Miller and netdev at oss.sgi.com in advance. Hal> Roland indicates he will shortly post flash support including Hal> building with autoconf. This requires no kernel support. It Hal> may need to be just GPL'd only as it use pcitools. Just posted. It is indeed GPLed because it links with pciutils. - R. From iod00d at hp.com Fri Sep 10 11:32:13 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Sep 2004 11:32:13 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <1094833865.1746.26.camel@localhost.localdomain> References: <1094833865.1746.26.camel@localhost.localdomain> Message-ID: <20040910183213.GC27652@cup.hp.com> Hal, thanks for posting the meeting notes. On Fri, Sep 10, 2004 at 12:31:05PM -0400, Hal Rosenstock wrote: > What about OSDL ? Who can supply IB hardware for them ? Some number of > HCAs and a switch would be needed. They also may have a budget for this. I've already have high level mgt agreement to fund this. I have temporary funding issues right now but it's not the primary problem. Tim Witham (OSDL) at the IB BOF (OLS) was clear he couldn't host OpenIB.org work until the "only promoters" SVN policy is changed. Ie folks who've contributed money to OpenIB get write access to SVN. Until Tim says he's happy, I don't see any point in ordering HW. ... > We will need to make things easier for kernel developers. Should there > be a bkbits tree? No. Just post the recipe to include openib.org SVN in any given source tree. Roland already sent me the recipe to include his code and it looks trivial (sorry - I'm still fighting other fires and haven't actually tried it yet...over a week ago *sigh*). > There needs to be a way to easily pull IB and add to > kernel including building outside the kernel. Right now, "build outside the kernel" is irrelevant given dependencies on 2.6.8 kernels. Or do I have that wrong? > openib.org website is confusing. The downloads are no longer applicable. > There is nothing that indicates what everything is (both gen1 and gen2). > Matt is working on improving this. Yes - that would be good. thanks again, grant From roland at topspin.com Fri Sep 10 11:56:45 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 11:56:45 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <20040910183213.GC27652@cup.hp.com> (Grant Grundler's message of "Fri, 10 Sep 2004 11:32:13 -0700") References: <1094833865.1746.26.camel@localhost.localdomain> <20040910183213.GC27652@cup.hp.com> Message-ID: <52zn3yc5qq.fsf@topspin.com> Hal> There needs to be a way to easily pull IB and add to kernel Hal> including building outside the kernel. Grant> Right now, "build outside the kernel" is irrelevant given Grant> dependencies on 2.6.8 kernels. Or do I have that wrong? As Grant said, it's easy to pull and build as part of a kernel tree. Not sure how important building outside of a kernel tree is, but it shouldn't be too hard to do: the only issue is all the use of CONFIG_INFINIBAND_blah in the Makefiles, and one could simply set environment variables appropriately before building with "make -C M=`pwd`" I haven't tried it yet but I'll give it a shot and post a recipe if it works. (By the way, separate object directories with "O=xxx" work fine, and as part of my testing I cross-compile 5 different archs out of one kernel tree) - Roland From halr at voltaire.com Fri Sep 10 11:57:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 14:57:46 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad.c: change clearing of management method table Message-ID: <1094842666.1794.183.camel@localhost.localdomain> ib_mad.c: change clearing of management method table Index: ib_mad.c =================================================================== --- ib_mad.c (revision 760) +++ ib_mad.c (working copy) @@ -413,8 +413,6 @@ static int allocate_method_table(struct ib_mad_mgmt_method_table **method) { - int i; - /* Allocate management method table */ *method = kmalloc(sizeof **method, GFP_KERNEL); if (!*method) { @@ -422,9 +420,8 @@ return ENOMEM; } /* Clear management method table */ - for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { - (*method)->agent[i] = NULL; - } + memset(*method, 0, sizeof **method); + return 0; } From halr at voltaire.com Fri Sep 10 12:11:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 15:11:42 -0400 Subject: [openib-general] [PATCH] ib_mad: In struct ib_mad_device_private, make qp0 and qp1 an array Message-ID: <1094843501.1752.204.camel@localhost.localdomain> ib_mad: In struct ib_mad_device_private, make qp0 and qp1 an array Index: ib_mad.c =================================================================== --- ib_mad.c (revision 761) +++ ib_mad.c (working copy) @@ -243,10 +243,7 @@ mad_agent->recv_handler = recv_handler; mad_agent->send_handler = send_handler; mad_agent->context = context; - if (qp_type == IB_QPT_GSI) - mad_agent->qp = priv->qp1; - else - mad_agent->qp = priv->qp0; + mad_agent->qp = priv->qp[qp_type]; mad_agent->hi_tid = ++ib_mad_client_id; /* Add to mad agent list */ @@ -788,7 +785,6 @@ { struct ib_mad_private *mad_priv; struct ib_sge sg_list; - struct ib_qp *qp; struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wr; IB_MAD_RECV_LIST_LOCK_VAR; @@ -817,11 +813,6 @@ recv_wr.recv_flags = IB_RECV_SIGNALED; recv_wr.wr_id = (unsigned long)mad_priv; - if (qp_type == IB_QPT_GSI) - qp = priv->qp1; - else - qp = priv->qp0; - /* Link receive WR into posted receive MAD list */ IB_MAD_RECV_LIST_LOCK(priv); list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list); @@ -830,7 +821,7 @@ pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); /* Now, post receive WR */ - if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { + if (ib_post_recv(priv->qp[qp_type], &recv_wr, &bad_recv_wr)) { /* Unlink from posted receive MAD list */ IB_MAD_RECV_LIST_LOCK(priv); list_del((struct list_head *)mad_priv); @@ -1039,14 +1030,9 @@ static int ib_mad_device_start(struct ib_mad_device_private *priv) { int ret, i; - struct ib_qp *qp; for (i = 0; i < 2; i++) { - if (i == 0) - qp = priv->qp0; - else - qp = priv->qp1; - ret = ib_mad_change_qp_state_to_init(qp, priv->port); + ret = ib_mad_change_qp_state_to_init(priv->qp[i], priv->port); if (ret) { printk(KERN_ERR "Could not change QP%d state to INIT\n", i); return ret; @@ -1066,17 +1052,13 @@ } for (i = 0; i < 2; i++) { - if (i == 0) - qp = priv->qp0; - else - qp = priv->qp1; - ret = ib_mad_change_qp_state_to_rtr(qp); + ret = ib_mad_change_qp_state_to_rtr(priv->qp[i]); if (ret) { printk(KERN_ERR "Could not change QP%d state to RTR\n", i); goto error; } - ret = ib_mad_change_qp_state_to_rts(qp); + ret = ib_mad_change_qp_state_to_rts(priv->qp[i]); if (ret) { printk(KERN_ERR "Could not change QP%d state to RTS\n", i); goto error; @@ -1089,11 +1071,7 @@ error: ib_mad_return_posted_recv_mads(priv); for (i = 0; i < 2; i++) { - if (i == 0) - qp = priv->qp0; - else - qp = priv->qp1; - ib_mad_change_qp_state_to_reset(qp); + ib_mad_change_qp_state_to_reset(priv->qp[i]); } return ret; @@ -1105,16 +1083,11 @@ static void ib_mad_device_stop(struct ib_mad_device_private *priv) { int i; - struct ib_qp *qp; IB_MAD_DEVICE_SET_DOWN(priv); for (i = 0; i < 2; i++) { - if (i == 0) - qp = priv->qp0; - else - qp = priv->qp1; - ib_mad_change_qp_state_to_reset(qp); + ib_mad_change_qp_state_to_reset(priv->qp[i]); } ib_mad_return_posted_recv_mads(priv); @@ -1151,7 +1124,6 @@ .size = (unsigned long) high_memory - PAGE_OFFSET }; struct ib_device_attr device_attr; - struct ib_qp *qp; struct ib_qp_init_attr qp_init_attr; struct ib_qp_cap qp_cap; struct ib_mad_device_private *entry, *priv = NULL, @@ -1219,10 +1191,6 @@ } for (i = 0; i < 2; i++) { - if (i == 0) - qp = priv->qp0; - else - qp = priv->qp1; memset(&qp_init_attr, 0, sizeof qp_init_attr); qp_init_attr.send_cq = priv->cq; qp_init_attr.recv_cq = priv->cq; @@ -1237,16 +1205,16 @@ else qp_init_attr.qp_type = IB_QPT_GSI; qp_init_attr.port_num = priv->port; - qp = ib_create_qp(priv->pd, &qp_init_attr, &qp_cap); - if (IS_ERR(qp)) { + priv->qp[i] = ib_create_qp(priv->pd, &qp_init_attr, &qp_cap); + if (IS_ERR(priv->qp[i])) { printk(KERN_ERR "Could not create ib_mad QP%d\n", i); - ret = PTR_ERR(qp); + ret = PTR_ERR(priv->qp[i]); if (i == 0) goto error6; else goto error7; } - printk(KERN_DEBUG "Created ib_mad QP %d\n", qp->qp_num); + printk(KERN_DEBUG "Created ib_mad QP %d\n", priv->qp[i]->qp_num); } spin_lock_init(&priv->recv_list_lock); @@ -1268,9 +1236,9 @@ return 0; error8: - ib_destroy_qp(priv->qp1); + ib_destroy_qp(priv->qp[1]); error7: - ib_destroy_qp(priv->qp0); + ib_destroy_qp(priv->qp[0]); error6: ib_dereg_mr(priv->mr); error5: @@ -1313,8 +1281,8 @@ ib_mad_device_stop(priv); ib_mad_thread_stop(priv); - ib_destroy_qp(priv->qp1); - ib_destroy_qp(priv->qp0); + ib_destroy_qp(priv->qp[1]); + ib_destroy_qp(priv->qp[0]); ib_dereg_mr(priv->mr); ib_dealloc_pd(priv->pd); ib_destroy_cq(priv->cq); Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 760) +++ ib_mad_priv.h (working copy) @@ -125,8 +125,7 @@ int port; int up; struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; - struct ib_qp *qp0; - struct ib_qp *qp1; + struct ib_qp *qp[2]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; From halr at voltaire.com Fri Sep 10 12:26:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 15:26:51 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad.c: change device to port routine names where appropriate Message-ID: <1094844410.1752.223.camel@localhost.localdomain> ib_mad.c: change device to port routine names where appropriate Index: ib_mad.c =================================================================== --- ib_mad.c (revision 762) +++ ib_mad.c (working copy) @@ -1025,7 +1025,7 @@ IB_MAD_DEVICE_LIST_UNLOCK();} /* - * Start the device + * Start the device */ static int ib_mad_device_start(struct ib_mad_device_private *priv) { @@ -1112,10 +1112,10 @@ } /* - * Open the device + * Open the port * Create the QP, PD, MR, and CQ if needed */ -static int ib_mad_device_open(struct ib_device *device, int port) +static int ib_mad_port_open(struct ib_device *device, int port) { int ret, cq_size, i; u64 iova = 0; @@ -1130,7 +1130,7 @@ *head = (struct ib_mad_device_private *) &ib_mad_device_list; IB_MAD_DEVICE_LIST_LOCK_VAR; - /* First, check if device already open at MAD layer */ + /* First, check if port already open at MAD layer */ IB_MAD_DEVICE_LIST_LOCK(); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { @@ -1140,7 +1140,7 @@ } IB_MAD_DEVICE_LIST_UNLOCK(); if (priv) { - printk(KERN_DEBUG "Device already open\n"); + printk(KERN_DEBUG "Port already open\n"); return 0; } @@ -1252,11 +1252,11 @@ } /* - * Close the device - * If there are no classes using the device, free the device - * resources (CQ, MR, PD, QP) and remove the device info structure + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure */ -static int ib_mad_device_close(struct ib_device *device, int port) +static int ib_mad_port_close(struct ib_device *device, int port) { struct ib_mad_device_private *entry, *priv = NULL, *head = (struct ib_mad_device_private *)&ib_mad_device_list; @@ -1271,7 +1271,7 @@ } if (priv == NULL) { - printk(KERN_ERR "Device not found\n"); + printk(KERN_ERR "Port not found\n"); IB_MAD_DEVICE_LIST_UNLOCK(); return -ENODEV; } @@ -1310,7 +1310,7 @@ num_ports = device_attr.phys_port_cnt; } for (i = 0; i < num_ports; i++) { - ret = ib_mad_device_open(device, i); + ret = ib_mad_port_open(device, i); if (ret) { printk(KERN_ERR "Could not open device port %d\n", i); goto error_device_open; @@ -1321,7 +1321,7 @@ error_device_open: while (i > 0) { - ret2 = ib_mad_device_close(device, i); + ret2 = ib_mad_port_close(device, i); if (ret2) { printk(KERN_ERR "Could not close device port %d\n", i); } @@ -1346,7 +1346,7 @@ /* num_ports should also be based on device type! */ num_ports = device_attr.phys_port_cnt; for (i = 0; i < num_ports; i++) { - ret2 = ib_mad_device_close(device, i); + ret2 = ib_mad_port_close(device, i); if (ret2) { printk(KERN_ERR "Could not close device port %d\n", i); if (!ret) From halr at voltaire.com Fri Sep 10 12:32:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 15:32:14 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad.c: ib_mad.c: Allocate CQ entries without any extra ones Message-ID: <1094844733.1752.231.camel@localhost.localdomain> ib_mad.c: Allocate CQ entries without any extra ones Index: ib_mad.c =================================================================== --- ib_mad.c (revision 766) +++ ib_mad.c (working copy) @@ -1159,7 +1159,7 @@ priv->version[i] = NULL; } - cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE + 20; + cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; priv->cq = ib_create_cq(priv->device, (ib_comp_handler) ib_mad_thread_completion_handler, priv, cq_size); From robert.j.woodruff at intel.com Fri Sep 10 12:37:47 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Sep 2004 12:37:47 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting Message-ID: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> Grant wrote, > I've already have high level mgt agreement to fund this. >I have temporary funding issues right now but it's not the primary problem. Tim Witham >(OSDL) at the IB BOF (OLS) was clear he couldn't host OpenIB.org work until the "only >promoters" SVN policy is changed. Ie folks who've contributed money to OpenIB get write >access to SVN. Until Tim says he's happy, I don't see any point in ordering HW. Jim, is there anything in the openib.org promoters agreement or bylaws that prevents openib.org maintainers from giving others (non-promoters) write access to the SVN tree ? If not, then I think that as long as all the people on the list want to allow someone to have write access, I see know reason why we could not allow it ? What do other people think ? woody From mshefty at ichips.intel.com Fri Sep 10 12:17:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Sep 2004 12:17:55 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <1094833865.1746.26.camel@localhost.localdomain> References: <1094833865.1746.26.camel@localhost.localdomain> Message-ID: <20040910121755.1620a62a.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 12:31:05 -0400 Hal Rosenstock wrote: > SM support is needed for initial deliverable for those without an > embedded SM (e.g. back to back HCAs). During the DevCon, Kevin Deierling (from Mellanox) mentioned that they would port opensm to the gen2 interface once it were defined and ready. Support for just opensm in user-mode is easier than exporting the hardware capabilities directly to user-mode clients. It should be noted that additional discussions with Hal, the user-mode MAD API may be slightly different than the kernel-mode API, since zero-copy is not as achievable. > Initial deliverable is by SuperComputing (mid November) to get IPoIB > working in an OpenIB stack. This includes mthca (driver), access layer > (MAD), SMI, SA client API for IPoIB, and IPoIB. (Also, hopefully OpenSM > although that is not for the kernel). Delivering code by SuperComputing was requested by Matt to set a specific deadline for the code. There wasn't a hard requirement for this date, but did seem reasonable. It requires defining the user-mode MAD API and a minimal query API. A simple way to join/leave multicast groups may also be desirable. Other notes that I have from the meeting not already mentioned are: uDAPL was mentioned as being a priority. Need to define the call context for all calls. Testing of the stack is going to be based primarily on normal stack usage, with no plans for creating a formal test suite. Once some additional ULPs (uDAPL/MPI) are available, their tests can be leveraged. OSU/labs would likely port MPI to the gen2 stack. The labs expressed interest in diagnostic tools. From roland at topspin.com Fri Sep 10 13:35:18 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 13:35:18 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs Message-ID: <52mzzxdfqx.fsf@topspin.com> OK, here is my proposal for how to handle device enumeration and async events. I actually have all of this coded and working on my branch; I'll post the full diff shortly. However I want to pull out the API changes so we can discuss them more easily. First of all, here is how a kernel client finds out about the devices in the system: struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); struct list_head list; }; int ib_register_client (struct ib_client *client); int ib_unregister_client(struct ib_client *client); When a client calls ib_register_client, the add method is called for each of the devices that already exist. Conversely, on unregister, remove is called for all the remaining devices. When a new device is added, add methods are be called in the order the clients registered; when a new device is removed, remove methods are called in the opposite order. This allows initialization and cleanup to happen properly (for example, IPoIB knows that the MAD layer will initialize before it and clean up after it). For unaffiliated events, my API is as follows: struct ib_event_handler { struct ib_device *device; void (*handler)(struct ib_event_handler *, struct ib_event *); struct list_head list; }; int ib_register_event_handler (struct ib_device *device, struct ib_event_handler *event_handler); int ib_unregister_event_handler(struct ib_event_handler *event_handler); void ib_dispatch_event(struct ib_event *event); This is pretty simple: everyone that wants to know about unaffiliated (ie not relating to a QP or CQ) events registers a struct ib_event_handler. The callback doesn't take a context parameter because I'm assuming the struct ib_event_handler will be embedded in the client's context and used with container_of (this is similar to ). Finally, I added event_handler members to struct ib_cq and struct ib_qp and added support for setting them on creation: struct ib_cq { struct ib_device *device; ib_comp_handler comp_handler; void (*event_handler)(struct ib_event *, void *); void * context; int cqe; atomic_t usecnt; /* count number of work queues */ }; struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe); struct ib_qp { struct ib_device *device; struct ib_pd *pd; struct ib_cq *send_cq; struct ib_cq *recv_cq; struct ib_srq *srq; void (*event_handler)(struct ib_event *, void *); void *qp_context; u32 qp_num; }; struct ib_qp_init_attr { void (*event_handler)(struct ib_event *, void *); void *qp_context; struct ib_cq *send_cq; struct ib_cq *recv_cq; struct ib_srq *srq; struct ib_qp_cap cap; enum ib_sig_type sq_sig_type; enum ib_sig_type rq_sig_type; enum ib_qp_type qp_type; u8 port_num; /* special QP types only */ }; These do get passed the context to match what we did with the comp_handler member of struct ib_cq. Comments? Thanks, Roland From roland at topspin.com Fri Sep 10 13:36:11 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 13:36:11 -0700 Subject: [openib-general] [PATCH] remove some /proc files Message-ID: <52isaldfpg.fsf@topspin.com> This gets rid of everything in /proc that's also in sysfs (only the PMA counters still need to be moved, and I'm waiting for the new MAD API before I do that work). - R. Index: infiniband/core/core_proc.c =================================================================== --- infiniband/core/core_proc.c (revision 759) +++ infiniband/core/core_proc.c (working copy) @@ -54,7 +54,6 @@ int index; char dev_dir_name[16]; struct proc_dir_entry *dev_dir; - struct proc_dir_entry *dev_info; struct ib_port_proc *port; }; @@ -62,343 +61,12 @@ struct ib_device *device; int port_num; struct proc_dir_entry *port_dir; - struct proc_dir_entry *port_data; - struct proc_dir_entry *gid_table; - struct proc_dir_entry *pkey_table; struct proc_dir_entry *counters; }; static int index = 1; static struct proc_dir_entry *core_dir; -static void *ib_dev_info_seq_start(struct seq_file *file, - loff_t *pos) -{ - if (*pos) - return NULL; - else - return (void *) 1UL; -} - -static void *ib_dev_info_seq_next(struct seq_file *file, - void *iter_ptr, - loff_t *pos) -{ - (*pos)++; - return NULL; -} - -static void ib_dev_info_seq_stop(struct seq_file *file, - void *iter_ptr) -{ - /* nothing for now */ -} - -static int ib_dev_info_seq_show(struct seq_file *file, - void *iter_ptr) -{ - struct ib_device_attr prop; - struct ib_device *proc_device = file->private; - - seq_printf(file, "name: %s\n", proc_device->name); - seq_printf(file, "provider: %s\n", proc_device->provider); - - if (proc_device->query_device(proc_device, &prop)) - return 0; - - seq_printf(file, "node GUID: %04x:%04x:%04x:%04x\n", - be16_to_cpu(((u16 *) &prop.node_guid)[0]), - be16_to_cpu(((u16 *) &prop.node_guid)[1]), - be16_to_cpu(((u16 *) &prop.node_guid)[2]), - be16_to_cpu(((u16 *) &prop.node_guid)[3])); - seq_printf(file, "ports: %d\n", prop.phys_port_cnt); - seq_printf(file, "vendor ID: 0x%x\n", prop.vendor_id); - seq_printf(file, "device ID: 0x%x\n", prop.vendor_part_id); - seq_printf(file, "HW revision: 0x%x\n", prop.hw_ver); - seq_printf(file, "FW revision: 0x%" TS_U64_FMT "x\n", prop.fw_ver); - - return 0; -} - -static struct seq_operations dev_info_seq_ops = { - .start = ib_dev_info_seq_start, - .next = ib_dev_info_seq_next, - .stop = ib_dev_info_seq_stop, - .show = ib_dev_info_seq_show -}; - -static int ib_dev_info_open(struct inode *inode, - struct file *file) -{ - int ret; - - ret = seq_open(file, &dev_info_seq_ops); - if (ret) - return ret; - - ((struct seq_file *) file->private_data)->private = PDE(inode)->data; - - return 0; -} - -static void *ib_port_info_seq_start(struct seq_file *file, - loff_t *pos) -{ - if (*pos) - return NULL; - else - return (void *) 1UL; -} - -static void *ib_port_info_seq_next(struct seq_file *file, - void *iter_ptr, - loff_t *pos) -{ - (*pos)++; - return NULL; -} - -static void ib_port_info_seq_stop(struct seq_file *file, - void *iter_ptr) -{ - /* nothing for now */ -} - -static int ib_port_info_seq_show(struct seq_file *file, - void *iter_ptr) -{ - struct ib_port_attr prop; - struct ib_port_proc *proc_port = file->private; - - if (proc_port->device->query_port(proc_port->device, - proc_port->port_num, &prop)) { - return 0; - } - - seq_printf(file, "state: "); - switch (prop.state) { - case IB_PORT_NOP: - seq_printf(file, "NOP\n"); - break; - case IB_PORT_DOWN: - seq_printf(file, "DOWN\n"); - break; - case IB_PORT_INIT: - seq_printf(file, "INITIALIZE\n"); - break; - case IB_PORT_ARMED: - seq_printf(file, "ARMED\n"); - break; - case IB_PORT_ACTIVE: - seq_printf(file, "ACTIVE\n"); - break; - default: - seq_printf(file, "UNKNOWN\n"); - break; - } - - seq_printf(file, "LID: 0x%04x\n", prop.lid); - seq_printf(file, "LMC: 0x%04x\n", prop.lmc); - seq_printf(file, "SM LID: 0x%04x\n", prop.sm_lid); - seq_printf(file, "SM SL: 0x%04x\n", prop.sm_sl); - seq_printf(file, "Capabilities: "); - if (prop.port_cap_flags) { - static const char *cap_name[] = { - [1] = "IsSM", - [2] = "IsNoticeSupported", - [3] = "IsTrapSupported", - [5] = "IsAutomaticMigrationSupported", - [6] = "IsSLMappingSupported", - [7] = "IsMKeyNVRAM", - [8] = "IsPKeyNVRAM", - [9] = "IsLEDInfoSupported", - [10] = "IsSMdisabled", - [11] = "IsSystemImageGUIDSupported", - [12] = "IsPKeySwitchExternalPortTrapSupported", - [16] = "IsCommunicationManagementSupported", - [17] = "IsSNMPTunnelingSupported", - [18] = "IsReinitSupported", - [19] = "IsDeviceManagementSupported", - [20] = "IsVendorClassSupported", - [21] = "IsDRNoticeSupported", - [22] = "IsCapabilityMaskNoticeSupported", - [23] = "IsBootManagementSupported" - }; - int i; - int f = 0; - - for (i = 0; i < ARRAY_SIZE(cap_name); ++i) { - if (prop.port_cap_flags & (1 << i)) { - if (f++) { - seq_puts(file, " "); - } - if (cap_name[i]) { - seq_printf(file, "%s\n", cap_name[i]); - } else { - seq_printf(file, "RESERVED (%d)\n", i); - } - } - } - } else { - seq_puts(file, "NONE\n"); - } - - - return 0; -} - -static struct seq_operations port_data_seq_ops = { - .start = ib_port_info_seq_start, - .next = ib_port_info_seq_next, - .stop = ib_port_info_seq_stop, - .show = ib_port_info_seq_show -}; - -static int ib_port_info_open(struct inode *inode, - struct file *file) -{ - int ret; - - ret = seq_open(file, &port_data_seq_ops); - if (ret) - return ret; - - ((struct seq_file *) file->private_data)->private = PDE(inode)->data; - - return 0; -} - -static void *ib_gid_table_seq_start(struct seq_file *file, - loff_t *pos) -{ - if (*pos) - return NULL; - else - return (void *) 1UL; -} - -static void *ib_gid_table_seq_next(struct seq_file *file, - void *iter_ptr, - loff_t *pos) -{ - (*pos)++; - return NULL; -} - -static void ib_gid_table_seq_stop(struct seq_file *file, - void *iter_ptr) -{ - /* nothing for now */ -} - -static int ib_gid_table_seq_show(struct seq_file *file, - void *iter_ptr) -{ - int i, j; - struct ib_port_proc *proc_port = file->private; - static const tTS_IB_GID null_gid; - tTS_IB_GID gid; - - for (i = 0; !ib_cached_gid_get(proc_port->device, proc_port->port_num, i, gid); ++i) { - if (memcmp(&null_gid[8], &gid[8], 8)) { - seq_printf(file, "[%3d] ", i); - - for (j = 0; j < sizeof (tTS_IB_GID) / 2; ++j) { - if (j) - seq_putc(file, ':'); - seq_printf(file, "%04x", - be16_to_cpu(((u16 *) gid)[j])); - } - seq_putc(file, '\n'); - } - } - - return 0; -} - -static struct seq_operations gid_table_seq_ops = { - .start = ib_gid_table_seq_start, - .next = ib_gid_table_seq_next, - .stop = ib_gid_table_seq_stop, - .show = ib_gid_table_seq_show -}; - -static int ib_gid_table_open(struct inode *inode, - struct file *file) -{ - int ret; - - ret = seq_open(file, &gid_table_seq_ops); - if (ret) - return ret; - - ((struct seq_file *) file->private_data)->private = PDE(inode)->data; - - return 0; -} - -static void *ib_pkey_table_seq_start(struct seq_file *file, - loff_t *pos) -{ - if (*pos) { - return NULL; - } else { - return (void *) 1UL; - } -} - -static void *ib_pkey_table_seq_next(struct seq_file *file, - void *iter_ptr, - loff_t *pos) -{ - (*pos)++; - return NULL; -} - -static void ib_pkey_table_seq_stop(struct seq_file *file, - void *iter_ptr) -{ - /* nothing for now */ -} - -static int ib_pkey_table_seq_show(struct seq_file *file, - void *iter_ptr) -{ - int i; - struct ib_port_proc *proc_port = file->private; - u16 pkey; - - for (i = 0; !ib_cached_pkey_get(proc_port->device, proc_port->port_num, i, &pkey); ++i) { - if (pkey & 0x7fff) { - seq_printf(file, "[%3d] %04x\n", i, pkey); - } - } - - return 0; -} - -static struct seq_operations pkey_table_seq_ops = { - .start = ib_pkey_table_seq_start, - .next = ib_pkey_table_seq_next, - .stop = ib_pkey_table_seq_stop, - .show = ib_pkey_table_seq_show -}; - -static int ib_pkey_table_open( - struct inode *inode, - struct file *file - ) { - int ret; - - ret = seq_open(file, &pkey_table_seq_ops); - if (ret) { - return ret; - } - ((struct seq_file *) file->private_data)->private = PDE(inode)->data; - - return 0; -} - static void *ib_counters_seq_start(struct seq_file *file, loff_t *pos) { @@ -541,38 +209,6 @@ return seq_release(inode, file); } -static struct file_operations dev_info_ops = { - .owner = THIS_MODULE, - .open = ib_dev_info_open, - .read = seq_read, - .llseek = seq_lseek, - .release = ib_proc_file_release -}; - -static struct file_operations port_data_ops = { - .owner = THIS_MODULE, - .open = ib_port_info_open, - .read = seq_read, - .llseek = seq_lseek, - .release = ib_proc_file_release -}; - -static struct file_operations gid_table_ops = { - .owner = THIS_MODULE, - .open = ib_gid_table_open, - .read = seq_read, - .llseek = seq_lseek, - .release = ib_proc_file_release -}; - -static struct file_operations pkey_table_ops = { - .owner = THIS_MODULE, - .open = ib_pkey_table_open, - .read = seq_read, - .llseek = seq_lseek, - .release = ib_proc_file_release -}; - static struct file_operations counters_ops = { .owner = THIS_MODULE, .open = ib_counters_open, @@ -606,65 +242,29 @@ goto out_free; } - core_proc->dev_info = create_proc_entry("info", S_IRUGO, core_proc->dev_dir); - if (!core_proc->dev_info) { - goto out_topdir; - } - core_proc->dev_info->proc_fops = &dev_info_ops; - core_proc->dev_info->data = device; - core_proc->port = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_proc), GFP_KERNEL); - if (!core_proc->port) { - goto out_info; - } + if (!core_proc->port) + goto out_topdir; for (p = priv->start_port; p <= priv->end_port; ++p) { core_proc->port[p].device = device; core_proc->port[p].port_num = p; core_proc->port[p].port_dir = NULL; - core_proc->port[p].port_data = NULL; - core_proc->port[p].gid_table = NULL; - core_proc->port[p].pkey_table = NULL; core_proc->port[p].counters = NULL; } for (p = priv->start_port; p <= priv->end_port; ++p) { snprintf(port_name, sizeof port_name, "port%d", p); core_proc->port[p].port_dir = proc_mkdir(port_name, core_proc->dev_dir); - if (!core_proc->port[p].port_dir) { + if (!core_proc->port[p].port_dir) goto out_port; - } - core_proc->port[p].port_data = create_proc_entry("info", S_IRUGO, - core_proc->port[p].port_dir); - if (!core_proc->port[p].port_data) { - goto out_port; - } - core_proc->port[p].port_data->proc_fops = &port_data_ops; - core_proc->port[p].port_data->data = &core_proc->port[p]; - - core_proc->port[p].gid_table = create_proc_entry("gid_table", S_IRUGO, - core_proc->port[p].port_dir); - if (!core_proc->port[p].gid_table) { - goto out_port; - } - core_proc->port[p].gid_table->proc_fops = &gid_table_ops; - core_proc->port[p].gid_table->data = &core_proc->port[p]; - - core_proc->port[p].pkey_table = create_proc_entry("pkey_table", S_IRUGO, - core_proc->port[p].port_dir); - if (!core_proc->port[p].pkey_table) { - goto out_port; - } - core_proc->port[p].pkey_table->proc_fops = &pkey_table_ops; - core_proc->port[p].pkey_table->data = &core_proc->port[p]; - core_proc->port[p].counters = create_proc_entry("counters", S_IRUGO, core_proc->port[p].port_dir); - if (!core_proc->port[p].counters) { + if (!core_proc->port[p].counters) goto out_port; - } + core_proc->port[p].counters->proc_fops = &counters_ops; core_proc->port[p].counters->data = &core_proc->port[p]; } @@ -675,21 +275,8 @@ out_port: for (p = priv->start_port; p <= priv->end_port; ++p) { - if (core_proc->port[p].port_data) { - remove_proc_entry("info", core_proc->port[p].port_dir); - } - - if (core_proc->port[p].gid_table) { - remove_proc_entry("gid_table", core_proc->port[p].port_dir); - } - - if (core_proc->port[p].pkey_table) { - remove_proc_entry("pkey_table", core_proc->port[p].port_dir); - } - - if (core_proc->port[p].counters) { + if (core_proc->port[p].counters) remove_proc_entry("counters", core_proc->port[p].port_dir); - } if (core_proc->port[p].port_dir) { snprintf(port_name, sizeof port_name, "port%d", p); @@ -697,9 +284,6 @@ } } - out_info: - remove_proc_entry("info", core_proc->dev_dir); - out_topdir: remove_proc_entry(core_proc->dev_dir_name, core_dir); @@ -717,14 +301,10 @@ for (p = priv->start_port; p <= priv->end_port; ++p) { remove_proc_entry("counters", core_proc->port[p].port_dir); - remove_proc_entry("pkey_table", core_proc->port[p].port_dir); - remove_proc_entry("gid_table", core_proc->port[p].port_dir); - remove_proc_entry("info", core_proc->port[p].port_dir); snprintf(port_name, sizeof port_name, "port%d", p); remove_proc_entry(port_name, core_proc->dev_dir); } - remove_proc_entry("info", core_proc->dev_dir); remove_proc_entry(core_proc->dev_dir_name, core_dir); kfree(priv->proc); From roland at topspin.com Fri Sep 10 13:36:53 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 13:36:53 -0700 Subject: [openib-general] [PATCH] implement device enumeration & async events Message-ID: <52ekl9dfoa.fsf@topspin.com> This rather large diff implements the device enumeration and async event API that I just described. - The MAD module and IPoIB both fully use the ib_client device enumeration method. You can now load and unload ib_mthca and ib_ipoib in any order and watch IPoIB interfaces appear and disappear appropriately. This also fixes crashes/hangs when unloading IPoIB with interfaces still configured. (And there was much rejoicing!) - Switching SDP and SRP over to the new ib_client stuff still needs doing. To remind me I added "__deprecated" to the old ib_device_get_by_index API that they're using, so there's a warning every build. - This finally kills off the ts_ib_provider.h file. Yay! If no one objects to the API I'll commit this to my branch shortly. - Roland Index: infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 759) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -214,7 +214,7 @@ return -ENODEV; } - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, dev, + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); if (IS_ERR(priv->cq)) { TS_REPORT_FATAL(MOD_IB_NET, "%s: failed to create CQ", @@ -256,7 +256,6 @@ out_free_pd: ib_dealloc_pd(priv->pd); - module_put(priv->ca->owner); return -ENODEV; } @@ -274,101 +273,36 @@ } if (ib_dereg_mr(priv->mr)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_dereg_mr failed", dev->name); + printk(KERN_WARNING "%s: ib_dereg_mr failed\n", dev->name); if (ib_destroy_cq(priv->cq)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_cq_destroy failed", dev->name); + printk(KERN_WARNING "%s: ib_cq_destroy failed\n", dev->name); if (ib_dealloc_pd(priv->pd)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_dealloc_pd failed", dev->name); - - module_put(priv->ca->owner); + printk(KERN_WARNING "%s: ib_dealloc_pd failed\n", dev->name); } -static void ipoib_device_notifier(struct ib_device_notifier *self, - struct ib_device *device, int event) +static void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) { - struct ib_device_attr props; - int port; + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); - switch (event) { - case IB_DEVICE_NOTIFIER_ADD: - if (ib_query_device(device, &props)) { - TS_REPORT_WARN(MOD_IB_NET, "ib_device_properties_get failed"); - return; - } - - if (device->node_type == IB_NODE_SWITCH) { - if (try_module_get(device->owner)) - ipoib_add_port("ib%d", device, 0); - } else { - for (port = 1; port <= props.phys_port_cnt; ++port) - if (try_module_get(device->owner)) - ipoib_add_port("ib%d", device, port); - } - break; - - case IB_DEVICE_NOTIFIER_REMOVE: - /* Yikes! We don't support devices going away from - underneath us yet! */ - TS_REPORT_WARN(MOD_IB_NET, - "IPoIB driver can't handle removal of device %s", - device->name); - break; - - default: - TS_REPORT_WARN(MOD_IB_NET, "Unknown device notifier event %d."); - break; - } -} - -static struct ib_device_notifier ipoib_notifier = { - .notifier = ipoib_device_notifier -}; - -int ipoib_transport_create_devices(void) -{ - ib_device_notifier_register(&ipoib_notifier); - return 0; -} - -void ipoib_transport_cleanup(void) -{ - ib_device_notifier_deregister(&ipoib_notifier); -} - -static void ipoib_async_event(struct ib_async_event_record *record, - void *priv_ptr) -{ - struct ipoib_dev_priv *priv = priv_ptr; - if (record->event == IB_EVENT_PORT_ACTIVE) { TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_IB_NET_GEN, - "%s: Port active Event", priv->dev.name); - - ipoib_ib_dev_flush(&priv->dev); - } else - TS_REPORT_WARN(MOD_IB_NET, - "%s: Unexpected event %d", priv->dev.name, - record->event); + "%s: Port active event", priv->dev.name); + schedule_work(&priv->flush_task); + } } int ipoib_port_monitor_dev_start(struct net_device *dev) { struct ipoib_dev_priv *priv = dev->priv; - struct ib_async_event_record event_record = { - .device = priv->ca, - .event = IB_EVENT_PORT_ACTIVE, - }; - if (ib_async_event_handler_register(&event_record, - ipoib_async_event, - priv, &priv->active_handler)) { - TS_REPORT_FATAL(MOD_IB_NET, - "ib_async_event_handler_register failed for TS_IB_PORT_ACTIVE"); + priv->event_handler.handler = ipoib_event; + + if (ib_register_event_handler(priv->ca, &priv->event_handler)) { + printk(KERN_WARNING "ib_handler_register_event failed\n"); return -EINVAL; } @@ -379,7 +313,7 @@ { struct ipoib_dev_priv *priv = dev->priv; - ib_async_event_handler_deregister(priv->active_handler); + ib_unregister_event_handler(&priv->event_handler); } /* Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 759) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -605,6 +605,7 @@ INIT_LIST_HEAD(&priv->child_intfs); INIT_LIST_HEAD(&priv->multicast_list); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, &priv->dev); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, &priv->dev); priv->dev.priv = priv; @@ -691,22 +692,57 @@ return result; } +static void ipoib_add_one(struct ib_device *device) +{ + struct ib_device_attr props; + int port; + + if (ib_query_device(device, &props)) { + TS_REPORT_WARN(MOD_IB_NET, "ib_device_properties_get failed"); + return; + } + + if (device->node_type == IB_NODE_SWITCH) + ipoib_add_port("ib%d", device, 0); + else + for (port = 1; port <= props.phys_port_cnt; ++port) + ipoib_add_port("ib%d", device, port); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + + LIST_HEAD(delete); + + down(&ipoib_device_mutex); + list_for_each_entry_safe(priv, tmp, &ipoib_device_list, list) { + list_del(&priv->list); + list_add_tail(&priv->list, &delete); + } + up(&ipoib_device_mutex); + + list_for_each_entry_safe(priv, tmp, &delete, list) { + unregister_netdev(&priv->dev); + ipoib_port_monitor_dev_stop(&priv->dev); + ipoib_dev_cleanup(&priv->dev); + kfree(priv); + } +} + +static struct ib_client ipoib_client = { + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + static int __init ipoib_init_module(void) { int ret; - ret = ipoib_transport_create_devices(); + ret = ib_register_client(&ipoib_client); if (ret) return ret; - down(&ipoib_device_mutex); - if (list_empty(&ipoib_device_list)) { - up(&ipoib_device_mutex); - ipoib_transport_cleanup(); - return -ENODEV; - } - up(&ipoib_device_mutex); - ipoib_vlan_init(); return 0; @@ -714,22 +750,8 @@ static void __exit ipoib_cleanup_module(void) { - struct ipoib_dev_priv *priv, *tpriv; - ipoib_vlan_cleanup(); - ipoib_transport_cleanup(); - - down(&ipoib_device_mutex); - list_for_each_entry_safe(priv, tpriv, &ipoib_device_list, list) { - ipoib_port_monitor_dev_stop(&priv->dev); - ipoib_dev_cleanup(&priv->dev); - unregister_netdev(&priv->dev); - - list_del(&priv->list); - - kfree(priv); - } - up(&ipoib_device_mutex); + ib_unregister_client(&ipoib_client); } module_init(ipoib_init_module); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 759) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -117,6 +117,7 @@ atomic_t mcast_joins; + struct work_struct flush_task; struct work_struct restart_task; struct ib_device *ca; @@ -152,7 +153,7 @@ struct proc_dir_entry *arp_proc_entry; struct proc_dir_entry *mcast_proc_entry; - struct ib_async_event_handler *active_handler; + struct ib_event_handler event_handler; struct net_device_stats stats; }; @@ -253,10 +254,6 @@ int ipoib_add_port(const char *format, struct ib_device *device, tTS_IB_PORT port); -int ipoib_transport_create_devices(void); - -void ipoib_transport_cleanup(void); - int ipoib_port_monitor_dev_start(struct net_device *dev); void ipoib_port_monitor_dev_stop(struct net_device *dev); Index: infiniband/ulp/ipoib/ip2pr_link.c =================================================================== --- infiniband/ulp/ipoib/ip2pr_link.c (revision 759) +++ infiniband/ulp/ipoib/ip2pr_link.c (working copy) @@ -27,8 +27,7 @@ static tTS_KERNEL_TIMER_STRUCT _tsIp2prPathTimer; static tIP2PR_PATH_LOOKUP_ID _tsIp2prPathLookupId = 0; -static struct ib_async_event_handler *_tsIp2prAsyncErrHandle[IP2PR_MAX_HCAS]; -static struct ib_async_event_handler *_tsIp2prAsyncActHandle[IP2PR_MAX_HCAS]; +static struct ib_event_handler _tsIp2prEventHandle[IP2PR_MAX_HCAS]; static unsigned int ip2pr_total_req = 0; static unsigned int ip2pr_arp_timeout = 0; @@ -1311,9 +1310,9 @@ return 0; } -/* ip2pr_async_event_func -- IB async event handler, for clearing caches */ -static void ip2pr_async_event_func(struct ib_async_event_record *record, - void *arg) +/* ip2pr_event_func -- IB async event handler, for clearing caches */ +static void ip2pr_event_func(struct ib_event_handler *handler, + struct ib_event *record) { struct ip2pr_path_element *path_elmt; s32 result; @@ -1321,15 +1320,10 @@ unsigned long flags; struct ip2pr_gid_pr_element *prn_elmt; - if (NULL == record) { - - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "ASYNC: Event with no record of what happened?"); + if (record->event != IB_EVENT_PORT_ACTIVE && + record->event != IB_EVENT_PORT_ERR) return; - } - /* if */ - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "ASYNC: Event <%d> reported, clearing cache."); + /* * destroy all cached path record elements. */ @@ -1346,18 +1340,18 @@ for (sgid_elmt = _tsIp2prLinkRoot.src_gid_list; NULL != sgid_elmt; sgid_elmt = sgid_elmt->next) { if ((sgid_elmt->ca == record->device) && - (sgid_elmt->port == record->modifier.port)) { + (sgid_elmt->port == record->element.port_num)) { sgid_elmt->port_state = record->event == IB_EVENT_PORT_ACTIVE ? IB_PORT_ACTIVE : IB_PORT_DOWN; /* Gid could have changed. Get the gid */ if (ib_cached_gid_get(record->device, - record->modifier.port, + record->element.port_num, 0, sgid_elmt->gid)) { TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "Could not get GID: on hca=<%d>,port=<%d>, event=%d", - record->device, record->modifier.port, + record->device, record->element.port_num, record->event); /* for now zero it. Will get it, when user queries */ @@ -1375,7 +1369,7 @@ TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "Async Port Event on hca=<%d>,port=<%d>, event=%d", - record->device, record->modifier.port, record->event); + record->device, record->element.port_num, record->event); return; } @@ -2074,7 +2068,6 @@ s32 ip2pr_link_addr_init(void) { s32 result = 0; - struct ib_async_event_record evt_rec; int i; struct ib_device *hca_device; @@ -2138,43 +2131,17 @@ * Install async event handler, to clear cache on port down */ - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - _tsIp2prAsyncErrHandle[i] = TS_IP2PR_INVALID_ASYNC_HANDLE; - _tsIp2prAsyncActHandle[i] = TS_IP2PR_INVALID_ASYNC_HANDLE; - } - for (i = 0; ((hca_device = ib_device_get_by_index(i)) != NULL); ++i) { - evt_rec.device = hca_device; - evt_rec.event = IB_PORT_ERROR; - result = ib_async_event_handler_register(&evt_rec, - ip2pr_async_event_func, - NULL, - &_tsIp2prAsyncErrHandle - [i]); - if (0 != result) { - + _tsIp2prEventHandle[i].handler = ip2pr_event_func; + result = ib_register_event_handler(hca_device, &_tsIp2prEventHandle[i]); + if (result) { TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "INIT: Error <%d> registering event handler.", result); goto error_async; } - /* if */ - evt_rec.device = hca_device; - evt_rec.event = IB_EVENT_PORT_ACTIVE; - result = ib_async_event_handler_register(&evt_rec, - ip2pr_async_event_func, - NULL, - &_tsIp2prAsyncActHandle - [i]); - if (0 != result) { + } - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "INIT: Error <%d> registering event handler.", - result); - goto error_async; - } /* if */ - } /* for */ - /* * create timer for pruning path record cache. */ @@ -2198,16 +2165,9 @@ return 0; error_async: - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - if (_tsIp2prAsyncErrHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncErrHandle - [i]); - } - if (_tsIp2prAsyncActHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncActHandle - [i]); - } - } + for (i = 0; i < IP2PR_MAX_HCAS; i++) + if (_tsIp2prEventHandle[i].device) + ib_unregister_event_handler(&_tsIp2prEventHandle[i]); kmem_cache_destroy(_tsIp2prLinkRoot.user_req); error_user: @@ -2243,16 +2203,9 @@ /* * release async event handler(s) */ - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - if (_tsIp2prAsyncErrHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncErrHandle - [i]); - } - if (_tsIp2prAsyncActHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncActHandle - [i]); - } - } + for (i = 0; i < IP2PR_MAX_HCAS; i++) + if (_tsIp2prEventHandle[i].device) + ib_unregister_event_handler(&_tsIp2prEventHandle[i]); /* * clear wait list Index: infiniband/ulp/srp/srp_host.c =================================================================== --- infiniband/ulp/srp/srp_host.c (revision 759) +++ infiniband/ulp/srp/srp_host.c (working copy) @@ -563,6 +563,7 @@ target->cqs_hndl[hca_index] = ib_create_cq(hca->ca_hndl, cq_send_event, + NULL, target, MAX_SEND_WQES); @@ -583,6 +584,7 @@ target->cqr_hndl[hca_index] = ib_create_cq(hca->ca_hndl, cq_recv_event, + NULL, target, MAX_RECV_WQES); Index: infiniband/ulp/srp/srp_dm.c =================================================================== --- infiniband/ulp/srp/srp_dm.c (revision 759) +++ infiniband/ulp/srp/srp_dm.c (working copy) @@ -1388,7 +1388,8 @@ return (status); } -void srp_hca_async_event_handler(struct ib_async_event_record *event, void *arg) +void srp_hca_async_event_handler(struct ib_event_handler *handler, + struct ib_event *event) { int hca_index; srp_host_port_params_t *port; @@ -1406,7 +1407,7 @@ } hca = &hca_params[hca_index]; - port = &hca->port[event->modifier.port - 1]; + port = &hca->port[event->element.port_num - 1]; switch (event->event) { @@ -1418,7 +1419,7 @@ */ TS_REPORT_WARN(MOD_SRPTP, "Port active event for hca %d port %d", - hca_index + 1, event->modifier.port); + hca_index + 1, event->element.port_num); if (!port->valid) break; @@ -1434,7 +1435,7 @@ up(&driver_params.sema); break; - case IB_LOCAL_CATASTROPHIC_ERROR: + case IB_EVENT_DEVICE_FATAL: { int port_index; @@ -1454,17 +1455,17 @@ if (!hca->port[port_index].valid) break; - event->event = IB_PORT_ERROR; + event->event = IB_EVENT_PORT_ERR; - event->modifier.port = + event->element.port_num = hca->port[port_index].local_port; - srp_hca_async_event_handler(event, NULL); + srp_hca_async_event_handler(handler, event); } } break; - case IB_PORT_ERROR: + case IB_EVENT_PORT_ERR: { u32 i; int ioc_index; @@ -1473,7 +1474,7 @@ TS_REPORT_WARN(MOD_SRPTP, "Port error event for hca %d port %d", - hca_index + 1, event->modifier.port); + hca_index + 1, event->element.port_num); if (!port->valid) break; @@ -1554,24 +1555,15 @@ } break; - case IB_LID_CHANGE: - break; - - case IB_PKEY_CHANGE: - break; - default: - TS_REPORT_FATAL(MOD_SRPTP, "Unsupported event type %d", - event->event); break; } } int srp_dm_init(void) { - int i, async_event_index, hca_index, status; + int hca_index, status; srp_host_hca_params_t *hca; - struct ib_async_event_record async_record; max_path_record_cache = max_srp_targets * MAX_LOCAL_PORTS; @@ -1610,27 +1602,15 @@ "Registering async events handler for HCA %d", hca->hca_index); - async_record.device = hca->ca_hndl; + hca->event_handler.handler = srp_hca_async_event_handler; + status = ib_register_event_handler(hca->ca_hndl, &hca->event_handler); - async_event_index = IB_LOCAL_CATASTROPHIC_ERROR; - for (i = 0; i < MAX_ASYNC_EVENT_HANDLES; i++) { - async_record.event = async_event_index; - status = ib_async_event_handler_register(&async_record, - srp_hca_async_event_handler, - hca, - &hca-> - async_handles - [i]); - - if (status) { - TS_REPORT_FATAL(MOD_SRPTP, - "Registration of async event " - "%d on hca %d failed", - i, hca->hca_index, status); - return (-EINVAL); - } - - async_event_index++; + if (status) { + TS_REPORT_FATAL(MOD_SRPTP, + "Registration of async event " + "hca %d failed", + hca->hca_index, status); + return (-EINVAL); } } @@ -1646,7 +1626,7 @@ void srp_dm_unload(void) { srp_host_hca_params_t *hca; - int i, hca_index; + int hca_index; /* * Unegister for async events on the HCA @@ -1665,9 +1645,7 @@ * Loop through the async handles for the HCA and * deregister them. */ - for (i = 0; i < MAX_ASYNC_EVENT_HANDLES; i++) { - ib_async_event_handler_deregister(hca->async_handles[i]); - } + ib_unregister_event_handler(&hca->event_handler); } /* Register with DM to register for async notification */ Index: infiniband/ulp/srp/srp_host.h =================================================================== --- infiniband/ulp/srp/srp_host.h (revision 759) +++ infiniband/ulp/srp/srp_host.h (working copy) @@ -161,7 +161,7 @@ struct _srp_host_port_params port[MAX_LOCAL_PORTS_PER_HCA]; - struct ib_async_event_handler *async_handles[MAX_ASYNC_EVENT_HANDLES]; + struct ib_event_handler event_handler; } srp_host_hca_params_t; Index: infiniband/ulp/srp/srptp.c =================================================================== --- infiniband/ulp/srp/srptp.c (revision 759) +++ infiniband/ulp/srp/srptp.c (working copy) @@ -681,6 +681,8 @@ init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; init_attr.qp_type = IB_QPT_RC; + init_attr.event_handler = NULL; + conn->qp_hndl = ib_create_qp(hca->pd_hndl, &init_attr, &qp_cap); if (IS_ERR(conn->qp_hndl)) { TS_REPORT_FATAL(MOD_SRPTP, "QP Create failed %d", Index: infiniband/ulp/sdp/sdp_conn.c =================================================================== --- infiniband/ulp/sdp/sdp_conn.c (revision 759) +++ infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1065,6 +1065,7 @@ if (!conn->send_cq) { conn->send_cq = ib_create_cq(conn->ca, sdp_cq_event_handler, + NULL, (void *)(unsigned long)conn->hashent, conn->send_cq_size); if (IS_ERR(conn->send_cq)) { @@ -1091,6 +1092,7 @@ if (!conn->recv_cq) { conn->recv_cq = ib_create_cq(conn->ca, sdp_cq_event_handler, + NULL, (void *)(unsigned long)conn->hashent, conn->recv_cq_size); Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 759) +++ infiniband/include/ib_verbs.h (working copy) @@ -206,6 +206,39 @@ u8 init_type; }; +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + struct ib_global_route { union ib_gid dgid; u32 flow_label; @@ -316,6 +349,7 @@ }; struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); void *qp_context; struct ib_cq *send_cq; struct ib_cq *recv_cq; @@ -549,6 +583,7 @@ struct ib_cq { struct ib_device *device; ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); void * context; int cqe; atomic_t usecnt; /* count number of work queues */ @@ -567,6 +602,7 @@ struct ib_cq *send_cq; struct ib_cq *recv_cq; struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); void *qp_context; u32 qp_num; }; @@ -600,7 +636,10 @@ struct pci_dev *dma_device; char name[IB_DEVICE_NAME_MAX]; - char *provider; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + struct list_head core_list; void *core; void *mad; @@ -709,11 +748,27 @@ u8 node_type; }; +struct ib_client { + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); + int ib_register_device (struct ib_device *device); -int ib_deregister_device(struct ib_device *device); +int ib_unregister_device(struct ib_device *device); +int ib_register_client (struct ib_client *client); +int ib_unregister_client(struct ib_client *client); + +int ib_register_event_handler (struct ib_device *device, + struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr); @@ -774,6 +829,7 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe); int ib_resize_cq(struct ib_cq *cq, int cqe); Index: infiniband/include/ts_ib_core_types.h =================================================================== --- infiniband/include/ts_ib_core_types.h (revision 759) +++ infiniband/include/ts_ib_core_types.h (working copy) @@ -74,59 +74,12 @@ #ifdef __KERNEL__ -enum ib_async_event { - IB_QP_PATH_MIGRATED, - IB_EEC_PATH_MIGRATED, - IB_QP_COMMUNICATION_ESTABLISHED, - IB_EEC_COMMUNICATION_ESTABLISHED, - IB_SEND_QUEUE_DRAINED, - IB_CQ_ERROR, - IB_LOCAL_WQ_INVALID_REQUEST_ERROR, - IB_LOCAL_WQ_ACCESS_VIOLATION_ERROR, - IB_LOCAL_WQ_CATASTROPHIC_ERROR, - IB_PATH_MIGRATION_ERROR, - IB_LOCAL_EEC_CATASTROPHIC_ERROR, - IB_LOCAL_CATASTROPHIC_ERROR, - IB_PORT_ERROR, - IB_EVENT_PORT_ACTIVE, - IB_LID_CHANGE, - IB_PKEY_CHANGE, -}; - -struct ib_async_event_handler; /* actual definition in core_async.c */ - -struct ib_async_event_record { - struct ib_device *device; - enum ib_async_event event; - union { - struct ib_qp *qp; - struct ib_eec *eec; - struct ib_cq *cq; - int port; - } modifier; -}; - -typedef void (*ib_async_event_handler_func)(struct ib_async_event_record *record, - void *arg); - /* enum definitions */ #define IB_MULTICAST_QPN 0xffffff /* structures */ -enum { - IB_DEVICE_NOTIFIER_ADD, - IB_DEVICE_NOTIFIER_REMOVE -}; - -struct ib_device_notifier { - void (*notifier)(struct ib_device_notifier *self, - struct ib_device *device, - int event); - struct list_head list; -}; - struct ib_sm_path { u16 sm_lid; tTS_IB_SL sm_sl; Index: infiniband/include/ts_ib_core.h =================================================================== --- infiniband/include/ts_ib_core.h (revision 759) +++ infiniband/include/ts_ib_core.h (working copy) @@ -38,17 +38,9 @@ } } -struct ib_device *ib_device_get_by_name(const char *name); -struct ib_device *ib_device_get_by_index(int index); -int ib_device_notifier_register(struct ib_device_notifier *notifier); -int ib_device_notifier_deregister(struct ib_device_notifier *notifier); +struct ib_device *ib_device_get_by_name(const char *name) __deprecated; +struct ib_device *ib_device_get_by_index(int index) __deprecated; -int ib_async_event_handler_register(struct ib_async_event_record *record, - ib_async_event_handler_func function, - void *arg, - struct ib_async_event_handler **handle); -int ib_async_event_handler_deregister(struct ib_async_event_handler *handle); - int ib_cached_node_guid_get(struct ib_device *device, tTS_IB_GUID node_guid); int ib_cached_port_properties_get(struct ib_device *device, Index: infiniband/include/ts_ib_provider.h =================================================================== --- infiniband/include/ts_ib_provider.h (revision 759) +++ infiniband/include/ts_ib_provider.h (working copy) @@ -1,38 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#ifndef _TS_IB_PROVIDER_H -#define _TS_IB_PROVIDER_H - -#include - -void ib_async_event_dispatch(struct ib_async_event_record *event_record); - -#endif /* _TS_IB_PROVIDER_H */ - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 759) +++ infiniband/core/Makefile (working copy) @@ -36,10 +36,9 @@ header_ud.o \ ib_verbs.o \ ib_sysfs.o \ + ib_device.o \ core_main.o \ - core_device.o \ core_fmr_pool.o \ - core_async.o \ core_cache.o \ core_proc.o Index: infiniband/core/core_cache.c =================================================================== --- infiniband/core/core_cache.c (revision 759) +++ infiniband/core/core_cache.c (working copy) @@ -260,71 +260,9 @@ } EXPORT_SYMBOL(ib_cached_pkey_find); -int ib_cache_setup(struct ib_device *device) +static void ib_cache_update(struct ib_device *device, + tTS_IB_PORT port) { - struct ib_device_private *priv = device->core; - struct ib_port_attr prop; - int p; - int ret; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - priv->port_data[p].gid_table = NULL; - priv->port_data[p].pkey_table = NULL; - } - - for (p = priv->start_port; p <= priv->end_port; ++p) { - seqcount_init(&priv->port_data[p].lock); - ret = device->query_port(device, p, &prop); - if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "query_port failed for %s", - device->name); - goto error; - } - priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len; - priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len * sizeof (tTS_IB_GID), - GFP_KERNEL); - if (!priv->port_data[p].gid_table) { - ret = -ENOMEM; - goto error; - } - - priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len; - priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16), - GFP_KERNEL); - if (!priv->port_data[p].pkey_table) { - ret = -ENOMEM; - goto error; - } - - ib_cache_update(device, p); - } - - return 0; - - error: - for (p = priv->start_port; p <= priv->end_port; ++p) { - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); - } - - return ret; -} - -void ib_cache_cleanup(struct ib_device *device) -{ - struct ib_device_private *priv = device->core; - int p; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); - } -} - -void ib_cache_update(struct ib_device *device, - tTS_IB_PORT port) -{ struct ib_device_private *priv = device->core; struct ib_port_data *info = &priv->port_data[port]; struct ib_port_attr *tprops = NULL; @@ -405,6 +343,102 @@ kfree(tgid); } +static void ib_cache_task(void *port_ptr) +{ + struct ib_port_data *port_data = port_ptr; + + ib_cache_update(port_data->device, port_data->port_num); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_device_private *priv = event->device->core; + schedule_work(&priv->port_data[event->element.port_num].refresh_task); + } +} + +int ib_cache_setup(struct ib_device *device) +{ + struct ib_device_private *priv = device->core; + struct ib_port_attr prop; + int p; + int ret; + + for (p = priv->start_port; p <= priv->end_port; ++p) { + priv->port_data[p].device = device; + priv->port_data[p].port_num = p; + INIT_WORK(&priv->port_data[p].refresh_task, + ib_cache_task, &priv->port_data[p]); + priv->port_data[p].gid_table = NULL; + priv->port_data[p].pkey_table = NULL; + priv->port_data[p].event_handler.device = NULL; + } + + for (p = priv->start_port; p <= priv->end_port; ++p) { + seqcount_init(&priv->port_data[p].lock); + ret = device->query_port(device, p, &prop); + if (ret) { + TS_REPORT_WARN(MOD_KERNEL_IB, + "query_port failed for %s", + device->name); + goto error; + } + priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len; + priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len * sizeof (tTS_IB_GID), + GFP_KERNEL); + if (!priv->port_data[p].gid_table) { + ret = -ENOMEM; + goto error; + } + + priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len; + priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16), + GFP_KERNEL); + if (!priv->port_data[p].pkey_table) { + ret = -ENOMEM; + goto error; + } + + ib_cache_update(device, p); + + priv->port_data[p].event_handler.handler = ib_cache_event; + ret = ib_register_event_handler(device, + &priv->port_data[p].event_handler); + if (ret) + goto error; + } + + return 0; + + error: + for (p = priv->start_port; p <= priv->end_port; ++p) { + if (priv->port_data[p].event_handler.device) + ib_unregister_event_handler(&priv->port_data[p].event_handler); + kfree(priv->port_data[p].gid_table); + kfree(priv->port_data[p].pkey_table); + } + + return ret; +} + +void ib_cache_cleanup(struct ib_device *device) +{ + struct ib_device_private *priv = device->core; + int p; + + for (p = priv->start_port; p <= priv->end_port; ++p) { + ib_unregister_event_handler(&priv->port_data[p].event_handler); + kfree(priv->port_data[p].gid_table); + kfree(priv->port_data[p].pkey_table); + } +} + /* Local Variables: c-file-style: "linux" Index: infiniband/core/core_priv.h =================================================================== --- infiniband/core/core_priv.h (revision 759) +++ infiniband/core/core_priv.h (working copy) @@ -24,16 +24,14 @@ #ifndef _CORE_PRIV_H #define _CORE_PRIV_H +#include +#include + #include -#include "ts_ib_provider.h" #include "ts_kernel_services.h" #include "ts_kernel_thread.h" -#include -#include -#include - enum { IB_PORT_CAP_SM, IB_PORT_CAP_SNMP_TUN, @@ -48,18 +46,19 @@ tTS_IB_GUID node_guid; struct ib_port_data *port_data; - struct list_head async_handler_list; - spinlock_t async_handler_lock; - tTS_KERNEL_QUEUE_THREAD async_thread; struct ib_core_proc *proc; }; struct ib_port_data { + struct ib_device *device; spinlock_t port_cap_lock; int port_cap_count[IB_PORT_CAP_NUM]; + struct ib_event_handler event_handler; + struct work_struct refresh_task; + seqcount_t lock; struct ib_port_attr properties; struct ib_sm_path sm_path; @@ -68,11 +67,11 @@ u16 pkey_table_alloc_length; union ib_gid *gid_table; u16 *pkey_table; + u8 port_num; }; int ib_cache_setup(struct ib_device *device); void ib_cache_cleanup(struct ib_device *device); -void ib_cache_update(struct ib_device *device, tTS_IB_PORT port); int ib_proc_setup(struct ib_device *device, int is_switch); void ib_proc_cleanup(struct ib_device *device); int ib_create_proc_dir(void); @@ -81,7 +80,7 @@ void ib_async_thread(struct list_head *entry, void *device_ptr); int ib_device_register_sysfs(struct ib_device *device); -void ib_device_deregister_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); int ib_sysfs_setup(void); void ib_sysfs_cleanup(void); Index: infiniband/core/ib_device.c =================================================================== --- infiniband/core/ib_device.c (revision 715) +++ infiniband/core/ib_device.c (working copy) @@ -21,8 +21,6 @@ $Id$ */ -#include "ts_kernel_services.h" - #include #include #include @@ -33,9 +31,17 @@ #include "core_priv.h" static LIST_HEAD(device_list); -static LIST_HEAD(notifier_list); -static DECLARE_MUTEX(device_lock); +static LIST_HEAD(client_list); +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + static int ib_device_check_mandatory(struct ib_device *device) { #define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } @@ -145,7 +151,7 @@ BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); - ib_device_deregister_sysfs(device); + ib_device_unregister_sysfs(device); } EXPORT_SYMBOL(ib_dealloc_device); @@ -156,18 +162,19 @@ int ret; int p; - if (ib_device_check_mandatory(device)) { - return -EINVAL; - } + down(&device_sem); - down(&device_lock); - if (strchr(device->name, '%')) { ret = alloc_name(device->name); if (ret) goto out; } + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { printk(KERN_WARNING "Couldn't allocate private struct for %s\n", @@ -209,8 +216,8 @@ device->core = priv; - INIT_LIST_HEAD(&priv->async_handler_list); - spin_lock_init(&priv->async_handler_lock); + INIT_LIST_HEAD(&device->event_handler_list); + spin_lock_init(&device->event_handler_lock); ret = ib_cache_setup(device); if (ret) { @@ -219,21 +226,11 @@ goto out_free_port; } - ret = tsKernelQueueThreadStart("ts_ib_async", - ib_async_thread, - device, - &priv->async_thread); - if (ret) { - printk(KERN_WARNING "Couldn't start async thread for %s\n", - device->name); - goto out_free_cache; - } - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); if (ret) { printk(KERN_WARNING "Couldn't create /proc dir for %s\n", device->name); - goto out_stop_async; + goto out_free_cache; } if (ib_device_register_sysfs(device)) { @@ -243,27 +240,23 @@ } list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + { - struct list_head *ptr; - struct ib_device_notifier *notifier; + struct ib_client *client; - list_for_each(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } + list_for_each_entry(client, &client_list, list) + if (client->add) + client->add(device); } - device->reg_state = IB_DEV_REGISTERED; - - up(&device_lock); + up(&device_sem); return 0; out_proc: ib_proc_cleanup(device); - out_stop_async: - tsKernelQueueThreadStop(priv->async_thread); - out_free_cache: ib_cache_cleanup(device); @@ -274,38 +267,29 @@ kfree(priv); out: - up(&device_lock); + up(&device_sem); return ret; } EXPORT_SYMBOL(ib_register_device); -int ib_deregister_device(struct ib_device *device) +int ib_unregister_device(struct ib_device *device) { - struct ib_device_private *priv; + struct ib_device_private *priv = device->core; + struct ib_client *client; - priv = device->core; + down(&device_sem); - if (tsKernelQueueThreadStop(priv->async_thread)) { - printk(KERN_WARNING "tsKernelThreadStop failed for %s async thread\n", - device->name); - } + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + list_del(&device->core_list); + + up(&device_sem); + ib_proc_cleanup(device); ib_cache_cleanup(device); - down(&device_lock); - list_del(&device->core_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each_prev(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_REMOVE); - } - } - up(&device_lock); - kfree(priv->port_data); kfree(priv); @@ -313,15 +297,15 @@ return 0; } -EXPORT_SYMBOL(ib_deregister_device); +EXPORT_SYMBOL(ib_unregister_device); struct ib_device *ib_device_get_by_name(const char *name) { struct ib_device *device; - down(&device_lock); + down(&device_sem); device = __ib_device_get_by_name(name); - up(&device_lock); + up(&device_sem); return device; } @@ -335,7 +319,7 @@ if (index < 0) return NULL; - down(&device_lock); + down(&device_sem); list_for_each(ptr, &device_list) { device = list_entry(ptr, struct ib_device, core_list); if (!index) @@ -345,38 +329,86 @@ device = NULL; out: - up(&device_lock); + up(&device_sem); return device; } EXPORT_SYMBOL(ib_device_get_by_index); -int ib_device_notifier_register(struct ib_device_notifier *notifier) +int ib_register_client(struct ib_client *client) { - struct list_head *ptr; struct ib_device *device; - down(&device_lock); - list_add_tail(¬ifier->list, ¬ifier_list); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } - up(&device_lock); + down(&device_sem); + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add) + client->add(device); + + up(&device_sem); + return 0; } -EXPORT_SYMBOL(ib_device_notifier_register); +EXPORT_SYMBOL(ib_register_client); -int ib_device_notifier_deregister(struct ib_device_notifier *notifier) +int ib_unregister_client(struct ib_client *client) { - down(&device_lock); - list_del(¬ifier->list); - up(&device_lock); + struct ib_device *device; + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) + if (client->remove) + client->remove(device); + list_del(&client->list); + + up(&device_sem); + return 0; } -EXPORT_SYMBOL(ib_device_notifier_deregister); +EXPORT_SYMBOL(ib_unregister_client); +int ib_register_event_handler (struct ib_device *device, + struct ib_event_handler *event_handler) +{ + unsigned long flags; + + event_handler->device = device; + + spin_lock_irqsave(&device->event_handler_lock, flags); + list_add_tail(&event_handler->list, &device->event_handler_list); + spin_unlock_irqrestore(&device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr) { Index: infiniband/core/mad_main.c =================================================================== --- infiniband/core/mad_main.c (revision 759) +++ infiniband/core/mad_main.c (working copy) @@ -23,11 +23,6 @@ #include -#include "mad_priv.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - #include #include @@ -37,10 +32,10 @@ /* Need the definition of high_memory: */ #include -#ifdef CONFIG_KMOD -#include -#endif +#include "ts_kernel_services.h" +#include "mad_priv.h" + MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_LICENSE("Dual BSD/GPL"); @@ -60,11 +55,11 @@ *mr = ib_reg_phys_mr(pd, &buffer_list, 1, /* list_len */ IB_ACCESS_LOCAL_WRITE, &iova); if (IS_ERR(*mr)) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "ib_reg_phys_mr failed " - "size 0x%016" TS_U64_FMT "x, iova 0x%016" TS_U64_FMT "x" - " (return code %d)", - buffer_list.size, iova, PTR_ERR(*mr)); + printk(KERN_WARNING "ib_reg_phys_mr failed " + "size 0x%016llx, iova 0x%016llx " + "(return code %ld)\n", + (unsigned long long) buffer_list.size, + (unsigned long long) iova, PTR_ERR(*mr)); return PTR_ERR(*mr); } @@ -82,10 +77,6 @@ int attr_mask; int ret; - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Creating port %d QPN %d for device %s", - port, qpn, device->name); - { struct ib_qp_init_attr init_attr = { .send_cq = priv->cq, @@ -105,10 +96,10 @@ priv->qp[port][qpn] = ib_create_qp(priv->pd, &init_attr, &qp_cap); if (IS_ERR(priv->qp[port][qpn])) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_special_qp_create failed for %s port %d QPN %d (%d)", - device->name, port, qpn, - PTR_ERR(priv->qp[port][qpn])); + printk(KERN_WARNING "ib_special_qp_create failed " + "for %s port %d QPN %d (%ld)\n", + device->name, port, qpn, + PTR_ERR(priv->qp[port][qpn])); return PTR_ERR(priv->qp[port][qpn]); } } @@ -125,9 +116,9 @@ ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> INIT failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> INIT failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } @@ -135,9 +126,9 @@ attr_mask = IB_QP_STATE; ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> RTR failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> RTR failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } @@ -148,16 +139,16 @@ IB_QP_SQ_PSN; ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> RTS failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> RTS failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } return 0; } -static int ib_mad_init_one(struct ib_device *device) +static void ib_mad_add_one(struct ib_device *device) { struct ib_mad_private *priv; struct ib_device_attr prop; @@ -165,18 +156,13 @@ ret = ib_query_device(device, &prop); if (ret) - return ret; + return; - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Setting up device %s, %d ports", - device->name, prop.phys_port_cnt); - priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't allocate private structure for %s", - device->name); - return -ENOMEM; + printk(KERN_WARNING "Couldn't allocate MAD private structure for %s\n", + device->name); + return; } device->mad = priv; @@ -187,9 +173,8 @@ priv->pd = ib_alloc_pd(device); if (IS_ERR(priv->pd)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate PD for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD PD for %s\n", + device->name); goto error; } @@ -198,11 +183,10 @@ (IB_MAD_RECEIVES_PER_QP + IB_MAD_SENDS_PER_QP) * priv->num_port; priv->cq = ib_create_cq(device, ib_mad_completion, - device, entries); + NULL, device, entries); if (IS_ERR(priv->cq)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate CQ for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD CQ for %s\n", + device->name); goto error_free_pd; } } @@ -214,9 +198,8 @@ INIT_WORK(&priv->cq_work, ib_mad_drain_cq, device); if (ib_mad_register_memory(priv->pd, &priv->mr, &priv->lkey)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate MR for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD MR for %s\n", + device->name); goto error_free_cq; } @@ -225,9 +208,8 @@ device, &priv->work_thread); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't start completion thread for %s", - device->name); + printk(KERN_WARNING "Couldn't start completion thread for %s\n", + device->name); goto error_free_mr; } @@ -272,7 +254,7 @@ } } - return 0; + return; error_free_qp: { @@ -307,7 +289,6 @@ error: kfree(priv); - return ret; } static void ib_mad_remove_one(struct ib_device *device) @@ -346,39 +327,15 @@ } } -static void ib_mad_device_notifier(struct ib_device_notifier *self, - struct ib_device *device, - int event) -{ - switch (event) { - case IB_DEVICE_NOTIFIER_ADD: - if (ib_mad_init_one(device)) - TS_REPORT_WARN(MOD_KERNEL_IB, - "Failed to initialize device."); - break; - - case IB_DEVICE_NOTIFIER_REMOVE: - ib_mad_remove_one(device); - break; - - default: - TS_REPORT_WARN(MOD_KERNEL_IB, - "Unknown device notifier event %d."); - break; - } -} - -static struct ib_device_notifier mad_notifier = { - .notifier = ib_mad_device_notifier +static struct ib_client mad_client = { + .add = ib_mad_add_one, + .remove = ib_mad_remove_one }; static int __init ib_mad_init(void) { int ret; - TS_REPORT_INIT(MOD_KERNEL_IB, - "Initializing IB MAD layer"); - ret = ib_mad_proc_setup(); if (ret) return ret; @@ -391,34 +348,25 @@ NULL, NULL); if (!mad_cache) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Couldn't create MAD slab cache"); + printk(KERN_ERR "Couldn't create MAD slab cache\n"); ib_mad_proc_cleanup(); return -ENOMEM; } - ib_device_notifier_register(&mad_notifier); + if (ib_register_client(&mad_client)) { - TS_REPORT_INIT(MOD_KERNEL_IB, - "IB MAD layer initialized"); + } return 0; } static void __exit ib_mad_cleanup(void) { - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "Unloading IB MAD layer"); - - ib_device_notifier_deregister(&mad_notifier); + ib_unregister_client(&mad_client); ib_mad_proc_cleanup(); if (kmem_cache_destroy(mad_cache)) - TS_REPORT_WARN(MOD_KERNEL_IB, - "Failed to destroy MAD slab cache (memory leak?)"); - - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "IB MAD layer unloaded"); + printk(KERN_WARNING "Failed to destroy MAD slab cache (memory leak?)\n"); } module_init(ib_mad_init); Index: infiniband/core/mad_priv.h =================================================================== --- infiniband/core/mad_priv.h (revision 759) +++ infiniband/core/mad_priv.h (working copy) @@ -26,7 +26,6 @@ #include "ts_ib_mad.h" #include -#include "ts_ib_provider.h" #include #include "ts_kernel_thread.h" Index: infiniband/core/core_device.c =================================================================== --- infiniband/core/core_device.c (revision 759) +++ infiniband/core/core_device.c (working copy) @@ -1,432 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#include "ts_kernel_services.h" - -#include -#include -#include -#include - -#include - -#include "core_priv.h" - -static LIST_HEAD(device_list); -static LIST_HEAD(notifier_list); -static DECLARE_MUTEX(device_lock); - -static int ib_device_check_mandatory(struct ib_device *device) -{ -#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } - static const struct { - size_t offset; - char *name; - } mandatory_table[] = { - IB_MANDATORY_FUNC(query_device), - IB_MANDATORY_FUNC(query_port), - IB_MANDATORY_FUNC(query_pkey), - IB_MANDATORY_FUNC(query_gid), - IB_MANDATORY_FUNC(alloc_pd), - IB_MANDATORY_FUNC(dealloc_pd), - IB_MANDATORY_FUNC(create_ah), - IB_MANDATORY_FUNC(destroy_ah), - IB_MANDATORY_FUNC(create_qp), - IB_MANDATORY_FUNC(modify_qp), - IB_MANDATORY_FUNC(destroy_qp), - IB_MANDATORY_FUNC(post_send), - IB_MANDATORY_FUNC(post_recv), - IB_MANDATORY_FUNC(create_cq), - IB_MANDATORY_FUNC(destroy_cq), - IB_MANDATORY_FUNC(poll_cq), - IB_MANDATORY_FUNC(req_notify_cq), - IB_MANDATORY_FUNC(reg_phys_mr), - IB_MANDATORY_FUNC(dereg_mr) - }; - int i; - - for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { - if (!*(void **) ((void *) device + mandatory_table[i].offset)) { - printk(KERN_WARNING "Device %s is missing mandatory function %s\n", - device->name, mandatory_table[i].name); - return -EINVAL; - } - } - - return 0; -} - -static struct ib_device *__ib_device_get_by_name(const char *name) -{ - struct ib_device *device; - - list_for_each_entry(device, &device_list, core_list) - if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) - return device; - - return NULL; -} - - -static int alloc_name(char *name) -{ - long *inuse; - char buf[IB_DEVICE_NAME_MAX]; - struct ib_device *device; - int i; - - inuse = (long *) get_zeroed_page(GFP_KERNEL); - if (!inuse) - return -ENOMEM; - - list_for_each_entry(device, &device_list, core_list) { - if (!sscanf(device->name, name, &i)) - continue; - if (i < 0 || i >= PAGE_SIZE * 8) - continue; - snprintf(buf, sizeof buf, name, i); - if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) - set_bit(i, inuse); - } - - i = find_first_zero_bit(inuse, PAGE_SIZE * 8); - free_page((unsigned long) inuse); - snprintf(buf, sizeof buf, name, i); - - if (__ib_device_get_by_name(buf)) - return -ENFILE; - - strlcpy(name, buf, IB_DEVICE_NAME_MAX); - return 0; -} - -struct ib_device *ib_alloc_device(size_t size) -{ - void *dev; - - BUG_ON(size < sizeof (struct ib_device)); - - dev = kmalloc(size, GFP_KERNEL); - if (!dev) - return NULL; - - memset(dev, 0, size); - - return dev; -} -EXPORT_SYMBOL(ib_alloc_device); - -void ib_dealloc_device(struct ib_device *device) -{ - if (device->reg_state == IB_DEV_UNINITIALIZED) { - kfree(device); - return; - } - - BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); - - ib_device_deregister_sysfs(device); -} -EXPORT_SYMBOL(ib_dealloc_device); - -int ib_register_device(struct ib_device *device) -{ - struct ib_device_private *priv; - struct ib_device_attr prop; - int ret; - int p; - - if (ib_device_check_mandatory(device)) { - return -EINVAL; - } - - down(&device_lock); - - if (strchr(device->name, '%')) { - ret = alloc_name(device->name); - if (ret) - goto out; - } - - priv = kmalloc(sizeof *priv, GFP_KERNEL); - if (!priv) { - printk(KERN_WARNING "Couldn't allocate private struct for %s\n", - device->name); - ret = -ENOMEM; - goto out; - } - - *priv = (struct ib_device_private) { 0 }; - - ret = device->query_device(device, &prop); - if (ret) { - printk(KERN_WARNING "query_device failed for %s\n", - device->name); - goto out_free; - } - - memcpy(priv->node_guid, &prop.node_guid, sizeof (tTS_IB_GUID)); - - if (device->node_type == IB_NODE_SWITCH) { - priv->start_port = priv->end_port = 0; - } else { - priv->start_port = 1; - priv->end_port = prop.phys_port_cnt; - } - - priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data), - GFP_KERNEL); - if (!priv->port_data) { - printk(KERN_WARNING "Couldn't allocate port info for %s\n", - device->name); - goto out_free; - } - - for (p = priv->start_port; p <= priv->end_port; ++p) { - spin_lock_init(&priv->port_data[p].port_cap_lock); - memset(priv->port_data[p].port_cap_count, 0, IB_PORT_CAP_NUM * sizeof (int)); - } - - device->core = priv; - - INIT_LIST_HEAD(&priv->async_handler_list); - spin_lock_init(&priv->async_handler_lock); - - ret = ib_cache_setup(device); - if (ret) { - printk(KERN_WARNING "Couldn't create device info cache for %s\n", - device->name); - goto out_free_port; - } - - ret = tsKernelQueueThreadStart("ts_ib_async", - ib_async_thread, - device, - &priv->async_thread); - if (ret) { - printk(KERN_WARNING "Couldn't start async thread for %s\n", - device->name); - goto out_free_cache; - } - - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); - if (ret) { - printk(KERN_WARNING "Couldn't create /proc dir for %s\n", - device->name); - goto out_stop_async; - } - - if (ib_device_register_sysfs(device)) { - printk(KERN_WARNING "Couldn't register device %s with driver model\n", - device->name); - goto out_proc; - } - - list_add_tail(&device->core_list, &device_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } - } - - device->reg_state = IB_DEV_REGISTERED; - - up(&device_lock); - return 0; - - out_proc: - ib_proc_cleanup(device); - - out_stop_async: - tsKernelQueueThreadStop(priv->async_thread); - - out_free_cache: - ib_cache_cleanup(device); - - out_free_port: - kfree(priv->port_data); - - out_free: - kfree(priv); - - out: - up(&device_lock); - return ret; -} -EXPORT_SYMBOL(ib_register_device); - -int ib_deregister_device(struct ib_device *device) -{ - struct ib_device_private *priv; - - priv = device->core; - - if (tsKernelQueueThreadStop(priv->async_thread)) { - printk(KERN_WARNING "tsKernelThreadStop failed for %s async thread\n", - device->name); - } - - ib_proc_cleanup(device); - ib_cache_cleanup(device); - - down(&device_lock); - list_del(&device->core_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each_prev(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_REMOVE); - } - } - up(&device_lock); - - kfree(priv->port_data); - kfree(priv); - - device->reg_state = IB_DEV_UNREGISTERED; - - return 0; -} -EXPORT_SYMBOL(ib_deregister_device); - -struct ib_device *ib_device_get_by_name(const char *name) -{ - struct ib_device *device; - - down(&device_lock); - device = __ib_device_get_by_name(name); - up(&device_lock); - - return device; -} -EXPORT_SYMBOL(ib_device_get_by_name); - -struct ib_device *ib_device_get_by_index(int index) -{ - struct list_head *ptr; - struct ib_device *device; - - if (index < 0) - return NULL; - - down(&device_lock); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - if (!index) - goto out; - --index; - } - - device = NULL; - out: - up(&device_lock); - return device; -} -EXPORT_SYMBOL(ib_device_get_by_index); - -int ib_device_notifier_register(struct ib_device_notifier *notifier) -{ - struct list_head *ptr; - struct ib_device *device; - - down(&device_lock); - list_add_tail(¬ifier->list, ¬ifier_list); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } - up(&device_lock); - - return 0; -} -EXPORT_SYMBOL(ib_device_notifier_register); - -int ib_device_notifier_deregister(struct ib_device_notifier *notifier) -{ - down(&device_lock); - list_del(¬ifier->list); - up(&device_lock); - - return 0; -} -EXPORT_SYMBOL(ib_device_notifier_deregister); - -int ib_query_device(struct ib_device *device, - struct ib_device_attr *device_attr) -{ - return device->query_device(device, device_attr); -} -EXPORT_SYMBOL(ib_query_device); - -int ib_query_port(struct ib_device *device, - u8 port_num, - struct ib_port_attr *port_attr) -{ - return device->query_port(device, port_num, port_attr); -} -EXPORT_SYMBOL(ib_query_port); - -int ib_query_gid(struct ib_device *device, - u8 port_num, int index, union ib_gid *gid) -{ - return device->query_gid(device, port_num, index, gid); -} -EXPORT_SYMBOL(ib_query_gid); - -int ib_query_pkey(struct ib_device *device, - u8 port_num, u16 index, u16 *pkey) -{ - return device->query_pkey(device, port_num, index, pkey); -} -EXPORT_SYMBOL(ib_query_pkey); - -int ib_modify_device(struct ib_device *device, - int device_modify_mask, - struct ib_device_modify *device_modify) -{ - return device->modify_device(device, device_modify_mask, - device_modify); -} -EXPORT_SYMBOL(ib_modify_device); - -int ib_modify_port(struct ib_device *device, - u8 port_num, int port_modify_mask, - struct ib_port_modify *port_modify) -{ - return device->modify_port(device, port_num, port_modify_mask, - port_modify); -} -EXPORT_SYMBOL(ib_modify_port); - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/mad_static.c =================================================================== --- infiniband/core/mad_static.c (revision 759) +++ infiniband/core/mad_static.c (working copy) @@ -22,7 +22,6 @@ */ #include "mad_priv.h" -#include "ts_ib_provider.h" #include "smp_access.h" #include "ts_kernel_trace.h" @@ -167,12 +166,12 @@ { /* Generate an artificial port error event so that cached info is updated for this port */ - struct ib_async_event_record record; + struct ib_event record; - record.device = device; - record.event = IB_PORT_ERROR; - record.modifier.port = port; - ib_async_event_dispatch(&record); + record.device = device; + record.event = IB_EVENT_PORT_ERR; + record.element.port_num = port; + ib_dispatch_event(&record); } } Index: infiniband/core/ib_verbs.c =================================================================== --- infiniband/core/ib_verbs.c (revision 759) +++ infiniband/core/ib_verbs.c (working copy) @@ -113,12 +113,13 @@ qp = pd->device->create_qp(pd, qp_init_attr, qp_cap); if (!IS_ERR(qp)) { - qp->device = pd->device; - qp->pd = pd; - qp->send_cq = qp_init_attr->send_cq; - qp->recv_cq = qp_init_attr->recv_cq; - qp->srq = qp_init_attr->srq; - qp->qp_context = qp_init_attr->qp_context; + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; atomic_inc(&pd->usecnt); atomic_inc(&qp_init_attr->send_cq->usecnt); atomic_inc(&qp_init_attr->recv_cq->usecnt); @@ -179,6 +180,7 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe) { struct ib_cq *cq; @@ -186,9 +188,10 @@ cq = device->create_cq(device, cqe); if (!IS_ERR(cq)) { - cq->device = device; - cq->comp_handler = comp_handler; - cq->context = cq_context; + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->context = cq_context; atomic_set(&cq->usecnt, 0); } Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 759) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -283,7 +283,7 @@ void mthca_cleanup_mcg_table(struct mthca_dev *dev); int mthca_register_device(struct mthca_dev *dev); -void mthca_deregister_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); @@ -308,7 +308,7 @@ void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); void mthca_qp_event(struct mthca_dev *dev, u32 qpn, - enum ib_async_event event); + enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_cap *qp_cap); int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, Index: infiniband/hw/mthca/mthca_main.c =================================================================== --- infiniband/hw/mthca/mthca_main.c (revision 759) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -638,7 +638,7 @@ int p; if (mdev) { - mthca_deregister_device(mdev); + mthca_unregister_device(mdev); for (p = 1; p <= mdev->limits.num_ports; ++p) mthca_CLOSE_IB(mdev, p, &status); Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 759) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -560,7 +560,6 @@ dev->ib_dev.owner = THIS_MODULE; dev->ib_dev.dma_device = dev->pdev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; - dev->ib_dev.provider = "mthca"; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; dev->ib_dev.modify_port = mthca_modify_port; @@ -593,7 +592,7 @@ ret = class_device_create_file(&dev->ib_dev.class_dev, mthca_class_attributes[i]); if (ret) { - ib_deregister_device(&dev->ib_dev); + ib_unregister_device(&dev->ib_dev); return ret; } } @@ -601,9 +600,9 @@ return 0; } -void mthca_deregister_device(struct mthca_dev *dev) +void mthca_unregister_device(struct mthca_dev *dev) { - ib_deregister_device(&dev->ib_dev); + ib_unregister_device(&dev->ib_dev); } /* Index: infiniband/hw/mthca/mthca_provider.h =================================================================== --- infiniband/hw/mthca/mthca_provider.h (revision 759) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -24,7 +24,6 @@ #ifndef MTHCA_PROVIDER_H #define MTHCA_PROVIDER_H -#include #include #define MTHCA_MPT_FLAG_ATOMIC (1 << 14) Index: infiniband/hw/mthca/mthca_mad.c =================================================================== --- infiniband/hw/mthca/mthca_mad.c (revision 759) +++ infiniband/hw/mthca/mthca_mad.c (working copy) @@ -46,24 +46,24 @@ static void smp_snoop(struct ib_device *ibdev, struct ib_mad *mad) { - struct ib_async_event_record record; + struct ib_event event; if (mad->dqpn == 0 && (mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->r_method == IB_MGMT_METHOD_SET) { if (mad->attribute_id == cpu_to_be16(IB_SM_PORT_INFO)) { - record.device = ibdev; - record.event = IB_LID_CHANGE; - record.modifier.port = mad->port; - ib_async_event_dispatch(&record); + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = mad->port; + ib_dispatch_event(&event); } if (mad->attribute_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { - record.device = ibdev; - record.event = IB_PKEY_CHANGE; - record.modifier.port = mad->port; - ib_async_event_dispatch(&record); + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = mad->port; + ib_dispatch_event(&event); } } } Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 759) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -200,16 +200,16 @@ static void port_change(struct mthca_dev *dev, int port, int active) { - struct ib_async_event_record record; + struct ib_event record; mthca_dbg(dev, "Port change to %s for port %d\n", active ? "active" : "down", port); record.device = &dev->ib_dev; - record.event = active ? IB_EVENT_PORT_ACTIVE : IB_PORT_ERROR; - record.modifier.port = port; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; - ib_async_event_dispatch(&record); + ib_dispatch_event(&record); } static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) @@ -234,37 +234,37 @@ case MTHCA_EVENT_TYPE_PATH_MIG: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_QP_PATH_MIGRATED); + IB_EVENT_PATH_MIG); break; case MTHCA_EVENT_TYPE_COMM_EST: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_QP_COMMUNICATION_ESTABLISHED); + IB_EVENT_COMM_EST); break; case MTHCA_EVENT_TYPE_SQ_DRAINED: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_SEND_QUEUE_DRAINED); + IB_EVENT_SQ_DRAINED); break; case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_CATASTROPHIC_ERROR); + IB_EVENT_QP_FATAL); break; case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_PATH_MIGRATION_ERROR); + IB_EVENT_PATH_MIG_ERR); break; case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_INVALID_REQUEST_ERROR); + IB_EVENT_QP_REQ_ERR); break; case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_ACCESS_VIOLATION_ERROR); + IB_EVENT_QP_ACCESS_ERR); break; case MTHCA_EVENT_TYPE_CMD: Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 759) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -261,10 +261,10 @@ } void mthca_qp_event(struct mthca_dev *dev, u32 qpn, - enum ib_async_event event) + enum ib_event_type event_type) { struct mthca_qp *qp; - struct ib_async_event_record event_record; + struct ib_event event; spin_lock(&dev->qp_table.lock); qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); @@ -277,10 +277,11 @@ return; } - event_record.device = &dev->ib_dev; - event_record.event = event; - event_record.modifier.qp = (struct ib_qp *) qp; - ib_async_event_dispatch(&event_record); + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); From roland at topspin.com Fri Sep 10 13:37:41 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 13:37:41 -0700 Subject: [openib-general] [PATCH] simplify sysfs slightly Message-ID: <52acvxdfmy.fsf@topspin.com> Based on Greg's suggestion, I'm using the new __ATTR and __ATTR_RO macros to shrink the code slightly: Index: infiniband/core/ib_sysfs.c =================================================================== --- infiniband/core/ib_sysfs.c (revision 759) +++ infiniband/core/ib_sysfs.c (working copy) @@ -37,13 +37,12 @@ ssize_t (*store)(struct ib_port *, struct port_attribute *, const char *buf, size_t count); }; -#define PORT_ATTR(_name, _mode, _show, _store) \ -struct port_attribute port_attr_##_name = { \ - .attr = { .name = __stringify(_name), .mode = _mode, .owner = THIS_MODULE }, \ - .show = _show, \ - .store = _store \ -} +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + struct port_table_attribute { struct port_attribute attr; int index; @@ -66,8 +65,8 @@ .show = port_attr_show }; -static ssize_t show_port_state(struct ib_port *p, struct port_attribute *unused, - char *buf) +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) { struct ib_port_attr attr; ssize_t ret; @@ -90,8 +89,8 @@ state_name[attr.state] : "UNKNOWN"); } -static ssize_t show_port_lid(struct ib_port *p, struct port_attribute *unused, - char *buf) +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) { struct ib_port_attr attr; ssize_t ret; @@ -103,8 +102,9 @@ return sprintf(buf, "0x%x\n", attr.lid); } -static ssize_t show_port_lmc(struct ib_port *p, struct port_attribute *unused, - char *buf) +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) { struct ib_port_attr attr; ssize_t ret; @@ -116,8 +116,8 @@ return sprintf(buf, "%d\n", attr.lmc); } -static ssize_t show_port_sm_lid(struct ib_port *p, struct port_attribute *unused, - char *buf) +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) { struct ib_port_attr attr; ssize_t ret; @@ -129,8 +129,8 @@ return sprintf(buf, "0x%x\n", attr.sm_lid); } -static ssize_t show_port_sm_sl(struct ib_port *p, struct port_attribute *unused, - char *buf) +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) { struct ib_port_attr attr; ssize_t ret; @@ -142,7 +142,7 @@ return sprintf(buf, "%d\n", attr.sm_sl); } -static ssize_t show_port_cap(struct ib_port *p, struct port_attribute *unused, +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, char *buf) { struct ib_port_attr attr; @@ -155,12 +155,12 @@ return sprintf(buf, "0x%08x\n", attr.port_cap_flags); } -static PORT_ATTR(state, S_IRUGO, show_port_state, NULL); -static PORT_ATTR(lid, S_IRUGO, show_port_lid, NULL); -static PORT_ATTR(lid_mask_count, S_IRUGO, show_port_lmc, NULL); -static PORT_ATTR(sm_lid, S_IRUGO, show_port_sm_lid, NULL); -static PORT_ATTR(sm_sl, S_IRUGO, show_port_sm_sl, NULL); -static PORT_ATTR(cap_mask, S_IRUGO, show_port_cap, NULL); +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); static struct attribute *port_default_attrs[] = { &port_attr_state.attr, @@ -517,7 +517,7 @@ return ret; } -void ib_device_deregister_sysfs(struct ib_device *device) +void ib_device_unregister_sysfs(struct ib_device *device) { struct kobject *p, *t; struct ib_port *port; From halr at voltaire.com Fri Sep 10 13:42:00 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 16:42:00 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad: change device to port routine names where appropriate (part 2) Message-ID: <1094848919.1746.380.camel@localhost.localdomain> ib_mad: change device to port routine names where appropriate (part 2 of this) Index: ib_mad.c =================================================================== --- ib_mad.c (revision 767) +++ ib_mad.c (working copy) @@ -66,7 +66,7 @@ kmem_cache_t *ib_mad_cache; -static struct list_head ib_mad_device_list; +static struct list_head ib_mad_port_list; static struct list_head ib_mad_agent_list; static u32 ib_mad_client_id = 0; @@ -75,10 +75,10 @@ */ /* Device list lock */ -static spinlock_t ib_mad_device_list_lock = SPIN_LOCK_UNLOCKED; -#define IB_MAD_DEVICE_LIST_LOCK_VAR unsigned long ib_mad_device_list_sflags -#define IB_MAD_DEVICE_LIST_LOCK() spin_lock_irqsave(&ib_mad_device_list_lock, ib_mad_device_list_sflags) -#define IB_MAD_DEVICE_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_device_list_lock, ib_mad_device_list_sflags) +static spinlock_t ib_mad_port_list_lock = SPIN_LOCK_UNLOCKED; +#define IB_MAD_PORT_LIST_LOCK_VAR unsigned long ib_mad_port_list_sflags +#define IB_MAD_PORT_LIST_LOCK() spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags) +#define IB_MAD_PORT_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags) /* Agent list lock */ static spinlock_t ib_mad_agent_list_lock = SPIN_LOCK_UNLOCKED; @@ -103,8 +103,8 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_device_restart(struct ib_mad_device_private *priv); -static int ib_mad_post_receive_mads(struct ib_mad_device_private *priv); +static int ib_mad_port_restart(struct ib_mad_port_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); /* @@ -134,8 +134,8 @@ ib_mad_recv_handler recv_handler, void *context) { - struct ib_mad_device_private *entry, *priv = NULL, - *head = (struct ib_mad_device_private *) &ib_mad_device_list; + struct ib_mad_port_private *entry, *priv = NULL, + *head = (struct ib_mad_port_private *) &ib_mad_port_list; struct ib_mad_agent_private *entry2, *head2 = (struct ib_mad_agent_private *)&ib_mad_agent_list; struct ib_mad_agent *mad_agent, *ret; @@ -144,7 +144,7 @@ struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; int ret2; - IB_MAD_DEVICE_LIST_LOCK_VAR; + IB_MAD_PORT_LIST_LOCK_VAR; IB_MAD_AGENT_LIST_LOCK_VAR; u8 mgmt_class; @@ -184,14 +184,14 @@ } /* Validate device and port */ - IB_MAD_DEVICE_LIST_LOCK(); + IB_MAD_PORT_LIST_LOCK(); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - IB_MAD_DEVICE_LIST_UNLOCK(); + IB_MAD_PORT_LIST_UNLOCK(); if (!priv) { ret = ERR_PTR(-ENODEV); goto error1; @@ -358,16 +358,16 @@ wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ /* Link send WR into posted send MAD list */ - IB_MAD_SEND_LIST_LOCK(((struct ib_mad_device_private *)mad_agent->device->mad)); + IB_MAD_SEND_LIST_LOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); list_add_tail((struct list_head *)mad_send_wr, - &((struct ib_mad_device_private *)mad_agent->device->mad)->send_posted_mad_list); - IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_device_private *)mad_agent->device->mad)); + &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); + IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); if (ib_post_send(mad_agent->qp, &wr, &bad_wr)) { /* Unlink from posted send MAD list */ - IB_MAD_SEND_LIST_LOCK(((struct ib_mad_device_private *)mad_agent->device->mad)); + IB_MAD_SEND_LIST_LOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); list_del((struct list_head *)send_wr); - IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_device_private *)mad_agent->device->mad)); + IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); *bad_send_wr = cur_send_wr; printk(KERN_ERR "ib_mad_post_send failed\n"); return -EINVAL; @@ -466,7 +466,7 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv) { - struct ib_mad_device_private *private; + struct ib_mad_port_private *private; struct ib_mad_mgmt_class_table **class; struct ib_mad_mgmt_method_table **method; @@ -540,7 +540,7 @@ static void remove_mad_reg_req(struct ib_mad_agent_private *priv) { - struct ib_mad_device_private *private; + struct ib_mad_port_private *private; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; u8 mgmt_class; @@ -576,7 +576,7 @@ } } -static void ib_mad_recv_done_handler(struct ib_mad_device_private *priv, +static void ib_mad_recv_done_handler(struct ib_mad_port_private *priv, struct ib_wc *wc) { struct ib_mad_recv_wc recv_wc; @@ -626,7 +626,7 @@ /* Receive reposting ? !!! */ } -static void ib_mad_send_done_handler(struct ib_mad_device_private *priv, +static void ib_mad_send_done_handler(struct ib_mad_port_private *priv, struct ib_wc *wc) { struct ib_mad_send_wr_private *entry, *send_wr = NULL, @@ -660,7 +660,7 @@ /* * IB MAD completion callback */ -static void ib_mad_completion_handler(struct ib_mad_device_private *priv) +static void ib_mad_completion_handler(struct ib_mad_port_private *priv) { /* @@ -707,7 +707,7 @@ } if (err_status) { - ib_mad_device_restart(priv); + ib_mad_port_restart(priv); } else { ib_mad_post_receive_mads(priv); ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP); @@ -719,7 +719,7 @@ */ static int ib_mad_thread(void *param) { - struct ib_mad_device_private *priv = param; + struct ib_mad_port_private *priv = param; struct ib_mad_thread_data *thread_data = &priv->thread_data; lock_kernel(); @@ -745,7 +745,7 @@ /* * Initialize the IB MAD thread */ -static void ib_mad_thread_init(struct ib_mad_device_private *priv) +static void ib_mad_thread_init(struct ib_mad_port_private *priv) { struct ib_mad_thread_data *thread_data = &priv->thread_data; @@ -756,7 +756,7 @@ /* * Wake up the IB MAD thread */ -static void ib_mad_thread_signal(struct ib_mad_device_private *priv) +static void ib_mad_thread_signal(struct ib_mad_port_private *priv) { struct ib_mad_thread_data *thread_data = &priv->thread_data; @@ -766,7 +766,7 @@ /* * Stop the IB MAD thread */ -static void ib_mad_thread_stop(struct ib_mad_device_private *priv) +static void ib_mad_thread_stop(struct ib_mad_port_private *priv) { struct ib_mad_thread_data *thread_data = &priv->thread_data; @@ -780,7 +780,7 @@ ib_mad_thread_signal(cq->cq_context); } -static int ib_mad_post_receive_mad(struct ib_mad_device_private *priv, +static int ib_mad_post_receive_mad(struct ib_mad_port_private *priv, enum ib_qp_type qp_type) { struct ib_mad_private *mad_priv; @@ -842,7 +842,7 @@ /* * Get receive MADs and post receive WRs for them */ -static int ib_mad_post_receive_mads(struct ib_mad_device_private *priv) +static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv) { int i; @@ -864,7 +864,7 @@ /* * Return all the posted receive MADs */ -static void ib_mad_return_posted_recv_mads(struct ib_mad_device_private *priv) +static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *priv) { IB_MAD_RECV_LIST_LOCK_VAR; @@ -881,7 +881,7 @@ /* * Return all the posted send MADs */ -static void ib_mad_return_posted_send_mads(struct ib_mad_device_private *priv) +static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *priv) { IB_MAD_SEND_LIST_LOCK_VAR; @@ -1012,22 +1012,22 @@ return ret; } -#define IB_MAD_DEVICE_SET_UP(__device__) {\ - IB_MAD_DEVICE_LIST_LOCK_VAR;\ - IB_MAD_DEVICE_LIST_LOCK();\ - (__device__)->up = 1;\ - IB_MAD_DEVICE_LIST_UNLOCK();} +#define IB_MAD_PORT_SET_UP(__port__) {\ + IB_MAD_PORT_LIST_LOCK_VAR;\ + IB_MAD_PORT_LIST_LOCK();\ + (__port__)->up = 1;\ + IB_MAD_PORT_LIST_UNLOCK();} -#define IB_MAD_DEVICE_SET_DOWN(__device__) {\ - IB_MAD_DEVICE_LIST_LOCK_VAR;\ - IB_MAD_DEVICE_LIST_LOCK();\ - (__device__)->up = 0;\ - IB_MAD_DEVICE_LIST_UNLOCK();} +#define IB_MAD_PORT_SET_DOWN(__port__) {\ + IB_MAD_PORT_LIST_LOCK_VAR;\ + IB_MAD_PORT_LIST_LOCK();\ + (__port__)->up = 0;\ + IB_MAD_PORT_LIST_UNLOCK();} /* - * Start the device + * Start the port */ -static int ib_mad_device_start(struct ib_mad_device_private *priv) +static int ib_mad_port_start(struct ib_mad_port_private *priv) { int ret, i; @@ -1065,7 +1065,7 @@ } } - IB_MAD_DEVICE_SET_UP(priv); + IB_MAD_PORT_SET_UP(priv); return 0; error: @@ -1078,13 +1078,13 @@ } /* - * Stop the device + * Stop the port */ -static void ib_mad_device_stop(struct ib_mad_device_private *priv) +static void ib_mad_port_stop(struct ib_mad_port_private *priv) { int i; - IB_MAD_DEVICE_SET_DOWN(priv); + IB_MAD_PORT_SET_DOWN(priv); for (i = 0; i < 2; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); @@ -1095,16 +1095,16 @@ } /* - * Restart the device + * Restart the port */ -static int ib_mad_device_restart(struct ib_mad_device_private *priv) +static int ib_mad_port_restart(struct ib_mad_port_private *priv) { int ret; - ib_mad_device_stop(priv); - ret = ib_mad_device_start(priv); + ib_mad_port_stop(priv); + ret = ib_mad_port_start(priv); if (ret) { - printk(KERN_ERR "Could not start device %s/%d\n", + printk(KERN_ERR "Could not restart port%s/%d\n", priv->device->name, priv->port); } @@ -1126,19 +1126,19 @@ struct ib_device_attr device_attr; struct ib_qp_init_attr qp_init_attr; struct ib_qp_cap qp_cap; - struct ib_mad_device_private *entry, *priv = NULL, - *head = (struct ib_mad_device_private *) &ib_mad_device_list; - IB_MAD_DEVICE_LIST_LOCK_VAR; + struct ib_mad_port_private *entry, *priv = NULL, + *head = (struct ib_mad_port_private *) &ib_mad_port_list; + IB_MAD_PORT_LIST_LOCK_VAR; /* First, check if port already open at MAD layer */ - IB_MAD_DEVICE_LIST_LOCK(); + IB_MAD_PORT_LIST_LOCK(); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - IB_MAD_DEVICE_LIST_UNLOCK(); + IB_MAD_PORT_LIST_UNLOCK(); if (priv) { printk(KERN_DEBUG "Port already open\n"); return 0; @@ -1147,7 +1147,7 @@ /* Create new device info */ priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { - printk(KERN_ERR "No memory for ib_mad_device_private\n"); + printk(KERN_ERR "No memory for ib_mad_port_private\n"); return -ENOMEM; } @@ -1223,15 +1223,15 @@ INIT_LIST_HEAD(&priv->send_posted_mad_list); ib_mad_thread_init(priv); - ret = ib_mad_device_start(priv); + ret = ib_mad_port_start(priv); if (ret) { - printk(KERN_ERR "Could not start device\n"); + printk(KERN_ERR "Could not start port\n"); goto error8; } - IB_MAD_DEVICE_LIST_LOCK(); - list_add_tail((struct list_head *)priv, &ib_mad_device_list); - IB_MAD_DEVICE_LIST_UNLOCK(); + IB_MAD_PORT_LIST_LOCK(); + list_add_tail((struct list_head *)priv, &ib_mad_port_list); + IB_MAD_PORT_LIST_UNLOCK(); return 0; @@ -1258,11 +1258,11 @@ */ static int ib_mad_port_close(struct ib_device *device, int port) { - struct ib_mad_device_private *entry, *priv = NULL, - *head = (struct ib_mad_device_private *)&ib_mad_device_list; - IB_MAD_DEVICE_LIST_LOCK_VAR; + struct ib_mad_port_private *entry, *priv = NULL, + *head = (struct ib_mad_port_private *)&ib_mad_port_list; + IB_MAD_PORT_LIST_LOCK_VAR; - IB_MAD_DEVICE_LIST_LOCK(); + IB_MAD_PORT_LIST_LOCK(); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; @@ -1272,14 +1272,14 @@ if (priv == NULL) { printk(KERN_ERR "Port not found\n"); - IB_MAD_DEVICE_LIST_UNLOCK(); + IB_MAD_PORT_LIST_UNLOCK(); return -ENODEV; } list_del((struct list_head *)priv); - IB_MAD_DEVICE_LIST_UNLOCK(); + IB_MAD_PORT_LIST_UNLOCK(); - ib_mad_device_stop(priv); + ib_mad_port_stop(priv); ib_mad_thread_stop(priv); ib_destroy_qp(priv->qp[1]); ib_destroy_qp(priv->qp[0]); @@ -1312,7 +1312,7 @@ for (i = 0; i < num_ports; i++) { ret = ib_mad_port_open(device, i); if (ret) { - printk(KERN_ERR "Could not open device port %d\n", i); + printk(KERN_ERR "Could not open port %d\n", i); goto error_device_open; } } @@ -1323,7 +1323,7 @@ while (i > 0) { ret2 = ib_mad_port_close(device, i); if (ret2) { - printk(KERN_ERR "Could not close device port %d\n", i); + printk(KERN_ERR "Could not close port %d\n", i); } i--; } @@ -1348,7 +1348,7 @@ for (i = 0; i < num_ports; i++) { ret2 = ib_mad_port_close(device, i); if (ret2) { - printk(KERN_ERR "Could not close device port %d\n", i); + printk(KERN_ERR "Could not close port %d\n", i); if (!ret) ret = ret2; } @@ -1393,7 +1393,7 @@ return -ENOMEM; } - INIT_LIST_HEAD(&ib_mad_device_list); + INIT_LIST_HEAD(&ib_mad_port_list); INIT_LIST_HEAD(&ib_mad_agent_list); ib_device_notifier_register(&mad_notifier); Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 762) +++ ib_mad_priv.h (working copy) @@ -119,8 +119,8 @@ int run; }; -struct ib_mad_device_private { - struct ib_mad_device_private *next; +struct ib_mad_port_private { + struct ib_mad_port_private *next; struct ib_device *device; int port; int up; From mlleinin at hpcn.ca.sandia.gov Fri Sep 10 13:42:15 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt L. Leininger) Date: Fri, 10 Sep 2004 13:42:15 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> Message-ID: <1094848934.4321.484.camel@trinity> On Fri, 2004-09-10 at 12:37, Woodruff, Robert J wrote: > Grant wrote, > I've already have high level mgt agreement to fund this. > >I have temporary funding issues right now but it's not the primary > problem. Tim Witham > >(OSDL) at the IB BOF (OLS) was clear he couldn't host OpenIB.org work > until the "only > >promoters" SVN policy is changed. Ie folks who've contributed money to > OpenIB get write > >access to SVN. Until Tim says he's happy, I don't see any point in > ordering HW. > > Jim, is there anything in the openib.org promoters agreement or bylaws > that prevents openib.org maintainers from giving others (non-promoters) > write access to the SVN tree ? If not, then I think that as long as all > the > people on the list want to allow someone to have write access, I see > know > reason why we could not allow it ? > > What do other people think ? > I'd take a different approach. If the promoters agreement forbids the OpenIB SW working group/maintainers from granting write access to non-promoters, then the agreement is broken and needs to be fixed in the bylaws. In the meantime we should start using a policy to grant code repository write access that makes the most sense for an open source development project. If a non-promoters needs write access, and the SW WG has no problems with the individual requesting write access, then write access should be granted. If I don't hear any strong objections (with good reasons) by Monday the SW WG can start taking requests for write access to the OpenIB code repository. - Matt From roland at topspin.com Fri Sep 10 13:49:05 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 13:49:05 -0700 Subject: [openib-general] semantics of process_mad? Message-ID: <521xh9df3y.fsf@topspin.com> I'm now looking at implementing the process_mad method: int (*process_mad)(struct ib_device *device, int process_mad_flags, struct ib_mad *in_mad, struct ib_mad *out_mad); First of all, it seems that a port_num parameter needs to be added. A QP number parameter might also be good, but maybe we can rely on the access layer to ensure that the MAD's class and QP number match appropriately. Finally, what should the return value be? There are several independent things that the process_mad function needs to tell the access layer: - Did the operation succeed? - Was the MAD consumed or should it be dispatched to a registered consumer? - Was a reply generated or should no reply be sent? In the Topspin drivers, we handled this by returning an enum ib_mad_result as below: enum ib_mad_result { IB_MAD_RESULT_FAILURE = 0, // (!SUCCESS is the important flag) IB_MAD_RESULT_SUCCESS = 1 << 0, // MAD was successfully processed IB_MAD_RESULT_REPLY = 1 << 1, // Reply packet needs to be sent IB_MAD_RESULT_CONSUMED = 1 << 2 // Packet consumed: stop processing }; Thanks, Roland From iod00d at hp.com Fri Sep 10 13:49:46 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Sep 2004 13:49:46 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> Message-ID: <20040910204946.GB28896@cup.hp.com> On Fri, Sep 10, 2004 at 12:37:47PM -0700, Woodruff, Robert J wrote: > Jim, is there anything in the openib.org promoters agreement or bylaws > that prevents openib.org maintainers from giving others (non-promoters) > write access to the SVN tree? My comments on based on what I've read here: http://openib.org/pipermail/openib-general/2004-June/002712.html I'm sure Tim Witham would be pleased to hear/see something different. (and kudos to whoever switched the mail archive to pipermail) > If not, then I think that as long as all > the people on the list want to allow someone to have write access, I see > know reason why we could not allow it ? Most open source project maintainers quietly build consensus based on the quality/frequency of patches (and based on those metrics I don't qualify for write access). In other words, only active developers *with* write access have a say in who else gets write access. hth, grant From halr at voltaire.com Fri Sep 10 13:55:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 16:55:39 -0400 Subject: [openib-general] [PATCH] ib_mad.c: In ib_mad_post_send, return error code from ib_post_send Message-ID: <1094849738.1746.415.camel@localhost.localdomain> ib_mad.c: In ib_mad_post_send, return error code from ib_post_send rather than "overwrite" with EINVAL Index: ib_mad.c =================================================================== --- ib_mad.c (revision 769) +++ ib_mad.c (working copy) @@ -319,6 +319,7 @@ struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) { + int ret; struct ib_send_wr *cur_send_wr, *next_send_wr; struct ib_send_wr wr; struct ib_send_wr *bad_wr; @@ -363,14 +364,15 @@ &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); - if (ib_post_send(mad_agent->qp, &wr, &bad_wr)) { + ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); + if (ret) { /* Unlink from posted send MAD list */ IB_MAD_SEND_LIST_LOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); list_del((struct list_head *)send_wr); IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); *bad_send_wr = cur_send_wr; printk(KERN_ERR "ib_mad_post_send failed\n"); - return -EINVAL; + return ret; } cur_send_wr= next_send_wr; } From roland at topspin.com Fri Sep 10 14:03:16 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 14:03:16 -0700 Subject: [openib-general] OpenIB SWG Face to Face 9/9 Meeting In-Reply-To: <20040910204946.GB28896@cup.hp.com> (Grant Grundler's message of "Fri, 10 Sep 2004 13:49:46 -0700") References: <1AC79F16F5C5284499BB9591B33D6F000205E90C@orsmsx408> <20040910204946.GB28896@cup.hp.com> Message-ID: <52sm9pbzvv.fsf@topspin.com> Grant> Most open source project maintainers quietly build Grant> consensus based on the quality/frequency of patches (and Grant> based on those metrics I don't qualify for write Grant> access). In other words, only active developers *with* Grant> write access have a say in who else gets write access. I agree with this completely. In fact, based on the usual standards for comitters, it will probably be a while before we want to give anyone new write access. The important thing is just to make it clear that it's not required to pay to get commit access. - R. From ftillier at infiniconsys.com Fri Sep 10 14:09:45 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 10 Sep 2004 14:09:45 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52mzzxdfqx.fsf@topspin.com> Message-ID: <000001c4977a$8085c070$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 1:35 PM > > OK, here is my proposal for how to handle device enumeration and async > events. I actually have all of this coded and working on my branch; > I'll post the full diff shortly. However I want to pull out the API > changes so we can discuss them more easily. Thanks, this makes it a lot easier to comment on (at least for me). > > First of all, here is how a kernel client finds out about the devices > in the system: > > struct ib_client { > void (*add) (struct ib_device *); > void (*remove)(struct ib_device *); > > struct list_head list; > }; > > int ib_register_client (struct ib_client *client); > int ib_unregister_client(struct ib_client *client); > > When a client calls ib_register_client, the add method is called > for each of the devices that already exist. Conversely, on > unregister, remove is called for all the remaining devices. When > a new device is added, add methods are be called in the order the > clients registered; when a new device is removed, remove methods are > called in the opposite order. This allows initialization and cleanup > to happen properly (for example, IPoIB knows that the MAD layer will > initialize before it and clean up after it). This sounds sane. Are existing device notifications invoked from the thread context of the ib_register_client function? In other words, does the ib_register_client function return before or after the client has receive notifications of existing events? Does ib_unregister_client synchronize with callback delivery? Does ib_unregister_client send "pretend" removal events? The advantage of having ib_unregister_client send these fake events is it allows clients to have their state driven entirely by these callbacks. This would imply that the remove events get sent in the context of the caller. > > For unaffiliated events, my API is as follows: > > struct ib_event_handler { > struct ib_device *device; > void (*handler)(struct ib_event_handler *, struct > ib_event *); > struct list_head list; > }; > > int ib_register_event_handler (struct ib_device *device, > struct ib_event_handler *event_handler); Why does ib_register_event_handler take a device as input? Is this device the same as event_handler.device? Why not just use the event handler's device instead? > int ib_unregister_event_handler(struct ib_event_handler > *event_handler); > void ib_dispatch_event(struct ib_event *event); > > This is pretty simple: everyone that wants to know about unaffiliated > (ie not relating to a QP or CQ) events registers a struct > ib_event_handler. The callback doesn't take a context parameter > because I'm assuming the struct ib_event_handler will be embedded in > the client's context and used with container_of (this is similar to > ). I think assuming the handler is embedded is sane. > > Finally, I added event_handler members to struct ib_cq and struct > ib_qp and added support for setting them on creation: > > struct ib_cq { > struct ib_device *device; > ib_comp_handler comp_handler; > void (*event_handler)(struct ib_event *, void *); > void * context; > int cqe; > atomic_t usecnt; /* count number of work queues */ > }; > > struct ib_cq *ib_create_cq(struct ib_device *device, > ib_comp_handler comp_handler, > void (*event_handler)(struct ib_event *, void > *), > void *cq_context, int cqe); > > struct ib_qp { > struct ib_device *device; > struct ib_pd *pd; > struct ib_cq *send_cq; > struct ib_cq *recv_cq; > struct ib_srq *srq; > void (*event_handler)(struct ib_event *, void > *); > void *qp_context; > u32 qp_num; > }; > > struct ib_qp_init_attr { > void (*event_handler)(struct ib_event *, void > *); > void *qp_context; > struct ib_cq *send_cq; > struct ib_cq *recv_cq; > struct ib_srq *srq; > struct ib_qp_cap cap; > enum ib_sig_type sq_sig_type; > enum ib_sig_type rq_sig_type; > enum ib_qp_type qp_type; > u8 port_num; /* special QP types only */ > }; > > These do get passed the context to match what we did with the > comp_handler member of struct ib_cq. Looks sane to me too. - Fab From halr at voltaire.com Fri Sep 10 14:15:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 17:15:19 -0400 Subject: [openib-general] semantics of process_mad? In-Reply-To: <521xh9df3y.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> Message-ID: <1094850918.1746.474.camel@localhost.localdomain> On Fri, 2004-09-10 at 16:49, Roland Dreier wrote: > I'm now looking at implementing the process_mad method: > > int (*process_mad)(struct ib_device *device, > int process_mad_flags, > struct ib_mad *in_mad, > struct ib_mad *out_mad); > > First of all, it seems that a port_num parameter needs to be added. Agreed. How did this get missed 'till now ? (That's rhetorical)... > A QP number parameter might also be good, > but maybe we can rely on the > access layer to ensure that the MAD's class and QP number match > appropriately. Are you saying to just make sure SM class is QP0 and all GS classes are not QP0 (QP1 is insufficient when redirection is supported) ? Because of redirection, it seems adding QP number as a parameter is a better solution. > Finally, what should the return value be? There are several > independent things that the process_mad function needs to tell the > access layer: > > - Did the operation succeed? > - Was the MAD consumed or should it be dispatched to a registered consumer? > - Was a reply generated or should no reply be sent? > > In the Topspin drivers, we handled this by returning an enum > ib_mad_result as below: > > enum ib_mad_result { > IB_MAD_RESULT_FAILURE = 0, // (!SUCCESS is the important flag) > IB_MAD_RESULT_SUCCESS = 1 << 0, // MAD was successfully processed > IB_MAD_RESULT_REPLY = 1 << 1, // Reply packet needs to be sent > IB_MAD_RESULT_CONSUMED = 1 << 2 // Packet consumed: stop processing > }; I think the only difference with OpenIB right now is that a MAD right now cannot be multiply consumed as it can in the Topspin implementation. -- Hal From ftillier at infiniconsys.com Fri Sep 10 14:26:02 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 10 Sep 2004 14:26:02 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <521xh9df3y.fsf@topspin.com> Message-ID: <000101c4977c$c6767f50$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 1:49 PM > To: openib-general at openib.org > Subject: [openib-general] semantics of process_mad? > > I'm now looking at implementing the process_mad method: > > int (*process_mad)(struct ib_device *device, > int process_mad_flags, > struct ib_mad *in_mad, > struct ib_mad *out_mad); > > A QP number parameter might also be good, but maybe we can rely on the > access layer to ensure that the MAD's class and QP number match > appropriately. I'm confused as to what the QP parameter would be used for. Can you clarify? > > Finally, what should the return value be? There are several > independent things that the process_mad function needs to tell the > access layer: > > - Did the operation succeed? > - Was the MAD consumed or should it be dispatched to a registered > consumer? > - Was a reply generated or should no reply be sent? Why not have the out_mad be an output parameter? If the process_mad has output data, it would allocate a mad, fill it in, and return it. Just a random thought, so it may be totally stupid. - Fab From roland at topspin.com Fri Sep 10 14:28:43 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 14:28:43 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000001c4977a$8085c070$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Fri, 10 Sep 2004 14:09:45 -0700") References: <000001c4977a$8085c070$655aa8c0@infiniconsys.com> Message-ID: <52k6v1bypg.fsf@topspin.com> Fab> This sounds sane. Are existing device notifications invoked Fab> from the thread context of the ib_register_client function? Fab> In other words, does the ib_register_client function return Fab> before or after the client has receive notifications of Fab> existing events? Does ib_unregister_client synchronize with Fab> callback delivery? Does ib_unregister_client send "pretend" Fab> removal events? Yes to all of these questions. Basically register does: lock device list add client to client list generate add callbacks for existing devices unlock device list and unregister does lock device list remove client from client list generate remove callbacks for existing unlock device list Adding a device does lock device list add device to device list generate add callbacks for existing clients unlock device list and removing a device does lock device list remove device from device list generate add callbacks for existing clients unlock device list So everything is synchronized by the lock and all fake events are generated. Fab> Why does ib_register_event_handler take a device as input? Fab> Is this device the same as event_handler.device? Why not Fab> just use the event handler's device instead? No good reason -- in fact I'll get rid of the parameter. Thanks for the comments. - Roland From mlleinin at hpcn.ca.sandia.gov Fri Sep 10 14:32:18 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt L. Leininger) Date: Fri, 10 Sep 2004 14:32:18 -0700 Subject: [openib-general] Re: [PATCH] Basic driver model/sysfs support In-Reply-To: <20040904080747.GA21430@kroah.com> References: <527jrcwo4m.fsf@topspin.com> <20040903103326.GA5257@kroah.com> <52vfevs9su.fsf@topspin.com> <20040904080747.GA21430@kroah.com> Message-ID: <1094851938.4319.490.camel@trinity> On Sat, 2004-09-04 at 01:07, Greg KH wrote: > > p.s. Can someone please turn the "closed list" option off? If you all > want to be a open mailing list, it's pretty rude to hold emails from > non-list members. Almost all Linux development mailing lists accept > email from anyone, list-member or not. > > p.s.s. And no, spam is not a valid reason for having such a policy, that > can be handled properly by filters on the mail server, or filters by the > users themselves. We got the spam filters up and the list is now open so folks can post w/o being list-members. - Matt From halr at voltaire.com Fri Sep 10 14:43:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 17:43:36 -0400 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <20040908164739.3e9c8723.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> Message-ID: <1094852615.1794.538.camel@localhost.localdomain> Hi Sean, Thanks for reviewing thi early version which is incomplete. On Wed, 2004-09-08 at 19:47, Sean Hefty wrote: > Here's a list of comments from reviewing ib_mad.c. I'll use this list as kind of my to do list for the GSI. Several of these can be delayed when implementing. After we meet tomorrow, I will begin creating patches. Overall, it's a good start. > > ib_mad_reg(): > - Need to lock when checking/setting version/class/methods. This item is on my TODO list for this (ib_mad.c). > - Need to support registrations for "all" methods of a given class. > (We may want the initial implementation to only do this for now, > to shorten the development time.) This can be done today depending on the definition of "all". All is the entire method bit mask. If "all" means all the ones valid for a class, that is a specific bit mask per class. Is this proposing a shortcut way to do this ? If so, is this much of a savings for the client ? Do you think it adds that much more complexity to have multiple clients for methods of the same class ? Wouldn't this be a problem to run SM and SMA concurrently if just all methods for a class was supported initially ? > - Should we reference qp0 and qp1 with the registration? Either way (struct ib_qp * (for qp0 or qp1) or enum ib_qp_type) is fine with me. It seems to me that the enum approach is easier for the client. Does the client need a pointer to the QP for something else ? > - Need to ensure unique tids in case of wrapping. Are you referring to making sure that the high TID is not in use ? It is on my TODO list to change this to use client_id as an index (to save walking a linked list and just index into a table based on the client_id. This will be done after all the more straightforward changes. > ib_mad_post_send(): > - We should return the error code from ib_post_send in order to > handle overruns differently. Fixed. Posted patch for this. > - The print level should be lowered from error. There are 2 prints. Not sure which ones should be lowered. Both or only the one after ib_post_send ? > - Should we avoid casting the list_head to a structure where possible? Will address as part of response to Roland's comments on ib_mad.c > allocate_method_table(): > - Can just use memset to clear the table. I changed this to use memset. Posted patch for this. > check_class_table(): > - Has an extra '{'. Where's the extra { ? > ib_mad_recv_done_handler(): > ib_mad_send_done_handler(): > - Not sure why these calls search for the corresponding work request. On the send side, it should just be the head entry. I will change that. Right now on the receive side, there is one receive list, and the receives are posted for QPs 0 and 1 (and ultimately other QPs) so the right post needs to be found. I think there is additional information in the WR which is needed for the callbacks. I didn't get far enough on the receive side so I will defer this part of the answer for now. > ib_mad_post_receive_mads(): > - I think we can just pass &qp0 or &qp1, rather than a type to > ib_mad_post_receive_mad. Right now, the QP type is saved as part of the private MAD. This may be unnecessary. Won't be sure until I complete the receive side. > - Print level should be lowered from error There are 2 prints. Not sure which ones should be lowered. Both or only the one after ib_post_recv ? > - We can track the number of posted receives to avoid posting overruns. This seems like an optimization. I will put it on my TODO list for now. > struct ib_mad_device_private: > - If we make qp0 and qp1 an array, it may simply the code and remove > several checks from the code. Good point. It does (simplify the code :-) Already posted patch for this. > ib_mad_device_open(): > - A nit, but it's logically initializing a port on the device. Fixed. Posted patch for this. > - Remove +20 to CQ size. Fixed. Posted patch for this. > - We could change to using 1 PD/device, versus 1 PD/port. Is there any advantage/disadvantage one way or the other ? > - Not sure if we need to support max_sge on the send queue. This may > be substantially larger than what we need. At a minimum, I think > that we need 2 for optimal RMPP support. I'm not sure where the > trade-off is between SGE versus copying into a single buffer lies. I'm not following where the minimum of 2 for optimal RMPP support comes from. -- Hal From halr at voltaire.com Fri Sep 10 14:56:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 17:56:10 -0400 Subject: [openib-general] [PATCH] ib_mad.c: In ib_mad_post_send, set bad_send_wr on errors Message-ID: <1094853370.1752.565.camel@localhost.localdomain> ib_mad.c: In ib_mad_post_send, set bad_send_wr on errors Index: ib_mad.c =================================================================== --- ib_mad.c (revision 770) +++ ib_mad.c (working copy) @@ -326,9 +326,12 @@ struct ib_mad_send_wr_private *mad_send_wr; IB_MAD_SEND_LIST_LOCK_VAR; + cur_send_wr = send_wr; /* Validate supplied parameters */ - if (!mad_agent || !send_wr) + if (!mad_agent || !send_wr) { + *bad_send_wr = cur_send_wr; return -EINVAL; + } /* Walk list of send WRs and post each one on send list */ cur_send_wr = send_wr; @@ -338,6 +341,7 @@ /* Allocate MAD send WR tracking structure */ mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_KERNEL); if (!mad_send_wr) { + *bad_send_wr = cur_send_wr; printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); return -ENOMEM; } From ftillier at infiniconsys.com Fri Sep 10 14:57:35 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 10 Sep 2004 14:57:35 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52k6v1bypg.fsf@topspin.com> Message-ID: <000201c49781$2ee32530$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 2:29 PM > > Fab> This sounds sane. Are existing device notifications invoked > Fab> from the thread context of the ib_register_client function? > Fab> In other words, does the ib_register_client function return > Fab> before or after the client has receive notifications of > Fab> existing events? Does ib_unregister_client synchronize with > Fab> callback delivery? Does ib_unregister_client send "pretend" > Fab> removal events? > > Yes to all of these questions. ... > > So everything is synchronized by the lock and all fake events are > generated. Great! I didn't see a way for a client to associate a context or some such thing with a device when the device is added. I would think this would be beneficial in order to avoid requiring clients to search a list for a matching device. I'm suggesting something like this: struct ib_device_reg { void (*remove)(struct ib_device_reg *); struct list_head reg_list; struct list_head dev_list; }; struct ib_client { struct ib_device_reg * (*add) (struct ib_device *); struct list_head list; struct list_head reg_list; }; Adding a device does lock device list add device to device list generate add callbacks for existing clients insert returned value into device's reg_list insert returned value into client's reg_list unlock device list and removing a device does lock device list for each entry in device's reg_list { generate remove callback remove device_reg.dev_list and device_reg.reg_list from lists } unlock device list and unregister does lock device list remove client from client list for each entry in client's reg_list { generate remove callbacks remove device_reg.dev_list and device_reg.reg_list from lists } unlock device list This does two things: it provides a way for clients to get some sort of context back for removal, allowing them to embed a struct ib_device_reg in whatever they allocate for the device; it allows clients to suppress a remove event for a device for which they failed to allocate stuff or don't care about. Thoughts? - Fab From mshefty at ichips.intel.com Fri Sep 10 14:02:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Sep 2004 14:02:35 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52mzzxdfqx.fsf@topspin.com> References: <52mzzxdfqx.fsf@topspin.com> Message-ID: <20040910140235.5139e440.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 13:35:18 -0700 Roland Dreier wrote: > struct ib_client { > void (*add) (struct ib_device *); > void (*remove)(struct ib_device *); > > struct list_head list; > }; > > int ib_register_client (struct ib_client *client); > int ib_unregister_client(struct ib_client *client); I agree with the behavior that you've defined. I would use the names ib_reg_client/ib_dereg_client to better match the existing APIs. (Same comments with "register/unregister" names below.) > int ib_register_event_handler (struct ib_device *device, > struct ib_event_handler *event_handler); > int ib_unregister_event_handler(struct ib_event_handler *event_handler); > void ib_dispatch_event(struct ib_event *event); What use do you see for ib_dispatch_event? From halr at voltaire.com Fri Sep 10 15:29:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 18:29:07 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Reimplement convert_mgmt_class function Message-ID: <1094855346.1752.639.camel@localhost.localdomain> [PATCH] ib_mad.c: Reimplement convert_mgmt_class function Index: ib_mad.c =================================================================== --- ib_mad.c (revision 771) +++ ib_mad.c (working copy) @@ -105,6 +105,7 @@ static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_port_restart(struct ib_mad_port_private *priv); static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); +static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req); /* @@ -385,17 +386,11 @@ } EXPORT_SYMBOL(ib_mad_post_send); -static u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) +static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) { - u8 mgmt_class; - /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ - if (mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - mgmt_class = 0; - } else { - mgmt_class = mad_reg_req->mgmt_class; - } - return mgmt_class; + return mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mad_reg_req->mgmt_class; } static int is_method_in_use(struct ib_mad_mgmt_method_table **method, From halr at voltaire.com Fri Sep 10 15:35:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 18:35:31 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Make error return codes are negative Message-ID: <1094855730.1794.657.camel@localhost.localdomain> ib_mad.c: Make error return codes are negative Index: ib_mad.c =================================================================== --- ib_mad.c (revision 772) +++ ib_mad.c (working copy) @@ -403,7 +403,7 @@ i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, 1+i)) { if ((*method)->agent[i]) { printk(KERN_ERR "Method %d already in use\n", i); - return EINVAL; + return -EINVAL; } } return 0; @@ -415,7 +415,7 @@ *method = kmalloc(sizeof **method, GFP_KERNEL); if (!*method) { printk(KERN_ERR "No memory for ib_mad_mgmt_method_table\n"); - return ENOMEM; + return -ENOMEM; } /* Clear management method table */ memset(*method, 0, sizeof **method); @@ -528,13 +528,13 @@ kfree(*method); *method = NULL; } - ret = EINVAL; + ret = -EINVAL; goto error; error2: kfree(*class); *class = NULL; error1: - ret = ENOMEM; + ret = -ENOMEM; error: return ret; } From halr at voltaire.com Fri Sep 10 15:46:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 18:46:15 -0400 Subject: [openib-general] [PATCH] ib_mad.c: In ib_mad_completion_handler, eliminate static WC Message-ID: <1094856374.1746.684.camel@localhost.localdomain> ib_mad.c: In ib_mad_completion_handler, eliminate static WC Index: ib_mad.c =================================================================== --- ib_mad.c (revision 773) +++ ib_mad.c (working copy) @@ -663,12 +663,7 @@ */ static void ib_mad_completion_handler(struct ib_mad_port_private *priv) { - - /* - * For stack overflow safety reason, WC is static here. - * This callback may not be called more than once at the same time. - */ - static struct ib_wc wc; + struct ib_wc wc; int err_status = 0; while (!ib_poll_cq(priv->cq, 1, &wc)) { From iod00d at hp.com Fri Sep 10 15:50:32 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Sep 2004 15:50:32 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040910140235.5139e440.mshefty@ichips.intel.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> Message-ID: <20040910225032.GA29616@cup.hp.com> On Fri, Sep 10, 2004 at 02:02:35PM -0700, Sean Hefty wrote: > > int ib_register_client (struct ib_client *client); > > int ib_unregister_client(struct ib_client *client); > > ... use the names ib_reg_client/ib_dereg_client to better match the > existing APIs. (Same comments with "register/unregister" names below.) register/unregister is pretty obvious and follows other linux APIs. Try "fgrep register_driver include/linux/*" in linux-2.6 source tree. thanks, grant From ftillier at infiniconsys.com Fri Sep 10 15:56:10 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 10 Sep 2004 15:56:10 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52k6v1bypg.fsf@topspin.com> Message-ID: <000301c49789$5e247da0$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 2:29 PM > ... > > So everything is synchronized by the lock and all fake events are > generated. Will clients be able to allocate HCA resources from the callback? Will this work if there's a lock held during callbacks? I'm not too familiar with the mthca code, but VAPI didn't let you do anything while holding a lock. - Fab From mshefty at ichips.intel.com Fri Sep 10 16:00:19 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Sep 2004 16:00:19 -0700 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <1094852615.1794.538.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> Message-ID: <20040910160019.5b064909.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 17:43:36 -0400 Hal Rosenstock wrote: > > - Need to support registrations for "all" methods of a given class. > > (We may want the initial implementation to only do this for now, > > to shorten the development time.) > This can be done today depending on the definition of "all". All is the > entire method bit mask. If "all" means all the ones valid for a class, > that is a specific bit mask per class. Is this proposing a shortcut way > to do this ? If so, is this much of a savings for the client ? >From the client's perspective, I don't think it's a big deal. I was thinking more about the implementation and avoiding checking on the method if someone wanted all MADs for a given class. This may not be much of a savings - not sure how fast the method checks would be. > Do you think it adds that much more complexity to have multiple clients > for methods of the same class ? Wouldn't this be a problem to run SM and > SMA concurrently if just all methods for a class was supported initially > ? Since you already have an implementation that operates on the methods, I wouldn't change it. > > - Should we reference qp0 and qp1 with the registration? > Either way (struct ib_qp * (for qp0 or qp1) or enum ib_qp_type) is fine > with me. It seems to me that the enum approach is easier for the client. > Does the client need a pointer to the QP for something else ? I used struct ib_qp* inside the ib_mad_agent for QP redirection purposes, and allows them to query the QP for its attributes, such as the number of supported SGEs. > > - The print level should be lowered from error. > There are 2 prints. Not sure which ones should be lowered. Both or only > the one after ib_post_send ? one after ib_post_send - see below > > - Has an extra '{'. > Where's the extra { ? My bad - the "return j" and "}" are swapped. > On the send side, it should just be the head entry. I will change that. > Right now on the receive side, there is one receive list, and the I think we need separate receive lists for QP 0 and 1. Or the wr_id should just point to the corresponding receive information. > > - Print level should be lowered from error > There are 2 prints. Not sure which ones should be lowered. Both or only > the one after ib_post_recv ? one after ib_post_recv - see below > > - We can track the number of posted receives to avoid posting overruns. > This seems like an optimization. I will put it on my TODO list for now. It has an effect on the implementation. If we call ib_post_send/recv until it fails, then we need to treat those failures as expected, and not true errors. We spoke about this some yesterday, but for others on the list, I think that the current implementation of ib_post_send needs to be moved down and renamed. A call to ib_post_send could then call that routine and take whatever action is appropriate to handle an overrun case, such as queuing the request, ignoring the overrun, etc. > > - We could change to using 1 PD/device, versus 1 PD/port. > Is there any advantage/disadvantage one way or the other ? Just a small optimization. I'd ignore for now. > > - Not sure if we need to support max_sge on the send queue. This may > > be substantially larger than what we need. At a minimum, I think > > that we need 2 for optimal RMPP support. I'm not sure where the > > trade-off is between SGE versus copying into a single buffer lies. > I'm not following where the minimum of 2 for optimal RMPP support comes > from. The first SGE would reference the MAD/RMPP header. The second SGE would reference the MAD data. We've seen out of memory issues in our testing due to the number of SGEs allocated per QP, so limiting the QP size is probably worthwhile. As long as a user can get the max SGEs supported by the QPs, we should be okay. From mshefty at ichips.intel.com Fri Sep 10 16:04:38 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Sep 2004 16:04:38 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000201c49781$2ee32530$655aa8c0@infiniconsys.com> References: <52k6v1bypg.fsf@topspin.com> <000201c49781$2ee32530$655aa8c0@infiniconsys.com> Message-ID: <20040910160438.73266327.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 14:57:35 -0700 "Fab Tillier" wrote: > Great! I didn't see a way for a client to associate a context or some such > thing with a device when the device is added. I would think this would be > beneficial in order to avoid requiring clients to search a list for a > matching device. I'm suggesting something like this: I think that this is a useful feature. Although, unless we're expecting a lot of devices or insertion/removals, forcing clients to search for their context isn't a huge deal. From halr at voltaire.com Fri Sep 10 16:10:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 19:10:26 -0400 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040910225032.GA29616@cup.hp.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> Message-ID: <1094857825.1794.735.camel@localhost.localdomain> On Fri, 2004-09-10 at 18:50, Grant Grundler wrote: > On Fri, Sep 10, 2004 at 02:02:35PM -0700, Sean Hefty wrote: > > > int ib_register_client (struct ib_client *client); > > > int ib_unregister_client(struct ib_client *client); > > > > ... use the names ib_reg_client/ib_dereg_client to better match the > > existing APIs. (Same comments with "register/unregister" names below.) > > register/unregister is pretty obvious and follows other linux APIs. > Try "fgrep register_driver include/linux/*" in linux-2.6 source tree. So should we can reg/dereg to register/deregister (in ib_mad.h) ? -- Hal From mshefty at ichips.intel.com Fri Sep 10 16:12:12 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Sep 2004 16:12:12 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040910225032.GA29616@cup.hp.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> Message-ID: <20040910161212.6ccbeee4.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 15:50:32 -0700 Grant Grundler wrote: > On Fri, Sep 10, 2004 at 02:02:35PM -0700, Sean Hefty wrote: > > > int ib_register_client (struct ib_client *client); > > > int ib_unregister_client(struct ib_client *client); > > > > ... use the names ib_reg_client/ib_dereg_client to better match the > > existing APIs. (Same comments with "register/unregister" names below.) > > register/unregister is pretty obvious and follows other linux APIs. > Try "fgrep register_driver include/linux/*" in linux-2.6 source tree. *nods* I was referring to the ib_verb APIs, but see that register/unregister is more common in the kernel code. thanks for the reference From halr at voltaire.com Fri Sep 10 16:20:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 19:20:59 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Change print level when ib_post_send/recv fails Message-ID: <1094858459.1752.771.camel@localhost.localdomain> ib_mad.c: Change print level when ib_post_send/recv fails Index: ib_mad.c =================================================================== --- ib_mad.c (revision 774) +++ ib_mad.c (working copy) @@ -376,7 +376,7 @@ list_del((struct list_head *)send_wr); IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); *bad_send_wr = cur_send_wr; - printk(KERN_ERR "ib_mad_post_send failed\n"); + printk(KERN_NOTICE "ib_mad_post_send failed\n"); return ret; } cur_send_wr= next_send_wr; @@ -828,7 +828,7 @@ sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); kfree(mad_priv); - printk(KERN_ERR "ib_post_recv failed\n"); + printk(KERN_NOTICE "ib_post_recv failed\n"); return -EINVAL; } From halr at voltaire.com Fri Sep 10 16:26:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 19:26:07 -0400 Subject: [openib-general] [PATCH] ib_mad: Reduce QP send max_sge Message-ID: <1094858767.1752.790.camel@localhost.localdomain> Reduce QP send max_sge Index: ib_mad.c =================================================================== --- ib_mad.c (revision 775) +++ ib_mad.c (working copy) @@ -1119,7 +1119,6 @@ .addr = 0, .size = (unsigned long) high_memory - PAGE_OFFSET }; - struct ib_device_attr device_attr; struct ib_qp_init_attr qp_init_attr; struct ib_qp_cap qp_cap; struct ib_mad_port_private *entry, *priv = NULL, @@ -1180,12 +1179,6 @@ goto error5; } - /* Query device to obtain max_sge */ - if (ib_query_device(device, &device_attr)) { - printk(KERN_ERR "Could not ib_query_device\n"); - device_attr.max_sge = IB_MAD_SEND_REQ_MAX_SG; - } - for (i = 0; i < 2; i++) { memset(&qp_init_attr, 0, sizeof qp_init_attr); qp_init_attr.send_cq = priv->cq; @@ -1194,7 +1187,7 @@ qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; - qp_init_attr.cap.max_send_sge = device_attr.max_sge; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; if (i == 0) qp_init_attr.qp_type = IB_QPT_SMI; Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 769) +++ ib_mad_priv.h (working copy) @@ -62,7 +62,7 @@ /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 2048 #define IB_MAD_QP_RECV_SIZE 512 -#define IB_MAD_SEND_REQ_MAX_SG 1 +#define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 #define IB_MAD_SEND_Q_PSN 0 From halr at voltaire.com Fri Sep 10 16:33:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 19:33:37 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad.c: Fix check_class_table function Message-ID: <1094859217.1746.818.camel@localhost.localdomain> ib_mad.c: Fix check_class_table function Index: ib_mad.c =================================================================== --- ib_mad.c (revision 776) +++ ib_mad.c (working copy) @@ -446,9 +446,8 @@ if (class->method_table[i]) { j++; } - return j; } - + return j; } static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, From halr at voltaire.com Fri Sep 10 17:00:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 20:00:51 -0400 Subject: [openib-general] [PATCH] ib_mad: Change ib_mad_post_receive_mad to take QP pointer rather than type Message-ID: <1094860851.1794.904.camel@localhost.localdomain> ib_mad: Change ib_mad_post_receive_mad to take QP pointer rather than type. Also, Eliminate qp_type from ib_mad_private structure. Index: ib_mad.c =================================================================== --- ib_mad.c (revision 777) +++ ib_mad.c (working copy) @@ -776,7 +776,7 @@ } static int ib_mad_post_receive_mad(struct ib_mad_port_private *priv, - enum ib_qp_type qp_type) + struct ib_qp *qp) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; @@ -791,7 +791,6 @@ return -ENOMEM; } mad_priv->header.next = NULL; - mad_priv->header.qp_type = qp_type; /* Setup scatter list */ sg_list.addr = pci_map_single(priv->device->dma_device, @@ -816,7 +815,7 @@ pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); /* Now, post receive WR */ - if (ib_post_recv(priv->qp[qp_type], &recv_wr, &bad_recv_wr)) { + if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { /* Unlink from posted receive MAD list */ IB_MAD_RECV_LIST_LOCK(priv); list_del((struct list_head *)mad_priv); @@ -839,18 +838,14 @@ */ static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv) { - int i; + int i, j; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - /* Post SMI receive */ - if (ib_mad_post_receive_mad(priv, IB_QPT_SMI)) { - printk(KERN_ERR "SMI receive post %d failed\n", i + 1); + for (j = 0; j < 2; j++) { + if (ib_mad_post_receive_mad(priv, priv->qp[j])) { + printk(KERN_ERR "receive post %d failed\n", i + 1); + } } - - /* Post GSI receive */ - if (ib_mad_post_receive_mad(priv, IB_QPT_GSI)) { - printk(KERN_ERR "GSI receive post %d failed\n", i + 1); - } } return 0; Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 776) +++ ib_mad_priv.h (working copy) @@ -79,7 +79,6 @@ struct ib_mad_private_header { struct ib_mad_private_header *next; - enum ib_qp_type qp_type; struct ib_mad_buf buf; } __attribute__ ((packed)); From halr at voltaire.com Fri Sep 10 17:12:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 20:12:16 -0400 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <20040910160019.5b064909.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> Message-ID: <1094861535.1752.939.camel@localhost.localdomain> On Fri, 2004-09-10 at 19:00, Sean Hefty wrote: > > > - Should we reference qp0 and qp1 with the registration? > > Either way (struct ib_qp * (for qp0 or qp1) or enum ib_qp_type) is fine > > with me. It seems to me that the enum approach is easier for the client. > > Does the client need a pointer to the QP for something else ? > > I used struct ib_qp* inside the ib_mad_agent for QP redirection purposes, > and allows them to query the QP for its attributes, such as the number of > supported SGEs. But the client doesn't get the mad_agent pointer until it registers. > > > - The print level should be lowered from error. > > There are 2 prints. Not sure which ones should be lowered. Both or only > > the one after ib_post_send ? > > one after ib_post_send - see below Changed to NOTICE for normal but significant condition. > > > - Has an extra '{'. > > Where's the extra { ? > > My bad - the "return j" and "}" are swapped. No my bad... Thanks for catching this. Patch sent. > > On the send side, it should just be the head entry. I will change that. > > Right now on the receive side, there is one receive list, and the > > I think we need separate receive lists for QP 0 and 1. I was thinking this (separate receive lists for QPs 0 and 1) but was not quite there. I will now be doing this shortly. This also seems to mean a list for every redirected QP too :-( That is a future item. > Or the wr_id should just point to the corresponding receive information. This would be perfect except that I think lists (or some structures) are still needed in case the posted receives need to be returned. > > > - Print level should be lowered from error > > There are 2 prints. Not sure which ones should be lowered. Both or only > > the one after ib_post_recv ? > > one after ib_post_recv - see below Changed to NOTICE for normal but significant condition. > > > - We can track the number of posted receives to avoid posting overruns. > > This seems like an optimization. I will put it on my TODO list for now. > > It has an effect on the implementation. If we call ib_post_send/recv until it fails, > then we need to treat those failures as expected, and not true errors. E.g. To implement a deferred send when room becomes available. > We spoke about this some yesterday, but for others on the list, I think that the current > implementation of ib_post_send needs to be moved down and renamed. A call to ib_post_send > could then call that routine and take whatever action is appropriate to handle an overrun case, > such as queuing the request, ignoring the overrun, etc. Do you mean ib_mad_post_send rather than ib_post_send ? > > > - We could change to using 1 PD/device, versus 1 PD/port. > > Is there any advantage/disadvantage one way or the other ? > > Just a small optimization. I'd ignore for now. I'll put it in the revisit section of the TODO list. > > > - Not sure if we need to support max_sge on the send queue. This may > > > be substantially larger than what we need. At a minimum, I think > > > that we need 2 for optimal RMPP support. I'm not sure where the > > > trade-off is between SGE versus copying into a single buffer lies. > > I'm not following where the minimum of 2 for optimal RMPP support comes > > from. > > The first SGE would reference the MAD/RMPP header. The second SGE would reference the MAD data. > We've seen out of memory issues in our testing due to the number of SGEs allocated per QP, > so limiting the QP size is probably worthwhile. As long as a user can get the max SGEs > supported by the QPs, we should be okay. I set the QP send max sge to 2. Patch sent for this. Thanks. -- Hal From halr at voltaire.com Fri Sep 10 17:57:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 20:57:53 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Fix HCA and switch port numbering Message-ID: <1094864272.1746.1039.camel@localhost.localdomain> ib_mad.c: Fix HCA and switch port numbering Index: ib_mad.c =================================================================== --- ib_mad.c (revision 780) +++ ib_mad.c (working copy) @@ -1257,7 +1257,7 @@ static int ib_mad_init_device(struct ib_device *device) { - int ret, num_ports, i, ret2; + int ret, num_ports, cur_port, i, ret2; struct ib_device_attr device_attr; ret = ib_query_device(device, &device_attr); @@ -1268,13 +1268,15 @@ if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; + cur_port = 0; } else { num_ports = device_attr.phys_port_cnt; + cur_port = 1; } - for (i = 0; i < num_ports; i++) { - ret = ib_mad_port_open(device, i); + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); if (ret) { - printk(KERN_ERR "Could not open port %d\n", i); + printk(KERN_ERR "Could not open port %d\n", cur_port); goto error_device_open; } } @@ -1283,9 +1285,10 @@ error_device_open: while (i > 0) { - ret2 = ib_mad_port_close(device, i); + cur_port--; + ret2 = ib_mad_port_close(device, cur_port); if (ret2) { - printk(KERN_ERR "Could not close port %d\n", i); + printk(KERN_ERR "Could not close port %d\n", cur_port); } i--; } @@ -1296,7 +1299,7 @@ static int ib_mad_remove_device(struct ib_device *device) { - int ret, i, num_ports, ret2; + int ret, i, num_ports, cur_port, ret2; struct ib_device_attr device_attr; ret = ib_query_device(device, &device_attr); @@ -1305,12 +1308,17 @@ goto error_device_query; } - /* num_ports should also be based on device type! */ - num_ports = device_attr.phys_port_cnt; - for (i = 0; i < num_ports; i++) { - ret2 = ib_mad_port_close(device, i); + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device_attr.phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_mad_port_close(device, cur_port); if (ret2) { - printk(KERN_ERR "Could not close port %d\n", i); + printk(KERN_ERR "Could not close port %d\n", cur_port); if (!ret) ret = ret2; } From iod00d at hp.com Fri Sep 10 18:12:07 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Sep 2004 18:12:07 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <1094857825.1794.735.camel@localhost.localdomain> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> Message-ID: <20040911011207.GE29616@cup.hp.com> On Fri, Sep 10, 2004 at 07:10:26PM -0400, Hal Rosenstock wrote: > > register/unregister is pretty obvious and follows other linux APIs. > > Try "fgrep register_driver include/linux/*" in linux-2.6 source tree. > > So should we can reg/dereg to register/deregister (in ib_mad.h) ? Sorry - I couldn't find function declarations with "_reg_" in ib_mad.h. I checked out the branch with: svn co https://openib.org/svn/gen2/branches/openib-candidate Am I looking in the wrong place? Two other functions are declared using _reg_ (in gsi.h and ib_verbs.h) and it's debatable if those should change. I did find a pile of other functions which use *_register. (e.g. find -name '*.h' | xargs fgrep register ) But this also turns up two "deregister" and not sure what to make of them: ib_core.h ib_device_notifier_deregister() rmpp_vendal.h rmpp_vendal_deregister() Overall, it's just easier to remember if all used register/unregister. And would be easier to find when folks need to look up the parameters. I can't assert those are sufficient reasons to change them. thanks, grant From halr at voltaire.com Fri Sep 10 18:25:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 21:25:28 -0400 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040911011207.GE29616@cup.hp.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> <20040911011207.GE29616@cup.hp.com> Message-ID: <1094865928.1752.1093.camel@localhost.localdomain> On Fri, 2004-09-10 at 21:12, Grant Grundler wrote: > On Fri, Sep 10, 2004 at 07:10:26PM -0400, Hal Rosenstock wrote: > > > register/unregister is pretty obvious and follows other linux APIs. > > > Try "fgrep register_driver include/linux/*" in linux-2.6 source tree. > > > > So should we can reg/dereg to register/deregister (in ib_mad.h) ? > > Sorry - I couldn't find function declarations with "_reg_" > in ib_mad.h. > I checked out the branch with: > svn co https://openib.org/svn/gen2/branches/openib-candidate > > Am I looking in the wrong place? That's the right place. There should be ib_mad.h under src/linux-kernel/infiniband/include. This along with ib_verbs.h makes up the OpenIB access layer API. In there, there's ib_mad_reg/dereg. > Two other functions are declared using _reg_ (in gsi.h and ib_verbs.h) > and it's debatable if those should change. Most of this directory will be orphaned. That includes gsi.h. > > I did find a pile of other functions which use *_register. > (e.g. find -name '*.h' | xargs fgrep register ) > > But this also turns up two "deregister" and not sure > what to make of them: > ib_core.h ib_device_notifier_deregister() > rmpp_vendal.h rmpp_vendal_deregister() > > Overall, it's just easier to remember if all used register/unregister. > And would be easier to find when folks need to look up the parameters. > I can't assert those are sufficient reasons to change them. rmpp_vendal.h is to be orphaned. ib_core.h is a header so that the code for device insertion/removal could be implemented prior to integrating with mthca. This is going away. The only file other file that will remain is ib_mad.c (and also possibly ib_verbs.c). Sorry for the confusion. -- Hal From halr at voltaire.com Fri Sep 10 18:37:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 21:37:56 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Eliminate potential race condition with thread semaphore Message-ID: <1094866675.1794.1122.camel@localhost.localdomain> ib_mad.c: Eliminate potential race condition with thread semaphore Index: ib_mad.c =================================================================== --- ib_mad.c (revision 781) +++ ib_mad.c (working copy) @@ -700,7 +700,6 @@ daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port); unlock_kernel(); - sema_init(&thread_data->sem, 0); while (1) { if (down_interruptible(&thread_data->sem)) { printk(KERN_DEBUG "Exiting ib_mad thread\n"); @@ -723,6 +722,7 @@ { struct ib_mad_thread_data *thread_data = &priv->thread_data; + sema_init(&thread_data->sem, 0); thread_data->run = 1; kernel_thread(ib_mad_thread, priv, 0); } From iod00d at hp.com Fri Sep 10 18:43:25 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Sep 2004 18:43:25 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <1094865928.1752.1093.camel@localhost.localdomain> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> <20040911011207.GE29616@cup.hp.com> <1094865928.1752.1093.camel@localhost.localdomain> Message-ID: <20040911014325.GI29616@cup.hp.com> On Fri, Sep 10, 2004 at 09:25:28PM -0400, Hal Rosenstock wrote: > > That's the right place. There should be ib_mad.h under > src/linux-kernel/infiniband/include. This along with ib_verbs.h makes up > the OpenIB access layer API. > > In there, there's ib_mad_reg/dereg. ah ok. I was looking for "_reg_" (with trailing underscore). Personally, I would rename those two to register/unregister since it's only the two functions. > ...This is going away. > > The only file other file that will remain is ib_mad.c (and also possibly > ib_verbs.c). ok. > Sorry for the confusion. Until everyone is working on one branch, this confusion will persist. thanks, grant From halr at voltaire.com Fri Sep 10 18:57:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Sep 2004 21:57:17 -0400 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040911014325.GI29616@cup.hp.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> <20040911011207.GE29616@cup.hp.com> <1094865928.1752.1093.camel@localhost.localdomain> <20040911014325.GI29616@cup.hp.com> Message-ID: <1094867837.1794.1155.camel@localhost.localdomain> On Fri, 2004-09-10 at 21:43, Grant Grundler wrote: > Until everyone is working on one branch, this confusion > will persist. Once the MAD layer is working, we will all finally be getting on one branch building up the initial components (mthca driver, access layer, SMI, SA client for IPoIB, and IPoIB. OpenSM will also start to be ported over to the OpenIB access layer once it is working. That will form the base on which to build other components (CM) and ULPs. -- Hal From roland at topspin.com Fri Sep 10 19:25:04 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:25:04 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000201c49781$2ee32530$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Fri, 10 Sep 2004 14:57:35 -0700") References: <000201c49781$2ee32530$655aa8c0@infiniconsys.com> Message-ID: <52d60tbkzj.fsf@topspin.com> Fab> Great! I didn't see a way for a client to associate a Fab> context or some such thing with a device when the device is Fab> added. I would think this would be beneficial in order to Fab> avoid requiring clients to search a list for a matching Fab> device. I'm suggesting something like this: Hmm... it's an interesting idea but as far as I can see it just hides the linear search in the access layer. It feels to me like it ends up making the code more complicated for a pretty nominal gain. - Roland From roland at topspin.com Fri Sep 10 19:29:42 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:29:42 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040910140235.5139e440.mshefty@ichips.intel.com> (Sean Hefty's message of "Fri, 10 Sep 2004 14:02:35 -0700") References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> Message-ID: <528ybhbkrt.fsf@topspin.com> Sean> What use do you see for ib_dispatch_event? The low-level driver calls it when an unaffiliated event occurs. Walking the list of event handlers and calling each one is pretty trivial but it feels better to me to keep the details of how the list of event handlers is implemented encapsulated in the access layer. - R. From roland at topspin.com Fri Sep 10 19:34:26 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:34:26 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040910225032.GA29616@cup.hp.com> (Grant Grundler's message of "Fri, 10 Sep 2004 15:50:32 -0700") References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> Message-ID: <524qm5bkjx.fsf@topspin.com> Sean> ... use the names ib_reg_client/ib_dereg_client to better Sean> match the existing APIs. (Same comments with Sean> "register/unregister" names below.) Grant> register/unregister is pretty obvious and follows other Grant> linux APIs. Try "fgrep register_driver include/linux/*" in Grant> linux-2.6 source tree. Yeah, the use of register/unregister fully spelled out is pretty standard in the Linux kernel. (And the non-word "unregister" seems to be preferred to the non-word "deregister" as an antonym for register by a pretty wide margin). It makes sense to me to use ib_reg_phys_mr, ib_dereg_mr, etc as names of verbs since the IB spec uses the word "deregister" and the rest of the verb functions are pretty abbreviated. Registering a memory region seems somehow a little different from registering a callback or a client too. In any case I would like to stick with the ib_register_client function names. In fact in ib_mad.h maybe we should change ib_mad_reg and ib_mad_dereg to ib_register_mad_agent and ib_unregister_mad_agent (ib_mad_reg seems backwards from the rest of the API, where the verb comes before the noun -- eg ib_create_qp). - R. From roland at topspin.com Fri Sep 10 19:35:23 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:35:23 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040911011207.GE29616@cup.hp.com> (Grant Grundler's message of "Fri, 10 Sep 2004 18:12:07 -0700") References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> <20040911011207.GE29616@cup.hp.com> Message-ID: <52zn3xa5xw.fsf@topspin.com> Grant> But this also turns up two "deregister" and not sure what Grant> to make of them: ib_core.h ib_device_notifier_deregister() This one is my old API, which I'm improving to ib_unregister_client... (not checked in yet, pending consensus) - R. From roland at topspin.com Fri Sep 10 19:36:25 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:36:25 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000301c49789$5e247da0$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Fri, 10 Sep 2004 15:56:10 -0700") References: <000301c49789$5e247da0$655aa8c0@infiniconsys.com> Message-ID: <52vfela5w6.fsf@topspin.com> Fab> Will clients be able to allocate HCA resources from the Fab> callback? Will this work if there's a lock held during Fab> callbacks? I'm not too familiar with the mthca code, but Fab> VAPI didn't let you do anything while holding a lock. Yes. I was a little vague about the type of lock -- it's a semaphore, which means it's safe to sleep while holding the lock. The only functions you can't call would be functions that deadlock against the same lock, ie register/deregister a client or a device. - R. From roland at topspin.com Fri Sep 10 19:38:52 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:38:52 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1094850918.1746.474.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 10 Sep 2004 17:15:19 -0400") References: <521xh9df3y.fsf@topspin.com> <1094850918.1746.474.camel@localhost.localdomain> Message-ID: <52r7p9a5s3.fsf@topspin.com> Hal> Are you saying to just make sure SM class is QP0 and all GS Hal> classes are not QP0 (QP1 is insufficient when redirection is Hal> supported) ? Because of redirection, it seems adding QP Hal> number as a parameter is a better solution. Actually I think redirection is a great argument for putting the filtering outside the low-level driver. How can the low-level driver know whether or not to respond to PMA requests on QP5? So I think the right answer is to put the tests for SM on QP0, GS on non-QP0 etc outside the low-level driver and in particular outside process_mad. Hal> I think the only difference with OpenIB right now is that a Hal> MAD right now cannot be multiply consumed as it can in the Hal> Topspin implementation. If the low-level driver returns IB_MAD_RESULT_CONSUMED that's the end of it in the Tospin driver (the multiple consumers come after the low-level driver has its shot at the MAD). From roland at topspin.com Fri Sep 10 19:40:42 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Sep 2004 19:40:42 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <000101c4977c$c6767f50$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Fri, 10 Sep 2004 14:26:02 -0700") References: <000101c4977c$c6767f50$655aa8c0@infiniconsys.com> Message-ID: <52mzzxa5p1.fsf@topspin.com> Fab> I'm confused as to what the QP parameter would be used for. Fab> Can you clarify? To make sure we don't process e.g. MADs with SM class received on QP1. Fab> Why not have the out_mad be an output parameter? If the Fab> process_mad has output data, it would allocate a mad, fill it Fab> in, and return it. Just a random thought, so it may be Fab> totally stupid. My objection to this is that it's generally a bad idea to have allocation take place in one layer and freeing happen in another layer. Which would imply that we would need to add another low-level driver entry point for "free response MAD," and that just seems a bit silly to me. - R. From halr at voltaire.com Sat Sep 11 04:22:35 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 07:22:35 -0400 Subject: [openib-general] ib_mad.c comments In-Reply-To: <524qm8gmxg.fsf@topspin.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> Message-ID: <1094901754.1752.1173.camel@localhost.localdomain> On Wed, 2004-09-08 at 23:07, Roland Dreier wrote: > Huh... I didn't even notice it was checked in... It was checked in for potential use at the SWG meeting. Thanks for the comments on the "early" code. They will help get us there more quickly. > anyway, my comments follow after some comments on Sean's comments: > > Sean> Need to lock when checking/setting version/class/methods. > > I agree for the initial implementation. Ultimately RCU seems better > but I would recommend sticking with locking to start with since it's > much easier to code correctly. I will put replacing locking with RCU on the futures list for this. > Sean> We should return the error code from ib_post_send in order > Sean> to handle overruns differently. > > What did we decide about how to handle someone posting more sends than > the underlying work queue can hold? Last I recall, we deferred this issue. Maybe we should just defer the implementation but decide what should be done. A conservative implementation would ensure all sends could be posted before allowing any of them to be posted. This has an extra cost to determine whether they all can be posted and doesn't keep the QP as full as possible. That's one end of the spectrum. What happens if only some of the sends get out initially ? The client would need to repost the remainder. If the remainder couldn't be posted soon enough, some timeout might occur. This seems like a more optimistic implementation. Does anyone see any issues with this ? Is it reasonable to start with this approach ? An intermediate approach would defer the sends of the ones which couldn't be posted. Perhaps there would be some timeout before if all the sends couldn't be posted that it is treated as an error. > In any case I agree with this. Sent patch for this. > Sean> Should we avoid casting the list_head to a structure where > Sean> possible? > > Yes, definitely. It's much better to do something like > &mystruct->list rather than relying on the fact that mystruct has a > struct list_head as its first member. In fact the usage of list.h is > pretty broken throughout ib_mad.c, see below. This is on my short term TODO list to fix this. > Sean> Not sure why these calls search for the corresponding work request. > > Yes -- we know the next request to complete will always be the oldest > one we have around, right? On the send side but that's not the case on the receive side as there are posts for multiple QPs. Maybe there should be a list per QP and then this would be true which eliminates the need to walk the list. The implementation is rapidly heading towards this. > Sean> Not sure if we need to support max_sge on the send queue. > Sean> This may be substantially larger than what we need. At a > Sean> minimum, I think that we need 2 for optimal RMPP support. > Sean> I'm not sure where the trade-off is between SGE versus > Sean> copying into a single buffer lies. > > I'm not sure there's much practical difference between copying and > using the HCA to do a gather/scatter on a buffer of size 256. Any idea at what buffer size there is a difference ? > The big difference is memory per WQE (at least for mthca): supporting the > max_sge means each WQE will be about 1 KB, while using a smaller > number means each WQE could be about 128 bytes. I fixed this and sent a patch. > OK, my comments (which are based on only a quick read and therefore > focused mostly on low-level coding details): > > kmem_cache_t *ib_mad_cache; > > seems to be unused -- should be static anyway. I changed this to static. I will remove this is it remains unused. > static u32 ib_mad_client_id = 0; > > needs to be protected by a lock when used later This will be locked. Rather than a linked list, this will become an indexed table as this will be more efficient when looking up the client in the receive path. > #define IB_MAD_DEVICE_LIST_LOCK_VAR unsigned long ib_mad_device_list_sflags > #define IB_MAD_DEVICE_LIST_LOCK() spin_lock_irqsave(&ib_mad_device_list_lock, ib_mad_device_list_sflags) > #define IB_MAD_DEVICE_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_device_list_lock, ib_mad_device_list_sflags) > > Don't use this idiom ... just use the spinlock functions directly. It > makes locking code harder to read and review, it leads to wasteful > stuff like the below (in ib_mad_reg()): > > IB_MAD_DEVICE_LIST_LOCK_VAR; > IB_MAD_AGENT_LIST_LOCK_VAR; > > and besides, Documentation/CodingStyle says > > "macros that depend on having a local variable with a magic name > might look like a good thing, but it's confusing as hell when one > reads the code and it's prone to breakage from seemingly innocent > changes." This is on my short term TODO list. > /* > * ib_mad_reg - Register to send/receive MADs. > * @device - The device to register with. > > Start with /** for kernel doc to pick this up. Might be better to put > it in a header file so that it's easier to find the documentation (but > it's OK to leave it in a .c). It's a copy of what is in ib_mad.h. I will eliminate it from ib_mad.c > struct ib_mad_device_private *entry, *priv = NULL, > *head = (struct ib_mad_device_private *) &ib_mad_device_list; > > This definition of head is totally broken, since ib_mad_device_list is > declared as: > > static struct list_head ib_mad_device_list; > > so trying to use it as a struct ib_mad_device_private is just going > off into random memory. However there's no reason to even have a > variable named head, since it seems you only use it in: > > list_for_each(entry, head) { > > This really should be > > list_for_each_entry(entry, &ib_mad_device_list, list) { > > and the definition of struct ib_mad_device_private needs to be fixed from > > struct ib_mad_device_private { > struct ib_mad_device_private *next; > > to > > struct ib_mad_device_private { > struct list_head list; > > (you don't have to use the name list for your struct list_head member; > that's just my habit). > > list_for_each(entry2, head2) { > if (entry2->agent == mad_agent_priv->agent) { > list_del((struct list_head *)entry2); > break; > } > } > > This is broken for a couple of reasons: misuse of list_for_each as > just described; also, you can't delete items from a list while walking > through it with list_for_each (use list_for_each_safe instead); > finally, there's no reason to walk a list to find the entry you just > added in the same function -- just call list_del on the entry > directly, since you should still have it around. > > Pretty much all of these comments apply to all use of the list.h > macros in the file -- most look wrong. This is on my short term TODO list to fix this. > What context is it allowed to call ib_mad_post_send() from? We never > discussed this, but since the current implementation allocates work > requests with > > mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_KERNEL); > > right now it can only be called from process context with no locks > held. This seems like it violates the principle of least surprise, > because ib_post_send() can be called from any context. I will make ib_mad_post_send also callable from any context. > Also, the failure case > > if (!mad_send_wr) { > printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); > return -ENOMEM; > } > > needs to set bad_send_wr. Fixed. Sent patch for this. > ib_mad_recv_done_handler() seems to be missing a call to pci_unmap_single(). Yes, I didn't get that far (the receive path is incomplete). I will finish it this weekend. > static u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) > { > u8 mgmt_class; > > /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ > if (mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { > mgmt_class = 0; > } else { > mgmt_class = mad_reg_req->mgmt_class; > } > return mgmt_class; > } > > I'd rewrite this as > > static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) > { > return mad_reg_req->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? > 0 : mad_reg_req->mgmt_class; > } > > or just open code it in the two places it's used. I chose the inline approach. Sent patch for this. > static int allocate_method_table(struct ib_mad_mgmt_method_table **method) > { > /* .. */ > return ENOMEM; > > probably should be -ENOMEM; I also found more that should have been negative. Sent patch for this. > static void ib_mad_completion_handler(struct ib_mad_device_private *priv) > { > > /* > * For stack overflow safety reason, WC is static here. > * This callback may not be called more than once at the same time. > */ > static struct ib_wc wc; > > Seems like a bad plan to me -- on an SMP machine with multiple HCAs > (or even multiple ports on a single HCA) it seems like we want to > multithread MAD processing rather than serializing it (In fact Yaron > has made a lot of noise about running on giant SGI NUMA machines with > millions of HCAs, where this looks especially bad). Also the comment > seems to be wrong -- there seems to be one thread per HCA so multiple > copies of the callback can run at once. There used to be a problem with not doing this with some old Linux kernels. I will eliminate the static WC and the comment. Sent patch for this. > static int ib_mad_thread(void *param) > { > struct ib_mad_device_private *priv = param; > struct ib_mad_thread_data *thread_data = &priv->thread_data; > > lock_kernel(); > daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port); > unlock_kernel(); > > Just use kthread_create() to start your thread and handle all this. > Even though the current Topspin stack uses a MAD processing thread per > HCA, I'm not sure it's the best design. This design uses a thread per port. > Why do we need to defer work to process context? To minimize time spent in non process context. Is this an overly conservative approach ? > sema_init(&thread_data->sem, 0); > > Seems like a race condition here ... what happens if someone else > tries to use the semaphore before the thread has gotten a chance to > run? Eliminated this race by moving the semaphore init to into thread initilization. Sent patch. > In any case... > > while (1) { > if (down_interruptible(&thread_data->sem)) { > printk(KERN_DEBUG "Exiting ib_mad thread\n"); > break; > } > > I don't think it's a good idea to use a semaphore and signals to > control the worker thread. Better would be a wait queue and something > like wait_event(). Is there a problem with this or is this an efficiency issue ? > #define IB_MAD_DEVICE_SET_UP(__device__) {\ > IB_MAD_DEVICE_LIST_LOCK_VAR;\ > IB_MAD_DEVICE_LIST_LOCK();\ > (__device__)->up = 1;\ > IB_MAD_DEVICE_LIST_UNLOCK();} > > #define IB_MAD_DEVICE_SET_DOWN(__device__) {\ > IB_MAD_DEVICE_LIST_LOCK_VAR;\ > IB_MAD_DEVICE_LIST_LOCK();\ > (__device__)->up = 0;\ > IB_MAD_DEVICE_LIST_UNLOCK();} > > These don't seem to merit being macros. If you really want they > could be inline functions but I don't see any use of the "up" member > outside of the macros anyway, so maybe you can just kill them. It > seems hard to think of how to test "up" in a way that's not racy. up was to be used to qualify whether to proceed with posting MADs to send. Why is it racy if the check of it also takes the lock ? > for (i = 0; i < num_ports; i++) { > ret = ib_mad_device_open(device, i); > > This is wrong -- for a CA you need to handle ports 1 ... num_ports, > while a switch just uses port 0. Fixed. Sent patch. -- Hal From halr at voltaire.com Sat Sep 11 05:02:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 08:02:19 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Eliminate macro use Message-ID: <1094904139.1794.1176.camel@localhost.localdomain> ib_mad.c: Eliminate macro use Index: ib_mad.c =================================================================== --- ib_mad.c (revision 781) +++ ib_mad.c (working copy) @@ -74,28 +74,13 @@ * Locks */ -/* Device list lock */ +/* Port list lock */ static spinlock_t ib_mad_port_list_lock = SPIN_LOCK_UNLOCKED; -#define IB_MAD_PORT_LIST_LOCK_VAR unsigned long ib_mad_port_list_sflags -#define IB_MAD_PORT_LIST_LOCK() spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags) -#define IB_MAD_PORT_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags) /* Agent list lock */ static spinlock_t ib_mad_agent_list_lock = SPIN_LOCK_UNLOCKED; -#define IB_MAD_AGENT_LIST_LOCK_VAR unsigned long ib_mad_agent_list_sflags -#define IB_MAD_AGENT_LIST_LOCK() spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags) -#define IB_MAD_AGENT_LIST_UNLOCK() spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags) -/* Send and receive list locks */ -#define IB_MAD_SEND_LIST_LOCK_VAR unsigned long ib_mad_send_list_sflags -#define IB_MAD_SEND_LIST_LOCK(priv) spin_lock_irqsave(&priv->send_list_lock, ib_mad_send_list_sflags) -#define IB_MAD_SEND_LIST_UNLOCK(priv) spin_unlock_irqrestore(&priv->send_list_lock, ib_mad_send_list_sflags) -#define IB_MAD_RECV_LIST_LOCK_VAR unsigned long ib_mad_recv_list_sflags -#define IB_MAD_RECV_LIST_LOCK(priv) spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags) -#define IB_MAD_RECV_LIST_UNLOCK(priv) spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags) - - /* Forward declarations */ static u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req); static int is_method_in_use(struct ib_mad_mgmt_method_table **method, @@ -109,7 +94,7 @@ /* - * ib_mad_reg - Register to send/receive MADs. + * ib_mad_reg - Register to send/receive MADs */ struct ib_mad_agent *ib_mad_reg(struct ib_device *device, u8 port, @@ -130,8 +115,8 @@ struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; int ret2; - IB_MAD_PORT_LIST_LOCK_VAR; - IB_MAD_AGENT_LIST_LOCK_VAR; + unsigned long ib_mad_port_list_sflags; + unsigned long ib_mad_agent_list_sflags; u8 mgmt_class; /* Validate parameters */ @@ -170,14 +155,14 @@ } /* Validate device and port */ - IB_MAD_PORT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - IB_MAD_PORT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); if (!priv) { ret = ERR_PTR(-ENODEV); goto error1; @@ -233,9 +218,9 @@ mad_agent->hi_tid = ++ib_mad_client_id; /* Add to mad agent list */ - IB_MAD_AGENT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); list_add_tail((struct list_head *) mad_agent_priv, &ib_mad_agent_list); - IB_MAD_AGENT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); if (ret2) { @@ -247,14 +232,14 @@ error3: /* Remove from mad agent list */ - IB_MAD_AGENT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); list_for_each(entry2, head2) { if (entry2->agent == mad_agent_priv->agent) { list_del((struct list_head *)entry2); break; } } - IB_MAD_AGENT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); kfree(reg_req); error2: kfree(mad_agent); @@ -265,15 +250,15 @@ EXPORT_SYMBOL(ib_mad_reg); /* - * ib_mad_dereg - Deregisters a client from using MAD services. + * ib_mad_dereg - Deregisters a client from using MAD services */ int ib_mad_dereg(struct ib_mad_agent *mad_agent) { struct ib_mad_agent_private *entry, *head = (struct ib_mad_agent_private *)&ib_mad_agent_list; - IB_MAD_AGENT_LIST_LOCK_VAR; + unsigned long ib_mad_agent_list_sflags; - IB_MAD_AGENT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); list_for_each(entry, head) { if (entry->agent == mad_agent) { remove_mad_reg_req(entry); @@ -285,7 +270,7 @@ break; } } - IB_MAD_AGENT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); return 0; } @@ -293,7 +278,7 @@ /* * ib_mad_post_send - Posts MAD(s) to the send queue of the QP associated - * with the registered client. + * with the registered client */ int ib_mad_post_send(struct ib_mad_agent *mad_agent, struct ib_send_wr *send_wr, @@ -304,7 +289,7 @@ struct ib_send_wr wr; struct ib_send_wr *bad_wr; struct ib_mad_send_wr_private *mad_send_wr; - IB_MAD_SEND_LIST_LOCK_VAR; + unsigned long ib_mad_send_list_sflags; cur_send_wr = send_wr; /* Validate supplied parameters */ @@ -343,17 +328,17 @@ wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ /* Link send WR into posted send MAD list */ - IB_MAD_SEND_LIST_LOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); + spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); list_add_tail((struct list_head *)mad_send_wr, &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); - IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { /* Unlink from posted send MAD list */ - IB_MAD_SEND_LIST_LOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); list_del((struct list_head *)send_wr); - IB_MAD_SEND_LIST_UNLOCK(((struct ib_mad_port_private *)mad_agent->device->mad)); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); *bad_send_wr = cur_send_wr; printk(KERN_NOTICE "ib_mad_post_send failed\n"); return ret; @@ -562,10 +547,10 @@ struct ib_mad_private_header *entry, *head = (struct ib_mad_private_header *)&priv->recv_posted_mad_list; struct ib_mad_private *recv = NULL; - IB_MAD_RECV_LIST_LOCK_VAR; + unsigned long ib_mad_recv_list_sflags; /* Find entry on posted MAD receive list which corresponds to this completion */ - IB_MAD_RECV_LIST_LOCK(priv); + spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); list_for_each(entry, head) { if ((unsigned long)entry == wc->wr_id) { recv = (struct ib_mad_private *)entry; @@ -574,7 +559,7 @@ break; } } - IB_MAD_RECV_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); if (!recv) { printk(KERN_ERR "No matching posted receive WR 0x%Lx\n", wc->wr_id); } @@ -610,10 +595,10 @@ { struct ib_mad_send_wr_private *entry, *send_wr = NULL, *head = (struct ib_mad_send_wr_private *)&priv->send_posted_mad_list; - IB_MAD_SEND_LIST_LOCK_VAR; + unsigned long ib_mad_send_list_sflags; /* Find entry on posted MAD send list which corresponds to this completion */ - IB_MAD_SEND_LIST_LOCK(priv); + spin_lock_irqsave(&priv->send_list_lock, ib_mad_send_list_sflags); list_for_each(entry, head) { if (entry->wr_id == wc->wr_id) { send_wr = entry; @@ -622,7 +607,7 @@ break; } } - IB_MAD_SEND_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->send_list_lock, ib_mad_send_list_sflags); if (!send_wr) { printk(KERN_ERR "No matching posted send WR 0x%Lx\n", wc->wr_id); } else { @@ -700,7 +685,6 @@ daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port); unlock_kernel(); - sema_init(&thread_data->sem, 0); while (1) { if (down_interruptible(&thread_data->sem)) { printk(KERN_DEBUG "Exiting ib_mad thread\n"); @@ -723,6 +707,7 @@ { struct ib_mad_thread_data *thread_data = &priv->thread_data; + sema_init(&thread_data->sem, 0); thread_data->run = 1; kernel_thread(ib_mad_thread, priv, 0); } @@ -761,7 +746,7 @@ struct ib_sge sg_list; struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wr; - IB_MAD_RECV_LIST_LOCK_VAR; + unsigned long ib_mad_recv_list_sflags; /* Allocate memory for receive MAD (and private header) */ mad_priv = kmalloc(sizeof *mad_priv, GFP_KERNEL); @@ -787,18 +772,18 @@ recv_wr.wr_id = (unsigned long)mad_priv; /* Link receive WR into posted receive MAD list */ - IB_MAD_RECV_LIST_LOCK(priv); + spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list); - IB_MAD_RECV_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); /* Now, post receive WR */ if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { /* Unlink from posted receive MAD list */ - IB_MAD_RECV_LIST_LOCK(priv); + spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); list_del((struct list_head *)mad_priv); - IB_MAD_RECV_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); pci_unmap_single(priv->device->dma_device, pci_unmap_addr(&mad_priv->header.buf, mapping), @@ -835,16 +820,16 @@ */ static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *priv) { - IB_MAD_RECV_LIST_LOCK_VAR; + unsigned long ib_mad_recv_list_sflags; /* PCI mapping ? */ - IB_MAD_RECV_LIST_LOCK(priv); + spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); while (!list_empty(&priv->recv_posted_mad_list)) { } INIT_LIST_HEAD(&priv->recv_posted_mad_list); - IB_MAD_RECV_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); } /* @@ -852,17 +837,17 @@ */ static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *priv) { - IB_MAD_SEND_LIST_LOCK_VAR; + unsigned long ib_mad_send_list_sflags; /* PCI mapping ? */ - IB_MAD_SEND_LIST_LOCK(priv); + spin_lock_irqsave(&priv->send_list_lock, ib_mad_send_list_sflags); while (!list_empty(&priv->send_posted_mad_list)) { list_del(priv->send_posted_mad_list.next); /* Call completion handler ? */ } INIT_LIST_HEAD(&priv->send_posted_mad_list); - IB_MAD_SEND_LIST_UNLOCK(priv); + spin_unlock_irqrestore(&priv->send_list_lock, ib_mad_send_list_sflags); } /* @@ -981,24 +966,13 @@ return ret; } -#define IB_MAD_PORT_SET_UP(__port__) {\ - IB_MAD_PORT_LIST_LOCK_VAR;\ - IB_MAD_PORT_LIST_LOCK();\ - (__port__)->up = 1;\ - IB_MAD_PORT_LIST_UNLOCK();} - -#define IB_MAD_PORT_SET_DOWN(__port__) {\ - IB_MAD_PORT_LIST_LOCK_VAR;\ - IB_MAD_PORT_LIST_LOCK();\ - (__port__)->up = 0;\ - IB_MAD_PORT_LIST_UNLOCK();} - /* * Start the port */ static int ib_mad_port_start(struct ib_mad_port_private *priv) { int ret, i; + unsigned long ib_mad_port_list_sflags; for (i = 0; i < 2; i++) { ret = ib_mad_change_qp_state_to_init(priv->qp[i], priv->port); @@ -1034,8 +1008,9 @@ } } - IB_MAD_PORT_SET_UP(priv); - + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + priv->up = 1; + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); return 0; error: ib_mad_return_posted_recv_mads(priv); @@ -1052,8 +1027,11 @@ static void ib_mad_port_stop(struct ib_mad_port_private *priv) { int i; + unsigned long ib_mad_port_list_sflags; - IB_MAD_PORT_SET_DOWN(priv); + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + priv->up = 0; + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); for (i = 0; i < 2; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); @@ -1096,17 +1074,17 @@ struct ib_qp_cap qp_cap; struct ib_mad_port_private *entry, *priv = NULL, *head = (struct ib_mad_port_private *) &ib_mad_port_list; - IB_MAD_PORT_LIST_LOCK_VAR; + unsigned long ib_mad_port_list_sflags; /* First, check if port already open at MAD layer */ - IB_MAD_PORT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - IB_MAD_PORT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); if (priv) { printk(KERN_DEBUG "Port already open\n"); return 0; @@ -1191,9 +1169,9 @@ goto error8; } - IB_MAD_PORT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); list_add_tail((struct list_head *)priv, &ib_mad_port_list); - IB_MAD_PORT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); return 0; @@ -1222,9 +1200,9 @@ { struct ib_mad_port_private *entry, *priv = NULL, *head = (struct ib_mad_port_private *)&ib_mad_port_list; - IB_MAD_PORT_LIST_LOCK_VAR; + unsigned long ib_mad_port_list_sflags; - IB_MAD_PORT_LIST_LOCK(); + spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; @@ -1234,12 +1212,12 @@ if (priv == NULL) { printk(KERN_ERR "Port not found\n"); - IB_MAD_PORT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); return -ENODEV; } list_del((struct list_head *)priv); - IB_MAD_PORT_LIST_UNLOCK(); + spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); ib_mad_port_stop(priv); ib_mad_thread_stop(priv); From roland at topspin.com Sat Sep 11 08:54:49 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 11 Sep 2004 08:54:49 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <1094901754.1752.1173.camel@localhost.localdomain> (Hal Rosenstock's message of "Sat, 11 Sep 2004 07:22:35 -0400") References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> Message-ID: <52isakajhy.fsf@topspin.com> Roland> What did we decide about how to handle someone posting Roland> more sends than the underlying work queue can hold? Hal> Last I recall, we deferred this issue. Maybe we should just Hal> defer the implementation but decide what should be done. I don't think we can really defer it. It's pretty ugly for consumers to have to worry about overflowing a queue that they aren't the only ones posting to and that they don't know the length of. My vote would be to have another queue of pending MADs that get sent as previous sends complete. However dropping (and returning a fake successful completion) might be OK too. Roland> Yes -- we know the next request to complete will always be Roland> the oldest one we have around, right? Hal> On the send side but that's not the case on the receive side Hal> as there are posts for multiple QPs. Maybe there should be a Hal> list per QP and then this would be true which eliminates the Hal> need to walk the list. The implementation is rapidly heading Hal> towards this. Good -- it seems silly to throw away the information about which queue a receive was posted to. Roland> I'm not sure there's much practical difference between Roland> copying and using the HCA to do a gather/scatter on a Roland> buffer of size 256. Hal> Any idea at what buffer size there is a difference ? You probably won't see much difference until you get to a buffer size where copying becomes expensive. I'm thinking multiple KB. Roland> static u32 ib_mad_client_id = 0; Hal> This will be locked. Rather than a linked list, this will Hal> become an indexed table as this will be more efficient when Hal> looking up the client in the receive path. Hmm... seems like we wouldn't want a static limit on the table size. Maybe we should use to handle allocating the 32-bit IDs. Hal> There used to be a problem with not doing this with some old Hal> Linux kernels. I will eliminate the static WC and the Hal> comment. Sent patch for this. What was the problem? If anything with the 4K stack option on i386 for kernel 2.6 we need to be even more aware of how much stack space we use. Hal> To minimize time spent in non process context. Is this an Hal> overly conservative approach ? It's probably OK for now. Probably the correct answer is to use tasklets and run in softirq context though (since process context can be starved for an arbitrarily long time). Roland> I don't think it's a good idea to use a semaphore and Roland> signals to control the worker thread. Better would be a Roland> wait queue and something like wait_event(). Hal> Is there a problem with this or is this an efficiency issue ? Using signals tends to be a bad idea because your thread can get killed when init sends SIGKILL to all processes while shutting down, and then if you need to send MADs later, say to unmount IB-attached storage, you're in trouble. The semaphore is not so good because if 100 wakeups come before the thread runs, it will wake up 100 times in a row (even if it does all the work on the first iteration). Plus it's just not idiomatic in the kernel so it makes the code harder to review and maintain. Roland> These don't seem to merit being macros. If you really Roland> want they could be inline functions but I don't see any Roland> use of the "up" member outside of the macros anyway, so Roland> maybe you can just kill them. It seems hard to think of Roland> how to test "up" in a way that's not racy. Hal> up was to be used to qualify whether to proceed with posting Hal> MADs to send. Why is it racy if the check of it also takes Hal> the lock ? Well, you have to be careful of context #1 context #2 lock see that up is set unlock lock clear up unlock do work relying on up being set...oops Basically it means that anything that relies on up being set has to hold the lock across the whole operation. This includes posting sends I assume, which means that the lock has to be a spinlock (since the send posting can happen in interrupt context). But that means you can't do any operations that sleep and rely on up staying consistent. So it starts to look messy to me (it does seem doable but there may be details I'm missing). - R. From roland at topspin.com Sat Sep 11 08:57:51 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 11 Sep 2004 08:57:51 -0700 Subject: [openib-general] [PATCH] ib_mad.c: Eliminate macro use In-Reply-To: <1094904139.1794.1176.camel@localhost.localdomain> (Hal Rosenstock's message of "Sat, 11 Sep 2004 08:02:19 -0400") References: <1094904139.1794.1176.camel@localhost.localdomain> Message-ID: <52ekl8ajcw.fsf@topspin.com> Hal> ib_mad.c: Eliminate macro use This is definitely good... Hal> + unsigned long ib_mad_port_list_sflags; Hal> + unsigned long ib_mad_agent_list_sflags; but you never need more than one flags variable (you can reuse the same variable for different locks, and if you're nesting locks, the inner lock doesn't need a flags variable since you know for sure IRQs are already disabled). It's not a big deal but it's wasting some stack space for no reason. It's probably better just to follow kernel idiom and call the variable "flags" rather than putting the lock name in as well. - R. From ftillier at infiniconsys.com Sat Sep 11 10:27:14 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Sat, 11 Sep 2004 10:27:14 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52d60tbkzj.fsf@topspin.com> Message-ID: <000401c49824$95505150$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 7:25 PM > > Fab> Great! I didn't see a way for a client to associate a > Fab> context or some such thing with a device when the device is > Fab> added. I would think this would be beneficial in order to > Fab> avoid requiring clients to search a list for a matching > Fab> device. I'm suggesting something like this: > > Hmm... it's an interesting idea but as far as I can see it just hides > the linear search in the access layer. It feels to me like it ends up > making the code more complicated for a pretty nominal gain. > Like Sean said, if the common case is going to be a single device, then the gain is likely not worth the complication in the code. I don't feel really strongly about it, it was just a thought based on how the device related notifications worked in IBAL, which I found to be useful. No biggie either way, and it should be simple enough to change down the road if we see a need. - Fab From halr at voltaire.com Sat Sep 11 11:12:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:12:17 -0400 Subject: [openib-general] ib_mad.c comments In-Reply-To: <52isakajhy.fsf@topspin.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <52isakajhy.fsf@topspin.com> Message-ID: <1094926336.1794.1259.camel@localhost.localdomain> On Sat, 2004-09-11 at 11:54, Roland Dreier wrote: > Roland> static u32 ib_mad_client_id = 0; > > Hal> This will be locked. Rather than a linked list, this will > Hal> become an indexed table as this will be more efficient when > Hal> looking up the client in the receive path. > > Hmm... seems like we wouldn't want a static limit on the table size. > Maybe we should use to handle allocating the 32-bit IDs. There will be a static limit to the number of clients. We could either use this as a direct index (or have an additional level of indirection) and have the client IDs be from a larger ID space. As I don't see a benefit to this right now (I don't think clients will be very dynamic), I will start with the direct mapping approach. It can easily be changed if this is the wrong decision. > Hal> There used to be a problem with not doing this with some old > Hal> Linux kernels. I will eliminate the static WC and the > Hal> comment. Sent patch for this. > > What was the problem? I don't know the exact problem, just what I wrote about the "origin" of this. The precise origin of this has not been explained to me (and it is not worth chasing down). > Hal> To minimize time spent in non process context. Is this an > Hal> overly conservative approach ? > > It's probably OK for now. Probably the correct answer is to use > tasklets and run in softirq context though (since process context can > be starved for an arbitrarily long time). OK. I will put changing this on the futures list for now. > Roland> I don't think it's a good idea to use a semaphore and > Roland> signals to control the worker thread. Better would be a > Roland> wait queue and something like wait_event(). > > Hal> Is there a problem with this or is this an efficiency issue ? > > Using signals tends to be a bad idea because your thread can get > killed when init sends SIGKILL to all processes while shutting down, > and then if you need to send MADs later, say to unmount IB-attached > storage, you're in trouble. The semaphore is not so good because if > 100 wakeups come before the thread runs, it will wake up 100 times in > a row (even if it does all the work on the first iteration). Plus > it's just not idiomatic in the kernel so it makes the code harder to > review and maintain. Chainging this is now on my short term list. > Roland> These don't seem to merit being macros. If you really > Roland> want they could be inline functions but I don't see any > Roland> use of the "up" member outside of the macros anyway, so > Roland> maybe you can just kill them. It seems hard to think of > Roland> how to test "up" in a way that's not racy. > > Hal> up was to be used to qualify whether to proceed with posting > Hal> MADs to send. Why is it racy if the check of it also takes > Hal> the lock ? > > Well, you have to be careful of > > context #1 context #2 > > lock > see that up is set > unlock > lock > clear up > unlock > do work relying on up > being set...oops > > Basically it means that anything that relies on up being set has to > hold the lock across the whole operation. This includes posting sends > I assume, which means that the lock has to be a spinlock (since the > send posting can happen in interrupt context). But that means you > can't do any operations that sleep and rely on up staying consistent. > So it starts to look messy to me (it does seem doable but there may be > details I'm missing). Got it. It will be removed. I think it means there may be some other cleanup cases to deal with but they might have been there anyway. Thanks. -- Hal From halr at voltaire.com Sat Sep 11 11:25:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:25:04 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Consolidate flags variables for spinlocks Message-ID: <1094927104.1794.1263.camel@localhost.localdomain> ib_mad.c: Consolidate flags variables for spinlocks Also, update TODO list (OpenIB MAD and Old GSI sections) Index: ib_mad.c =================================================================== --- ib_mad.c (revision 783) +++ ib_mad.c (working copy) @@ -115,8 +115,7 @@ struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; int ret2; - unsigned long ib_mad_port_list_sflags; - unsigned long ib_mad_agent_list_sflags; + unsigned long flags; u8 mgmt_class; /* Validate parameters */ @@ -155,14 +154,14 @@ } /* Validate device and port */ - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); if (!priv) { ret = ERR_PTR(-ENODEV); goto error1; @@ -218,9 +217,9 @@ mad_agent->hi_tid = ++ib_mad_client_id; /* Add to mad agent list */ - spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_lock_irqsave(&ib_mad_agent_list_lock, flags); list_add_tail((struct list_head *) mad_agent_priv, &ib_mad_agent_list); - spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); if (ret2) { @@ -232,14 +231,14 @@ error3: /* Remove from mad agent list */ - spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_lock_irqsave(&ib_mad_agent_list_lock, flags); list_for_each(entry2, head2) { if (entry2->agent == mad_agent_priv->agent) { list_del((struct list_head *)entry2); break; } } - spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); kfree(reg_req); error2: kfree(mad_agent); @@ -256,9 +255,9 @@ { struct ib_mad_agent_private *entry, *head = (struct ib_mad_agent_private *)&ib_mad_agent_list; - unsigned long ib_mad_agent_list_sflags; + unsigned long flags; - spin_lock_irqsave(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_lock_irqsave(&ib_mad_agent_list_lock, flags); list_for_each(entry, head) { if (entry->agent == mad_agent) { remove_mad_reg_req(entry); @@ -270,7 +269,7 @@ break; } } - spin_unlock_irqrestore(&ib_mad_agent_list_lock, ib_mad_agent_list_sflags); + spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); return 0; } @@ -289,7 +288,7 @@ struct ib_send_wr wr; struct ib_send_wr *bad_wr; struct ib_mad_send_wr_private *mad_send_wr; - unsigned long ib_mad_send_list_sflags; + unsigned long flags; cur_send_wr = send_wr; /* Validate supplied parameters */ @@ -328,17 +327,17 @@ wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ /* Link send WR into posted send MAD list */ - spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); + spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_add_tail((struct list_head *)mad_send_wr, &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { /* Unlink from posted send MAD list */ - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_del((struct list_head *)send_wr); - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, ib_mad_send_list_sflags); + spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); *bad_send_wr = cur_send_wr; printk(KERN_NOTICE "ib_mad_post_send failed\n"); return ret; @@ -547,10 +546,10 @@ struct ib_mad_private_header *entry, *head = (struct ib_mad_private_header *)&priv->recv_posted_mad_list; struct ib_mad_private *recv = NULL; - unsigned long ib_mad_recv_list_sflags; + unsigned long flags; /* Find entry on posted MAD receive list which corresponds to this completion */ - spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_lock_irqsave(&priv->recv_list_lock, flags); list_for_each(entry, head) { if ((unsigned long)entry == wc->wr_id) { recv = (struct ib_mad_private *)entry; @@ -559,7 +558,7 @@ break; } } - spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); if (!recv) { printk(KERN_ERR "No matching posted receive WR 0x%Lx\n", wc->wr_id); } @@ -595,10 +594,10 @@ { struct ib_mad_send_wr_private *entry, *send_wr = NULL, *head = (struct ib_mad_send_wr_private *)&priv->send_posted_mad_list; - unsigned long ib_mad_send_list_sflags; + unsigned long flags; /* Find entry on posted MAD send list which corresponds to this completion */ - spin_lock_irqsave(&priv->send_list_lock, ib_mad_send_list_sflags); + spin_lock_irqsave(&priv->send_list_lock, flags); list_for_each(entry, head) { if (entry->wr_id == wc->wr_id) { send_wr = entry; @@ -607,7 +606,7 @@ break; } } - spin_unlock_irqrestore(&priv->send_list_lock, ib_mad_send_list_sflags); + spin_unlock_irqrestore(&priv->send_list_lock, flags); if (!send_wr) { printk(KERN_ERR "No matching posted send WR 0x%Lx\n", wc->wr_id); } else { @@ -746,7 +745,7 @@ struct ib_sge sg_list; struct ib_recv_wr recv_wr; struct ib_recv_wr *bad_recv_wr; - unsigned long ib_mad_recv_list_sflags; + unsigned long flags; /* Allocate memory for receive MAD (and private header) */ mad_priv = kmalloc(sizeof *mad_priv, GFP_KERNEL); @@ -772,18 +771,18 @@ recv_wr.wr_id = (unsigned long)mad_priv; /* Link receive WR into posted receive MAD list */ - spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_lock_irqsave(&priv->recv_list_lock, flags); list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list); - spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); /* Now, post receive WR */ if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { /* Unlink from posted receive MAD list */ - spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_lock_irqsave(&priv->recv_list_lock, flags); list_del((struct list_head *)mad_priv); - spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); pci_unmap_single(priv->device->dma_device, pci_unmap_addr(&mad_priv->header.buf, mapping), @@ -820,16 +819,16 @@ */ static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *priv) { - unsigned long ib_mad_recv_list_sflags; + unsigned long flags; /* PCI mapping ? */ - spin_lock_irqsave(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_lock_irqsave(&priv->recv_list_lock, flags); while (!list_empty(&priv->recv_posted_mad_list)) { } INIT_LIST_HEAD(&priv->recv_posted_mad_list); - spin_unlock_irqrestore(&priv->recv_list_lock, ib_mad_recv_list_sflags); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); } /* @@ -837,17 +836,17 @@ */ static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *priv) { - unsigned long ib_mad_send_list_sflags; + unsigned long flags; /* PCI mapping ? */ - spin_lock_irqsave(&priv->send_list_lock, ib_mad_send_list_sflags); + spin_lock_irqsave(&priv->send_list_lock, flags); while (!list_empty(&priv->send_posted_mad_list)) { list_del(priv->send_posted_mad_list.next); /* Call completion handler ? */ } INIT_LIST_HEAD(&priv->send_posted_mad_list); - spin_unlock_irqrestore(&priv->send_list_lock, ib_mad_send_list_sflags); + spin_unlock_irqrestore(&priv->send_list_lock, flags); } /* @@ -972,7 +971,7 @@ static int ib_mad_port_start(struct ib_mad_port_private *priv) { int ret, i; - unsigned long ib_mad_port_list_sflags; + unsigned long flags; for (i = 0; i < 2; i++) { ret = ib_mad_change_qp_state_to_init(priv->qp[i], priv->port); @@ -1008,9 +1007,9 @@ } } - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); priv->up = 1; - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return 0; error: ib_mad_return_posted_recv_mads(priv); @@ -1027,11 +1026,11 @@ static void ib_mad_port_stop(struct ib_mad_port_private *priv) { int i; - unsigned long ib_mad_port_list_sflags; + unsigned long flags; - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); priv->up = 0; - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); for (i = 0; i < 2; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); @@ -1074,17 +1073,17 @@ struct ib_qp_cap qp_cap; struct ib_mad_port_private *entry, *priv = NULL, *head = (struct ib_mad_port_private *) &ib_mad_port_list; - unsigned long ib_mad_port_list_sflags; + unsigned long flags; /* First, check if port already open at MAD layer */ - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; break; } } - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); if (priv) { printk(KERN_DEBUG "Port already open\n"); return 0; @@ -1169,9 +1168,9 @@ goto error8; } - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_add_tail((struct list_head *)priv, &ib_mad_port_list); - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return 0; @@ -1200,9 +1199,9 @@ { struct ib_mad_port_private *entry, *priv = NULL, *head = (struct ib_mad_port_private *)&ib_mad_port_list; - unsigned long ib_mad_port_list_sflags; + unsigned long flags; - spin_lock_irqsave(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each(entry, head) { if (entry->device == device && entry->port == port) { priv = entry; @@ -1212,12 +1211,12 @@ if (priv == NULL) { printk(KERN_ERR "Port not found\n"); - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return -ENODEV; } list_del((struct list_head *)priv); - spin_unlock_irqrestore(&ib_mad_port_list_lock, ib_mad_port_list_sflags); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); ib_mad_port_stop(priv); ib_mad_thread_stop(priv); Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 781) +++ ib_mad_priv.h (working copy) @@ -133,6 +133,7 @@ spinlock_t recv_list_lock; struct list_head send_posted_mad_list; struct list_head recv_posted_mad_list; + struct ib_mad_thread_data thread_data; }; Index: TODO =================================================================== --- TODO (revision 781) +++ TODO (working copy) @@ -1,5 +1,29 @@ -8/31/04 +9/11/04 +OpenIB MAD Layer + +Short Term +Track count of posted sends and receives +Support call of ib_mad_post_send from any context +Receive list per QP rather than 1 receive list +Fix list handling +Use wait queue and wait_event rather than signals and semaphores +Finish coding receive path + +Revisit +Handle post send overruns +PD per device rather than per port +Use tasklets/softirq rather than process context + +Futures +RMPP support +Redirection support (including receive list per QP) +Replace locking with RCU + + + +(Old) GSI (will be orphaned) + Update API to proposed openib GSI interface (ib_mad.h) Makefile needs to use standard kbuild Sync with latest ib_verbs.h when appropriate @@ -10,5 +34,5 @@ Add GRH support for RMPP (low priority) Static rate handling (low priority) -Migrate from /proc to /sysfs (may only apply to old GSI) +Migrate from /proc to /sysfs (only applies to original GSI) From halr at voltaire.com Sat Sep 11 11:29:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:29:11 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad: Eliminate up variable in port structure Message-ID: <1094927350.1752.1266.camel@localhost.localdomain> ib_mad: Eliminate up variable in port structure Index: ib_mad.c =================================================================== --- ib_mad.c (revision 784) +++ ib_mad.c (working copy) @@ -971,7 +971,6 @@ static int ib_mad_port_start(struct ib_mad_port_private *priv) { int ret, i; - unsigned long flags; for (i = 0; i < 2; i++) { ret = ib_mad_change_qp_state_to_init(priv->qp[i], priv->port); @@ -1007,9 +1006,6 @@ } } - spin_lock_irqsave(&ib_mad_port_list_lock, flags); - priv->up = 1; - spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return 0; error: ib_mad_return_posted_recv_mads(priv); @@ -1026,12 +1022,7 @@ static void ib_mad_port_stop(struct ib_mad_port_private *priv) { int i; - unsigned long flags; - spin_lock_irqsave(&ib_mad_port_list_lock, flags); - priv->up = 0; - spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - for (i = 0; i < 2; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); } Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 784) +++ ib_mad_priv.h (working copy) @@ -122,7 +122,6 @@ struct ib_mad_port_private *next; struct ib_device *device; int port; - int up; struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; struct ib_qp *qp[2]; struct ib_cq *cq; From halr at voltaire.com Sat Sep 11 11:38:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:38:10 -0400 Subject: [openib-general] [PATCH] ib_mad: Use IB_MAD_QPS_CORE and IB_MAD_QPS_SUPPORTED definitions Message-ID: <1094927890.1794.1269.camel@localhost.localdomain> ib_mad: Use IB_MAD_QPS_CORE and IB_MAD_QPS_SUPPORTED definitions rather than hard coded constant Index: ib_mad.c =================================================================== --- ib_mad.c (revision 785) +++ ib_mad.c (working copy) @@ -972,7 +972,7 @@ { int ret, i; - for (i = 0; i < 2; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_change_qp_state_to_init(priv->qp[i], priv->port); if (ret) { printk(KERN_ERR "Could not change QP%d state to INIT\n", i); @@ -992,7 +992,7 @@ goto error; } - for (i = 0; i < 2; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_change_qp_state_to_rtr(priv->qp[i]); if (ret) { printk(KERN_ERR "Could not change QP%d state to RTR\n", i); @@ -1009,7 +1009,7 @@ return 0; error: ib_mad_return_posted_recv_mads(priv); - for (i = 0; i < 2; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); } @@ -1023,7 +1023,7 @@ { int i; - for (i = 0; i < 2; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { ib_mad_change_qp_state_to_reset(priv->qp[i]); } @@ -1120,7 +1120,7 @@ goto error5; } - for (i = 0; i < 2; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { memset(&qp_init_attr, 0, sizeof qp_init_attr); qp_init_attr.send_cq = priv->cq; qp_init_attr.recv_cq = priv->cq; Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 785) +++ ib_mad_priv.h (working copy) @@ -59,6 +59,9 @@ #include +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 */ +#define IB_MAD_QPS_SUPPORTED 2 + /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 2048 #define IB_MAD_QP_RECV_SIZE 512 @@ -123,7 +126,7 @@ struct ib_device *device; int port; struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; - struct ib_qp *qp[2]; + struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; From halr at voltaire.com Sat Sep 11 11:57:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:57:08 -0400 Subject: [openib-general] [PATCH]: ib_mad: Make receive list be per QP rather than just 1 for all QPs Message-ID: <1094929028.1746.1275.camel@localhost.localdomain> ib_mad: Make receive list be per QP rather than just 1 for all QPs Index: ib_mad.c =================================================================== --- ib_mad.c (revision 786) +++ ib_mad.c (working copy) @@ -772,7 +772,7 @@ /* Link receive WR into posted receive MAD list */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list); + list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list[qp->qp_num]); /* This works now as only QP0 and 1 (no redirection)!!! */ spin_unlock_irqrestore(&priv->recv_list_lock, flags); pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); @@ -819,16 +819,19 @@ */ static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *priv) { + int i; unsigned long flags; /* PCI mapping ? */ - spin_lock_irqsave(&priv->recv_list_lock, flags); - while (!list_empty(&priv->recv_posted_mad_list)) { + for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { + spin_lock_irqsave(&priv->recv_list_lock, flags); + while (!list_empty(&priv->recv_posted_mad_list[i])) { + } + INIT_LIST_HEAD(&priv->recv_posted_mad_list[i]); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); } - INIT_LIST_HEAD(&priv->recv_posted_mad_list); - spin_unlock_irqrestore(&priv->recv_list_lock, flags); } /* @@ -1149,8 +1152,10 @@ spin_lock_init(&priv->recv_list_lock); spin_lock_init(&priv->send_list_lock); - INIT_LIST_HEAD(&priv->recv_posted_mad_list); INIT_LIST_HEAD(&priv->send_posted_mad_list); + for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { + INIT_LIST_HEAD(&priv->recv_posted_mad_list[i]); + } ib_mad_thread_init(priv); ret = ib_mad_port_start(priv); Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 786) +++ ib_mad_priv.h (working copy) @@ -134,7 +134,7 @@ spinlock_t send_list_lock; spinlock_t recv_list_lock; struct list_head send_posted_mad_list; - struct list_head recv_posted_mad_list; + struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; struct ib_mad_thread_data thread_data; }; Index: TODO =================================================================== --- TODO (revision 784) +++ TODO (working copy) @@ -5,7 +5,7 @@ Short Term Track count of posted sends and receives Support call of ib_mad_post_send from any context -Receive list per QP rather than 1 receive list +Encode QP number in WRID of receive WRs Fix list handling Use wait queue and wait_event rather than signals and semaphores Finish coding receive path From halr at voltaire.com Sat Sep 11 11:59:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 14:59:51 -0400 Subject: [openib-general] ib_mad receive locking Message-ID: <1094929190.1794.1279.camel@localhost.localdomain> Hi, Now that there is a receive list per MAD layer QP, should there also be a receive lock per QP to minimize lock contention ? Right now, there is one receive lock for all MAD QPs (which is just QP0 and 1 now but will ultimately include redirected QPs). And I know I can defer this as it is not critical path for getting this working but I also don't want to forget it either... Thanks. -- Hal From halr at voltaire.com Sat Sep 11 12:50:09 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 15:50:09 -0400 Subject: [openib-general] [PATCH] ib_mad: Track count of posted sends and receives Message-ID: <1094932208.1794.1317.camel@localhost.localdomain> ib_mad: Track count of posted sends and receives Also, Encode QP number in WRID of receive WRs Updated TODO (with this and added sysfs support to futures) Index: ib_mad.c =================================================================== --- ib_mad.c (revision 787) +++ ib_mad.c (working copy) @@ -330,6 +330,7 @@ spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_add_tail((struct list_head *)mad_send_wr, &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); + ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); @@ -337,6 +338,7 @@ /* Unlink from posted send MAD list */ spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_del((struct list_head *)send_wr); + ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count--; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); *bad_send_wr = cur_send_wr; printk(KERN_NOTICE "ib_mad_post_send failed\n"); @@ -539,6 +541,20 @@ } } +static int convert_qpnum(u32 qp_num) +{ + /* + * No redirection currently!!! + * QP0 and QP1 only + * Ultimately, will need table of QP numbers and table index + * as QP numbers will not be packed once redirection supported + */ + if (qp_num > 1) { + printk(KERN_ERR "QP number %d invalid\n", qp_num); + } + return qp_num; +} + static void ib_mad_recv_done_handler(struct ib_mad_port_private *priv, struct ib_wc *wc) { @@ -547,7 +563,11 @@ *head = (struct ib_mad_private_header *)&priv->recv_posted_mad_list; struct ib_mad_private *recv = NULL; unsigned long flags; + u32 qp_num; + /* WC WRID is the QP number */ + qp_num = wc->wr_id; + /* Find entry on posted MAD receive list which corresponds to this completion */ spin_lock_irqsave(&priv->recv_list_lock, flags); list_for_each(entry, head) { @@ -555,6 +575,7 @@ recv = (struct ib_mad_private *)entry; /* Remove from posted receive MAD list */ list_del((struct list_head *)entry); + priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; break; } } @@ -603,6 +624,7 @@ send_wr = entry; /* Remove from posted send MAD list */ list_del((struct list_head *)entry); + priv->send_posted_mad_count--; break; } } @@ -768,11 +790,12 @@ recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - recv_wr.wr_id = (unsigned long)mad_priv; + recv_wr.wr_id = qp->qp_num; /* 32 bits left */ /* Link receive WR into posted receive MAD list */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list[qp->qp_num]); /* This works now as only QP0 and 1 (no redirection)!!! */ + list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); + priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; spin_unlock_irqrestore(&priv->recv_list_lock, flags); pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); @@ -782,6 +805,7 @@ /* Unlink from posted receive MAD list */ spin_lock_irqsave(&priv->recv_list_lock, flags); list_del((struct list_head *)mad_priv); + priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; spin_unlock_irqrestore(&priv->recv_list_lock, flags); pci_unmap_single(priv->device->dma_device, @@ -830,6 +854,7 @@ } INIT_LIST_HEAD(&priv->recv_posted_mad_list[i]); + priv->recv_posted_mad_count[i] = 0; spin_unlock_irqrestore(&priv->recv_list_lock, flags); } } @@ -846,9 +871,10 @@ spin_lock_irqsave(&priv->send_list_lock, flags); while (!list_empty(&priv->send_posted_mad_list)) { list_del(priv->send_posted_mad_list.next); - /* Call completion handler ? */ + /* Call completion handler with some status ? */ } INIT_LIST_HEAD(&priv->send_posted_mad_list); + priv->send_posted_mad_count = 0; spin_unlock_irqrestore(&priv->send_list_lock, flags); } @@ -1153,8 +1179,10 @@ spin_lock_init(&priv->recv_list_lock); spin_lock_init(&priv->send_list_lock); INIT_LIST_HEAD(&priv->send_posted_mad_list); + priv->send_posted_mad_count = 0; for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { INIT_LIST_HEAD(&priv->recv_posted_mad_list[i]); + priv->recv_posted_mad_count[i] = 0; } ib_mad_thread_init(priv); Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 787) +++ ib_mad_priv.h (working copy) @@ -132,9 +132,12 @@ struct ib_mr *mr; spinlock_t send_list_lock; + struct list_head send_posted_mad_list; + int send_posted_mad_count; + spinlock_t recv_list_lock; - struct list_head send_posted_mad_list; struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; + int recv_posted_mad_count[IB_MAD_QPS_SUPPORTED]; struct ib_mad_thread_data thread_data; }; Index: TODO =================================================================== --- TODO (revision 787) +++ TODO (working copy) @@ -3,9 +3,7 @@ OpenIB MAD Layer Short Term -Track count of posted sends and receives Support call of ib_mad_post_send from any context -Encode QP number in WRID of receive WRs Fix list handling Use wait queue and wait_event rather than signals and semaphores Finish coding receive path @@ -16,6 +14,7 @@ Use tasklets/softirq rather than process context Futures +sysfs support for MAD layer (statistics, debug support, etc.) RMPP support Redirection support (including receive list per QP) Replace locking with RCU From halr at voltaire.com Sat Sep 11 12:57:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Sep 2004 15:57:32 -0400 Subject: [openib-general] [PATCH] ib_mad: Support call of ib_mad_post_send from any context Message-ID: <1094932652.1794.1320.camel@localhost.localdomain> ib_mad: Support call of ib_mad_post_send from any context Index: ib_mad.c =================================================================== --- ib_mad.c (revision 788) +++ ib_mad.c (working copy) @@ -58,6 +58,7 @@ #include "ib_mad_priv.h" #include "ib_core.h" /* for IB_DEVICE_NOTIFIER for now!!! */ #include +#include MODULE_LICENSE("Dual BSD/GPL"); @@ -303,7 +304,9 @@ next_send_wr = (struct ib_send_wr *)cur_send_wr->list.next; /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_KERNEL); + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); if (!mad_send_wr) { *bad_send_wr = cur_send_wr; printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); Index: TODO =================================================================== --- TODO (revision 788) +++ TODO (working copy) @@ -3,7 +3,6 @@ OpenIB MAD Layer Short Term -Support call of ib_mad_post_send from any context Fix list handling Use wait queue and wait_event rather than signals and semaphores Finish coding receive path From mst at mellanox.co.il Sun Sep 12 01:01:09 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Sep 2004 11:01:09 +0300 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1094850918.1746.474.camel@localhost.localdomain> References: <521xh9df3y.fsf@topspin.com> <1094850918.1746.474.camel@localhost.localdomain> Message-ID: <20040912080109.GB16320@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] semantics of process_mad?": > On Fri, 2004-09-10 at 16:49, Roland Dreier wrote: > > I'm now looking at implementing the process_mad method: > > > > int (*process_mad)(struct ib_device *device, > > int process_mad_flags, > > struct ib_mad *in_mad, > > struct ib_mad *out_mad); > > > > First of all, it seems that a port_num parameter needs to be added. > Agreed. How did this get missed 'till now ? (That's rhetorical)... What happends if a trap needs to be sent as a result of this mad (consider, e.g. if there is a bkey violation)? I think you might need to pass additional info needed by the trap to be generated. MST From halr at voltaire.com Sun Sep 12 04:35:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 07:35:02 -0400 Subject: [openib-general] semantics of process_mad? In-Reply-To: <20040912080109.GB16320@mellanox.co.il> References: <521xh9df3y.fsf@topspin.com> <1094850918.1746.474.camel@localhost.localdomain> <20040912080109.GB16320@mellanox.co.il> Message-ID: <1094988901.1752.1333.camel@localhost.localdomain> On Sun, 2004-09-12 at 04:01, Michael S. Tsirkin wrote: > What happends if a trap needs to be sent as a result of this mad > (consider, e.g. if there is a bkey violation)? > I think you might need to pass additional info needed by the trap to be > generated. Is there more info needed for trap generation that is not contained in the incoming mad ? Thanks. -- Hal From halr at voltaire.com Sun Sep 12 05:04:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 08:04:34 -0400 Subject: [openib-general] [PATCH] ib_mad: Version/Class/Method table locking Message-ID: <1094990673.1746.1336.camel@localhost.localdomain> ib_mad: Version/Class/Method table locking Index: ib_mad.c =================================================================== --- ib_mad.c (revision 789) +++ ib_mad.c (working copy) @@ -168,24 +168,6 @@ goto error1; } - /* - * Make sure MAD registration (if supplied) - * is non overlapping with any existing ones - */ - if (mad_reg_req) { - class = priv->version[mad_reg_req->mgmt_class_version]; - if (class) { - mgmt_class = convert_mgmt_class(mad_reg_req); - method = class->method_table[mgmt_class]; - if (method) { - if (is_method_in_use(&method, mad_reg_req)) { - ret = ERR_PTR(-EINVAL); - goto error1; - } - } - } - } - /* Allocate structures */ mad_agent = kmalloc(sizeof *mad_agent, GFP_KERNEL); mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); @@ -204,6 +186,27 @@ memcpy(reg_req, mad_reg_req, sizeof *reg_req); } + spin_lock_irqsave(&priv->reg_lock, flags); + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + class = priv->version[mad_reg_req->mgmt_class_version]; + if (class) { + mgmt_class = convert_mgmt_class(mad_reg_req); + method = class->method_table[mgmt_class]; + if (method) { + if (is_method_in_use(&method, mad_reg_req)) { + spin_unlock_irqrestore(&priv->reg_lock, flags); + ret = ERR_PTR(-EINVAL); + goto error2; + } + } + } + } + /* Now, fill in the various structures */ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->agent = mad_agent; @@ -223,6 +226,7 @@ spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); + spin_unlock_irqrestore(&priv->reg_lock, flags); if (ret2) { ret = ERR_PTR(ret2); goto error3; @@ -1123,6 +1127,7 @@ device->mad = priv; priv->device = device; priv->port = port; + spin_lock_init(&priv->reg_lock); for (i = 0; i < MAX_MGMT_VERSION; i++) { priv->version[i] = NULL; } Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 788) +++ ib_mad_priv.h (working copy) @@ -125,12 +125,14 @@ struct ib_mad_port_private *next; struct ib_device *device; int port; - struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; + spinlock_t reg_lock; + struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; + spinlock_t send_list_lock; struct list_head send_posted_mad_list; int send_posted_mad_count; Index: TODO =================================================================== --- TODO (revision 790) +++ TODO (working copy) @@ -4,7 +4,6 @@ Short Term Fix list handling -Version/Class/Method table locking Client ID table Use wait queue and wait_event rather than signals and semaphores Finish receive path coding From roland at topspin.com Sun Sep 12 09:05:14 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 09:05:14 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000401c49824$95505150$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Sat, 11 Sep 2004 10:27:14 -0700") References: <000401c49824$95505150$655aa8c0@infiniconsys.com> Message-ID: <528ybfa2x1.fsf@topspin.com> Fab> Like Sean said, if the common case is going to be a single Fab> device, then the gain is likely not worth the complication in Fab> the code. I don't feel really strongly about it, it was just Fab> a thought based on how the device related notifications Fab> worked in IBAL, which I found to be useful. No biggie either Fab> way, and it should be simple enough to change down the road Fab> if we see a need. OK, I thought about it some more and I decided that it's better to have the API from the beginning so that more efficient implementations can be added without changing the client code. However, I implemented the following API: void *ib_get_client_data(struct ib_device *device, struct ib_client *client); int ib_set_client_data(struct ib_device *device, struct ib_client *client, void *data); I think this is equivalent to what you proposed but simpler to implement and use. It also mimics the API in : void *pci_get_drvdata (struct pci_dev *pdev); void pci_set_drvdata (struct pci_dev *pdev, void *data); (my set function returns an int because it does an allocation, so it can fail) Comments? - R. From roland at topspin.com Sun Sep 12 09:07:36 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 09:07:36 -0700 Subject: [openib-general] ib_mad receive locking In-Reply-To: <1094929190.1794.1279.camel@localhost.localdomain> (Hal Rosenstock's message of "Sat, 11 Sep 2004 14:59:51 -0400") References: <1094929190.1794.1279.camel@localhost.localdomain> Message-ID: <524qm3a2t3.fsf@topspin.com> Hal> Hi, Now that there is a receive list per MAD layer QP, should Hal> there also be a receive lock per QP to minimize lock Hal> contention ? Right now, there is one receive lock for all MAD Hal> QPs (which is just QP0 and 1 now but will ultimately include Hal> redirected QPs). Yes, I think the locking should be pretty fine-grained. We definitely don't want independent HCAs to serialize against each other, and it's probably good even for different ports on the same HCA to be independent. - R. From halr at voltaire.com Sun Sep 12 10:05:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 13:05:24 -0400 Subject: [openib-general] [PATCH] ib_mad: Fix list handling Message-ID: <1095008723.1794.1396.camel@localhost.localdomain> Fix list handling (hopefully :-) Index: ib_mad.c =================================================================== --- ib_mad.c (revision 791) +++ ib_mad.c (working copy) @@ -106,10 +106,7 @@ ib_mad_recv_handler recv_handler, void *context) { - struct ib_mad_port_private *entry, *priv = NULL, - *head = (struct ib_mad_port_private *) &ib_mad_port_list; - struct ib_mad_agent_private *entry2, - *head2 = (struct ib_mad_agent_private *)&ib_mad_agent_list; + struct ib_mad_port_private *entry, *priv = NULL; struct ib_mad_agent *mad_agent, *ret; struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_reg_req *reg_req = NULL; @@ -148,7 +145,7 @@ goto error1; } } else if (mad_reg_req->mgmt_class == 0) { - /* class 0 is reserved and used for aliasing IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ + /* Class 0 is reserved and used for aliasing IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ ret = ERR_PTR(-EINVAL); goto error1; } @@ -156,7 +153,7 @@ /* Validate device and port */ spin_lock_irqsave(&ib_mad_port_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry(entry, &ib_mad_port_list, port_list) { if (entry->device == device && entry->port == port) { priv = entry; break; @@ -222,7 +219,7 @@ /* Add to mad agent list */ spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_add_tail((struct list_head *) mad_agent_priv, &ib_mad_agent_list); + list_add_tail(&mad_agent_priv->agent_list, &ib_mad_agent_list); spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); @@ -237,13 +234,10 @@ error3: /* Remove from mad agent list */ spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_for_each(entry2, head2) { - if (entry2->agent == mad_agent_priv->agent) { - list_del((struct list_head *)entry2); - break; - } - } + list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); + + /* Release allocated structures */ kfree(reg_req); error2: kfree(mad_agent); @@ -258,15 +252,15 @@ */ int ib_mad_dereg(struct ib_mad_agent *mad_agent) { - struct ib_mad_agent_private *entry, - *head = (struct ib_mad_agent_private *)&ib_mad_agent_list; + struct ib_mad_agent_private *entry, *temp; unsigned long flags; spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry_safe(entry, temp, &ib_mad_agent_list, agent_list) { if (entry->agent == mad_agent) { remove_mad_reg_req(entry); - list_del((struct list_head *)entry); + list_del(&entry->agent_list); + /* Release allocated structures */ kfree(entry->reg_req); kfree(entry->agent); @@ -316,8 +310,7 @@ printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); return -ENOMEM; } - /* Initialized MAD send WR tracking structure */ - mad_send_wr->next = NULL; + /* Initialize MAD send WR tracking structure */ mad_send_wr->agent = mad_agent; mad_send_wr->wr_id = cur_send_wr->wr_id; mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; @@ -335,7 +328,7 @@ /* Link send WR into posted send MAD list */ spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); - list_add_tail((struct list_head *)mad_send_wr, + list_add_tail(&mad_send_wr->send_list, &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); @@ -344,7 +337,7 @@ if (ret) { /* Unlink from posted send MAD list */ spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); - list_del((struct list_head *)send_wr); + list_del(&mad_send_wr->send_list); ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count--; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); *bad_send_wr = cur_send_wr; @@ -566,8 +559,7 @@ struct ib_wc *wc) { struct ib_mad_recv_wc recv_wc; - struct ib_mad_private_header *entry, - *head = (struct ib_mad_private_header *)&priv->recv_posted_mad_list; + struct ib_mad_private_header *entry, *temp; struct ib_mad_private *recv = NULL; unsigned long flags; u32 qp_num; @@ -577,11 +569,13 @@ /* Find entry on posted MAD receive list which corresponds to this completion */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry_safe(entry, temp, + &priv->recv_posted_mad_list[convert_qpnum(qp_num)], + mad_list) { if ((unsigned long)entry == wc->wr_id) { recv = (struct ib_mad_private *)entry; /* Remove from posted receive MAD list */ - list_del((struct list_head *)entry); + list_del(&entry->mad_list); priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; break; } @@ -620,17 +614,17 @@ static void ib_mad_send_done_handler(struct ib_mad_port_private *priv, struct ib_wc *wc) { - struct ib_mad_send_wr_private *entry, *send_wr = NULL, - *head = (struct ib_mad_send_wr_private *)&priv->send_posted_mad_list; + struct ib_mad_send_wr_private *entry, *temp, *send_wr = NULL; unsigned long flags; /* Find entry on posted MAD send list which corresponds to this completion */ spin_lock_irqsave(&priv->send_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry_safe(entry, temp, + &priv->send_posted_mad_list, send_list) { if (entry->wr_id == wc->wr_id) { send_wr = entry; /* Remove from posted send MAD list */ - list_del((struct list_head *)entry); + list_del(&entry->send_list); priv->send_posted_mad_count--; break; } @@ -782,7 +776,6 @@ printk(KERN_ERR "No memory for receive MAD\n"); return -ENOMEM; } - mad_priv->header.next = NULL; /* Setup scatter list */ sg_list.addr = pci_map_single(priv->device->dma_device, @@ -801,7 +794,7 @@ /* Link receive WR into posted receive MAD list */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_add_tail((struct list_head *)mad_priv, &priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); + list_add_tail(&mad_priv->header.mad_list, &priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; spin_unlock_irqrestore(&priv->recv_list_lock, flags); @@ -811,7 +804,7 @@ if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { /* Unlink from posted receive MAD list */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_del((struct list_head *)mad_priv); + list_del(&mad_priv->header.mad_list); priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; spin_unlock_irqrestore(&priv->recv_list_lock, flags); @@ -877,7 +870,7 @@ spin_lock_irqsave(&priv->send_list_lock, flags); while (!list_empty(&priv->send_posted_mad_list)) { - list_del(priv->send_posted_mad_list.next); + list_del(&priv->send_posted_mad_list); /* Call completion handler with some status ? */ } INIT_LIST_HEAD(&priv->send_posted_mad_list); @@ -1098,13 +1091,12 @@ }; struct ib_qp_init_attr qp_init_attr; struct ib_qp_cap qp_cap; - struct ib_mad_port_private *entry, *priv = NULL, - *head = (struct ib_mad_port_private *) &ib_mad_port_list; + struct ib_mad_port_private *entry, *priv = NULL; unsigned long flags; /* First, check if port already open at MAD layer */ spin_lock_irqsave(&ib_mad_port_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry(entry, &ib_mad_port_list, port_list) { if (entry->device == device && entry->port == port) { priv = entry; break; @@ -1201,7 +1193,7 @@ } spin_lock_irqsave(&ib_mad_port_list_lock, flags); - list_add_tail((struct list_head *)priv, &ib_mad_port_list); + list_add_tail(&priv->port_list, &ib_mad_port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return 0; @@ -1229,12 +1221,11 @@ */ static int ib_mad_port_close(struct ib_device *device, int port) { - struct ib_mad_port_private *entry, *priv = NULL, - *head = (struct ib_mad_port_private *)&ib_mad_port_list; + struct ib_mad_port_private *entry, *priv = NULL; unsigned long flags; spin_lock_irqsave(&ib_mad_port_list_lock, flags); - list_for_each(entry, head) { + list_for_each_entry(entry, &ib_mad_port_list, port_list) { if (entry->device == device && entry->port == port) { priv = entry; break; @@ -1247,7 +1238,7 @@ return -ENODEV; } - list_del((struct list_head *)priv); + list_del(&priv->port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); ib_mad_port_stop(priv); @@ -1257,7 +1248,7 @@ ib_dereg_mr(priv->mr); ib_dealloc_pd(priv->pd); ib_destroy_cq(priv->cq); - /* Handle MAD registration tables!!! */ + /* Handle deallocation of MAD registration tables!!! */ kfree(priv); device->mad = NULL; Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 791) +++ ib_mad_priv.h (working copy) @@ -81,7 +81,7 @@ }; struct ib_mad_private_header { - struct ib_mad_private_header *next; + struct list_head mad_list; struct ib_mad_buf buf; } __attribute__ ((packed)); @@ -95,14 +95,14 @@ } __attribute__ ((packed)); struct ib_mad_agent_private { - struct ib_mad_agent_private *next; + struct list_head agent_list; struct ib_mad_agent *agent; struct ib_mad_reg_req *reg_req; u8 rmpp_version; }; struct ib_mad_send_wr_private { - struct ib_mad_send_wr_private *next; + struct list_head send_list; struct ib_mad_agent *agent; u64 wr_id; int timeout_ms; @@ -122,7 +122,7 @@ }; struct ib_mad_port_private { - struct ib_mad_port_private *next; + struct list_head port_list; struct ib_device *device; int port; struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; Index: TODO =================================================================== --- TODO (revision 791) +++ TODO (working copy) @@ -3,7 +3,6 @@ OpenIB MAD Layer Short Term -Fix list handling Client ID table Use wait queue and wait_event rather than signals and semaphores Finish receive path coding From halr at voltaire.com Sun Sep 12 10:12:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 13:12:26 -0400 Subject: [openib-general] ib_mad.c comments In-Reply-To: <1094926336.1794.1259.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <52isakajhy.fsf@topspin.com> <1094926336.1794.1259.camel@localhost.localdomain> Message-ID: <1095009146.1752.1407.camel@localhost.localdomain> On Sat, 2004-09-11 at 14:12, Hal Rosenstock wrote: > > Hal> To minimize time spent in non process context. Is this an > > Hal> overly conservative approach ? > > > > It's probably OK for now. Probably the correct answer is to use > > tasklets and run in softirq context though (since process context can > > be starved for an arbitrarily long time). > > OK. I will put changing this on the futures list for now. One further thought although I think this was already discussed on the list but I forgot the answer... Another specific issue will come into play for the CM in that the callbacks will want to invoke ib_create/modify_qp. Will those be callable from non process context ? Just want to be sure... -- Hal From roland at topspin.com Sun Sep 12 10:25:39 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 10:25:39 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <1095009146.1752.1407.camel@localhost.localdomain> (Hal Rosenstock's message of "Sun, 12 Sep 2004 13:12:26 -0400") References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <52isakajhy.fsf@topspin.com> <1094926336.1794.1259.camel@localhost.localdomain> <1095009146.1752.1407.camel@localhost.localdomain> Message-ID: <52zn3v8kmk.fsf@topspin.com> Hal> Another specific issue will come into play for the CM in that Hal> the callbacks will want to invoke ib_create/modify_qp. Will Hal> those be callable from non process context ? Just want to be Hal> sure... No, ib_create_qp and ib_modify_qp are not callable from process context. It's up to the CM to defer processing to a process context (probably using a workqueue). This is exactly the same thing the CM has to do with timeouts, since Linux timers are called from timer interrupt context. - Roland From roland at topspin.com Sun Sep 12 10:27:12 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 10:27:12 -0700 Subject: [openib-general] [PATCH] ib_mad: Fix list handling In-Reply-To: <1095008723.1794.1396.camel@localhost.localdomain> (Hal Rosenstock's message of "Sun, 12 Sep 2004 13:05:24 -0400") References: <1095008723.1794.1396.camel@localhost.localdomain> Message-ID: <52vfej8kjz.fsf@topspin.com> This looks good to me. - R. From ftillier at infiniconsys.com Sun Sep 12 11:36:54 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Sun, 12 Sep 2004 11:36:54 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <528ybfa2x1.fsf@topspin.com> Message-ID: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Sunday, September 12, 2004 9:05 AM > > Fab> Like Sean said, if the common case is going to be a single > Fab> device, then the gain is likely not worth the complication in > Fab> the code. I don't feel really strongly about it, it was just > Fab> a thought based on how the device related notifications > Fab> worked in IBAL, which I found to be useful. No biggie either > Fab> way, and it should be simple enough to change down the road > Fab> if we see a need. > > OK, I thought about it some more and I decided that it's better to > have the API from the beginning so that more efficient implementations > can be added without changing the client code. However, I implemented > the following API: > > void *ib_get_client_data(struct ib_device *device, struct ib_client > *client); > int ib_set_client_data(struct ib_device *device, struct ib_client > *client, > void *data); > > I think this is equivalent to what you proposed but simpler to > implement and use. It also mimics the API in : I think making things behave like other Linux APIs is the right way to go - the more we do to make it simpler for kernel devs to find their way around IB the better. > > void *pci_get_drvdata (struct pci_dev *pdev); > void pci_set_drvdata (struct pci_dev *pdev, void *data); > > (my set function returns an int because it does an allocation, so it > can fail) How do the above pci_xxx_drvdata functions avoid a malloc? Can we do something similar? Either way, I think your proposed API will work fine. Is it valid for a client to call the get function if it did not call the set function? Does that just result in NULL being retuned? Thanks, - Fab From roland at topspin.com Sun Sep 12 12:46:39 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 12:46:39 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Sun, 12 Sep 2004 11:36:54 -0700") References: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> Message-ID: <52r7p78e3k.fsf@topspin.com> Fab> How do the above pci_xxx_drvdata functions avoid a malloc? Fab> Can we do something similar? Either way, I think your Fab> proposed API will work fine. The pci functions only support a single driver, so they can just use a single void * driver_data member of the struct device. Since IB allows multiple consumers, the options are to have a table with a static limit, or to dynamically allocate context for each consumer as it gets used. I really don't like having static limits so I chose the second option. The chances of a small kmalloc (with GFP_KERNEL) failing are pretty small, and when such an allocation fails you're usually pretty screwed anyway. Fab> Is it valid for a client to call the get function if it did Fab> not call the set function? Does that just result in NULL Fab> being retuned? Yes. - R. From mst at mellanox.co.il Sun Sep 12 12:58:54 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Sep 2004 22:58:54 +0300 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52r7p78e3k.fsf@topspin.com> References: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> <52r7p78e3k.fsf@topspin.com> Message-ID: <20040912195854.GB17841@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Proposed device enumeration & async event APIs": > Fab> How do the above pci_xxx_drvdata functions avoid a malloc? > Fab> Can we do something similar? Either way, I think your > Fab> proposed API will work fine. > > The pci functions only support a single driver, so they can just use a > single void * driver_data member of the struct device. Since IB > allows multiple consumers, the options are to have a table with a > static limit, or to dynamically allocate context for each consumer as > it gets used. I really don't like having static limits so I chose the > second option. Am I right that you need the ib core loaded always (in a way similiar to how you need ip core loaded always) and then ulps plug in from above and devices from below? I'm still not sure how ULPs are loaded on a hotplug event when a device is added, though. MST From roland at topspin.com Sun Sep 12 13:04:47 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 13:04:47 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040912195854.GB17841@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 12 Sep 2004 22:58:54 +0300") References: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> <52r7p78e3k.fsf@topspin.com> <20040912195854.GB17841@mellanox.co.il> Message-ID: <52mzzv8d9c.fsf@topspin.com> Michael> Am I right that you need the ib core loaded always (in a Michael> way similiar to how you need ip core loaded always) and Michael> then ulps plug in from above and devices from below? Yes (or it could be loaded based on module dependencies when hotplug loads the first low-level driver). Michael> I'm still not sure how ULPs are loaded on a hotplug event Michael> when a device is added, though. No need -- you can load the ULP before any devices are present (its add() method will get called when low-level drivers create devices). Or one could add /etc/hotplug/infiniband.agent or something like that to do any initialization required. - R. From mst at mellanox.co.il Sun Sep 12 13:09:51 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Sep 2004 23:09:51 +0300 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52mzzv8d9c.fsf@topspin.com> References: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> <52r7p78e3k.fsf@topspin.com> <20040912195854.GB17841@mellanox.co.il> <52mzzv8d9c.fsf@topspin.com> Message-ID: <20040912200951.GB18013@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Proposed device enumeration & async event APIs": > Michael> Am I right that you need the ib core loaded always (in a > Michael> way similiar to how you need ip core loaded always) and > Michael> then ulps plug in from above and devices from below? > > Yes (or it could be loaded based on module dependencies when hotplug > loads the first low-level driver). > > Michael> I'm still not sure how ULPs are loaded on a hotplug event > Michael> when a device is added, though. > > No need -- you can load the ULP before any devices are present (its > add() method will get called when low-level drivers create devices). > Or one could add /etc/hotplug/infiniband.agent or something like that > to do any initialization required. > But I think ULP initialisation normally fails if there are no devices. So you have to load them when the devices *are* there. For example, it seems I wont be able to ifconfig ib0 up before there's a device with at least one ib port. mst From roland at topspin.com Sun Sep 12 13:21:54 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 13:21:54 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040912200951.GB18013@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 12 Sep 2004 23:09:51 +0300") References: <000501c498f7$7af46e80$655aa8c0@infiniconsys.com> <52r7p78e3k.fsf@topspin.com> <20040912195854.GB17841@mellanox.co.il> <52mzzv8d9c.fsf@topspin.com> <20040912200951.GB18013@mellanox.co.il> Message-ID: <52isaj8cgt.fsf@topspin.com> Michael> But I think ULP initialisation normally fails if there Michael> are no devices. So you have to load them when the Michael> devices *are* there. For example, it seems I wont be Michael> able to ifconfig ib0 up before there's a device with at Michael> least one ib port. That's the old world. In my current tree you can load and unload ipoib and the low-level driver in any order. - R. From halr at voltaire.com Sun Sep 12 13:23:40 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 16:23:40 -0400 Subject: [openib-general] ib_mad.c comments In-Reply-To: <52zn3v8kmk.fsf@topspin.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <52isakajhy.fsf@topspin.com> <1094926336.1794.1259.camel@localhost.localdomain> <1095009146.1752.1407.camel@localhost.localdomain> <52zn3v8kmk.fsf@topspin.com> Message-ID: <1095020619.1794.1784.camel@localhost.localdomain> On Sun, 2004-09-12 at 13:25, Roland Dreier wrote: > Hal> Another specific issue will come into play for the CM in that > Hal> the callbacks will want to invoke ib_create/modify_qp. Will > Hal> those be callable from non process context ? Just want to be > Hal> sure... > > No, ib_create_qp and ib_modify_qp are not callable from process > context. It's up to the CM to defer processing to a process context > (probably using a workqueue). This is exactly the same thing the CM > has to do with timeouts, since Linux timers are called from timer > interrupt context. At some point, we may want to consider what it would take to make ib_[create modify]_qp run from other than process context as this would help with CM performance. -- Hal From halr at voltaire.com Sun Sep 12 13:45:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 16:45:25 -0400 Subject: [openib-general] [PATCH] ib_mad: Use kthread_create rather than daemonize Message-ID: <1095021924.1746.1831.camel@localhost.localdomain> ib_mad: Use kthread_create rather than daemonize Index: ib_mad.c =================================================================== --- ib_mad.c (revision 792) +++ ib_mad.c (working copy) @@ -703,10 +703,6 @@ struct ib_mad_port_private *priv = param; struct ib_mad_thread_data *thread_data = &priv->thread_data; - lock_kernel(); - daemonize("ib_mad-%-6s-%-2d", priv->device->name, priv->port); - unlock_kernel(); - while (1) { if (down_interruptible(&thread_data->sem)) { printk(KERN_DEBUG "Exiting ib_mad thread\n"); @@ -725,13 +721,22 @@ /* * Initialize the IB MAD thread */ -static void ib_mad_thread_init(struct ib_mad_port_private *priv) +static int ib_mad_thread_init(struct ib_mad_port_private *priv) { struct ib_mad_thread_data *thread_data = &priv->thread_data; sema_init(&thread_data->sem, 0); thread_data->run = 1; - kernel_thread(ib_mad_thread, priv, 0); + priv->mad_thread = kthread_create(ib_mad_thread, + priv, + "ib_mad-%-6s-%-2d", + priv->device->name, + priv->port); + if (IS_ERR(priv->mad_thread)) { + printk(KERN_ERR "couldn't start mad thread\n"); + return 1; + } + return 0; } /* @@ -1185,10 +1190,13 @@ priv->recv_posted_mad_count[i] = 0; } - ib_mad_thread_init(priv); + ret = ib_mad_thread_init(priv); + if (ret) + goto error8; + ret = ib_mad_port_start(priv); if (ret) { - printk(KERN_ERR "Could not start port\n"); + printk(KERN_ERR "Couldn't start port\n"); goto error8; } Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 792) +++ ib_mad_priv.h (working copy) @@ -57,6 +57,7 @@ #define __IB_MAD_PRIV_H_ #include +#include #define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 */ @@ -123,6 +124,7 @@ struct ib_mad_port_private { struct list_head port_list; + struct task_struct *mad_thread; struct ib_device *device; int port; struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; From ftillier at infiniconsys.com Sun Sep 12 15:00:38 2004 From: ftillier at infiniconsys.com (Tillier, Fabian) Date: Sun, 12 Sep 2004 18:00:38 -0400 Subject: [openib-general] Proposed device enumeration & async event APIs Message-ID: <5D78D28F88822E4D8702BB9EEF1A4367062389@mercury.infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Sunday, September 12, 2004 12:47 PM > > Fab> How do the above pci_xxx_drvdata functions avoid a malloc? > Fab> Can we do something similar? Either way, I think your > Fab> proposed API will work fine. > > The pci functions only support a single driver, so they can just use a > single void * driver_data member of the struct device. Since IB > allows multiple consumers, the options are to have a table with a > static limit, or to dynamically allocate context for each consumer as > it gets used. I really don't like having static limits so I chose the > second option. > > The chances of a small kmalloc (with GFP_KERNEL) failing are pretty > small, and when such an allocation fails you're usually pretty screwed > anyway. Do you plan on having an allocation per client, or having an array of allocations that you grow dynamically as needed? I would think the array would be beneficial since you end up with fewer allocations. It does require synchronization and a counter to keep track of the array size. Resize would then just do a malloc, copy, and free. > > Fab> Is it valid for a client to call the get function if it did > Fab> not call the set function? Does that just result in NULL > Fab> being retuned? > > Yes. > Great! - Fab From halr at voltaire.com Sun Sep 12 17:21:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 20:21:29 -0400 Subject: [openib-general] [PATCH] ib_mad: Replace MAD agent list with client ID table Message-ID: <1095034889.1746.2270.camel@localhost.localdomain> ib_mad: Replace MAD agent list with client ID table Index: ib_mad.c =================================================================== --- ib_mad.c (revision 794) +++ ib_mad.c (working copy) @@ -68,8 +68,7 @@ static kmem_cache_t *ib_mad_cache; static struct list_head ib_mad_port_list; -static struct list_head ib_mad_agent_list; -static u32 ib_mad_client_id = 0; +static struct ib_mad_agent_private *ib_mad_client_table[IB_MAD_MAX_CLIENTS]; /* * Locks @@ -78,8 +77,8 @@ /* Port list lock */ static spinlock_t ib_mad_port_list_lock = SPIN_LOCK_UNLOCKED; -/* Agent list lock */ -static spinlock_t ib_mad_agent_list_lock = SPIN_LOCK_UNLOCKED; +/* Client list lock */ +static spinlock_t ib_mad_client_table_lock = SPIN_LOCK_UNLOCKED; /* Forward declarations */ @@ -112,7 +111,7 @@ struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; - int ret2; + int ret2, i; unsigned long flags; u8 mgmt_class; @@ -215,12 +214,17 @@ mad_agent->send_handler = send_handler; mad_agent->context = context; mad_agent->qp = priv->qp[qp_type]; - mad_agent->hi_tid = ++ib_mad_client_id; - /* Add to mad agent list */ - spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_add_tail(&mad_agent_priv->agent_list, &ib_mad_agent_list); - spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); + /* Add mad agent into client table */ + spin_lock_irqsave(&ib_mad_client_table_lock, flags); + for (i = 0; i < IB_MAD_MAX_CLIENTS; i++) { + if (ib_mad_client_table[i] == NULL) { + mad_agent->hi_tid = i; + ib_mad_client_table[i] = mad_agent_priv; + break; + } + } + spin_unlock_irqrestore(&ib_mad_client_table_lock, flags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); spin_unlock_irqrestore(&priv->reg_lock, flags); @@ -232,10 +236,10 @@ return mad_agent; error3: - /* Remove from mad agent list */ - spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_del(&mad_agent_priv->agent_list); - spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); + /* Remove mad agent from client table */ + spin_lock_irqsave(&ib_mad_client_table_lock, flags); + ib_mad_client_table[mad_agent->hi_tid] = NULL; + spin_unlock_irqrestore(&ib_mad_client_table_lock, flags); /* Release allocated structures */ kfree(reg_req); @@ -252,23 +256,24 @@ */ int ib_mad_dereg(struct ib_mad_agent *mad_agent) { - struct ib_mad_agent_private *entry, *temp; + int i; unsigned long flags; - spin_lock_irqsave(&ib_mad_agent_list_lock, flags); - list_for_each_entry_safe(entry, temp, &ib_mad_agent_list, agent_list) { - if (entry->agent == mad_agent) { - remove_mad_reg_req(entry); - list_del(&entry->agent_list); + spin_lock_irqsave(&ib_mad_client_table_lock, flags); + for (i = 0; i < IB_MAD_MAX_CLIENTS; i++) { + if (ib_mad_client_table[i]->agent == mad_agent) { + remove_mad_reg_req(ib_mad_client_table[i]); + + /* Release allocated structures */ + kfree(ib_mad_client_table[i]->reg_req); + kfree(ib_mad_client_table[i]->agent); + kfree(ib_mad_client_table[i]); - /* Release allocated structures */ - kfree(entry->reg_req); - kfree(entry->agent); - kfree(entry); + ib_mad_client_table[i] = NULL; break; } } - spin_unlock_irqrestore(&ib_mad_agent_list_lock, flags); + spin_unlock_irqrestore(&ib_mad_client_table_lock, flags); return 0; } @@ -1257,6 +1262,7 @@ ib_dealloc_pd(priv->pd); ib_destroy_cq(priv->cq); /* Handle deallocation of MAD registration tables!!! */ + kfree(priv); device->mad = NULL; @@ -1360,6 +1366,8 @@ static int __init ib_mad_init_module(void) { + int i; + ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), 0, @@ -1372,8 +1380,11 @@ } INIT_LIST_HEAD(&ib_mad_port_list); - INIT_LIST_HEAD(&ib_mad_agent_list); + for (i = 0; i < IB_MAD_MAX_CLIENTS; i++) { + ib_mad_client_table[i] = NULL; + } + ib_device_notifier_register(&mad_notifier); return 0; Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 794) +++ ib_mad_priv.h (working copy) @@ -60,7 +60,9 @@ #include -#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 */ +#define IB_MAD_MAX_CLIENTS 32 + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ #define IB_MAD_QPS_SUPPORTED 2 /* QP and CQ parameters */ @@ -96,7 +98,6 @@ } __attribute__ ((packed)); struct ib_mad_agent_private { - struct list_head agent_list; struct ib_mad_agent *agent; struct ib_mad_reg_req *reg_req; u8 rmpp_version; Index: TODO =================================================================== --- TODO (revision 794) +++ TODO (working copy) @@ -3,7 +3,6 @@ OpenIB MAD Layer Short Term -Client ID table Use wait queue and wait_event rather than signals and semaphores Finish receive path coding From halr at voltaire.com Sun Sep 12 17:30:55 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 20:30:55 -0400 Subject: [openib-general] ib_mad receive locking In-Reply-To: <524qm3a2t3.fsf@topspin.com> References: <1094929190.1794.1279.camel@localhost.localdomain> <524qm3a2t3.fsf@topspin.com> Message-ID: <1095035454.1746.2292.camel@localhost.localdomain> On Sun, 2004-09-12 at 12:07, Roland Dreier wrote: > Yes, I think the locking should be pretty fine-grained. We definitely > don't want independent HCAs to serialize against each other, and it's > probably good even for different ports on the same HCA to be independent. Agreed. It is that "fine" grained already: the locking is per port for receive currently. I was asking about even "finer" graininess (where the receive lock would be per receive list (per QP) but this doesn't make sense until there are multiple threads handling receive which is not the case now. -- Hal From halr at voltaire.com Sun Sep 12 19:52:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 12 Sep 2004 22:52:25 -0400 Subject: [openib-general] core_device.c ib_dealloc_device() Message-ID: <1095043944.18370.2.camel@localhost.localdomain> In core_device.c, ib_dealloc_device() calls ib_device_deregister_sysfs(). Where is this defined ? Should this be ib_device_unregister_sysfs() ? -- Hal From roland at topspin.com Sun Sep 12 20:35:33 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 20:35:33 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A4367062389@mercury.infiniconsys.com> (Fabian Tillier's message of "Sun, 12 Sep 2004 18:00:38 -0400") References: <5D78D28F88822E4D8702BB9EEF1A4367062389@mercury.infiniconsys.com> Message-ID: <52ekl696yi.fsf@topspin.com> Fabian> Do you plan on having an allocation per client, or having Fabian> an array of allocations that you grow dynamically as Fabian> needed? I would think the array would be beneficial since Fabian> you end up with fewer allocations. It does require Fabian> synchronization and a counter to keep track of the array Fabian> size. Resize would then just do a malloc, copy, and free. I allocate per client per device. I just wanted to do the dumbest, simplest possible thing, given that the number of clients and devices are probably going to be in single digits on almost all machines. However, now that we have an API defined, we can make whatever changes to the implementation are required later painlessly. If you're curious, here's the actual implementation: void *ib_get_client_data(struct ib_device *device, struct ib_client *client) { struct ib_client_data *context; void *ret = NULL; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { ret = context->data; break; } spin_unlock_irqrestore(&device->client_data_lock, flags); return ret; } int ib_set_client_data(struct ib_device *device, struct ib_client *client, void *data) { struct ib_client_data *context; int ret = 0; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { context->data = data; spin_unlock_irqrestore(&device->client_data_lock, flags); return 0; } spin_unlock_irqrestore(&device->client_data_lock, flags); context = kmalloc(sizeof *context, GFP_KERNEL); if (!context) return -ENOMEM; context->client = client; context->data = data; spin_lock_irqsave(&device->client_data_lock, flags); list_add(&context->list, &device->client_data_list); spin_unlock_irqrestore(&device->client_data_lock, flags); return 0; } From roland at topspin.com Sun Sep 12 20:37:00 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 20:37:00 -0700 Subject: [openib-general] core_device.c ib_dealloc_device() In-Reply-To: <1095043944.18370.2.camel@localhost.localdomain> (Hal Rosenstock's message of "Sun, 12 Sep 2004 22:52:25 -0400") References: <1095043944.18370.2.camel@localhost.localdomain> Message-ID: <52acvu96w3.fsf@topspin.com> Hal> In core_device.c, ib_dealloc_device() calls Hal> ib_device_deregister_sysfs(). Where is this defined ? Should Hal> this be ib_device_unregister_sysfs() ? Sorry, my fault -- I committed a change to ib_sysfs.c without committing the rest of my tree. The simplest thing to do for the moment is to change the function back to ib_device_deregister_sysfs() in ib_sysfs.c. When I check in the rest of my work tomorrow then my branch will compile again. - R. From roland at topspin.com Sun Sep 12 20:41:59 2004 From: roland at topspin.com (Roland Dreier) Date: Sun, 12 Sep 2004 20:41:59 -0700 Subject: [openib-general] [PATCH] My current enumeration/async event diff Message-ID: <52656i96ns.fsf@topspin.com> Here's the latest version of my tree. I think the main change since the last version I posted is the addition of ib_get_client_data() and ib_set_client_data(). I'm planning on committing this on Monday unless someone complains. So far all the feedback I've gotten has been positive except that there was some question about register/unregister vs. reg/dereg in function names. If we decide my function names need to change I have no problem with it but it doesn't seem worth waiting to resolve the question before committing. Thanks, Roland Index: infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 759) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -214,7 +214,7 @@ return -ENODEV; } - priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, dev, + priv->cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, IPOIB_TX_RING_SIZE + IPOIB_RX_RING_SIZE + 1); if (IS_ERR(priv->cq)) { TS_REPORT_FATAL(MOD_IB_NET, "%s: failed to create CQ", @@ -256,7 +256,6 @@ out_free_pd: ib_dealloc_pd(priv->pd); - module_put(priv->ca->owner); return -ENODEV; } @@ -274,101 +273,37 @@ } if (ib_dereg_mr(priv->mr)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_dereg_mr failed", dev->name); + printk(KERN_WARNING "%s: ib_dereg_mr failed\n", dev->name); if (ib_destroy_cq(priv->cq)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_cq_destroy failed", dev->name); + printk(KERN_WARNING "%s: ib_cq_destroy failed\n", dev->name); if (ib_dealloc_pd(priv->pd)) - TS_REPORT_WARN(MOD_IB_NET, - "%s: ib_dealloc_pd failed", dev->name); - - module_put(priv->ca->owner); + printk(KERN_WARNING "%s: ib_dealloc_pd failed\n", dev->name); } -static void ipoib_device_notifier(struct ib_device_notifier *self, - struct ib_device *device, int event) +static void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record) { - struct ib_device_attr props; - int port; + struct ipoib_dev_priv *priv = + container_of(handler, struct ipoib_dev_priv, event_handler); - switch (event) { - case IB_DEVICE_NOTIFIER_ADD: - if (ib_query_device(device, &props)) { - TS_REPORT_WARN(MOD_IB_NET, "ib_device_properties_get failed"); - return; - } - - if (device->node_type == IB_NODE_SWITCH) { - if (try_module_get(device->owner)) - ipoib_add_port("ib%d", device, 0); - } else { - for (port = 1; port <= props.phys_port_cnt; ++port) - if (try_module_get(device->owner)) - ipoib_add_port("ib%d", device, port); - } - break; - - case IB_DEVICE_NOTIFIER_REMOVE: - /* Yikes! We don't support devices going away from - underneath us yet! */ - TS_REPORT_WARN(MOD_IB_NET, - "IPoIB driver can't handle removal of device %s", - device->name); - break; - - default: - TS_REPORT_WARN(MOD_IB_NET, "Unknown device notifier event %d."); - break; - } -} - -static struct ib_device_notifier ipoib_notifier = { - .notifier = ipoib_device_notifier -}; - -int ipoib_transport_create_devices(void) -{ - ib_device_notifier_register(&ipoib_notifier); - return 0; -} - -void ipoib_transport_cleanup(void) -{ - ib_device_notifier_deregister(&ipoib_notifier); -} - -static void ipoib_async_event(struct ib_async_event_record *record, - void *priv_ptr) -{ - struct ipoib_dev_priv *priv = priv_ptr; - if (record->event == IB_EVENT_PORT_ACTIVE) { TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_IB_NET_GEN, - "%s: Port active Event", priv->dev.name); - - ipoib_ib_dev_flush(&priv->dev); - } else - TS_REPORT_WARN(MOD_IB_NET, - "%s: Unexpected event %d", priv->dev.name, - record->event); + "%s: Port active event", priv->dev.name); + schedule_work(&priv->flush_task); + } } int ipoib_port_monitor_dev_start(struct net_device *dev) { struct ipoib_dev_priv *priv = dev->priv; - struct ib_async_event_record event_record = { - .device = priv->ca, - .event = IB_EVENT_PORT_ACTIVE, - }; - if (ib_async_event_handler_register(&event_record, - ipoib_async_event, - priv, &priv->active_handler)) { - TS_REPORT_FATAL(MOD_IB_NET, - "ib_async_event_handler_register failed for TS_IB_PORT_ACTIVE"); + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + + if (ib_register_event_handler(&priv->event_handler)) { + printk(KERN_WARNING "ib_handler_register_event failed\n"); return -EINVAL; } @@ -379,7 +314,7 @@ { struct ipoib_dev_priv *priv = dev->priv; - ib_async_event_handler_deregister(priv->active_handler); + ib_unregister_event_handler(&priv->event_handler); } /* Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 759) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -605,6 +605,7 @@ INIT_LIST_HEAD(&priv->child_intfs); INIT_LIST_HEAD(&priv->multicast_list); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, &priv->dev); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, &priv->dev); priv->dev.priv = priv; @@ -691,22 +692,59 @@ return result; } +static void ipoib_add_one(struct ib_device *device) +{ + struct ib_device_attr props; + int port; + + if (ib_query_device(device, &props)) { + TS_REPORT_WARN(MOD_IB_NET, "ib_device_properties_get failed"); + return; + } + + if (device->node_type == IB_NODE_SWITCH) + ipoib_add_port("ib%d", device, 0); + else + for (port = 1; port <= props.phys_port_cnt; ++port) + ipoib_add_port("ib%d", device, port); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + + LIST_HEAD(delete); + + down(&ipoib_device_mutex); + list_for_each_entry_safe(priv, tmp, &ipoib_device_list, list) { + if (priv->ca == device) { + list_del(&priv->list); + list_add_tail(&priv->list, &delete); + } + } + up(&ipoib_device_mutex); + + list_for_each_entry_safe(priv, tmp, &delete, list) { + unregister_netdev(&priv->dev); + ipoib_port_monitor_dev_stop(&priv->dev); + ipoib_dev_cleanup(&priv->dev); + kfree(priv); + } +} + +static struct ib_client ipoib_client = { + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + static int __init ipoib_init_module(void) { int ret; - ret = ipoib_transport_create_devices(); + ret = ib_register_client(&ipoib_client); if (ret) return ret; - down(&ipoib_device_mutex); - if (list_empty(&ipoib_device_list)) { - up(&ipoib_device_mutex); - ipoib_transport_cleanup(); - return -ENODEV; - } - up(&ipoib_device_mutex); - ipoib_vlan_init(); return 0; @@ -714,22 +752,8 @@ static void __exit ipoib_cleanup_module(void) { - struct ipoib_dev_priv *priv, *tpriv; - ipoib_vlan_cleanup(); - ipoib_transport_cleanup(); - - down(&ipoib_device_mutex); - list_for_each_entry_safe(priv, tpriv, &ipoib_device_list, list) { - ipoib_port_monitor_dev_stop(&priv->dev); - ipoib_dev_cleanup(&priv->dev); - unregister_netdev(&priv->dev); - - list_del(&priv->list); - - kfree(priv); - } - up(&ipoib_device_mutex); + ib_unregister_client(&ipoib_client); } module_init(ipoib_init_module); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 759) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -117,6 +117,7 @@ atomic_t mcast_joins; + struct work_struct flush_task; struct work_struct restart_task; struct ib_device *ca; @@ -152,7 +153,7 @@ struct proc_dir_entry *arp_proc_entry; struct proc_dir_entry *mcast_proc_entry; - struct ib_async_event_handler *active_handler; + struct ib_event_handler event_handler; struct net_device_stats stats; }; @@ -253,10 +254,6 @@ int ipoib_add_port(const char *format, struct ib_device *device, tTS_IB_PORT port); -int ipoib_transport_create_devices(void); - -void ipoib_transport_cleanup(void); - int ipoib_port_monitor_dev_start(struct net_device *dev); void ipoib_port_monitor_dev_stop(struct net_device *dev); Index: infiniband/ulp/ipoib/ip2pr_link.c =================================================================== --- infiniband/ulp/ipoib/ip2pr_link.c (revision 759) +++ infiniband/ulp/ipoib/ip2pr_link.c (working copy) @@ -27,8 +27,7 @@ static tTS_KERNEL_TIMER_STRUCT _tsIp2prPathTimer; static tIP2PR_PATH_LOOKUP_ID _tsIp2prPathLookupId = 0; -static struct ib_async_event_handler *_tsIp2prAsyncErrHandle[IP2PR_MAX_HCAS]; -static struct ib_async_event_handler *_tsIp2prAsyncActHandle[IP2PR_MAX_HCAS]; +static struct ib_event_handler _tsIp2prEventHandle[IP2PR_MAX_HCAS]; static unsigned int ip2pr_total_req = 0; static unsigned int ip2pr_arp_timeout = 0; @@ -1311,9 +1310,9 @@ return 0; } -/* ip2pr_async_event_func -- IB async event handler, for clearing caches */ -static void ip2pr_async_event_func(struct ib_async_event_record *record, - void *arg) +/* ip2pr_event_func -- IB async event handler, for clearing caches */ +static void ip2pr_event_func(struct ib_event_handler *handler, + struct ib_event *record) { struct ip2pr_path_element *path_elmt; s32 result; @@ -1321,15 +1320,10 @@ unsigned long flags; struct ip2pr_gid_pr_element *prn_elmt; - if (NULL == record) { - - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "ASYNC: Event with no record of what happened?"); + if (record->event != IB_EVENT_PORT_ACTIVE && + record->event != IB_EVENT_PORT_ERR) return; - } - /* if */ - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "ASYNC: Event <%d> reported, clearing cache."); + /* * destroy all cached path record elements. */ @@ -1346,18 +1340,18 @@ for (sgid_elmt = _tsIp2prLinkRoot.src_gid_list; NULL != sgid_elmt; sgid_elmt = sgid_elmt->next) { if ((sgid_elmt->ca == record->device) && - (sgid_elmt->port == record->modifier.port)) { + (sgid_elmt->port == record->element.port_num)) { sgid_elmt->port_state = record->event == IB_EVENT_PORT_ACTIVE ? IB_PORT_ACTIVE : IB_PORT_DOWN; /* Gid could have changed. Get the gid */ if (ib_cached_gid_get(record->device, - record->modifier.port, + record->element.port_num, 0, sgid_elmt->gid)) { TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "Could not get GID: on hca=<%d>,port=<%d>, event=%d", - record->device, record->modifier.port, + record->device, record->element.port_num, record->event); /* for now zero it. Will get it, when user queries */ @@ -1375,7 +1369,7 @@ TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "Async Port Event on hca=<%d>,port=<%d>, event=%d", - record->device, record->modifier.port, record->event); + record->device, record->element.port_num, record->event); return; } @@ -2074,7 +2068,6 @@ s32 ip2pr_link_addr_init(void) { s32 result = 0; - struct ib_async_event_record evt_rec; int i; struct ib_device *hca_device; @@ -2138,43 +2131,18 @@ * Install async event handler, to clear cache on port down */ - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - _tsIp2prAsyncErrHandle[i] = TS_IP2PR_INVALID_ASYNC_HANDLE; - _tsIp2prAsyncActHandle[i] = TS_IP2PR_INVALID_ASYNC_HANDLE; - } - for (i = 0; ((hca_device = ib_device_get_by_index(i)) != NULL); ++i) { - evt_rec.device = hca_device; - evt_rec.event = IB_PORT_ERROR; - result = ib_async_event_handler_register(&evt_rec, - ip2pr_async_event_func, - NULL, - &_tsIp2prAsyncErrHandle - [i]); - if (0 != result) { - + INIT_IB_EVENT_HANDLER(&_tsIp2prEventHandle[i], + hca_device, ip2pr_event_func); + result = ib_register_event_handler(&_tsIp2prEventHandle[i]); + if (result) { TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, "INIT: Error <%d> registering event handler.", result); goto error_async; } - /* if */ - evt_rec.device = hca_device; - evt_rec.event = IB_EVENT_PORT_ACTIVE; - result = ib_async_event_handler_register(&evt_rec, - ip2pr_async_event_func, - NULL, - &_tsIp2prAsyncActHandle - [i]); - if (0 != result) { + } - TS_TRACE(MOD_IP2PR, T_VERBOSE, TRACE_FLOW_WARN, - "INIT: Error <%d> registering event handler.", - result); - goto error_async; - } /* if */ - } /* for */ - /* * create timer for pruning path record cache. */ @@ -2198,16 +2166,9 @@ return 0; error_async: - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - if (_tsIp2prAsyncErrHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncErrHandle - [i]); - } - if (_tsIp2prAsyncActHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncActHandle - [i]); - } - } + for (i = 0; i < IP2PR_MAX_HCAS; i++) + if (_tsIp2prEventHandle[i].device) + ib_unregister_event_handler(&_tsIp2prEventHandle[i]); kmem_cache_destroy(_tsIp2prLinkRoot.user_req); error_user: @@ -2243,16 +2204,9 @@ /* * release async event handler(s) */ - for (i = 0; i < IP2PR_MAX_HCAS; i++) { - if (_tsIp2prAsyncErrHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncErrHandle - [i]); - } - if (_tsIp2prAsyncActHandle[i] != TS_IP2PR_INVALID_ASYNC_HANDLE) { - ib_async_event_handler_deregister(_tsIp2prAsyncActHandle - [i]); - } - } + for (i = 0; i < IP2PR_MAX_HCAS; i++) + if (_tsIp2prEventHandle[i].device) + ib_unregister_event_handler(&_tsIp2prEventHandle[i]); /* * clear wait list Index: infiniband/ulp/srp/srp_host.c =================================================================== --- infiniband/ulp/srp/srp_host.c (revision 759) +++ infiniband/ulp/srp/srp_host.c (working copy) @@ -563,6 +563,7 @@ target->cqs_hndl[hca_index] = ib_create_cq(hca->ca_hndl, cq_send_event, + NULL, target, MAX_SEND_WQES); @@ -583,6 +584,7 @@ target->cqr_hndl[hca_index] = ib_create_cq(hca->ca_hndl, cq_recv_event, + NULL, target, MAX_RECV_WQES); Index: infiniband/ulp/srp/srp_dm.c =================================================================== --- infiniband/ulp/srp/srp_dm.c (revision 759) +++ infiniband/ulp/srp/srp_dm.c (working copy) @@ -1388,7 +1388,8 @@ return (status); } -void srp_hca_async_event_handler(struct ib_async_event_record *event, void *arg) +void srp_hca_async_event_handler(struct ib_event_handler *handler, + struct ib_event *event) { int hca_index; srp_host_port_params_t *port; @@ -1406,7 +1407,7 @@ } hca = &hca_params[hca_index]; - port = &hca->port[event->modifier.port - 1]; + port = &hca->port[event->element.port_num - 1]; switch (event->event) { @@ -1418,7 +1419,7 @@ */ TS_REPORT_WARN(MOD_SRPTP, "Port active event for hca %d port %d", - hca_index + 1, event->modifier.port); + hca_index + 1, event->element.port_num); if (!port->valid) break; @@ -1434,7 +1435,7 @@ up(&driver_params.sema); break; - case IB_LOCAL_CATASTROPHIC_ERROR: + case IB_EVENT_DEVICE_FATAL: { int port_index; @@ -1454,17 +1455,17 @@ if (!hca->port[port_index].valid) break; - event->event = IB_PORT_ERROR; + event->event = IB_EVENT_PORT_ERR; - event->modifier.port = + event->element.port_num = hca->port[port_index].local_port; - srp_hca_async_event_handler(event, NULL); + srp_hca_async_event_handler(handler, event); } } break; - case IB_PORT_ERROR: + case IB_EVENT_PORT_ERR: { u32 i; int ioc_index; @@ -1473,7 +1474,7 @@ TS_REPORT_WARN(MOD_SRPTP, "Port error event for hca %d port %d", - hca_index + 1, event->modifier.port); + hca_index + 1, event->element.port_num); if (!port->valid) break; @@ -1554,24 +1555,15 @@ } break; - case IB_LID_CHANGE: - break; - - case IB_PKEY_CHANGE: - break; - default: - TS_REPORT_FATAL(MOD_SRPTP, "Unsupported event type %d", - event->event); break; } } int srp_dm_init(void) { - int i, async_event_index, hca_index, status; + int hca_index, status; srp_host_hca_params_t *hca; - struct ib_async_event_record async_record; max_path_record_cache = max_srp_targets * MAX_LOCAL_PORTS; @@ -1610,27 +1602,16 @@ "Registering async events handler for HCA %d", hca->hca_index); - async_record.device = hca->ca_hndl; + INIT_IB_EVENT_HANDLER(&hca->event_handler, hca->ca_hndl, + srp_hca_async_event_handler); + status = ib_register_event_handler(&hca->event_handler); - async_event_index = IB_LOCAL_CATASTROPHIC_ERROR; - for (i = 0; i < MAX_ASYNC_EVENT_HANDLES; i++) { - async_record.event = async_event_index; - status = ib_async_event_handler_register(&async_record, - srp_hca_async_event_handler, - hca, - &hca-> - async_handles - [i]); - - if (status) { - TS_REPORT_FATAL(MOD_SRPTP, - "Registration of async event " - "%d on hca %d failed", - i, hca->hca_index, status); - return (-EINVAL); - } - - async_event_index++; + if (status) { + TS_REPORT_FATAL(MOD_SRPTP, + "Registration of async event " + "hca %d failed", + hca->hca_index, status); + return (-EINVAL); } } @@ -1646,7 +1627,7 @@ void srp_dm_unload(void) { srp_host_hca_params_t *hca; - int i, hca_index; + int hca_index; /* * Unegister for async events on the HCA @@ -1665,9 +1646,7 @@ * Loop through the async handles for the HCA and * deregister them. */ - for (i = 0; i < MAX_ASYNC_EVENT_HANDLES; i++) { - ib_async_event_handler_deregister(hca->async_handles[i]); - } + ib_unregister_event_handler(&hca->event_handler); } /* Register with DM to register for async notification */ Index: infiniband/ulp/srp/srp_host.h =================================================================== --- infiniband/ulp/srp/srp_host.h (revision 759) +++ infiniband/ulp/srp/srp_host.h (working copy) @@ -161,7 +161,7 @@ struct _srp_host_port_params port[MAX_LOCAL_PORTS_PER_HCA]; - struct ib_async_event_handler *async_handles[MAX_ASYNC_EVENT_HANDLES]; + struct ib_event_handler event_handler; } srp_host_hca_params_t; Index: infiniband/ulp/srp/srptp.c =================================================================== --- infiniband/ulp/srp/srptp.c (revision 759) +++ infiniband/ulp/srp/srptp.c (working copy) @@ -681,6 +681,8 @@ init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; init_attr.qp_type = IB_QPT_RC; + init_attr.event_handler = NULL; + conn->qp_hndl = ib_create_qp(hca->pd_hndl, &init_attr, &qp_cap); if (IS_ERR(conn->qp_hndl)) { TS_REPORT_FATAL(MOD_SRPTP, "QP Create failed %d", Index: infiniband/ulp/sdp/sdp_conn.c =================================================================== --- infiniband/ulp/sdp/sdp_conn.c (revision 759) +++ infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -1065,6 +1065,7 @@ if (!conn->send_cq) { conn->send_cq = ib_create_cq(conn->ca, sdp_cq_event_handler, + NULL, (void *)(unsigned long)conn->hashent, conn->send_cq_size); if (IS_ERR(conn->send_cq)) { @@ -1091,6 +1092,7 @@ if (!conn->recv_cq) { conn->recv_cq = ib_create_cq(conn->ca, sdp_cq_event_handler, + NULL, (void *)(unsigned long)conn->hashent, conn->recv_cq_size); Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 759) +++ infiniband/include/ib_verbs.h (working copy) @@ -206,6 +206,46 @@ u8 init_type; }; +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + struct ib_global_route { union ib_gid dgid; u32 flow_label; @@ -316,6 +356,7 @@ }; struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); void *qp_context; struct ib_cq *send_cq; struct ib_cq *recv_cq; @@ -549,6 +590,7 @@ struct ib_cq { struct ib_device *device; ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); void * context; int cqe; atomic_t usecnt; /* count number of work queues */ @@ -567,6 +609,7 @@ struct ib_cq *send_cq; struct ib_cq *recv_cq; struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); void *qp_context; u32 qp_num; }; @@ -600,8 +643,14 @@ struct pci_dev *dma_device; char name[IB_DEVICE_NAME_MAX]; - char *provider; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + void *core; void *mad; u32 flags; @@ -709,11 +758,30 @@ u8 node_type; }; +struct ib_client { + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + struct ib_device *ib_alloc_device(size_t size); void ib_dealloc_device(struct ib_device *device); -int ib_register_device (struct ib_device *device); -int ib_deregister_device(struct ib_device *device); +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +int ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr); @@ -774,6 +842,7 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe); int ib_resize_cq(struct ib_cq *cq, int cqe); Index: infiniband/include/ts_ib_core_types.h =================================================================== --- infiniband/include/ts_ib_core_types.h (revision 759) +++ infiniband/include/ts_ib_core_types.h (working copy) @@ -74,59 +74,12 @@ #ifdef __KERNEL__ -enum ib_async_event { - IB_QP_PATH_MIGRATED, - IB_EEC_PATH_MIGRATED, - IB_QP_COMMUNICATION_ESTABLISHED, - IB_EEC_COMMUNICATION_ESTABLISHED, - IB_SEND_QUEUE_DRAINED, - IB_CQ_ERROR, - IB_LOCAL_WQ_INVALID_REQUEST_ERROR, - IB_LOCAL_WQ_ACCESS_VIOLATION_ERROR, - IB_LOCAL_WQ_CATASTROPHIC_ERROR, - IB_PATH_MIGRATION_ERROR, - IB_LOCAL_EEC_CATASTROPHIC_ERROR, - IB_LOCAL_CATASTROPHIC_ERROR, - IB_PORT_ERROR, - IB_EVENT_PORT_ACTIVE, - IB_LID_CHANGE, - IB_PKEY_CHANGE, -}; - -struct ib_async_event_handler; /* actual definition in core_async.c */ - -struct ib_async_event_record { - struct ib_device *device; - enum ib_async_event event; - union { - struct ib_qp *qp; - struct ib_eec *eec; - struct ib_cq *cq; - int port; - } modifier; -}; - -typedef void (*ib_async_event_handler_func)(struct ib_async_event_record *record, - void *arg); - /* enum definitions */ #define IB_MULTICAST_QPN 0xffffff /* structures */ -enum { - IB_DEVICE_NOTIFIER_ADD, - IB_DEVICE_NOTIFIER_REMOVE -}; - -struct ib_device_notifier { - void (*notifier)(struct ib_device_notifier *self, - struct ib_device *device, - int event); - struct list_head list; -}; - struct ib_sm_path { u16 sm_lid; tTS_IB_SL sm_sl; Index: infiniband/include/ts_ib_core.h =================================================================== --- infiniband/include/ts_ib_core.h (revision 759) +++ infiniband/include/ts_ib_core.h (working copy) @@ -38,17 +38,9 @@ } } -struct ib_device *ib_device_get_by_name(const char *name); -struct ib_device *ib_device_get_by_index(int index); -int ib_device_notifier_register(struct ib_device_notifier *notifier); -int ib_device_notifier_deregister(struct ib_device_notifier *notifier); +struct ib_device *ib_device_get_by_name(const char *name) __deprecated; +struct ib_device *ib_device_get_by_index(int index) __deprecated; -int ib_async_event_handler_register(struct ib_async_event_record *record, - ib_async_event_handler_func function, - void *arg, - struct ib_async_event_handler **handle); -int ib_async_event_handler_deregister(struct ib_async_event_handler *handle); - int ib_cached_node_guid_get(struct ib_device *device, tTS_IB_GUID node_guid); int ib_cached_port_properties_get(struct ib_device *device, Index: infiniband/include/ts_ib_provider.h =================================================================== --- infiniband/include/ts_ib_provider.h (revision 715) +++ infiniband/include/ts_ib_provider.h (working copy) @@ -1,38 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#ifndef _TS_IB_PROVIDER_H -#define _TS_IB_PROVIDER_H - -#include - -void ib_async_event_dispatch(struct ib_async_event_record *event_record); - -#endif /* _TS_IB_PROVIDER_H */ - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 759) +++ infiniband/core/Makefile (working copy) @@ -36,10 +36,9 @@ header_ud.o \ ib_verbs.o \ ib_sysfs.o \ + ib_device.o \ core_main.o \ - core_device.o \ core_fmr_pool.o \ - core_async.o \ core_cache.o \ core_proc.o Index: infiniband/core/core_cache.c =================================================================== --- infiniband/core/core_cache.c (revision 759) +++ infiniband/core/core_cache.c (working copy) @@ -260,71 +260,9 @@ } EXPORT_SYMBOL(ib_cached_pkey_find); -int ib_cache_setup(struct ib_device *device) +static void ib_cache_update(struct ib_device *device, + tTS_IB_PORT port) { - struct ib_device_private *priv = device->core; - struct ib_port_attr prop; - int p; - int ret; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - priv->port_data[p].gid_table = NULL; - priv->port_data[p].pkey_table = NULL; - } - - for (p = priv->start_port; p <= priv->end_port; ++p) { - seqcount_init(&priv->port_data[p].lock); - ret = device->query_port(device, p, &prop); - if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "query_port failed for %s", - device->name); - goto error; - } - priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len; - priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len * sizeof (tTS_IB_GID), - GFP_KERNEL); - if (!priv->port_data[p].gid_table) { - ret = -ENOMEM; - goto error; - } - - priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len; - priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16), - GFP_KERNEL); - if (!priv->port_data[p].pkey_table) { - ret = -ENOMEM; - goto error; - } - - ib_cache_update(device, p); - } - - return 0; - - error: - for (p = priv->start_port; p <= priv->end_port; ++p) { - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); - } - - return ret; -} - -void ib_cache_cleanup(struct ib_device *device) -{ - struct ib_device_private *priv = device->core; - int p; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - kfree(priv->port_data[p].gid_table); - kfree(priv->port_data[p].pkey_table); - } -} - -void ib_cache_update(struct ib_device *device, - tTS_IB_PORT port) -{ struct ib_device_private *priv = device->core; struct ib_port_data *info = &priv->port_data[port]; struct ib_port_attr *tprops = NULL; @@ -405,6 +343,104 @@ kfree(tgid); } +static void ib_cache_task(void *port_ptr) +{ + struct ib_port_data *port_data = port_ptr; + + ib_cache_update(port_data->device, port_data->port_num); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_device_private *priv = event->device->core; + schedule_work(&priv->port_data[event->element.port_num].refresh_task); + } +} + +int ib_cache_setup(struct ib_device *device) +{ + struct ib_device_private *priv = device->core; + struct ib_port_attr prop; + int p; + int ret; + + for (p = priv->start_port; p <= priv->end_port; ++p) { + priv->port_data[p].device = device; + priv->port_data[p].port_num = p; + INIT_WORK(&priv->port_data[p].refresh_task, + ib_cache_task, &priv->port_data[p]); + priv->port_data[p].gid_table = NULL; + priv->port_data[p].pkey_table = NULL; + priv->port_data[p].event_handler.device = NULL; + } + + for (p = priv->start_port; p <= priv->end_port; ++p) { + seqcount_init(&priv->port_data[p].lock); + ret = device->query_port(device, p, &prop); + if (ret) { + TS_REPORT_WARN(MOD_KERNEL_IB, + "query_port failed for %s", + device->name); + goto error; + } + priv->port_data[p].gid_table_alloc_length = prop.gid_tbl_len; + priv->port_data[p].gid_table = kmalloc(prop.gid_tbl_len * sizeof (tTS_IB_GID), + GFP_KERNEL); + if (!priv->port_data[p].gid_table) { + ret = -ENOMEM; + goto error; + } + + priv->port_data[p].pkey_table_alloc_length = prop.pkey_tbl_len; + priv->port_data[p].pkey_table = kmalloc(prop.pkey_tbl_len * sizeof (u16), + GFP_KERNEL); + if (!priv->port_data[p].pkey_table) { + ret = -ENOMEM; + goto error; + } + + ib_cache_update(device, p); + + INIT_IB_EVENT_HANDLER(&priv->port_data[p].event_handler, + device, ib_cache_event); + ret = ib_register_event_handler(&priv->port_data[p].event_handler); + if (ret) { + priv->port_data[p].event_handler.device = NULL; + goto error; + } + } + + return 0; + + error: + for (p = priv->start_port; p <= priv->end_port; ++p) { + if (priv->port_data[p].event_handler.device) + ib_unregister_event_handler(&priv->port_data[p].event_handler); + kfree(priv->port_data[p].gid_table); + kfree(priv->port_data[p].pkey_table); + } + + return ret; +} + +void ib_cache_cleanup(struct ib_device *device) +{ + struct ib_device_private *priv = device->core; + int p; + + for (p = priv->start_port; p <= priv->end_port; ++p) { + ib_unregister_event_handler(&priv->port_data[p].event_handler); + kfree(priv->port_data[p].gid_table); + kfree(priv->port_data[p].pkey_table); + } +} + /* Local Variables: c-file-style: "linux" Index: infiniband/core/core_priv.h =================================================================== --- infiniband/core/core_priv.h (revision 759) +++ infiniband/core/core_priv.h (working copy) @@ -24,16 +24,14 @@ #ifndef _CORE_PRIV_H #define _CORE_PRIV_H +#include +#include + #include -#include "ts_ib_provider.h" #include "ts_kernel_services.h" #include "ts_kernel_thread.h" -#include -#include -#include - enum { IB_PORT_CAP_SM, IB_PORT_CAP_SNMP_TUN, @@ -48,18 +46,17 @@ tTS_IB_GUID node_guid; struct ib_port_data *port_data; - struct list_head async_handler_list; - spinlock_t async_handler_lock; - - tTS_KERNEL_QUEUE_THREAD async_thread; - struct ib_core_proc *proc; }; struct ib_port_data { + struct ib_device *device; spinlock_t port_cap_lock; int port_cap_count[IB_PORT_CAP_NUM]; + struct ib_event_handler event_handler; + struct work_struct refresh_task; + seqcount_t lock; struct ib_port_attr properties; struct ib_sm_path sm_path; @@ -68,11 +65,11 @@ u16 pkey_table_alloc_length; union ib_gid *gid_table; u16 *pkey_table; + u8 port_num; }; int ib_cache_setup(struct ib_device *device); void ib_cache_cleanup(struct ib_device *device); -void ib_cache_update(struct ib_device *device, tTS_IB_PORT port); int ib_proc_setup(struct ib_device *device, int is_switch); void ib_proc_cleanup(struct ib_device *device); int ib_create_proc_dir(void); @@ -81,7 +78,7 @@ void ib_async_thread(struct list_head *entry, void *device_ptr); int ib_device_register_sysfs(struct ib_device *device); -void ib_device_deregister_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); int ib_sysfs_setup(void); void ib_sysfs_cleanup(void); Index: infiniband/core/ib_device.c =================================================================== --- infiniband/core/ib_device.c (revision 715) +++ infiniband/core/ib_device.c (working copy) @@ -1,28 +1,26 @@ /* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#include "ts_kernel_services.h" - #include #include #include @@ -32,10 +30,24 @@ #include "core_priv.h" +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + static LIST_HEAD(device_list); -static LIST_HEAD(notifier_list); -static DECLARE_MUTEX(device_lock); +static LIST_HEAD(client_list); +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + static int ib_device_check_mandatory(struct ib_device *device) { #define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } @@ -145,7 +157,7 @@ BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); - ib_device_deregister_sysfs(device); + ib_device_unregister_sysfs(device); } EXPORT_SYMBOL(ib_dealloc_device); @@ -156,18 +168,19 @@ int ret; int p; - if (ib_device_check_mandatory(device)) { - return -EINVAL; - } + down(&device_sem); - down(&device_lock); - if (strchr(device->name, '%')) { ret = alloc_name(device->name); if (ret) goto out; } + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { printk(KERN_WARNING "Couldn't allocate private struct for %s\n", @@ -209,8 +222,10 @@ device->core = priv; - INIT_LIST_HEAD(&priv->async_handler_list); - spin_lock_init(&priv->async_handler_lock); + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); ret = ib_cache_setup(device); if (ret) { @@ -219,21 +234,11 @@ goto out_free_port; } - ret = tsKernelQueueThreadStart("ts_ib_async", - ib_async_thread, - device, - &priv->async_thread); - if (ret) { - printk(KERN_WARNING "Couldn't start async thread for %s\n", - device->name); - goto out_free_cache; - } - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); if (ret) { printk(KERN_WARNING "Couldn't create /proc dir for %s\n", device->name); - goto out_stop_async; + goto out_free_cache; } if (ib_device_register_sysfs(device)) { @@ -243,27 +248,23 @@ } list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + { - struct list_head *ptr; - struct ib_device_notifier *notifier; + struct ib_client *client; - list_for_each(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } + list_for_each_entry(client, &client_list, list) + if (client->add) + client->add(device); } - device->reg_state = IB_DEV_REGISTERED; - - up(&device_lock); + up(&device_sem); return 0; out_proc: ib_proc_cleanup(device); - out_stop_async: - tsKernelQueueThreadStop(priv->async_thread); - out_free_cache: ib_cache_cleanup(device); @@ -274,54 +275,50 @@ kfree(priv); out: - up(&device_lock); + up(&device_sem); return ret; } EXPORT_SYMBOL(ib_register_device); -int ib_deregister_device(struct ib_device *device) +void ib_unregister_device(struct ib_device *device) { - struct ib_device_private *priv; + struct ib_device_private *priv = device->core; + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; - priv = device->core; + down(&device_sem); - if (tsKernelQueueThreadStop(priv->async_thread)) { - printk(KERN_WARNING "tsKernelThreadStop failed for %s async thread\n", - device->name); - } + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + ib_proc_cleanup(device); ib_cache_cleanup(device); - down(&device_lock); - list_del(&device->core_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each_prev(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_REMOVE); - } - } - up(&device_lock); - kfree(priv->port_data); kfree(priv); device->reg_state = IB_DEV_UNREGISTERED; - - return 0; } -EXPORT_SYMBOL(ib_deregister_device); +EXPORT_SYMBOL(ib_unregister_device); struct ib_device *ib_device_get_by_name(const char *name) { struct ib_device *device; - down(&device_lock); + down(&device_sem); device = __ib_device_get_by_name(name); - up(&device_lock); + up(&device_sem); return device; } @@ -335,7 +332,7 @@ if (index < 0) return NULL; - down(&device_lock); + down(&device_sem); list_for_each(ptr, &device_list) { device = list_entry(ptr, struct ib_device, core_list); if (!index) @@ -345,38 +342,142 @@ device = NULL; out: - up(&device_lock); + up(&device_sem); return device; } EXPORT_SYMBOL(ib_device_get_by_index); -int ib_device_notifier_register(struct ib_device_notifier *notifier) +int ib_register_client(struct ib_client *client) { - struct list_head *ptr; struct ib_device *device; - down(&device_lock); - list_add_tail(¬ifier->list, ¬ifier_list); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); } - up(&device_lock); + list_del(&client->list); + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +int ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + int ret = 0; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + spin_unlock_irqrestore(&device->client_data_lock, flags); + return 0; + } + + spin_unlock_irqrestore(&device->client_data_lock, flags); + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) + return -ENOMEM; + + context->client = client; + context->data = data; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + return 0; } -EXPORT_SYMBOL(ib_device_notifier_register); +EXPORT_SYMBOL(ib_set_client_data); -int ib_device_notifier_deregister(struct ib_device_notifier *notifier) +int ib_register_event_handler (struct ib_event_handler *event_handler) { - down(&device_lock); - list_del(¬ifier->list); - up(&device_lock); + unsigned long flags; + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + return 0; } -EXPORT_SYMBOL(ib_device_notifier_deregister); +EXPORT_SYMBOL(ib_register_event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr) { Index: infiniband/core/core_async.c =================================================================== --- infiniband/core/core_async.c (revision 715) +++ infiniband/core/core_async.c (working copy) @@ -1,251 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#include "core_priv.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - -#include -#include - -#include -#include - -struct ib_async_event_handler { - struct ib_async_event_record record; - ib_async_event_handler_func function; - void *arg; - struct list_head list; - spinlock_t *list_lock; -}; - -struct ib_async_event_list { - struct ib_async_event_record record; - struct list_head list; -}; - -/* Table of modifiers for async events */ -static struct { - enum { - QP, - EEC, - CQ, - PORT, - NONE - } mod; - char *desc; -} event_table[] = { - [IB_QP_PATH_MIGRATED] = { QP, "QP Path Migrated" }, - [IB_EEC_PATH_MIGRATED] = { EEC, "EEC Path Migrated" }, - [IB_QP_COMMUNICATION_ESTABLISHED] = { QP, "QP Communication Established" }, - [IB_EEC_COMMUNICATION_ESTABLISHED] = { EEC, "EEC Communication Established" }, - [IB_SEND_QUEUE_DRAINED] = { QP, "Send Queue Drained" }, - [IB_CQ_ERROR] = { CQ, "CQ Error" }, - [IB_LOCAL_WQ_INVALID_REQUEST_ERROR] = { QP, "Local WQ Invalid Request Error" }, - [IB_LOCAL_WQ_ACCESS_VIOLATION_ERROR] = { QP, "Local WQ Access Violation Error" }, - [IB_LOCAL_WQ_CATASTROPHIC_ERROR] = { QP, "Local WQ Catastrophic Error" }, - [IB_PATH_MIGRATION_ERROR] = { QP, "Path Migration Error" }, - [IB_LOCAL_EEC_CATASTROPHIC_ERROR] = { EEC, "Local EEC Catastrophic Error" }, - [IB_LOCAL_CATASTROPHIC_ERROR] = { NONE, "Local Catastrophic Error" }, - [IB_PORT_ERROR] = { PORT, "Port Error" }, - [IB_EVENT_PORT_ACTIVE] = { PORT, "Port Active" }, - [IB_LID_CHANGE] = { PORT, "LID Change" }, - [IB_PKEY_CHANGE] = { PORT, "P_Key Change" } -}; - -int ib_async_event_handler_register(struct ib_async_event_record *record, - ib_async_event_handler_func function, - void *arg, - struct ib_async_event_handler **handle) -{ - struct ib_async_event_handler *handler; - int ret; - unsigned long flags; - - if (record->event < 0 || record->event >= ARRAY_SIZE(event_table)) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Attempt to register handler for invalid async event %d", - record->event); - return -EINVAL; - } - - handler = kmalloc(sizeof *handler, GFP_KERNEL); - if (!handler) { - return -ENOMEM; - } - - handler->record = *record; - handler->function = function; - handler->arg = arg; - - switch (event_table[record->event].mod) { - case QP: - break; - - case CQ: - printk(KERN_WARNING "Async events for CQs not supported\n"); - break; - - case EEC: - TS_REPORT_WARN(MOD_KERNEL_IB, - "Async events for EECs not supported yet"); - ret = -EINVAL; - goto error; - - case PORT: - case NONE: - { - struct ib_device_private *priv = ((struct ib_device *) record->device)->core; - - spin_lock_irqsave(&priv->async_handler_lock, flags); - handler->list_lock = &priv->async_handler_lock; - list_add_tail(&handler->list, &priv->async_handler_list); - spin_unlock_irqrestore(&priv->async_handler_lock, flags); - } - break; - } - - *handle = handler; - return 0; - - error: - kfree(handler); - return ret; -} -EXPORT_SYMBOL(ib_async_event_handler_register); - -int ib_async_event_handler_deregister(struct ib_async_event_handler *handle) -{ - struct ib_async_event_handler *handler = handle; - unsigned long flags; - - spin_lock_irqsave(handler->list_lock, flags); - list_del(&handler->list); - spin_unlock_irqrestore(handler->list_lock, flags); - - kfree(handle); - return 0; -} -EXPORT_SYMBOL(ib_async_event_handler_deregister); - -void ib_async_event_dispatch(struct ib_async_event_record *event_record) -{ - struct ib_async_event_list *event; - struct ib_device_private *priv = event_record->device->core; - - switch (event_table[event_record->event].mod) { - default: - break; - } - - event = kmalloc(sizeof *event, GFP_ATOMIC); - if (!event) { - return; - } - - event->record = *event_record; - - tsKernelQueueThreadAdd(priv->async_thread, &event->list); -} -EXPORT_SYMBOL(ib_async_event_dispatch); - -void ib_async_thread(struct list_head *entry, - void *device_ptr) -{ - struct ib_async_event_list *event; - struct ib_device_private *priv; - char mod_buf[32]; - struct list_head *handler_list = NULL; - spinlock_t *handler_lock = NULL; - struct list_head *pos; - struct list_head *n; - struct ib_async_event_handler *handler; - ib_async_event_handler_func function; - void *arg; - - event = list_entry(entry, struct ib_async_event_list, list); - priv = ((struct ib_device *) event->record.device)->core; - - switch (event_table[event->record.event].mod) { - case QP: - sprintf(mod_buf, " (QP %p)", event->record.modifier.qp); - break; - - case CQ: - sprintf(mod_buf, " (CQ %p)", event->record.modifier.cq); - break; - - case EEC: - sprintf(mod_buf, " (EEC %p)", event->record.modifier.eec); - break; - - case PORT: - sprintf(mod_buf, " (port %d)", event->record.modifier.port); - handler_list = &priv->async_handler_list; - handler_lock = &priv->async_handler_lock; - - /* Update cached port info */ - ib_cache_update(event->record.device, event->record.modifier.port); - break; - - case NONE: - mod_buf[0] = '\0'; - handler_list = &priv->async_handler_list; - handler_lock = &priv->async_handler_lock; - break; - } - - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Received %s event for %s%s", - event_table[event->record.event].desc, - ((struct ib_device *) event->record.device)->name, - mod_buf); - - if (!handler_list) - return; - - spin_lock_irq(handler_lock); - - list_for_each_safe(pos, n, handler_list) { - handler = list_entry(pos, struct ib_async_event_handler, list); - if (handler->record.event == event->record.event) { - function = handler->function; - arg = handler->arg; - - spin_unlock_irq(handler_lock); - function(&event->record, arg); - spin_lock_irq(handler_lock); - } - } - - spin_unlock_irq(handler_lock); - kfree(event); -} - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/mad_main.c =================================================================== --- infiniband/core/mad_main.c (revision 759) +++ infiniband/core/mad_main.c (working copy) @@ -23,11 +23,6 @@ #include -#include "mad_priv.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - #include #include @@ -37,10 +32,10 @@ /* Need the definition of high_memory: */ #include -#ifdef CONFIG_KMOD -#include -#endif +#include "ts_kernel_services.h" +#include "mad_priv.h" + MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("kernel IB MAD API"); MODULE_LICENSE("Dual BSD/GPL"); @@ -60,11 +55,11 @@ *mr = ib_reg_phys_mr(pd, &buffer_list, 1, /* list_len */ IB_ACCESS_LOCAL_WRITE, &iova); if (IS_ERR(*mr)) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "ib_reg_phys_mr failed " - "size 0x%016" TS_U64_FMT "x, iova 0x%016" TS_U64_FMT "x" - " (return code %d)", - buffer_list.size, iova, PTR_ERR(*mr)); + printk(KERN_WARNING "ib_reg_phys_mr failed " + "size 0x%016llx, iova 0x%016llx " + "(return code %ld)\n", + (unsigned long long) buffer_list.size, + (unsigned long long) iova, PTR_ERR(*mr)); return PTR_ERR(*mr); } @@ -82,10 +77,6 @@ int attr_mask; int ret; - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Creating port %d QPN %d for device %s", - port, qpn, device->name); - { struct ib_qp_init_attr init_attr = { .send_cq = priv->cq, @@ -105,10 +96,10 @@ priv->qp[port][qpn] = ib_create_qp(priv->pd, &init_attr, &qp_cap); if (IS_ERR(priv->qp[port][qpn])) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_special_qp_create failed for %s port %d QPN %d (%d)", - device->name, port, qpn, - PTR_ERR(priv->qp[port][qpn])); + printk(KERN_WARNING "ib_special_qp_create failed " + "for %s port %d QPN %d (%ld)\n", + device->name, port, qpn, + PTR_ERR(priv->qp[port][qpn])); return PTR_ERR(priv->qp[port][qpn]); } } @@ -125,9 +116,9 @@ ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> INIT failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> INIT failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } @@ -135,9 +126,9 @@ attr_mask = IB_QP_STATE; ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> RTR failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> RTR failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } @@ -148,16 +139,16 @@ IB_QP_SQ_PSN; ret = ib_modify_qp(priv->qp[port][qpn], &qp_attr, attr_mask, &qp_cap); if (ret) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "ib_modify_qp -> RTS failed for %s port %d QPN %d (%d)", - device->name, port, qpn, ret); + printk(KERN_WARNING "ib_modify_qp -> RTS failed " + "for %s port %d QPN %d (%d)\n", + device->name, port, qpn, ret); return ret; } return 0; } -static int ib_mad_init_one(struct ib_device *device) +static void ib_mad_add_one(struct ib_device *device) { struct ib_mad_private *priv; struct ib_device_attr prop; @@ -165,18 +156,13 @@ ret = ib_query_device(device, &prop); if (ret) - return ret; + return; - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Setting up device %s, %d ports", - device->name, prop.phys_port_cnt); - priv = kmalloc(sizeof *priv, GFP_KERNEL); if (!priv) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't allocate private structure for %s", - device->name); - return -ENOMEM; + printk(KERN_WARNING "Couldn't allocate MAD private structure for %s\n", + device->name); + return; } device->mad = priv; @@ -187,9 +173,8 @@ priv->pd = ib_alloc_pd(device); if (IS_ERR(priv->pd)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate PD for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD PD for %s\n", + device->name); goto error; } @@ -198,11 +183,10 @@ (IB_MAD_RECEIVES_PER_QP + IB_MAD_SENDS_PER_QP) * priv->num_port; priv->cq = ib_create_cq(device, ib_mad_completion, - device, entries); + NULL, device, entries); if (IS_ERR(priv->cq)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate CQ for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD CQ for %s\n", + device->name); goto error_free_pd; } } @@ -214,9 +198,8 @@ INIT_WORK(&priv->cq_work, ib_mad_drain_cq, device); if (ib_mad_register_memory(priv->pd, &priv->mr, &priv->lkey)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Failed to allocate MR for %s", - device->name); + printk(KERN_WARNING "Failed to allocate MAD MR for %s\n", + device->name); goto error_free_cq; } @@ -225,9 +208,8 @@ device, &priv->work_thread); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, - "Couldn't start completion thread for %s", - device->name); + printk(KERN_WARNING "Couldn't start completion thread for %s\n", + device->name); goto error_free_mr; } @@ -272,7 +254,7 @@ } } - return 0; + return; error_free_qp: { @@ -307,7 +289,6 @@ error: kfree(priv); - return ret; } static void ib_mad_remove_one(struct ib_device *device) @@ -346,39 +327,15 @@ } } -static void ib_mad_device_notifier(struct ib_device_notifier *self, - struct ib_device *device, - int event) -{ - switch (event) { - case IB_DEVICE_NOTIFIER_ADD: - if (ib_mad_init_one(device)) - TS_REPORT_WARN(MOD_KERNEL_IB, - "Failed to initialize device."); - break; - - case IB_DEVICE_NOTIFIER_REMOVE: - ib_mad_remove_one(device); - break; - - default: - TS_REPORT_WARN(MOD_KERNEL_IB, - "Unknown device notifier event %d."); - break; - } -} - -static struct ib_device_notifier mad_notifier = { - .notifier = ib_mad_device_notifier +static struct ib_client mad_client = { + .add = ib_mad_add_one, + .remove = ib_mad_remove_one }; static int __init ib_mad_init(void) { int ret; - TS_REPORT_INIT(MOD_KERNEL_IB, - "Initializing IB MAD layer"); - ret = ib_mad_proc_setup(); if (ret) return ret; @@ -391,34 +348,25 @@ NULL, NULL); if (!mad_cache) { - TS_REPORT_FATAL(MOD_KERNEL_IB, - "Couldn't create MAD slab cache"); + printk(KERN_ERR "Couldn't create MAD slab cache\n"); ib_mad_proc_cleanup(); return -ENOMEM; } - ib_device_notifier_register(&mad_notifier); + if (ib_register_client(&mad_client)) { - TS_REPORT_INIT(MOD_KERNEL_IB, - "IB MAD layer initialized"); + } return 0; } static void __exit ib_mad_cleanup(void) { - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "Unloading IB MAD layer"); - - ib_device_notifier_deregister(&mad_notifier); + ib_unregister_client(&mad_client); ib_mad_proc_cleanup(); if (kmem_cache_destroy(mad_cache)) - TS_REPORT_WARN(MOD_KERNEL_IB, - "Failed to destroy MAD slab cache (memory leak?)"); - - TS_REPORT_CLEANUP(MOD_KERNEL_IB, - "IB MAD layer unloaded"); + printk(KERN_WARNING "Failed to destroy MAD slab cache (memory leak?)\n"); } module_init(ib_mad_init); Index: infiniband/core/mad_priv.h =================================================================== --- infiniband/core/mad_priv.h (revision 759) +++ infiniband/core/mad_priv.h (working copy) @@ -26,7 +26,6 @@ #include "ts_ib_mad.h" #include -#include "ts_ib_provider.h" #include #include "ts_kernel_thread.h" Index: infiniband/core/core_device.c =================================================================== --- infiniband/core/core_device.c (revision 715) +++ infiniband/core/core_device.c (working copy) @@ -1,432 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -#include "ts_kernel_services.h" - -#include -#include -#include -#include - -#include - -#include "core_priv.h" - -static LIST_HEAD(device_list); -static LIST_HEAD(notifier_list); -static DECLARE_MUTEX(device_lock); - -static int ib_device_check_mandatory(struct ib_device *device) -{ -#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } - static const struct { - size_t offset; - char *name; - } mandatory_table[] = { - IB_MANDATORY_FUNC(query_device), - IB_MANDATORY_FUNC(query_port), - IB_MANDATORY_FUNC(query_pkey), - IB_MANDATORY_FUNC(query_gid), - IB_MANDATORY_FUNC(alloc_pd), - IB_MANDATORY_FUNC(dealloc_pd), - IB_MANDATORY_FUNC(create_ah), - IB_MANDATORY_FUNC(destroy_ah), - IB_MANDATORY_FUNC(create_qp), - IB_MANDATORY_FUNC(modify_qp), - IB_MANDATORY_FUNC(destroy_qp), - IB_MANDATORY_FUNC(post_send), - IB_MANDATORY_FUNC(post_recv), - IB_MANDATORY_FUNC(create_cq), - IB_MANDATORY_FUNC(destroy_cq), - IB_MANDATORY_FUNC(poll_cq), - IB_MANDATORY_FUNC(req_notify_cq), - IB_MANDATORY_FUNC(reg_phys_mr), - IB_MANDATORY_FUNC(dereg_mr) - }; - int i; - - for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { - if (!*(void **) ((void *) device + mandatory_table[i].offset)) { - printk(KERN_WARNING "Device %s is missing mandatory function %s\n", - device->name, mandatory_table[i].name); - return -EINVAL; - } - } - - return 0; -} - -static struct ib_device *__ib_device_get_by_name(const char *name) -{ - struct ib_device *device; - - list_for_each_entry(device, &device_list, core_list) - if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) - return device; - - return NULL; -} - - -static int alloc_name(char *name) -{ - long *inuse; - char buf[IB_DEVICE_NAME_MAX]; - struct ib_device *device; - int i; - - inuse = (long *) get_zeroed_page(GFP_KERNEL); - if (!inuse) - return -ENOMEM; - - list_for_each_entry(device, &device_list, core_list) { - if (!sscanf(device->name, name, &i)) - continue; - if (i < 0 || i >= PAGE_SIZE * 8) - continue; - snprintf(buf, sizeof buf, name, i); - if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) - set_bit(i, inuse); - } - - i = find_first_zero_bit(inuse, PAGE_SIZE * 8); - free_page((unsigned long) inuse); - snprintf(buf, sizeof buf, name, i); - - if (__ib_device_get_by_name(buf)) - return -ENFILE; - - strlcpy(name, buf, IB_DEVICE_NAME_MAX); - return 0; -} - -struct ib_device *ib_alloc_device(size_t size) -{ - void *dev; - - BUG_ON(size < sizeof (struct ib_device)); - - dev = kmalloc(size, GFP_KERNEL); - if (!dev) - return NULL; - - memset(dev, 0, size); - - return dev; -} -EXPORT_SYMBOL(ib_alloc_device); - -void ib_dealloc_device(struct ib_device *device) -{ - if (device->reg_state == IB_DEV_UNINITIALIZED) { - kfree(device); - return; - } - - BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); - - ib_device_deregister_sysfs(device); -} -EXPORT_SYMBOL(ib_dealloc_device); - -int ib_register_device(struct ib_device *device) -{ - struct ib_device_private *priv; - struct ib_device_attr prop; - int ret; - int p; - - if (ib_device_check_mandatory(device)) { - return -EINVAL; - } - - down(&device_lock); - - if (strchr(device->name, '%')) { - ret = alloc_name(device->name); - if (ret) - goto out; - } - - priv = kmalloc(sizeof *priv, GFP_KERNEL); - if (!priv) { - printk(KERN_WARNING "Couldn't allocate private struct for %s\n", - device->name); - ret = -ENOMEM; - goto out; - } - - *priv = (struct ib_device_private) { 0 }; - - ret = device->query_device(device, &prop); - if (ret) { - printk(KERN_WARNING "query_device failed for %s\n", - device->name); - goto out_free; - } - - memcpy(priv->node_guid, &prop.node_guid, sizeof (tTS_IB_GUID)); - - if (device->node_type == IB_NODE_SWITCH) { - priv->start_port = priv->end_port = 0; - } else { - priv->start_port = 1; - priv->end_port = prop.phys_port_cnt; - } - - priv->port_data = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_data), - GFP_KERNEL); - if (!priv->port_data) { - printk(KERN_WARNING "Couldn't allocate port info for %s\n", - device->name); - goto out_free; - } - - for (p = priv->start_port; p <= priv->end_port; ++p) { - spin_lock_init(&priv->port_data[p].port_cap_lock); - memset(priv->port_data[p].port_cap_count, 0, IB_PORT_CAP_NUM * sizeof (int)); - } - - device->core = priv; - - INIT_LIST_HEAD(&priv->async_handler_list); - spin_lock_init(&priv->async_handler_lock); - - ret = ib_cache_setup(device); - if (ret) { - printk(KERN_WARNING "Couldn't create device info cache for %s\n", - device->name); - goto out_free_port; - } - - ret = tsKernelQueueThreadStart("ts_ib_async", - ib_async_thread, - device, - &priv->async_thread); - if (ret) { - printk(KERN_WARNING "Couldn't start async thread for %s\n", - device->name); - goto out_free_cache; - } - - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); - if (ret) { - printk(KERN_WARNING "Couldn't create /proc dir for %s\n", - device->name); - goto out_stop_async; - } - - if (ib_device_register_sysfs(device)) { - printk(KERN_WARNING "Couldn't register device %s with driver model\n", - device->name); - goto out_proc; - } - - list_add_tail(&device->core_list, &device_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } - } - - device->reg_state = IB_DEV_REGISTERED; - - up(&device_lock); - return 0; - - out_proc: - ib_proc_cleanup(device); - - out_stop_async: - tsKernelQueueThreadStop(priv->async_thread); - - out_free_cache: - ib_cache_cleanup(device); - - out_free_port: - kfree(priv->port_data); - - out_free: - kfree(priv); - - out: - up(&device_lock); - return ret; -} -EXPORT_SYMBOL(ib_register_device); - -int ib_deregister_device(struct ib_device *device) -{ - struct ib_device_private *priv; - - priv = device->core; - - if (tsKernelQueueThreadStop(priv->async_thread)) { - printk(KERN_WARNING "tsKernelThreadStop failed for %s async thread\n", - device->name); - } - - ib_proc_cleanup(device); - ib_cache_cleanup(device); - - down(&device_lock); - list_del(&device->core_list); - { - struct list_head *ptr; - struct ib_device_notifier *notifier; - - list_for_each_prev(ptr, ¬ifier_list) { - notifier = list_entry(ptr, struct ib_device_notifier, list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_REMOVE); - } - } - up(&device_lock); - - kfree(priv->port_data); - kfree(priv); - - device->reg_state = IB_DEV_UNREGISTERED; - - return 0; -} -EXPORT_SYMBOL(ib_deregister_device); - -struct ib_device *ib_device_get_by_name(const char *name) -{ - struct ib_device *device; - - down(&device_lock); - device = __ib_device_get_by_name(name); - up(&device_lock); - - return device; -} -EXPORT_SYMBOL(ib_device_get_by_name); - -struct ib_device *ib_device_get_by_index(int index) -{ - struct list_head *ptr; - struct ib_device *device; - - if (index < 0) - return NULL; - - down(&device_lock); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - if (!index) - goto out; - --index; - } - - device = NULL; - out: - up(&device_lock); - return device; -} -EXPORT_SYMBOL(ib_device_get_by_index); - -int ib_device_notifier_register(struct ib_device_notifier *notifier) -{ - struct list_head *ptr; - struct ib_device *device; - - down(&device_lock); - list_add_tail(¬ifier->list, ¬ifier_list); - list_for_each(ptr, &device_list) { - device = list_entry(ptr, struct ib_device, core_list); - notifier->notifier(notifier, device, IB_DEVICE_NOTIFIER_ADD); - } - up(&device_lock); - - return 0; -} -EXPORT_SYMBOL(ib_device_notifier_register); - -int ib_device_notifier_deregister(struct ib_device_notifier *notifier) -{ - down(&device_lock); - list_del(¬ifier->list); - up(&device_lock); - - return 0; -} -EXPORT_SYMBOL(ib_device_notifier_deregister); - -int ib_query_device(struct ib_device *device, - struct ib_device_attr *device_attr) -{ - return device->query_device(device, device_attr); -} -EXPORT_SYMBOL(ib_query_device); - -int ib_query_port(struct ib_device *device, - u8 port_num, - struct ib_port_attr *port_attr) -{ - return device->query_port(device, port_num, port_attr); -} -EXPORT_SYMBOL(ib_query_port); - -int ib_query_gid(struct ib_device *device, - u8 port_num, int index, union ib_gid *gid) -{ - return device->query_gid(device, port_num, index, gid); -} -EXPORT_SYMBOL(ib_query_gid); - -int ib_query_pkey(struct ib_device *device, - u8 port_num, u16 index, u16 *pkey) -{ - return device->query_pkey(device, port_num, index, pkey); -} -EXPORT_SYMBOL(ib_query_pkey); - -int ib_modify_device(struct ib_device *device, - int device_modify_mask, - struct ib_device_modify *device_modify) -{ - return device->modify_device(device, device_modify_mask, - device_modify); -} -EXPORT_SYMBOL(ib_modify_device); - -int ib_modify_port(struct ib_device *device, - u8 port_num, int port_modify_mask, - struct ib_port_modify *port_modify) -{ - return device->modify_port(device, port_num, port_modify_mask, - port_modify); -} -EXPORT_SYMBOL(ib_modify_port); - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/mad_static.c =================================================================== --- infiniband/core/mad_static.c (revision 759) +++ infiniband/core/mad_static.c (working copy) @@ -22,7 +22,6 @@ */ #include "mad_priv.h" -#include "ts_ib_provider.h" #include "smp_access.h" #include "ts_kernel_trace.h" @@ -167,12 +166,12 @@ { /* Generate an artificial port error event so that cached info is updated for this port */ - struct ib_async_event_record record; + struct ib_event record; - record.device = device; - record.event = IB_PORT_ERROR; - record.modifier.port = port; - ib_async_event_dispatch(&record); + record.device = device; + record.event = IB_EVENT_PORT_ERR; + record.element.port_num = port; + ib_dispatch_event(&record); } } Index: infiniband/core/ib_verbs.c =================================================================== --- infiniband/core/ib_verbs.c (revision 759) +++ infiniband/core/ib_verbs.c (working copy) @@ -113,12 +113,13 @@ qp = pd->device->create_qp(pd, qp_init_attr, qp_cap); if (!IS_ERR(qp)) { - qp->device = pd->device; - qp->pd = pd; - qp->send_cq = qp_init_attr->send_cq; - qp->recv_cq = qp_init_attr->recv_cq; - qp->srq = qp_init_attr->srq; - qp->qp_context = qp_init_attr->qp_context; + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; atomic_inc(&pd->usecnt); atomic_inc(&qp_init_attr->send_cq->usecnt); atomic_inc(&qp_init_attr->recv_cq->usecnt); @@ -179,6 +180,7 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe) { struct ib_cq *cq; @@ -186,9 +188,10 @@ cq = device->create_cq(device, cqe); if (!IS_ERR(cq)) { - cq->device = device; - cq->comp_handler = comp_handler; - cq->context = cq_context; + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->context = cq_context; atomic_set(&cq->usecnt, 0); } Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 759) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -283,7 +283,7 @@ void mthca_cleanup_mcg_table(struct mthca_dev *dev); int mthca_register_device(struct mthca_dev *dev); -void mthca_deregister_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); @@ -308,7 +308,7 @@ void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); void mthca_qp_event(struct mthca_dev *dev, u32 qpn, - enum ib_async_event event); + enum ib_event_type event_type); int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_cap *qp_cap); int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, Index: infiniband/hw/mthca/mthca_main.c =================================================================== --- infiniband/hw/mthca/mthca_main.c (revision 759) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -638,7 +638,7 @@ int p; if (mdev) { - mthca_deregister_device(mdev); + mthca_unregister_device(mdev); for (p = 1; p <= mdev->limits.num_ports; ++p) mthca_CLOSE_IB(mdev, p, &status); Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 759) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -560,7 +560,6 @@ dev->ib_dev.owner = THIS_MODULE; dev->ib_dev.dma_device = dev->pdev; dev->ib_dev.class_dev.dev = &dev->pdev->dev; - dev->ib_dev.provider = "mthca"; dev->ib_dev.query_device = mthca_query_device; dev->ib_dev.query_port = mthca_query_port; dev->ib_dev.modify_port = mthca_modify_port; @@ -593,7 +592,7 @@ ret = class_device_create_file(&dev->ib_dev.class_dev, mthca_class_attributes[i]); if (ret) { - ib_deregister_device(&dev->ib_dev); + ib_unregister_device(&dev->ib_dev); return ret; } } @@ -601,9 +600,9 @@ return 0; } -void mthca_deregister_device(struct mthca_dev *dev) +void mthca_unregister_device(struct mthca_dev *dev) { - ib_deregister_device(&dev->ib_dev); + ib_unregister_device(&dev->ib_dev); } /* Index: infiniband/hw/mthca/mthca_provider.h =================================================================== --- infiniband/hw/mthca/mthca_provider.h (revision 759) +++ infiniband/hw/mthca/mthca_provider.h (working copy) @@ -24,7 +24,6 @@ #ifndef MTHCA_PROVIDER_H #define MTHCA_PROVIDER_H -#include #include #define MTHCA_MPT_FLAG_ATOMIC (1 << 14) Index: infiniband/hw/mthca/mthca_mad.c =================================================================== --- infiniband/hw/mthca/mthca_mad.c (revision 759) +++ infiniband/hw/mthca/mthca_mad.c (working copy) @@ -46,24 +46,24 @@ static void smp_snoop(struct ib_device *ibdev, struct ib_mad *mad) { - struct ib_async_event_record record; + struct ib_event event; if (mad->dqpn == 0 && (mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && mad->r_method == IB_MGMT_METHOD_SET) { if (mad->attribute_id == cpu_to_be16(IB_SM_PORT_INFO)) { - record.device = ibdev; - record.event = IB_LID_CHANGE; - record.modifier.port = mad->port; - ib_async_event_dispatch(&record); + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = mad->port; + ib_dispatch_event(&event); } if (mad->attribute_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { - record.device = ibdev; - record.event = IB_PKEY_CHANGE; - record.modifier.port = mad->port; - ib_async_event_dispatch(&record); + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = mad->port; + ib_dispatch_event(&event); } } } Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 759) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -200,16 +200,16 @@ static void port_change(struct mthca_dev *dev, int port, int active) { - struct ib_async_event_record record; + struct ib_event record; mthca_dbg(dev, "Port change to %s for port %d\n", active ? "active" : "down", port); record.device = &dev->ib_dev; - record.event = active ? IB_EVENT_PORT_ACTIVE : IB_PORT_ERROR; - record.modifier.port = port; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; - ib_async_event_dispatch(&record); + ib_dispatch_event(&record); } static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) @@ -234,37 +234,37 @@ case MTHCA_EVENT_TYPE_PATH_MIG: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_QP_PATH_MIGRATED); + IB_EVENT_PATH_MIG); break; case MTHCA_EVENT_TYPE_COMM_EST: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_QP_COMMUNICATION_ESTABLISHED); + IB_EVENT_COMM_EST); break; case MTHCA_EVENT_TYPE_SQ_DRAINED: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_SEND_QUEUE_DRAINED); + IB_EVENT_SQ_DRAINED); break; case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_CATASTROPHIC_ERROR); + IB_EVENT_QP_FATAL); break; case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_PATH_MIGRATION_ERROR); + IB_EVENT_PATH_MIG_ERR); break; case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_INVALID_REQUEST_ERROR); + IB_EVENT_QP_REQ_ERR); break; case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: mthca_qp_event(dev, be32_to_cpu(eqe->qp.qpn) & 0xffffff, - IB_LOCAL_WQ_ACCESS_VIOLATION_ERROR); + IB_EVENT_QP_ACCESS_ERR); break; case MTHCA_EVENT_TYPE_CMD: Index: infiniband/hw/mthca/mthca_qp.c =================================================================== --- infiniband/hw/mthca/mthca_qp.c (revision 759) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -261,10 +261,10 @@ } void mthca_qp_event(struct mthca_dev *dev, u32 qpn, - enum ib_async_event event) + enum ib_event_type event_type) { struct mthca_qp *qp; - struct ib_async_event_record event_record; + struct ib_event event; spin_lock(&dev->qp_table.lock); qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); @@ -277,10 +277,11 @@ return; } - event_record.device = &dev->ib_dev; - event_record.event = event; - event_record.modifier.qp = (struct ib_qp *) qp; - ib_async_event_dispatch(&event_record); + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); From mst at mellanox.co.il Sun Sep 12 23:30:28 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Sep 2004 09:30:28 +0300 Subject: TODO patch (was Re: [openib-general] qp lock in mthca_poll_cq) In-Reply-To: <527jrzqf7u.fsf@topspin.com> References: <20040812072806.GA803@mellanox.co.il> <20040812074643.GB803@mellanox.co.il> <523c2s2xre.fsf@topspin.com> <20040815125027.GA410@mellanox.co.il> <527jrzqf7u.fsf@topspin.com> Message-ID: <20040913063028.GC17530@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] qp lock in mthca_poll_cq": > Michael> Wouldnt locking the cq while QP is being destroyed also > Michael> work? And maybe the eq which gets the async events. > > Yes, that's a good idea, trading some locking in the slow destroy path > for removing an atomic access in the data path. EQ access is > currently lock-free, but replacing the atomic_t refcounting of > individual resources with a per-EQ spinlock should if anything be a > little more cache friendly. I'll add this to my TODO list (I need to > take care that the locking hierarchies are OK to avoid deadlocks). Here's a TODO patch. Index: src/linux-kernel/TODO =================================================================== --- src/linux-kernel/TODO (revision 795) +++ src/linux-kernel/TODO (working copy) @@ -44,3 +44,12 @@ - rewrite client_query/sa_client/dm_client so that they are more general (better support for component mask, RMPP, etc) and more maintainable. + +mthca specific tasks: + - Reduce the number of locks on poll cq path. Locking the cq while qp + is being destroyed might just work (and maybe the eq which gets the + async events), trading some locking in the slow destroy path + for removing an atomic access in the data path. eq access is + currently lock-free, but replacing the atomic_t refcounting of + individual resources with a per-eq spinlock should if anything be a + little more cache friendly. From halr at voltaire.com Mon Sep 13 05:43:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 08:43:14 -0400 Subject: [openib-general] [PATCH] ib_mad.c: On completion, use first entry rather than walking send/receive lists Message-ID: <1095079394.2002.1.camel@localhost.localdomain> ib_mad.c: On completion, use first entry rather than walking send/receive lists Index: ib_mad.c =================================================================== --- ib_mad.c (revision 795) +++ ib_mad.c (working copy) @@ -144,7 +144,10 @@ goto error1; } } else if (mad_reg_req->mgmt_class == 0) { - /* Class 0 is reserved and used for aliasing IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ + /* + * Class 0 is reserved in IBA and is used here for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ ret = ERR_PTR(-EINVAL); goto error1; } @@ -564,31 +567,31 @@ struct ib_wc *wc) { struct ib_mad_recv_wc recv_wc; - struct ib_mad_private_header *entry, *temp; - struct ib_mad_private *recv = NULL; + struct ib_mad_private *recv; unsigned long flags; u32 qp_num; /* WC WRID is the QP number */ qp_num = wc->wr_id; - /* Find entry on posted MAD receive list which corresponds to this completion */ + /* + * Completion corresponds to first entry on + * posted MAD receive list based on WRID in completion + */ spin_lock_irqsave(&priv->recv_list_lock, flags); - list_for_each_entry_safe(entry, temp, - &priv->recv_posted_mad_list[convert_qpnum(qp_num)], - mad_list) { - if ((unsigned long)entry == wc->wr_id) { - recv = (struct ib_mad_private *)entry; - /* Remove from posted receive MAD list */ - list_del(&entry->mad_list); - priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; - break; - } + if (!list_empty(&priv->recv_posted_mad_list[convert_qpnum(qp_num)])) { + recv = list_entry(&priv->recv_posted_mad_list[convert_qpnum(qp_num)], + struct ib_mad_private, + header.mad_list); + /* Remove from posted receive MAD list */ + list_del(&recv->header.mad_list); + priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; + } else { + printk(KERN_ERR "Receive completion WR ID 0x%Lx on QP %d with no posted receive\n", wc->wr_id, qp_num); + spin_unlock_irqrestore(&priv->recv_list_lock, flags); + return; } spin_unlock_irqrestore(&priv->recv_list_lock, flags); - if (!recv) { - printk(KERN_ERR "No matching posted receive WR 0x%Lx\n", wc->wr_id); - } /* Setup MAD receive work completion from normal one */ recv_wc.wr_id = wc->wr_id; @@ -609,43 +612,51 @@ recv_wc.sl = wc->sl; recv_wc.dlid_path_bits = wc->dlid_path_bits; - /* Need to figure out MAD agent !!! */ + /* Determine corresponding MAD agent for incoming receive MAD */ /* Invoke client receive callback */ - /* Receive reposting ? !!! */ + /* When to repost receive request ? !!! */ } static void ib_mad_send_done_handler(struct ib_mad_port_private *priv, struct ib_wc *wc) { - struct ib_mad_send_wr_private *entry, *temp, *send_wr = NULL; + struct ib_mad_send_wr_private *send_wr; unsigned long flags; - /* Find entry on posted MAD send list which corresponds to this completion */ + /* Completion corresponds to first entry on posted MAD send list */ spin_lock_irqsave(&priv->send_list_lock, flags); - list_for_each_entry_safe(entry, temp, - &priv->send_posted_mad_list, send_list) { - if (entry->wr_id == wc->wr_id) { - send_wr = entry; - /* Remove from posted send MAD list */ - list_del(&entry->send_list); - priv->send_posted_mad_count--; - break; + if (!list_empty(&priv->send_posted_mad_list)) { + send_wr = list_entry(&priv->send_posted_mad_list, + struct ib_mad_send_wr_private, + send_list); + if (send_wr->wr_id != wc->wr_id) { + printk(KERN_ERR "Send completion WR ID 0x%Lx doesn't match posted send WR ID 0x%Lx\n", wc->wr_id, send_wr->wr_id); + + goto error; } + /* Remove from posted send MAD list */ + list_del(&send_wr->send_list); + priv->send_posted_mad_count--; + } else { + printk(KERN_ERR "Send completion WR ID 0x%Lx but send list is empty\n", wc->wr_id); + goto error; } spin_unlock_irqrestore(&priv->send_list_lock, flags); - if (!send_wr) { - printk(KERN_ERR "No matching posted send WR 0x%Lx\n", wc->wr_id); - } else { - /* Restore client wr_id in WC */ - wc->wr_id = send_wr->wr_id; - /* Invoke client send callback */ - send_wr->agent->send_handler(send_wr->agent, - (struct ib_mad_send_wc *)wc); - /* Release send MAD WR tracking structure */ - kfree(send_wr); - } + + /* Restore client wr_id in WC */ + wc->wr_id = send_wr->wr_id; + /* Invoke client send callback */ + send_wr->agent->send_handler(send_wr->agent, + (struct ib_mad_send_wc *)wc); + /* Release send MAD WR tracking structure */ + kfree(send_wr); + return; + +error: + spin_unlock_irqrestore(&priv->send_list_lock, flags); + return; } /* From halr at voltaire.com Mon Sep 13 08:35:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 11:35:10 -0400 Subject: [openib-general] [PATCH] ib_mad.h: Add in definition of management methods Message-ID: <1095089709.1947.56.camel@localhost.localdomain> ib_mad.h: Add in definition of management methods Index: ib_mad.h =================================================================== --- ib_mad.h (revision 799) +++ ib_mad.h (working copy) @@ -38,6 +38,19 @@ #define IB_MGMT_CLASS_CM 0x07 #define IB_MGMT_CLASS_SNMP 0x08 +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + + #define IB_MGMT_MAX_METHODS 128 #define IB_QP0 0 From iod00d at hp.com Mon Sep 13 08:42:06 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 13 Sep 2004 08:42:06 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <528ybfa2x1.fsf@topspin.com> References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> Message-ID: <20040913154206.GA11533@cup.hp.com> On Sun, Sep 12, 2004 at 09:05:14AM -0700, Roland Dreier wrote: > OK, I thought about it some more and I decided that it's better to > have the API from the beginning so that more efficient implementations > can be added without changing the client code. However, I implemented > the following API: > > void *ib_get_client_data(struct ib_device *device, struct ib_client *client); > int ib_set_client_data(struct ib_device *device, struct ib_client *client, > void *data); Can a client talk to multiple devices? ie is the "data" more closely associated with the ib_device or with the ib_client? Someplace, there has to be a 1:1 mapping or ib_get_client_data() can't work. > I think this is equivalent to what you proposed but simpler to > implement and use. It also mimics the API in : > > void *pci_get_drvdata (struct pci_dev *pdev); > void pci_set_drvdata (struct pci_dev *pdev, void *data); pci set/get hides the mechanism of where/how this info is stored so fewer details of the pci data structure are public. And in fact, the implementation makes use of generic device support. Everything is handed in and pci has a 1:1 - so life is simple. To answer Fab's question: How do PCI functions avoid a malloc? It's a 1:1 relationship and the PCI discovery sets up all the main data structures. > (my set function returns an int because it does an allocation, so it > can fail) This feels like it shouldn't be necessary. PCI has dealt with the 1:N relationship well before a driver gets associated with a device and might call pci_set_drvdata(). The 1:N relationship (ib_device:ib_client) needs to be dealt with at a different level or earlier in the initialization sequence. Is that not possible? hth, grant From roland at topspin.com Mon Sep 13 09:48:42 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 09:48:42 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040913154206.GA11533@cup.hp.com> (Grant Grundler's message of "Mon, 13 Sep 2004 08:42:06 -0700") References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> Message-ID: <52vfei6ro5.fsf@topspin.com> Grant> Can a client talk to multiple devices? ie is the "data" Grant> more closely associated with the ib_device or with the Grant> ib_client? Someplace, there has to be a 1:1 mapping or Grant> ib_get_client_data() can't work. I don't quite follow this question. The data is associated with the pair (device, client). One client can talk to multiple devices (eg IPoIB using all the HCAs in a system), and multiple clients can talk to a single device (eg IPoIB and SDP sharing an HCA). If you look at the implementation ob ib_get_client_data() that I posted it should be pretty clear how things work. I chose to keep a list of client data in each device, but that's just an implementation detail. Grant> This feels like it shouldn't be necessary. PCI has dealt Grant> with the 1:N relationship well before a driver gets Grant> associated with a device and might call pci_set_drvdata(). I'm not sure what you're referring to here. What is the 1 and what is the M in this relationship? Grant> The 1:N relationship (ib_device:ib_client) needs to be Grant> dealt with at a different level or earlier in the Grant> initialization sequence. Is that not possible? It's actually an M:N relationship. I don't see a way to avoid dynamic allocation, since both devices and clients can be added and removed at any time in any order, and we have no way of knowing in advance what the maximum number of devices or clients is going to be. Thanks, Roland From mshefty at ichips.intel.com Mon Sep 13 09:51:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 09:51:05 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <1094901754.1752.1173.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> Message-ID: <20040913095105.31b16c06.mshefty@ichips.intel.com> On Sat, 11 Sep 2004 07:22:35 -0400 Hal Rosenstock wrote: > > Sean> We should return the error code from ib_post_send in order > > Sean> to handle overruns differently. > > > > What did we decide about how to handle someone posting more sends than > > the underlying work queue can hold? > > Last I recall, we deferred this issue. Maybe we should just defer the > implementation but decide what should be done. I believe that we agreed that queuing should be done within the access layer. I don't think that we came to any conclusion about an implementation beyond that. Personally, for an initial implementation, I'd just go with posting work requests, and generate completions for sends that cannot be posted. This should be fairly trivial to implement, yet still work. > On the send side but that's not the case on the receive side as there > are posts for multiple QPs. Maybe there should be a list per QP and then > this would be true which eliminates the need to walk the list. The > implementation is rapidly heading towards this. Yes, I'd go with a list per QP. From roland at topspin.com Mon Sep 13 10:01:28 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 10:01:28 -0700 Subject: [openib-general] Re: TODO patch In-Reply-To: <20040913063028.GC17530@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 13 Sep 2004 09:30:28 +0300") References: <20040812072806.GA803@mellanox.co.il> <20040812074643.GB803@mellanox.co.il> <523c2s2xre.fsf@topspin.com> <20040815125027.GA410@mellanox.co.il> <527jrzqf7u.fsf@topspin.com> <20040913063028.GC17530@mellanox.co.il> Message-ID: <52mzzu6r2v.fsf@topspin.com> I added the following to the hw/mthca/TODO. Patches that contain the actual implementation are a lot more useful to me that TODO patches though ;) - R. Index: TODO =================================================================== --- TODO (revision 759) +++ TODO (working copy) @@ -11,6 +11,11 @@ into HCA memory. Miscellaneous verbs: At least query AH, query QP and resize CQ are not implemented. + Reduce locking for CQ poll: The reference counting used to prevent + CQs and QPs from being destroyed while events are being + dispatched could be eliminated by locking EQs (and, for QPs, + the associated CQs) during destroy operations. This saves an + atomic access in the CQ poll fast path. Medium projects (well understood but require a fair amount of code): MW support: ib_mthca does not support memory windows. From mshefty at ichips.intel.com Mon Sep 13 10:07:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:07:55 -0700 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <1094861535.1752.939.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> <1094861535.1752.939.camel@localhost.localdomain> Message-ID: <20040913100755.2ee70dee.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 20:12:16 -0400 Hal Rosenstock wrote: > > I used struct ib_qp* inside the ib_mad_agent for QP redirection purposes, > > and allows them to query the QP for its attributes, such as the number of > > supported SGEs. > > But the client doesn't get the mad_agent pointer until it registers. ib_mad_reg(ister? :) is used to send/receive MADs on QP 0 or 1. ib_mad_qp_redir is used to send/receive MADs on a different QP. Both calls return the mad_agent pointer, but in the case of ib_mad_qp_redir, the qp pointer references the qp specified by the API. > I was thinking this (separate receive lists for QPs 0 and 1) but was not > quite there. I will now be doing this shortly. This also seems to mean a > list for every redirected QP too :-( That is a future item. I'm not sure that we need to maintain lists for redirected QPs. The user owns posting the receives on that QP, so will have set the wr_id to something use for them. When the redirected QP is destroyed, the user should get completions for all posted receives. For QP0/1, the access layer owns the receive lists, so should be able to set the wr_id to whatever it needs in order to recover the buffers. > > We spoke about this some yesterday, but for others on the list, I think that the current > > implementation of ib_post_send needs to be moved down and renamed. A call to ib_post_send > > could then call that routine and take whatever action is appropriate to handle an overrun case, > > such as queuing the request, ignoring the overrun, etc. > > Do you mean ib_mad_post_send rather than ib_post_send ? I did... yes From mshefty at ichips.intel.com Mon Sep 13 10:10:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:10:05 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52zn3xa5xw.fsf@topspin.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <1094857825.1794.735.camel@localhost.localdomain> <20040911011207.GE29616@cup.hp.com> <52zn3xa5xw.fsf@topspin.com> Message-ID: <20040913101005.06cc86ca.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 19:35:23 -0700 Roland Dreier wrote: > Grant> But this also turns up two "deregister" and not sure what > Grant> to make of them: ib_core.h ib_device_notifier_deregister() > > This one is my old API, which I'm improving to ib_unregister_client... > (not checked in yet, pending consensus) I agree that the names you have are fine, so I vote to just check in the code. From mshefty at ichips.intel.com Mon Sep 13 10:14:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:14:35 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <524qm5bkjx.fsf@topspin.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <20040910225032.GA29616@cup.hp.com> <524qm5bkjx.fsf@topspin.com> Message-ID: <20040913101435.1ac6bc7b.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 19:34:26 -0700 Roland Dreier wrote: > It makes sense to me to use ib_reg_phys_mr, ib_dereg_mr, etc as names > of verbs since the IB spec uses the word "deregister" and the rest of > the verb functions are pretty abbreviated. Registering a memory > region seems somehow a little different from registering a callback or > a client too. I think I'll leave those as is for now then. A simple rename, if needed later, is easier to deal with than changes to parameters. > In any case I would like to stick with the ib_register_client function > names. In fact in ib_mad.h maybe we should change ib_mad_reg and > ib_mad_dereg to ib_register_mad_agent and ib_unregister_mad_agent > (ib_mad_reg seems backwards from the rest of the API, where the verb > comes before the noun -- eg ib_create_qp). I agree. The MAD names should change. I'll submit a patch for this shortly. From mshefty at ichips.intel.com Mon Sep 13 10:16:02 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:16:02 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <528ybhbkrt.fsf@topspin.com> References: <52mzzxdfqx.fsf@topspin.com> <20040910140235.5139e440.mshefty@ichips.intel.com> <528ybhbkrt.fsf@topspin.com> Message-ID: <20040913101602.7172450a.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 19:29:42 -0700 Roland Dreier wrote: > Sean> What use do you see for ib_dispatch_event? > > The low-level driver calls it when an unaffiliated event occurs. > Walking the list of event handlers and calling each one is pretty > trivial but it feels better to me to keep the details of how the list > of event handlers is implemented encapsulated in the access layer. Understood and agree. (I was thinking that you wanted the call for the ULPs, and was just trying to understand why.) From mshefty at ichips.intel.com Mon Sep 13 10:18:09 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:18:09 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <521xh9df3y.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> Message-ID: <20040913101809.0d7131f9.mshefty@ichips.intel.com> On Fri, 10 Sep 2004 13:49:05 -0700 Roland Dreier wrote: > enum ib_mad_result { > IB_MAD_RESULT_FAILURE = 0, // (!SUCCESS is the important flag) > IB_MAD_RESULT_SUCCESS = 1 << 0, // MAD was successfully processed > IB_MAD_RESULT_REPLY = 1 << 1, // Reply packet needs to be sent > IB_MAD_RESULT_CONSUMED = 1 << 2 // Packet consumed: stop processing > }; I think that using these return codes would work fine. From halr at voltaire.com Mon Sep 13 10:22:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 13:22:16 -0400 Subject: [openib-general] ib_mad.c comments In-Reply-To: <20040913095105.31b16c06.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <20040913095105.31b16c06.mshefty@ichips.intel.com> Message-ID: <1095096135.2002.87.camel@localhost.localdomain> On Mon, 2004-09-13 at 12:51, Sean Hefty wrote: > Personally, for an initial implementation, I'd just go with > posting work requests, and generate completions for sends that > cannot be posted. This should be fairly trivial to implement, > yet still work. The two implementation "issues" I see with this approach are: 1. Which completion code to hand back ? Is IB_WC_GENERAL_ERR the best one ? This is a nit. 2. The context for the callback might be different between the overrun of post send and a real send completion. If this would cause extra code to be written to handle this, I would just prefer to hand back the post_send error for now as this is not expected to be the normal case. I will handle deferred sends once things are basically working. > Yes, I'd go with a list per QP. The implementation has been changed to be a list per QP. -- Hal From halr at voltaire.com Mon Sep 13 10:35:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 13:35:04 -0400 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <20040913100755.2ee70dee.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> <1094861535.1752.939.camel@localhost.localdomain> <20040913100755.2ee70dee.mshefty@ichips.intel.com> Message-ID: <1095096903.2002.100.camel@localhost.localdomain> On Mon, 2004-09-13 at 13:07, Sean Hefty wrote: > I'm not sure that we need to maintain lists for redirected QPs. The user owns > posting the receives on that QP, so will have set the wr_id to something use > for them. When the redirected QP is destroyed, the user should get completions > for all posted receives. Which component's responsibility is it to generate the completions on QP destruction ? > For QP0/1, the access layer owns the receive lists, so should be able to > set the wr_id to whatever it needs in order to recover the buffers. That's what's already being done. For QP0 and 1, wr_id is only meaningful internally. -- Hal From mshefty at ichips.intel.com Mon Sep 13 10:54:07 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:54:07 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <1095096135.2002.87.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <20040913095105.31b16c06.mshefty@ichips.intel.com> <1095096135.2002.87.camel@localhost.localdomain> Message-ID: <20040913105407.726e9a33.mshefty@ichips.intel.com> On Mon, 13 Sep 2004 13:22:16 -0400 Hal Rosenstock wrote: > On Mon, 2004-09-13 at 12:51, Sean Hefty wrote: > > Personally, for an initial implementation, I'd just go with > > posting work requests, and generate completions for sends that > > cannot be posted. This should be fairly trivial to implement, > > yet still work. > The two implementation "issues" I see with this approach are: > 1. Which completion code to hand back ? Is IB_WC_GENERAL_ERR the best > one ? This is a nit. > 2. The context for the callback might be different between the overrun > of post send and a real send completion. If this would cause extra code > to be written to handle this, I would just prefer to hand back the > post_send error for now as this is not expected to be the normal case. I > will handle deferred sends once things are basically working. To clarify: A user would call ib_mad_post_send. Internally, if the call failed because of an overrun, we would simply return success. This would be true even if the MAD were queued. Internally, after posting the send or not, we would track the send as normal (do RMPP, wait for timeouts, etc.). After any timeout, we would invoke the user's send_handler to complete the MAD. If we later change the implementation to queue the send work requests, nothing changes from the user's perspective. I can help write up the code for this, but probably wouldn't be able to start on it for a couple of day. I'm just trying to think of what we can do that would work and would consist of quick, but reasonable implementation. From mshefty at ichips.intel.com Mon Sep 13 10:58:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 10:58:34 -0700 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <1095096903.2002.100.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> <1094861535.1752.939.camel@localhost.localdomain> <20040913100755.2ee70dee.mshefty@ichips.intel.com> <1095096903.2002.100.camel@localhost.localdomain> Message-ID: <20040913105834.662d2c0b.mshefty@ichips.intel.com> On Mon, 13 Sep 2004 13:35:04 -0400 Hal Rosenstock wrote: > On Mon, 2004-09-13 at 13:07, Sean Hefty wrote: > > I'm not sure that we need to maintain lists for redirected QPs. The user owns > > posting the receives on that QP, so will have set the wr_id to something use > > for them. When the redirected QP is destroyed, the user should get completions > > for all posted receives. > > Which component's responsibility is it to generate the completions on QP > destruction ? I should have more specifically said when the QP is modified to the error state. I was using "destroyed" here loosely to refer to a process of cleanup up the QP. From roland at topspin.com Mon Sep 13 11:08:06 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 11:08:06 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <20040913101809.0d7131f9.mshefty@ichips.intel.com> (Sean Hefty's message of "Mon, 13 Sep 2004 10:18:09 -0700") References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> Message-ID: <52isai6nzt.fsf@topspin.com> It seems that the process_mad method wants one more parameter too: the source LID of the MAD being processed. This used to tell if an SM trap is generated locally (by the Tavor) and should be forwarded to the SM or if it's a real trap received from the fabric. Should I add this parameter too or does anyone else see a better API? Thanks, Roland From halr at voltaire.com Mon Sep 13 11:38:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 14:38:07 -0400 Subject: [openib-general] [PATCH] [TRIVIAL]: ib_device.c: Eliminate unused variable Message-ID: <1095100686.2002.132.camel@localhost.localdomain> ib_device.c: Eliminate unused variable Index: ib_device.c =================================================================== --- ib_device.c (revision 805) +++ ib_device.c (working copy) @@ -412,7 +412,6 @@ void *data) { struct ib_client_data *context; - int ret = 0; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); From iod00d at hp.com Mon Sep 13 11:46:07 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 13 Sep 2004 11:46:07 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52vfei6ro5.fsf@topspin.com> References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> <52vfei6ro5.fsf@topspin.com> Message-ID: <20040913184607.GE11533@cup.hp.com> On Mon, Sep 13, 2004 at 09:48:42AM -0700, Roland Dreier wrote: > I don't quite follow this question. The data is associated with the > pair (device, client). One client can talk to multiple devices (eg > IPoIB using all the HCAs in a system), and multiple clients can talk > to a single device (eg IPoIB and SDP sharing an HCA). This sounds no different than say tulip driver which might claim multiple 100BT cards below different PCI Busses. I expect each client to have some data structure to track state info and each "parent" device must know something about it. Ie when the parent device goes away, each related client instance must be notified. Or is that not true either? > If you look at > the implementation ob ib_get_client_data() that I posted it should be > pretty clear how things work. I chose to keep a list of client data > in each device, but that's just an implementation detail. I did look - but the implementation doesn't tell me about the N:M vs 1:1. > Grant> This feels like it shouldn't be necessary. PCI has dealt > Grant> with the 1:N relationship well before a driver gets > Grant> associated with a device and might call pci_set_drvdata(). > > I'm not sure what you're referring to here. What is the 1 and what is > the M in this relationship? PCI Host bus controller is "1" and PCI devices are "N". > > Grant> The 1:N relationship (ib_device:ib_client) needs to be > Grant> dealt with at a different level or earlier in the > Grant> initialization sequence. Is that not possible? > > It's actually an M:N relationship. At a higher level it is. But it seems that each client must have it's own data structure in order to track events/activity. > I don't see a way to avoid dynamic > allocation, since both devices and clients can be added and removed at > any time in any order, and we have no way of knowing in advance what > the maximum number of devices or clients is going to be. I agree we can't avoid the dynamic allocation. But we can choose when it happens. I'm just questioning if it really needs to happen in get/set private data. I'm thinking it should happen during some part of the initialization and removal of specific instances of a client. thanks, grant From mshefty at ichips.intel.com Mon Sep 13 12:02:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 12:02:46 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <52isai6nzt.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> Message-ID: <20040913120246.1a2b89c9.mshefty@ichips.intel.com> On Mon, 13 Sep 2004 11:08:06 -0700 Roland Dreier wrote: > It seems that the process_mad method wants one more parameter too: the > source LID of the MAD being processed. This used to tell if an SM > trap is generated locally (by the Tavor) and should be forwarded to > the SM or if it's a real trap received from the fabric. > > Should I add this parameter too or does anyone else see a better API? Should we look at sending in a structure with some of this information? Right now, I think we have 4 parameters (device, flags, in_mad, out_mad). And we want to add in QP number, port number, and LID. Am I missing something? Port number seems like it could have its own parameter. For the source LID, would it be possible to use a flag for locally generated MADs? (Not suggesting this over using the source LID, just trying to see what other options there are.) Also, I didn't quite follow the reasoning for the QP number parameter. From roland at topspin.com Mon Sep 13 12:34:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 12:34:34 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <20040913120246.1a2b89c9.mshefty@ichips.intel.com> (Sean Hefty's message of "Mon, 13 Sep 2004 12:02:46 -0700") References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> Message-ID: <52ekl66jzp.fsf@topspin.com> Sean> Should we look at sending in a structure with some of this Sean> information? Right now, I think we have 4 parameters Sean> (device, flags, in_mad, out_mad). And we want to add in QP Sean> number, port number, and LID. Am I missing something? I don't think QP number is required. So we just need port and LID, which I don't think merits a struct. Sean> Port number seems like it could have its own parameter. For Sean> the source LID, would it be possible to use a flag for Sean> locally generated MADs? (Not suggesting this over using the Sean> source LID, just trying to see what other options there Sean> are.) I think the issue with using a flag is that it puts the knowledge that SLID==0 means locally generated (which is really Tavor-specific) into the access layer. Sean> Also, I didn't quite follow the reasoning for the QP number Sean> parameter. Actually I was convinced that we don't want the QP number (it should be up to the access layer to filter out class/QPN mismatches like SMPs received on QP1). - R. From roland at topspin.com Mon Sep 13 12:38:53 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 12:38:53 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040913184607.GE11533@cup.hp.com> (Grant Grundler's message of "Mon, 13 Sep 2004 11:46:07 -0700") References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> <52vfei6ro5.fsf@topspin.com> <20040913184607.GE11533@cup.hp.com> Message-ID: <52acvu6jsi.fsf@topspin.com> Grant> This sounds no different than say tulip driver which might Grant> claim multiple 100BT cards below different PCI Busses. The difference is that not only can only a client have multiple devices, but also a device can have multiple clients. So PCI devices can get away with a single slot to track driver context. Grant> I expect each client to have some data structure to track Grant> state info and each "parent" device must know something Grant> about it. Ie when the parent device goes away, each related Grant> client instance must be notified. Or is that not true Grant> either? There's a list of devices and a list of clients. When a device is added or removed, every client on the list gets a callback (ib_device is sort of like class_device and ib_client is sort of like class_interface in the driver model). Grant> I agree we can't avoid the dynamic allocation. But we can Grant> choose when it happens. I'm just questioning if it really Grant> needs to happen in get/set private data. I'm thinking it Grant> should happen during some part of the initialization and Grant> removal of specific instances of a client. Sure, we could allocate client context before we call the client back for a device add (and just not even call the client back if the allocation failed). Does that seem better? - R. From gdror at mellanox.co.il Mon Sep 13 12:46:01 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Mon, 13 Sep 2004 22:46:01 +0300 Subject: [openib-general] Multicast address aliasing in IPoIB Message-ID: <506C3D7B14CDD411A52C00025558DED605F9CD53@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Friday, September 10, 2004 5:43 AM > > > Dror> The problem is based on how Linux works. The IPoIB plugs > Dror> into the OS as an Ethernet driver. Therefore, it is subject > Dror> to the Ethernet rules, and from the OS, it only get lists of > Dror> "Ethernet" multicast addresses. The notification for > Dror> additional HW multicast address, will only be delivered to > Dror> the IPoIB driver if there is no Ethernet > Dror> aliasing. Otherwise, Linux will just maintain a reference > Dror> count per HW address. Therefore, there is a problem with the > Dror> implementation of the IPoIB driver as an Ethernet > Dror> driver. Fixing this problem requires a kernel patch (or a > Dror> very ugly solution). > > I see no reason to modify the IPoIB multicast mapping just > because the Linux kernel does not yet have support for it. > For customers that have to run older unpatched kernels, it is > possible to support the full IPoIB spec via ugly hacks (such > as the one found in the gen1 Topspin stack's IPoIB implementation). Well, I am not sure that this solution works. It does peek into the full IP addresses when transmitting IP MC packets and when MC list update is triggered. However, when kernel just updates an existing entry in HW MC list by incrementing a ref count, it'll not call set_multicast_list() on the net_device. For example try registering to two aliasing addresses (register only: setsockopt(IP_ADD_MEMBERSHIP), but don't send a packet). You'll see that you end up being registered only to one address. This is the case where I feel that you'd need much uglier solution in order to make it work in gen1 stacks. > > For newer kernels, there is no reason that the IPoIB driver > has to masquerade at an ethernet driver -- we should be > aiming for a fully native driver that sets its dev->type > field to ARPHRD_INFINIBAND. Gen 2, I strongly agree. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Sep 13 12:55:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 15:55:14 -0400 Subject: [openib-general] semantics of process_mad? In-Reply-To: <52ekl66jzp.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> Message-ID: <1095105313.1947.158.camel@localhost.localdomain> On Mon, 2004-09-13 at 15:34, Roland Dreier wrote: > Sean> Should we look at sending in a structure with some of this > Sean> information? Right now, I think we have 4 parameters > Sean> (device, flags, in_mad, out_mad). And we want to add in QP > Sean> number, port number, and LID. Am I missing something? > > I don't think QP number is required. My bad (for getting someone started down this path) :-( > So we just need port and LID, which I don't think merits a struct. > > Sean> Port number seems like it could have its own parameter. For > Sean> the source LID, would it be possible to use a flag for > Sean> locally generated MADs? (Not suggesting this over using the > Sean> source LID, just trying to see what other options there > Sean> are.) > > I think the issue with using a flag is that it puts the knowledge that > SLID==0 means locally generated (which is really Tavor-specific) into > the access layer. That seems better. I still don't quite have the full picture and want to see how this works out for other HCAs which might not do this and also for proper layering for Tavor. > Sean> Also, I didn't quite follow the reasoning for the QP number > Sean> parameter. > > Actually I was convinced that we don't want the QP number (it should > be up to the access layer to filter out class/QPN mismatches like SMPs > received on QP1). I'm just about to code this into the receive path now :-) I'm assuming that the transmit side doesn't need this paranoid checking for miscoded clients. -- Hal From iod00d at hp.com Mon Sep 13 13:10:03 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 13 Sep 2004 13:10:03 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52acvu6jsi.fsf@topspin.com> References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> <52vfei6ro5.fsf@topspin.com> <20040913184607.GE11533@cup.hp.com> <52acvu6jsi.fsf@topspin.com> Message-ID: <20040913201003.GH11533@cup.hp.com> On Mon, Sep 13, 2004 at 12:38:53PM -0700, Roland Dreier wrote: > Grant> This sounds no different than say tulip driver which might > Grant> claim multiple 100BT cards below different PCI Busses. > > The difference is that not only can only a client have multiple > devices, but also a device can have multiple clients. So PCI devices > can get away with a single slot to track driver context. I'm suggesting there should be an equivalent 1:1 for each IPoIB instance. ie each IPoIB client is dealing with a virtual NIC. > Grant> I expect each client to have some data structure to track > Grant> state info and each "parent" device must know something > Grant> about it. Ie when the parent device goes away, each related > Grant> client instance must be notified. Or is that not true > Grant> either? > > There's a list of devices and a list of clients. When a device is > added or removed, every client on the list gets a callback (ib_device > is sort of like class_device and ib_client is sort of like > class_interface in the driver model). We don't know which client instances are bound to a particular device? > > Grant> I agree we can't avoid the dynamic allocation. But we can > Grant> choose when it happens. I'm just questioning if it really > Grant> needs to happen in get/set private data. I'm thinking it > Grant> should happen during some part of the initialization and > Grant> removal of specific instances of a client. > > Sure, we could allocate client context before we call the client back > for a device add (and just not even call the client back if the > allocation failed). Does that seem better? Yeah - I think so. It would match the analogy to PCI support alot better. And the IB support interfaces would in turn behave like those for PCI as well - and not just look like them. But I don't want to turn the IB world upside down to make that work. If if doesn't make sense, do it the way you proposed originally. thanks, grant From roland at topspin.com Mon Sep 13 13:16:07 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 13:16:07 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <20040913201003.GH11533@cup.hp.com> (Grant Grundler's message of "Mon, 13 Sep 2004 13:10:03 -0700") References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> <52vfei6ro5.fsf@topspin.com> <20040913184607.GE11533@cup.hp.com> <52acvu6jsi.fsf@topspin.com> <20040913201003.GH11533@cup.hp.com> Message-ID: <52656h7wmw.fsf@topspin.com> Grant> We don't know which client instances are bound to a Grant> particular device? Nope... think of class_devices and class_interfaces in the current driver model. Grant> But I don't want to turn the IB world upside down to make Grant> that work. If if doesn't make sense, do it the way you Grant> proposed originally. It's easy... like this: Index: infiniband/core/ib_device.c =================================================================== --- infiniband/core/ib_device.c (revision 803) +++ infiniband/core/ib_device.c (working copy) @@ -161,6 +161,28 @@ } EXPORT_SYMBOL(ib_dealloc_device); +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + int ib_register_device(struct ib_device *device) { struct ib_device_private *priv; @@ -234,17 +256,10 @@ goto out_free_port; } - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); - if (ret) { - printk(KERN_WARNING "Couldn't create /proc dir for %s\n", - device->name); - goto out_free_cache; - } - if (ib_device_register_sysfs(device)) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", device->name); - goto out_proc; + goto out_free_cache; } list_add_tail(&device->core_list, &device_list); @@ -255,16 +270,13 @@ struct ib_client *client; list_for_each_entry(client, &client_list, list) - if (client->add) + if (client->add && !add_client_context(device, client)) client->add(device); } up(&device_sem); return 0; - out_proc: - ib_proc_cleanup(device); - out_free_cache: ib_cache_cleanup(device); @@ -302,7 +314,6 @@ kfree(context); spin_unlock_irqrestore(&device->client_data_lock, flags); - ib_proc_cleanup(device); ib_cache_cleanup(device); kfree(priv->port_data); @@ -355,7 +366,7 @@ list_add_tail(&client->list, &client_list); list_for_each_entry(device, &device_list, core_list) - if (client->add) + if (client->add && !add_client_context(device, client)) client->add(device); up(&device_sem); @@ -408,34 +419,23 @@ } EXPORT_SYMBOL(ib_get_client_data); -int ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data) +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) { struct ib_client_data *context; - int ret = 0; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { context->data = data; - spin_unlock_irqrestore(&device->client_data_lock, flags); - return 0; + break; } spin_unlock_irqrestore(&device->client_data_lock, flags); - context = kmalloc(sizeof *context, GFP_KERNEL); - if (!context) - return -ENOMEM; - context->client = client; - context->data = data; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_add(&context->list, &device->client_data_list); - spin_unlock_irqrestore(&device->client_data_lock, flags); - - return 0; + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); } EXPORT_SYMBOL(ib_set_client_data); From halr at voltaire.com Mon Sep 13 13:22:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 16:22:13 -0400 Subject: [openib-general] semantics of process_mad? In-Reply-To: <52ekl66jzp.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> Message-ID: <1095106933.2002.207.camel@localhost.localdomain> On Mon, 2004-09-13 at 15:34, Roland Dreier wrote: > Actually I was convinced that we don't want the QP number (it should > be up to the access layer to filter out class/QPN mismatches like SMPs > received on QP1). I now have code in the receive path which filters out SMI packets sent to other than QP0 and GSI packets. There is one more set of related checks on the SMI/GSI. SMI packets are supposed to be received on VL15. GSI packets are on other than VL15. The WC contains the SL but not the VL. How is this accomplished ? Is this level of the filtering done in the firmware ? -- Hal From roland at topspin.com Mon Sep 13 13:28:25 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 13:28:25 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1095106933.2002.207.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 13 Sep 2004 16:22:13 -0400") References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> <1095106933.2002.207.camel@localhost.localdomain> Message-ID: <521xh57w2e.fsf@topspin.com> Hal> There is one more set of related checks on the SMI/GSI. SMI Hal> packets are supposed to be received on VL15. GSI packets are Hal> on other than VL15. The WC contains the SL but not the Hal> VL. How is this accomplished ? Is this level of the filtering Hal> done in the firmware ? Yes, you can assume that the VL15 check is done below the level of verbs. - R. From roland at topspin.com Mon Sep 13 13:31:49 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 13:31:49 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1095106933.2002.207.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 13 Sep 2004 16:22:13 -0400") References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> <1095106933.2002.207.camel@localhost.localdomain> Message-ID: <52wtyx6hca.fsf@topspin.com> By the way, for SMI implementation, you may want to look ib_mad_validate_dr_smp() in mad_filter.c in my branch. Ted Wilcox spent a fair bit of time going through the real IB compliance suite and making sure that all the required checks were done. It's probably worth cribbing as much as possible rather than trying to do it independently. - R. From halr at voltaire.com Mon Sep 13 13:34:22 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 16:34:22 -0400 Subject: [openib-general] [PATCH] ib_mad.h: Add management base version definition Message-ID: <1095107661.3896.233.camel@localhost.localdomain> ib_mad.h: Add management base version definition Index: ../include/ib_mad.h =================================================================== --- ../include/ib_mad.h (revision 805) +++ ../include/ib_mad.h (working copy) @@ -28,6 +28,9 @@ #include +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + /* Management classes */ #define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 #define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 From iod00d at hp.com Mon Sep 13 13:40:00 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 13 Sep 2004 13:40:00 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52656h7wmw.fsf@topspin.com> References: <000401c49824$95505150$655aa8c0@infiniconsys.com> <528ybfa2x1.fsf@topspin.com> <20040913154206.GA11533@cup.hp.com> <52vfei6ro5.fsf@topspin.com> <20040913184607.GE11533@cup.hp.com> <52acvu6jsi.fsf@topspin.com> <20040913201003.GH11533@cup.hp.com> <52656h7wmw.fsf@topspin.com> Message-ID: <20040913204000.GJ11533@cup.hp.com> On Mon, Sep 13, 2004 at 01:16:07PM -0700, Roland Dreier wrote: > Grant> We don't know which client instances are bound to a > Grant> particular device? > > Nope... think of class_devices and class_interfaces in the current > driver model. I thought the whole point of class_devices and class_interfaces in the driver model was to bind devices to drivers. One difference to PCI analogy is IPoIB is "creating" the virtual device when it talks to the IB HCA driver and sets up the various queues. I don't know details of which types of queues are necessary for one IPoIB instance. But it seems natural for an IPoIB "device" to have it's own set of queues for RX/TX and any admin stuff. > Grant> But I don't want to turn the IB world upside down to make > Grant> that work. If if doesn't make sense, do it the way you > Grant> proposed originally. > > It's easy... like this: Looks good to me - thanks! grant From mshefty at ichips.intel.com Mon Sep 13 13:51:44 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 13:51:44 -0700 Subject: [openib-general] [PATCH] renaming of MAD APIs Message-ID: <20040913135144.5b229716.mshefty@ichips.intel.com> This patch changes the names of the MAD routines as suggested by Roland. This change is made to the files in gen2/branches/openib-candidate/src/linux-kernel/infiniband. I would like to avoid further updates to the files in trunk/contrib/intel, to keep everyone's life a little simpler. - Sean -- Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 806) +++ access/ib_mad.c (working copy) @@ -94,16 +94,16 @@ /* - * ib_mad_reg - Register to send/receive MADs + * ib_register_mad_agent eg - Register to send/receive MADs */ -struct ib_mad_agent *ib_mad_reg(struct ib_device *device, - u8 port, - enum ib_qp_type qp_type, - struct ib_mad_reg_req *mad_reg_req, - u8 rmpp_version, - ib_mad_send_handler send_handler, - ib_mad_recv_handler recv_handler, - void *context) +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) { struct ib_mad_port_private *entry, *priv = NULL; struct ib_mad_agent *mad_agent, *ret; @@ -252,12 +252,12 @@ error1: return ret; } -EXPORT_SYMBOL(ib_mad_reg); +EXPORT_SYMBOL(ib_register_mad_agent); /* - * ib_mad_dereg - Deregisters a client from using MAD services + * ib_unregister_mad_agent - Unregisters a client from using MAD services */ -int ib_mad_dereg(struct ib_mad_agent *mad_agent) +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) { int i; unsigned long flags; @@ -280,13 +280,13 @@ return 0; } -EXPORT_SYMBOL(ib_mad_dereg); +EXPORT_SYMBOL(ib_unregister_mad_agent); /* - * ib_mad_post_send - Posts MAD(s) to the send queue of the QP associated + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client */ -int ib_mad_post_send(struct ib_mad_agent *mad_agent, +int ib_post_send_mad(struct ib_mad_agent *mad_agent, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr) { @@ -349,7 +349,7 @@ ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count--; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); *bad_send_wr = cur_send_wr; - printk(KERN_NOTICE "ib_mad_post_send failed\n"); + printk(KERN_NOTICE "ib_post_mad_send failed\n"); return ret; } cur_send_wr= next_send_wr; @@ -357,7 +357,7 @@ return 0; } -EXPORT_SYMBOL(ib_mad_post_send); +EXPORT_SYMBOL(ib_post_send_mad); static inline u8 convert_mgmt_class(struct ib_mad_reg_req *mad_reg_req) { Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 804) +++ include/ib_mad.h (working copy) @@ -189,7 +189,7 @@ }; /** - * ib_mad_reg - Register to send/receive MADs. + * ib_register_mad_agent - Register to send/receive MADs. * @device - The device to register with. * @port - The port on the specified device to use. * @qp_type - Specifies which QP to access. Must be either @@ -206,37 +206,37 @@ * MAD. * @context - User specified context associated with the registration. */ -struct ib_mad_agent *ib_mad_reg(struct ib_device *device, - u8 port, - enum ib_qp_type qp_type, - struct ib_mad_reg_req *mad_reg_req, - u8 rmpp_version, - ib_mad_send_handler send_handler, - ib_mad_recv_handler recv_handler, - void *context); +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); /** - * ib_mad_dereg - Deregisters a client from using MAD services. + * ib_unregister_mad_agent - Deregisters a client from using MAD services. * @mad_agent - Corresponding MAD registration request to deregister. * * After invoking this routine, MAD services are no longer usable by the * client on the associated QP. */ -int ib_mad_dereg(struct ib_mad_agent *mad_agent); +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); /** - * ib_mad_post_send - Posts MAD(s) to the send queue of the QP associated + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client. * @mad_agent - Specifies the associated registration to post the send to. * @send_wr - Specifies the information needed to send the MAD(s). * @bad_send_wr - Specifies the MAD on which an error was encountered. */ -int ib_mad_post_send(struct ib_mad_agent *mad_agent, +int ib_post_send_mad(struct ib_mad_agent *mad_agent, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr); /** - * ib_mad_qp_redir - Registers a QP for MAD services. + * ib_redirect_mad_qp - Registers a QP for MAD services. * @qp - Reference to a QP that requires MAD services. * @rmpp_version - If set, indicates that the client will send * and receive MADs that contain the RMPP header for the given version. @@ -251,14 +251,14 @@ * on user-owned QPs. After calling this routine, users may send * MADs on the specified QP by calling ib_mad_post_send. */ -struct ib_mad_agent *ib_mad_qp_redir(struct ib_qp *qp, - u8 rmpp_version, - ib_mad_send_handler send_handler, - ib_mad_recv_handler recv_handler, - void *context); +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); /** - * ib_mad_process_wc - Processes a work completion associated with a + * ib_process_mad_wc - Processes a work completion associated with a * MAD sent or received on a redirected QP. * @mad_agent - Specifies the registered MAD service using the redirected QP. * @wc - References a work completion associated with a sent or received @@ -272,7 +272,7 @@ * process an inbound or outbound RMPP transfer, or to match a response MAD * with its corresponding request. */ -int ib_mad_process_wc(struct ib_mad_agent *mad_agent, +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, struct ib_wc *wc); #endif /* IB_MAD_H */ From mshefty at ichips.intel.com Mon Sep 13 13:53:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 13:53:34 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <52wtyx6hca.fsf@topspin.com> References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> <1095106933.2002.207.camel@localhost.localdomain> <52wtyx6hca.fsf@topspin.com> Message-ID: <20040913135334.3b290651.mshefty@ichips.intel.com> On Mon, 13 Sep 2004 13:31:49 -0700 Roland Dreier wrote: > By the way, for SMI implementation, you may want to look > ib_mad_validate_dr_smp() in mad_filter.c in my branch. Ted Wilcox > spent a fair bit of time going through the real IB compliance suite > and making sure that all the required checks were done. It's probably > worth cribbing as much as possible rather than trying to do it > independently. Thanks for the reference. Btw, I was going to start with the SMI implementation in your stack and port that to the updated APIs. - Sean From halr at voltaire.com Mon Sep 13 14:03:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 17:03:58 -0400 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1095105313.1947.158.camel@localhost.localdomain> References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> <1095105313.1947.158.camel@localhost.localdomain> Message-ID: <1095109437.3896.291.camel@localhost.localdomain> On Mon, 2004-09-13 at 15:55, Hal Rosenstock wrote: > > Actually I was convinced that we don't want the QP number (it should > > be up to the access layer to filter out class/QPN mismatches like SMPs > > received on QP1). > > I'm just about to code this into the receive path now :-) I'm assuming > that the transmit side doesn't need this paranoid checking for miscoded > clients. Is DQPN (from the BTH) missing in order to do the class/QPN filtering ? Should it be added to the end of the ib_mad struct ? -- Hal From halr at voltaire.com Mon Sep 13 14:26:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 17:26:29 -0400 Subject: [openib-general] Re: [PATCH] renaming of MAD APIs In-Reply-To: <20040913135144.5b229716.mshefty@ichips.intel.com> References: <20040913135144.5b229716.mshefty@ichips.intel.com> Message-ID: <1095110788.3896.329.camel@localhost.localdomain> On Mon, 2004-09-13 at 16:51, Sean Hefty wrote: > This patch changes the names of the MAD routines as suggested by Roland. > > This change is made to the files in gen2/branches/openib-candidate/src/linux-kernel/infiniband. I would like to avoid further updates to the files in trunk/contrib/intel, to keep everyone's life a little simpler. Looks and sounds good to me :-) -- Hal From roland at topspin.com Mon Sep 13 14:16:18 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 14:16:18 -0700 Subject: [openib-general] semantics of process_mad? In-Reply-To: <1095109437.3896.291.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 13 Sep 2004 17:03:58 -0400") References: <521xh9df3y.fsf@topspin.com> <20040913101809.0d7131f9.mshefty@ichips.intel.com> <52isai6nzt.fsf@topspin.com> <20040913120246.1a2b89c9.mshefty@ichips.intel.com> <52ekl66jzp.fsf@topspin.com> <1095105313.1947.158.camel@localhost.localdomain> <1095109437.3896.291.camel@localhost.localdomain> Message-ID: <52oek96fa5.fsf@topspin.com> Hal> Is DQPN (from the BTH) missing in order to do the class/QPN Hal> filtering ? Should it be added to the end of the ib_mad Hal> struct ? I think the consumer should know which queue pair a receive occurred on. - R. From mshefty at ichips.intel.com Mon Sep 13 15:16:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Sep 2004 15:16:49 -0700 Subject: [openib-general] [PATCH] review for new MAD APIs Message-ID: <20040913151649.51cf14c8.mshefty@ichips.intel.com> The following patch adds two new APIs to better support zero-copy receives on MADs. The first call copies a chain of RMPP MADs into a single data buffer, ignoring the extra MAD headers. The second call returns the receive MAD buffers and chained completion structures to the access layer, where they were allocated. Comments? - Sean -- Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 808) +++ include/ib_mad.h (working copy) @@ -249,6 +249,29 @@ struct ib_send_wr **bad_send_wr); /** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc - Work completion information for a received MAD. + * @buf - User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc - Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** * ib_redirect_mad_qp - Registers a QP for MAD services. * @qp - Reference to a QP that requires MAD services. * @rmpp_version - If set, indicates that the client will send From halr at voltaire.com Mon Sep 13 17:51:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Sep 2004 20:51:32 -0400 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <52mzzxdfqx.fsf@topspin.com> References: <52mzzxdfqx.fsf@topspin.com> Message-ID: <1095123092.2002.372.camel@localhost.localdomain> On Fri, 2004-09-10 at 16:35, Roland Dreier wrote: > Finally, I added event_handler members to struct ib_cq and struct > ib_qp and added support for setting them on creation: > > struct ib_cq { > struct ib_device *device; > ib_comp_handler comp_handler; > void (*event_handler)(struct ib_event *, void *); > void * context; > int cqe; > atomic_t usecnt; /* count number of work queues */ > }; > > struct ib_cq *ib_create_cq(struct ib_device *device, > ib_comp_handler comp_handler, > void (*event_handler)(struct ib_event *, void *), > void *cq_context, int cqe); > > struct ib_qp { > struct ib_device *device; > struct ib_pd *pd; > struct ib_cq *send_cq; > struct ib_cq *recv_cq; > struct ib_srq *srq; > void (*event_handler)(struct ib_event *, void *); > void *qp_context; > u32 qp_num; > }; > > struct ib_qp_init_attr { > void (*event_handler)(struct ib_event *, void *); > void *qp_context; > struct ib_cq *send_cq; > struct ib_cq *recv_cq; > struct ib_srq *srq; > struct ib_qp_cap cap; > enum ib_sig_type sq_sig_type; > enum ib_sig_type rq_sig_type; > enum ib_qp_type qp_type; > u8 port_num; /* special QP types only */ > }; > > These do get passed the context to match what we did with the > comp_handler member of struct ib_cq. > > Comments? Is there a way to do the following things: 1. Can other than the QP owner obtain any QP async events (or perhaps only certain ones) ? Certain events might need to go to more than 1 place (I think that 2 is sufficient). (perhaps this is a second event handler which could be at qp_create time or qp modify time (with a new virtual QP attribute to add or remove this). 2. An optimization would be to be able to obtain all QP async events rather than needing to do this for all QPs as they come and go. Perhaps something like ib_set/clear_global_qp_handler() would set/clear the second event handler in all existing QPs and if a global handler is set, this would be used at QP create time. Sorry for the delay in responding to this. -- Hal From roland at topspin.com Mon Sep 13 20:22:57 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 20:22:57 -0700 Subject: [openib-general] [PATCH] Update to new process_mad API Message-ID: <52k6ux5yb2.fsf@topspin.com> This updates my branch to the new API for process_mad. I still need to fix up the definition of struct ib_mad in ts_ib_mad_types.h (will be done shortly). - R. Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 803) +++ infiniband/include/ib_verbs.h (working copy) @@ -636,6 +636,19 @@ u32 rkey; }; +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + #define IB_DEVICE_NAME_MAX 64 struct ib_device { @@ -743,7 +756,12 @@ int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid); - ib_mad_process_func mad_process; + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); struct class_device class_dev; struct kobject ports_parent; Index: infiniband/include/ib_mad.h =================================================================== --- infiniband/include/ib_mad.h (revision 0) +++ infiniband/include/ib_mad.h (revision 0) @@ -0,0 +1,293 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 cpu_to_be32(1) +#define IB_QP1_QKEY cpu_to_be32(0x80010000) + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +}; + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +}; + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +}; + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +}; + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent - MAD agent that sent the MAD. + * @mad_send_wc - Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent - MAD agent requesting the received MAD. + * @mad_recv_wc - Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to the user through this routine are owned by the receiving client. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device - Reference to device registration is on. + * @qp - Reference to QP used for sending and receiving MADs. + * @recv_handler - Callback handler for a received MAD. + * @send_handler - Callback handler for a sent MAD. + * @context - User-specified context associated with this registration. + * @hi_tid - Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + void *context; + u32 hi_tid; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id - Work request identifier associated with the send MAD request. + * @status - Completion status. + * @vendor_err - Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list - Reference to next data buffer for a received RMPP MAD. + * @grh - References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad - References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc - Completion information for the received data. + * @recv_buf - Specifies the location of the received data buffer(s). + * @mad_len - The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf *recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class - Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version - Indicates which version of MADs for the given + * management class to receive. + * @method_mask - The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device - The device to register with. + * @port - The port on the specified device to use. + * @qp_type - Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req - Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version - If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler - The completion callback routine invoked after a send + * request has completed. + * @recv_handler - The completion callback routine invoked for a received + * MAD. + * @context - User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent - Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent - Specifies the associated registration to post the send to. + * @send_wr - Specifies the information needed to send the MAD(s). + * @bad_send_wr - Specifies the MAD on which an error was encountered. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp - Reference to a QP that requires MAD services. + * @rmpp_version - If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler - The completion callback routine invoked after a send + * request has completed. + * @recv_handler - The completion callback routine invoked for a received + * MAD. + * @context - User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent - Specifies the registered MAD service using the redirected QP. + * @wc - References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ Index: infiniband/include/ts_ib_core_types.h =================================================================== --- infiniband/include/ts_ib_core_types.h (revision 803) +++ infiniband/include/ts_ib_core_types.h (working copy) @@ -90,20 +90,6 @@ u8 lmc; }; -struct ib_mad; - -enum ib_mad_result { - IB_MAD_RESULT_FAILURE = 0, // (!SUCCESS is the important flag) - IB_MAD_RESULT_SUCCESS = 1 << 0, // MAD was successfully processed - IB_MAD_RESULT_REPLY = 1 << 1, // Reply packet needs to be sent - IB_MAD_RESULT_CONSUMED = 1 << 2 // Packet consumed: stop processing -}; - -typedef enum ib_mad_result (*ib_mad_process_func)(struct ib_device *device, - int ignore_mkey, - struct ib_mad *in_mad, - struct ib_mad *response_mad); - /* structures */ enum { Index: infiniband/core/mad_thread.c =================================================================== --- infiniband/core/mad_thread.c (revision 803) +++ infiniband/core/mad_thread.c (working copy) @@ -58,7 +58,7 @@ /* If this is an outgoing 0-hop SMP, we have a mad_process method and the provider hasn't told use to use QP0 for this, just process the MAD directly. */ - if (device->mad_process && + if (device->process_mad && !(device->flags & IB_MAD_LOCAL_USE_QP) && !mad->route.directed.hop_count) { void *response_buf = kmalloc(sizeof (struct ib_mad) + @@ -85,7 +85,8 @@ work->type = IB_MAD_WORK_SEND_DONE; work->index = -1; - result = device->mad_process(device, 0, mad, response); + result = device->process_mad(device, 0, mad->port, mad->slid, + mad, response); *reuse = 1; Index: infiniband/core/useraccess_ioctl.c =================================================================== --- infiniband/core/useraccess_ioctl.c (revision 803) +++ infiniband/core/useraccess_ioctl.c (working copy) @@ -26,14 +26,6 @@ #include #include "ts_ib_mad.h" -/* - We include ts_ib_provider_types.h so that we can access the - mad_process member of a device struct. This is sort of an ugly - violation of our layering (since the useraccess module should - probably only use devices through device handles) but seems like the - least bad solution. -*/ - #include "ts_kernel_trace.h" #include "ts_kernel_services.h" @@ -206,9 +198,8 @@ /* Here's the ugly layering violation mentioned above: */ struct ib_device *device = priv->device->ib_device; - if (!device->mad_process) { + if (!device->process_mad) return -ENOSYS; - } mad = kmalloc(2 * sizeof *mad, GFP_KERNEL); if (!mad) { @@ -221,7 +212,7 @@ } mad->device = priv->device->ib_device; - result = device->mad_process(device, 1, mad, mad + 1); + result = device->process_mad(device, 1, mad->port, mad->slid, mad, mad + 1); if (copy_to_user ((void *)arg + TS_USER_MAD_SIZE, &result, sizeof result)) { Index: infiniband/core/mad_filter.c =================================================================== --- infiniband/core/mad_filter.c (revision 803) +++ infiniband/core/mad_filter.c (working copy) @@ -284,7 +284,8 @@ !ib_mad_validate_dr_smp(mad, device)) ret = IB_MAD_RESULT_SUCCESS; // As if device ignored packet. else - ret = device->mad_process(device, 0, mad, response); + ret = device->process_mad(device, 0, mad->port, mad->slid, + mad, response); if (!(ret & IB_MAD_RESULT_SUCCESS)) TS_REPORT_WARN(MOD_KERNEL_IB, Index: infiniband/core/mad_static.c =================================================================== --- infiniband/core/mad_static.c (revision 803) +++ infiniband/core/mad_static.c (working copy) @@ -104,13 +104,11 @@ { struct ib_mad *mad_in, *mad_out; - if (!device->mad_process) { + if (!device->process_mad) return; - } - if (!lid_base) { + if (!lid_base) ib_mad_static_compute_base(); - } mad_in = kmem_cache_alloc(mad_cache, GFP_KERNEL); if (!mad_in) { @@ -133,14 +131,12 @@ mad_in->class_version = 1; mad_in->r_method = IB_MGMT_METHOD_GET; mad_in->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); - mad_in->port = port; - mad_in->slid = 0xffff; /* Request port info from the device */ - if ((device->mad_process(device, 1, mad_in, mad_out) & + if ((device->process_mad(device, IB_MAD_IGNORE_MKEY, port, 0xffff, mad_in, mad_out) & (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, "%s: mad_process failed for port %d", + TS_REPORT_FATAL(MOD_KERNEL_IB, "%s: process_mad failed for port %d", device->name, port); return; } @@ -149,13 +145,11 @@ ib_smp_port_info_lid_set(IB_MAD_TO_SMP_DATA(mad_out), lid_base); ++lid_base; mad_out->r_method = IB_MGMT_METHOD_SET; - mad_out->port = port; - mad_out->slid = 0xffff; /* Update the port info on the device */ - if (!(device->mad_process(device, 1, mad_out, mad_in) & + if (!(device->process_mad(device, IB_MAD_IGNORE_MKEY, port, 0xffff, mad_out, mad_in) & IB_MAD_RESULT_SUCCESS)) { - TS_REPORT_FATAL(MOD_KERNEL_IB, "%s: mad_process failed for port %d", + TS_REPORT_FATAL(MOD_KERNEL_IB, "%s: process_mad failed for port %d", device->name, port); return; } Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 803) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -346,10 +346,12 @@ int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); -enum ib_mad_result mthca_process_mad(struct ib_device *ibdev, - int ignore_mkey, - struct ib_mad *in_mad, - struct ib_mad *response_mad); +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) { Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 803) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -582,7 +582,7 @@ dev->ib_dev.dereg_mr = mthca_dereg_mr; dev->ib_dev.attach_mcast = mthca_multicast_attach; dev->ib_dev.detach_mcast = mthca_multicast_detach; - dev->ib_dev.mad_process = mthca_process_mad; + dev->ib_dev.process_mad = mthca_process_mad; ret = ib_register_device(&dev->ib_dev); if (ret) Index: infiniband/hw/mthca/mthca_mad.c =================================================================== --- infiniband/hw/mthca/mthca_mad.c (revision 803) +++ infiniband/hw/mthca/mthca_mad.c (working copy) @@ -44,7 +44,8 @@ * synthesize LID change and P_Key change events. */ static void smp_snoop(struct ib_device *ibdev, - struct ib_mad *mad) + struct ib_mad *mad, + u8 port_num) { struct ib_event event; @@ -55,35 +56,36 @@ if (mad->attribute_id == cpu_to_be16(IB_SM_PORT_INFO)) { event.device = ibdev; event.event = IB_EVENT_LID_CHANGE; - event.element.port_num = mad->port; + event.element.port_num = port_num; ib_dispatch_event(&event); } if (mad->attribute_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { event.device = ibdev; event.event = IB_EVENT_PKEY_CHANGE; - event.element.port_num = mad->port; + event.element.port_num = port_num; ib_dispatch_event(&event); } } } -enum ib_mad_result mthca_process_mad(struct ib_device *ibdev, - int ignore_mkey, - struct ib_mad *in_mad, - struct ib_mad *response_mad) +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) { int err; u8 status; /* Forward locally generated traps to the SM */ - if (in_mad->dqpn == 0 && - in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && + if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && in_mad->r_method == IB_MGMT_METHOD_TRAP && - in_mad->slid == 0) { + slid == 0) { struct ib_sm_path sm_path; - ib_cached_sm_path_get(ibdev, in_mad->port, &sm_path); + ib_cached_sm_path_get(ibdev, port_num, &sm_path); if (sm_path.sm_lid) { in_mad->sqpn = 0; in_mad->dlid = sm_path.sm_lid; @@ -94,13 +96,17 @@ return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; } - /* Only handle SM gets, sets and trap represses for QP0 */ - if (in_mad->dqpn == 0) { - if ((in_mad->mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED && - in_mad->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) || - (in_mad->r_method != IB_MGMT_METHOD_GET && - in_mad->r_method != IB_MGMT_METHOD_SET && - in_mad->r_method != IB_MGMT_METHOD_TRAP_REPRESS)) + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->r_method != IB_MGMT_METHOD_GET && + in_mad->r_method != IB_MGMT_METHOD_SET && + in_mad->r_method != IB_MGMT_METHOD_TRAP_REPRESS) return IB_MAD_RESULT_SUCCESS; /* @@ -110,22 +116,18 @@ if (be16_to_cpu(in_mad->attribute_id) == IB_SM_SM_INFO || be16_to_cpu(in_mad->attribute_id) >= IB_SM_VENDOR_START) return IB_MAD_RESULT_SUCCESS; - } - - /* - * Only handle PMA and Mellanox vendor-specific class gets and - * sets on QP1 - */ - if (in_mad->dqpn == 1 && - ((in_mad->mgmt_class != IB_MGMT_CLASS_PERF && - in_mad->mgmt_class != MTHCA_VENDOR_CLASS1 && - in_mad->mgmt_class != MTHCA_VENDOR_CLASS2) || - (in_mad->r_method != IB_MGMT_METHOD_GET && - in_mad->r_method != IB_MGMT_METHOD_SET))) + } else if (in_mad->mgmt_class == IB_MGMT_CLASS_PERF || + in_mad->mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->r_method != IB_MGMT_METHOD_GET && + in_mad->r_method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else return IB_MAD_RESULT_SUCCESS; - err = mthca_MAD_IFC(to_mdev(ibdev), ignore_mkey, - in_mad->port, in_mad, response_mad, + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, &status); if (err) { mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); @@ -139,11 +141,11 @@ return IB_MAD_RESULT_FAILURE; } - smp_snoop(ibdev, in_mad); + smp_snoop(ibdev, in_mad, port_num); /* set return bit in status of directed route responses */ if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) - response_mad->status |= cpu_to_be16(1 << 15); + out_mad->status |= cpu_to_be16(1 << 15); if (in_mad->r_method == IB_MGMT_METHOD_TRAP_REPRESS) /* no response for trap repress */ From roland at topspin.com Mon Sep 13 20:26:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 20:26:59 -0700 Subject: [openib-general] [PATCH] Have ib_set_client_data return void Message-ID: <52fz5l5y4c.fsf@topspin.com> Allocate client context sooner so that ib_set_client_data() can return void. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 812) +++ infiniband/ulp/ipoib/ipoib_main.c (revision 813) @@ -733,6 +733,7 @@ } static struct ib_client ipoib_client = { + .name = "ipoib", .add = ipoib_add_one, .remove = ipoib_remove_one }; Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 812) +++ infiniband/include/ib_verbs.h (revision 813) @@ -777,6 +777,7 @@ }; struct ib_client { + char *name; void (*add) (struct ib_device *); void (*remove)(struct ib_device *); @@ -793,7 +794,7 @@ void ib_unregister_client(struct ib_client *client); void *ib_get_client_data(struct ib_device *device, struct ib_client *client); -int ib_set_client_data(struct ib_device *device, struct ib_client *client, +void ib_set_client_data(struct ib_device *device, struct ib_client *client, void *data); int ib_register_event_handler (struct ib_event_handler *event_handler); Index: infiniband/core/ib_device.c =================================================================== --- infiniband/core/ib_device.c (revision 812) +++ infiniband/core/ib_device.c (revision 813) @@ -161,6 +161,28 @@ } EXPORT_SYMBOL(ib_dealloc_device); +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + int ib_register_device(struct ib_device *device) { struct ib_device_private *priv; @@ -234,17 +256,10 @@ goto out_free_port; } - ret = ib_proc_setup(device, device->node_type == IB_NODE_SWITCH); - if (ret) { - printk(KERN_WARNING "Couldn't create /proc dir for %s\n", - device->name); - goto out_free_cache; - } - if (ib_device_register_sysfs(device)) { printk(KERN_WARNING "Couldn't register device %s with driver model\n", device->name); - goto out_proc; + goto out_free_cache; } list_add_tail(&device->core_list, &device_list); @@ -255,16 +270,13 @@ struct ib_client *client; list_for_each_entry(client, &client_list, list) - if (client->add) + if (client->add && !add_client_context(device, client)) client->add(device); } up(&device_sem); return 0; - out_proc: - ib_proc_cleanup(device); - out_free_cache: ib_cache_cleanup(device); @@ -302,7 +314,6 @@ kfree(context); spin_unlock_irqrestore(&device->client_data_lock, flags); - ib_proc_cleanup(device); ib_cache_cleanup(device); kfree(priv->port_data); @@ -355,7 +366,7 @@ list_add_tail(&client->list, &client_list); list_for_each_entry(device, &device_list, core_list) - if (client->add) + if (client->add && !add_client_context(device, client)) client->add(device); up(&device_sem); @@ -408,34 +419,23 @@ } EXPORT_SYMBOL(ib_get_client_data); -int ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data) +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) { struct ib_client_data *context; - int ret = 0; unsigned long flags; spin_lock_irqsave(&device->client_data_lock, flags); list_for_each_entry(context, &device->client_data_list, list) if (context->client == client) { context->data = data; - spin_unlock_irqrestore(&device->client_data_lock, flags); - return 0; + break; } spin_unlock_irqrestore(&device->client_data_lock, flags); - context = kmalloc(sizeof *context, GFP_KERNEL); - if (!context) - return -ENOMEM; - context->client = client; - context->data = data; - - spin_lock_irqsave(&device->client_data_lock, flags); - list_add(&context->list, &device->client_data_list); - spin_unlock_irqrestore(&device->client_data_lock, flags); - - return 0; + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); } EXPORT_SYMBOL(ib_set_client_data); Index: infiniband/core/mad_main.c =================================================================== --- infiniband/core/mad_main.c (revision 812) +++ infiniband/core/mad_main.c (revision 813) @@ -328,6 +328,7 @@ } static struct ib_client mad_client = { + .name = "mad", .add = ib_mad_add_one, .remove = ib_mad_remove_one }; From roland at topspin.com Mon Sep 13 20:28:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 20:28:59 -0700 Subject: [openib-general] [PATCH] move PMA counters to sysfs Message-ID: <52brg95y10.fsf@topspin.com> This puts the PMA counters under /sys/class/infiniband/DEV/ports/N/counters/ and kills off the last of /proc/infiniband/core. - R. Index: infiniband/core/Makefile =================================================================== --- infiniband/core/Makefile (revision 803) +++ infiniband/core/Makefile (working copy) @@ -39,8 +39,7 @@ ib_device.o \ core_main.o \ core_fmr_pool.o \ - core_cache.o \ - core_proc.o + core_cache.o ib_mad-objs := \ mad_main.o \ Index: infiniband/core/core_proc.c =================================================================== --- infiniband/core/core_proc.c (revision 803) +++ infiniband/core/core_proc.c (working copy) @@ -1,329 +0,0 @@ -/* - This software is available to you under a choice of one of two - licenses. You may choose to be licensed under the terms of the GNU - General Public License (GPL) Version 2, available at - , or the OpenIB.org BSD - license, available in the LICENSE.TXT file accompanying this - software. These details are also available at - . - - THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - SOFTWARE. - - Copyright (c) 2004 Topspin Communications. All rights reserved. - - $Id$ -*/ - -/* - We want to make a directory tree like: - - /proc/infiniband/core/ - hca1/ - info - port1/ - counters - info - gid_table - pkey_table - port2... - hca2... -*/ - -#include "core_priv.h" - -#include "pm_access.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - -#include -#include - -#include -#include -#include - -struct ib_core_proc { - int index; - char dev_dir_name[16]; - struct proc_dir_entry *dev_dir; - struct ib_port_proc *port; -}; - -struct ib_port_proc { - struct ib_device *device; - int port_num; - struct proc_dir_entry *port_dir; - struct proc_dir_entry *counters; -}; - -static int index = 1; -static struct proc_dir_entry *core_dir; - -static void *ib_counters_seq_start(struct seq_file *file, - loff_t *pos) -{ - if (*pos) - return NULL; - else - return (void *) 1UL; -} - -static void *ib_counters_seq_next(struct seq_file *file, - void *iter_ptr, - loff_t *pos) -{ - (*pos)++; - return NULL; -} - -static void ib_counters_seq_stop(struct seq_file *file, - void *iter_ptr) -{ - /* nothing for now */ -} - -static int ib_counters_seq_show(struct seq_file *file, - void *iter_ptr) -{ - struct ib_port_proc *proc_port = file->private; - struct ib_mad *in_mad = NULL; - struct ib_mad *out_mad = NULL; - - struct ib_pm_port_counters *counters = NULL; - - if (!proc_port->device->mad_process) { - seq_puts(file, "\n"); - return 0; - } - - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); - out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); - if (!in_mad || !out_mad) { - seq_puts(file, "\n"); - goto out; - } - - counters = kmalloc(sizeof *counters, GFP_KERNEL); - if (!counters) { - seq_puts(file, "\n"); - goto out; - } - - memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_PERF; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_PM_ATTRIB_PORT_COUNTERS); - in_mad->dqpn = 1; - in_mad->port = proc_port->port_num; - - memset(counters, 0, sizeof *counters); - counters->port_select = proc_port->port_num; - ib_pm_port_counters_pack(counters, IB_MAD_TO_PM_DATA(in_mad)); - - if ((proc_port->device->mad_process(proc_port->device, - 1, - in_mad, - out_mad) & - (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != - (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { - seq_puts(file, "\n"); - goto out; - } - - ib_pm_port_counters_unpack(IB_MAD_TO_PM_DATA(out_mad), counters); - - seq_printf(file, "Symbol error counter: %10u\n", - counters->symbol_error_counter); - seq_printf(file, "Link error recovery counter: %10u\n", - counters->link_error_recovery_counter); - seq_printf(file, "Link downed counter: %10u\n", - counters->link_downed_counter); - seq_printf(file, "Port receive errors: %10u\n", - counters->port_rcv_errors); - seq_printf(file, "Port receive remote physical errors: %10u\n", - counters->port_rcv_remote_physical_errors); - seq_printf(file, "Port receive switch relay errors: %10u\n", - counters->port_rcv_switch_relay_errors); - seq_printf(file, "Port transmit discards: %10u\n", - counters->port_xmit_discards); - seq_printf(file, "Port transmit constrain errors: %10u\n", - counters->port_xmit_constrain_errors); - seq_printf(file, "Port receive constrain errors: %10u\n", - counters->port_rcv_constrain_errors); - seq_printf(file, "Local link integrity errors: %10u\n", - counters->local_link_integrity_errors); - seq_printf(file, "Excessive buffer overrun errors: %10u\n", - counters->excessive_buffer_overrun_errors); - seq_printf(file, "VL15 dropped: %10u\n", - counters->vl15_dropped); - seq_printf(file, "Port transmit data: %10u\n", - counters->port_xmit_data); - seq_printf(file, "Port receive data: %10u\n", - counters->port_rcv_data); - seq_printf(file, "Port transmit packets: %10u\n", - counters->port_xmit_pkts); - seq_printf(file, "Port receive packets: %10u\n", - counters->port_rcv_pkts); - - out: - kfree(in_mad); - kfree(out_mad); - kfree(counters); - return 0; -} - -static struct seq_operations counters_seq_ops = { - .start = ib_counters_seq_start, - .next = ib_counters_seq_next, - .stop = ib_counters_seq_stop, - .show = ib_counters_seq_show -}; - -static int ib_counters_open(struct inode *inode, - struct file *file) -{ - int ret; - - ret = seq_open(file, &counters_seq_ops); - if (ret) { - return ret; - } - ((struct seq_file *) file->private_data)->private = PDE(inode)->data; - - return 0; -} - -static int ib_proc_file_release(struct inode *inode, - struct file *file) -{ - return seq_release(inode, file); -} - -static struct file_operations counters_ops = { - .owner = THIS_MODULE, - .open = ib_counters_open, - .read = seq_read, - .llseek = seq_lseek, - .release = ib_proc_file_release -}; - -int ib_proc_setup(struct ib_device *device, - int is_switch) -{ - struct ib_device_private *priv = device->core; - struct ib_core_proc *core_proc; - char port_name[] = "portNN"; - int p; - - core_proc = kmalloc(sizeof *core_proc, GFP_KERNEL); - if (!core_proc) { - return -ENOMEM; - } - - core_proc->index = index; - - if (is_switch) { - sprintf(core_proc->dev_dir_name, "switch%d", index); - } else { - sprintf(core_proc->dev_dir_name, "ca%d", index); - } - core_proc->dev_dir = proc_mkdir(core_proc->dev_dir_name, core_dir); - if (!core_proc) { - goto out_free; - } - - core_proc->port = kmalloc((priv->end_port + 1) * sizeof (struct ib_port_proc), - GFP_KERNEL); - if (!core_proc->port) - goto out_topdir; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - core_proc->port[p].device = device; - core_proc->port[p].port_num = p; - core_proc->port[p].port_dir = NULL; - core_proc->port[p].counters = NULL; - } - - for (p = priv->start_port; p <= priv->end_port; ++p) { - snprintf(port_name, sizeof port_name, "port%d", p); - core_proc->port[p].port_dir = proc_mkdir(port_name, core_proc->dev_dir); - if (!core_proc->port[p].port_dir) - goto out_port; - - core_proc->port[p].counters = create_proc_entry("counters", S_IRUGO, - core_proc->port[p].port_dir); - if (!core_proc->port[p].counters) - goto out_port; - - core_proc->port[p].counters->proc_fops = &counters_ops; - core_proc->port[p].counters->data = &core_proc->port[p]; - } - - priv->proc = core_proc; - ++index; - return 0; - - out_port: - for (p = priv->start_port; p <= priv->end_port; ++p) { - if (core_proc->port[p].counters) - remove_proc_entry("counters", core_proc->port[p].port_dir); - - if (core_proc->port[p].port_dir) { - snprintf(port_name, sizeof port_name, "port%d", p); - remove_proc_entry(port_name, core_proc->dev_dir); - } - } - - out_topdir: - remove_proc_entry(core_proc->dev_dir_name, core_dir); - - out_free: - kfree(core_proc); - return -ENOMEM; -} - -void ib_proc_cleanup(struct ib_device *device) -{ - struct ib_device_private *priv = device->core; - struct ib_core_proc *core_proc = priv->proc; - char port_name[] = "portNN"; - int p; - - for (p = priv->start_port; p <= priv->end_port; ++p) { - remove_proc_entry("counters", core_proc->port[p].port_dir); - snprintf(port_name, sizeof port_name, "port%d", p); - remove_proc_entry(port_name, core_proc->dev_dir); - } - - remove_proc_entry(core_proc->dev_dir_name, core_dir); - - kfree(priv->proc); -} - -int ib_create_proc_dir(void) -{ - core_dir = proc_mkdir("core", tsKernelProcDirGet()); - return core_dir ? 0 : -ENOMEM; -} - -void ib_remove_proc_dir(void) -{ - remove_proc_entry("core", tsKernelProcDirGet()); -} - -/* - Local Variables: - c-file-style: "linux" - indent-tabs-mode: t - End: -*/ Index: infiniband/core/core_main.c =================================================================== --- infiniband/core/core_main.c (revision 803) +++ infiniband/core/core_main.c (working copy) @@ -35,24 +35,14 @@ int ret; ret = ib_sysfs_setup(); - if (ret) { + if (ret) printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); - return ret; - } - ret = ib_create_proc_dir(); - if (ret) { - ib_sysfs_cleanup(); - printk(KERN_WARNING "Couldn't create IB core proc directory\n"); - return ret; - } - - return 0; + return ret; } static void __exit ib_core_cleanup(void) { - ib_remove_proc_dir(); ib_sysfs_cleanup(); } Index: infiniband/core/core_priv.h =================================================================== --- infiniband/core/core_priv.h (revision 803) +++ infiniband/core/core_priv.h (working copy) @@ -45,8 +45,6 @@ int end_port; tTS_IB_GUID node_guid; struct ib_port_data *port_data; - - struct ib_core_proc *proc; }; struct ib_port_data { @@ -70,10 +68,6 @@ int ib_cache_setup(struct ib_device *device); void ib_cache_cleanup(struct ib_device *device); -int ib_proc_setup(struct ib_device *device, int is_switch); -void ib_proc_cleanup(struct ib_device *device); -int ib_create_proc_dir(void); -void ib_remove_proc_dir(void); void ib_completion_thread(struct list_head *entry, void *device_ptr); void ib_async_thread(struct list_head *entry, void *device_ptr); Index: infiniband/core/ib_sysfs.c =================================================================== --- infiniband/core/ib_sysfs.c (revision 803) +++ infiniband/core/ib_sysfs.c (working copy) @@ -23,6 +23,8 @@ #include "core_priv.h" +#include + struct ib_port { struct kobject kobj; struct ib_device *ibdev; @@ -212,6 +214,119 @@ return sprintf(buf, "0x%04x\n", pkey); } +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%d\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%d\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%d\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%d\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error, 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery, 1, 8, 48); +static PORT_PMA_ATTR(link_downed, 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors, 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors, 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards, 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors, 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors, 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors, 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped, 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data, 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data, 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets, 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets, 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + static void ib_port_release(struct kobject *kobj) { struct ib_port *p = container_of(kobj, struct ib_port, kobj); @@ -373,10 +488,14 @@ if (ret) goto err_put; - ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + ret = sysfs_create_group(&p->kobj, &pma_group); if (ret) goto err_put; + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + p->gid_group.name = "gids"; p->gid_group.attrs = p->gid_attr; @@ -418,6 +537,9 @@ kfree(p->gid_attr); +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + err_put: kobject_put(&device->ports_parent); @@ -537,6 +659,7 @@ list_for_each_entry_safe(p, t, &device->port_list, entry) { list_del(&p->entry); port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); sysfs_remove_group(p, &port->pkey_group); sysfs_remove_group(p, &port->gid_group); kobject_unregister(p); @@ -560,7 +683,9 @@ list_for_each_entry_safe(p, t, &device->port_list, entry) { list_del(&p->entry); port = container_of(p, struct ib_port, kobj); - sysfs_remove_group(p,&port->gid_group); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); kobject_unregister(p); } From roland at topspin.com Mon Sep 13 20:41:20 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 20:41:20 -0700 Subject: [openib-general] Proposed device enumeration & async event APIs In-Reply-To: <1095123092.2002.372.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 13 Sep 2004 20:51:32 -0400") References: <52mzzxdfqx.fsf@topspin.com> <1095123092.2002.372.camel@localhost.localdomain> Message-ID: <527jqx5xgf.fsf@topspin.com> Hal> Is there a way to do the following things: Hal> 1. Can other than the QP owner obtain any QP async events (or Hal> perhaps only certain ones) ? Certain events might need to go Hal> to more than 1 place (I think that 2 is sufficient). Hal> (perhaps this is a second event handler which could be at Hal> qp_create time or qp modify time (with a new virtual QP Hal> attribute to add or remove this). Hal> 2. An optimization would be to be able to obtain all QP async Hal> events rather than needing to do this for all QPs as they Hal> come and go. Perhaps something like Hal> ib_set/clear_global_qp_handler() would set/clear the second Hal> event handler in all existing QPs and if a global handler is Hal> set, this would be used at QP create time. No, there is not a way to do what you ask at the moment. I assume that your motivation is the CM, specifically for communication established and path migrated events. With that perspective, #2 is not that useful because it means that the CM needs to keep yet another table mapping local QP to CM context, which is a hassle and a potential performance hit. #1 is somewhat suboptimial as well, because there is no ordering between when the CM might process a communication established event and when the client might receive the actual receive completion. I think we've found that it's actually easier for the CM client to explicitly synchronize with the CM by calling into the CM when it gets an event prior to the CM entering the established state (rather than trying to drive the CM via async events). With all that said, I'm not that opposed to implementing 1 or 2 as appropriate, but I would like to see a consumer of the API that shows a real benefit before going either of those two ways. - R. From roland at topspin.com Mon Sep 13 21:34:18 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Sep 2004 21:34:18 -0700 Subject: [openib-general] [PATCH] Update MAD header fields to new API Message-ID: <52zn3t4gfp.fsf@topspin.com> This rather giant diff updates struct ib_mad in my tree's ts_ib_mad_types.h to have the same member names as in ib_mad.h. This means that it should now be possible to build and use the new SMI code (once it's ready) with mthca. - R. Index: infiniband/include/ts_ib_mad_smi_types.h =================================================================== --- infiniband/include/ts_ib_mad_smi_types.h (revision 803) +++ infiniband/include/ts_ib_mad_smi_types.h (working copy) @@ -30,7 +30,7 @@ #define TS_IB_MAD_SMP_PAYLOAD(mad) ((struct ib_mad_payload_smp *)(mad)->payload) /* Directed route payload */ #define TS_IB_MAD_SMP_DR_PAYLOAD(mad) \ - ((struct ib_mad_payload_smp_dr *)(mad)->payload) + ((struct ib_mad_payload_smp_dr *)(mad)->data) /* Macro to get the SMP DATA field from a MAD payload * (Same for LID or Directed Route SMPs.) Index: infiniband/include/ts_ib_mad_types.h =================================================================== --- infiniband/include/ts_ib_mad_types.h (revision 803) +++ infiniband/include/ts_ib_mad_types.h (working copy) @@ -46,34 +46,9 @@ #define TS_IB_MAD_DR_DIRECTION_RETURN 0x8000 #define TS_IB_MAD_DR_RETURNING(mad) \ - ((mad)->status & cpu_to_be16(TS_IB_MAD_DR_DIRECTION_RETURN)) + ((mad)->mad_hdr.status & cpu_to_be16(TS_IB_MAD_DR_DIRECTION_RETURN)) #define TS_IB_MAD_DR_OUTGOING(mad) (!(TS_IB_MAD_DR_RETURNING(mad))) -/* 13.4.4 */ -typedef enum ib_mgmt_class { - IB_MGMT_CLASS_SUBN_LID_ROUTED = 0x01, - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE = 0x81, - IB_MGMT_CLASS_SUBN_ADM = 0x03, - IB_MGMT_CLASS_PERF = 0x04, - IB_MGMT_CLASS_BM = 0x05, - IB_MGMT_CLASS_DEV_MGT = 0x06, - IB_MGMT_CLASS_COMM_MGT = 0x07, - IB_MGMT_CLASS_SNMP = 0x08, - IB_MGMT_CLASS_VENDOR_TOPSPIN = 0x30 -} tTS_IB_MGMT_CLASS; - -/* 13.4.5 */ -typedef enum ib_mgmt_method { - IB_MGMT_METHOD_GET = 0x01, - IB_MGMT_METHOD_SET = 0x02, - IB_MGMT_METHOD_GET_RESPONSE = 0x81, - IB_MGMT_METHOD_SEND = 0x03, - IB_MGMT_METHOD_TRAP = 0x05, - IB_MGMT_METHOD_REPORT = 0x06, - IB_MGMT_METHOD_REPORT_RESPONSE = 0x86, - IB_MGMT_METHOD_TRAP_REPRESS = 0x07 -} tTS_IB_MGMT_METHOD; - typedef enum ib_mad_filter_mask { TS_IB_MAD_FILTER_DEVICE = 1 << 0, TS_IB_MAD_FILTER_PORT = 1 << 1, @@ -89,6 +64,26 @@ TS_IB_MAD_DIRECTION_OUT } tTS_IB_MAD_DIRECTION; +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + /* function types */ typedef void (*ib_mad_completion_func)(int result, @@ -96,26 +91,22 @@ /* structs */ +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +}; + struct ib_mad { - uint8_t format_version __attribute__((packed)); - uint8_t mgmt_class __attribute__((packed)); - uint8_t class_version __attribute__((packed)); - uint8_t r_method __attribute__((packed)); - uint16_t status __attribute__((packed)); - union { - struct { - uint8_t hop_pointer __attribute__((packed)); - uint8_t hop_count __attribute__((packed)); - } directed __attribute__((packed)); - struct { - uint16_t class_specific __attribute__((packed)); - } lid __attribute__((packed)); - } route __attribute__((packed)); - uint64_t transaction_id __attribute__((packed)); - uint16_t attribute_id __attribute__((packed)); - uint16_t reserved __attribute__((packed)); - uint32_t attribute_modifier __attribute__((packed)); - uint8_t payload[232] __attribute__((packed)); + struct ib_mad_hdr mad_hdr; + u8 data[232]; struct ib_device *device; tTS_IB_PORT port; Index: infiniband/core/dm_client_svc_entries.c =================================================================== --- infiniband/core/dm_client_svc_entries.c (revision 803) +++ infiniband/core/dm_client_svc_entries.c (working copy) @@ -67,10 +67,10 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_dm_svc_entries *svc_entries_ptr = - (struct ib_dm_svc_entries *) & packet->payload; + (struct ib_dm_svc_entries *) & packet->data; struct ib_svc_entries svc_entries; u32 attribute_modifier = - be32_to_cpu(packet->attribute_modifier); + be32_to_cpu(packet->mad_hdr.attr_mod); int i; svc_entries.controller_id = @@ -124,7 +124,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_DM, T_VERBOSE, TRACE_KERNEL_IB_DM_GEN, "DM client Svc Entries MAD slid= 0x%04x, status 0x%04x", - packet->slid, be16_to_cpu(packet->status)); + packet->slid, be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -185,12 +185,12 @@ TS_IB_DM_METHOD_GET, TS_IB_DM_ATTRIBUTE_SVC_ENTRIES, attrib_mod); - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->dlid = dst_port_lid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, ib_svc_entries_response, query); Index: infiniband/core/dm_client_main.c =================================================================== --- infiniband/core/dm_client_main.c (revision 803) +++ infiniband/core/dm_client_main.c (working copy) @@ -52,7 +52,7 @@ return -EINVAL; } - ib_client_mad_handler_register(IB_MGMT_CLASS_DEV_MGT, + ib_client_mad_handler_register(IB_MGMT_CLASS_DEVICE_MGMT, ib_dm_async_notify_handler, NULL); return 0; @@ -60,7 +60,7 @@ static void __exit ib_dm_client_cleanup_module(void) { - ib_client_mad_handler_register(IB_MGMT_CLASS_DEV_MGT, NULL, NULL); + ib_client_mad_handler_register(IB_MGMT_CLASS_DEVICE_MGMT, NULL, NULL); ib_dm_client_query_cleanup(); } Index: infiniband/core/cm_main.c =================================================================== --- infiniband/core/cm_main.c (revision 803) +++ infiniband/core/cm_main.c (working copy) @@ -152,7 +152,7 @@ ib_cm_count_receive(packet); - attribute_id = be16_to_cpu(packet->attribute_id); + attribute_id = be16_to_cpu(packet->mad_hdr.attr_id); if (attribute_id >= ARRAY_SIZE(dispatch_table)) { TS_REPORT_WARN(MOD_IB_CM, @@ -199,7 +199,7 @@ struct ib_mad_filter filter = { NULL }; filter.qpn = 1; - filter.mgmt_class = IB_MGMT_CLASS_COMM_MGT; + filter.mgmt_class = IB_MGMT_CLASS_CM; filter.direction = TS_IB_MAD_DIRECTION_IN; filter.mask = (TS_IB_MAD_FILTER_QPN | TS_IB_MAD_FILTER_MGMT_CLASS | Index: infiniband/core/cm_common.c =================================================================== --- infiniband/core/cm_common.c (revision 803) +++ infiniband/core/cm_common.c (working copy) @@ -40,23 +40,23 @@ void ib_mad_build_header(struct ib_mad *packet) { - packet->format_version = 1; - packet->mgmt_class = IB_MGMT_CLASS_COMM_MGT; - packet->class_version = 2; /* IB Spec version 1.1 */ - packet->r_method = IB_MGMT_METHOD_SEND; - packet->status = 0; - packet->route.lid.class_specific = 0; + packet->mad_hdr.base_version = 1; + packet->mad_hdr.mgmt_class = IB_MGMT_CLASS_CM; + packet->mad_hdr.class_version = 2; /* IB Spec version 1.1 */ + packet->mad_hdr.method = IB_MGMT_METHOD_SEND; + packet->mad_hdr.status = 0; + packet->mad_hdr.class_specific = 0; /* caller will fill in */ packet->sl = 0; - packet->attribute_id = 0; - packet->transaction_id = 0; + packet->mad_hdr.attr_id = 0; + packet->mad_hdr.tid = 0; - packet->reserved = 0; - packet->attribute_modifier = 0; + packet->mad_hdr.resv = 0; + packet->mad_hdr.attr_mod = 0; /* clear the payload */ - memset(packet->payload, 0, sizeof packet->payload); + memset(packet->data, 0, sizeof packet->data); return; } @@ -347,8 +347,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_MRA); - connection->mad.transaction_id = cpu_to_be64(connection->transaction_id); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_MRA); + connection->mad.mad_hdr.tid = cpu_to_be64(connection->transaction_id); connection->mad.device = connection->local_cm_device; connection->mad.port = connection->local_cm_port; @@ -403,8 +403,8 @@ if (reply_data && reply_size > 0) memcpy(ib_cm_rej_private_data_get(packet), reply_data, reply_size); - packet->attribute_id = cpu_to_be16(IB_COM_MGT_REJ); - packet->transaction_id = cpu_to_be64(transaction_id); + packet->mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_REJ); + packet->mad_hdr.tid = cpu_to_be64(transaction_id); ib_cm_rej_local_comm_id_set (packet, local_comm_id); ib_cm_rej_remote_comm_id_set(packet, remote_comm_id); @@ -479,8 +479,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_DREQ); - connection->mad.transaction_id = ib_cm_tid_generate(); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_DREQ); + connection->mad.mad_hdr.tid = ib_cm_tid_generate(); ib_cm_dreq_local_comm_id_set (&connection->mad, connection->local_comm_id); ib_cm_dreq_remote_comm_id_set(&connection->mad, connection->remote_comm_id); @@ -703,8 +703,8 @@ ib_mad_build_header(drep); - drep->attribute_id = cpu_to_be16(IB_COM_MGT_DREP); - drep->transaction_id = packet->transaction_id; + drep->mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_DREP); + drep->mad_hdr.tid = packet->mad_hdr.tid; ib_cm_drep_local_comm_id_set (drep, ib_cm_dreq_remote_comm_id_get(packet)); ib_cm_drep_remote_comm_id_set(drep, ib_cm_drep_local_comm_id_get (packet)); @@ -741,8 +741,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_DREP); - connection->mad.transaction_id = packet->transaction_id; + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_DREP); + connection->mad.mad_hdr.tid = packet->mad_hdr.tid; ib_cm_drep_local_comm_id_set (&connection->mad, connection->local_comm_id); ib_cm_drep_remote_comm_id_set(&connection->mad, connection->remote_comm_id); Index: infiniband/core/sa_client_service.c =================================================================== --- infiniband/core/sa_client_service.c (revision 803) +++ infiniband/core/sa_client_service.c (working copy) @@ -88,9 +88,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_service *mad_service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; struct ib_common_attrib_service service; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -120,7 +120,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client service MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) query->completion_func(query->transaction_id, -EINVAL, @@ -171,14 +171,14 @@ return -ENOMEM; } - sa_payload = (struct ib_sa_payload *) & mad->payload; + sa_payload = (struct ib_sa_payload *) & mad->data; mad_service = (struct ib_sa_service *) sa_payload->admin_data; /* MAD initialization */ tsIbSaClientMadInit(mad, device, port); - mad->r_method = method; - mad->attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); - mad->attribute_modifier = 0; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); + mad->mad_hdr.attr_mod = 0; /* SA header */ sa_payload->sa_header.component_mask = cpu_to_be64(comp_mask); @@ -196,9 +196,9 @@ memcpy(mad_service->service_data32, service->service_data32, 16); memcpy(mad_service->service_data64, service->service_data64, 16); - *transaction_id = mad->transaction_id; + *transaction_id = mad->mad_hdr.tid; - query->transaction_id = mad->transaction_id; + query->transaction_id = mad->mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -269,9 +269,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_service *mad_service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client tsIbServiceAtsGetGidResponse() status OK\n"); @@ -290,7 +290,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client tsIbServiceAtsGetGidResponse status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -333,9 +333,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_service *mad_service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client tsIbServiceAtsGetIpResponse() status OK\n"); @@ -355,7 +355,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client tsIbServiceAtsGetIpResponse status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -397,9 +397,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_service *service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; struct ib_sa_service_query *query; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -417,9 +417,9 @@ /* MAD initialization */ tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_SET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_SET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); + mad.mad_hdr.attr_mod = 0; /* SA header */ sa_payload->sa_header.component_mask = @@ -439,12 +439,12 @@ if (!query) { return -ENOMEM; } - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; /* Subscribe/unsubscribe to SA */ - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, ib_service_response, query); return 0; @@ -461,9 +461,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_service *service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; struct ib_sa_service_get_gid_query *query; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -477,9 +477,9 @@ /* MAD initialization */ tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_GET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_GET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); + mad.mad_hdr.attr_mod = 0; /* SA header */ sa_payload->sa_header.component_mask = @@ -496,12 +496,12 @@ if (!query) { return -ENOMEM; } - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; /* Subscribe/unsubscribe to SA */ - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, _tsIbServiceAtsGetGidResponse, query); @@ -519,9 +519,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_service *service = - (struct ib_sa_service *) sa_payload->admin_data; + (struct ib_sa_service *) sa_payload->admin_data; struct ib_sa_service_get_ip_query *query; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -534,9 +534,9 @@ /* MAD initialization */ tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_GET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_GET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_SERVICE_RECORD); + mad.mad_hdr.attr_mod = 0; /* SA header */ sa_payload->sa_header.component_mask = @@ -553,12 +553,12 @@ if (!query) { return -ENOMEM; } - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; /* Subscribe/unsubscribe to SA */ - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, _tsIbServiceAtsGetIpResponse, query); Index: infiniband/core/dm_client_async_notify.c =================================================================== --- infiniband/core/dm_client_async_notify.c (revision 803) +++ infiniband/core/dm_client_async_notify.c (working copy) @@ -75,7 +75,7 @@ void ib_dm_async_notify_handler(struct ib_mad *packet, void *arg) { struct ib_dm_notice *mad_notice = - (struct ib_dm_notice *) packet->payload; + (struct ib_dm_notice *) packet->data; struct ib_common_attrib_notice notice; /* Convert to host order */ Index: infiniband/core/cm_path_migration.c =================================================================== --- infiniband/core/cm_path_migration.c (revision 803) +++ infiniband/core/cm_path_migration.c (working copy) @@ -145,8 +145,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_LAP); - connection->mad.transaction_id = ib_cm_tid_generate(); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_LAP); + connection->mad.mad_hdr.tid = ib_cm_tid_generate(); connection->mad.device = connection->local_cm_device; connection->mad.port = connection->local_cm_port; @@ -282,8 +282,8 @@ out_send_apr: ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_APR); - connection->mad.transaction_id = packet->transaction_id; + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_APR); + connection->mad.mad_hdr.tid = packet->mad_hdr.tid; connection->mad.device = connection->local_cm_device; connection->mad.port = connection->local_cm_port; Index: infiniband/core/dm_client_iou_info.c =================================================================== --- infiniband/core/dm_client_iou_info.c (revision 803) +++ infiniband/core/dm_client_iou_info.c (working copy) @@ -60,7 +60,7 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_dm_iou_info *iou_info_ptr = - (struct ib_dm_iou_info *) & packet->payload; + (struct ib_dm_iou_info *) & packet->data; struct ib_iou_info iou_info; iou_info.lid = packet->slid; @@ -92,7 +92,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_DM, T_VERBOSE, TRACE_KERNEL_IB_DM_GEN, "DM client IOU Info MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -145,12 +145,12 @@ TS_IB_DM_METHOD_GET, TS_IB_DM_ATTRIBUTE_IOU_INFO, 0); - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->dlid = dst_port_lid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, ib_iou_info_response, query); Index: infiniband/core/sa_client_multicast.c =================================================================== --- infiniband/core/sa_client_multicast.c (revision 803) +++ infiniband/core/sa_client_multicast.c (working copy) @@ -73,7 +73,7 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_multicast_member *mc_ptr = (struct ib_sa_multicast_member *) sa_payload->admin_data; @@ -122,7 +122,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client multicast member MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -285,9 +285,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_multicast_member *mc_member = - (struct ib_sa_multicast_member *) sa_payload->admin_data; + (struct ib_sa_multicast_member *) sa_payload->admin_data; struct ib_sa_multicast_member_query *query; query = kmalloc(sizeof *query, GFP_ATOMIC); @@ -296,11 +296,11 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_SET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_MC_MEMBER_RECORD); - mad.attribute_modifier = 0xffffffff; /* match attributes */ + mad.mad_hdr.method = IB_MGMT_METHOD_SET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_MC_MEMBER_RECORD); + mad.mad_hdr.attr_mod = 0xffffffff; /* match attributes */ - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -354,7 +354,7 @@ ib_client_query(&mad, timeout_jiffies, _tsIbMulticastJoinResponse, query); - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; return 0; } @@ -378,9 +378,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_multicast_member *mc_member = - (struct ib_sa_multicast_member *) sa_payload->admin_data; + (struct ib_sa_multicast_member *) sa_payload->admin_data; struct ib_sa_multicast_group_table_query *query; struct ib_client_rmpp_mad *rmpp_mad; @@ -390,16 +390,16 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = TS_IB_SA_METHOD_GET_TABLE; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_MC_MEMBER_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = TS_IB_SA_METHOD_GET_TABLE; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_MC_MEMBER_RECORD); + mad.mad_hdr.attr_mod = 0; /* rmpp header init */ rmpp_mad = (struct ib_client_rmpp_mad *) & mad; rmpp_mad->version = 1; rmpp_mad->type = TS_IB_CLIENT_RMPP_TYPE_DATA; - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -407,7 +407,7 @@ sa_payload->sa_header.component_mask = cpu_to_be64(0x1ULL << 7); mc_member->pkey = cpu_to_be16(partition); - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_rmpp_client_query(&mad, timeout_jiffies, sizeof(struct ib_sa_header), _tsIbMcastGroupTableResponse, query); Index: infiniband/core/mad_thread.c =================================================================== --- infiniband/core/mad_thread.c (revision 812) +++ infiniband/core/mad_thread.c (working copy) @@ -42,7 +42,7 @@ static inline int ib_mad_smp_is_outgoing(struct ib_mad *mad) { - return !(be16_to_cpu(mad->status) & 0x8000); + return !(be16_to_cpu(mad->mad_hdr.status) & 0x8000); } static inline int ib_mad_smp_send(struct ib_device *device, @@ -52,7 +52,11 @@ struct ib_mad_private *priv = device->mad; struct ib_mad *mad = work->buf; enum ib_mad_result result; + u8 hop_pointer, hop_count; + hop_pointer = be16_to_cpu(mad->mad_hdr.class_specific) >> 8; + hop_count = be16_to_cpu(mad->mad_hdr.class_specific) & 0xf; + if (ib_mad_smp_is_outgoing(mad)) { /* If this is an outgoing 0-hop SMP, we have a @@ -60,7 +64,7 @@ to use QP0 for this, just process the MAD directly. */ if (device->process_mad && !(device->flags & IB_MAD_LOCAL_USE_QP) && - !mad->route.directed.hop_count) { + !hop_count) { void *response_buf = kmalloc(sizeof (struct ib_mad) + IB_MAD_GRH_SIZE, GFP_KERNEL); @@ -111,16 +115,16 @@ return 1; } else { - if (!mad->route.directed.hop_pointer) { + if (!hop_pointer) { /* Some devices (eg Anafa2) do the hop pointer increment themselves. */ if (!(device->flags & IB_MAD_NO_HOP_POINTER_INCR)) - ++mad->route.directed.hop_pointer; + ++hop_pointer; } else { /* Discard (IB Spec 14.2.2.2 #2) */ TS_REPORT_WARN(MOD_KERNEL_IB, "Discarding outgoing DR SMP with hop_pointer %d", - mad->route.directed.hop_pointer); + hop_pointer); work->type = IB_MAD_WORK_SEND_DONE; work->status = -EINVAL; work->index = -1; @@ -130,11 +134,13 @@ } } else { /* Process returning SMP: (IB Spec 14.2.2.4 is relevant) */ - if (mad->route.directed.hop_count && - mad->route.directed.hop_pointer > 1) - --mad->route.directed.hop_pointer; + if (hop_count && + hop_pointer > 1) + --hop_pointer; } + mad->mad_hdr.class_specific = cpu_to_be16(hop_count | (hop_pointer << 8)); + return 0; } @@ -215,9 +221,9 @@ ib_mad_invoke_filters(mad, TS_IB_MAD_DIRECTION_OUT); /* Handle directed route SMPs */ - if (mad->dqpn == 0 && - mad->dlid == IB_LID_PERMISSIVE && - mad->mgmt_class == IB_SM_DIRECTED_ROUTE) + if (mad->dqpn == 0 && + mad->dlid == IB_LID_PERMISSIVE && + mad->mad_hdr.mgmt_class == IB_SM_DIRECTED_ROUTE) if (ib_mad_smp_send(device, work, &reuse)) break; Index: infiniband/core/generate_cm_packet.pl =================================================================== --- infiniband/core/generate_cm_packet.pl (revision 803) +++ infiniband/core/generate_cm_packet.pl (working copy) @@ -75,23 +75,23 @@ if ($offset == 0) { print <<"END_DEF_SHORT"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return packet->payload[$byte] & $mask; + return packet->data[$byte] & $mask; } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - packet->payload[$byte] = value | (packet->payload[$byte] & ~$mask); + packet->data[$byte] = value | (packet->data[$byte] & ~$mask); } END_DEF_SHORT } else { print <<"END_DEF_SHORT_OFFSET"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return (packet->payload[$byte] >> $offset) & $mask; + return (packet->data[$byte] >> $offset) & $mask; } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - packet->payload[$byte] = (value << $offset) | - (packet->payload[$byte] & ~($mask << $offset)); + packet->data[$byte] = (value << $offset) | + (packet->data[$byte] & ~($mask << $offset)); } END_DEF_SHORT_OFFSET @@ -103,11 +103,11 @@ if ($linelist[2] == 8) { print <<"END_DEF8"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return packet->payload[$linelist[1]]; + return packet->data[$linelist[1]]; } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - packet->payload[$linelist[1]] = value; + packet->data[$linelist[1]] = value; } END_DEF8 @@ -119,11 +119,11 @@ print <<"END_DEF16"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return be16_to_cpu(*(($type *) &packet->payload[$linelist[1]])); + return be16_to_cpu(*(($type *) &packet->data[$linelist[1]])); } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - *(($type *) &packet->payload[$linelist[1]]) = cpu_to_be16(value); + *(($type *) &packet->data[$linelist[1]]) = cpu_to_be16(value); } END_DEF16 @@ -135,13 +135,13 @@ print <<"END_DEF20"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return be32_to_cpu(*(($type *) &packet->payload[$linelist[1]])) >> 12; + return be32_to_cpu(*(($type *) &packet->data[$linelist[1]])) >> 12; } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - *(($type *) &packet->payload[$linelist[1]]) = + *(($type *) &packet->data[$linelist[1]]) = cpu_to_be32((value << 12) | - (*(($type *) &packet->payload[$linelist[1]]) & 0x00000fff)); + (*(($type *) &packet->data[$linelist[1]]) & 0x00000fff)); } END_DEF20 @@ -153,12 +153,12 @@ print <<"END_DEF24"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return be32_to_cpu(*(($type *) &packet->payload[$linelist[1]])) >> 8; + return be32_to_cpu(*(($type *) &packet->data[$linelist[1]])) >> 8; } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - *(($type *) &packet->payload[$linelist[1]]) = - cpu_to_be32((value << 8) | packet->payload[$linelist[1] + 3]); + *(($type *) &packet->data[$linelist[1]]) = + cpu_to_be32((value << 8) | packet->data[$linelist[1] + 3]); } END_DEF24 @@ -170,11 +170,11 @@ print <<"END_DEF32"; static inline $type ib_cm_${packet_type}_${name}_get(const struct ib_mad *packet) { - return be32_to_cpu(*(($type *) &packet->payload[$linelist[1]])); + return be32_to_cpu(*(($type *) &packet->data[$linelist[1]])); } static inline void ib_cm_${packet_type}_${name}_set(struct ib_mad *packet, ${type} value) { - *(($type *) &packet->payload[$linelist[1]]) = cpu_to_be32(value); + *(($type *) &packet->data[$linelist[1]]) = cpu_to_be32(value); } END_DEF32 @@ -186,7 +186,7 @@ print <<"END_DEFPOINTER"; static inline void *ib_cm_${packet_type}_${name}_get(struct ib_mad *packet) { - return &packet->payload[$linelist[1]]; + return &packet->data[$linelist[1]]; } static inline int ib_cm_${packet_type}_${name}_get_length(void) { Index: infiniband/core/dm_client_class_port_info.c =================================================================== --- infiniband/core/dm_client_class_port_info.c (revision 803) +++ infiniband/core/dm_client_class_port_info.c (working copy) @@ -73,7 +73,7 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_dm_class_port_info *cpi_ptr = - (struct ib_dm_class_port_info *) & packet->payload; + (struct ib_dm_class_port_info *) & packet->data; struct ib_common_attrib_cpi cpi; cpi.base_version = cpi_ptr->base_version; @@ -134,7 +134,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_DM, T_VERBOSE, TRACE_KERNEL_IB_DM_GEN, "DM client Class Port Info query status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -193,18 +193,18 @@ TS_IB_DM_ATTRIBUTE_CLASS_PORTINFO, 0); /* Set trap LID to notify TS SRP Mgr to forward trap info */ - cpi = (struct ib_dm_class_port_info *) mad.payload; + cpi = (struct ib_dm_class_port_info *) mad.data; ib_cached_lid_get(device, port, &lid_info); cpi->trap_lid = cpu_to_be16(lid_info.lid); ib_cached_gid_get(device, port, 0, gid); memcpy(cpi->trap_gid, gid, sizeof(tTS_IB_GID)); - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->dlid = dst_port_lid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, ib_dm_class_port_info_response, query); Index: infiniband/core/sa_client_notice.c =================================================================== --- infiniband/core/sa_client_notice.c (revision 803) +++ infiniband/core/sa_client_notice.c (working copy) @@ -216,9 +216,9 @@ void tsIbSaNoticeHandler(struct ib_mad *mad, void *arg) { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) mad->payload; + (struct ib_sa_payload *) mad->data; struct ib_sa_notice *mad_notice = - (struct ib_sa_notice *) sa_payload->admin_data; + (struct ib_sa_notice *) sa_payload->admin_data; struct ib_common_attrib_notice notice; tTS_IB_SA_NOTICE_HANDLER_FUNC handler; void *handler_arg; Index: infiniband/core/client_query.c =================================================================== --- infiniband/core/client_query.c (revision 803) +++ infiniband/core/client_query.c (working copy) @@ -239,8 +239,8 @@ query->callback_running = 0; - query->transaction_id = packet->transaction_id; - query->r_method = packet->r_method; + query->transaction_id = packet->mad_hdr.tid; + query->r_method = packet->mad_hdr.method; query->timeout_jiffies = timeout_jiffies; query->callback.function = function; query->arg = arg; @@ -278,8 +278,8 @@ "ib_client_rmpp_query_new()\n"); query->callback_running = 0; - query->transaction_id = packet->transaction_id; - query->r_method = packet->r_method; + query->transaction_id = packet->mad_hdr.tid; + query->r_method = packet->mad_hdr.method; query->timeout_jiffies = timeout_jiffies; query->callback.rmpp_function = function; query->arg = arg; @@ -346,9 +346,9 @@ int bytes_to_copy; /* Convert to host format */ - status = be16_to_cpu(mad->status); - attribute_id = be16_to_cpu(mad->attribute_id); - attribute_modifier = be32_to_cpu(mad->attribute_modifier); + status = be16_to_cpu(mad->mad_hdr.status); + attribute_id = be16_to_cpu(mad->mad_hdr.attr_id); + attribute_modifier = be32_to_cpu(mad->mad_hdr.attr_mod); flag = rmpp_mad->resp_time__flags & 0x0F; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -488,22 +488,22 @@ static void ib_client_get_response(struct ib_mad *mad) { struct ib_client_query *query = - ib_client_query_find(mad->transaction_id); + ib_client_query_find(mad->mad_hdr.tid); tTS_IB_CLIENT_RESPONSE_STATUS resp_status; if (!query) { TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "packet received for unknown TID 0x%016" TS_U64_FMT - "x", mad->transaction_id); + "x", mad->mad_hdr.tid); return; } - if (query->r_method == mad->r_method) { + if (query->r_method == mad->mad_hdr.method) { ib_client_query_put(query); return; } - if (mad->status) { + if (mad->mad_hdr.status) { resp_status = TS_IB_CLIENT_RESPONSE_ERROR; } else { resp_status = TS_IB_CLIENT_RESPONSE_OK; @@ -547,8 +547,9 @@ } } } + /* Send out MAD */ - tsIbMadSend(packet); + ib_mad_send(packet); return 0; } @@ -586,7 +587,7 @@ { TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, "query packet received, TID 0x%016" TS_U64_FMT "x", - mad->transaction_id); + mad->mad_hdr.tid); if (0) { int i; @@ -602,15 +603,15 @@ } } } - switch (mad->r_method) { + switch (mad->mad_hdr.method) { case IB_MGMT_METHOD_REPORT: { /* Send back REPORT RESPONSE */ struct ib_mad report_resp; memcpy(&report_resp, mad, sizeof(report_resp)); - report_resp.r_method = - IB_MGMT_METHOD_REPORT_RESPONSE; + report_resp.mad_hdr.method = + IB_MGMT_METHOD_REPORT_RESP; report_resp.slid = mad->dlid; report_resp.dlid = mad->slid; report_resp.sqpn = mad->dqpn; @@ -626,12 +627,11 @@ case IB_MGMT_METHOD_TRAP_REPRESS: { ib_mad_dispatch_func dispatch = - ib_client_async_mad_handler_get(mad->mgmt_class); + ib_client_async_mad_handler_get(mad->mad_hdr.mgmt_class); if (dispatch) dispatch(mad, - ib_client_async_mad_handler_arg_get(mad-> - mgmt_class)); + ib_client_async_mad_handler_arg_get(mad->mad_hdr.mgmt_class)); return; } @@ -700,7 +700,7 @@ ib_client_query_put(query); - tsIbMadSend(packet); + ib_mad_send(packet); return 0; } Index: infiniband/core/mad_filter.c =================================================================== --- infiniband/core/mad_filter.c (revision 812) +++ infiniband/core/mad_filter.c (working copy) @@ -55,13 +55,13 @@ filter->qpn == qpn) && (!(filter->mask & TS_IB_MAD_FILTER_MGMT_CLASS) || - filter->mgmt_class == mad->mgmt_class) && + filter->mgmt_class == mad->mad_hdr.mgmt_class) && (!(filter->mask & TS_IB_MAD_FILTER_R_METHOD) || - filter->r_method == mad->r_method) && + filter->r_method == mad->mad_hdr.method) && (!(filter->mask & TS_IB_MAD_FILTER_ATTRIBUTE_ID) || - filter->attribute_id == be16_to_cpu(mad->attribute_id)) && + filter->attribute_id == be16_to_cpu(mad->mad_hdr.attr_id)) && (!(filter->mask & TS_IB_MAD_FILTER_DIRECTION) || filter->direction == direction); @@ -115,8 +115,8 @@ { u8 hop_pointer, hop_count; - hop_pointer = mad->route.directed.hop_pointer; - hop_count = mad->route.directed.hop_count; + hop_pointer = be16_to_cpu(mad->mad_hdr.class_specific) >> 8; + hop_count = be16_to_cpu(mad->mad_hdr.class_specific) & 0xf; /* * Outgoing MAD processing. "Outgoing" means from initiator to responder. @@ -146,7 +146,9 @@ if (hop_pointer == hop_count) { if (hop_count != 0) (TS_IB_MAD_SMP_DR_PAYLOAD(mad))->return_path[hop_pointer] = mad->port; - ++mad->route.directed.hop_pointer; + ++hop_pointer; + mad->mad_hdr.class_specific = + cpu_to_be16(hop_count | (hop_pointer << 8)); if (device->node_type == IB_NODE_SWITCH) { /* XXX switch */ @@ -210,7 +212,9 @@ /* C14-13:3 -- We're at the end of the DR segment of path */ if (hop_pointer == 1) { - --mad->route.directed.hop_pointer; + --hop_pointer; + mad->mad_hdr.class_specific = + cpu_to_be16(hop_count | (hop_pointer << 8)); if (device->node_type == IB_NODE_SWITCH) { /* XXX switch */ @@ -242,7 +246,7 @@ return 1; /* Check for unreasonable hop pointer. (C14-13:5) */ - if (mad->route.directed.hop_pointer > mad->route.directed.hop_count + 1) + if (hop_pointer > hop_count + 1) return 0; } return 1; @@ -280,7 +284,7 @@ } /* If MAD is Directed Route, we need to validate it and fix it up. */ - if ((mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && !ib_mad_validate_dr_smp(mad, device)) ret = IB_MAD_RESULT_SUCCESS; // As if device ignored packet. else @@ -294,8 +298,8 @@ device->name, mad->port, mad->dqpn, - mad->mgmt_class, - be16_to_cpu(mad->attribute_id)); + mad->mad_hdr.mgmt_class, + be16_to_cpu(mad->mad_hdr.attr_id)); /* If the packet was consumed, we don't want to let anyone else look at it. * This is a special case for hardware (tavor) which uses the input queue Index: infiniband/core/dm_client_ioc_profile.c =================================================================== --- infiniband/core/dm_client_ioc_profile.c (revision 803) +++ infiniband/core/dm_client_ioc_profile.c (working copy) @@ -81,13 +81,12 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_dm_ioc_profile *ioc_profile_ptr = - (struct ib_dm_ioc_profile *) & packet->payload; + (struct ib_dm_ioc_profile *) & packet->data; struct ib_ioc_profile ioc_profile; ioc_profile.controller_id = TS_IB_DM_IOCPROFILE_GET_CONTROLLER_ID(be32_to_cpu - (packet-> - attribute_modifier)); + (packet->mad_hdr.attr_mod)); memcpy(ioc_profile.guid, ioc_profile_ptr->guid, sizeof(tTS_IB_GUID)); ioc_profile.vendor_id = @@ -153,7 +152,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_DM, T_VERBOSE, TRACE_KERNEL_IB_DM_GEN, "DM client IOU Info MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -208,12 +207,12 @@ TS_IB_DM_IOCPROFILE_GET_CONTROLLER_ID (controller_id)); - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->dlid = dst_port_lid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, ib_ioc_profile_response, query); Index: infiniband/core/mad_static.c =================================================================== --- infiniband/core/mad_static.c (revision 812) +++ infiniband/core/mad_static.c (working copy) @@ -126,11 +126,11 @@ } memset(mad_in, 0, sizeof *mad_in); - mad_in->format_version = 1; - mad_in->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - mad_in->class_version = 1; - mad_in->r_method = IB_MGMT_METHOD_GET; - mad_in->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + mad_in->mad_hdr.base_version = 1; + mad_in->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + mad_in->mad_hdr.class_version = 1; + mad_in->mad_hdr.method = IB_MGMT_METHOD_GET; + mad_in->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); /* Request port info from the device */ if ((device->process_mad(device, IB_MAD_IGNORE_MKEY, port, 0xffff, mad_in, mad_out) & @@ -144,7 +144,7 @@ /* Edit the lid field in the returned port info. */ ib_smp_port_info_lid_set(IB_MAD_TO_SMP_DATA(mad_out), lid_base); ++lid_base; - mad_out->r_method = IB_MGMT_METHOD_SET; + mad_out->mad_hdr.method = IB_MGMT_METHOD_SET; /* Update the port info on the device */ if (!(device->process_mad(device, IB_MAD_IGNORE_MKEY, port, 0xffff, mad_out, mad_in) & Index: infiniband/core/sa_client_path_record.c =================================================================== --- infiniband/core/sa_client_path_record.c (revision 803) +++ infiniband/core/sa_client_path_record.c (working copy) @@ -68,7 +68,7 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_path_record *path_ptr = (struct ib_sa_path_record *) sa_payload->admin_data; struct ib_path_record path_rec; @@ -107,7 +107,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client path record MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -152,7 +152,7 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_path_record *path_rec = (struct ib_sa_path_record *) sa_payload->admin_data; struct ib_sa_path_record_query *query; @@ -164,11 +164,11 @@ tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_GET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PATH_RECORD); - mad.attribute_modifier = 0xffffffff; /* match attributes */ + mad.mad_hdr.method = IB_MGMT_METHOD_GET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PATH_RECORD); + mad.mad_hdr.attr_mod = 0xffffffff; /* match attributes */ - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -184,7 +184,7 @@ ib_client_query(&mad, timeout_jiffies, _tsIbPathRecordResponse, query); - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; return 0; } Index: infiniband/core/sa_client_inform.c =================================================================== --- infiniband/core/sa_client_inform.c (revision 803) +++ infiniband/core/sa_client_inform.c (working copy) @@ -82,9 +82,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_inform_info *mad_inform_info = - (struct ib_sa_inform_info *) sa_payload->admin_data; + (struct ib_sa_inform_info *) sa_payload->admin_data; struct ib_common_attrib_inform inform_info; TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, @@ -133,7 +133,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client inform info MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -174,9 +174,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_inform_info *mad_inform_info = - (struct ib_sa_inform_info *) sa_payload->admin_data; + (struct ib_sa_inform_info *) sa_payload->admin_data; struct ib_sa_inform_info_query *query; query = kmalloc(sizeof *query, GFP_ATOMIC); @@ -185,11 +185,11 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_SET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_INFORM_INFO); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_SET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_INFORM_INFO); + mad.mad_hdr.attr_mod = 0; - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -218,7 +218,7 @@ cpu_to_be32(inform_info->define.vendor.vendor_id); } - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, _tsIbInformResponse, query); Index: infiniband/core/sa_client_port_info.c =================================================================== --- infiniband/core/sa_client_port_info.c (revision 803) +++ infiniband/core/sa_client_port_info.c (working copy) @@ -88,9 +88,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_port_info *port_info_ptr = - (struct ib_sa_port_info *) sa_payload->admin_data; + (struct ib_sa_port_info *) sa_payload->admin_data; struct ib_port_info port_info; memcpy(port_info.mkey, port_info_ptr->mkey, 8); @@ -199,7 +199,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client port info MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -466,9 +466,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_port_info *port_info = - (struct ib_sa_port_info *) sa_payload->admin_data; + (struct ib_sa_port_info *) sa_payload->admin_data; struct ib_sa_port_info_query *query; query = kmalloc(sizeof *query, GFP_ATOMIC); @@ -477,19 +477,19 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_GET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PORT_INFO_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_GET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PORT_INFO_RECORD); + mad.mad_hdr.attr_mod = 0; sa_payload->sa_header.component_mask = cpu_to_be64(0x1ULL); /* port LID */ port_info->port_lid = cpu_to_be16(port_lid); - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, _tsIbPortInfoResponse, query); @@ -514,20 +514,20 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = TS_IB_SA_METHOD_GET_TABLE; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PORT_INFO_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = TS_IB_SA_METHOD_GET_TABLE; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_PORT_INFO_RECORD); + mad.mad_hdr.attr_mod = 0; /* rmpp header init */ rmpp_mad = (struct ib_client_rmpp_mad *) & mad; rmpp_mad->version = 1; rmpp_mad->type = TS_IB_CLIENT_RMPP_TYPE_DATA; - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_rmpp_client_query(&mad, timeout_jiffies, sizeof(struct ib_sa_header), _tsIbPortInfoTblResponse, query); Index: infiniband/core/sa_client_node_info.c =================================================================== --- infiniband/core/sa_client_node_info.c (revision 803) +++ infiniband/core/sa_client_node_info.c (working copy) @@ -70,9 +70,9 @@ case TS_IB_CLIENT_RESPONSE_OK: { struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & packet->payload; + (struct ib_sa_payload *) & packet->data; struct ib_sa_node_info *node_info_ptr = - (struct ib_sa_node_info *) sa_payload->admin_data; + (struct ib_sa_node_info *) sa_payload->admin_data; struct ib_node_info node_info; memset(&node_info, 0, sizeof(node_info)); @@ -115,7 +115,7 @@ case TS_IB_CLIENT_RESPONSE_ERROR: TS_TRACE(MOD_KERNEL_IB, T_VERBOSE, TRACE_KERNEL_IB_GEN, "SA client node info MAD status 0x%04x", - be16_to_cpu(packet->status)); + be16_to_cpu(packet->mad_hdr.status)); if (query->completion_func) { query->completion_func(query->transaction_id, -EINVAL, @@ -156,9 +156,9 @@ { struct ib_mad mad; struct ib_sa_payload *sa_payload = - (struct ib_sa_payload *) & mad.payload; + (struct ib_sa_payload *) & mad.data; struct ib_sa_node_info *node_info = - (struct ib_sa_node_info *) sa_payload->admin_data; + (struct ib_sa_node_info *) sa_payload->admin_data; struct ib_sa_node_info_query *query; query = kmalloc(sizeof *query, GFP_ATOMIC); @@ -167,11 +167,11 @@ } tsIbSaClientMadInit(&mad, device, port); - mad.r_method = IB_MGMT_METHOD_GET; - mad.attribute_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_NODE_RECORD); - mad.attribute_modifier = 0; + mad.mad_hdr.method = IB_MGMT_METHOD_GET; + mad.mad_hdr.attr_id = cpu_to_be16(TS_IB_SA_ATTRIBUTE_NODE_RECORD); + mad.mad_hdr.attr_mod = 0; - query->transaction_id = mad.transaction_id; + query->transaction_id = mad.mad_hdr.tid; query->completion_func = completion_func; query->completion_arg = completion_arg; @@ -179,7 +179,7 @@ node_info->port_lid = port_lid; - *transaction_id = mad.transaction_id; + *transaction_id = mad.mad_hdr.tid; ib_client_query(&mad, timeout_jiffies, _tsIbNodeInfoResponse, query); Index: infiniband/core/useraccess_mad.c =================================================================== --- infiniband/core/useraccess_mad.c (revision 803) +++ infiniband/core/useraccess_mad.c (working copy) @@ -185,8 +185,8 @@ mad->slid, mad->port, mad->dqpn, - mad->mgmt_class, - mad->r_method, be16_to_cpu(mad->attribute_id)); + mad->mad_hdr.mgmt_class, + mad->mad_hdr.method, be16_to_cpu(mad->mad_hdr.attr_id)); down(&filter->priv->mad_sem); if (filter->priv->mad_queue_length < filter->priv->max_mad_queue_length) { Index: infiniband/core/cm_proc.c =================================================================== --- infiniband/core/cm_proc.c (revision 803) +++ infiniband/core/cm_proc.c (working copy) @@ -60,21 +60,21 @@ void ib_cm_count_receive(struct ib_mad *packet) { - u16 attribute_id = be16_to_cpu(packet->attribute_id); + u16 attribute_id = be16_to_cpu(packet->mad_hdr.attr_id); if (attribute_id < max_id) atomic_inc(&cm_packet_count[attribute_id].received); } void ib_cm_count_send(struct ib_mad *packet) { - u16 attribute_id = be16_to_cpu(packet->attribute_id); + u16 attribute_id = be16_to_cpu(packet->mad_hdr.attr_id); if (attribute_id < max_id) atomic_inc(&cm_packet_count[attribute_id].sent); } void ib_cm_count_resend(struct ib_mad *packet) { - u16 attribute_id = be16_to_cpu(packet->attribute_id); + u16 attribute_id = be16_to_cpu(packet->mad_hdr.attr_id); if (attribute_id < max_id) { atomic_inc(&cm_packet_count[attribute_id].sent); atomic_inc(&cm_packet_count[attribute_id].resent); Index: infiniband/core/generate_pkt_access.pl =================================================================== --- infiniband/core/generate_pkt_access.pl (revision 803) +++ infiniband/core/generate_pkt_access.pl (working copy) @@ -132,7 +132,7 @@ }; #endif // _IB_PRINTER_DEFINED_ -#define IB_MAD_TO_${class_type}_DATA(mad) (&(mad)->payload[$payload_offset]) +#define IB_MAD_TO_${class_type}_DATA(mad) (&(mad)->data[$payload_offset]) HEADER_TOP @@ -1184,7 +1184,7 @@ print "\n"; print "void $func_name(struct ib_printer *printer, struct ib_mad *mad)\n"; print "{\n"; - print " __u16 attribute = be16_to_cpu(mad->attribute_id);\n\n"; + print " __u16 attribute = be16_to_cpu(mad->mad_hdr.attr_id);\n\n"; print " switch (attribute) {\n"; foreach $attrib_num (sort numerically keys(%known_attributes)) { Index: infiniband/core/dm_client_query.c =================================================================== --- infiniband/core/dm_client_query.c (revision 803) +++ infiniband/core/dm_client_query.c (working copy) @@ -42,10 +42,10 @@ { memset(packet, 0, sizeof *packet); - packet->format_version = 1; - packet->mgmt_class = IB_MGMT_CLASS_DEV_MGT; - packet->class_version = TS_IB_DM_CLASS_VERSION; - packet->transaction_id = ib_client_alloc_tid(); + packet->mad_hdr.base_version = 1; + packet->mad_hdr.mgmt_class = IB_MGMT_CLASS_DEVICE_MGMT; + packet->mad_hdr.class_version = TS_IB_DM_CLASS_VERSION; + packet->mad_hdr.tid = ib_client_alloc_tid(); packet->device = device; packet->pkey_index = 0; @@ -55,9 +55,9 @@ packet->sl = 0; packet->sqpn = 1; packet->dqpn = dst_qpn; - packet->r_method = r_method; - packet->attribute_id = cpu_to_be16(attribute_id); - packet->attribute_modifier = cpu_to_be32(attribute_modifier); + packet->mad_hdr.method = r_method; + packet->mad_hdr.attr_id = cpu_to_be16(attribute_id); + packet->mad_hdr.attr_mod = cpu_to_be32(attribute_modifier); packet->has_grh = 0; Index: infiniband/core/sa_client_query.c =================================================================== --- infiniband/core/sa_client_query.c (revision 803) +++ infiniband/core/sa_client_query.c (working copy) @@ -44,11 +44,11 @@ memset(packet, 0, sizeof *packet); - packet->format_version = 1; - packet->mgmt_class = IB_MGMT_CLASS_SUBN_ADM; - packet->class_version = TS_IB_SA_CLASS_VERSION; - packet->transaction_id = ib_client_alloc_tid(); - packet->attribute_modifier = 0xffffffff; + packet->mad_hdr.base_version = 1; + packet->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + packet->mad_hdr.class_version = TS_IB_SA_CLASS_VERSION; + packet->mad_hdr.tid = ib_client_alloc_tid(); + packet->mad_hdr.attr_mod = 0xffffffff; packet->device = device; packet->port = port; Index: infiniband/core/cm_passive.c =================================================================== --- infiniband/core/cm_passive.c (revision 803) +++ infiniband/core/cm_passive.c (working copy) @@ -60,8 +60,8 @@ if (reply_data && reply_size > 0) memcpy(ib_cm_rep_private_data_get(&connection->mad), reply_data, reply_size); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_REP); - connection->mad.transaction_id = cpu_to_be64(connection->transaction_id); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_REP); + connection->mad.mad_hdr.tid = cpu_to_be64(connection->transaction_id); ib_cm_rep_local_comm_id_set (&connection->mad, connection->local_comm_id); ib_cm_rep_remote_comm_id_set (&connection->mad, connection->remote_comm_id); @@ -372,7 +372,7 @@ connection->state = IB_CM_STATE_REQ_RECEIVED; connection->cm_retry_count = 0; - connection->transaction_id = be64_to_cpu(packet->transaction_id); + connection->transaction_id = be64_to_cpu(packet->mad_hdr.tid); ib_cm_connection_insert_remote(connection); } @@ -395,7 +395,7 @@ packet->pkey_index, packet->slid, packet->sqpn, - be64_to_cpu(packet->transaction_id), + be64_to_cpu(packet->mad_hdr.tid), 0, ib_cm_req_local_comm_id_get(packet), IB_REJ_REQ, @@ -483,7 +483,7 @@ then treat this REQ as a resend. Otherwise our connection is a stale connection. (See section 12.9.8.3.1 of the IB spec) */ if (ib_cm_req_local_comm_id_get(packet) == connection->remote_comm_id && - be16_to_cpu(connection->mad.attribute_id) == IB_COM_MGT_REP && + be16_to_cpu(connection->mad.mad_hdr.attr_id) == IB_COM_MGT_REP && time_after(connection->establish_jiffies + ib_cm_timeout_to_jiffies(connection->cm_response_timeout), jiffies)) { Index: infiniband/core/cm_active.c =================================================================== --- infiniband/core/cm_active.c (revision 803) +++ infiniband/core/cm_active.c (working copy) @@ -59,8 +59,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_REQ); - connection->mad.transaction_id = cpu_to_be64(connection->transaction_id); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_REQ); + connection->mad.mad_hdr.tid = cpu_to_be64(connection->transaction_id); /* Fields are in order of the IB spec. 12.6.5 */ @@ -237,8 +237,8 @@ ib_mad_build_header(&connection->mad); - connection->mad.attribute_id = cpu_to_be16(IB_COM_MGT_RTU); - connection->mad.transaction_id = cpu_to_be64(connection->transaction_id); + connection->mad.mad_hdr.attr_id = cpu_to_be16(IB_COM_MGT_RTU); + connection->mad.mad_hdr.tid = cpu_to_be64(connection->transaction_id); ib_cm_rtu_local_comm_id_set (&connection->mad, connection->local_comm_id); ib_cm_rtu_remote_comm_id_set(&connection->mad, connection->remote_comm_id); @@ -374,7 +374,7 @@ if (connection->state == IB_CM_STATE_ESTABLISHED) { /* Resend RTU if connection is established, but make sure we haven't already sent some other kind of CM packet. */ - if (connection->mad.attribute_id == cpu_to_be16(IB_COM_MGT_RTU)) { + if (connection->mad.mad_hdr.attr_id == cpu_to_be16(IB_COM_MGT_RTU)) { ib_cm_count_resend(&connection->mad); result = ib_mad_send(&connection->mad); if (result) { @@ -484,7 +484,7 @@ packet->pkey_index, packet->slid, packet->sqpn, - be64_to_cpu(packet->transaction_id), + be64_to_cpu(packet->mad_hdr.tid), ib_cm_rep_remote_comm_id_get(packet), ib_cm_rep_local_comm_id_get(packet), IB_REJ_REP, Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 812) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -51,11 +51,11 @@ props->fw_ver = to_mdev(ibdev)->fw_ver; memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, in_mad, out_mad, @@ -67,12 +67,12 @@ goto out; } - props->vendor_id = be32_to_cpup((u32 *) (out_mad->payload + 76)) & + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & 0xffffff; - props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->payload + 70)); - props->hw_ver = be16_to_cpup((u16 *) (out_mad->payload + 72)); - memcpy(&props->sys_image_guid, out_mad->payload + 44, 8); - memcpy(&props->node_guid, out_mad->payload + 52, 8); + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); err = 0; out: @@ -95,12 +95,12 @@ goto out; memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); - in_mad->attribute_modifier = cpu_to_be32(port); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, port, in_mad, out_mad, @@ -112,15 +112,15 @@ goto out; } - props->lid = be16_to_cpup((u16 *) (out_mad->payload + 56)); - props->lmc = (*(u8 *) (out_mad->payload + 74)) & 0x7; - props->sm_lid = be16_to_cpup((u16 *) (out_mad->payload + 58)); - props->sm_sl = (*(u8 *) (out_mad->payload + 76)) & 0xf; - props->state = (*(u8 *) (out_mad->payload + 72)) & 0xf; - props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->payload + 60)); + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; - props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->payload + 88)); + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); out: kfree(in_mad); @@ -149,12 +149,12 @@ goto out; memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); - in_mad->attribute_modifier = cpu_to_be32(index / 32); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); err = mthca_MAD_IFC(to_mdev(ibdev), 1, port, in_mad, out_mad, @@ -166,7 +166,7 @@ goto out; } - *pkey = ((u16 *) (out_mad->payload + 40))[index % 32]; + *pkey = ((u16 *) (out_mad->data + 40))[index % 32]; out: kfree(in_mad); @@ -188,12 +188,12 @@ goto out; memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); - in_mad->attribute_modifier = cpu_to_be32(port); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, port, in_mad, out_mad, @@ -205,15 +205,15 @@ goto out; } - memcpy(gid->raw, out_mad->payload + 48, 8); + memcpy(gid->raw, out_mad->data + 48, 8); memset(in_mad, 0, sizeof *in_mad); - in_mad->format_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->r_method = IB_MGMT_METHOD_GET; - in_mad->attribute_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); - in_mad->attribute_modifier = cpu_to_be32(index / 8); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); err = mthca_MAD_IFC(to_mdev(ibdev), 1, port, in_mad, out_mad, @@ -225,7 +225,7 @@ goto out; } - memcpy(gid->raw + 8, out_mad->payload + 40 + (index % 8) * 16, 8); + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); out: kfree(in_mad); Index: infiniband/hw/mthca/mthca_mad.c =================================================================== --- infiniband/hw/mthca/mthca_mad.c (revision 812) +++ infiniband/hw/mthca/mthca_mad.c (working copy) @@ -50,17 +50,17 @@ struct ib_event event; if (mad->dqpn == 0 && - (mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || - mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && - mad->r_method == IB_MGMT_METHOD_SET) { - if (mad->attribute_id == cpu_to_be16(IB_SM_PORT_INFO)) { + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { event.device = ibdev; event.event = IB_EVENT_LID_CHANGE; event.element.port_num = port_num; ib_dispatch_event(&event); } - if (mad->attribute_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { event.device = ibdev; event.event = IB_EVENT_PKEY_CHANGE; event.element.port_num = port_num; @@ -80,8 +80,8 @@ u8 status; /* Forward locally generated traps to the SM */ - if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && - in_mad->r_method == IB_MGMT_METHOD_TRAP && + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && + in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && slid == 0) { struct ib_sm_path sm_path; @@ -102,25 +102,25 @@ * Only handle PMA and Mellanox vendor-specific class gets and * sets for other classes. */ - if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || - in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - if (in_mad->r_method != IB_MGMT_METHOD_GET && - in_mad->r_method != IB_MGMT_METHOD_SET && - in_mad->r_method != IB_MGMT_METHOD_TRAP_REPRESS) + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) return IB_MAD_RESULT_SUCCESS; /* * Don't process SMInfo queries or vendor-specific * MADs -- the SMA can't handle them. */ - if (be16_to_cpu(in_mad->attribute_id) == IB_SM_SM_INFO || - be16_to_cpu(in_mad->attribute_id) >= IB_SM_VENDOR_START) + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) return IB_MAD_RESULT_SUCCESS; - } else if (in_mad->mgmt_class == IB_MGMT_CLASS_PERF || - in_mad->mgmt_class == MTHCA_VENDOR_CLASS1 || - in_mad->mgmt_class == MTHCA_VENDOR_CLASS2) { - if (in_mad->r_method != IB_MGMT_METHOD_GET && - in_mad->r_method != IB_MGMT_METHOD_SET) + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) return IB_MAD_RESULT_SUCCESS; } else return IB_MAD_RESULT_SUCCESS; @@ -144,10 +144,10 @@ smp_snoop(ibdev, in_mad, port_num); /* set return bit in status of directed route responses */ - if (in_mad->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) - out_mad->status |= cpu_to_be16(1 << 15); + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); - if (in_mad->r_method == IB_MGMT_METHOD_TRAP_REPRESS) + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) /* no response for trap repress */ return IB_MAD_RESULT_SUCCESS; From halr at voltaire.com Mon Sep 13 21:46:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 00:46:37 -0400 Subject: [openib-general] [PATCH] ib_verbs.h: Update to new process_mad AP Message-ID: <1095137196.1947.413.camel@localhost.localdomain> ib_verbs.h: Update to new process_mad API Integrated Roland's patch for this from his branch onto openib-candidate branch Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 817) +++ ib_verbs.h (revision 816) @@ -632,13 +632,6 @@ IB_MAD_IGNORE_MKEY = 1 }; -enum ib_mad_result { - IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ - IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ - IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ - IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ -}; - #define IB_DEVICE_NAME_MAX 64 struct ib_device { @@ -755,8 +748,6 @@ int (*req_n_notify_cq)(struct ib_cq *cq, int wc_cnt); int (*process_mad)(struct ib_device *device, int process_mad_flags, - u8 port_num, - u16 source_lid, struct ib_mad *in_mad, struct ib_mad *out_mad); From mst at mellanox.co.il Mon Sep 13 23:49:36 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Sep 2004 09:49:36 +0300 Subject: [openib-general] ib_mad.c comments In-Reply-To: <20040913095105.31b16c06.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <524qm8gmxg.fsf@topspin.com> <1094901754.1752.1173.camel@localhost.localdomain> <20040913095105.31b16c06.mshefty@ichips.intel.com> Message-ID: <20040914064936.GC25611@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] ib_mad.c comments": > Personally, for an initial implementation, I'd just go with posting > work requests, and generate completions for sends that cannot be > posted. This should be fairly trivial to implement, yet still work. In that case, it would be sufficient to return an error code to the caller. If the caller wants to re-use the completion routine at this point, let him. MST From sean.hefty at intel.com Tue Sep 14 00:19:36 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 00:19:36 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <20040914064936.GC25611@mellanox.co.il> Message-ID: >> Personally, for an initial implementation, I'd just go with posting >> work requests, and generate completions for sends that cannot be >> posted. This should be fairly trivial to implement, yet still work. > >In that case, it would be sufficient to return an error code to the caller. >If the caller wants to re-use the completion routine at this point, >let him. For sends without a timeout specified, I agree that it makes sense to just return a failure, but for sends with a timeout, this would prevent the caller from taking advantage of the access layer tracking the timeout period. For example, if the caller tries to resend the MAD immediately, it's likely to fail again. It seems like it would be beneficial to indicate the failure after a given time period, which may give a retry a better chance of succeeding. Thinking about this a little more, I'm guessing that the access layer will already have to allocate a structure to track all sends in order to match responses with requests, so queuing sends is probably just as easy to implement now as not. From mst at mellanox.co.il Tue Sep 14 00:42:27 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Sep 2004 10:42:27 +0300 Subject: [openib-general] ib_mad.c comments In-Reply-To: References: <20040914064936.GC25611@mellanox.co.il> Message-ID: <20040914074227.GE25611@mellanox.co.il> Hello, Sean! I am not arguing against queuing sends and processing them later (if only because it guarantees non-starving processing of the mads). Its only generating fake failed completions that does not make sence to me. Quoting r. Sean Hefty (sean.hefty at intel.com) "RE: [openib-general] ib_mad.c comments": > >> Personally, for an initial implementation, I'd just go with posting > >> work requests, and generate completions for sends that cannot be > >> posted. This should be fairly trivial to implement, yet still work. > > > >In that case, it would be sufficient to return an error code to the caller. > >If the caller wants to re-use the completion routine at this point, > >let him. > > For sends without a timeout specified, I agree that it makes sense to just > return a failure, but for sends with a timeout, this would prevent the > caller from taking advantage of the access layer tracking the timeout > period. > > For example, if the caller tries to resend the MAD immediately, it's likely > to fail again. It seems like it would be beneficial to indicate the failure > after a given time period, which may give a retry a better chance of > succeeding. But this is a question of policy, is it not? If we need a generic timeout function, lets have it and let the caller use it if thats what is needed. No need to force the policy. > Thinking about this a little more, I'm guessing that the access layer will > already have to allocate a structure to track all sends in order to match > responses with requests, so queuing sends is probably just as easy to > implement now as not. Here you are talking about the option to queue the sends, and although you likely need a separate queue for these, I am not against it, its only generating fake failed completions that does not make sence to me. MST From halr at voltaire.com Tue Sep 14 05:54:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 08:54:14 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad.h: Add in management base version definition Message-ID: <1095166453.1830.1.camel@localhost.localdomain> ib_mad.h: Add in management base version definition (This is to Roland's branch) Index: ib_mad.h =================================================================== --- ib_mad.h (revision 821) +++ ib_mad.h (working copy) @@ -30,6 +30,9 @@ #include +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + /* Management classes */ #define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 #define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 From halr at voltaire.com Tue Sep 14 07:29:12 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 10:29:12 -0400 Subject: [openib-general] [PATCH] ib_mad: Consolidate registration and agent locks Message-ID: <1095172151.2285.29.camel@localhost.localdomain> ib_mad: Consolidate registration and agent locks Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 821) +++ access/ib_mad.c (working copy) @@ -86,7 +86,7 @@ /* - * ib_register_mad_agent eg - Register to send/receive MADs + * ib_register_mad_agent - Register to send/receive MADs */ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, u8 port, @@ -130,14 +130,14 @@ goto error1; } if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { - /* IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is only one currently allowed */ + /* IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only one currently allowed */ if (mad_reg_req->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { ret = ERR_PTR(-EINVAL); goto error1; } } else if (mad_reg_req->mgmt_class == 0) { /* - * Class 0 is reserved in IBA and is used here for + * Class 0 is reserved in IBA and is used for * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ ret = ERR_PTR(-EINVAL); @@ -213,24 +213,21 @@ mad_agent->hi_tid = ++ib_mad_client_id; /* Add mad agent into agent list */ - spin_lock_irqsave(&port_priv->agent_lock, flags); list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); - spin_unlock_irqrestore(&port_priv->agent_lock, flags); ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); - spin_unlock_irqrestore(&port_priv->reg_lock, flags); if (ret2) { ret = ERR_PTR(ret2); goto error3; } + spin_unlock_irqrestore(&port_priv->reg_lock, flags); return mad_agent; error3: /* Remove mad agent from agent list */ - spin_lock_irqsave(&port_priv->agent_lock, flags); list_del(&mad_agent_priv->agent_list); - spin_unlock_irqrestore(&port_priv->agent_lock, flags); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); /* Release allocated structures */ kfree(reg_req); @@ -252,9 +249,14 @@ int not_found = 1; unsigned long flags, flags2; + /* + * Rather than walk all the mad agent lists on all the mad ports, + * might use device in mad_agent and port number from mad agent QP + * but this approach has some downsides + */ spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each_entry(entry, &ib_mad_port_list, port_list) { - spin_lock_irqsave(&entry->agent_lock, flags2); + spin_lock_irqsave(&entry->reg_lock, flags2); list_for_each_entry_safe(entry2, temp, &entry->agent_list, agent_list) { if (entry2->agent == mad_agent) { @@ -269,7 +271,7 @@ break; } } - spin_unlock_irqrestore(&entry->agent_lock, flags2); + spin_unlock_irqrestore(&entry->reg_lock, flags2); } spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); return not_found; @@ -664,7 +666,7 @@ { struct ib_mad_recv_wc recv_wc; struct ib_mad_private *recv; - unsigned long flags, flags2; + unsigned long flags; u32 qp_num; struct ib_mad_agent_private *mad_agent; int solicited; @@ -711,7 +713,6 @@ /* Determine corresponding MAD agent for incoming receive MAD */ spin_lock_irqsave(&port_priv->reg_lock, flags); - spin_lock_irqsave(&port_priv->agent_lock, flags2); /* First, determine whether MAD was solicited */ solicited = solicited_mad(recv->header.recv_buf.mad); /* Now, find the mad agent */ @@ -730,7 +731,6 @@ mad_agent->agent->recv_handler(mad_agent->agent, &recv_wc); } - spin_unlock_irqrestore(&port_priv->agent_lock, flags2); spin_unlock_irqrestore(&port_priv->reg_lock, flags); /* Repost receive request */ @@ -1324,7 +1324,6 @@ port_priv->qp[i]->qp_num); } - spin_lock_init(&port_priv->agent_lock); spin_lock_init(&port_priv->reg_lock); spin_lock_init(&port_priv->recv_list_lock); spin_lock_init(&port_priv->send_list_lock); Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 821) +++ access/ib_mad_priv.h (working copy) @@ -133,11 +133,9 @@ struct ib_pd *pd; struct ib_mr *mr; - spinlock_t agent_lock; - struct list_head agent_list; - spinlock_t reg_lock; struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; + struct list_head agent_list; spinlock_t send_list_lock; struct list_head send_posted_mad_list; From halr at voltaire.com Tue Sep 14 08:23:55 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 11:23:55 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Filter SMI and GSI packets if received on wrong QPN Message-ID: <1095175435.1830.78.camel@localhost.localdomain> ib_mad.c: Filter SMI and GSI packets if received on wrong QPN Index: ib_mad.c =================================================================== --- ib_mad.c (revision 822) +++ ib_mad.c (working copy) @@ -317,6 +317,7 @@ /* Initialize MAD send WR tracking structure */ mad_send_wr->agent = mad_agent; mad_send_wr->wr_id = cur_send_wr->wr_id; + /* Timeout valid only when MAD is a request !!! */ mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; wr.list.next = NULL; @@ -636,7 +637,7 @@ return mad_agent; } -static int validate_mad(struct ib_mad *mad) +static int validate_mad(struct ib_mad *mad, u32 qp_num) { int valid = 0; @@ -647,14 +648,15 @@ goto ret; } - /* Need DQPN from incoming MAD !!! */ /* Filter SMI packets sent to other than QP0 */ if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { - + if (qp_num == 0) + valid = 1; } else { /* Filter GSI packets sent to QP0 */ - + if (qp_num != 0) + valid = 1; } ret: @@ -708,7 +710,7 @@ } /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad)) + if (!validate_mad(recv->header.recv_buf.mad, qp_num)) goto ret; /* Determine corresponding MAD agent for incoming receive MAD */ @@ -757,6 +759,8 @@ goto error; } + /* Check whether timeout was requested !!! */ + /* Remove from posted send MAD list */ list_del(&send_wr->send_list); port_priv->send_posted_mad_count--; From mshefty at ichips.intel.com Tue Sep 14 08:52:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 08:52:46 -0700 Subject: [openib-general] [PATCH] ib_mad.c: Filter SMI and GSI packets if received on wrong QPN In-Reply-To: <1095175435.1830.78.camel@localhost.localdomain> References: <1095175435.1830.78.camel@localhost.localdomain> Message-ID: <20040914085246.0216ced6.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 11:23:55 -0400 Hal Rosenstock wrote: > ib_mad.c: Filter SMI and GSI packets if received on wrong QPN Looks good to me. By the way, looking at the various SMI implementations available, I think that it makes sense to just merge that code into ib_mad.c. A good portion of the SMI validates the MAD and determines if the MAD should be routed to the local device via the process_mad routine. Thoughts? Also, does anyone know if we have a need to perform software loopback in the access layer? From mshefty at ichips.intel.com Tue Sep 14 09:00:31 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 09:00:31 -0700 Subject: [openib-general] ib_mad.c comments In-Reply-To: <20040914074227.GE25611@mellanox.co.il> References: <20040914064936.GC25611@mellanox.co.il> <20040914074227.GE25611@mellanox.co.il> Message-ID: <20040914090031.27b9566b.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 10:42:27 +0300 "Michael S. Tsirkin" wrote: > But this is a question of policy, is it not? > If we need a generic timeout function, lets have it > and let the caller use it if thats what is needed. > No need to force the policy. There's not really any policy for this in the access layer. The generic timeout "function" is invoked by the caller by setting the timeout_ms field in the ib_send_wr. After waiting the specified time period, if no response has been received, the client is notified asynchronously of the failure. What they do after that is up to them. From roland at topspin.com Tue Sep 14 09:02:22 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 09:02:22 -0700 Subject: [openib-general] [PATCH] ib_mad.c: Filter SMI and GSI packets if received on wrong QPN In-Reply-To: <20040914085246.0216ced6.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 14 Sep 2004 08:52:46 -0700") References: <1095175435.1830.78.camel@localhost.localdomain> <20040914085246.0216ced6.mshefty@ichips.intel.com> Message-ID: <52r7p44z5d.fsf@topspin.com> Sean> Looks good to me. By the way, looking at the various SMI Sean> implementations available, I think that it makes sense to Sean> just merge that code into ib_mad.c. A good portion of the Sean> SMI validates the MAD and determines if the MAD should be Sean> routed to the local device via the process_mad routine. Sean> Thoughts? Also, does anyone know if we have a need to Sean> perform software loopback in the access layer? If by software loopback you mean passing 0-hop DR SMPs to the process_mad method, then yes, this is required. On Tavor posting such a MAD to a work queue will not work. - Roland From halr at voltaire.com Tue Sep 14 09:10:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 12:10:31 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Filter SMI and GSI packets if received on wrong QPN In-Reply-To: <20040914085246.0216ced6.mshefty@ichips.intel.com> References: <1095175435.1830.78.camel@localhost.localdomain> <20040914085246.0216ced6.mshefty@ichips.intel.com> Message-ID: <1095178230.1830.116.camel@localhost.localdomain> On Tue, 2004-09-14 at 11:52, Sean Hefty wrote: > By the way, looking at the various SMI implementations available, > I think that it makes sense to just merge that code into ib_mad.c. > A good portion of the SMI validates the MAD and determines if the > MAD should be routed to the local device via the process_mad > routine. Thoughts? Here's my $0.02 worth... I would prefer to keep this in at least a separate file (something like ib_smi.c) if possible. Right now, there is a validate_mad routine in the MAD receive path. That routine could call validate_smi which could perform the validation based on whether it is an incoming LR or DR packet. I see no reason for the SMI and MAD layer to be in separate modules. -- Hal From halr at voltaire.com Tue Sep 14 10:19:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 13:19:05 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_verbs.h : Add timeout_ms to ib_send_wr ud struct Message-ID: <1095182345.1830.161.camel@localhost.localdomain> ib_verbs.h : Add timeout_ms to ib_send_wr ud struct (This is also to Roland's branch) Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 826) +++ ib_verbs.h (working copy) @@ -520,6 +520,7 @@ u32 remote_qpn; u32 remote_qkey; u16 pkey_index; /* valid for GSI only */ + int timeout_ms; /* valid for MADs only */ } ud; } wr; }; From halr at voltaire.com Tue Sep 14 10:49:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 13:49:16 -0400 Subject: [openib-general] ib_verbs.h ib_send/recv_wr struct discrepancy Message-ID: <1095184155.2285.193.camel@localhost.localdomain> I think the ib_send_wr struct needs updating on Roland's branch from: struct ib_send_wr { struct ib_send_wr *next; to struct ib_send_wr { struct list_head list; u64 wr_id; to be in sync with the latest ib_verbs.h. Same thing is true for ib_recv_wr struct. Not sure what other changes are required for this. Thanks. -- Hal From halr at voltaire.com Tue Sep 14 11:05:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 14:05:08 -0400 Subject: [openib-general] [PATCH] Changes for async events to openib-candidate branch Message-ID: <1095185108.1864.197.camel@localhost.localdomain> Changes for async events to openib-candidate branch Index: access/ib_verbs_priv.h =================================================================== --- access/ib_verbs_priv.h (revision 823) +++ access/ib_verbs_priv.h (working copy) @@ -26,21 +26,6 @@ #if !defined( IB_VERBS_PRIV_H ) #define IB_VERBS_PRIV_H -struct ib_client { - char *name; - void (*add) (struct ib_device *); - void (*remove)(struct ib_device *); - - struct list_head list; -}; - -int ib_register_client (struct ib_client *client); -void ib_unregister_client(struct ib_client *client); - -void *ib_get_client_data(struct ib_device *device, struct ib_client *client); -int ib_set_client_data(struct ib_device *device, struct ib_client *client, - void *data); - struct ib_mad; static inline int ib_process_mad(struct ib_device *device, Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 823) +++ access/ib_mad.c (working copy) @@ -1277,7 +1277,7 @@ cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; port_priv->cq = ib_create_cq(port_priv->device, (ib_comp_handler) ib_mad_thread_completion_handler, - port_priv, cq_size); + NULL, port_priv, cq_size); if (IS_ERR(port_priv->cq)) { printk(KERN_ERR "Could not create ib_mad CQ\n"); ret = PTR_ERR(port_priv->cq); Index: access/ib_verbs.c =================================================================== --- access/ib_verbs.c (revision 823) +++ access/ib_verbs.c (working copy) @@ -268,16 +268,18 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe) { struct ib_cq *cq; - cq = device->create_cq(device, comp_handler, cq_context, cqe); + cq = device->create_cq(device, cqe); if (!IS_ERR(cq)) { cq->device = device; cq->comp_handler = comp_handler; + cq->event_handler = event_handler; cq->cq_context = cq_context; atomic_set(&cq->usecnt, 0); } Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 823) +++ include/ib_verbs.h (working copy) @@ -78,6 +78,7 @@ struct ib_cq { struct ib_device *device; ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); void *cq_context; int cqe; atomic_t usecnt; @@ -702,9 +703,7 @@ int (*post_srq)(struct ib_srq *srq, struct ib_recv_wr *recv_wr, struct ib_recv_wr **bad_recv_wr); - struct ib_cq * (*create_cq)(struct ib_device *device, - ib_comp_handler comp_handler, - void *cq_context, int cqe); + struct ib_cq * (*create_cq)(struct ib_device *device, int cqe); int (*resize_cq)(struct ib_cq *cq, int cqe); int (*destroy_cq)(struct ib_cq *cq); struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, @@ -764,6 +763,21 @@ u8 node_type; }; +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + static inline int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr) { @@ -859,6 +873,7 @@ struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), void *cq_context, int cqe); From roland at topspin.com Tue Sep 14 11:07:16 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 11:07:16 -0700 Subject: [openib-general] ib_verbs.h ib_send/recv_wr struct discrepancy In-Reply-To: <1095184155.2285.193.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 14 Sep 2004 13:49:16 -0400") References: <1095184155.2285.193.camel@localhost.localdomain> Message-ID: <523c1k4td7.fsf@topspin.com> Hal> I think the ib_send_wr struct needs updating on Roland's Hal> branch from: Hal> struct ib_send_wr { Hal> struct ib_send_wr *next; Hal> to Hal> struct ib_send_wr { Hal> struct list_head list; Hal> u64 wr_id; Actually I would prefer not to make this change. If we use the list.h linked list implementation for work requests then consumers need to allocate an extra struct list_head and the low-level driver has to chase an extra pointer, even to post a single work request (the common case). - R. From mshefty at ichips.intel.com Tue Sep 14 11:10:37 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 11:10:37 -0700 Subject: [openib-general] ib_verbs.h ib_send/recv_wr struct discrepancy In-Reply-To: <523c1k4td7.fsf@topspin.com> References: <1095184155.2285.193.camel@localhost.localdomain> <523c1k4td7.fsf@topspin.com> Message-ID: <20040914111037.1adfb993.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 11:07:16 -0700 Roland Dreier wrote: > Actually I would prefer not to make this change. If we use the list.h > linked list implementation for work requests then consumers need to > allocate an extra struct list_head and the low-level driver has to > chase an extra pointer, even to post a single work request (the common > case). I agree. From halr at voltaire.com Tue Sep 14 11:11:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 14:11:53 -0400 Subject: [openib-general] [PATCH] review for new MAD APIs In-Reply-To: <20040913151649.51cf14c8.mshefty@ichips.intel.com> References: <20040913151649.51cf14c8.mshefty@ichips.intel.com> Message-ID: <1095185512.2285.199.camel@localhost.localdomain> On Mon, 2004-09-13 at 18:16, Sean Hefty wrote: > The following patch adds two new APIs > to better support zero-copy receives on MADs. Thanks. Committed. -- Hal From halr at voltaire.com Tue Sep 14 11:45:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 14:45:37 -0400 Subject: [openib-general] [PATCH] Change ib_send/recv_wr struct back to use next member rather than list Message-ID: <1095187536.1830.203.camel@localhost.localdomain> Change ib_send/recv_wr struct back to use next member rather than list (on openib-candidate branch) Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 829) +++ access/ib_mad.c (working copy) @@ -303,7 +303,7 @@ /* Walk list of send WRs and post each one on send list */ cur_send_wr = send_wr; while (cur_send_wr) { - next_send_wr = (struct ib_send_wr *)cur_send_wr->list.next; + next_send_wr = (struct ib_send_wr *)cur_send_wr->next; /* Allocate MAD send WR tracking structure */ mad_send_wr = kmalloc(sizeof *mad_send_wr, @@ -320,7 +320,7 @@ /* Timeout valid only when MAD is a request !!! */ mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; - wr.list.next = NULL; + wr.next = NULL; wr.opcode = IB_WR_SEND; /* cur_send_wr->opcode ? */ wr.wr_id = (unsigned long)mad_send_wr; wr.sg_list = cur_send_wr->sg_list; @@ -932,7 +932,7 @@ sg_list.lkey = (*port_priv->mr).lkey; /* Setup receive WR */ - recv_wr.list.next = NULL; + recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 829) +++ include/ib_verbs.h (working copy) @@ -519,7 +519,7 @@ }; struct ib_send_wr { - struct list_head list; + struct ib_send_wr *next; u64 wr_id; struct ib_sge *sg_list; int num_sge; @@ -552,7 +552,7 @@ }; struct ib_recv_wr { - struct list_head list; + struct ib_recv_wr *next; u64 wr_id; struct ib_sge *sg_list; int num_sge; From halr at voltaire.com Tue Sep 14 11:54:38 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 14:54:38 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] In ib_cq struct, change member name from context to cq_context Message-ID: <1095188077.2285.207.camel@localhost.localdomain> In ib_cq struct, change member name from context to cq_context (This is on Roland's branch). This is for consistency with other context member names (and the openib-candidate version of this). Note that I did not check the ULPs to see if they were affected. Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 826) +++ include/ib_verbs.h (working copy) @@ -591,7 +592,7 @@ struct ib_device *device; ib_comp_handler comp_handler; void (*event_handler)(struct ib_event *, void *); - void * context; + void *cq_context; int cqe; atomic_t usecnt; /* count number of work queues */ }; Index: core/ib_verbs.c =================================================================== --- core/ib_verbs.c (revision 826) +++ core/ib_verbs.c (working copy) @@ -191,7 +191,7 @@ cq->device = device; cq->comp_handler = comp_handler; cq->event_handler = event_handler; - cq->context = cq_context; + cq->cq_context = cq_context; atomic_set(&cq->usecnt, 0); } Index: hw/mthca/mthca_cq.c =================================================================== --- hw/mthca/mthca_cq.c (revision 826) +++ hw/mthca/mthca_cq.c (working copy) @@ -177,7 +177,7 @@ return; } - cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.context); + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); if (atomic_dec_and_test(&cq->refcount)) wake_up(&cq->wait); From tduffy at sun.com Tue Sep 14 11:57:13 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 11:57:13 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52d60udpfd.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> Message-ID: <1095188233.6945.8.camel@duffman> On Fri, 2004-09-10 at 10:06 -0700, Roland Dreier wrote: > I just checked in source for "tvflash," a tool for updating the > firmware flash on Mellanox HCA's. It is available from: > > https://openib.org/svn/gen2/branches/roland-merge/src/userspace/tvflash > > This tool operates either by mmap()ing /dev/mem to get access to PCI > memory, or (for systems such as IBM pSeries where this doesn't work) > by peeking and poking at the PCI configuration header. > > I have used it successfully on a variety of systems but you should be > prepared for it to fail and corrupt the flash on your HCA. > > Since it links with the GPLed pciutils library, tvflash is licensed > under the GPL only. I get this error trying to compile this on both x86_64 (64bit) and sparc64 (32bit) gcc -Wall -g -O2 -o tvflash tvflash.o -lpci tvflash.o(.text+0x37f4): In function `main': /build2/tduffy/tvflash/src/tvflash.c:2044: undefined reference to `__sizeof_vsd_data_doesnt_equal_sizeof_vsd_raw' on both, I have vsd.data=232 and vsd.raw=224 -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Sep 14 12:05:54 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 12:05:54 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095188233.6945.8.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 11:57:13 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> Message-ID: <52y8jc3c31.fsf@topspin.com> Tom> I get this error trying to compile this on both x86_64 Tom> (64bit) and sparc64 (32bit) Tom> gcc -Wall -g -O2 -o tvflash tvflash.o -lpci Tom> tvflash.o(.text+0x37f4): In function `main': Tom> /build2/tduffy/tvflash/src/tvflash.c:2044: undefined Tom> reference to `__sizeof_vsd_data_doesnt_equal_sizeof_vsd_raw' Tom> on both, I have vsd.data=232 and vsd.raw=224 Hmm, I'll check it out. For some odd reason it worked for me on i386 and ppc64... - R. From roland at topspin.com Tue Sep 14 12:09:04 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 12:09:04 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] In ib_cq struct, change member name from context to cq_context In-Reply-To: <1095188077.2285.207.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 14 Sep 2004 14:54:38 -0400") References: <1095188077.2285.207.camel@localhost.localdomain> Message-ID: <52u0u03bxr.fsf@topspin.com> Hal> Note that I did not check the ULPs to see if they were Hal> affected. No ULPs but both mthca and the access layer... Index: infiniband/include/ib_verbs.h =================================================================== --- infiniband/include/ib_verbs.h (revision 828) +++ infiniband/include/ib_verbs.h (working copy) @@ -592,7 +592,7 @@ struct ib_device *device; ib_comp_handler comp_handler; void (*event_handler)(struct ib_event *, void *); - void * context; + void * cq_context; int cqe; atomic_t usecnt; /* count number of work queues */ }; Index: infiniband/core/ib_verbs.c =================================================================== --- infiniband/core/ib_verbs.c (revision 803) +++ infiniband/core/ib_verbs.c (working copy) @@ -191,7 +191,7 @@ cq->device = device; cq->comp_handler = comp_handler; cq->event_handler = event_handler; - cq->context = cq_context; + cq->cq_context = cq_context; atomic_set(&cq->usecnt, 0); } Index: infiniband/hw/mthca/mthca_cq.c =================================================================== --- infiniband/hw/mthca/mthca_cq.c (revision 803) +++ infiniband/hw/mthca/mthca_cq.c (working copy) @@ -177,7 +177,7 @@ return; } - cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.context); + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); if (atomic_dec_and_test(&cq->refcount)) wake_up(&cq->wait); From halr at voltaire.com Tue Sep 14 12:22:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 15:22:10 -0400 Subject: [openib-general] [PATCH] Allow for support of alternative MAD layers in build process Message-ID: <1095189730.1864.217.camel@localhost.localdomain> Aloow for support of alternative MAD layers in build process This is just to get started as ib_mad_send will need replacing with an ib_post_send_mad (from OpenIB). (This patch is for Roland's branch.) Index: mthca_mad.c =================================================================== --- mthca_mad.c (revision 831) +++ mthca_mad.c (working copy) @@ -90,7 +90,9 @@ in_mad->sqpn = 0; in_mad->dlid = sm_path.sm_lid; in_mad->completion_func = NULL; +#ifndef OPENIB_ACCESS_LAYER ib_mad_send(in_mad); +#endif } return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; Index: Makefile =================================================================== --- Makefile (revision 831) +++ Makefile (working copy) @@ -4,6 +4,10 @@ EXTRA_CFLAGS += -DDEBUG endif +ifdef CONFIG_INFINIBAND_ACCESS_LAYER +EXTRA_CFLAGS += -DOPENIB_ACCESS_LAYER +endif + obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o ib_mthca-objs := \ From roland at topspin.com Tue Sep 14 13:04:15 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 13:04:15 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095188233.6945.8.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 11:57:13 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> Message-ID: <52pt4o39ds.fsf@topspin.com> OK, can you give it another shot now? I added an "__attribute__((packed))" for the Topspin VSD struct. When I split things up refactoring some code I changed the alignment of some fields so that it no longer gets packed naturally on 64-bit archs. - R. c From roland at topspin.com Tue Sep 14 13:06:35 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 13:06:35 -0700 Subject: [openib-general] [PATCH] Start IPoIB cleanup Message-ID: <52llfc399w.fsf@topspin.com> Start cleaning up IPoIB: use alloc_netdev and free_netdev properly, and set the underlying device with SET_NETDEV_DEV(). - R. Index: infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 803) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -29,7 +29,7 @@ /*.. ipoib_pkey_dev_check_presence - Check for the interface P_Key presence */ void ipoib_pkey_dev_check_presence(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); u16 pkey_index = 0; if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) @@ -40,7 +40,7 @@ int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr *qp_attr; struct ib_qp_cap qp_cap; int attr_mask; @@ -85,7 +85,7 @@ int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; down(&priv->mcast_mutex); @@ -101,7 +101,7 @@ int ipoib_qp_create(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; u16 pkey_index; @@ -194,7 +194,7 @@ void ipoib_qp_destroy(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (ib_destroy_qp(priv->qp)) TS_REPORT_WARN(MOD_IB_NET, @@ -205,7 +205,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); priv->pd = ib_alloc_pd(priv->ca); if (IS_ERR(priv->pd)) { @@ -261,7 +261,7 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (priv->qp != NULL) { if (ib_destroy_qp(priv->qp)) @@ -290,14 +290,14 @@ if (record->event == IB_EVENT_PORT_ACTIVE) { TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_IB_NET_GEN, - "%s: Port active event", priv->dev.name); + "%s: Port active event", priv->dev->name); schedule_work(&priv->flush_task); } } int ipoib_port_monitor_dev_start(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); INIT_IB_EVENT_HANDLER(&priv->event_handler, priv->ca, ipoib_event); @@ -312,7 +312,7 @@ void ipoib_port_monitor_dev_stop(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); ib_unregister_event_handler(&priv->event_handler); } Index: infiniband/ulp/ipoib/ipoib_arp.c =================================================================== --- infiniband/ulp/ipoib/ipoib_arp.c (revision 803) +++ infiniband/ulp/ipoib/ipoib_arp.c (working copy) @@ -24,7 +24,6 @@ #include "ipoib.h" #include "ts_kernel_trace.h" -#include "ts_kernel_services.h" #include "ts_ib_sa_client.h" @@ -198,7 +197,7 @@ static struct ipoib_sarp *__ipoib_sarp_find(struct net_device *dev, const uint8_t *hash) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry; list_for_each_entry(entry, &priv->sarp_cache->table[hash[0]], @@ -223,7 +222,7 @@ static struct ipoib_sarp *_ipoib_sarp_find(struct net_device *dev, const uint8_t *hash) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry; unsigned long flags; @@ -238,7 +237,7 @@ /*..ipoib_sarp_iter_init -- create new ARP iterator */ struct ipoib_sarp_iter *ipoib_sarp_iter_init(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp_iter *iter; iter = kmalloc(sizeof(*iter), GFP_KERNEL); @@ -273,7 +272,7 @@ /*..ipoib_sarp_iter_next -- incr. iter. -- return non-zero at end */ int ipoib_sarp_iter_next(struct ipoib_sarp_iter *iter) { - struct ipoib_dev_priv *priv = iter->dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(iter->dev); while (1) { iter->cur = iter->cur->next; @@ -314,7 +313,7 @@ struct ipoib_sarp *ipoib_sarp_add(struct net_device *dev, union ib_gid *gid, u32 qpn) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); uint8_t hash[IPOIB_ADDRESS_HASH_BYTES]; struct ipoib_sarp *entry; unsigned long flags; @@ -366,7 +365,7 @@ /*..ipoib_sarp_delete -- delete shadow ARP cache entry */ int ipoib_sarp_delete(struct net_device *dev, const uint8_t *hash) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry; unsigned long flags; @@ -398,7 +397,7 @@ { struct ipoib_sarp *entry = entry_ptr; struct net_device *dev = entry->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); TS_TRACE(MOD_IB_NET, T_VERY_VERBOSE, TRACE_IB_NET_ARP, "%s: path record lookup done, status %d", dev->name, status); @@ -474,7 +473,7 @@ { struct ipoib_sarp *entry = entry_ptr; struct net_device *dev = entry->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); tTS_IB_CLIENT_QUERY_TID tid; ipoib_sarp_get(entry); @@ -729,7 +728,7 @@ /*..ipoib_sarp_rewrite_send -- rewrite and send ARP packet */ int ipoib_sarp_rewrite_send(struct net_device *dev, struct sk_buff *skb) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned char broadcast_mac_addr[] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }; struct sk_buff *new_skb; @@ -961,7 +960,7 @@ /*..ipoib_sarp_dev_init -- initialize ARP cache */ int ipoib_sarp_dev_init(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int i; priv->sarp_cache = kmalloc(sizeof(*priv->sarp_cache), GFP_KERNEL); @@ -978,7 +977,7 @@ /*..ipoib_sarp_dev_flush -- flush ARP cache */ void ipoib_sarp_dev_flush(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry, *tentry; LIST_HEAD(delete_list); unsigned long flags; @@ -1041,7 +1040,7 @@ /*..ipoib_sarp_dev_destroy -- destroy ARP cache */ static void ipoib_sarp_dev_destroy(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry, *tentry; LIST_HEAD(delete_list); unsigned long flags; @@ -1072,7 +1071,7 @@ /*..ipoib_sarp_dev_cleanup -- clean up ARP cache */ void ipoib_sarp_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); TS_REPORT_CLEANUP(MOD_IB_NET, "%s: cleaning up ARP table", dev->name); Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 813) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -23,7 +23,6 @@ #include "ipoib.h" -#include "ts_kernel_services.h" #include "ts_kernel_trace.h" #include @@ -115,7 +114,7 @@ int ipoib_device_handle(struct net_device *dev, struct ib_device **ca, tTS_IB_PORT *port, tTS_IB_GID gid, u16 *pkey) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); *ca = priv->ca; *port = priv->port; @@ -128,7 +127,7 @@ int ipoib_dev_open(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_FLOW_CONFIG, "%s: bringing up interface", dev->name); @@ -152,11 +151,11 @@ list_for_each_entry(cpriv, &priv->child_intfs, list) { int flags; - flags = cpriv->dev.flags; + flags = cpriv->dev->flags; if (flags & IFF_UP) continue; - ipoib_change_flags(&cpriv->dev, flags | IFF_UP); + ipoib_change_flags(cpriv->dev, flags | IFF_UP); } up(&ipoib_device_mutex); } @@ -168,7 +167,7 @@ static int _ipoib_dev_stop(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_FLOW_CONFIG, "%s: stopping interface", dev->name); @@ -188,11 +187,11 @@ list_for_each_entry(cpriv, &priv->child_intfs, list) { int flags; - flags = cpriv->dev.flags; + flags = cpriv->dev->flags; if (!(flags & IFF_UP)) continue; - ipoib_change_flags(&cpriv->dev, flags & ~IFF_UP); + ipoib_change_flags(cpriv->dev, flags & ~IFF_UP); } up(&ipoib_device_mutex); } @@ -215,7 +214,7 @@ static int _ipoib_dev_change_mtu(struct net_device *dev, int new_mtu) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) return -EINVAL; @@ -234,7 +233,7 @@ static int _ipoib_dev_xmit(struct sk_buff *skb, struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); uint16_t ethertype; int ret; @@ -374,14 +373,14 @@ struct net_device_stats *_ipoib_dev_get_stats(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); return &priv->stats; } static void _ipoib_dev_timeout(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); if (priv->tx_free && !test_bit(IPOIB_FLAG_TIMEOUT, &priv->flags)) { char ring[IPOIB_TX_RING_SIZE + 1]; @@ -435,51 +434,15 @@ static void _ipoib_dev_set_mcast_list(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); schedule_work(&priv->restart_task); } int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); - TS_REPORT_INOUT(MOD_IB_NET, "%s: initializing device", dev->name); - - dev->open = ipoib_dev_open; - dev->stop = _ipoib_dev_stop; - dev->do_ioctl = _ipoib_dev_ioctl; - dev->change_mtu = _ipoib_dev_change_mtu; - dev->set_config = _ipoib_dev_set_config; - dev->hard_start_xmit = _ipoib_dev_xmit; - dev->get_stats = _ipoib_dev_get_stats; - dev->tx_timeout = _ipoib_dev_timeout; - dev->hard_header = _ipoib_dev_hard_header; - dev->set_multicast_list = _ipoib_dev_set_mcast_list; - dev->watchdog_timeo = HZ; - - dev->rebuild_header = NULL; - dev->set_mac_address = NULL; - dev->header_cache_update = NULL; - - dev->flags |= IFF_BROADCAST | IFF_MULTICAST; - - dev->hard_header_len = ETH_HLEN; - dev->addr_len = IPOIB_ADDRESS_HASH_BYTES; - dev->type = ARPHRD_ETHER; - dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; - /* MTU will be reset when mcast join happens */ - dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; - priv->mcast_mtu = priv->admin_mtu = dev->mtu; - - memset(dev->broadcast, 0xff, dev->addr_len); - - netif_carrier_off(dev); - - SET_MODULE_OWNER(dev); - - spin_lock_init(&priv->lock); - if (ipoib_sarp_dev_init(dev)) goto out; @@ -540,14 +503,14 @@ void ipoib_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv, *cpriv, *tcpriv; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; int i; /* Delete any child interfaces first */ /* Safe since it's either protected by ipoib_device_mutex or empty */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { - ipoib_dev_cleanup(&cpriv->dev); - unregister_netdev(&cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + unregister_netdev(cpriv->dev); list_del(&cpriv->list); @@ -585,72 +548,98 @@ } } -struct ipoib_dev_priv *ipoib_intf_alloc(void) +static void ipoib_setup(struct net_device *dev) { - struct ipoib_dev_priv *priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); - priv = kmalloc(sizeof(*priv), GFP_KERNEL); - if (!priv) { - TS_REPORT_FATAL(MOD_IB_NET, - "failed to allocate private struct"); - return NULL; - } + dev->open = ipoib_dev_open; + dev->stop = _ipoib_dev_stop; + dev->do_ioctl = _ipoib_dev_ioctl; + dev->change_mtu = _ipoib_dev_change_mtu; + dev->set_config = _ipoib_dev_set_config; + dev->hard_start_xmit = _ipoib_dev_xmit; + dev->get_stats = _ipoib_dev_get_stats; + dev->tx_timeout = _ipoib_dev_timeout; + dev->hard_header = _ipoib_dev_hard_header; + dev->set_multicast_list = _ipoib_dev_set_mcast_list; + dev->watchdog_timeo = HZ; - memset(priv, 0, sizeof(*priv)); + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; - sema_init(&priv->mcast_mutex, 1); + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + dev->hard_header_len = ETH_HLEN; + dev->addr_len = IPOIB_ADDRESS_HASH_BYTES; + dev->type = ARPHRD_ETHER; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memset(dev->broadcast, 0xff, dev->addr_len); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + + sema_init(&priv->mcast_mutex, 1); atomic_set(&priv->mcast_joins, 0); INIT_LIST_HEAD(&priv->child_intfs); INIT_LIST_HEAD(&priv->multicast_list); - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, &priv->dev); - INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, &priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); - priv->dev.priv = priv; +} - return priv; +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); } -int ipoib_add_port(const char *format, struct ib_device *hca, tTS_IB_PORT port) +static int ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { struct ipoib_dev_priv *priv; int result = -ENOMEM; - priv = ipoib_intf_alloc(); + priv = ipoib_intf_alloc(format); if (!priv) goto alloc_mem_failed; - priv->pkey = 0xffff; + SET_NETDEV_DEV(priv->dev, &hca->dma_device->dev); -#if 0 - /* We'll probably use something like this in the future */ - result = ib_pkey_entry_get(hca, port, 0, &priv->pkey); + result = ib_query_pkey(hca, port, 0, &priv->pkey); if (result) { TS_REPORT_FATAL(MOD_IB_NET, "%s: ib_pkey_entry_get failed (ret = %d)", - priv->dev.name, result); - goto dev_pkey_get_failed; + priv->dev->name, result); + goto alloc_mem_failed; } -#endif - result = dev_alloc_name(&priv->dev, format); + result = ipoib_dev_init(priv->dev, hca, port); if (result < 0) { TS_REPORT_FATAL(MOD_IB_NET, - "failed to get device name (ret = %d)", result); - goto dev_alloc_failed; - } - - result = ipoib_dev_init(&priv->dev, hca, port); - if (result < 0) { - TS_REPORT_FATAL(MOD_IB_NET, "failed to initialize net device %d, port %d (ret = %d)", hca, port, result); goto device_init_failed; } - result = ipoib_port_monitor_dev_start(&priv->dev); + result = ipoib_port_monitor_dev_start(priv->dev); if (result < 0) { TS_REPORT_FATAL(MOD_IB_NET, "failed to setup port monitor for device %d, " @@ -659,11 +648,11 @@ goto port_monitor_failed; } - result = register_netdev(&priv->dev); + result = register_netdev(priv->dev); if (result) { TS_REPORT_FATAL(MOD_IB_NET, "%s: failed to initialize; error %i", - priv->dev.name, result); + priv->dev->name, result); goto register_failed; } @@ -674,20 +663,14 @@ return 0; register_failed: - ipoib_port_monitor_dev_stop(&priv->dev); + ipoib_port_monitor_dev_stop(priv->dev); port_monitor_failed: - ipoib_dev_cleanup(&priv->dev); + ipoib_dev_cleanup(priv->dev); device_init_failed: - /* - * Nothing to do since the device name only gets finally added - * to the linked list in register_netdev - */ + free_netdev(priv->dev); -dev_alloc_failed: - kfree(priv); - alloc_mem_failed: return result; } @@ -725,10 +708,10 @@ up(&ipoib_device_mutex); list_for_each_entry_safe(priv, tmp, &delete, list) { - unregister_netdev(&priv->dev); - ipoib_port_monitor_dev_stop(&priv->dev); - ipoib_dev_cleanup(&priv->dev); - kfree(priv); + unregister_netdev(priv->dev); + ipoib_port_monitor_dev_stop(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); } } Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 824) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -26,9 +26,10 @@ #include #include -#include /* struct device, and other headers */ +#include #include #include +#include #include #include @@ -40,7 +41,6 @@ #include -#include #include /* constants */ @@ -97,10 +97,8 @@ struct ipoib_dev_priv { spinlock_t lock; - struct net_device dev; - struct list_head list; - struct list_head child_intfs; + struct net_device *dev; unsigned long flags; @@ -156,6 +154,9 @@ struct ib_event_handler event_handler; struct net_device_stats stats; + + struct list_head list; + struct list_head child_intfs; }; /* list of IPoIB network devices */ @@ -172,7 +173,7 @@ ipoib_tx_callback_t callback, void *ptr, struct ib_ah *address, u32 qpn); -struct ipoib_dev_priv *ipoib_intf_alloc(void); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(void *dev); @@ -251,9 +252,6 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); void ipoib_transport_dev_cleanup(struct net_device *dev); -int ipoib_add_port(const char *format, struct ib_device *device, - tTS_IB_PORT port); - int ipoib_port_monitor_dev_start(struct net_device *dev); void ipoib_port_monitor_dev_stop(struct net_device *dev); Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 824) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -53,7 +53,7 @@ /*.._ipoib_ib_post_receive -- post a receive buffer */ static int _ipoib_ib_post_receive(struct net_device *dev, int id) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; dma_addr_t addr; int ret; @@ -104,7 +104,7 @@ static void ipoib_ib_handle_wc(struct net_device *dev, struct ib_wc *entry) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned int work_request_id = (unsigned int) entry->wr_id; TS_REPORT_DATA(MOD_IB_NET, @@ -270,7 +270,7 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int n, i; ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); @@ -312,7 +312,7 @@ ipoib_tx_callback_t callback, void *ptr, struct ib_ah *address, u32 qpn) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; dma_addr_t addr; @@ -430,7 +430,7 @@ int ipoib_ib_dev_up(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); @@ -441,7 +441,7 @@ /*..ipoib_ib_dev_down -- remove from multicast, etc */ int ipoib_ib_dev_down(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int count = 0; clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); @@ -489,7 +489,7 @@ /*..ipoib_ib_dev_stop -- cleanup QP and RX ring */ int ipoib_ib_dev_stop(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int i; /* Kill the existing QP and allocate a new one */ @@ -510,7 +510,7 @@ /*..ipoib_ib_dev_init -- set up IB resources for iface */ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); priv->ca = ca; priv->port = port; @@ -539,7 +539,7 @@ void ipoib_ib_dev_flush(void *_dev) { struct net_device *dev = (struct net_device *)_dev; - struct ipoib_dev_priv *priv = dev->priv, *cpriv; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) return; @@ -565,7 +565,7 @@ /*..ipoib_ib_dev_cleanup -- clean up IB resources for iface */ void ipoib_ib_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); TS_REPORT_CLEANUP(MOD_IB_NET, "%s: cleaning up IB resources", dev->name); @@ -602,7 +602,7 @@ /*..ipoib_pkey_dev_start_thread -- Start the P_Key thread */ int ipoib_pkey_dev_start_thread(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); char thread_name[sizeof("ibX.YYYY_pkey")]; int ret = 0; @@ -635,7 +635,7 @@ /*..ipoib_pkey_dev_stop_thread -- Stop the P_Key thread */ int ipoib_pkey_dev_stop_thread(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret = 0; TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_IB_NET_GEN, @@ -660,7 +660,7 @@ static void _ipoib_pkey_thread(void *dev_ptr) { struct net_device *dev = dev_ptr; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); /* P_Key already assigned */ if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) @@ -689,7 +689,7 @@ /*..ipoib_pkey_dev_delay_open -- wait for pkey to be set */ int ipoib_pkey_dev_delay_open(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); /* Look for the interface pkey value in the IB Port P_Key table and */ /* set the interface pkey assigment flag */ Index: infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- infiniband/ulp/ipoib/ipoib_vlan.c (revision 803) +++ infiniband/ulp/ipoib/ipoib_vlan.c (working copy) @@ -21,11 +21,6 @@ $Id$ */ -#include "ipoib.h" - -#include "ts_kernel_services.h" -#include "ts_kernel_trace.h" - #include #include @@ -34,6 +29,11 @@ #include +#include "ipoib.h" + +#include "ts_kernel_services.h" +#include "ts_kernel_trace.h" + struct ipoib_vlan_iter { struct list_head *pintf_cur; struct list_head *intf_cur; @@ -50,7 +50,7 @@ if (!capable(CAP_NET_ADMIN)) return -EPERM; - ppriv = pdev->priv; + ppriv = netdev_priv(pdev); /* * First ensure this isn't a duplicate. We check the parent device and @@ -68,7 +68,7 @@ } up(&ipoib_device_mutex); - priv = ipoib_intf_alloc(); + priv = ipoib_intf_alloc(intf_name); if (!priv) goto alloc_mem_failed; @@ -76,9 +76,7 @@ priv->pkey = pkey; - strncpy(priv->dev.name, intf_name, sizeof(priv->dev.name)); - - result = ipoib_dev_init(&priv->dev, ppriv->ca, ppriv->port); + result = ipoib_dev_init(priv->dev, ppriv->ca, ppriv->port); if (result < 0) { TS_REPORT_FATAL(MOD_IB_NET, "failed to initialize net device %d, port %d", @@ -86,11 +84,11 @@ goto device_init_failed; } - result = register_netdev(&priv->dev); + result = register_netdev(priv->dev); if (result) { TS_REPORT_FATAL(MOD_IB_NET, "%s: failed to initialize; error %i", - priv->dev.name, result); + priv->dev->name, result); goto register_failed; } @@ -101,10 +99,10 @@ return 0; register_failed: - ipoib_dev_cleanup(&priv->dev); + ipoib_dev_cleanup(priv->dev); device_init_failed: - kfree(priv); + free_netdev(priv->dev); alloc_mem_failed: return result; @@ -117,18 +115,18 @@ if (!capable(CAP_NET_ADMIN)) return -EPERM; - ppriv = pdev->priv; + ppriv = netdev_priv(pdev); down(&ipoib_device_mutex); list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { if (priv->pkey == pkey) { - if (priv->dev.flags & IFF_UP) { + if (priv->dev->flags & IFF_UP) { up(&ipoib_device_mutex); return -EBUSY; } - ipoib_dev_cleanup(&priv->dev); - unregister_netdev(&priv->dev); + ipoib_dev_cleanup(priv->dev); + unregister_netdev(priv->dev); list_del(&priv->list); @@ -230,7 +228,7 @@ ppriv = list_entry(iter->pintf_cur, struct ipoib_dev_priv, list); if (!iter->intf_cur) - seq_printf(file, "%s 0x%04x\n", ppriv->dev.name, + seq_printf(file, "%s 0x%04x\n", ppriv->dev->name, ppriv->pkey); else { struct ipoib_dev_priv *priv; @@ -238,8 +236,8 @@ priv = list_entry(iter->intf_cur, struct ipoib_dev_priv, list); - seq_printf(file, " %s %s 0x%04x\n", ppriv->dev.name, - priv->dev.name, priv->pkey); + seq_printf(file, " %s %s 0x%04x\n", ppriv->dev->name, + priv->dev->name, priv->pkey); } } Index: infiniband/ulp/ipoib/ipoib_proc.c =================================================================== --- infiniband/ulp/ipoib/ipoib_proc.c (revision 803) +++ infiniband/ulp/ipoib/ipoib_proc.c (working copy) @@ -21,17 +21,18 @@ $Id$ */ -#include "ipoib.h" - -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - #include #include +#include #include #include +#include "ipoib.h" + +#include "ts_kernel_trace.h" +#include "ts_kernel_services.h" + /* * ARP proc file stuff */ @@ -425,7 +426,7 @@ /*..ipoib_proc_dev_init -- set up ipoib_arp in /proc */ int ipoib_proc_dev_init(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); char name[sizeof(ipoib_arp_proc_entry_name) + sizeof (dev->name)]; snprintf(name, sizeof(name) - 1, ipoib_arp_proc_entry_name, dev->name); @@ -465,7 +466,7 @@ /*..ipoib_proc_dev_cleanup -- unregister /proc file */ void ipoib_proc_dev_cleanup(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); char name[sizeof(ipoib_arp_proc_entry_name) + sizeof(dev->name)]; if (priv->arp_proc_entry) { Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 803) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -25,7 +25,6 @@ #include "ts_ib_sa_client.h" -#include "ts_kernel_services.h" #include "ts_kernel_trace.h" #include @@ -137,7 +136,7 @@ /*..__ipoib_mcast_find - find multicast group */ struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct rb_node *n = priv->multicast_tree.rb_node; while (n) { @@ -165,7 +164,7 @@ struct ipoib_mcast *_ipoib_mcast_find(struct net_device *dev, union ib_gid *mgid) { struct ipoib_mcast *mcast; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned long flags; spin_lock_irqsave(&priv->lock, flags); @@ -179,7 +178,7 @@ /*..__ipoib_mcast_add -- add multicast group to rbtree */ static int __ipoib_mcast_add(struct net_device *dev, struct ipoib_mcast *mcast) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct rb_node **n = &priv->multicast_tree.rb_node, *pn = NULL; while (*n) { @@ -210,7 +209,7 @@ struct ib_multicast_member *member_ptr) { struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; mcast->mcast_member = *member_ptr; @@ -301,7 +300,7 @@ { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); mcast->tid = TS_IB_CLIENT_QUERY_TID_INVALID; @@ -338,7 +337,7 @@ static int _ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) { struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); tTS_IB_CLIENT_QUERY_TID tid; int ret = 0; @@ -399,7 +398,7 @@ { struct ipoib_mcast *mcast = mcast_ptr; struct net_device *dev = mcast->dev; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); priv->mcast_tid = TS_IB_CLIENT_QUERY_TID_INVALID; @@ -415,7 +414,7 @@ /*..__ipoib_mcast_join - join multicast group for iface */ static int __ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int status; TS_TRACE(MOD_IB_NET, T_VERY_VERBOSE, TRACE_IB_NET_MULTICAST, @@ -508,7 +507,7 @@ static void _ipoib_mcast_join_thread(void *dev_ptr) { struct net_device *dev = dev_ptr; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_sarp *entry; unsigned long flags; int ret = 0; @@ -612,7 +611,7 @@ /*..ipoib_mcast_start_thread -- start multicast thread */ int ipoib_mcast_start_thread(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); char thread_name[64]; int ret = 0; @@ -646,7 +645,7 @@ /*..ipoib_mcast_stop_thread -- stop multicast join */ int ipoib_mcast_stop_thread(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int ret = 0; TS_TRACE(MOD_IB_NET, T_VERBOSE, TRACE_IB_NET_MULTICAST, @@ -685,7 +684,7 @@ /*..ipoib_mcast_leave -- leave multicast group */ int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); int result; if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) @@ -739,7 +738,7 @@ static int _ipoib_mcast_delete(struct net_device *dev, union ib_gid *mgid) { struct ipoib_mcast *mcast; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); unsigned long flags; spin_lock_irqsave(&priv->lock, flags); @@ -765,7 +764,7 @@ union ib_gid *mgid, struct ipoib_mcast **mmcast) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_mcast *mcast; unsigned long flags; int ret = 0; @@ -847,7 +846,7 @@ /*..ipoib_mcast_dev_flush -- flush joins and address vectors */ void ipoib_mcast_dev_flush(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); LIST_HEAD(remove_list); struct ipoib_mcast *mcast, *tmcast, *nmcast; unsigned long flags; @@ -914,7 +913,7 @@ /*..ipoib_mcast_dev_down -- delete broadcast group */ void ipoib_mcast_dev_down(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); /* Delete broadcast since it will be recreated */ if (priv->broadcast) { @@ -928,7 +927,7 @@ void ipoib_mcast_restart_task(void *dev_ptr) { struct net_device *dev = dev_ptr; - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct in_device *in_dev = in_dev_get(dev); struct ip_mc_list *im; struct ipoib_mcast *mcast, *tmcast; @@ -1064,7 +1063,7 @@ /*..ipoib_mcast_iter_init -- create new multicast iterator */ struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev) { - struct ipoib_dev_priv *priv = dev->priv; + struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_mcast_iter *iter; struct rb_node *node, *parent = NULL; From tduffy at sun.com Tue Sep 14 13:28:04 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 13:28:04 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52pt4o39ds.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> Message-ID: <1095193684.6945.29.camel@duffman> On Tue, 2004-09-14 at 13:04 -0700, Roland Dreier wrote: > OK, can you give it another shot now? I added an "__attribute__((packed))" > for the Topspin VSD struct. When I split things up refactoring some > code I changed the alignment of some fields so that it no longer gets > packed naturally on 64-bit archs. Yeah, it looks good as far as compiling goes now. Thanks. I still am trying to figure out why running it on sparc64 is hanging in the do loop of flash_write_cmd(): tat:~# strace /tmp/tvflash -i execve("/tmp/tvflash", ["/tmp/tvflash", "-i"], [/* 16 vars */]) = 0 uname({sys="Linux", node="tat", ...}) = 0 brk(0) = 0x268f0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=50853, ...}) = 0 mmap(NULL, 50853, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7002c000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/usr/lib/libpci.so.2", O_RDONLY) = 3 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\2\0\0\0\1\0\0\30"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0644, st_size=33288, ...}) = 0 mmap(NULL, 97784, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x7003c000 mprotect(0x70044000, 65016, PROT_NONE) = 0 mmap(0x7004c000, 32768, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x7004c000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\2\0\0\0\1\0\1\316"..., 512) = 512fstat64(3, {st_mode=S_IFREG|0644, st_size=1291820, ...}) = 0 mmap(NULL, 1361736, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x70054000 mprotect(0x70188000, 100168, PROT_NONE) = 0 mmap(0x70194000, 49152, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0x130000) = 0x70194000 mmap(0x701a0000, 1864, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x701a0000 close(3) = 0 munmap(0x7002c000, 50853) = 0 ioctl(0, 0x40087468, 0xeffffcb0) = 0 brk(0) = 0x268f0 brk(0x488f0) = 0x488f0 brk(0) = 0x488f0 brk(0x4a000) = 0x4a000 access("/sys/bus/pci/devices", R_OK) = -1 ENOENT (No such file or directory) uname({sys="Linux", node="tat", ...}) = 0 access("/proc/bus/pci", R_OK) = 0 open("/proc/bus/pci/devices", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7001a000 read(3, "0000\t108e8000\t0\t 0"..., 8192) = 2060 open("/proc/bus/pci/00/00.0", O_RDONLY) = 4 pread(4, "\0", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/00/01.0", O_RDONLY) = 4 pread(4, "\200", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/00/01.1", O_RDONLY) = 4 pread(4, "\200", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/00/03.0", O_RDONLY) = 4 pread(4, "\200", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/00/03.1", O_RDONLY) = 4 pread(4, "\200", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/01/00.0", O_RDONLY) = 4 pread(4, "\0", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/01/01.0", O_RDONLY) = 4 pread(4, "\1", 1, 14) = 1 close(4) = 0 open("/proc/bus/pci/02/00.0", O_RDONLY) = 4 pread(4, "\0", 1, 14) = 1 read(3, "", 8192) = 0 close(3) = 0 munmap(0x7001a000, 8192) = 0 open("/dev/mem", O_RDWR) = 3 mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100000) = 0x701a4000 close(3) = 0 ....hangs here.... # gdb tvflash GNU gdb 6.1-debian Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-linux"...Using host libthread_db library "/lib/libthread_db.so.1". (gdb) run -i Starting program: /build2/tduffy/tvflash/src/tvflash -i Program received signal SIGTSTP, Stopped (user). 0x00011b0c in READ_CFG (addr=983460) at tvflash.c:499 499 } /* READ_CFG */ (gdb) bt #0 0x00011b0c in READ_CFG (addr=983460) at tvflash.c:499 #1 0x00011ca0 in flash_write_cmd (addr=0) at tvflash.c:555 #2 0x00012df4 in identify_hca (num=0, tvdev=0x26b40, identify_mode=IDENTIFY_EXTENDED) at tvflash.c:1153 #3 0x000132f0 in identify_hcas (hca=-1, identify_mode=IDENTIFY_EXTENDED) at tvflash.c:1325 #4 0x00014e98 in main (argc=2, argv=0xeffffd64) at tvflash.c:2166 -- "When they took the 4th Amendment, I was quiet because I didn't deal drugs. When they took the 6th Amendment, I was quiet because I am innocent. When they took the 2nd Amendment, I was quiet because I don't own a gun. Now they have taken the 1st Amendment, and I can only be quiet." --Lyle Myhr -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Sep 14 13:32:53 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 13:32:53 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095193684.6945.29.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 13:28:04 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> Message-ID: <52hdq03822.fsf@topspin.com> Tom> I still am trying to figure out why running it on sparc64 is Tom> hanging in the do loop of flash_write_cmd(): Can you try it with -c as well (to use PCI config header rather than mmap'ed /dev/mem)? - R. From tduffy at sun.com Tue Sep 14 13:43:30 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 13:43:30 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52hdq03822.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> Message-ID: <1095194610.6945.31.camel@duffman> On Tue, 2004-09-14 at 13:32 -0700, Roland Dreier wrote: > Tom> I still am trying to figure out why running it on sparc64 is > Tom> hanging in the do loop of flash_write_cmd(): > > Can you try it with -c as well (to use PCI config header rather than > mmap'ed /dev/mem)? Yeah, ok, that seems to work for the -i. Now, should I be brave and try to update the firmware? # ./tvflash -c -i HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source Secondary image is valid, unknown source -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Sep 14 13:47:00 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 13:47:00 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095194610.6945.31.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 13:43:30 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> Message-ID: <52d60o37ej.fsf@topspin.com> Tom> Yeah, ok, that seems to work for the -i. Now, should I be Tom> brave and try to update the firmware? Up to you :) Worst case you'll have to pull the card, add a jumper to disable the flash, and reburn it with another tool. - R. From halr at voltaire.com Tue Sep 14 14:09:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 17:09:13 -0400 Subject: [openib-general] Multicast address aliasing in IPoIB In-Reply-To: <52wtz2etda.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED605E00F19@mtlex01.yok.mtl.com> <52wtz2etda.fsf@topspin.com> Message-ID: <1095196150.1830.288.camel@localhost.localdomain> On Thu, 2004-09-09 at 22:43, Roland Dreier wrote: > I see no reason to modify the IPoIB multicast mapping just because the > Linux kernel does not yet have support for it. Agreed. > For newer kernels, there is no reason that the IPoIB driver has to > masquerade at an ethernet driver -- we should be aiming for a fully > native driver that sets its dev->type field to ARPHRD_INFINIBAND. Hopefully we are moving to this if not already there. > Once the OpenIB IPoIB driver is ready to be merged upstream, it should > be no problem to get the trivial changes required in the core > networking code merged. As far as I can tell, the only changes needed > would be: implement an ip_ib_mc_map() function and add > > case ARPHRD_INFINIBAND: > ip_ib_mc_map(addr, haddr); > return 0; > > to arp_mc_map() in net/ipv4/arp.c and to make the analogous addition > for ndisc_mc_map() in net/ipv6/ndisc.c. (ARPHRD_INFINIBAND is already > defined in the Linux headers -- I got that merged back in early 2003) The only issue I see is where the scope and PKey would come from to form the MGID. The scope could default to link local (2). I would presume the PKey would come from the IPoIB interface somehow. -- Hal From mst at mellanox.co.il Tue Sep 14 14:10:06 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 00:10:06 +0300 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095194610.6945.31.camel@duffman> References: <1095194610.6945.31.camel@duffman> Message-ID: <20040914211005.GA2965@mellanox.co.il> Hello! Quoting r. Tom Duffy (tduffy at sun.com) "Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool": > _______________________________________________ > On Tue, 2004-09-14 at 13:32 -0700, Roland Dreier wrote: > > Tom> I still am trying to figure out why running it on sparc64 is > > Tom> hanging in the do loop of flash_write_cmd(): > > > > Can you try it with -c as well (to use PCI config header rather than > > mmap'ed /dev/mem)? > > Yeah, ok, that seems to work for the -i. Now, should I be brave and try > to update the firmware? > > # ./tvflash -c -i > HCA #0: Found MT23108, Cougar, revision A1 > Primary image is valid, unknown source > Secondary image is valid, unknown source > > -tduffy Doesnt flint work for you? MST From Tom.Duffy at Sun.COM Tue Sep 14 14:23:09 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Tue, 14 Sep 2004 14:23:09 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52d60o37ej.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> Message-ID: <1095196989.6945.43.camel@duffman> On Tue, 2004-09-14 at 13:47 -0700, Roland Dreier wrote: > Up to you :) > > Worst case you'll have to pull the card, add a jumper to disable the > flash, and reburn it with another tool. No worst case, flashing works on sparc64! # uname -a Linux tat 2.6.9-rc2 #1 SMP Tue Sep 14 13:27:46 PDT 2004 sparc64 GNU/Linux # ./tvflash -c /tmp/fw-23108-a1-3.2.0.cougar.bin New Node GUID = 0002c90109765fd0 New Port1 GUID = 0002c90109765fd1 New Port2 GUID = 0002c90109765fd2 Programming Tavor Microcode... Flash Image Size = 342968 Failsafe [==================================================================] Erasing [==================================================================] Writing [==================================================================] Verifying [==================================================================] Flash verify passed! # reboot # dmesg | tail ib_mthca: Mellanox InfiniBand HCA driver v0.05-pre (June 13, 2004) ib_mthca: Initializing (0000:81:00.0) ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit PCI DMA mask. ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask. # cat /sys/class/infiniband/mthca0/fw_ver 3.2.0 ---- However, I am getting a bunch of these coming in my logs... Sep 14 14:15:36 localhost kernel: [KERNEL_IB][ib_mad_handle_wc][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_ib.c:179]completion status 4 for mthca0 index 0 port 1 qpn 0 send 0 bytes 2 For index 0 - 127. And my port 1 never goes into ACTIVE, stays in INIT. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Tue Sep 14 14:29:50 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 00:29:50 +0300 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095196989.6945.43.camel@duffman> References: <1095196989.6945.43.camel@duffman> Message-ID: <20040914212950.GB2965@mellanox.co.il> Most likely the case of firmware/board mismatch. Quoting r. Tom Duffy (Tom.Duffy at Sun.COM) "Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool": > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > Subject: > Date: Wed, 15 Sep 2004 00:22:09 +0300 > > > On Tue, 2004-09-14 at 13:47 -0700, Roland Dreier wrote: > > Up to you :) > > > > Worst case you'll have to pull the card, add a jumper to disable the > > flash, and reburn it with another tool. > > No worst case, flashing works on sparc64! > > # uname -a > Linux tat 2.6.9-rc2 #1 SMP Tue Sep 14 13:27:46 PDT 2004 sparc64 GNU/Linux > > # ./tvflash -c /tmp/fw-23108-a1-3.2.0.cougar.bin > New Node GUID = 0002c90109765fd0 > New Port1 GUID = 0002c90109765fd1 > New Port2 GUID = 0002c90109765fd2 > Programming Tavor Microcode... Flash Image Size = 342968 > Failsafe [==================================================================] > Erasing [==================================================================] > Writing [==================================================================] > Verifying [==================================================================] > Flash verify passed! > > # reboot > > # dmesg | tail > ib_mthca: Mellanox InfiniBand HCA driver v0.05-pre (June 13, 2004) > ib_mthca: Initializing (0000:81:00.0) > ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit PCI DMA mask. > ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask. > > # cat /sys/class/infiniband/mthca0/fw_ver > 3.2.0 > > ---- > > However, I am getting a bunch of these coming in my logs... > > Sep 14 14:15:36 localhost kernel: [KERNEL_IB][ib_mad_handle_wc][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_ib.c:179]completion status 4 for mthca0 index 0 port 1 qpn 0 send 0 bytes 2 > > For index 0 - 127. > > And my port 1 never goes into ACTIVE, stays in INIT. > > -tduffy From tduffy at sun.com Tue Sep 14 14:55:45 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 14:55:45 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <20040914211005.GA2965@mellanox.co.il> References: <1095194610.6945.31.camel@duffman> <20040914211005.GA2965@mellanox.co.il> Message-ID: <1095198945.6945.47.camel@duffman> On Wed, 2004-09-15 at 00:10 +0300, Michael S. Tsirkin wrote: > Doesnt flint work for you? Honestly, I haven't really tried that. And when I try to run the build scripts, I get: Unsupported platform: sparc64 -tduffy -- "When they took the 4th Amendment, I was quiet because I didn't deal drugs. When they took the 6th Amendment, I was quiet because I am innocent. When they took the 2nd Amendment, I was quiet because I don't own a gun. Now they have taken the 1st Amendment, and I can only be quiet." --Lyle Myhr -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Tue Sep 14 14:57:58 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 14:57:58 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <20040914212950.GB2965@mellanox.co.il> References: <1095196989.6945.43.camel@duffman> <20040914212950.GB2965@mellanox.co.il> Message-ID: <1095199078.6945.52.camel@duffman> On Wed, 2004-09-15 at 00:29 +0300, Michael S. Tsirkin wrote: > Most likely the case of firmware/board mismatch. What is? And I have all a1 cougars...so, I don't think this is the case. -tduffy -- "When they took the 4th Amendment, I was quiet because I didn't deal drugs. When they took the 6th Amendment, I was quiet because I am innocent. When they took the 2nd Amendment, I was quiet because I don't own a gun. Now they have taken the 1st Amendment, and I can only be quiet." --Lyle Myhr -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Sep 14 15:01:45 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 15:01:45 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <1095196989.6945.43.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 14:23:09 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> Message-ID: <52zn3s1pdi.fsf@topspin.com> Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit PCI DMA mask. Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask. Hmm, I wonder what's up with that. Does sparc64 limit PCI addresses to the low 4G or something? Tom> [KERNEL_IB][ib_mad_handle_wc][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_ib.c:179]completion status 4 for mthca0 index 0 port 1 qpn 0 send 0 bytes 2 Probably the kernel isn't mapping RAM to DMA addresses in the range 0...(high_memory - PAGE_OFFSET) (ib_mad_register_memory() kind of implicitly assumes this). It might be interesting to dump the value of high_memory - PAGE_OFFSET, and what scatter_list.addr is getting back from pci_map_single in ib_mad_post_receive(). I know how to fix this properly, it just requires a couple small API extensions. - R. From mshefty at ichips.intel.com Tue Sep 14 15:18:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 15:18:39 -0700 Subject: [openib-general] [PATCH] beginnings of SMI Message-ID: <20040914151839.0ece83e6.mshefty@ichips.intel.com> This patch begins to add in functionality needed by the SMI implementation. - Sean -- Index: access/ib_smi.c =================================================================== --- access/ib_smi.c (revision 0) +++ access/ib_smi.c (revision 0) @@ -0,0 +1,125 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include +#include "ib_mad_priv.h" + +int smi_process_dr_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* + * Outgoing MAD processing. "Outgoing" means from initiator to responder. + * Section 14.2.2.2, Vol 1 IB spec + */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_ptr == 0 && hop_cnt) + return 0; + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE, + "Need to handle DR Mad on switch"); + } + return 0; + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_priv->port; + smp->hop_ptr++; + + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE, + "Need to handle DR Mad on switch"); + return 0; + } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { + return 0; + } + + return 1; + } + + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ + if (hop_ptr == hop_cnt + 1) + return 1; + + /* C14-9:5 -- Check for unreasonable hop pointer. */ + if (hop_ptr > hop_cnt + 1) + return 0; + + /* There should be no way of getting here, since one of the if + * statements above should have matched, and should have + * returned a value. + */ + printk(KERN_ERR, "Unhandled Outgoing DR MAD case."); + return 0; + } else { /* Returning MAD (From responder to initiator) */ + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) + return 0; + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE, + "Need to handle DR Mad on switch"); + } + return 0; + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE, + "Need to handle DR Mad on switch"); + return 0; + } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { + return 0; + } + + return 1; + } + + /* C14-13:4 -- Hop Pointer = 0 -> give to SM. */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer. */ + if (hop_ptr > hop_cnt + 1) + return 0; + } + return 1; +} Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 836) +++ include/ib_mad.h (working copy) @@ -60,6 +60,11 @@ #define IB_QP1 cpu_to_be32(1) #define IB_QP1_QKEY cpu_to_be32(0x80010000) +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + struct ib_grh { u32 version_tclass_flow; u16 paylen; @@ -82,6 +87,35 @@ u32 attr_mod; } __attribute__ ((packed)); +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION cpu_to_be16(0x8000) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + struct ib_rmpp_hdr { u8 rmpp_version; u8 rmpp_type; From johannes at erdfelt.com Tue Sep 14 15:26:14 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Tue, 14 Sep 2004 15:26:14 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040914151839.0ece83e6.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> Message-ID: <20040914222614.GI6255@sventech.com> On Tue, Sep 14, 2004, Sean Hefty wrote: > + printk(KERN_NOTICE, > + "Need to handle DR Mad on switch"); [snip] > + printk(KERN_NOTICE, > + "Need to handle DR Mad on switch"); [snip] > + printk(KERN_ERR, "Unhandled Outgoing DR MAD case."); [more snipped] The comma between the logging level (KERN_NOTICE, KERN_ERR, etc) and the text should not be there. JE From mshefty at ichips.intel.com Tue Sep 14 15:29:37 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 15:29:37 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040914222614.GI6255@sventech.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <20040914222614.GI6255@sventech.com> Message-ID: <20040914152937.49c6c781.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 15:26:14 -0700 Johannes Erdfelt wrote: > On Tue, Sep 14, 2004, Sean Hefty wrote: > > + printk(KERN_NOTICE, > > + "Need to handle DR Mad on switch"); > [snip] > > + printk(KERN_NOTICE, > > + "Need to handle DR Mad on switch"); > [snip] > > + printk(KERN_ERR, "Unhandled Outgoing DR MAD case."); > [more snipped] > > The comma between the logging level (KERN_NOTICE, KERN_ERR, etc) and the > text should not be there. Thanks for catching this. Obviously I hadn't added this to the build yet, since it can't do anything useful yet. I'll update patch and resubmit. From halr at voltaire.com Tue Sep 14 15:35:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 18:35:02 -0400 Subject: [openib-general] Re: [PATCH] beginnings of SMI In-Reply-To: <20040914151839.0ece83e6.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> Message-ID: <1095201301.2285.307.camel@localhost.localdomain> On Tue, 2004-09-14 at 18:18, Sean Hefty wrote: > This patch begins to add in functionality needed by the SMI > implementation. Any objections to the SMI header changes going into a separate ib_smi.h rather than into ib_mad.h ? -- Hal From tduffy at sun.com Tue Sep 14 15:36:31 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 15:36:31 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52zn3s1pdi.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> Message-ID: <1095201391.6945.66.camel@duffman> On Tue, 2004-09-14 at 15:01 -0700, Roland Dreier wrote: > Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit PCI DMA mask. > Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask. > > Hmm, I wonder what's up with that. Does sparc64 limit PCI addresses > to the low 4G or something? from (mainline) pci.c: int pci_set_dma_mask(struct pci_dev *dev, u64 mask) { if (!pci_dma_supported(dev, mask)) return -EIO; ... and sparc's implementation of pci_dma_supported: ... if (device_mask >= (1UL << 32UL)) return 0; ... which seems to me that it does not support >4G. > Tom> [KERNEL_IB][ib_mad_handle_wc][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_ib.c:179]completion status 4 for mthca0 index 0 port 1 qpn 0 send 0 bytes 2 > > Probably the kernel isn't mapping RAM to DMA addresses in the range > 0...(high_memory - PAGE_OFFSET) (ib_mad_register_memory() kind of > implicitly assumes this). It might be interesting to dump the value of > high_memory - PAGE_OFFSET, and what scatter_list.addr is getting back > from pci_map_single in ib_mad_post_receive(). > > I know how to fix this properly, it just requires a couple small API > extensions. high_memory - PAGE_OFFSET = 0x000000007fecc000 scatter_list.addr = 0x00000000c3b7a360 through 00000000c3f96d80 -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Tue Sep 14 15:38:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 15:38:30 -0700 Subject: [openib-general] Re: [PATCH] beginnings of SMI In-Reply-To: <1095201301.2285.307.camel@localhost.localdomain> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <1095201301.2285.307.camel@localhost.localdomain> Message-ID: <20040914153830.2eb8db7d.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 18:35:02 -0400 Hal Rosenstock wrote: > On Tue, 2004-09-14 at 18:18, Sean Hefty wrote: > > This patch begins to add in functionality needed by the SMI > > implementation. > > Any objections to the SMI header changes going into a separate ib_smi.h > rather than into ib_mad.h ? Not really. I can move those definitions. I placed them in ib_mad.h to locate the SMP definition close to the MAD header definition, but there's no reason I can't create a new file for it. From johannes at erdfelt.com Tue Sep 14 15:42:03 2004 From: johannes at erdfelt.com (Johannes Erdfelt) Date: Tue, 14 Sep 2004 15:42:03 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040914152937.49c6c781.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <20040914222614.GI6255@sventech.com> <20040914152937.49c6c781.mshefty@ichips.intel.com> Message-ID: <20040914224203.GJ6255@sventech.com> On Tue, Sep 14, 2004, Sean Hefty wrote: > On Tue, 14 Sep 2004 15:26:14 -0700 > Johannes Erdfelt wrote: > > > On Tue, Sep 14, 2004, Sean Hefty wrote: > > > + printk(KERN_NOTICE, > > > + "Need to handle DR Mad on switch"); > > [snip] > > > + printk(KERN_NOTICE, > > > + "Need to handle DR Mad on switch"); > > [snip] > > > + printk(KERN_ERR, "Unhandled Outgoing DR MAD case."); > > [more snipped] > > > > The comma between the logging level (KERN_NOTICE, KERN_ERR, etc) and the > > text should not be there. > > Thanks for catching this. Obviously I hadn't added this to the build yet, > since it can't do anything useful yet. I'll update patch and resubmit. Oh, and don't forget the \n (newline) :) JE From iod00d at hp.com Tue Sep 14 15:48:58 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 14 Sep 2004 15:48:58 -0700 Subject: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool In-Reply-To: <52zn3s1pdi.fsf@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> Message-ID: <20040914224858.GB19535@cup.hp.com> On Tue, Sep 14, 2004 at 03:01:45PM -0700, Roland Dreier wrote: > Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit PCI DMA mask. > Tom> ib_mthca 0000:81:00.0: Warning: couldn't set 64-bit consistent PCI DMA mask. > > Hmm, I wonder what's up with that. Does sparc64 limit PCI addresses > to the low 4G or something? It depends on the IOMMU support which is chipset specific. AFAIK, sparc64/alpha/parisc platforms only support 32-bit DMA and force everything through an IOMMU. > Probably the kernel isn't mapping RAM to DMA addresses in the range > 0...(high_memory - PAGE_OFFSET) (ib_mad_register_memory() kind of > implicitly assumes this). This is a broken assumption. The DMA support can return *anything*. chipsets can use addresses > physical RAM present in the system (assuming < 4GB RAM). > I know how to fix this properly, it just requires a couple small API > extensions. cool. grant From roland at topspin.com Tue Sep 14 15:56:07 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 15:56:07 -0700 Subject: DMA mapping on sparc64 (was Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool) In-Reply-To: <1095201391.6945.66.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 15:36:31 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> Message-ID: <52vfeg1muw.fsf_-_@topspin.com> Tom> high_memory - PAGE_OFFSET = 0x000000007fecc000 Tom> scatter_list.addr = 0x00000000c3b7a360 through 00000000c3f96d80 Yup, that's the problem. As a quick hack you can change the .size in ib_mad_register_memory to (1ULL << 32). There's some similar code in ipoib_transport_dev_init() that needs the same treatment too. That should make things work. The correct fix is to add a new operation to return an L_Key that covers any DMA address. I'll post some patches to do that soon (I hope this week). - R. From iod00d at hp.com Tue Sep 14 15:59:24 2004 From: iod00d at hp.com (Grant Grundler) Date: Tue, 14 Sep 2004 15:59:24 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040914151839.0ece83e6.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> Message-ID: <20040914225924.GC19535@cup.hp.com> On Tue, Sep 14, 2004 at 03:18:39PM -0700, Sean Hefty wrote: > This patch begins to add in functionality needed by the SMI implementation. ... > + /* There should be no way of getting here, since one of the if > + * statements above should have matched, and should have > + * returned a value. > + */ > + printk(KERN_ERR, "Unhandled Outgoing DR MAD case."); Would it be useful to print something related to the input parameters? The above message really doesn't tell us anything about the unhandled request or where it came from. grant From mshefty at ichips.intel.com Tue Sep 14 16:10:45 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 14 Sep 2004 16:10:45 -0700 Subject: [openib-general] Re: [PATCH] beginnings of SMI In-Reply-To: <1095201301.2285.307.camel@localhost.localdomain> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <1095201301.2285.307.camel@localhost.localdomain> Message-ID: <20040914161045.3b390c8b.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 18:35:02 -0400 Hal Rosenstock wrote: > On Tue, 2004-09-14 at 18:18, Sean Hefty wrote: > > This patch begins to add in functionality needed by the SMI > > implementation. > > Any objections to the SMI header changes going into a separate ib_smi.h > rather than into ib_mad.h ? Btw, I *think* that I will eventually need access to ib_mad_cache, along with some minor tweaks to ib_mad_priv.h (e.g. adding smp to ib_mad_private?). I'm also having to anticipate a little how the MAD code will be structured, so that the SMI can send responses. I'll send patches for any changes. From tduffy at sun.com Tue Sep 14 16:12:18 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 14 Sep 2004 16:12:18 -0700 Subject: DMA mapping on sparc64 (was Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool) In-Reply-To: <52vfeg1muw.fsf_-_@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> Message-ID: <1095203538.6945.70.camel@duffman> On Tue, 2004-09-14 at 15:56 -0700, Roland Dreier wrote: > As a quick hack you can change the .size in ib_mad_register_memory to > (1ULL << 32). There's some similar code in ipoib_transport_dev_init() > that needs the same treatment too. That should make things work. This fixed it. Now, the port goes active and I can see my sparc64 node from the SM. Yippie! -tduffy -- "When they took the 4th Amendment, I was quiet because I didn't deal drugs. When they took the 6th Amendment, I was quiet because I am innocent. When they took the 2nd Amendment, I was quiet because I don't own a gun. Now they have taken the 1st Amendment, and I can only be quiet." --Lyle Myhr -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Tue Sep 14 16:14:28 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 16:14:28 -0700 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <1095203538.6945.70.camel@duffman> (Tom Duffy's message of "Tue, 14 Sep 2004 16:12:18 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> Message-ID: <52r7p41m0b.fsf@topspin.com> Tom> This fixed it. Now, the port goes active and I can see my Tom> sparc64 node from the SM. Yippie! Cool. Thanks for testing -- running on as many different platforms as possible definitely makes our code better. - R. From halr at voltaire.com Tue Sep 14 16:18:01 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Sep 2004 19:18:01 -0400 Subject: [openib-general] Re: [PATCH] beginnings of SMI In-Reply-To: <20040914161045.3b390c8b.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <1095201301.2285.307.camel@localhost.localdomain> <20040914161045.3b390c8b.mshefty@ichips.intel.com> Message-ID: <1095203880.1830.327.camel@localhost.localdomain> On Tue, 2004-09-14 at 19:10, Sean Hefty wrote: > Btw, I *think* that I will eventually need access to ib_mad_cache, > along with some minor tweaks to ib_mad_priv.h (e.g. adding smp > to ib_mad_private?). OK. > I'm also having to anticipate a little how the MAD code will > be structured, so that the SMI can send responses. Are you referring to turning a buffer around ? > I'll send patches for any changes. Thanks. -- Hal From robert.j.woodruff at intel.com Tue Sep 14 16:58:38 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 14 Sep 2004 16:58:38 -0700 Subject: [openib-general] CLI command that disables the SM in a Topspin switch Message-ID: <1AC79F16F5C5284499BB9591B33D6F000205E925@orsmsx408> Does anyone know the CLI command to enter into a Topspin90 switch to disable the SM in the switch ? I want to try to use the OpenSM, but need to disable the SM in the switch first. After finding the right cable, I was able to log in as super user but cannot figure out from the documentation or command line help how to disable the SM. woody From roland at topspin.com Tue Sep 14 17:02:15 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 17:02:15 -0700 Subject: [openib-general] CLI command that disables the SM in a Topspin switch In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F000205E925@orsmsx408> (Robert J. Woodruff's message of "Tue, 14 Sep 2004 16:58:38 -0700") References: <1AC79F16F5C5284499BB9591B33D6F000205E925@orsmsx408> Message-ID: <52mzzs1jso.fsf@topspin.com> Robert> Does anyone know the CLI command to enter into a Topspin90 Robert> switch to disable the SM in the switch ? I want to try to Robert> use the OpenSM, but need to disable the SM in the switch Robert> first. I think this should work: enable configure no ib sm subnet-prefix fe:80:00:00:00:00:00:00 - R. From roland at topspin.com Tue Sep 14 20:38:53 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 20:38:53 -0700 Subject: [openib-general] Reserved L_Key API (was Re: DMA mapping on sparc64) In-Reply-To: <52r7p41m0b.fsf@topspin.com> (Roland Dreier's message of "Tue, 14 Sep 2004 16:14:28 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> <52r7p41m0b.fsf@topspin.com> Message-ID: <52ekl419rm.fsf_-_@topspin.com> Based on Tom's sparc64 testing, I'd like to design an API for consumers (MAD layer, IPoIB, etc) who want to do local DMA to arbitrary addresses. Our current hack of registering all of memory by assuming that DMA addresses will be between 0 and (high_memory - PAGE_OFFSET) is not valid (as sparc64 shows) and probably won't be accepted into the kernel. For new HCAs that support the base memory management extensions, the consumer can just use the reserved L_Key. It is almost possible to simulate this with Tavor: one can create a memory region that does not perform any address translation (and just uses the address given in a work request as a PCI bus address), but it is not possible to turn off PD enforcement. This means we need an API that allows a consumer to get a "no translation" MR for a given PD. My proposal would be as follows: The low-level driver entry point would just be: struct ib_mr *(*get_dma_mr)(struct ib_pd *); And the client-exposed entry point: struct ib_mr *ib_get_dma_mr(struct ib_pd *); Only the L_Key of this MR would be valid, and it would always have local write access (to match the semantics of reserved L_Key). If the HCA supports reserved L_Key, it can just return the same L_Key for every consumer. If need be it can take the PD into account. It is required for the consumer to call ib_dereg_mr() on this MR when exiting, but this can be a NOP for HCAs that support reserved L_Key. I would argue that this entry point should replace reg_phys_mr as a mandatory low-level driver function; this will simplify the implementation of consumers that use the API. Devices that can't even simulate reserved L_Key like Tavor (and I don't know of any such devices -- even on Topspin's embedded platforms I could implement this API) could just register a giant address range in a normal physical MR (and even use pci_set_dma_mask() to limit the size of the MR to 4 GB if they're really limited). Comments? Better naming ideas? Thanks, Roland From mst at mellanox.co.il Tue Sep 14 21:00:34 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 07:00:34 +0300 Subject: DMA mapping on sparc64 (was Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox HCA flash tool) In-Reply-To: <52vfeg1muw.fsf_-_@topspin.com> References: <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> Message-ID: <20040915040034.GC10214@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "DMA mapping on sparc64 (was Re: [openib-general] [ANNOUNCE] tvflash: userspace-only Mellanox?HCA flash tool)": > Tom> high_memory - PAGE_OFFSET = 0x000000007fecc000 > > Tom> scatter_list.addr = 0x00000000c3b7a360 through 00000000c3f96d80 > > Yup, that's the problem. > > As a quick hack you can change the .size in ib_mad_register_memory to > (1ULL << 32). There's some similar code in ipoib_transport_dev_init() > that needs the same treatment too. That should make things work. > > The correct fix is to add a new operation to return an L_Key that > covers any DMA address. I'll post some patches to do that soon (I > hope this week). > > - R. I think this is going to break anyway when you actually have more than 4G memory that you can DMA into. I think its more portable to allocate mads out of some pool, registering this pool with an lkey. MST From roland at topspin.com Tue Sep 14 21:04:45 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 21:04:45 -0700 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <20040915040034.GC10214@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Sep 2004 07:00:34 +0300") References: <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <20040915040034.GC10214@mellanox.co.il> Message-ID: <52vfegyy76.fsf@topspin.com> Michael> I think this is going to break anyway when you actually Michael> have more than 4G memory that you can DMA into. I think Michael> its more portable to allocate mads out of some pool, Michael> registering this pool with an lkey. Of course my hack breaks if we get DMA addresses above 4G. That's why I call it a quick hack (for a specific situation) and not a general solution. It's possible to allocate a chunk of memory and use it as a MAD pool. However I think it's better to use an L_Key with translation off and use any DMA address the kernel wants to give us. See the email I just sent for my proposal on how to do this. - R. From roland at topspin.com Tue Sep 14 21:13:04 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 21:13:04 -0700 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <20040915040034.GC10214@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Sep 2004 07:00:34 +0300") References: <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <20040915040034.GC10214@mellanox.co.il> Message-ID: <52r7p4yxtb.fsf@topspin.com> Michael> I think this is going to break anyway when you actually Michael> have more than 4G memory that you can DMA into. I think Michael> its more portable to allocate mads out of some pool, Michael> registering this pool with an lkey. Oh yeah... I forgot to mention that the MAD pool idea doesn't help for IPoIB, SDP, ... (where we can't control where the buffers we need to use are allocated) - R. From mst at mellanox.co.il Tue Sep 14 21:23:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 07:23:39 +0300 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <52r7p4yxtb.fsf@topspin.com> References: <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <20040915040034.GC10214@mellanox.co.il> <52r7p4yxtb.fsf@topspin.com> Message-ID: <20040915042339.GD10214@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: DMA mapping on sparc64": > Michael> I think this is going to break anyway when you actually > Michael> have more than 4G memory that you can DMA into. I think > Michael> its more portable to allocate mads out of some pool, > Michael> registering this pool with an lkey. > > Oh yeah... I forgot to mention that the MAD pool idea doesn't help for > IPoIB, SDP, ... (where we can't control where the buffers we need to > use are allocated) > > - R. Maybe for IP over IB you can have a pool of pre-allocated buffers. From mst at mellanox.co.il Tue Sep 14 21:34:55 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 07:34:55 +0300 Subject: [openib-general] Reserved L_Key API (was Re: DMA mapping on sparc64) In-Reply-To: <52ekl419rm.fsf_-_@topspin.com> References: <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> <52r7p41m0b.fsf@topspin.com> <52ekl419rm.fsf_-_@topspin.com> Message-ID: <20040915043455.GE10214@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "[openib-general] Reserved L_Key API (was Re: DMA mapping on sparc64)": > Based on Tom's sparc64 testing, I'd like to design an API for > consumers (MAD layer, IPoIB, etc) who want to do local DMA to > arbitrary addresses. Our current hack of registering all of memory by > assuming that DMA addresses will be between 0 and (high_memory - > PAGE_OFFSET) is not valid (as sparc64 shows) and probably won't be > accepted into the kernel. > > For new HCAs that support the base memory management extensions, the > consumer can just use the reserved L_Key. It is almost possible to > simulate this with Tavor: one can create a memory region that does not > perform any address translation (and just uses the address given in a > work request as a PCI bus address), but it is not possible to turn off > PD enforcement. > > This means we need an API that allows a consumer to get a "no > translation" MR for a given PD. My proposal would be as follows: > > The low-level driver entry point would just be: > > struct ib_mr *(*get_dma_mr)(struct ib_pd *); > > And the client-exposed entry point: > > struct ib_mr *ib_get_dma_mr(struct ib_pd *); > > Only the L_Key of this MR would be valid, and it would always have > local write access (to match the semantics of reserved L_Key). If the > HCA supports reserved L_Key, it can just return the same L_Key for > every consumer. If need be it can take the PD into account. > > It is required for the consumer to call ib_dereg_mr() on this MR when > exiting, but this can be a NOP for HCAs that support reserved L_Key. > > I would argue that this entry point should replace reg_phys_mr as a > mandatory low-level driver function; this will simplify the > implementation of consumers that use the API. Devices that can't even > simulate reserved L_Key like Tavor (and I don't know of any such > devices -- even on Topspin's embedded platforms I could implement this > API) could just register a giant address range in a normal physical MR > (and even use pci_set_dma_mask() to limit the size of the MR to 4 GB > if they're really limited). > > Comments? Better naming ideas? > > Thanks, > Roland Dont you want to basically create a physical memory region covering the whole 64 bit range, and then post full phy addresses in the WQE? Cant you do exactly that with existing API? Thanks, MST From roland at topspin.com Tue Sep 14 21:34:51 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 21:34:51 -0700 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <20040915042339.GD10214@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Sep 2004 07:23:39 +0300") References: <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <20040915040034.GC10214@mellanox.co.il> <52r7p4yxtb.fsf@topspin.com> <20040915042339.GD10214@mellanox.co.il> Message-ID: <52mzzsywt0.fsf@topspin.com> Michael> Maybe for IP over IB you can have a pool of pre-allocated buffers. You can't control the buffers the network stack passes down for you to send. In fact you can't even control where the skbuffs you allocate for receives will be. If you want to use pre-allocated buffers for IPoIB then you're stuck copying to and from your own buffers. What's wrong with using an L_Key with translation off? It solves all these problems very simply and it works with all known hardware. - R. From mst at mellanox.co.il Tue Sep 14 21:42:49 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 07:42:49 +0300 Subject: [openib-general] Re: DMA mapping on sparc64 In-Reply-To: <52mzzsywt0.fsf@topspin.com> References: <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <20040915040034.GC10214@mellanox.co.il> <52r7p4yxtb.fsf@topspin.com> <20040915042339.GD10214@mellanox.co.il> <52mzzsywt0.fsf@topspin.com> Message-ID: <20040915044248.GF10214@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: DMA mapping on sparc64": > Michael> Maybe for IP over IB you can have a pool of pre-allocated buffers. > > You can't control the buffers the network stack passes down for you to > send. In fact you can't even control where the skbuffs you allocate > for receives will be. If you want to use pre-allocated buffers for > IPoIB then you're stuck copying to and from your own buffers. > > What's wrong with using an L_Key with translation off? It solves all > these problems very simply and it works with all known hardware. > > - R. Sorry - that was before I read that proposal. From ftillier at infiniconsys.com Tue Sep 14 21:45:59 2004 From: ftillier at infiniconsys.com (Tillier, Fabian) Date: Wed, 15 Sep 2004 00:45:59 -0400 Subject: [openib-general] Reserved L_Key API (was Re: DMA mapping on sparc64) Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Tuesday, September 14, 2004 8:39 PM > > Only the L_Key of this MR would be valid, and it would always have > local write access (to match the semantics of reserved L_Key). If the > HCA supports reserved L_Key, it can just return the same L_Key for > every consumer. If need be it can take the PD into account. > > It is required for the consumer to call ib_dereg_mr() on this MR when > exiting, but this can be a NOP for HCAs that support reserved L_Key. I think this makes a lot of sense and looks good. > > I would argue that this entry point should replace reg_phys_mr as a > mandatory low-level driver function; this will simplify the > implementation of consumers that use the API. Devices that can't even > simulate reserved L_Key like Tavor (and I don't know of any such > devices -- even on Topspin's embedded platforms I could implement this > API) could just register a giant address range in a normal physical MR > (and even use pci_set_dma_mask() to limit the size of the MR to 4 GB > if they're really limited). I think you still need reg_phys_mr (or some way to get an RKEY) for kernel clients that do RDMA (SRP or kernel SDP, for example). Something like an RKEY with translation off but PD enforcement (I don't think you want to get rid of PD enforcement for that kind of usage). Given this, I would suggest keeping similar semantics as memory registration. Since we need an input PD anyway, I suggest having a call like: struct ib_mr *ib_reg_dma_mr(struct ib_pd *pd, int mr_access_flags ); Depending on the mr_access_flags, the returned MR could have a valid RKEY with PD enforcement. If only local access is needed, the LKEY could be the reserved LKEY if the device supports it. This would enable both your desired usage as well as future usage by kernel clients that perform RDMA. With the above call, I think ib_reg_phys_mr can be eliminated. Thoughts? - Fab From roland at topspin.com Tue Sep 14 21:54:45 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 21:54:45 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <20040915043455.GE10214@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Sep 2004 07:34:55 +0300") References: <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> <52r7p41m0b.fsf@topspin.com> <52ekl419rm.fsf_-_@topspin.com> <20040915043455.GE10214@mellanox.co.il> Message-ID: <52isagyvvu.fsf@topspin.com> Michael> Dont you want to basically create a physical memory Michael> region covering the whole 64 bit range, and then post Michael> full phy addresses in the WQE? Cant you do exactly that Michael> with existing API? Yes, that's the natural thing to try. However, as understand it, Tavor is limited to 2 GB pages, which means covering the whole 64-bit range is going to take 8 billion pages, which clearly will overflow any conceivable context memory. Other devices are likely to have similar limitations. - Roland From roland at topspin.com Tue Sep 14 21:59:09 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 14 Sep 2004 21:59:09 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> (Fabian Tillier's message of "Wed, 15 Sep 2004 00:45:59 -0400") References: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> Message-ID: <52ekl4yvoi.fsf@topspin.com> Fabian> I think you still need reg_phys_mr (or some way to get an Fabian> RKEY) for kernel clients that do RDMA (SRP or kernel SDP, Fabian> for example). Something like an RKEY with translation off Fabian> but PD enforcement (I don't think you want to get rid of Fabian> PD enforcement for that kind of usage). Given this, I Fabian> would suggest keeping similar semantics as memory Fabian> registration. First, just to be clear, I'm not suggesting that we get rid of reg_phys_mr (although it would make sense for a low-level driver for a stupid TCA not to support the operation). I don't think consumers ever really want to pass remote entities an R_Key with translation off (which would allow RDMA to arbitrary addreses). I think the solution for creating R_Keys is FMRs (either Tavor-style or verbs extension-style). - R. From ftillier at infiniconsys.com Tue Sep 14 22:24:55 2004 From: ftillier at infiniconsys.com (Tillier, Fabian) Date: Wed, 15 Sep 2004 01:24:55 -0400 Subject: [openib-general] Reserved L_Key API Message-ID: <5D78D28F88822E4D8702BB9EEF1A43670623A2@mercury.infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Tuesday, September 14, 2004 9:59 PM > > I don't think consumers ever really want to pass remote entities an > R_Key with translation off (which would allow RDMA to arbitrary > addreses). I think the solution for creating R_Keys is FMRs (either > Tavor-style or verbs extension-style). Are the ramifications of such an RKEY any worse than those of locally attached DMA-capable adapters? If your FC HBA goes haywire and decides to write all over memory, there's not much you can do. SRP just uses IB as a bus extension of sorts. What am I missing? Thanks, - Fab From mst at mellanox.co.il Tue Sep 14 22:35:09 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Sep 2004 08:35:09 +0300 Subject: [openib-general] Reserved L_Key API In-Reply-To: <52ekl4yvoi.fsf@topspin.com> References: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> <52ekl4yvoi.fsf@topspin.com> Message-ID: <20040915053509.GH10214@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Reserved L_Key API": > Fabian> I think you still need reg_phys_mr (or some way to get an > Fabian> RKEY) for kernel clients that do RDMA (SRP or kernel SDP, > Fabian> for example). Something like an RKEY with translation off > Fabian> but PD enforcement (I don't think you want to get rid of > Fabian> PD enforcement for that kind of usage). Given this, I > Fabian> would suggest keeping similar semantics as memory > Fabian> registration. > > First, just to be clear, I'm not suggesting that we get rid of > reg_phys_mr (although it would make sense for a low-level driver for > a stupid TCA not to support the operation). > > I don't think consumers ever really want to pass remote entities an > R_Key with translation off (which would allow RDMA to arbitrary > addreses). I think the solution for creating R_Keys is FMRs (either > Tavor-style or verbs extension-style). > > - R. Why isnt PD protection sufficient? From gdror at mellanox.co.il Tue Sep 14 23:39:12 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 15 Sep 2004 09:39:12 +0300 Subject: [openib-general] Reserved L_Key API Message-ID: <506C3D7B14CDD411A52C00025558DED605F9CF76@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Wednesday, September 15, 2004 7:55 AM > > > Michael> Dont you want to basically create a physical memory > Michael> region covering the whole 64 bit range, and then post > Michael> full phy addresses in the WQE? Cant you do exactly that > Michael> with existing API? > > Yes, that's the natural thing to try. However, as understand > it, Tavor is limited to 2 GB pages, which means covering the > whole 64-bit range is going to take 8 billion pages, which > clearly will overflow any conceivable context memory. Other > devices are likely to have similar limitations. > > - Roland If you're creating a physical MR with 1:1 mapping, just turn on MPT.pa bit. If you do that, you don't need to care about pages and you don't need to set up MTTs. You're region will be bounded by the MPT.start_address and MPT.start_address + MPT.reg_wnd_len. This way you can use MRs longer than 2GB. However, if you do believe that devices are likely to have this kind of limitation, then I believe that it needs to be reflected in the API. When you request this dma_mr, you'd need to specify which addresses are to be covered in it. One more question. If the device does supports the memory management extensions. Will ib_get_dma_mr(NULL) be the right way to obtain the dma_mr that will not be PD protected ? And if that fails, it means that you need a valid PD. -Dror -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdror at mellanox.co.il Tue Sep 14 23:48:22 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 15 Sep 2004 09:48:22 +0300 Subject: [openib-general] Multicast address aliasing in IPoIB Message-ID: <506C3D7B14CDD411A52C00025558DED605F9CF7E@mtlex01.yok.mtl.com> > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, September 15, 2004 12:09 AM > > On Thu, 2004-09-09 at 22:43, Roland Dreier wrote: > > I see no reason to modify the IPoIB multicast mapping just > because the > > Linux kernel does not yet have support for it. > > Agreed. OK. Fine with me. Note that what it means that Gen2 will be fully compliant. Gen1 may not be interoperable / ipoib compliant - because they just can't without a very ugly hack. > > The only issue I see is where the scope and PKey would come from to > form the MGID. The scope could default to link local (2). I > would presume the PKey would come from the IPoIB interface somehow. > > -- Hal > Can we leave the pkey part of the MGID blank (e.g. zeroes) to be filled by the underlying driver ? If so, then it should be OK. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Sep 15 04:54:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 07:54:08 -0400 Subject: [openib-general] [PATCH] review for new MAD APIs In-Reply-To: <20040913151649.51cf14c8.mshefty@ichips.intel.com> References: <20040913151649.51cf14c8.mshefty@ichips.intel.com> Message-ID: <1095249247.1973.8.camel@localhost.localdomain> On Mon, 2004-09-13 at 18:16, Sean Hefty wrote: > The following patch adds two new APIs to better support zero-copy receives on MADs. > The second call returns the receive MAD buffers and chained completion structures > to the access layer, where they were allocated. > > Comments? > +/** > + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the > + * access layer. > + * @mad_recv_wc - Work completion information for a received MAD. > + * > + * Clients receiving MADs through their ib_mad_recv_handler must call this > + * routine to return the work completion buffers to the access layer. > + */ > +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); > + > +/** Either the client needs to explictly do this or the MAD layer would assume this is the case upon return from the receive callback. They both have their downsides so I'm not sure which is better. -- Hal From halr at voltaire.com Wed Sep 15 06:41:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 09:41:37 -0400 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes Message-ID: <1095255696.1973.11.camel@localhost.localdomain> ib_mad: PCI mapping/unmapping and some buffering fixes Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 837) +++ access/ib_mad_priv.h (working copy) @@ -106,8 +106,10 @@ struct ib_mad_send_wr_private { struct list_head send_list; struct ib_mad_agent *agent; - u64 wr_id; + u64 wr_id; /* client WRID */ int timeout_ms; + struct ib_mad_buf *buf; + u32 buf_len; }; struct ib_mad_mgmt_method_table { Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 837) +++ access/ib_mad.c (working copy) @@ -81,6 +81,8 @@ struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_port_restart(struct ib_mad_port_private *priv); +static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, + struct ib_qp *qp); static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); static inline u8 convert_mgmt_class(u8 mgmt_class); @@ -291,6 +293,7 @@ struct ib_send_wr wr; struct ib_send_wr *bad_wr; struct ib_mad_send_wr_private *mad_send_wr; + struct ib_sge gather_list; unsigned long flags; cur_send_wr = send_wr; @@ -300,7 +303,7 @@ return -EINVAL; } - /* Walk list of send WRs and post each one on send list */ + /* Walk list of send WRs and post each on send list */ cur_send_wr = send_wr; while (cur_send_wr) { next_send_wr = (struct ib_send_wr *)cur_send_wr->next; @@ -314,16 +317,26 @@ printk(KERN_ERR "No memory for ib_mad_send_wr_private\n"); return -ENOMEM; } + + /* Setup gather list */ + gather_list.addr = pci_map_single(mad_agent->device->dma_device, + cur_send_wr->sg_list->addr, + cur_send_wr->sg_list->length, + PCI_DMA_TODEVICE); + gather_list.length = cur_send_wr->sg_list->length; + gather_list.lkey = cur_send_wr->sg_list->lkey; + /* Initialize MAD send WR tracking structure */ mad_send_wr->agent = mad_agent; mad_send_wr->wr_id = cur_send_wr->wr_id; /* Timeout valid only when MAD is a request !!! */ mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; + mad_send_wr->buf_len = gather_list.length; wr.next = NULL; wr.opcode = IB_WR_SEND; /* cur_send_wr->opcode ? */ wr.wr_id = (unsigned long)mad_send_wr; - wr.sg_list = cur_send_wr->sg_list; + wr.sg_list = &gather_list; wr.num_sge = cur_send_wr->num_sge; wr.wr.ud.remote_qpn = cur_send_wr->wr.ud.remote_qpn; wr.wr.ud.remote_qkey = cur_send_wr->wr.ud.remote_qkey; @@ -338,8 +351,17 @@ ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + pci_unmap_addr_set(&mad_send_wr->buf, mapping, + gather_list.addr); + ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { + pci_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(&mad_send_wr->buf, + mapping), + gather_list.length, + PCI_DMA_TODEVICE); + /* Unlink from posted send MAD list */ spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_del(&mad_send_wr->send_list); @@ -685,9 +707,12 @@ recv = list_entry(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)], struct ib_mad_private, header.mad_list); + /* Remove from posted receive MAD list */ list_del(&recv->header.mad_list); + port_priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; + } else { printk(KERN_ERR "Receive completion WR ID 0x%Lx on QP %d with no posted receive\n", wc->wr_id, qp_num); spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); @@ -695,6 +720,11 @@ } spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + pci_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->buf, mapping), + sizeof(struct ib_mad_private) - sizeof(struct ib_mad_private_header), + PCI_DMA_FROMDEVICE); + /* Setup MAD receive work completion from "normal" work completion */ recv_wc.wc = wc; recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ @@ -735,8 +765,8 @@ } spin_unlock_irqrestore(&port_priv->reg_lock, flags); - /* Repost receive request */ - /* Client must be done with receive */ + /* Post another receive request for this QP */ + ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); ret: return; @@ -754,22 +784,30 @@ send_wr = list_entry(&port_priv->send_posted_mad_list, struct ib_mad_send_wr_private, send_list); + if (send_wr->wr_id != wc->wr_id) { printk(KERN_ERR "Send completion WR ID 0x%Lx doesn't match posted send WR ID 0x%Lx\n", wc->wr_id, send_wr->wr_id); goto error; } + /* Check whether timeout was requested !!! */ /* Remove from posted send MAD list */ list_del(&send_wr->send_list); port_priv->send_posted_mad_count--; + } else { printk(KERN_ERR "Send completion WR ID 0x%Lx but send list is empty\n", wc->wr_id); goto error; } spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + pci_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&send_wr->buf, mapping), + send_wr->buf_len, + PCI_DMA_TODEVICE); + /* Restore client wr_id in WC */ wc->wr_id = send_wr->wr_id; /* Invoke client send callback */ @@ -916,10 +954,18 @@ struct ib_recv_wr *bad_recv_wr; unsigned long flags; - /* Allocate memory for receive MAD (and private header) */ - mad_priv = kmalloc(sizeof *mad_priv, GFP_KERNEL); + /* + * Allocate memory for receive buffer. + * This is for both MAD and private header + * which serves as the receive tracking structure. + * By prepending thisheader, there is one rather + * than two memory allocations. + */ + mad_priv = kmalloc(sizeof *mad_priv, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); if (!mad_priv) { - printk(KERN_ERR "No memory for receive MAD\n"); + printk(KERN_ERR "No memory for receive buffer\n"); return -ENOMEM; } @@ -949,16 +995,18 @@ /* Now, post receive WR */ if (ib_post_recv(qp, &recv_wr, &bad_recv_wr)) { + + pci_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header.buf, mapping), + sizeof *mad_priv - sizeof mad_priv->header, + PCI_DMA_FROMDEVICE); + /* Unlink from posted receive MAD list */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); list_del(&mad_priv->header.mad_list); port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&mad_priv->header.buf, mapping), - sizeof *mad_priv - sizeof mad_priv->header, - PCI_DMA_FROMDEVICE); kfree(mad_priv); printk(KERN_NOTICE "ib_post_recv failed\n"); return -EINVAL; @@ -968,14 +1016,14 @@ } /* - * Get receive MADs and post receive WRs for them + * Allocate receive MADs and post receive WRs for them */ static int ib_mad_post_receive_mads(struct ib_mad_port_private *port_priv) { int i, j; for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - for (j = 0; j < 2; j++) { + for (j = 0; j < IB_MAD_QPS_CORE; j++) { if (ib_mad_post_receive_mad(port_priv, port_priv->qp[j])) { printk(KERN_ERR "receive post %d failed\n", i + 1); @@ -994,12 +1042,12 @@ int i; unsigned long flags; - /* PCI mapping ? */ - for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { spin_lock_irqsave(&port_priv->recv_list_lock, flags); while (!list_empty(&port_priv->recv_posted_mad_list[i])) { + /* PCI mapping !!! */ + } INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->recv_posted_mad_count[i] = 0; @@ -1014,11 +1062,13 @@ { unsigned long flags; - /* PCI mapping ? */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); while (!list_empty(&port_priv->send_posted_mad_list)) { + + /* PCI mapping !!! */ + list_del(&port_priv->send_posted_mad_list); + /* Call completion handler with some status ? */ } From mvonwyl at bluewin.ch Wed Sep 15 07:42:06 2004 From: mvonwyl at bluewin.ch (mvonwyl at bluewin.ch) Date: Wed, 15 Sep 2004 16:42:06 +0200 Subject: [openib-general] 2.4 compilation problems... In-Reply-To: <1095255696.1973.11.camel@localhost.localdomain> Message-ID: <412EB53F000A6E28@mssbzhb-int.msg.bluewin.ch> Good afternoon, I'm trying to compile à 2.4.20 kernel using the linux-2.4.20-8 patch and the infiniband-kernel-24-2004-03-17 package... After using the patch, I must just correct one error in /drivers/Makefile (putting infiniband at the end of the line 11...) After a make dep and a make bzImage without any problems I get this error with the make modules : gcc -D__KERNEL__ -I/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586 -DMODULE -DMODVERSIONS -include /home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/include/linux/modversions.h -I/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/include -DIN_TREE_BUILD -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE -nostdinc -iwithprefix include -DKBUILD_BASENAME=services_export -DEXPORT_SYMTAB -c services_export.c services_export.c:81: `rb_replace_node' undeclared here (not in a function) services_export.c:81: initializer element is not constant services_export.c:81: (near initialization for `__ksymtab_rb_replace_node.value') make[3]: *** [services_export.o] Erreur 1 make[3]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/core' make[2]: *** [_modsubdir_core] Erreur 2 make[2]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband' make[1]: *** [_modsubdir_infiniband] Erreur 2 make[1]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers' make: *** [_mod_drivers] Erreur 2 So I modified the service_export.c file : #if !defined(TS_KERNEL_2_6) //EXPORT_SYMBOL(rb_replace_node); #endif at line 81... And I get that : gcc -D__KERNEL__ -I/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i586 -DMODULE -DMODVERSIONS -include /home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/include/linux/modversions.h -I/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/include -I/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/hw/mellanox-hca/include -I. -DMT_KERNEL -D__LINUX__ -DMAX_ERROR=4 -DMTL_MODULE=MOSAL "-DREMAP_PREFIX=current->mm->mmap," -DMT_LITTLE_ENDIAN -nostdinc -iwithprefix include -DKBUILD_BASENAME=mosal_mem -c -o mosal_mem.o mosal_mem.c mosal_mem.c: In function `MOSAL_map_phys_addr': mosal_mem.c:218: warning: passing arg 1 of `remap_page_range_Rsmp_69d01e73' makes integer from pointer without a cast mosal_mem.c:218: incompatible type for argument 4 of `remap_page_range_Rsmp_69d01e73' mosal_mem.c:218: too many arguments to function `remap_page_range_Rsmp_69d01e73' make[5]: *** [mosal_mem.o] Erreur 1 make[5]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/hw/mellanox-hca/mosal' make[4]: *** [_modsubdir_mosal] Erreur 2 make[4]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/hw/mellanox-hca' make[3]: *** [_modsubdir_mellanox-hca] Erreur 2 make[3]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband/hw' make[2]: *** [_modsubdir_hw] Erreur 2 make[2]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers/infiniband' make[1]: *** [_modsubdir_infiniband] Erreur 2 make[1]: Quitte le répertoire `/home/vonwyl/InfiniBand/kernel/rtlinux-3.2-pre2/linux/drivers' make: *** [_mod_drivers] Erreur 2 I tried with a gcc 2.95.3, 2.95.4 and a 3.2 but without any results... So can anyone help me? In advance, thanks... From roland at topspin.com Wed Sep 15 08:41:14 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 08:41:14 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <506C3D7B14CDD411A52C00025558DED605F9CF76@mtlex01.yok.mtl.com> (Dror Goldenberg's message of "Wed, 15 Sep 2004 09:39:12 +0300") References: <506C3D7B14CDD411A52C00025558DED605F9CF76@mtlex01.yok.mtl.com> Message-ID: <52y8jby1yd.fsf@topspin.com> Dror> If you're creating a physical MR with 1:1 mapping, just turn Dror> on MPT.pa bit. If you do that, you don't need to care about Dror> pages and you don't need to set up MTTs. You're region will Dror> be bounded by the MPT.start_address and MPT.start_address + Dror> MPT.reg_wnd_len. This way you can use MRs longer than 2GB. Yes, I understand. This is my proposal for how to handle Tavor: use MRs with translation off (meaning MPT.pa == 1). It's an interesting idea to use this as an optimization in reg_phys_mr. However I'm not sure if this is quite right for the "DMA MR" that MAD, IPoIB et al need, because it doesn't seem portable to all possible HCAs to assume that a physical MR covering the whole 64-bit address space can be registered. That's why I'd like to hide the details in a new function. By the way, I notice that neither the API nor the Tavor MPT format seem to allow creating memory regions of size 2^64 since they use a 64-bit value for the size (which means they're limited to 2^64 - 1). I don't think this is a serious issue though. Dror> One more question. If the device does supports the memory Dror> management extensions. Will ib_get_dma_mr(NULL) be the right Dror> way to obtain the dma_mr that will not be PD protected ? And Dror> if that fails, it means that you need a valid PD. No, I think we should just have a way to get the reserved L_Key directly without allocating an MR. - R. From roland at topspin.com Wed Sep 15 08:42:53 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 08:42:53 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670623A2@mercury.infiniconsys.com> (Fabian Tillier's message of "Wed, 15 Sep 2004 01:24:55 -0400") References: <5D78D28F88822E4D8702BB9EEF1A43670623A2@mercury.infiniconsys.com> Message-ID: <52u0tzy1vm.fsf@topspin.com> Roland> I don't think consumers ever really want to pass remote Roland> entities an R_Key with translation off (which would allow Roland> RDMA to arbitrary addreses). I think the solution for Roland> creating R_Keys is FMRs (either Tavor-style or verbs Roland> extension-style). Fabian> Are the ramifications of such an RKEY any worse than those Fabian> of locally attached DMA-capable adapters? If your FC HBA Fabian> goes haywire and decides to write all over memory, there's Fabian> not much you can do. SRP just uses IB as a bus extension Fabian> of sorts. All I can say is that in the real world people don't like it. That's why features like remote invalidate get added. - R. From roland at topspin.com Wed Sep 15 08:44:19 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 08:44:19 -0700 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes In-Reply-To: <1095255696.1973.11.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 15 Sep 2004 09:41:37 -0400") References: <1095255696.1973.11.camel@localhost.localdomain> Message-ID: <52pt4ny1t8.fsf@topspin.com> I thought we said the consumer was responsible for PCI mapping in the send path? - R. From halr at voltaire.com Wed Sep 15 08:48:12 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 11:48:12 -0400 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes In-Reply-To: <52pt4ny1t8.fsf@topspin.com> References: <1095255696.1973.11.camel@localhost.localdomain> <52pt4ny1t8.fsf@topspin.com> Message-ID: <1095263292.1831.1.camel@localhost.localdomain> On Wed, 2004-09-15 at 11:44, Roland Dreier wrote: > I thought we said the consumer was responsible for PCI mapping in the > send path? I don't recall but in any case... Is that the best choice ? Is there an advantage to the client doing this over the MAD layer ? -- Hal From roland at topspin.com Wed Sep 15 08:48:09 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 08:48:09 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> (Fabian Tillier's message of "Wed, 15 Sep 2004 00:45:59 -0400") References: <5D78D28F88822E4D8702BB9EEF1A43670623A1@mercury.infiniconsys.com> Message-ID: <52llfby1mu.fsf@topspin.com> Fabian> Since we need an input PD anyway, I suggest having a call Fabian> like: Fabian> struct ib_mr *ib_reg_dma_mr(struct ib_pd *pd, int Fabian> mr_access_flags ); Fabian> Depending on the mr_access_flags, the returned MR could Fabian> have a valid RKEY with PD enforcement. If only local Fabian> access is needed, the LKEY could be the reserved LKEY if Fabian> the device supports it. This would enable both your Fabian> desired usage as well as future usage by kernel clients Fabian> that perform RDMA. With the above call, I think Fabian> ib_reg_phys_mr can be eliminated. Thinking about this some more, I don't have a problem with adding the mr_access_flags. However I think reg_phys_mr is still required, because it allows things like combining multiple buffers into a single virtually contiguous region and setting the IO virtual address (and will be needed to implement userspace memory registration anyway). - R. From mshefty at ichips.intel.com Wed Sep 15 08:58:14 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 08:58:14 -0700 Subject: [openib-general] [PATCH] review for new MAD APIs In-Reply-To: <1095249247.1973.8.camel@localhost.localdomain> References: <20040913151649.51cf14c8.mshefty@ichips.intel.com> <1095249247.1973.8.camel@localhost.localdomain> Message-ID: <20040915085814.250163ad.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 07:54:08 -0400 Hal Rosenstock wrote: > > +/** > > + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the > > + * access layer. > > + * @mad_recv_wc - Work completion information for a received MAD. > > + * > > + * Clients receiving MADs through their ib_mad_recv_handler must call this > > + * routine to return the work completion buffers to the access layer. > > + */ > > +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); > > + > > +/** > > Either the client needs to explictly do this or the MAD layer would > assume this is the case upon return from the receive callback. > > They both have their downsides so I'm not sure which is better. My initial thought is to make this an asynchronous operation. This gives clients more flexibility with respect to their MAD handling, while still avoiding a data copy on the receive path. E.g. something like the CM can schedule a separate thread to perform whatever processing it requires before returning the MAD. As a side note, when QP redirection is supported, this call is still needed to free the work completion structures, but would not free the data buffers (which are owned by the client). From roland at topspin.com Wed Sep 15 08:59:25 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 08:59:25 -0700 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes In-Reply-To: <1095263292.1831.1.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 15 Sep 2004 11:48:12 -0400") References: <1095255696.1973.11.camel@localhost.localdomain> <52pt4ny1t8.fsf@topspin.com> <1095263292.1831.1.camel@localhost.localdomain> Message-ID: <52hdpzy142.fsf@topspin.com> Hal> I don't recall but in any case... Here's the thread: http://thread.gmane.org/gmane.linux.drivers.openib/4428 Hal> Is that the best choice ? Is there an advantage to the client Hal> doing this over the MAD layer ? I see a few reasons. First, principle of least surprise: consumers need to do the mapping for ib_post_send, so it makes sense for them to do it for ib_mad_post_send (especially since we're reusing the same work request structure). Second, avoiding bugs: if we hide the PCI mapping from consumers, then they may not realize it's happening at all and pass non-DMA-able memory (eg from vmalloc) into ib_mad_post_send. Finally, flexibility: a consumer may want to use other DMA mapping functions and we may as well allow it. Also I didn't notice before but the current code in ib_mad.c is wrong. It does pci_map_single on the first gather entry but potentially the consumer can pass in a gather list with more than one entry. - R. From mshefty at ichips.intel.com Wed Sep 15 09:01:10 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 09:01:10 -0700 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes In-Reply-To: <52pt4ny1t8.fsf@topspin.com> References: <1095255696.1973.11.camel@localhost.localdomain> <52pt4ny1t8.fsf@topspin.com> Message-ID: <20040915090110.1beac8b3.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 08:44:19 -0700 Roland Dreier wrote: > I thought we said the consumer was responsible for PCI mapping in the > send path? I believe that this was the case as well. From halr at voltaire.com Wed Sep 15 09:02:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 12:02:47 -0400 Subject: [openib-general] [PATCH] ib_mad: PCI mapping/unmapping and some buffering fixes In-Reply-To: <1095263292.1831.1.camel@localhost.localdomain> References: <1095255696.1973.11.camel@localhost.localdomain> <52pt4ny1t8.fsf@topspin.com> <1095263292.1831.1.camel@localhost.localdomain> Message-ID: <1095264167.1831.4.camel@localhost.localdomain> On Wed, 2004-09-15 at 11:48, Hal Rosenstock wrote: > On Wed, 2004-09-15 at 11:44, Roland Dreier wrote: > > I thought we said the consumer was responsible for PCI mapping in the > > send path? > > I don't recall but in any case... > > Is that the best choice ? Is there an advantage to the client doing this > over the MAD layer ? Never mind. I found your response on this: "it makes sense for the consumer to do the PCI mapping since that allows consumers to do thinks like allocate MADs with pci_pool_alloc() and avoid calling pci_map at all." -- Hal From halr at voltaire.com Wed Sep 15 09:10:50 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 12:10:50 -0400 Subject: [openib-general] [PATCH] ib_mad: Undo send side PCI mapping/unmapping Message-ID: <1095264650.1865.7.camel@localhost.localdomain> ib_mad: Undo send side PCI mapping/unmapping Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 839) +++ ib_mad_priv.h (working copy) @@ -108,8 +108,6 @@ struct ib_mad_agent *agent; u64 wr_id; /* client WRID */ int timeout_ms; - struct ib_mad_buf *buf; - u32 buf_len; }; struct ib_mad_mgmt_method_table { Index: ib_mad.c =================================================================== --- ib_mad.c (revision 839) +++ ib_mad.c (working copy) @@ -293,7 +293,6 @@ struct ib_send_wr wr; struct ib_send_wr *bad_wr; struct ib_mad_send_wr_private *mad_send_wr; - struct ib_sge gather_list; unsigned long flags; cur_send_wr = send_wr; @@ -318,25 +317,16 @@ return -ENOMEM; } - /* Setup gather list */ - gather_list.addr = pci_map_single(mad_agent->device->dma_device, - cur_send_wr->sg_list->addr, - cur_send_wr->sg_list->length, - PCI_DMA_TODEVICE); - gather_list.length = cur_send_wr->sg_list->length; - gather_list.lkey = cur_send_wr->sg_list->lkey; - /* Initialize MAD send WR tracking structure */ mad_send_wr->agent = mad_agent; mad_send_wr->wr_id = cur_send_wr->wr_id; /* Timeout valid only when MAD is a request !!! */ mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; - mad_send_wr->buf_len = gather_list.length; wr.next = NULL; wr.opcode = IB_WR_SEND; /* cur_send_wr->opcode ? */ wr.wr_id = (unsigned long)mad_send_wr; - wr.sg_list = &gather_list; + wr.sg_list = cur_send_wr->sg_list; wr.num_sge = cur_send_wr->num_sge; wr.wr.ud.remote_qpn = cur_send_wr->wr.ud.remote_qpn; wr.wr.ud.remote_qkey = cur_send_wr->wr.ud.remote_qkey; @@ -351,17 +341,8 @@ ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); - pci_unmap_addr_set(&mad_send_wr->buf, mapping, - gather_list.addr); - ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { - pci_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(&mad_send_wr->buf, - mapping), - gather_list.length, - PCI_DMA_TODEVICE); - /* Unlink from posted send MAD list */ spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); list_del(&mad_send_wr->send_list); @@ -803,11 +784,6 @@ } spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&send_wr->buf, mapping), - send_wr->buf_len, - PCI_DMA_TODEVICE); - /* Restore client wr_id in WC */ wc->wr_id = send_wr->wr_id; /* Invoke client send callback */ @@ -1065,7 +1041,7 @@ spin_lock_irqsave(&port_priv->send_list_lock, flags); while (!list_empty(&port_priv->send_posted_mad_list)) { - /* PCI mapping !!! */ + /* PCI mapping ? */ list_del(&port_priv->send_posted_mad_list); From halr at voltaire.com Wed Sep 15 09:38:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 12:38:24 -0400 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine Message-ID: <1095266304.1865.11.camel@localhost.localdomain> ib_mad: Implement ib_free_recv_mad routine Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 841) +++ ib_mad_priv.h (working copy) @@ -82,6 +82,7 @@ }; struct ib_mad_private_header { + struct ib_mad_recv_wc recv_wc; /* must be first member (for now !!!) */ struct list_head mad_list; struct ib_mad_buf buf; struct ib_mad_recv_buf recv_buf; Index: ib_mad.c =================================================================== --- ib_mad.c (revision 841) +++ ib_mad.c (working copy) @@ -359,6 +359,16 @@ } EXPORT_SYMBOL(ib_post_send_mad); +/* + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + kfree(mad_recv_wc); /* RMPP !!! */ +} +EXPORT_SYMBOL(ib_free_recv_mad); + static inline u8 convert_mgmt_class(u8 mgmt_class) { /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ @@ -669,7 +679,6 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { - struct ib_mad_recv_wc recv_wc; struct ib_mad_private *recv; unsigned long flags; u32 qp_num; @@ -707,9 +716,9 @@ PCI_DMA_FROMDEVICE); /* Setup MAD receive work completion from "normal" work completion */ - recv_wc.wc = wc; - recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ - recv_wc.recv_buf = &recv->header.recv_buf; + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ + recv->header.recv_wc.recv_buf = &recv->header.recv_buf; /* Setup MAD receive buffer */ recv->header.recv_buf.list.next = NULL; /* Until RMPP implemented !!! */ @@ -742,7 +751,7 @@ /* Invoke receive callback */ mad_agent->agent->recv_handler(mad_agent->agent, - &recv_wc); + &recv->header.recv_wc); } spin_unlock_irqrestore(&port_priv->reg_lock, flags); @@ -933,9 +942,10 @@ /* * Allocate memory for receive buffer. * This is for both MAD and private header - * which serves as the receive tracking structure. - * By prepending thisheader, there is one rather - * than two memory allocations. + * which contains the receive tracking structure. + * By prepending this header, there is one rather + * than multiple memory allocations. + * Ths will need revisiting for RMPP !!! */ mad_priv = kmalloc(sizeof *mad_priv, (in_atomic() || irqs_disabled()) ? From mshefty at ichips.intel.com Wed Sep 15 09:45:10 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 09:45:10 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040914151839.0ece83e6.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> Message-ID: <20040915094510.25f72bab.mshefty@ichips.intel.com> On Tue, 14 Sep 2004 15:18:39 -0700 Sean Hefty wrote: > This patch begins to add in functionality needed by the SMI implementation. Let's try this again. Moved SMI specific code from ib_mad.h to ib_smi.h. Updated printk from previous patch. And replaced the Makefile. (Feel free to ignore the Makefile changes or to replace it with yours. The current one didn't seem out of date.) - Sean Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 841) +++ access/ib_mad_priv.h (working copy) @@ -58,7 +58,8 @@ #include #include - +#include +#include #define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ #define IB_MAD_QPS_SUPPORTED 2 @@ -93,6 +94,7 @@ union { struct ib_mad mad; struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; } mad; } __attribute__ ((packed)); Index: access/ib_smi.c =================================================================== --- access/ib_smi.c (revision 0) +++ access/ib_smi.c (revision 0) @@ -0,0 +1,125 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include +#include "ib_mad_priv.h" + +int smi_process_dr_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* + * Outgoing MAD processing. "Outgoing" means from initiator to responder. + * Section 14.2.2.2, Vol 1 IB spec + */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_ptr == 0 && hop_cnt) + return 0; + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE + "Need to handle DR Mad on switch\n"); + } + return 0; + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_priv->port; + smp->hop_ptr++; + + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE + "Need to handle DR Mad on switch\n"); + return 0; + } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { + return 0; + } + + return 1; + } + + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ + if (hop_ptr == hop_cnt + 1) + return 1; + + /* C14-9:5 -- Check for unreasonable hop pointer. */ + if (hop_ptr > hop_cnt + 1) + return 0; + + /* There should be no way of getting here, since one of the if + * statements above should have matched, and should have + * returned a value. + */ + printk(KERN_ERR "Unhandled Outgoing DR MAD case\n"); + return 0; + } else { /* Returning MAD (From responder to initiator) */ + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) + return 0; + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE + "Need to handle DR Mad on switch\n"); + } + return 0; + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + + if (port_priv->device->node_type == IB_NODE_SWITCH) { + printk(KERN_NOTICE + "Need to handle DR Mad on switch\n"); + return 0; + } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { + return 0; + } + + return 1; + } + + /* C14-13:4 -- Hop Pointer = 0 -> give to SM. */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer. */ + if (hop_ptr > hop_cnt + 1) + return 0; + } + return 1; +} Index: access/Makefile =================================================================== --- access/Makefile (revision 841) +++ access/Makefile (working copy) @@ -1,27 +1,9 @@ -# Makefile for the kernel module +EXTRA_CFLAGS += -Idrivers/infiniband/include -INCDIRS := -I. -I../include -I/lib/modules/$(shell uname -r)/build/include -#INCDIRS := -I. -I../include -I/lib/modules/$(shell uname -r)/build/include -I../../include +obj-$(CONFIG_INFINIBAND) += ib_access.o -MODULE := gsi -OBJS := gsi_main.o +ib_access-objs := \ + ib_verbs.o \ + ib_mad.o \ + ib_smi.o -#For RMPP support: -OBJS += gsi_rmpp_vendal.o rmpp/rmpp_module.o - -#CFLAGS := -W -O2 -DVD_MODULE_NAME=GSI -DVD_TRACE_LEVEL=6 -DVD_ENTERLEAVE_LEVEL=3 -DMODULE -D__KERNEL__ -DKBUILD_BASENAME=$(MODULE) -DKBUILD_MODNAME=$(MODULE) $(INCDIRS) -CFLAGS := -W -O2 -DVD_MODULE_NAME=GSI -DMODULE -D__KERNEL__ -DKBUILD_BASENAME=$(MODULE) -DKBUILD_MODNAME=$(MODULE) -DGSI_RMPP_SUPPORT $(INCDIRS) - -all: gsi.o - -rmpp/rmpp_module.o: - make -C rmpp - -gsi.o: $(OBJS) - $(LD) $(LDFLAGS) -o $@ $(OBJS) - -include ./make.rules -CFLAGS := ${CFLAGS:-W=} -CFLAGS := ${CFLAGS:-Wshadow=} - -# DO NOT DELETE Index: include/ib_smi.h =================================================================== --- include/ib_smi.h (revision 0) +++ include/ib_smi.h (revision 0) @@ -0,0 +1,66 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION cpu_to_be16(0x8000) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + + +#endif /* IB_SMI_H */ From halr at voltaire.com Wed Sep 15 10:02:22 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 13:02:22 -0400 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <20040915094510.25f72bab.mshefty@ichips.intel.com> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <20040915094510.25f72bab.mshefty@ichips.intel.com> Message-ID: <1095267741.1865.17.camel@localhost.localdomain> On Wed, 2004-09-15 at 12:45, Sean Hefty wrote: > On Tue, 14 Sep 2004 15:18:39 -0700 > Sean Hefty wrote: > > > This patch begins to add in functionality needed by the SMI implementation. > > Let's try this again. Moved SMI specific code from ib_mad.h to > ib_smi.h. Updated printk from previous patch. Looks good. Just one nit: In ib_smi.c, don't need to include ib_smi.h as this is included by ib_mad_priv.h. > And replaced the Makefile. (Feel free to ignore the Makefile changes or to replace it with yours. The current one didn't seem out of date.) I would prefer to hold off on the Makefile for now. I have a similar one for the tree but we are not quite there yet and I like to prebuild out of the tree right now. Should I apply the changes (except the Makefile) or do you want to ? -- Hal From halr at voltaire.com Wed Sep 15 10:05:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 13:05:26 -0400 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <1095267741.1865.17.camel@localhost.localdomain> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <20040915094510.25f72bab.mshefty@ichips.intel.com> <1095267741.1865.17.camel@localhost.localdomain> Message-ID: <1095267925.1831.20.camel@localhost.localdomain> On Wed, 2004-09-15 at 13:02, Hal Rosenstock wrote: > On Wed, 2004-09-15 at 12:45, Sean Hefty wrote: > > On Tue, 14 Sep 2004 15:18:39 -0700 > > Sean Hefty wrote: > > > > > This patch begins to add in functionality needed by the SMI implementation. > > > > Let's try this again. Moved SMI specific code from ib_mad.h to > > ib_smi.h. Updated printk from previous patch. > > Looks good. Just one nit: > In ib_smi.c, don't need to include ib_smi.h as this is included by > ib_mad_priv.h. > > > And replaced the Makefile. (Feel free to ignore the Makefile changes or to replace it with yours. The current one didn't seem out of date.) > I would prefer to hold off on the Makefile for now. I have a similar one > for the tree but we are not quite there yet and I like to prebuild out > of the tree right now. > > Should I apply the changes (except the Makefile) or do you want to ? One more thing on the Makefile I forgot to mention (I hit send too quickly :-( As ib_verbs.c is in Roland's core right now, if that is the one we should use, it doesn't need to be part of this build. -- Hal From mshefty at ichips.intel.com Wed Sep 15 10:17:01 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 10:17:01 -0700 Subject: [openib-general] [PATCH] beginnings of SMI In-Reply-To: <1095267741.1865.17.camel@localhost.localdomain> References: <20040914151839.0ece83e6.mshefty@ichips.intel.com> <20040915094510.25f72bab.mshefty@ichips.intel.com> <1095267741.1865.17.camel@localhost.localdomain> Message-ID: <20040915101701.176ffde0.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 13:02:22 -0400 Hal Rosenstock wrote: > > And replaced the Makefile. (Feel free to ignore the Makefile changes or to replace it with yours. The current one didn't seem out of date.) > I would prefer to hold off on the Makefile for now. I have a similar one > for the tree but we are not quite there yet and I like to prebuild out > of the tree right now. > > Should I apply the changes (except the Makefile) or do you want to ? I'll ignore the Makefile for now and commit the changes. Hopefully, most future short term changes will be limited to just the SMI files. From mshefty at ichips.intel.com Wed Sep 15 10:57:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 10:57:00 -0700 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <1095266304.1865.11.camel@localhost.localdomain> References: <1095266304.1865.11.camel@localhost.localdomain> Message-ID: <20040915105700.0217f4be.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 12:38:24 -0400 Hal Rosenstock wrote: > ib_mad: Implement ib_free_recv_mad routine > =================================================================== > +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) > +{ > + kfree(mad_recv_wc); /* RMPP !!! */ > +} > +EXPORT_SYMBOL(ib_free_recv_mad); > + > static inline u8 convert_mgmt_class(u8 mgmt_class) > { > /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ > @@ -669,7 +679,6 @@ > static void ib_mad_recv_done_handler(struct ib_mad_port_private > *port_priv, > struct ib_wc *wc) > { > - struct ib_mad_recv_wc recv_wc; > struct ib_mad_private *recv; > unsigned long flags; > u32 qp_num; > @@ -707,9 +716,9 @@ > PCI_DMA_FROMDEVICE); > > /* Setup MAD receive work completion from "normal" work completion */ > - recv_wc.wc = wc; > - recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on > wc->byte_len ? Also, RMPP !!! */ > - recv_wc.recv_buf = &recv->header.recv_buf; > + recv->header.recv_wc.wc = wc; > + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this > be based on wc->byte_len ? Also, RMPP !!! */ > + recv->header.recv_wc.recv_buf = &recv->header.recv_buf; > Would it make more sense for ib_mad_recv_wc to have an embedded recv_buf, rather than a a pointer to a recv_buf? Seems like a minor optimization/simplification. From roland at topspin.com Wed Sep 15 11:46:58 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 11:46:58 -0700 Subject: [openib-general] [PATCH] Move SDP to dynamic device enumeration Message-ID: <52r7p3wesd.fsf@topspin.com> This moves the SDP in my tree to using the struct ib_client method for device enumeration. There are still problems with adding and removing devices because the ip2pr module is still using static methods, but I think this fixes up SDP. Libor, seem OK to commit? Thanks, Roland Index: infiniband/ulp/sdp/sdp_conn.c =================================================================== --- infiniband/ulp/sdp/sdp_conn.c (revision 836) +++ infiniband/ulp/sdp/sdp_conn.c (working copy) @@ -28,6 +28,16 @@ static char _recv_pool_name[] = TS_SDP_SOCK_RECV_DATA_NAME; static struct sdev_root _dev_root_s; + +static void sdp_device_init_one(struct ib_device *device); +static void sdp_device_remove_one(struct ib_device *device); + +static struct ib_client sdp_client = { + .name = "sdp", + .add = sdp_device_init_one, + .remove = sdp_device_remove_one +}; + /* --------------------------------------------------------------------- */ /* */ /* module specific functions */ @@ -1016,27 +1026,17 @@ /* * look up correct HCA and port */ - for (hca = _dev_root_s.hca_list; NULL != hca; hca = hca->next) { + hca = ib_get_client_data(device, &sdp_client); + if (!hca) + return -ERANGE; - if (device == hca->ca) { - - for (port = hca->port_list; NULL != port; - port = port->next) { - - if (hw_port == port->index) { - - break; - } - } - + for (port = hca->port_list; NULL != port; port = port->next) + if (hw_port == port->index) break; - } - } - if (NULL == hca || NULL == port) { - + if (!port) return -ERANGE; - } + /* * allocate creation parameters */ @@ -1815,12 +1815,6 @@ off_t start_index, long *end_index) { - struct sdev_hca_port *port; - struct sdev_hca *hca; - u64 subnet_prefix; - u64 guid; - int hca_count; - int port_count; int offset = 0; TS_CHECK_NULL(buffer, -EINVAL); @@ -1857,40 +1851,8 @@ offset += sprintf((buffer + offset), "max receive buffered: <%d>\n", _dev_root_s.recv_buff_max); - - offset += sprintf((buffer + offset), "HCAs:\n"); } - /* - * HCA loop - */ - for (hca = _dev_root_s.hca_list, hca_count = 0; - NULL != hca; hca = hca->next, hca_count++) { - offset += sprintf((buffer + offset), - " hca %02x: ca <%p> pd <%p> mem <%p> " - "l_key <%08x>\n", - hca_count, hca->ca, hca->pd, - hca->mem_h, hca->l_key); - - for (port = hca->port_list, port_count = 0; - NULL != port; port = port->next, port_count++) { - - subnet_prefix = cpu_to_be64(*(u64 *) (port->gid)); - guid = cpu_to_be64(*(u64 *)(port->gid + sizeof(u64))); - - offset += sprintf((buffer + offset), - " port %02x: index <%d> gid " - "<%08x%08x:%08x%08x>\n", - port_count, - port->index, - (u32)((subnet_prefix >> 32) & - 0xffffffff), - (u32)(subnet_prefix & 0xffffffff), - (u32)((guid >> 32) & 0xffffffff), - (u32)(guid & 0xffffffff)); - } - } - return offset; } /* sdp_proc_dump_device */ @@ -1899,232 +1861,200 @@ /* initialization/cleanup functions */ /* */ /* --------------------------------------------------------------------- */ + /* ========================================================================= */ /*.._sdp_device_table_init -- create hca list */ -static s32 _sdp_device_table_init(struct sdev_root *dev_root) +static void sdp_device_init_one(struct ib_device *device) { #ifdef _TS_SDP_AIO_SUPPORT struct ib_fmr_pool_param fmr_param_s; #endif struct ib_phys_buf buffer_list; struct ib_device_attr node_info; - struct ib_device *hca_handle; struct sdev_hca_port *port; struct sdev_hca *hca; - s32 result; - s32 hca_count; - s32 port_count; - s32 fmr_size; + int result; + int port_count; - TS_CHECK_NULL(dev_root, -EINVAL); + result = ib_query_device(device, &node_info); + if (0 != result) { - TS_TRACE(MOD_LNX_SDP, T_VERY_VERBOSE, TRACE_FLOW_INOUT, - "INIT: Probing HCA/Port list."); + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, + "INIT: Error <%d> fetching HCA <%s> type.", + result, device->name); + return; + } /* - * first count number of HCA's + * allocate per-HCA structure */ - hca_count = 0; + hca = kmalloc(sizeof(struct sdev_hca), GFP_KERNEL); + if (NULL == hca) { - while (ib_device_get_by_index(hca_count)) { - - hca_count++; + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, + "INIT: Error allocating HCA <%s> memory.", + device->name); + return; } + /* + * init and insert into list. + */ + memset(hca, 0, sizeof(struct sdev_hca)); - fmr_size = TS_SDP_FMR_POOL_SIZE / hca_count; + hca->fmr_pool = NULL; + hca->mem_h = NULL; + hca->pd = NULL; + hca->ca = device; + /* + * protection domain + */ + hca->pd = ib_alloc_pd(hca->ca); + if (IS_ERR(hca->pd)) { - for (hca_count = 0; - (hca_handle = ib_device_get_by_index(hca_count)) != NULL; - hca_count++) { - if (!hca_handle || !try_module_get(hca_handle->owner)) - continue; + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, + "INIT: Error <%d> creating HCA <%s> protection domain.", + PTR_ERR(hca->pd), device->name); + goto error; + } + /* + * memory registration + */ + buffer_list.addr = 0; + buffer_list.size = (unsigned long)high_memory - PAGE_OFFSET; - result = ib_query_device(hca_handle, &node_info); - if (0 != result) { + hca->iova = 0; - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, - "INIT: Error <%d> fetching HCA <%x:%d> type.", - result, hca_handle, hca_count); - goto error; - } - /* - * allocate per-HCA structure - */ - hca = kmalloc(sizeof(struct sdev_hca), GFP_KERNEL); - if (NULL == hca) { + hca->mem_h = ib_reg_phys_mr(hca->pd, + &buffer_list, + 1, /* list_len */ + IB_ACCESS_LOCAL_WRITE, + &hca->iova); + if (IS_ERR(hca->mem_h)) { + result = PTR_ERR(hca->mem_h); + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, + "INIT: Error <%d> registering HCA <%s> memory.", + result, device->name); + goto error; + } - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, - "INIT: Error allocating HCA <%x:%d> memory.", - hca_handle, hca_count); - result = -ENOMEM; - goto error; - } - /* - * init and insert into list. - */ - memset(hca, 0, sizeof(struct sdev_hca)); + hca->l_key = hca->mem_h->lkey; + hca->r_key = hca->mem_h->rkey; - hca->next = dev_root->hca_list; - dev_root->hca_list = hca; +#ifdef _TS_SDP_AIO_SUPPORT + /* + * FMR allocation + */ + fmr_param_s.pool_size = TS_SDP_FMR_POOL_SIZE; + fmr_param_s.dirty_watermark = TS_SDP_FMR_DIRTY_SIZE; + fmr_param_s.cache = 1; + fmr_param_s.max_pages_per_fmr = TS_SDP_IOCB_PAGES_MAX; + fmr_param_s.access = (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); - hca->fmr_pool = NULL; - hca->mem_h = NULL; - hca->pd = NULL; - hca->ca = hca_handle; - /* - * protection domain - */ - hca->pd = ib_alloc_pd(hca->ca); - if (IS_ERR(hca->pd)) { + fmr_param_s.flush_function = NULL; + /* + * create SDP memory pool + */ + result = ib_create_fmr_pool(hca->pd, + &fmr_param_s, + &hca->fmr_pool); + if (0 > result) { - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, - "INIT: Error <%d> creating HCA <%x:%d> protection domain.", - PTR_ERR(hca->pd), hca_handle, hca_count); - goto error; - } - /* - * memory registration - */ - buffer_list.addr = 0; - buffer_list.size = (unsigned long)high_memory - PAGE_OFFSET; + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, + "INIT: Error <%d> creating HCA <%s> fast memory pool.", + result, hca->ca); + goto error; + } +#endif /* _TS_SDP_AIO_SUPPORT */ + /* + * port allocation + */ + for (port_count = 0; port_count < node_info.phys_port_cnt; port_count++) { - hca->iova = 0; + port = kmalloc(sizeof(struct sdev_hca_port), + GFP_KERNEL); + if (NULL == port) { - hca->mem_h = ib_reg_phys_mr(hca->pd, - &buffer_list, - 1, /* list_len */ - IB_ACCESS_LOCAL_WRITE, - &hca->iova); - if (IS_ERR(hca->mem_h)) { - result = PTR_ERR(hca->mem_h); - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, - "INIT: Error <%d> registering HCA <%x:%d> memory.", - result, hca_handle, hca_count); + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, + TRACE_FLOW_FATAL, + "INIT: Error allocating HCA <%s> port <%d:%d> memory.", + device->name, port_count, + node_info.phys_port_cnt); + goto error; } - hca->l_key = hca->mem_h->lkey; - hca->r_key = hca->mem_h->rkey; + memset(port, 0, sizeof(struct sdev_hca_port)); -#ifdef _TS_SDP_AIO_SUPPORT - /* - * FMR allocation - */ - fmr_param_s.pool_size = fmr_size; - fmr_param_s.dirty_watermark = TS_SDP_FMR_DIRTY_SIZE; - fmr_param_s.cache = 1; - fmr_param_s.max_pages_per_fmr = TS_SDP_IOCB_PAGES_MAX; - fmr_param_s.access = (IB_ACCESS_LOCAL_WRITE | - IB_ACCESS_REMOTE_WRITE | - IB_ACCESS_REMOTE_READ); + port->index = port_count + 1; + port->next = hca->port_list; + hca->port_list = port; - fmr_param_s.flush_function = NULL; - /* - * create SDP memory pool - */ - result = ib_create_fmr_pool(hca->pd, - &fmr_param_s, - &hca->fmr_pool); - if (0 > result) { + result = ib_query_gid(hca->ca, + port->index, + 0, /* index */ + (union ib_gid *) port->gid); + if (0 != result) { - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, - "INIT: Error <%d> creating HCA <%d:%d> fast memory pool.", - result, hca->ca, hca_count); + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, + TRACE_FLOW_FATAL, + "INIT: Error <%d> getting GID for HCA <%s> port <%d:%d>", + result, hca->ca, + port->index, node_info.phys_port_cnt); goto error; } -#endif /* _TS_SDP_AIO_SUPPORT */ - /* - * port allocation - */ - for (port_count = 0; port_count < node_info.phys_port_cnt; - port_count++) { + } - port = kmalloc(sizeof(struct sdev_hca_port), - GFP_KERNEL); - if (NULL == port) { + ib_set_client_data(device, &sdp_client, hca); - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, - TRACE_FLOW_FATAL, - "INIT: Error allocating HCA <%d:%d> port <%x:%d> memory.", - hca_handle, hca_count, port_count, - node_info.phys_port_cnt); + return; - result = -ENOMEM; - goto error; - } +error: + for (port = hca->port_list; NULL != port; port = hca->port_list) { + hca->port_list = port->next; + port->next = NULL; - memset(port, 0, sizeof(struct sdev_hca_port)); + kfree(port); + } - port->index = port_count + 1; - port->next = hca->port_list; - hca->port_list = port; + if (hca->fmr_pool) + (void)ib_destroy_fmr_pool(hca->fmr_pool); - result = ib_query_gid(hca->ca, - port->index, - 0, /* index */ - (union ib_gid *) port->gid); - if (0 != result) { + if (hca->mem_h) + (void)ib_dereg_mr(hca->mem_h); - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, - TRACE_FLOW_FATAL, - "INIT: Error <%d> getting GID for HCA <%d:%d> port <%d:%d>", - result, hca->ca, hca_count, - port->index, node_info.phys_port_cnt); - goto error; - } - } - } + if (hca->pd) + (void)ib_dealloc_pd(hca->pd); - return 0; -error: - return result; + kfree(hca); } /* _sdp_device_table_init */ /* ========================================================================= */ /*.._sdp_device_table_cleanup -- delete hca list */ -static s32 _sdp_device_table_cleanup(struct sdev_root *dev_root) +static void sdp_device_remove_one(struct ib_device *device) { struct sdev_hca_port *port; struct sdev_hca *hca; - TS_CHECK_NULL(dev_root, -EINVAL); - /* - * free all hca/ports - */ - for (hca = dev_root->hca_list; NULL != hca; hca = dev_root->hca_list) { + hca = ib_get_client_data(device, &sdp_client); - dev_root->hca_list = hca->next; - hca->next = NULL; + for (port = hca->port_list; NULL != port; port = hca->port_list) { + hca->port_list = port->next; + port->next = NULL; - for (port = hca->port_list; NULL != port; port = hca->port_list) { + kfree(port); + } - hca->port_list = port->next; - port->next = NULL; + if (hca->fmr_pool) + (void)ib_destroy_fmr_pool(hca->fmr_pool); - kfree(port); - } + if (hca->mem_h) + (void)ib_dereg_mr(hca->mem_h); - if (NULL != hca->fmr_pool) { + if (hca->pd) + (void)ib_dealloc_pd(hca->pd); - (void)ib_destroy_fmr_pool(hca->fmr_pool); - } - - if (hca->mem_h) { - - (void)ib_dereg_mr(hca->mem_h); - } - - if (hca->pd) { - - (void)ib_dealloc_pd(hca->pd); - } - - if (hca->ca) - module_put(hca->ca->owner); - - kfree(hca); - } - - return 0; + kfree(hca); } /* _sdp_device_table_cleanup */ /* ========================================================================= */ @@ -2170,11 +2100,11 @@ /* * Get HCA/port list */ - result = _sdp_device_table_init(&_dev_root_s); + result = ib_register_client(&sdp_client); if (0 > result) { TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, - "INIT: Error <%d> building HCA/port list.", result); + "INIT: Error <%d> registering SDP client.", result); goto error_hca; } /* @@ -2281,8 +2211,8 @@ free_pages((unsigned long)_dev_root_s.sk_array, _dev_root_s.sk_ordr); error_array: error_size: + ib_unregister_client(&sdp_client); error_hca: - (void)_sdp_device_table_cleanup(&_dev_root_s); return result; } /* sdp_conn_table_init */ @@ -2302,7 +2232,7 @@ /* * delete list of HCAs/PORTs */ - (void)_sdp_device_table_cleanup(&_dev_root_s); + ib_unregister_client(&sdp_client); /* * drop socket table */ Index: infiniband/ulp/sdp/sdp_dev.h =================================================================== --- infiniband/ulp/sdp/sdp_dev.h (revision 803) +++ infiniband/ulp/sdp/sdp_dev.h (working copy) @@ -167,12 +167,8 @@ int send_buff_max; int send_usig_max; /* - * devices. list of installed HCA's and some associated parameters - */ - struct sdev_hca *hca_list; - /* * connections. The table is a simple linked list, since it does not - * need to require fast lookup capabilities. + * need fast lookup capabilities. */ u32 sk_size; /* socket array size */ u32 sk_ordr; /* order size of region. */ From tduffy at sun.com Wed Sep 15 13:44:54 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 15 Sep 2004 13:44:54 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys Message-ID: <1095281095.22224.26.camel@duffman> So, I am trying to use the Sun subnet manager on my IB network (to test compatibility with openib). Our SM uses a pkey of 7FFF. This runs on Solaris on sparc64 (BE system). What I have noticed is that since switching to mthca (from the old ib_tavor), openib is (not) byte swapping the pkey properly on LE systems. We have hooked up an IB analyzer to indeed verify that openib is sending FF7F over the wire for the pkey. I am still investigating why this is happening, but I would like to suggest that we change all the pkeys in openib to use 0x0BB0 other than 0xFFFF from now on so that we can be sure to see clearly when things are not going out network endian ;) -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Wed Sep 15 14:03:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 17:03:48 -0400 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <20040915105700.0217f4be.mshefty@ichips.intel.com> References: <1095266304.1865.11.camel@localhost.localdomain> <20040915105700.0217f4be.mshefty@ichips.intel.com> Message-ID: <1095282228.1831.12.camel@localhost.localdomain> On Wed, 2004-09-15 at 13:57, Sean Hefty wrote: > On Wed, 15 Sep 2004 12:38:24 -0400 > Hal Rosenstock wrote: > > > ib_mad: Implement ib_free_recv_mad routine > > =================================================================== > > +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) > > +{ > > + kfree(mad_recv_wc); /* RMPP !!! */ > > +} > > +EXPORT_SYMBOL(ib_free_recv_mad); > > + > > static inline u8 convert_mgmt_class(u8 mgmt_class) > > { > > /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ > > @@ -669,7 +679,6 @@ > > static void ib_mad_recv_done_handler(struct ib_mad_port_private > > *port_priv, > > struct ib_wc *wc) > > { > > - struct ib_mad_recv_wc recv_wc; > > struct ib_mad_private *recv; > > unsigned long flags; > > u32 qp_num; > > @@ -707,9 +716,9 @@ > > PCI_DMA_FROMDEVICE); > > > > /* Setup MAD receive work completion from "normal" work completion */ > > - recv_wc.wc = wc; > > - recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on > > wc->byte_len ? Also, RMPP !!! */ > > - recv_wc.recv_buf = &recv->header.recv_buf; > > + recv->header.recv_wc.wc = wc; > > + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this > > be based on wc->byte_len ? Also, RMPP !!! */ > > + recv->header.recv_wc.recv_buf = &recv->header.recv_buf; > > > > Would it make more sense for ib_mad_recv_wc to have an embedded > recv_buf, rather than a a pointer to a recv_buf? > Seems like a minor optimization/simplification. I think that optimization would only work short term. Once RMPP is supported, it doesn't so I would leave the API as is. -- Hal From halr at voltaire.com Wed Sep 15 14:09:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 17:09:20 -0400 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <1095281095.22224.26.camel@duffman> References: <1095281095.22224.26.camel@duffman> Message-ID: <1095282559.1831.18.camel@localhost.localdomain> On Wed, 2004-09-15 at 16:44, Tom Duffy wrote: > We have hooked up an IB analyzer to indeed verify that openib is sending > FF7F over the wire for the pkey. Are you referring to OpenSM here ? > I am still investigating why this is happening, but I would like > to suggest that we change all the pkeys in openib to use 0x0BB0 > other than 0xFFFF from now on so that we can be sure to see > clearly when things are not going out network endian ;) I'm not familiar with "0x0BB0 other than 0xFFFF" but clearly you mean a PKey which is different in big and little endian. We probably would still want to use a full rather than limited PKey. Also, I'm not sure what the state of OpenSM is relative to using PKeys other than the default partition (assuming you were referring to OpenSM). -- Hal From halr at voltaire.com Wed Sep 15 14:23:37 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 17:23:37 -0400 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <20040913100755.2ee70dee.mshefty@ichips.intel.com> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> <1094861535.1752.939.camel@localhost.localdomain> <20040913100755.2ee70dee.mshefty@ichips.intel.com> Message-ID: <1095283416.1837.23.camel@localhost.localdomain> On Mon, 2004-09-13 at 13:07, Sean Hefty wrote: > > I was thinking this (separate receive lists for QPs 0 and 1) but was not > > quite there. I will now be doing this shortly. This also seems to mean a > > list for every redirected QP too :-( That is a future item. > > I'm not sure that we need to maintain lists for redirected QPs. > The user owns posting the receives on that QP, so will have set > the wr_id to something use for them. When the redirected QP is > destroyed, the user should get completions for all posted receives. > > For QP0/1, the access layer owns the receive lists, so > should be able to set the wr_id to whatever it needs in > order to recover the buffers. What about send lists for redirected QPs ? I presume the MAD layer would maintain those (for response/request matching, timeouts). -- Hal From mshefty at ichips.intel.com Wed Sep 15 14:33:51 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 14:33:51 -0700 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <1095282228.1831.12.camel@localhost.localdomain> References: <1095266304.1865.11.camel@localhost.localdomain> <20040915105700.0217f4be.mshefty@ichips.intel.com> <1095282228.1831.12.camel@localhost.localdomain> Message-ID: <20040915143351.43e7e56f.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 17:03:48 -0400 Hal Rosenstock wrote: > > Would it make more sense for ib_mad_recv_wc to have an embedded > > recv_buf, rather than a a pointer to a recv_buf? > > Seems like a minor optimization/simplification. > > I think that optimization would only work short term. Once RMPP is > supported, it doesn't so I would leave the API as is. I think that the optimization works even with RMPP. Assuming that most receives are for a one MAD, it allocates a single structure, rather than two that then need to be chained together. For RMPP, you can still have more recv_buf's chained together. It's just that the first recv_buf is included with the mad_recv_wc, rather than being linked off of it. From mshefty at ichips.intel.com Wed Sep 15 14:41:21 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 14:41:21 -0700 Subject: [openib-general] Re: ib_mad.c comments In-Reply-To: <1095283416.1837.23.camel@localhost.localdomain> References: <20040908164739.3e9c8723.mshefty@ichips.intel.com> <1094852615.1794.538.camel@localhost.localdomain> <20040910160019.5b064909.mshefty@ichips.intel.com> <1094861535.1752.939.camel@localhost.localdomain> <20040913100755.2ee70dee.mshefty@ichips.intel.com> <1095283416.1837.23.camel@localhost.localdomain> Message-ID: <20040915144121.230fecf4.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 17:23:37 -0400 Hal Rosenstock wrote: > What about send lists for redirected QPs ? I presume the MAD layer would > maintain those (for response/request matching, timeouts). I need to think more about the send side for redirected QPs, but I think that you're right. From halr at voltaire.com Wed Sep 15 14:42:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 17:42:05 -0400 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <20040915143351.43e7e56f.mshefty@ichips.intel.com> References: <1095266304.1865.11.camel@localhost.localdomain> <20040915105700.0217f4be.mshefty@ichips.intel.com> <1095282228.1831.12.camel@localhost.localdomain> <20040915143351.43e7e56f.mshefty@ichips.intel.com> Message-ID: <1095284525.1837.32.camel@localhost.localdomain> On Wed, 2004-09-15 at 17:33, Sean Hefty wrote: > On Wed, 15 Sep 2004 17:03:48 -0400 > Hal Rosenstock wrote: > > > > Would it make more sense for ib_mad_recv_wc to have an embedded > > > recv_buf, rather than a a pointer to a recv_buf? > > > Seems like a minor optimization/simplification. > > > > I think that optimization would only work short term. Once RMPP is > > supported, it doesn't so I would leave the API as is. > > I think that the optimization works even with RMPP. Assuming that > most receives are for a one MAD, it allocates a single structure, > rather than two that then need to be chained together. For RMPP, > you can still have more recv_buf's chained together. It's just > that the first recv_buf is included with the mad_recv_wc, > rather than being linked off of it. OK. It is probably worth the extra memory to optimize for the single buffer case. -- Hal From roland at topspin.com Wed Sep 15 14:43:04 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 14:43:04 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <1095281095.22224.26.camel@duffman> (Tom Duffy's message of "Wed, 15 Sep 2004 13:44:54 -0700") References: <1095281095.22224.26.camel@duffman> Message-ID: <52d60nw6mv.fsf@topspin.com> Tom> So, I am trying to use the Sun subnet manager on my IB Tom> network (to test compatibility with openib). Our SM uses a Tom> pkey of 7FFF. This runs on Solaris on sparc64 (BE system). Tom> What I have noticed is that since switching to mthca (from Tom> the old ib_tavor), openib is (not) byte swapping the pkey Tom> properly on LE systems. Tom> We have hooked up an IB analyzer to indeed verify that openib Tom> is sending FF7F over the wire for the pkey. Clearly there's a bug in mthca. Which packets have the wrong byte order? MAD packets? Is it working well enough to be able to send other traffic? Tom> I am still investigating why this is happening, but I would Tom> like to suggest that we change all the pkeys in openib to use Tom> 0x0BB0 other than 0xFFFF from now on so that we can be sure Tom> to see clearly when things are not going out network endian It's a good idea... but it's really up to the SM to assign P_Keys. I'll try to update my test network config to do this. - R. From mshefty at ichips.intel.com Wed Sep 15 14:45:41 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 14:45:41 -0700 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <1095284525.1837.32.camel@localhost.localdomain> References: <1095266304.1865.11.camel@localhost.localdomain> <20040915105700.0217f4be.mshefty@ichips.intel.com> <1095282228.1831.12.camel@localhost.localdomain> <20040915143351.43e7e56f.mshefty@ichips.intel.com> <1095284525.1837.32.camel@localhost.localdomain> Message-ID: <20040915144541.1c437e87.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 17:42:05 -0400 Hal Rosenstock wrote: > On Wed, 2004-09-15 at 17:33, Sean Hefty wrote: > > On Wed, 15 Sep 2004 17:03:48 -0400 > > Hal Rosenstock wrote: > > > > I think that the optimization works even with RMPP. Assuming that > > most receives are for a one MAD, it allocates a single structure, > > rather than two that then need to be chained together. For RMPP, > > you can still have more recv_buf's chained together. It's just > > that the first recv_buf is included with the mad_recv_wc, > > rather than being linked off of it. > > OK. It is probably worth the extra memory to optimize for the single > buffer case. Actually, this change should decrease memory usage -- saves a whole pointer. From roland at topspin.com Wed Sep 15 14:47:10 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 14:47:10 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <1095281095.22224.26.camel@duffman> (Tom Duffy's message of "Wed, 15 Sep 2004 13:44:54 -0700") References: <1095281095.22224.26.camel@duffman> Message-ID: <528ybbw6g1.fsf@topspin.com> I see one problem, I think. Can you try this? Thanks, Roland Index: infiniband/hw/mthca/mthca_provider.c =================================================================== --- infiniband/hw/mthca/mthca_provider.c (revision 824) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -166,7 +166,7 @@ goto out; } - *pkey = ((u16 *) (out_mad->data + 40))[index % 32]; + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); out: kfree(in_mad); From mshefty at ichips.intel.com Wed Sep 15 14:53:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 14:53:50 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <1095281095.22224.26.camel@duffman> References: <1095281095.22224.26.camel@duffman> Message-ID: <20040915145350.6c168472.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 13:44:54 -0700 Tom Duffy wrote: > I am still investigating why this is happening, but I would like to > suggest that we change all the pkeys in openib to use 0x0BB0 other than > 0xFFFF from now on so that we can be sure to see clearly when things are > not going out network endian ;) Is there a reason why pkeys shouldn't be big endian all the time? From ftillier at infiniconsys.com Wed Sep 15 15:02:02 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 15 Sep 2004 15:02:02 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <20040915145350.6c168472.mshefty@ichips.intel.com> Message-ID: <000001c49b6f$a25374b0$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, September 15, 2004 2:54 PM > > On Wed, 15 Sep 2004 13:44:54 -0700 > Tom Duffy wrote: > > > I am still investigating why this is happening, but I would like to > > suggest that we change all the pkeys in openib to use 0x0BB0 other than > > 0xFFFF from now on so that we can be sure to see clearly when things are > > not going out network endian ;) > > Is there a reason why pkeys shouldn't be big endian all the time? Personally, I would like to see all wire values handled in network order in the stack rather than swapped. For things like pkeys, QPNs, etc where the value is really opaque for the user, there's no value in putting it in host order. - Fab From roland at topspin.com Wed Sep 15 15:28:16 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 15:28:16 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <000001c49b6f$a25374b0$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Wed, 15 Sep 2004 15:02:02 -0700") References: <000001c49b6f$a25374b0$655aa8c0@infiniconsys.com> Message-ID: <524qlzw4jj.fsf@topspin.com> Fab> Personally, I would like to see all wire values handled in Fab> network order in the stack rather than swapped. For things Fab> like pkeys, QPNs, etc where the value is really opaque for Fab> the user, there's no value in putting it in host order. I don't think it's a big deal one way or another. I will point out that neither P_Key nor QPN values are opaque (the MSB of a P_Key value determines full membership or not, QPN is used in CM peer compare, etc). Are you volunteering to go through the current code base and fix up the endian issues? I don't have much stomach to go through everything and change the byte order around, and I don't have much confidence that I wouldn't introduce a ton of bugs anyway. Thanks, Roland From mshefty at ichips.intel.com Wed Sep 15 15:50:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 15 Sep 2004 15:50:35 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <524qlzw4jj.fsf@topspin.com> References: <000001c49b6f$a25374b0$655aa8c0@infiniconsys.com> <524qlzw4jj.fsf@topspin.com> Message-ID: <20040915155035.5f89c943.mshefty@ichips.intel.com> On Wed, 15 Sep 2004 15:28:16 -0700 Roland Dreier wrote: > Fab> Personally, I would like to see all wire values handled in > Fab> network order in the stack rather than swapped. For things > Fab> like pkeys, QPNs, etc where the value is really opaque for > Fab> the user, there's no value in putting it in host order. > > I don't think it's a big deal one way or another. I will point out > that neither P_Key nor QPN values are opaque (the MSB of a P_Key value > determines full membership or not, QPN is used in CM peer compare, etc). I think it would make the interface much easier to deal with. Today, clients must deal with these values in two different formats, depending on how they get their data. It's also not clear what format clients should use. Glancing at the stack, it seems that clients receive data in big endian format via MADs, convert it to host order, then have it converted back into big endian format for use with the hardware. Comparisons in little/big endian format doesn't matter, as long as the data is consistent. > Are you volunteering to go through the current code base and fix up > the endian issues? I don't have much stomach to go through everything > and change the byte order around, and I don't have much confidence > that I wouldn't introduce a ton of bugs anyway. I don't really have time for this now either, but I do think removing unnecessary byte-swapping is the right thing to do. From tduffy at sun.com Wed Sep 15 16:40:53 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 15 Sep 2004 16:40:53 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <528ybbw6g1.fsf@topspin.com> References: <1095281095.22224.26.camel@duffman> <528ybbw6g1.fsf@topspin.com> Message-ID: <1095291653.15622.7.camel@duffman> On Wed, 2004-09-15 at 14:47 -0700, Roland Dreier wrote: > I see one problem, I think. Can you try this? OK, it looks like the pkey is going out BE now. Cool. Thanks. I am still not able to get ipoib to work, but I am seeing the openib node on the network from the SM side. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Wed Sep 15 17:11:24 2004 From: tduffy at sun.com (Tom Duffy) Date: Wed, 15 Sep 2004 17:11:24 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up Message-ID: <1095293484.15622.13.camel@duffman> I probably shouldn't have done this, but I tried to unload ib_mthca with ib_ipoib loaded and ib0 up. Anyways, it caused an oops. Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: {:ib_ipoib:ipoib_ib_dev_stop+66} PML4 30d9b067 PGD 34b53067 PMD 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: ib_ipoib ib_sa_client ib_client_query ib_poll ib_mthca ib_mad ib_core ib_services nfs lockd md5 ipv6 parport_pc lp parport autofs4 rfcomm l2c ap bluetooth ds yenta_socket pcmcia_core sunrpc tg3 floppy sg ext3 jbd dm_mod oh ci_hcd button battery asus_acpi ac xfs mptscsih mptbase sd_mod scsi_mod Pid: 22107, comm: rmmod Not tainted 2.6.9-rc2openib RIP: 0010:[] {:ib_ipoib:ipoib_ib_dev_stop+66 } RSP: 0018:0000010030e1dca8 EFLAGS: 00010256 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000015 RDX: 0000000000000000 RSI: 0000000000000216 RDI: 0000010031107000 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000002 R10: 00000100311070c8 R11: ffffffff8040b080 R12: 0000010031107380 R13: ffffffff804aa1c0 R14: 00000100097e5380 R15: 00000100097e5000 FS: 0000002a9557b4c0(0000) GS:ffffffff80473d80(0000) knlGS:0000000055579540 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process rmmod (pid: 22107, threadinfo 0000010030e1c000, task 000001007ee2e030) Stack: 000001003fefb200 ffffffff802bc0fe 0000010031107380 0000010031107380 0000010031107000 00000100097e5390 00000100097e5748 ffffffffa02b4311 ffffffffa02be2aa 0000010031107000 Call Trace:{skb_queue_purge+46} {:ib_ipoib:_ ipoib_dev_stop+113} {dev_close+118} {unregister_netdevice +181} {unregister_netdev+17} {:ib_ipoib:ipo ib_dev_cleanup+88} {netdev_run_todo+298} {:ib_ipoib:ipoi b_remove_one+253} {:ib_core:ib_unregister_device+94} {:ib_mthca:mthca_remove_one+46} {pci_ device_remove+44} {device_release_driver+94} {driver_de tach+41} {bus_remove_driver+42} {driver_unregi ster+9} {pci_unregister_driver+13} {sys_delet e_module+331} {error_exit+0} {system_call+126} Code: 48 8b 3c 02 48 85 ff 0f 84 9a 00 00 00 65 48 8b 04 25 18 00 RIP {:ib_ipoib:ipoib_ib_dev_stop+66} RSP <0000010030e1dca8> CR2: 0000000000000000 Message from syslogd at sins-stinger-10 at Wed Sep 15 17:57:55 2004 ... sins-stinger-10 kernel: Oops: 0000 [1] SMP Message from syslogd at sins-stinger-10 at Wed Sep 15 17:57:55 2004 ... sins-stinger-10 kernel: CR2: 0000000000000000 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From iod00d at hp.com Wed Sep 15 17:12:12 2004 From: iod00d at hp.com (Grant Grundler) Date: Wed, 15 Sep 2004 17:12:12 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <52u0tzy1vm.fsf@topspin.com> References: <5D78D28F88822E4D8702BB9EEF1A43670623A2@mercury.infiniconsys.com> <52u0tzy1vm.fsf@topspin.com> Message-ID: <20040916001212.GH24931@cup.hp.com> On Wed, Sep 15, 2004 at 08:42:53AM -0700, Roland Dreier wrote: > Fabian> Are the ramifications of such an RKEY any worse than those > Fabian> of locally attached DMA-capable adapters? Not if one can guarantee it's well behaved. > Fabian> If your FC HBA > Fabian> goes haywire and decides to write all over memory, there's > Fabian> not much you can do. That's not true. There are platforms that can isolate the FC and prevent the FC HBA from scribbling to anything the respective driver didn't previously get explicite write permission. ie we can guarantee the containment. However, for platforms that have 1:1 IO/CPU addressing (aka PCI is identity mapped with CPU physical) and no error containment, it would work as well as anything. > All I can say is that in the real world people don't like it. That's > why features like remote invalidate get added. In the real world, things occasionally do go haywire and that's why sometimes it's useful to have error containment. Deciding to add features that don't work with error containment is ok - just make sure there is workable alternative. Caveat: I'll have to re-read this thread in the morning. I'm pretty sure I'm not registering some of the acronyms right now (e.g. PD). thanks, grant From halr at voltaire.com Wed Sep 15 17:21:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 15 Sep 2004 20:21:56 -0400 Subject: [openib-general] [PATCH] ib_mad: Implement ib_free_recv_mad routine In-Reply-To: <20040915144541.1c437e87.mshefty@ichips.intel.com> References: <1095266304.1865.11.camel@localhost.localdomain> <20040915105700.0217f4be.mshefty@ichips.intel.com> <1095282228.1831.12.camel@localhost.localdomain> <20040915143351.43e7e56f.mshefty@ichips.intel.com> <1095284525.1837.32.camel@localhost.localdomain> <20040915144541.1c437e87.mshefty@ichips.intel.com> Message-ID: <1095294115.1841.36.camel@localhost.localdomain> On Wed, 2004-09-15 at 17:45, Sean Hefty wrote: > > OK. It is probably worth the extra memory to optimize for the single > > buffer case. > > Actually, this change should decrease memory usage -- saves a whole > pointer. Isn't that only for the first buffer ? Don't buffers 2-n have an unneeded WC structure ? Also, do you see a way to turn a receive buffer handed to the client around ? That seems to be useful for the agents in order to respond to a request without copying into a send buffer. -- Hal From roland at topspin.com Wed Sep 15 18:14:45 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 18:14:45 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <1095293484.15622.13.camel@duffman> (Tom Duffy's message of "Wed, 15 Sep 2004 17:11:24 -0700") References: <1095293484.15622.13.camel@duffman> Message-ID: <52sm9jui9m.fsf@topspin.com> Tom> I probably shouldn't have done this, but I tried to unload Tom> ib_mthca with ib_ipoib loaded and ib0 up. Anyways, it caused Tom> an oops. Actually that should be fine now (and works for me). I'll try to figure out what's going on here.... - R. From roland at topspin.com Wed Sep 15 18:18:30 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 18:18:30 -0700 Subject: [openib-general] [BUG] Byte swapped Pkeys In-Reply-To: <20040915155035.5f89c943.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 15 Sep 2004 15:50:35 -0700") References: <000001c49b6f$a25374b0$655aa8c0@infiniconsys.com> <524qlzw4jj.fsf@topspin.com> <20040915155035.5f89c943.mshefty@ichips.intel.com> Message-ID: <52oek7ui3d.fsf@topspin.com> Sean> I think it would make the interface much easier to deal Sean> with. Today, clients must deal with these values in two Sean> different formats, depending on how they get their data. Sean> It's also not clear what format clients should use. Sean> Glancing at the stack, it seems that clients receive data in Sean> big endian format via MADs, convert it to host order, then Sean> have it converted back into big endian format for use with Sean> the hardware. Mostly true, although for mthca things like QPN are used (in host order) as indexes into HCA context tables, so passing values like QPN in big endian format is going to add a fair amount of swapping to mthca. Basically this all goes back to Mellanox's original choice to use host byte order. I guess we don't care about VAPI compatibility any more but I think a lot of people are used to the byte ordering. Sean> Comparisons in little/big endian format doesn't matter, as Sean> long as the data is consistent. Mostly true although eg CM peer compare requires QPN in host order. Sean> I don't really have time for this now either, but I do think Sean> removing unnecessary byte-swapping is the right thing to do. If this is something we want to do, I think the sanest way to go about it is to use the new __be16, __be32 etc. types (in current Linus bk, will be in kernel 2.6.9) and make sure the whole tree is sparse clean. For example change all P_Key APIs to take __be16 and then go through and fix all the sparse warnings. - R. From Tom.Duffy at Sun.COM Wed Sep 15 18:38:03 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Wed, 15 Sep 2004 18:38:03 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <52sm9jui9m.fsf@topspin.com> References: <1095293484.15622.13.camel@duffman> <52sm9jui9m.fsf@topspin.com> Message-ID: <1095298683.15622.19.camel@duffman> On Wed, 2004-09-15 at 18:14 -0700, Roland Dreier wrote: > Tom> I probably shouldn't have done this, but I tried to unload > Tom> ib_mthca with ib_ipoib loaded and ib0 up. Anyways, it caused > Tom> an oops. > > Actually that should be fine now (and works for me). I'll try to figure > out what's going on here.... There is a difference in my config that I did not mention in the previous email. Solaris IPoIB uses pkey 0x8001, so on Linux I created a pseudo ib0 device with ipoibcfg and had *that* interface up (ib0.8001). Don't know if that makes a difference. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From roland at topspin.com Wed Sep 15 18:56:53 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 18:56:53 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up References: <1095293484.15622.13.camel@duffman> <52sm9jui9m.fsf@topspin.com> <1095298683.15622.19.camel@duffman> Message-ID: <52ekl3ugbe.fsf@topspin.com> Tom> There is a difference in my config that I did not mention in Tom> the previous email. Solaris IPoIB uses pkey 0x8001, so on Tom> Linux I created a pseudo ib0 device with ipoibcfg and had Tom> *that* interface up (ib0.8001). Don't know if that makes a Tom> difference. It probably does because of how the interfaces are handled. Thanks, this should make it pretty easy to fix. - Roland From roland at topspin.com Wed Sep 15 20:36:45 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 15 Sep 2004 20:36:45 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <1095293484.15622.13.camel@duffman> (Tom Duffy's message of "Wed, 15 Sep 2004 17:11:24 -0700") References: <1095293484.15622.13.camel@duffman> Message-ID: <52acvqvq9e.fsf@topspin.com> Does this fix the oops? Thanks, Roland Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 835) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -509,8 +509,8 @@ /* Delete any child interfaces first */ /* Safe since it's either protected by ipoib_device_mutex or empty */ list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { - ipoib_dev_cleanup(cpriv->dev); unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); list_del(&cpriv->list); From Tom.Duffy at Sun.COM Wed Sep 15 21:22:15 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Wed, 15 Sep 2004 21:22:15 -0700 Subject: [openib-general] [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <52acvqvq9e.fsf@topspin.com> References: <1095293484.15622.13.camel@duffman> <52acvqvq9e.fsf@topspin.com> Message-ID: <1095308535.15622.23.camel@duffman> On Wed, 2004-09-15 at 20:36 -0700, Roland Dreier wrote: > Does this fix the oops? Yup. Rock on, man. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From gdror at mellanox.co.il Thu Sep 16 04:24:52 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Thu, 16 Sep 2004 14:24:52 +0300 Subject: [openib-general] Reserved L_Key API Message-ID: <506C3D7B14CDD411A52C00025558DED605F9D0B9@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Wednesday, September 15, 2004 6:41 PM > > By the way, I notice that neither the API nor the Tavor MPT > format seem to allow creating memory regions of size 2^64 > since they use a 64-bit value for the size (which means > they're limited to 2^64 - 1). I don't think this is a serious > issue though. > Think of Tavor as a software friendly design. Use: start address = 0x0000000000000001 length = 0xffffffffffffffff pa = 1 And then you got all memory space mapped + protection violation if you try to access a NULL pointer :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mvonwyl at bluewin.ch Thu Sep 16 05:08:46 2004 From: mvonwyl at bluewin.ch (mvonwyl at bluewin.ch) Date: Thu, 16 Sep 2004 14:08:46 +0200 Subject: [openib-general] 2.4 compilation problems... In-Reply-To: <412EB53F000A6E28@mssbzhb-int.msg.bluewin.ch> Message-ID: <412EB53F000AFCF3@mssbzhb-int.msg.bluewin.ch> Sorry, but it seems that someone answered me and the mail was deleted by mistake... Can he just send it back, please? Sorry, my fault... From halr at voltaire.com Thu Sep 16 05:46:55 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 08:46:55 -0400 Subject: [openib-general] [PATCH] ib_mad.c: In ib_free_recv_mad, walk list of buffers to be returned Message-ID: <1095338814.3946.3.camel@localhost.localdomain> ib_mad.c: In ib_free_recv_mad, walk list of buffers to be returned Index: ib_mad.c =================================================================== --- ib_mad.c (revision 847) +++ ib_mad.c (working copy) @@ -365,7 +365,20 @@ */ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) { - kfree(mad_recv_wc); /* RMPP !!! */ + struct ib_mad_recv_buf *entry; + struct ib_mad_private *buffer = (struct ib_mad_private *)mad_recv_wc; + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + /* Free previous receive buffer */ + kfree(buffer); + buffer = (void *)entry - sizeof(struct ib_mad_private_header); + } + /* Free last buffer */ + kfree(buffer); } EXPORT_SYMBOL(ib_free_recv_mad); @@ -944,8 +957,7 @@ * This is for both MAD and private header * which contains the receive tracking structure. * By prepending this header, there is one rather - * than multiple memory allocations. - * Ths will need revisiting for RMPP !!! + * than two memory allocations. */ mad_priv = kmalloc(sizeof *mad_priv, (in_atomic() || irqs_disabled()) ? From halr at voltaire.com Thu Sep 16 05:53:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 08:53:19 -0400 Subject: [openib-general] [PATCH] ib_mad: Consolidate receive lists Message-ID: <1095339198.3919.5.camel@localhost.localdomain> ib_mad: Consolidate receive lists Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 847) +++ ib_mad_priv.h (working copy) @@ -84,9 +84,8 @@ struct ib_mad_private_header { struct ib_mad_recv_wc recv_wc; /* must be first member (for now !!!) */ - struct list_head mad_list; - struct ib_mad_buf buf; struct ib_mad_recv_buf recv_buf; + struct ib_mad_buf buf; } __attribute__ ((packed)); struct ib_mad_private { Index: ib_mad.c =================================================================== --- ib_mad.c (revision 848) +++ ib_mad.c (working copy) @@ -709,10 +709,10 @@ if (!list_empty(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)])) { recv = list_entry(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)], struct ib_mad_private, - header.mad_list); + header.recv_buf.list); /* Remove from posted receive MAD list */ - list_del(&recv->header.mad_list); + list_del(&recv->header.recv_buf.list); port_priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; @@ -984,7 +984,7 @@ /* Link receive WR into posted receive MAD list */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_add_tail(&mad_priv->header.mad_list, + list_add_tail(&mad_priv->header.recv_buf.list, &port_priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); @@ -1001,7 +1001,7 @@ /* Unlink from posted receive MAD list */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); - list_del(&mad_priv->header.mad_list); + list_del(&mad_priv->header.recv_buf.list); port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); From halr at voltaire.com Thu Sep 16 07:00:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 10:00:36 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Use mad cache for receive Message-ID: <1095343236.3919.8.camel@localhost.localdomain> ib_mad.c: Use mad cache for receive Index: ib_mad.c =================================================================== --- ib_mad.c (revision 849) +++ ib_mad.c (working copy) @@ -374,11 +374,11 @@ */ list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { /* Free previous receive buffer */ - kfree(buffer); + kmem_cache_free(ib_mad_cache, buffer); buffer = (void *)entry - sizeof(struct ib_mad_private_header); } /* Free last buffer */ - kfree(buffer); + kmem_cache_free(ib_mad_cache, buffer); } EXPORT_SYMBOL(ib_free_recv_mad); @@ -959,9 +959,9 @@ * By prepending this header, there is one rather * than two memory allocations. */ - mad_priv = kmalloc(sizeof *mad_priv, - (in_atomic() || irqs_disabled()) ? - GFP_ATOMIC : GFP_KERNEL); + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); if (!mad_priv) { printk(KERN_ERR "No memory for receive buffer\n"); return -ENOMEM; @@ -1005,7 +1005,7 @@ port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - kfree(mad_priv); + kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_NOTICE "ib_post_recv failed\n"); return -EINVAL; } From roland at topspin.com Thu Sep 16 08:56:12 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 08:56:12 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <506C3D7B14CDD411A52C00025558DED605F9D0B9@mtlex01.yok.mtl.com> (Dror Goldenberg's message of "Thu, 16 Sep 2004 14:24:52 +0300") References: <506C3D7B14CDD411A52C00025558DED605F9D0B9@mtlex01.yok.mtl.com> Message-ID: <52sm9itdgj.fsf@topspin.com> Dror> Think of Tavor as a software friendly design. Use: start Dror> address = 0x0000000000000001 length = 0xffffffffffffffff pa Dror> = 1 And then you got all memory space mapped + protection Dror> violation if you try to access a NULL pointer :) That's cute :) But I think 0x0 is more likely to be a valid DMA address than 0xffffffffffffffff (I seem to remember this coming up when pci_dma_mapping_error() was added to the kernel). - R. From halr at voltaire.com Thu Sep 16 10:23:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 13:23:24 -0400 Subject: [openib-general] [PATCH] ib_mad: Use wait queue and wait_event rather than signals and semaphores Message-ID: <1095355403.3946.19.camel@localhost.localdomain> ib_mad: Use wait queue and wait_event rather than signals and semaphores Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 849) +++ ib_mad_priv.h (working copy) @@ -120,9 +120,8 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; -struct ib_mad_thread_data { - struct semaphore sem; - int run; +struct ib_mad_thread_private { + wait_queue_head_t wait; }; struct ib_mad_port_private { @@ -147,7 +146,7 @@ struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; int recv_posted_mad_count[IB_MAD_QPS_SUPPORTED]; - struct ib_mad_thread_data thread_data; + struct ib_mad_thread_private mad_thread_private; }; #endif /* __IB_MAD_PRIV_H__ */ Index: TODO =================================================================== --- TODO (revision 847) +++ TODO (working copy) @@ -1,9 +1,8 @@ -9/15/04 +9/16/04 OpenIB MAD Layer Short Term -Use wait queue and wait_event rather than signals and semaphores Send timeout support Treat send overruns as timeouts for now (once timeout support implemented) ? Index: ib_mad.c =================================================================== --- ib_mad.c (revision 852) +++ ib_mad.c (working copy) @@ -879,21 +879,21 @@ static int ib_mad_thread(void *param) { struct ib_mad_port_private *port_priv = param; - struct ib_mad_thread_data *thread_data = &port_priv->thread_data; + struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + int ret; while (1) { - if (down_interruptible(&thread_data->sem)) { - printk(KERN_DEBUG "Exiting ib_mad thread\n"); - break; + while (!signal_pending(current)) { + ret = wait_event_interruptible(mad_thread_priv->wait, 0); + if (ret) { + printk(KERN_ERR "ib_mad thread exiting\n"); + return 0; + } + + ib_mad_completion_handler(port_priv); } - if (!thread_data->run) - break; - - ib_mad_completion_handler(port_priv); } - - return 0; } /* @@ -901,10 +901,10 @@ */ static int ib_mad_thread_init(struct ib_mad_port_private *port_priv) { - struct ib_mad_thread_data *thread_data = &port_priv->thread_data; + struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; - sema_init(&thread_data->sem, 0); - thread_data->run = 1; + init_waitqueue_head(&mad_thread_priv->wait); + port_priv->mad_thread = kthread_create(ib_mad_thread, port_priv, "ib_mad-%-6s-%-2d", @@ -918,30 +918,19 @@ } /* - * Wake up the IB MAD thread - */ -static void ib_mad_thread_signal(struct ib_mad_port_private *port_priv) -{ - struct ib_mad_thread_data *thread_data = &port_priv->thread_data; - - up(&thread_data->sem); -} - -/* * Stop the IB MAD thread */ static void ib_mad_thread_stop(struct ib_mad_port_private *port_priv) { - struct ib_mad_thread_data *thread_data = &port_priv->thread_data; - - thread_data->run = 0; - ib_mad_thread_signal(port_priv); - schedule(); + kthread_stop(port_priv->mad_thread); } static void ib_mad_thread_completion_handler(struct ib_cq *cq) { - ib_mad_thread_signal(cq->cq_context); + struct ib_mad_port_private *port_priv = cq->cq_context; + struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + + wake_up_interruptible(&mad_thread_priv->wait); } static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, From Tom.Duffy at Sun.COM Thu Sep 16 11:02:12 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 16 Sep 2004 11:02:12 -0700 Subject: [openib-general] [PATCH] fix ipoib proc entries WAS: [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <52acvqvq9e.fsf@topspin.com> References: <1095293484.15622.13.camel@duffman> <52acvqvq9e.fsf@topspin.com> Message-ID: <1095357732.17548.13.camel@duffman> On Wed, 2004-09-15 at 20:36 -0700, Roland Dreier wrote: > Does this fix the oops? Although, I do get this when I remove ib_services. Badness in remove_proc_entry at /build1/tduffy/openib-work/linux-2.6.9-rc2-openib/fs/proc/generic.c:688 Call Trace:{remove_proc_entry+207} {sys_delete_module+331} {sys_munmap+90} {system_call+126} This stems from ipoib making bad entries: # ls /proc/infiniband/ ipoib_arp_ib%d ipoib_mcast_ib%d ipoib_vlan poll_counts ipoib_arp_ib%d ipoib_mcast_ib%d mad tracelevel This is because the proc entries are created *before* register_netdev() is called which sets the correct name. So, this patch moves the proc entry creation after register_netdev(). -tduffy Index: drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- drivers/infiniband/ulp/ipoib/ipoib_main.c (revision 853) +++ drivers/infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -459,14 +459,8 @@ if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; - if (ipoib_proc_dev_init(dev)) - goto out_ib_cleanup; - return 0; -out_ib_cleanup: - ipoib_ib_dev_cleanup(dev); - out_tx_ring_cleanup: kfree(priv->tx_ring); @@ -620,12 +614,18 @@ goto register_failed; } + if (ipoib_proc_dev_init(priv->dev)) + goto proc_failed; + down(&ipoib_device_mutex); list_add_tail(&priv->list, &ipoib_device_list); up(&ipoib_device_mutex); return 0; +proc_failed: + unregister_netdev(priv->dev); + register_failed: ipoib_port_monitor_dev_stop(priv->dev); -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Thu Sep 16 11:26:55 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 16 Sep 2004 11:26:55 -0700 Subject: [openib-general] [PATCH] ib_mad: Implementib_free_recv_mad routine In-Reply-To: <1095294115.1841.36.camel@localhost.localdomain> Message-ID: >> Actually, this change should decrease memory usage -- saves a whole >> pointer. > >Isn't that only for the first buffer ? Don't buffers 2-n have an >unneeded WC structure ? I'm suggesting is that instead of the ib_mad_recv_wc pointing to the first ib_mad_recv_buf structure, it contains it. Buffers 2-n would still be referenced by ib_mad_recv_buf structures chained off of the recv_buf in the ib_mad_recv_wc. The user would only receive on ib_mad_recv_wc. >Also, do you see a way to turn a receive buffer handed to the client >around ? That seems to be useful for the agents in order to respond to a >request without copying into a send buffer. As long as the user calls ib_free_recv_mad after the send operation completes, they should be able to reuse the data buffer from the receive. We could also give some thought to some sort of ib_reply_to_recv_mad() routine that could take an ib_mad_recv_wc as input (with the buffer modified) and generate a send to the source of the receive. The call could allocate an address handle, format a send work request, post the send, and deallocate the address handle once the send completed. Something like this could easily enough be added on later though. From roland at topspin.com Thu Sep 16 12:14:33 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 12:14:33 -0700 Subject: [openib-general] [PATCH] fix ipoib proc entries WAS: [OOPS] unloading ib_mthca with ib0 up In-Reply-To: <1095357732.17548.13.camel@duffman> (Tom Duffy's message of "Thu, 16 Sep 2004 11:02:12 -0700") References: <1095293484.15622.13.camel@duffman> <52acvqvq9e.fsf@topspin.com> <1095357732.17548.13.camel@duffman> Message-ID: <52zn3qvxeu.fsf@topspin.com> Thanks, I committed this (plus some more fixes to dev->name usage before register_netdev). - R. From halr at voltaire.com Thu Sep 16 12:29:45 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 15:29:45 -0400 Subject: [openib-general] [RFC] ib_mad Message-ID: <1095362984.3919.36.camel@localhost.localdomain> Now that I think ib_mad is far enough along, I would like to request any code comments. The current primary feature limitations are as follows: No deferring of post sends which overrun No timeout support No RMPP support No redirection support Thanks. -- Hal From sean.hefty at intel.com Thu Sep 16 14:38:32 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 16 Sep 2004 14:38:32 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095362984.3919.36.camel@localhost.localdomain> Message-ID: >Now that I think ib_mad is far enough along, I would like to request any >code comments. > >The current primary feature limitations are as follows: > No deferring of post sends which overrun > No timeout support > No RMPP support > No redirection support As an FYI, I plan on reviewing the code in a fair amount of detail within the next few days. (I'm out of the office until next Wednesday, so I may be slow to respond during the day.) - Sean From mshefty at ichips.intel.com Thu Sep 16 15:15:14 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Sep 2004 15:15:14 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095362984.3919.36.camel@localhost.localdomain> References: <1095362984.3919.36.camel@localhost.localdomain> Message-ID: <20040916151514.17c5ea50.mshefty@ichips.intel.com> On Thu, 16 Sep 2004 15:29:45 -0400 Hal Rosenstock wrote: > Now that I think ib_mad is far enough along, I would like to request any > code comments. It would be terribly convenient for the SMI to know how many ports are on a device without having to query for it. I have some other options to get around this issue, but I'm wondering if phys_port_cnt for a device can't either be stored by the MAD layer somewhere, or if it would make sense to move it into ib_device. (Random thought... is there a reason why ib_device_attr or a pointer can't be stored in ib_device? Is it already stored that way internally?) Along similar lines, renaming ib_mad_port_private::port to port_num would let me think less, and my head already hurts from studying the SMI portion of the spec. (I'll submit a patch for this with my latest SMI changes.) - Sean From tduffy at sun.com Thu Sep 16 16:12:13 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 16 Sep 2004 16:12:13 -0700 Subject: [openib-general] Re: [openib-commits] r857 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <20040916211633.55D0C2283D5@openib.ca.sandia.gov> References: <20040916211633.55D0C2283D5@openib.ca.sandia.gov> Message-ID: <1095376333.17548.23.camel@duffman> On Thu, 2004-09-16 at 14:16 -0700, roland at openib.org wrote: > Remove use of ts_kernel_trace.h Wow, that made a 20% size improvement! Before: [root at sins-stinger-10 x86_64]# lsmod Module Size Used by ib_ipoib 60440 0 After: [root at sins-stinger-10 x86_64]# lsmod | head Module Size Used by ib_ipoib 49688 0 -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Sep 16 16:14:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 19:14:48 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <20040916151514.17c5ea50.mshefty@ichips.intel.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> Message-ID: <1095376488.3946.107.camel@localhost.localdomain> On Thu, 2004-09-16 at 18:15, Sean Hefty wrote: > It would be terribly convenient for the SMI to know how many ports > are on a device without having to query for it. I have some other > options to get around this issue, but I'm wondering if > phys_port_cnt for a device can't either be stored by the > MAD layer somewhere, or if it would make sense to move it into > ib_device. It could be saved in the ib_mad_port_private struct but this seems wrong to me since this is a per port structure (and the MAD layer only uses the IB device at start and stop time). > (Random thought... is there a reason why > ib_device_attr or a pointer can't be stored in ib_device? > Is it already stored that way internally?) IMO it better belongs in the ib_device if it were to go anywhere but if stored we would need a way to indicate changes if ports were hot swappable. > Along similar lines, renaming ib_mad_port_private::port to > port_num would let me think less, and my head already hurts > from studying the SMI portion of the spec. (I'll submit a > patch for this with my latest SMI changes.) Using port rather than port_num may have stemmed from ib_register_mad_agent(). -- Hal From roland at topspin.com Thu Sep 16 16:49:20 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 16:49:20 -0700 Subject: [openib-general] Re: [openib-commits] r857 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <1095376333.17548.23.camel@duffman> (Tom Duffy's message of "Thu, 16 Sep 2004 16:12:13 -0700") References: <20040916211633.55D0C2283D5@openib.ca.sandia.gov> <1095376333.17548.23.camel@duffman> Message-ID: <52r7p1wz9b.fsf@topspin.com> Tom> Wow, that made a 20% size improvement! That's a surprise... is the after with CONFIG_INFINIBAND_IPOIB_DEBUG=y? - R. From roland at topspin.com Thu Sep 16 17:46:59 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 17:46:59 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <20040916151514.17c5ea50.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 16 Sep 2004 15:15:14 -0700") References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> Message-ID: <52mzzpwwl8.fsf@topspin.com> Sean> It would be terribly convenient for the SMI to know how many Sean> ports are on a device without having to query for it. I Sean> have some other options to get around this issue, but I'm Sean> wondering if phys_port_cnt for a device can't either be Sean> stored by the MAD layer somewhere, or if it would make sense Sean> to move it into ib_device. (Random thought... is there a Sean> reason why ib_device_attr or a pointer can't be stored in Sean> ib_device? Is it already stored that way internally?) It makes sense to me to add the device attr to ib_device. The only fields that I see that might change are system_image_guid and phys_port_cnt (if hot pluggable ports ever come about). We could fix this by requiring the low-level driver to keep the system_image_guid field up to date (since it can only be changed through that driver), and defining phys_port_cnt to be the maximum possible number of ports (pluggable ports that aren't present would just always be down). In any case pretty much everyone's code would be broken by phys_port_cnt changing under them. If we did this we could just kill off the query_device method entirely, which would be nice. - Roland From roland at topspin.com Thu Sep 16 17:51:27 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 17:51:27 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <52mzzpwwl8.fsf@topspin.com> (Roland Dreier's message of "Thu, 16 Sep 2004 17:46:59 -0700") References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> Message-ID: <52isadwwds.fsf@topspin.com> By the way, I'm not sure I totally understand the IB spec on pluggable ports. Is the idea that the number of ports can change or just that the GUID of a port might change? - R. From roland at topspin.com Thu Sep 16 18:48:22 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 18:48:22 -0700 Subject: [openib-general] Re: [openib-commits] r857 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <52r7p1wz9b.fsf@topspin.com> (Roland Dreier's message of "Thu, 16 Sep 2004 16:49:20 -0700") References: <20040916211633.55D0C2283D5@openib.ca.sandia.gov> <1095376333.17548.23.camel@duffman> <52r7p1wz9b.fsf@topspin.com> Message-ID: <52acvpwtqx.fsf@topspin.com> Roland> That's a surprise... is the after with Roland> CONFIG_INFINIBAND_IPOIB_DEBUG=y? err... it helps if I do my commit from a high enough directory that the new Kconfig gets committed. - R. From mshefty at ichips.intel.com Thu Sep 16 19:14:38 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Sep 2004 19:14:38 -0700 Subject: [openib-general] [PATCH] SMI update Message-ID: <20040916191438.47391b89.mshefty@ichips.intel.com> After spending a couple of days floundering in SMI-related code, I finally gave in and studied that part of the spec, comparing it against the existing implementations. I tried to separate the SMI requirements into send and receive handling of SMPs. There's some pseudo-code near the end of ib_smi.c (to be converted into real code) that describes how the SMI checks will eventually work as the code is merged with the ib_mad.c routines. I'd appreciate comments from anyone, but particularly someone who's worked on SMI code before, to make sure that I'm not completely off here. n (Btw, you can pretty much ignore the diffs for ib_smi.c. The previous code matched to the wrong function.) - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 859) +++ access/ib_mad_priv.h (working copy) @@ -128,7 +128,7 @@ struct list_head port_list; struct task_struct *mad_thread; struct ib_device *device; - int port; + int port_num; struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; struct ib_cq *cq; struct ib_pd *pd; Index: access/ib_smi.c =================================================================== --- access/ib_smi.c (revision 859) +++ access/ib_smi.c (working copy) @@ -26,91 +26,293 @@ #include #include "ib_mad_priv.h" -int smi_process_dr_smp(struct ib_mad_port_private *port_priv, - struct ib_smp *smp) +/* + * Fixup a directed route SMP for sending. Return 0 if the SMP should be + * discarded. + */ +static int smi_handle_dr_smp_send(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) { u8 hop_ptr, hop_cnt; hop_ptr = smp->hop_ptr; hop_cnt = smp->hop_cnt; - /* - * Outgoing MAD processing. "Outgoing" means from initiator to responder. - * Section 14.2.2.2, Vol 1 IB spec - */ + /* See section 14.2.2.2, Vol 1 IB spec */ if (!ib_get_smp_direction(smp)) { /* C14-9:1 */ - if (hop_ptr == 0 && hop_cnt) - return 0; + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_priv->port_num); + } /* C14-9:2 */ if (hop_ptr && hop_ptr < hop_cnt) { - if (port_priv->device->node_type == IB_NODE_SWITCH) { - printk(KERN_NOTICE - "Need to handle DR Mad on switch\n"); - } - return 0; + if (port_priv->device->node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_priv->port_num); } /* C14-9:3 -- We're at the end of the DR segment of path */ if (hop_ptr == hop_cnt) { - if (hop_cnt) - smp->return_path[hop_ptr] = port_priv->port; + /* smp->return_path set when received */ smp->hop_ptr++; + return (port_priv->device->node_type != IB_NODE_CA || + smp->dr_dlid == IB_LID_PERMISSIVE); + } - if (port_priv->device->node_type == IB_NODE_SWITCH) { - printk(KERN_NOTICE - "Need to handle DR Mad on switch\n"); - return 0; - } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + /* C14-9:5 -- Fail unreasonable hop pointer. */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_priv->port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (port_priv->device->node_type != IB_NODE_SWITCH) return 0; - } - return 1; + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_priv->port_num); } - /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ - /* C14-9:5 -- Check for unreasonable hop pointer. */ - if (hop_ptr > hop_cnt + 1) + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (port_priv->device->node_type == IB_NODE_SWITCH && + smp->dr_slid != IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM. */ + /* C14-13:5 -- Check for unreasonable hop pointer. */ + return 0; + } +} + +/* + * Sender side handling of outgoing SMPs. Fixup the SMP as required by + * the spec. Return 0 if the SMP should be dropped. + */ +static int smi_handle_smp_send(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + switch (smp->mgmt_class) + { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + return smi_handle_dr_smp_send(port_priv, smp); + default: + return 0; /* write me... */ + } +} + +/* + * Return 1 if the SMP should be handled by the local SMA via process_mad. + */ +static inline int smi_check_local_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM. */ + return (port_priv->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1)); +} + +/* + * Adjust information for a received SMP. Return 0 if the SMP should be + * dropped. + */ +static int smi_handle_dr_smp_recv(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) return 0; - } else { /* Returning MAD (From responder to initiator) */ + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (port_priv->device->node_type != IB_NODE_SWITCH) + return 0; - /* C14-13:1 */ + smp->return_path[hop_ptr] = port_priv->port_num; + /* smp->hop_ptr updated when sending */ + return 1; /*(smp->initial_path[hop_ptr+1] <= + port_priv->device->phys_port_cnt); */ + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_priv->port_num; + /* smp->hop_ptr updated when sending */ + + return (port_priv->device->node_type != IB_NODE_CA || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + /* C14-9:5 -- fail unreasonable hop pointer. */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 -- sender should have decremented hop_ptr */ if (hop_cnt && hop_ptr == hop_cnt + 1) return 0; /* C14-13:2 */ if (2 <= hop_ptr && hop_ptr <= hop_cnt) { - if (port_priv->device->node_type == IB_NODE_SWITCH) { - printk(KERN_NOTICE - "Need to handle DR Mad on switch\n"); - } - return 0; + if (port_priv->device->node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return 1; /*(smp->return_path[hop_ptr-1] <= + port_priv->device->phys_port_cnt); */ } /* C14-13:3 -- We're at the end of the DR segment of path */ if (hop_ptr == 1) { - smp->hop_ptr--; - - if (port_priv->device->node_type == IB_NODE_SWITCH) { - printk(KERN_NOTICE - "Need to handle DR Mad on switch\n"); - return 0; - } else if (smp->dr_dlid != IB_LID_PERMISSIVE) { - return 0; + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; } + /* smp->hop_ptr updated when sending */ + return (port_priv->device->node_type != IB_NODE_CA); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM. */ + /* C14-13:5 -- Check for unreasonable hop pointer. */ + return (hop_ptr == 0); + } +} +/* + * Receive side handling SMPs. Save receive information as required by + * the spec. Return 0 if the SMP should be dropped. + */ +static int smi_handle_smp_recv(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + switch (smp->mgmt_class) + { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + return smi_handle_dr_smp_recv(port_priv, smp); + default: + return 0; /* write me... */ + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue. + * Return 0 if the SMP should be completed up the stack. + */ +static int smi_check_forward_dr_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) return 1; - } - /* C14-13:4 -- Hop Pointer = 0 -> give to SM. */ - if (hop_ptr == 0) + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM. */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) return 1; - /* C14-13:5 -- Check for unreasonable hop pointer. */ - if (hop_ptr > hop_cnt + 1) - return 0; + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} + +/* + * Return 1 if the received SMP should be forwarded to the send queue. + * Return 0 if the SMP should be completed up the stack. + */ +static int smi_check_forward_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + switch (smp->mgmt_class) + { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + return smi_check_forward_dr_smp(port_priv, smp); + default: + return 0; /* write me... */ + } +} + +/* +static int smi_process_local(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + port_priv->device->process_mad( ... ); +} + +int smi_send_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + if (!smi_handle_smp_send(port_priv, smp)) { + smi_fail_send() + return 0; + } + + if (smi_check_local_smp(port_priv, smp)) { + smi_process_local(port_priv, smp); + return 0; + } + + * Post the send on the QP * + return 1; +} + +int smi_recv_smp(struct ib_mad_port_private *port_priv, + struct ib_smp *smp) +{ + if (!smi_handle_smp_recv(port_priv, smp)) { + smi_fail_recv(); + return 0; + } + + if (smi_check_forward_smp(port_priv, smp)) { + smi_send_smp(port_priv, smp); + return 0; } + + * Complete receive up stack * return 1; } +*/ Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 859) +++ access/ib_mad.c (working copy) @@ -91,7 +91,7 @@ * ib_register_mad_agent - Register to send/receive MADs */ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, - u8 port, + u8 port_num, enum ib_qp_type qp_type, struct ib_mad_reg_req *mad_reg_req, u8 rmpp_version, @@ -150,7 +150,7 @@ /* Validate device and port */ spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each_entry(entry, &ib_mad_port_list, port_list) { - if (entry->device == device && entry->port == port) { + if (entry->device == device && entry->port_num == port_num) { port_priv = entry; break; } @@ -372,7 +372,7 @@ * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { /* Free previous receive buffer */ kmem_cache_free(ib_mad_cache, buffer); buffer = (void *)entry - sizeof(struct ib_mad_private_header); @@ -909,7 +909,7 @@ port_priv, "ib_mad-%-6s-%-2d", port_priv->device->name, - port_priv->port); + port_priv->port_num); if (IS_ERR(port_priv->mad_thread)) { printk(KERN_ERR "couldn't start mad thread\n"); return 1; @@ -1068,7 +1068,7 @@ /* * Modify QP into Init state */ -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp, int port) +static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp, int port_num) { int ret; struct ib_qp_attr *attr = NULL; @@ -1087,7 +1087,7 @@ * one is needed for the Reset to Init transition. */ attr->pkey_index = 0; - attr->port_num = port; + attr->port_num = port_num; /* QKey is 0 for QP0 */ if (qp->qp_num == 0) attr->qkey = 0; @@ -1190,7 +1190,7 @@ for (i = 0; i < IB_MAD_QPS_CORE; i++) { ret = ib_mad_change_qp_state_to_init(port_priv->qp[i], - port_priv->port); + port_priv->port_num); if (ret) { printk(KERN_ERR "Could not change QP%d state to INIT\n", i); return ret; @@ -1259,7 +1259,7 @@ ret = ib_mad_port_start(port_priv); if (ret) { printk(KERN_ERR "Could not restart port%s/%d\n", - port_priv->device->name, port_priv->port); + port_priv->device->name, port_priv->port_num); } return ret; @@ -1269,7 +1269,7 @@ * Open the port * Create the QP, PD, MR, and CQ if needed */ -static int ib_mad_port_open(struct ib_device *device, int port) +static int ib_mad_port_open(struct ib_device *device, int port_num) { int ret, cq_size, i; u64 iova = 0; @@ -1285,7 +1285,7 @@ /* First, check if port already open at MAD layer */ spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each_entry(entry, &ib_mad_port_list, port_list) { - if (entry->device == device && entry->port == port) { + if (entry->device == device && entry->port_num == port_num) { port_priv = entry; break; } @@ -1306,7 +1306,7 @@ memset(port_priv, 0, sizeof *port_priv); device->mad = port_priv; port_priv->device = device; - port_priv->port = port; + port_priv->port_num = port_num; spin_lock_init(&port_priv->reg_lock); for (i = 0; i < MAX_MGMT_VERSION; i++) { port_priv->version[i] = NULL; @@ -1351,7 +1351,7 @@ qp_init_attr.qp_type = IB_QPT_SMI; else qp_init_attr.qp_type = IB_QPT_GSI; - qp_init_attr.port_num = port_priv->port; + qp_init_attr.port_num = port_priv->port_num; port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr, &qp_cap); if (IS_ERR(port_priv->qp[i])) { @@ -1414,14 +1414,14 @@ * If there are no classes using the port, free the port * resources (CQ, MR, PD, QP) and remove the port's info structure */ -static int ib_mad_port_close(struct ib_device *device, int port) +static int ib_mad_port_close(struct ib_device *device, int port_num) { struct ib_mad_port_private *entry, *port_priv = NULL; unsigned long flags; spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each_entry(entry, &ib_mad_port_list, port_list) { - if (entry->device == device && entry->port == port) { + if (entry->device == device && entry->port_num == port_num) { port_priv = entry; break; } Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 859) +++ include/ib_mad.h (working copy) @@ -207,7 +207,7 @@ /** * ib_register_mad_agent - Register to send/receive MADs. * @device - The device to register with. - * @port - The port on the specified device to use. + * @port_num - The port on the specified device to use. * @qp_type - Specifies which QP to access. Must be either * IB_QPT_SMI or IB_QPT_GSI. * @mad_reg_req - Specifies which unsolicited MADs should be received @@ -223,7 +223,7 @@ * @context - User specified context associated with the registration. */ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, - u8 port, + u8 port_num, enum ib_qp_type qp_type, struct ib_mad_reg_req *mad_reg_req, u8 rmpp_version, From halr at voltaire.com Thu Sep 16 19:28:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 22:28:29 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <52isadwwds.fsf@topspin.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> <52isadwwds.fsf@topspin.com> Message-ID: <1095388108.3946.146.camel@localhost.localdomain> On Thu, 2004-09-16 at 20:51, Roland Dreier wrote: > By the way, I'm not sure I totally understand the IB spec on pluggable > ports. Is the idea that the number of ports can change or just that > the GUID of a port might change? The short answer is that both can change. While the IB spec indicates GUIDs (there are 3 levels of management GUIDs: system image, node, and port) are manufacturer assigned, it is up to the manufacturer as to how they are assigned. Clearly, system image GUIDs can change as there is an optional trap for this. I am not sure there is an explicit requirement to keep node and port GUIDs the same. However, changing them has an effect on IB management (e.g. SM) and should not be taken lightly. For any hot swappable "component", there needs to be some persistent uniquifier to identify it's "physical"ness. In ethernet that is the MAC address. In IB, it is likely a GUID. That would be the ModuleGUID (in vol 2). ChassisGUID would be the chassis into which the module is plugged. If the IB device (xCA, switch) are permanently affixed to a chassis (there are compliances for this), the Chassis and ModuleGUIDs can be the same. These are contained in ChassisInfo and ModuleInfo attributes obtainable by the BM. Based on the (IB) module and chassis type, one or both of these are required. (I'm much less "fluent" here). Does that make sense ? Or is it more confusing ? Does that answer your question ? -- Hal From halr at voltaire.com Thu Sep 16 19:36:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 22:36:02 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <52mzzpwwl8.fsf@topspin.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> Message-ID: <1095388561.3946.151.camel@localhost.localdomain> On Thu, 2004-09-16 at 20:46, Roland Dreier wrote: > It makes sense to me to add the device attr to ib_device. The only > fields that I see that might change are system_image_guid and > phys_port_cnt (if hot pluggable ports ever come about). We could fix > this by requiring the low-level driver to keep the system_image_guid > field up to date (since it can only be changed through that driver), > and defining phys_port_cnt to be the maximum possible number of ports > (pluggable ports that aren't present would just always be down). Why would the max possible number of ports need to be known ahead of time ? This might change as "modules" fit in the same chassis with higher port densities over time. It would be better if we didn't need to artificially limit it. > In any case pretty much everyone's code would be broken by phys_port_cnt > changing under them. Agreed. We would need some event to indicate a port count change and a lot of code would need to use this as a trigger. > If we did this we could just kill off the query_device method > entirely, which would be nice. Sounds good to me. -- Hal From halr at voltaire.com Thu Sep 16 19:49:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Sep 2004 22:49:51 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <20040916191438.47391b89.mshefty@ichips.intel.com> References: <20040916191438.47391b89.mshefty@ichips.intel.com> Message-ID: <1095389391.3919.158.camel@localhost.localdomain> On Thu, 2004-09-16 at 22:14, Sean Hefty wrote: > I'd appreciate comments from anyone, > but particularly someone who's worked on SMI code before, > to make sure that I'm not completely off here. Just one comment on the ib_mad changes: You half got in the change we have been discussing: ib_mad.h didn't change the definition of struct ib_mad_recv_wc but ib_mad.c had the following change: @@ -372,7 +372,7 @@ * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { /* Free previous receive buffer */ kmem_cache_free(ib_mad_cache, buffer); buffer = (void *)entry - sizeof(struct ib_mad_private_header); I will review the SMI code as I have time over the next few days. -- Hal From roland at topspin.com Thu Sep 16 20:09:04 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 20:09:04 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095388108.3946.146.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 16 Sep 2004 22:28:29 -0400") References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> <52isadwwds.fsf@topspin.com> <1095388108.3946.146.camel@localhost.localdomain> Message-ID: <52656dwq0f.fsf@topspin.com> Roland> By the way, I'm not sure I totally understand the IB spec Roland> on pluggable ports. Is the idea that the number of ports Roland> can change or just that the GUID of a port might change? Hal> The short answer is that both can change. Hal> ... Hal> Does that make sense ? Or is it more confusing ? Does that Hal> answer your question ? I don't think that answered my question. However the existence of a System Image GUID change trap and the nonexistence of a "number of ports change" trap seems to imply nodes changing their number of ports is not well supported. For example what's going to happen if a SM discovers an 8 port switch, which then turns into a 12 port switch and sends a trap for port 10 before the SM sweeps again? Since no devices that I know of will change their number of ports, I don't think we should worry about it now. - R. From roland at topspin.com Thu Sep 16 20:13:19 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 16 Sep 2004 20:13:19 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095388561.3946.151.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 16 Sep 2004 22:36:02 -0400") References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> <1095388561.3946.151.camel@localhost.localdomain> Message-ID: <521xh1wptc.fsf@topspin.com> Hal> Why would the max possible number of ports need to be known Hal> ahead of time ? This might change as "modules" fit in the Hal> same chassis with higher port densities over time. It would Hal> be better if we didn't need to artificially limit it. I don't think this corresponds to anything realistic. Keep in mind we're really only concerned about end ports, and a CA is _very_ unlikely to have a variable number of ports. Even current chassis based switches that I know of don't report themselves as a single node. Everything right now is coded assuming that the number of ports stays constant. If someday that breaks we can go back and enumerate ports with add and remove methods but for now that feels like overengineering to me. - R. From halr at voltaire.com Thu Sep 16 22:06:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 01:06:48 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <20040916191438.47391b89.mshefty@ichips.intel.com> References: <20040916191438.47391b89.mshefty@ichips.intel.com> Message-ID: <1095397608.17793.161.camel@localhost.localdomain> On Thu, 2004-09-16 at 22:14, Sean Hefty wrote: > After spending a couple of days floundering in SMI-related code, > I finally gave in and studied that part of the spec, > comparing it against the existing implementations. > I tried to separate the SMI requirements into send and > receive handling of SMPs. There's some pseudo-code near the > end of ib_smi.c (to be converted into real code) that describes > how the SMI checks will eventually work as the code is merged > with the ib_mad.c routines. Sounds like you used the written sections (and compliances) rather than the flow chart. Is that true ? -- Hal From halr at voltaire.com Thu Sep 16 22:11:29 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 01:11:29 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <52656dwq0f.fsf@topspin.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> <52isadwwds.fsf@topspin.com> <1095388108.3946.146.camel@localhost.localdomain> <52656dwq0f.fsf@topspin.com> Message-ID: <1095397888.17793.167.camel@localhost.localdomain> On Thu, 2004-09-16 at 23:09, Roland Dreier wrote: > I don't think that answered my question. However the existence of a > System Image GUID change trap and the nonexistence of a "number of > ports change" trap seems to imply nodes changing their number of ports > is not well supported. The same is true for a number of components which can change but there are no associated traps. This makes it harder for the SM to deal with these changes (polling rather than trap based). > For example what's going to happen if a SM > discovers an 8 port switch, which then turns into a 12 port switch and > sends a trap for port 10 before the SM sweeps again? > > Since no devices that I know of will change their number of ports, I > don't think we should worry about it now. Since switch and CA chips themselves don't change their number of ports, there is only one case of this: a box made up of a group of switch chips which was reporting itself as one switch in terms of IBA. I agree that we shouldn't worry about this case now. -- Hal From halr at voltaire.com Thu Sep 16 22:14:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 01:14:02 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <521xh1wptc.fsf@topspin.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040916151514.17c5ea50.mshefty@ichips.intel.com> <52mzzpwwl8.fsf@topspin.com> <1095388561.3946.151.camel@localhost.localdomain> <521xh1wptc.fsf@topspin.com> Message-ID: <1095398041.17793.171.camel@localhost.localdomain> On Thu, 2004-09-16 at 23:13, Roland Dreier wrote: > Hal> Why would the max possible number of ports need to be known > Hal> ahead of time ? This might change as "modules" fit in the > Hal> same chassis with higher port densities over time. It would > Hal> be better if we didn't need to artificially limit it. > > I don't think this corresponds to anything realistic. Keep in mind > we're really only concerned about end ports, and a CA is _very_ > unlikely to have a variable number of ports. Even current chassis > based switches that I know of don't report themselves as a single node. That's the only case of this that I'm aware of too. > Everything right now is coded assuming that the number of ports stays > constant. If someday that breaks we can go back and enumerate ports > with add and remove methods but for now that feels like > overengineering to me. Agreed. -- Hal From tduffy at sun.com Fri Sep 17 09:14:10 2004 From: tduffy at sun.com (Tom Duffy) Date: Fri, 17 Sep 2004 09:14:10 -0700 Subject: [openib-general] Re: [openib-commits] r857 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <52acvpwtqx.fsf@topspin.com> References: <20040916211633.55D0C2283D5@openib.ca.sandia.gov> <1095376333.17548.23.camel@duffman> <52r7p1wz9b.fsf@topspin.com> <52acvpwtqx.fsf@topspin.com> Message-ID: <1095437650.25760.4.camel@duffman> On Thu, 2004-09-16 at 18:48 -0700, Roland Dreier wrote: > Roland> That's a surprise... is the after with > Roland> CONFIG_INFINIBAND_IPOIB_DEBUG=y? > > err... it helps if I do my commit from a high enough directory that > the new Kconfig gets committed. OK, that makes more sense, now with both DEBUG turned on: [root at sins-stinger-10 x86_64]# lsmod | head Module Size Used by ib_ipoib 56608 0 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From sean.hefty at intel.com Fri Sep 17 09:53:34 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 17 Sep 2004 09:53:34 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095389391.3919.158.camel@localhost.localdomain> Message-ID: >Just one comment on the ib_mad changes: > >You half got in the change we have been discussing: > >ib_mad.h didn't change the definition of struct ib_mad_recv_wc but >ib_mad.c had the following change: I had this change to the header file in my tree, but backed it out before sending the patch after seeing how the code in ib_mad.c was done. It's not my intention to make this change at this point. From sean.hefty at intel.com Fri Sep 17 10:17:55 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 17 Sep 2004 10:17:55 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095397608.17793.161.camel@localhost.localdomain> Message-ID: >> After spending a couple of days floundering in SMI-related code, >> I finally gave in and studied that part of the spec, >> comparing it against the existing implementations. >> I tried to separate the SMI requirements into send and >> receive handling of SMPs. There's some pseudo-code near the >> end of ib_smi.c (to be converted into real code) that describes >> how the SMI checks will eventually work as the code is merged >> with the ib_mad.c routines. > >Sounds like you used the written sections (and compliances) rather than >the flow chart. Is that true ? Correct - the code is based on the compliance statements, although the two seem to match. From what I could tell, neither of these areas separated the handling of sending a SMP from receiving one, which is what I tried to have the code do. The receive side handling of SMPs is based off the Topspin stack, but was updated to include switches. From roland at topspin.com Fri Sep 17 10:30:23 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 10:30:23 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: (Sean Hefty's message of "Fri, 17 Sep 2004 10:17:55 -0700") References: Message-ID: <52acvovm4w.fsf@topspin.com> Sean> The receive side handling of SMPs is based off the Topspin Sean> stack, but was updated to include switches. How are you handling switches? We don't have any API to specify the output port when we send a directed route MAD. - Roland From halr at voltaire.com Fri Sep 17 10:32:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 13:32:36 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <52acvovm4w.fsf@topspin.com> References: <52acvovm4w.fsf@topspin.com> Message-ID: <1095442356.3946.191.camel@localhost.localdomain> On Fri, 2004-09-17 at 13:30, Roland Dreier wrote: > Sean> The receive side handling of SMPs is based off the Topspin > Sean> stack, but was updated to include switches. > > How are you handling switches? We don't have any API to specify the > output port when we send a directed route MAD. Doesn't the mad_agent supplied in the ib_mad_post_send take care of that ? -- Hal From roland at topspin.com Fri Sep 17 10:42:21 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 10:42:21 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095442356.3946.191.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 17 Sep 2004 13:32:36 -0400") References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> Message-ID: <52656cvlky.fsf@topspin.com> Hal> Doesn't the mad_agent supplied in the ib_mad_post_send take Hal> care of that ? I don't see how it can. What am I missing? - R. From halr at voltaire.com Fri Sep 17 10:51:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 13:51:46 -0400 Subject: [openib-general] [PATCH] SMI update References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> Message-ID: <000701c49cdf$01950540$6401a8c0@comcast.net> Roland Dreier wrote: > Hal> Doesn't the mad_agent supplied in the ib_mad_post_send take > Hal> care of that ? > > I don't see how it can. What am I missing? You're right. It's not the switch output port. Another parameter is needed to support switches when sending DR packets. It can be ignored for CAs. -- Hal. From halr at voltaire.com Fri Sep 17 12:11:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 15:11:25 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <000701c49cdf$01950540$6401a8c0@comcast.net> References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> <000701c49cdf$01950540$6401a8c0@comcast.net> Message-ID: <1095448285.3946.208.camel@localhost.localdomain> On Fri, 2004-09-17 at 13:51, Hal Rosenstock wrote: > Roland Dreier wrote: > > Hal> Doesn't the mad_agent supplied in the ib_mad_post_send take > > Hal> care of that ? > > > > I don't see how it can. What am I missing? > > You're right. It's not the switch output port. Another parameter is needed > to support switches when sending DR packets. It can be ignored for CAs. On second thought, it can work. Since there is only one port on the switch (base or enhanced switch port 0), the port_num parameter could be overloaded to mean the output port for DR SMI packets and ignored otherwise (LR) for switches. -- Hal From roland at topspin.com Fri Sep 17 12:17:45 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 12:17:45 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095448285.3946.208.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 17 Sep 2004 15:11:25 -0400") References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> <000701c49cdf$01950540$6401a8c0@comcast.net> <1095448285.3946.208.camel@localhost.localdomain> Message-ID: <521xh0vh5y.fsf@topspin.com> Hal> On second thought, it can work. Since there is only one port Hal> on the switch (base or enhanced switch port 0), the port_num Hal> parameter could be overloaded to mean the output port for DR Hal> SMI packets and ignored otherwise (LR) for switches. Which port_num parameter? (I don't see any that look useful) - Roland From halr at voltaire.com Fri Sep 17 12:23:25 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 15:23:25 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <521xh0vh5y.fsf@topspin.com> References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> <000701c49cdf$01950540$6401a8c0@comcast.net> <1095448285.3946.208.camel@localhost.localdomain> <521xh0vh5y.fsf@topspin.com> Message-ID: <1095449004.17793.215.camel@localhost.localdomain> On Fri, 2004-09-17 at 15:17, Roland Dreier wrote: > Hal> On second thought, it can work. Since there is only one port > Hal> on the switch (base or enhanced switch port 0), the port_num > Hal> parameter could be overloaded to mean the output port for DR > Hal> SMI packets and ignored otherwise (LR) for switches. > > Which port_num parameter? (I don't see any that look useful) My bad again :-( port_num needs to be at send time not registration time. An extra parameter is needed for ib_mad_post_send and ib_post_send for this. Do you see a way around it ? -- Hal From roland at topspin.com Fri Sep 17 12:31:41 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 12:31:41 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095449004.17793.215.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 17 Sep 2004 15:23:25 -0400") References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> <000701c49cdf$01950540$6401a8c0@comcast.net> <1095448285.3946.208.camel@localhost.localdomain> <521xh0vh5y.fsf@topspin.com> <1095449004.17793.215.camel@localhost.localdomain> Message-ID: <52wtysu1ya.fsf@topspin.com> Hal> My bad again :-( port_num needs to be at send time not Hal> registration time. An extra parameter is needed for Hal> ib_mad_post_send and ib_post_send for this. Do you see a way Hal> around it ? Nope. I just committed the following on my branch. Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 833) +++ ib_verbs.h (working copy) @@ -521,6 +521,7 @@ u32 remote_qkey; int timeout_ms; /* valid for MADs only */ u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ } ud; } wr; }; From halr at voltaire.com Fri Sep 17 13:46:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 16:46:34 -0400 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <52wtysu1ya.fsf@topspin.com> References: <52acvovm4w.fsf@topspin.com> <1095442356.3946.191.camel@localhost.localdomain> <52656cvlky.fsf@topspin.com> <000701c49cdf$01950540$6401a8c0@comcast.net> <1095448285.3946.208.camel@localhost.localdomain> <521xh0vh5y.fsf@topspin.com> <1095449004.17793.215.camel@localhost.localdomain> <52wtysu1ya.fsf@topspin.com> Message-ID: <1095453994.17793.225.camel@localhost.localdomain> On Fri, 2004-09-17 at 15:31, Roland Dreier wrote: > Hal> My bad again :-( port_num needs to be at send time not > Hal> registration time. An extra parameter is needed for > Hal> ib_mad_post_send and ib_post_send for this. Do you see a way > Hal> around it ? > > Nope. I just committed the following on my branch. > > Index: ib_verbs.h > =================================================================== > --- ib_verbs.h (revision 833) > +++ ib_verbs.h (working copy) > @@ -521,6 +521,7 @@ > u32 remote_qkey; > int timeout_ms; /* valid for MADs only */ > u16 pkey_index; /* valid for GSI only */ > + u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > } wr; > }; Similarly on openib-candidate branch (and committed)... svn diff ib_verbs.h Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 846) +++ ib_verbs.h (working copy) @@ -543,6 +543,7 @@ u32 remote_qkey; int timeout_ms; /* valid for MADs only */ u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ } ud; } wr; }; From sean.hefty at intel.com Fri Sep 17 14:17:37 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 17 Sep 2004 14:17:37 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <1095449004.17793.215.camel@localhost.localdomain> Message-ID: >> Hal> On second thought, it can work. Since there is only one port >> Hal> on the switch (base or enhanced switch port 0), the port_num >> Hal> parameter could be overloaded to mean the output port for DR >> Hal> SMI packets and ignored otherwise (LR) for switches. >> >> Which port_num parameter? (I don't see any that look useful) > >My bad again :-( port_num needs to be at send time not registration >time. An extra parameter is needed for ib_mad_post_send and ib_post_send >for this. Do you see a way around it ? Here's what I was thinking, so let me know if I'm way off here. If a received SMP is valid, the SMI forwards it out the initial_path[hop_ptr] port. The send path updates the SMP's hop_ptr. Right now I only added the checks to validate received SMPs, but those checks should include if the device is a switch. (I asked for a phys_port_cnt to validate initial_path[hop_ptr], but we can let this fail by verbs.) I still need to write the actual forwarding code. Looking ahead in the mail thread, it looks like the verb definitions have been fixed for switches. Thanks From mshefty at ichips.intel.com Fri Sep 17 14:27:02 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 17 Sep 2004 14:27:02 -0700 Subject: [openib-general] [PATCH] SMI update In-Reply-To: <20040916191438.47391b89.mshefty@ichips.intel.com> References: <20040916191438.47391b89.mshefty@ichips.intel.com> Message-ID: <20040917142702.1ab2cf94.mshefty@ichips.intel.com> On Thu, 16 Sep 2004 19:14:38 -0700 Sean Hefty wrote: > I'd appreciate comments from anyone, but particularly someone who's worked on SMI code before, to make sure that I'm not completely off here. n (Btw, you can pretty much ignore the diffs for ib_smi.c. The previous code matched to the wrong function.) After correcting the issue pointed out by Hal wrt changes to ib_mad.c, these changes have been committed. - Sean From halr at voltaire.com Fri Sep 17 15:40:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Sep 2004 18:40:31 -0400 Subject: [openib-general] [PATCH] core_cache.c: Eliminate use of ts_kernel_[services trace].h Message-ID: <1095460830.17793.259.camel@localhost.localdomain> core_cache.c: Eliminate use of ts_kernel_[services trace].h Index: core_cache.c =================================================================== --- core_cache.c (revision 871) +++ core_cache.c (working copy) @@ -23,9 +23,6 @@ #include "core_priv.h" -#include "ts_kernel_trace.h" -#include "ts_kernel_services.h" - #include #include @@ -177,8 +174,8 @@ } } - TS_REPORT_WARN(MOD_KERNEL_IB, - "No match for source GID " + printk(KERN_WARNING + "No match for source GID " "%02x%02x:%02x%02x:%02x%02x:%02x%02x:" "%02x%02x:%02x%02x:%02x%02x:%02x%02x", gid[ 0], gid[ 1], gid[ 2], gid[ 3], @@ -271,8 +268,8 @@ int i; int ret; - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, - "Updating cached port info for %s port %d", + printk(KERN_DEBUG + "Updating cached port info for %s port %d", device->name, port); tprops = kmalloc(sizeof *tprops, GFP_KERNEL); @@ -281,7 +278,7 @@ ret = device->query_port(device, port, tprops); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, + printk(KERN_WARNING "query_port failed (%d) for %s", ret, device->name); goto out; @@ -296,7 +293,7 @@ for (i = 0; i < tprops->gid_tbl_len; ++i) { ret = device->query_gid(device, port, i, tgid + i); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, + printk(KERN_WARNING "query_gid failed (%d) for %s (index %d)", ret, device->name, i); goto out; @@ -313,7 +310,7 @@ for (i = 0; i < tprops->pkey_tbl_len; ++i) { ret = device->query_pkey(device, port, i, &tpkey[i]); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, + printk(KERN_WARNING "query_pkey failed (%d) for %s, port %d, index %d", ret, device->name, port, i); goto out; @@ -384,7 +381,7 @@ seqcount_init(&priv->port_data[p].lock); ret = device->query_port(device, p, &prop); if (ret) { - TS_REPORT_WARN(MOD_KERNEL_IB, + printk(KERN_WARNING "query_port failed for %s", device->name); goto error; From roland at topspin.com Fri Sep 17 17:44:57 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 17:44:57 -0700 Subject: [openib-general] [PATCH] core_cache.c: Eliminate use of ts_kernel_[services trace].h In-Reply-To: <1095460830.17793.259.camel@localhost.localdomain> (Hal Rosenstock's message of "Fri, 17 Sep 2004 18:40:31 -0400") References: <1095460830.17793.259.camel@localhost.localdomain> Message-ID: <527jqstng6.fsf@topspin.com> Thanks, however I haven't been cleaning things up this way because changes like this: > - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, > - "Updating cached port info for %s port %d", > + printk(KERN_DEBUG > + "Updating cached port info for %s port %d", mean that we spew debugging output without any way of turning it off. Also I'm not sure whether the core_cache.c code has a future in the real OpenIB access layer anyway so I don't know if it's worth cleaning it up. You can look at what I did in my ipoib directory to see how I made the debugging output be controllable at both compile- and run-time. - R. From Tom.Duffy at Sun.COM Fri Sep 17 18:09:37 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Fri, 17 Sep 2004 18:09:37 -0700 Subject: [openib-general] Reserved L_Key API (was Re: DMA mapping on sparc64) In-Reply-To: <52ekl419rm.fsf_-_@topspin.com> References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> <52r7p41m0b.fsf@topspin.com> <52ekl419rm.fsf_-_@topspin.com> Message-ID: <414B8AD1.3020008@sun.com> Roland Dreier wrote: > Comments? Better naming ideas? I think this is a good idea. Even if Tavor is limited on big mem systems, this would be fine. There are many devices that cannot DMA into high memory regions for various reasons. And the naming seems fine. Thanks, -tduffy From roland at topspin.com Fri Sep 17 18:22:57 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 17 Sep 2004 18:22:57 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <414B8AD1.3020008@sun.com> (Tom Duffy's message of "Fri, 17 Sep 2004 18:09:37 -0700") References: <52d60udpfd.fsf@topspin.com> <1095188233.6945.8.camel@duffman> <52pt4o39ds.fsf@topspin.com> <1095193684.6945.29.camel@duffman> <52hdq03822.fsf@topspin.com> <1095194610.6945.31.camel@duffman> <52d60o37ej.fsf@topspin.com> <1095196989.6945.43.camel@duffman> <52zn3s1pdi.fsf@topspin.com> <1095201391.6945.66.camel@duffman> <52vfeg1muw.fsf_-_@topspin.com> <1095203538.6945.70.camel@duffman> <52r7p41m0b.fsf@topspin.com> <52ekl419rm.fsf_-_@topspin.com> <414B8AD1.3020008@sun.com> Message-ID: <523c1gtlou.fsf@topspin.com> Tom> I think this is a good idea. Even if Tavor is limited on big Tom> mem systems, this would be fine. There are many devices that Tom> cannot DMA into high memory regions for various reasons. Tavor wouldn't be limited since it can create arbitrarily large memory regions with translation off. I'm thinking of other devices -- for example, I have no documentation on the Fujitsu HCA (and in fact I've never even seen one) but from looking at the InfiniCon/Agilent driver in svn, it seems that it is limited to memory regions with pages of max size 4 MB. In my scheme they would be limited to DMA from the low 4G. On the other hand even there I can think of hacks to get around the problem -- the only systems where this limitation really causes problems would be Intel IA-32 and EMT64T systems with > 4 GB of RAM (since Intel has no IOMMU). And for those systems the assumption that RAM starts at PCI address 0 is valid, so you could just register a region from 0 up to 0xfffffffff (ie the low 64 GB) and be happy. In any case -- I'll go ahead and code this up early next week. - R. From iod00d at hp.com Fri Sep 17 23:10:51 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 17 Sep 2004 23:10:51 -0700 Subject: [openib-general] [PATCH] core_cache.c: Eliminate use of ts_kernel_[services trace].h In-Reply-To: <527jqstng6.fsf@topspin.com> References: <1095460830.17793.259.camel@localhost.localdomain> <527jqstng6.fsf@topspin.com> Message-ID: <20040918061051.GB11259@cup.hp.com> On Fri, Sep 17, 2004 at 05:44:57PM -0700, Roland Dreier wrote: > Thanks, however I haven't been cleaning things up this way because > changes like this: > > > - TS_TRACE(MOD_KERNEL_IB, T_VERY_VERBOSE, TRACE_KERNEL_IB_GEN, > > - "Updating cached port info for %s port %d", > > + printk(KERN_DEBUG > > + "Updating cached port info for %s port %d", > > mean that we spew debugging output without any way of turning it off. not true. We've gone round on this issue on the linux-ia64 mailing list in regard to misaligned accesses causing console output. I don't have a URL handy. The conclusion was it's pretty simple to filter console output based on priority levels. This doesn't mean every message is relevant - just we have high level tools to filter them. grant From roland at topspin.com Sat Sep 18 11:22:34 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Sep 2004 11:22:34 -0700 Subject: [openib-general] [PATCH] core_cache.c: Eliminate use of ts_kernel_[services trace].h In-Reply-To: <20040918061051.GB11259@cup.hp.com> (Grant Grundler's message of "Fri, 17 Sep 2004 23:10:51 -0700") References: <1095460830.17793.259.camel@localhost.localdomain> <527jqstng6.fsf@topspin.com> <20040918061051.GB11259@cup.hp.com> Message-ID: <52oek3sahh.fsf@topspin.com> Grant> not true. We've gone round on this issue on the linux-ia64 Grant> mailing list in regard to misaligned accesses causing Grant> console output. I don't have a URL handy. The conclusion Grant> was it's pretty simple to filter console output based on Grant> priority levels. I don't think the ia64 "unaligned access" message is a good counterexample because: - they're not just an unconditional printk -- they're ratelimited - they can be turned off per-process with "prctl --unaligned=silent" - people actually complain about the message spew all the time - setting your console loglevel so you don't see them still leaves them filling up the logbuffer (and the unaligned access message is at KERN_WARNING so setting your loglevel so you don't see them blocks a lot of other messages you probably do want to see) The example in my original email is even worse than the "unaligned access" message in that it's a completely benign informational message -- there's not even the justification of "you're hurting performance, fix the program doing unaligned accessess." - R. From gdror at mellanox.co.il Sat Sep 18 13:37:43 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Sat, 18 Sep 2004 23:37:43 +0300 Subject: [openib-general] Reserved L_Key API Message-ID: <506C3D7B14CDD411A52C00025558DED605F9D1BC@mtlex01.yok.mtl.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Saturday, September 18, 2004 4:23 AM > > On the other hand even there I can think of hacks to get > around the problem -- the only systems where this limitation > really causes problems would be Intel IA-32 and EMT64T > systems with > 4 GB of RAM (since Intel has no IOMMU). And > for those systems the assumption that RAM starts at PCI > address 0 is valid, so you could just register a region from > 0 up to 0xfffffffff (ie the low 64 GB) and be happy. > Not sure I understand. If we do plan on supporting architectures that can't map the whole address space to DMA, then I think that the API should reflect it. When you try to create a DMA MR, then you should be able to request which is the address space that you need. Am I missing something ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Sat Sep 18 13:47:18 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 18 Sep 2004 13:47:18 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <506C3D7B14CDD411A52C00025558DED605F9D1BC@mtlex01.yok.mtl.com> (Dror Goldenberg's message of "Sat, 18 Sep 2004 23:37:43 +0300") References: <506C3D7B14CDD411A52C00025558DED605F9D1BC@mtlex01.yok.mtl.com> Message-ID: <52k6urs3s9.fsf@topspin.com> Roland> On the other hand even there I can think of hacks to get Roland> around the problem -- the only systems where this Roland> limitation really causes problems would be Intel IA-32 and Roland> EMT64T systems with > 4 GB of RAM (since Intel has no Roland> IOMMU). And for those systems the assumption that RAM Roland> starts at PCI address 0 is valid, so you could just Roland> register a region from 0 up to 0xfffffffff (ie the low 64 Roland> GB) and be happy. Dror> Not sure I understand. If we do plan on supporting Dror> architectures that can't map the whole address space to DMA, Dror> then I think that the API should reflect it. When you try to Dror> create a DMA MR, then you should be able to request which is Dror> the address space that you need. Am I missing something ? Yes, you're looking at things backwards. The device knows what its DMA mask is so there's no problem if an architecture limits it to DMA only in the low 4G. There's no need to put this in the API because in fact it's the consumer that doesn't know the limits on DMA addresses. The issue is a device that cannot create a MR that spans the full 64 bit address space (as I said, based on my quick look at the InfiniCon/Agilent driver, the Fujitsu HCA seems to be in this category). For systems with an IOMMU, the driver can just set its DMA mask to the low 4G and be happy. For systems without IOMMU (ie Intel IA32/EMT64) the driver can set its DMA mask to the full 64 bits and create an MR spanning the low 64G and still be happy. - Roland From mst at mellanox.co.il Sun Sep 19 12:18:28 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Sep 2004 22:18:28 +0300 Subject: [openib-general] Reserved L_Key API In-Reply-To: <20040916001212.GH24931@cup.hp.com> References: <5D78D28F88822E4D8702BB9EEF1A43670623A2@mercury.infiniconsys.com> <52u0tzy1vm.fsf@topspin.com> <20040916001212.GH24931@cup.hp.com> Message-ID: <20040919191828.GB17659@mellanox.co.il> Hello! Quoting r. Grant Grundler (iod00d at hp.com) "Re: [openib-general] Reserved L_Key API": > On Wed, Sep 15, 2004 at 08:42:53AM -0700, Roland Dreier wrote: > > Fabian> Are the ramifications of such an RKEY any worse than those > > Fabian> of locally attached DMA-capable adapters? > > Not if one can guarantee it's well behaved. > > > Fabian> If your FC HBA > > Fabian> goes haywire and decides to write all over memory, there's > > Fabian> not much you can do. > > That's not true. There are platforms that can isolate the FC and > prevent the FC HBA from scribbling to anything the respective driver > didn't previously get explicite write permission. ie we can guarantee > the containment. In such platforms, my understanding is you dont really need the rkey protection in the HCA - the dma API will do this for you. From mst at mellanox.co.il Sun Sep 19 12:20:40 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Sep 2004 22:20:40 +0300 Subject: [openib-general] Reserved L_Key API In-Reply-To: <52sm9itdgj.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED605F9D0B9@mtlex01.yok.mtl.com> <52sm9itdgj.fsf@topspin.com> Message-ID: <20040919192040.GC17659@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Reserved L_Key API": > Dror> Think of Tavor as a software friendly design. Use: start > Dror> address = 0x0000000000000001 length = 0xffffffffffffffff pa > Dror> = 1 And then you got all memory space mapped + protection > Dror> violation if you try to access a NULL pointer :) > > That's cute :) But I think 0x0 is more likely to be a valid DMA > address than 0xffffffffffffffff (I seem to remember this coming up > when pci_dma_mapping_error() was added to the kernel). > > - R. In this case, you could cover all space with two (overlapping) keys, and add a function to map address/size pair to the correct key. From iod00d at hp.com Sun Sep 19 20:56:46 2004 From: iod00d at hp.com (Grant Grundler) Date: Sun, 19 Sep 2004 20:56:46 -0700 Subject: [openib-general] [PATCH] core_cache.c: Eliminate use of ts_kernel_[services trace].h In-Reply-To: <52oek3sahh.fsf@topspin.com> References: <1095460830.17793.259.camel@localhost.localdomain> <527jqstng6.fsf@topspin.com> <20040918061051.GB11259@cup.hp.com> <52oek3sahh.fsf@topspin.com> Message-ID: <20040920035646.GA20464@cup.hp.com> On Sat, Sep 18, 2004 at 11:22:34AM -0700, Roland Dreier wrote: > - they're not just an unconditional printk -- they're ratelimited > - they can be turned off per-process with "prctl --unaligned=silent" Sorry - you are right - bad comparison. I had forgotten both of those. grant From iod00d at hp.com Sun Sep 19 21:40:31 2004 From: iod00d at hp.com (Grant Grundler) Date: Sun, 19 Sep 2004 21:40:31 -0700 Subject: [openib-general] Reserved L_Key API In-Reply-To: <52k6urs3s9.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED605F9D1BC@mtlex01.yok.mtl.com> <52k6urs3s9.fsf@topspin.com> Message-ID: <20040920044031.GB20464@cup.hp.com> On Sat, Sep 18, 2004 at 01:47:18PM -0700, Roland Dreier wrote: > The device knows what its DMA mask is It does? The driver knows. But doesn't have to tell the device anything if the mask happens to be smaller than number of physical bits the device can use for DMA. > so there's no problem > if an architecture limits it to DMA only in the low 4G. It's the chipset that limits the DMA, not the arch. e.g. if a PCI Host bus controller doesn't support DAC (Dual address cycle), then the PCI device can only use 32-bit DMA. DMA mapping services or higher level OS code has to make sure we never hand back a DMA address > 32-bits. ... > For systems with an IOMMU, the driver can just set its DMA > mask to the low 4G and be happy. For systems without IOMMU (ie Intel > IA32/EMT64) the driver can set its DMA mask to the full 64 bits and > create an MR spanning the low 64G and still be happy. The world can mostly be divided between with and without IOMMU. But it's not a real clean division given some exceptions. Let me list two that are well known: o ZX1 and SX1000 chipsets (HP): ZX1 chipset can allow 64-bit devices to bypass the iommu. DMA mapping services just needs to know if the device is 64-bit capable and then it will allow that. ZX1 and SX1000 are used for both parisc (PA8800) and IA64. o "EMT64" (AMD64): Has a GART which is also used as an IOMMU by 32-bit PCI devices. 64-bit devices can DMA directly into RAM. But the DMA services are involved at setup time. There's probably a bit more to it. o SGI Altix (IA64) also has an IOMMU but has different requirements for PCI-X than PCI devices. It allows "64-bit" DMA but then uses some of the upper bits as "hints" to the chipset. The IB code should only need to know what the largest acceptable DMA mask is and then it can build a window based on that mask. Or only attempt to build a "global" window if all 64-bits are "available" for DMA. As noted above, that may not work on SGI Altix but I suspect other major problems (broken PCI ordering rules) will make IB hard to get working properly there. hth, grant From tduffy at sun.com Mon Sep 20 13:20:16 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 20 Sep 2004 13:20:16 -0700 Subject: [openib-general] [PATCH] Being more anal about iospace accesses.. Message-ID: <1095711616.32648.33.camel@duffman> This may not be *all* of it, but it gets rid of the build warnings on gcc 3.3 on x86. Also built cleanly on x86_64 and sparc64. Signed-off by: Tom Duffy with permission from Sun legal. Index: drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- drivers/infiniband/hw/mthca/mthca_dev.h (revision 872) +++ drivers/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -145,7 +145,7 @@ struct mthca_eq_table { struct mthca_alloc alloc; - unsigned long clr_int; + void __iomem *clr_int; u32 clr_mask; struct mthca_eq eq[MTHCA_NUM_EQ]; int have_irq; @@ -169,7 +169,7 @@ struct pci_pool *pool; int num_ddr_avs; u64 ddr_av_base; - unsigned long av_map; + void __iomem *av_map; struct mthca_alloc alloc; }; @@ -195,9 +195,9 @@ MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) - unsigned long hcr; - unsigned long clr_base; - unsigned long kar; + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; struct mthca_cmd cmd; struct mthca_limits limits; Index: drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- drivers/infiniband/hw/mthca/mthca_main.c (revision 872) +++ drivers/infiniband/hw/mthca/mthca_main.c (working copy) @@ -535,17 +535,15 @@ mdev->cmd.use_events = 0; mthca_base = pci_resource_start(pdev, 0); - mdev->hcr = (unsigned long) ioremap(mthca_base + MTHCA_HCR_BASE, - MTHCA_MAP_HCR_SIZE); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); if (!mdev->hcr) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); err = -ENOMEM; goto err_out_free_dev; } - mdev->clr_base = - (unsigned long) ioremap(mthca_base + MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); if (!mdev->clr_base) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); @@ -554,9 +552,7 @@ } mthca_base = pci_resource_start(pdev, 2); - mdev->kar = (unsigned long) ioremap(mthca_base + - PAGE_SIZE * MTHCA_KAR_PAGE, - PAGE_SIZE); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); if (!mdev->kar) { mthca_err(mdev, "Couldn't map kernel access region, " "aborting.\n"); Index: drivers/infiniband/hw/mthca/mthca_av.c =================================================================== --- drivers/infiniband/hw/mthca/mthca_av.c (revision 872) +++ drivers/infiniband/hw/mthca/mthca_av.c (working copy) @@ -175,12 +175,12 @@ goto out_free_alloc; if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { - dev->av_table.av_map = - (unsigned long) ioremap(pci_resource_start(dev->pdev, 4) + - dev->av_table.ddr_av_base - - dev->ddr_start, - dev->av_table.num_ddr_avs * - MTHCA_AV_SIZE); + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, + 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); if (!dev->av_table.av_map) goto out_free_pool; } else Index: drivers/infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- drivers/infiniband/hw/mthca/mthca_doorbell.h (revision 872) +++ drivers/infiniband/hw/mthca/mthca_doorbell.h (working copy) @@ -42,7 +42,7 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) #define MTHCA_GET_DOORBELL_LOCK(ptr) (0) -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { writeq(cpu_to_le64p((u64 *) val), dest); @@ -70,7 +70,7 @@ preempt_enable(); } -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { /* i386 stack is aligned to 8 bytes, so this should be OK: */ @@ -98,7 +98,7 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) #define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mst at mellanox.co.il Mon Sep 20 14:30:06 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Sep 2004 00:30:06 +0300 Subject: [openib-general] mwrite64 - need for uar object in access layer Message-ID: <20040920213006.GC30626@mellanox.co.il> Hello! Roland, just looking at mwrite64/mread64 I would like to point out that you can avoid the whole complexity and the need to write 64 bit atomically for a doorbell if you simply allocate a separate UAR to the resource that you want to ring the doorbell on. If you do this, tavor hardware will be able to detect and handle the case where two doorbells are inter-mixed. For example if you give a separate UAR to each EQ you'll be able to avoid these spinlocks in interrupt path. The cost of UAR is not too high - by default we have normally 2K UAR pages available - so I think, its OK to give IP over IB its own UAR for its QP/CQ and then it wont need any atomicity, either. For things like SDP I think it makes sence to have UAR, say, per process. Looking at the non-SSE, 32 bit path, I would think you want 3 spinlocks per UAR (for CQ, Send Q and Receive Q), and not a global doorbell_lock. Granted, I dont know how relevant any 32-bit non-SSE architectures are ... I conclude the access layer needs some interface to manage UARs. It might seem tavor specific but I think something like UAR is what everyone has to do anyway to have OS bypass, right? Opinions? MST From halr at voltaire.com Mon Sep 20 16:54:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 20 Sep 2004 19:54:31 -0400 Subject: [openib-general] mthca startup problem Message-ID: <1095724470.1830.5.camel@localhost.localdomain> When insmod'ing ib_mthca.ko, I get the following error: Failed to initialize queue pair table, aborting. It seems to fail on configuring special QP0 with a return code of -16 (which appears to be bad QE state). Any idea on what is wrong ? Here is the complete log: Sep 20 08:22:51 cn5 kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.05-pre (June 13, 2004) Sep 20 08:22:51 cn5 kernel: ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:02:00.0) Sep 20 08:22:51 cn5 kernel: ib_mthca 0000:02:00.0: Found bridge: Mellanox Technology MT23108 PCI Bridge (0000:01:03.0) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: FW version 000300000000, max_cmds 64 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: FW size 6143 KB (start f7a00000, end f7ffffff) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: HCA memory size 131071 KB (start f0000000, end f7ffffff) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Max QPs: 16777216, reserved QPs: 16, entry size: 256 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: reserved MPTs: 16, reserved MTTs: 16 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Max PDs: 16777216, reserved PDs: 0, reserved UARs: 1 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Max QP/MCG: 16777216, reserved MGMs: 0 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Flags: 00370307 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 0]--10/20 @ 0x f0000000 (size 0x 4000000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 1]-- 0/16 @ 0x f4000000 (size 0x 1000000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 2]-- 7/18 @ 0x f5000000 (size 0x 800000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 3]-- 9/17 @ 0x f5800000 (size 0x 800000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 4]-- 3/16 @ 0x f6000000 (size 0x 400000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 5]-- 4/16 @ 0x f6400000 (size 0x 200000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 6]--12/15 @ 0x f6600000 (size 0x 100000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 7]-- 8/13 @ 0x f6700000 (size 0x 80000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 8]--11/11 @ 0x f6780000 (size 0x 10000) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: profile[ 9]-- 6/ 5 @ 0x f6790000 (size 0x 800) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: HCA memory: allocated 106050 KB/124928 KB (18878 KB free) Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Allocated EQ 1 with 65536 entries Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Allocated EQ 2 with 128 entries Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Allocated EQ 3 with 128 entries Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Setting mask 000000000003c3fe for eqn 2 Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: Setting mask 0000000000000400 for eqn 3 Sep 20 08:23:53 cn5 kernel: ib_mthca 0000:02:00.0: Failed to initialize queue pair table, aborting. Sep 20 08:23:53 cn5 kernel: ib_mthca 0000:02:00.0: Clearing mask 000000000003c3fe for eqn 2 Sep 20 08:23:53 cn5 kernel: ib_mthca 0000:02:00.0: Clearing mask 0000000000000400 for eqn 3 Sep 20 08:23:53 cn5 kernel: ib_mthca: probe of 0000:02:00.0 failed with error -16 -- Hal From roland at topspin.com Mon Sep 20 19:50:31 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Sep 2004 19:50:31 -0700 Subject: [openib-general] [PATCH] Being more anal about iospace accesses.. In-Reply-To: <1095711616.32648.33.camel@duffman> (Tom Duffy's message of "Mon, 20 Sep 2004 13:20:16 -0700") References: <1095711616.32648.33.camel@duffman> Message-ID: <52pt4gqqrs.fsf@topspin.com> Thanks... I was meaning to do this once 2.6.9 comes out but it's great to have it now. I added a temporary #define to keep things building against 2.6.8.1, like this: - R. Index: infiniband/hw/mthca/mthca_dev.h =================================================================== --- infiniband/hw/mthca/mthca_dev.h (revision 862) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -29,6 +29,14 @@ #include #include +/* + * Backwards compatibility for kernel 2.6.8.1. Remove when 2.6.9 is + * officially released with support for __iomem annotations. + */ +#ifndef __iomem +#define __iomem +#endif + #include "mthca_provider.h" #include "mthca_doorbell.h" @@ -145,7 +153,7 @@ struct mthca_eq_table { struct mthca_alloc alloc; - unsigned long clr_int; + void __iomem *clr_int; u32 clr_mask; struct mthca_eq eq[MTHCA_NUM_EQ]; int have_irq; @@ -169,7 +177,7 @@ struct pci_pool *pool; int num_ddr_avs; u64 ddr_av_base; - unsigned long av_map; + void __iomem *av_map; struct mthca_alloc alloc; }; @@ -195,9 +203,9 @@ MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) - unsigned long hcr; - unsigned long clr_base; - unsigned long kar; + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; struct mthca_cmd cmd; struct mthca_limits limits; Index: infiniband/hw/mthca/mthca_main.c =================================================================== --- infiniband/hw/mthca/mthca_main.c (revision 873) +++ infiniband/hw/mthca/mthca_main.c (working copy) @@ -535,17 +535,15 @@ mdev->cmd.use_events = 0; mthca_base = pci_resource_start(pdev, 0); - mdev->hcr = (unsigned long) ioremap(mthca_base + MTHCA_HCR_BASE, - MTHCA_MAP_HCR_SIZE); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); if (!mdev->hcr) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); err = -ENOMEM; goto err_out_free_dev; } - mdev->clr_base = - (unsigned long) ioremap(mthca_base + MTHCA_CLR_INT_BASE, - MTHCA_CLR_INT_SIZE); + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); if (!mdev->clr_base) { mthca_err(mdev, "Couldn't map command register, " "aborting.\n"); @@ -554,9 +552,7 @@ } mthca_base = pci_resource_start(pdev, 2); - mdev->kar = (unsigned long) ioremap(mthca_base + - PAGE_SIZE * MTHCA_KAR_PAGE, - PAGE_SIZE); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); if (!mdev->kar) { mthca_err(mdev, "Couldn't map kernel access region, " "aborting.\n"); Index: infiniband/hw/mthca/mthca_av.c =================================================================== --- infiniband/hw/mthca/mthca_av.c (revision 824) +++ infiniband/hw/mthca/mthca_av.c (working copy) @@ -175,12 +175,11 @@ goto out_free_alloc; if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { - dev->av_table.av_map = - (unsigned long) ioremap(pci_resource_start(dev->pdev, 4) + - dev->av_table.ddr_av_base - - dev->ddr_start, - dev->av_table.num_ddr_avs * - MTHCA_AV_SIZE); + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); if (!dev->av_table.av_map) goto out_free_pool; } else Index: infiniband/hw/mthca/mthca_doorbell.h =================================================================== --- infiniband/hw/mthca/mthca_doorbell.h (revision 803) +++ infiniband/hw/mthca/mthca_doorbell.h (working copy) @@ -42,7 +42,7 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) #define MTHCA_GET_DOORBELL_LOCK(ptr) (0) -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { writeq(cpu_to_le64p((u64 *) val), dest); @@ -70,7 +70,7 @@ preempt_enable(); } -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { /* i386 stack is aligned to 8 bytes, so this should be OK: */ @@ -98,7 +98,7 @@ #define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) #define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) -static inline void mthca_write64(u32 val[2], unsigned long dest, +static inline void mthca_write64(u32 val[2], void __iomem *dest, spinlock_t *doorbell_lock) { unsigned long flags; From roland at topspin.com Mon Sep 20 19:52:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Sep 2004 19:52:05 -0700 Subject: [openib-general] mwrite64 - need for uar object in access layer In-Reply-To: <20040920213006.GC30626@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 21 Sep 2004 00:30:06 +0300") References: <20040920213006.GC30626@mellanox.co.il> Message-ID: <52llf4qqp6.fsf@topspin.com> Michael> Hello! Roland, just looking at mwrite64/mread64 I would Michael> like to point out that you can avoid the whole complexity Michael> and the need to write 64 bit atomically for a doorbell if Michael> you simply allocate a separate UAR to the resource that Michael> you want to ring the doorbell on. Michael> If you do this, tavor hardware will be able to detect and Michael> handle the case where two doorbells are inter-mixed. Michael> For example if you give a separate UAR to each EQ you'll Michael> be able to avoid these spinlocks in interrupt path. The Michael> cost of UAR is not too high - by default we have normally Michael> 2K UAR pages available - so I think, its OK to give IP Michael> over IB its own UAR for its QP/CQ and then it wont need Michael> any atomicity, either. Michael> For things like SDP I think it makes sence to have UAR, Michael> say, per process. Michael> Looking at the non-SSE, 32 bit path, I would think you Michael> want 3 spinlocks per UAR (for CQ, Send Q and Receive Q), Michael> and not a global doorbell_lock. Granted, I dont know how Michael> relevant any 32-bit non-SSE architectures are ... Michael> I conclude the access layer needs some interface to Michael> manage UARs. It might seem tavor specific but I think Michael> something like UAR is what everyone has to do anyway to Michael> have OS bypass, right? Michael> Opinions? Sounds interesting... I would be curious to see benchmarks for various schemes. Exposing UARs to consumers seems a little bit of a layering violation but if it turns out to be a big win then it might be worth it. Of course I'm not sure how interested in optimizing 32-bit architectures anyone is... - R. From roland at topspin.com Mon Sep 20 19:54:25 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 20 Sep 2004 19:54:25 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <1095724470.1830.5.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 20 Sep 2004 19:54:31 -0400") References: <1095724470.1830.5.camel@localhost.localdomain> Message-ID: <52hdpsqqla.fsf@topspin.com> Hal> When insmod'ing ib_mthca.ko, I get the following error: Hal> Failed to initialize queue pair table, aborting. Hal> It seems to fail on configuring special QP0 with a return Hal> code of -16 (which appears to be bad QE state). Hal> Any idea on what is wrong ? Actually -16 is -EBUSY, which means that a FW command timed out. In fact the CONF_SPECIAL_QP command in mthca_init_qp_table() is the first FW command where we try to use interrupt-driven mode. So it seems that the driver is not receiving interrupts from the device. Typically this problem means you have interrupt routing issues with your BIOS or kernel. You can try the usual things like booting with "acpi=off", looking for a BIOS update, etc. - Roland From hozer at hozed.org Mon Sep 20 23:11:18 2004 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 21 Sep 2004 01:11:18 -0500 Subject: [openib-general] mwrite64 - need for uar object in access layer In-Reply-To: <52llf4qqp6.fsf@topspin.com> References: <20040920213006.GC30626@mellanox.co.il> <52llf4qqp6.fsf@topspin.com> Message-ID: <20040921061118.GP6837@kalmia.hozed.org> > Michael> Looking at the non-SSE, 32 bit path, I would think you > Michael> want 3 spinlocks per UAR (for CQ, Send Q and Receive Q), > Michael> and not a global doorbell_lock. Granted, I dont know how > Michael> relevant any 32-bit non-SSE architectures are ... > > Michael> I conclude the access layer needs some interface to > Michael> manage UARs. It might seem tavor specific but I think > Michael> something like UAR is what everyone has to do anyway to > Michael> have OS bypass, right? > > Michael> Opinions? > > Sounds interesting... I would be curious to see benchmarks for various > schemes. Exposing UARs to consumers seems a little bit of a layering > violation but if it turns out to be a big win then it might be worth > it. Of course I'm not sure how interested in optimizing 32-bit > architectures anyone is... How does just using the floating point unit compare the the SSE codepath? In a past life I had to get a flash driver for a 32 bit PPC board working that *had* to have 64 bit access to flash. The gross solution was using the FP unit to do all the load/stores. This also meant I had to had to do __enable_fpu() and __disable_fpu() all over the place and compile the .c file with '-mhard-float'. The other advantage to this solution is it doesn't introduce a whole bunch more ASM code we may not want... From roland at topspin.com Tue Sep 21 07:55:58 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Sep 2004 07:55:58 -0700 Subject: [openib-general] mwrite64 - need for uar object in access layer In-Reply-To: <20040921061118.GP6837@kalmia.hozed.org> (Troy Benjegerdes's message of "Tue, 21 Sep 2004 01:11:18 -0500") References: <20040920213006.GC30626@mellanox.co.il> <52llf4qqp6.fsf@topspin.com> <20040921061118.GP6837@kalmia.hozed.org> Message-ID: <52brfzr7r5.fsf@topspin.com> Troy> How does just using the floating point unit compare the the Troy> SSE codepath? In a past life I had to get a flash driver for Troy> a 32 bit PPC board working that *had* to have 64 bit access Troy> to flash. It's a pretty huge loss, because saving/restoring the FPU state requires writing/reading something like 170 bytes. With SSE we can just save the 8 bytes of XMM register that we actually use. Even so I'm not convinced SSE is a win over just using a lock because saving CR0 is so expensive. As I said, I'd be curious to see benchmarks of other approaches. I think there's definitely room for improvement if someone is interested in working on this. - R. From mst at mellanox.co.il Tue Sep 21 10:17:14 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 21 Sep 2004 20:17:14 +0300 Subject: [openib-general] mwrite64 - need for uar object in access layer In-Reply-To: <52brfzr7r5.fsf@topspin.com> References: <20040920213006.GC30626@mellanox.co.il> <52llf4qqp6.fsf@topspin.com> <20040921061118.GP6837@kalmia.hozed.org> <52brfzr7r5.fsf@topspin.com> Message-ID: <20040921171714.GB6624@mellanox.co.il> Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] mwrite64 - need for uar object in access layer": > Troy> How does just using the floating point unit compare the the > Troy> SSE codepath? In a past life I had to get a flash driver for > Troy> a 32 bit PPC board working that *had* to have 64 bit access > Troy> to flash. > > It's a pretty huge loss, because saving/restoring the FPU state > requires writing/reading something like 170 bytes. With SSE we can > just save the 8 bytes of XMM register that we actually use. Even so > I'm not convinced SSE is a win over just using a lock because saving > CR0 is so expensive. > > As I said, I'd be curious to see benchmarks of other approaches. I > think there's definitely room for improvement if someone is interested > in working on this. > Well, profiling is always nice but hard to do. But just doing a two word write is clearly a win, no? MST From roland at topspin.com Tue Sep 21 10:29:32 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Sep 2004 10:29:32 -0700 Subject: [openib-general] mwrite64 - need for uar object in access layer In-Reply-To: <20040921171714.GB6624@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 21 Sep 2004 20:17:14 +0300") References: <20040920213006.GC30626@mellanox.co.il> <52llf4qqp6.fsf@topspin.com> <20040921061118.GP6837@kalmia.hozed.org> <52brfzr7r5.fsf@topspin.com> <20040921171714.GB6624@mellanox.co.il> Message-ID: <52sm9bpm2r.fsf@topspin.com> Michael> Well, profiling is always nice but hard to do. But just Michael> doing a two word write is clearly a win, no? Sure, but I'd be much more interested in creating an abstraction for UARs, teaching ULPs about it, adding locking in ULPs, etc. for a 25% improvement than for a 0.01% improvement. I'm pretty busy just trying to get everything working at a basic level for merging upstream. But the field is wide open for anyone else to write patches and do some benchmarking. Thanks, Roland From volta104 at mail.netvision.net.il Tue Sep 21 11:42:10 2004 From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il) Date: Tue, 21 Sep 2004 14:42:10 -0400 Subject: [openib-general] mthca startup problem Message-ID: <59650-220049221184210158@M2W099.mail2web.com> [Roland wrote:] Actually -16 is -EBUSY, which means that a FW command timed out. In fact the CONF_SPECIAL_QP command in mthca_init_qp_table() is the first FW command where we try to use interrupt-driven mode. So it seems that the driver is not receiving interrupts from the device. Typically this problem means you have interrupt routing issues with your BIOS or kernel. You can try the usual things like booting with "acpi=off", looking for a BIOS update, etc. I have been running with acpi=off on my boot line (and also disabled ACPI in the BIOS). Is that what the BIOS update would be for ? The machine is a dual processor Opteron running in 32 bit mode. In any case, I have tried numerous combinations and still can't seem to get that first interrupt from the HCA. How do I go about debugging this ? Thanks. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From roland at topspin.com Tue Sep 21 11:59:59 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Sep 2004 11:59:59 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <59650-220049221184210158@M2W099.mail2web.com> (volta's message of "Tue, 21 Sep 2004 14:42:10 -0400") References: <59650-220049221184210158@M2W099.mail2web.com> Message-ID: <52ekkvphw0.fsf@topspin.com> hal> I have been running with acpi=off on my boot line (and also hal> disabled ACPI in the BIOS). Is that what the BIOS update hal> would be for ? If you've always had acpi=off on your bootline then getting rid of it and turning ACPI on may be helpful (I'm not sure what "have been running" means exactly). The kernel needs to find out how the interrupt lines are wired from the PCI slots to the interrupt controller. ACPI information from the BIOS is the most modern way, but BIOS vendors frequently have bugs. The virtual PCI bridge in the HCA often confuses the BIOS. That's why a BIOS update may be helpful. hal> In any case, I have tried numerous combinations and still hal> can't seem to get that first interrupt from the HCA. How do hal> I go about debugging this ? If you've ever had an HCA driver working on this machine, you can look at /proc/interrupts and see what IRQ the HCA is assigned. Then add a printk to mthca_eq.c before request_irq (the second, non-MSIX request_irq) and see what the value of dev->pdev->irq is. If they're different then that's the problem. If you've never had the HCA working then you can put a different PCI card in the same slot as the HCA and see what IRQ it gets, and compare that to the IRQ assigned to the HCA. Assuming the wrong IRQ is being assigned to the HCA, then you need to work with your BIOS vendor and/or the Linux ACPI maintainers to fix things up. There's also a slim chance that turning on CONFIG_PCI_MSI may help. (However make sure you don't enable msi or msi_x for ib_mthca -- MSI/MSI-X doesn't work on any current Opteron systems). - R. From volta104 at mail.netvision.net.il Tue Sep 21 12:28:31 2004 From: volta104 at mail.netvision.net.il (volta104 at mail.netvision.net.il) Date: Tue, 21 Sep 2004 15:28:31 -0400 Subject: [openib-general] mthca startup problem Message-ID: <220970-220049221192831799@M2W066.mail2web.com> [Roland wrote:] If you've always had acpi=off on your bootline then getting rid of it and turning ACPI on may be helpful (I'm not sure what "have been running" means exactly). I've run with all combinations (acpi on/off in BIOS, boot line, and configured/built into and not configured not built into 2.6.8.1 kernel). [Roland wrote:] If you've ever had an HCA driver working on this machine, you can look at /proc/interrupts and see what IRQ the HCA is assigned. Then add a printk to mthca_eq.c before request_irq (the second, non-MSIX request_irq) and see what the value of dev->pdev->irq is. If they're different then that's the problem. Yes, the HCA worked with the Voltaire stack at 2.4. [Roland wrote:] There's also a slim chance that turning on CONFIG_PCI_MSI may help. (However make sure you don't enable msi or msi_x for ib_mthca -- MSI/MSI-X doesn't work on any current Opteron systems). I've tried the latter to no avail. Where's the MSI config to mthca ? I can't seem to find it. One more thing: I noticed that when mthca starts up it displays the following: Sep 20 08:22:53 cn5 kernel: ib_mthca 0000:02:00.0: FW version 000300000000, max_cmds 64 Does the firmware rev need to be 3.2 ? More later as I gather more info. Thanks again. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From roland at topspin.com Tue Sep 21 12:34:30 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 21 Sep 2004 12:34:30 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <220970-220049221192831799@M2W066.mail2web.com> (volta's message of "Tue, 21 Sep 2004 15:28:31 -0400") References: <220970-220049221192831799@M2W066.mail2web.com> Message-ID: <52y8j3o1q1.fsf@topspin.com> hal> Yes, the HCA worked with the Voltaire stack at 2.4. Good, you should be able to compare the IRQ assigned under kernel 2.4 with the IRQ you're getting now. (It might be worth trying the Voltaire stack with your 2.6.8.1 kernel to see if it has the same IRQ problem) hal> I've tried the latter to no avail. Where's the MSI config hal> to mthca ? I can't seem to find it. It's not an mthca option. In the main kernel config, it's under something like "bus options". hal> One more thing: I noticed that when mthca starts up it hal> displays the following: Sep 20 08:22:53 cn5 kernel: hal> ib_mthca 0000:02:00.0: FW version 000300000000, max_cmds 64 hal> Does the firmware rev need to be 3.2 ? It wouldn't hurt but I doubt it would help with this problem. - Roland From halr at voltaire.com Wed Sep 22 09:18:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 22 Sep 2004 12:18:26 -0400 Subject: [openib-general] mthca startup problem References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> Message-ID: <005101c4a0bf$d93b33f0$4302000a@Gripen> Roland Dreier wrote: > Good, you should be able to compare the IRQ assigned under kernel 2.4 > with the IRQ you're getting now. (It might be worth trying the > Voltaire stack with your 2.6.8.1 kernel to see if it has the same IRQ > problem) IRQs are different. For 2.6.8.1, IRQ is 177 whereas for 2.4 it is IRQ28 for 2.4.19. dmesg in both cases shows the relevant IRQ -> 2.0. lspci for 02:00.0 shows pin A routed to the proper IRQ. > hal> I've tried the latter to no avail. Where's the MSI config > hal> to mthca ? I can't seem to find it. > > It's not an mthca option. In the main kernel config, it's under > something like "bus options". I've played with that to no avail. > hal> One more thing: I noticed that when mthca starts up it > hal> displays the following: Sep 20 08:22:53 cn5 kernel: > hal> ib_mthca 0000:02:00.0: FW version 000300000000, max_cmds 64 > > hal> Does the firmware rev need to be 3.2 ? > > It wouldn't hurt but I doubt it would help with this problem. I'll wait on this until I get past this. -- Hal From mst at mellanox.co.il Wed Sep 22 09:23:27 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 22 Sep 2004 18:23:27 +0200 Subject: [openib-general] mthca startup problem In-Reply-To: <005101c4a0bf$d93b33f0$4302000a@Gripen> References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> Message-ID: <20040922162327.GC16142@mellanox.co.il> Quoting r. Hal Rosenstock (halr at voltaire.com) "Re: [openib-general] mthca startup problem": > Roland Dreier wrote: > > Good, you should be able to compare the IRQ assigned under kernel 2.4 > > with the IRQ you're getting now. (It might be worth trying the > > Voltaire stack with your 2.6.8.1 kernel to see if it has the same IRQ > > problem) > > IRQs are different. For 2.6.8.1, IRQ is 177 whereas for 2.4 it is IRQ28 for > 2.4.19. dmesg in both cases shows the relevant IRQ -> 2.0. lspci for 02:00.0 > shows pin A routed to the proper IRQ. > > > hal> I've tried the latter to no avail. Where's the MSI config > > hal> to mthca ? I can't seem to find it. > > > > It's not an mthca option. In the main kernel config, it's under > > something like "bus options". > > I've played with that to no avail. > > > hal> One more thing: I noticed that when mthca starts up it > > hal> displays the following: Sep 20 08:22:53 cn5 kernel: > > hal> ib_mthca 0000:02:00.0: FW version 000300000000, max_cmds 64 > > > > hal> Does the firmware rev need to be 3.2 ? > > > > It wouldn't hurt but I doubt it would help with this problem. > > I'll wait on this until I get past this. > > -- Hal Hal, I suggest you try calling the interrupt handler after the timeout and see if this helps. If it does, this means you missed the interrupt. If it does not, there's some other problem. MST From roland at topspin.com Wed Sep 22 09:22:16 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 22 Sep 2004 09:22:16 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <005101c4a0bf$d93b33f0$4302000a@Gripen> (Hal Rosenstock's message of "Wed, 22 Sep 2004 12:18:26 -0400") References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> Message-ID: <52ekkumfyf.fsf@topspin.com> Hal> IRQs are different. For 2.6.8.1, IRQ is 177 whereas for 2.4 Hal> it is IRQ28 for 2.4.19. dmesg in both cases shows the Hal> relevant IRQ -> 2.0. lspci for 02:00.0 shows pin A routed to Hal> the proper IRQ. IRQ 177 seems like you have CONFIG_PCI_MSI turned on (it looks like a vector rather than a regular IRQ). What do you get with CONFIG_PCI_MSI=n? - Roland From mshefty at ichips.intel.com Wed Sep 22 15:37:33 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 22 Sep 2004 15:37:33 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095362984.3919.36.camel@localhost.localdomain> References: <1095362984.3919.36.camel@localhost.localdomain> Message-ID: <20040922153733.6d76b3df.mshefty@ichips.intel.com> On Thu, 16 Sep 2004 15:29:45 -0400 Hal Rosenstock wrote: > Now that I think ib_mad is far enough along, I would like to request any > code comments. I'll start submitting comments in the form of patches. I'll try to keep each patch as small and contained as possible. This first one is fairly easy. I didn't see where struct ib_mad_buf was used. If it's needed, can it be merged directly into ib_mad_private_header? My next few patches will be more useful... - Sean Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 877) +++ access/ib_mad_priv.h (working copy) @@ -76,16 +76,9 @@ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 - -struct ib_mad_buf { - void *mad_buf; - DECLARE_PCI_UNMAP_ADDR(mapping) -}; - struct ib_mad_private_header { struct ib_mad_recv_wc recv_wc; /* must be first member (for now !!!) */ struct ib_mad_recv_buf recv_buf; - struct ib_mad_buf buf; } __attribute__ ((packed)); struct ib_mad_private { From mshefty at ichips.intel.com Wed Sep 22 16:01:39 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 22 Sep 2004 16:01:39 -0700 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095362984.3919.36.camel@localhost.localdomain> References: <1095362984.3919.36.camel@localhost.localdomain> Message-ID: <20040922160139.2cf79338.mshefty@ichips.intel.com> On Thu, 16 Sep 2004 15:29:45 -0400 Hal Rosenstock wrote: > Now that I think ib_mad is far enough along, I would like to request any > code comments. This patch embeds struct ib_mad_agent within struct ib_mad_agent_private and fixes up some cleanup issues related to the MAD registration. - Sean Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 877) +++ access/ib_mad_priv.h (working copy) @@ -100,7 +100,7 @@ struct ib_mad_agent_private { struct list_head agent_list; - struct ib_mad_agent *agent; + struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; u8 rmpp_version; }; Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 877) +++ access/ib_mad.c (working copy) @@ -100,7 +100,7 @@ void *context) { struct ib_mad_port_private *entry, *port_priv = NULL; - struct ib_mad_agent *mad_agent, *ret; + struct ib_mad_agent *ret; struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; @@ -162,11 +162,10 @@ } /* Allocate structures */ - mad_agent = kmalloc(sizeof *mad_agent, GFP_KERNEL); mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); - if (!mad_agent || !mad_agent_priv) { + if (!mad_agent_priv) { ret = ERR_PTR(-ENOMEM); - goto error2; + goto error1; } if (mad_reg_req) { @@ -203,38 +202,33 @@ /* Now, fill in the various structures */ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); - mad_agent_priv->agent = mad_agent; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; - memset(mad_agent, 0, sizeof *mad_agent); - mad_agent->device = device; - mad_agent->recv_handler = recv_handler; - mad_agent->send_handler = send_handler; - mad_agent->context = context; - mad_agent->qp = port_priv->qp[qp_type]; - mad_agent->hi_tid = ++ib_mad_client_id; - - /* Add mad agent into agent list */ - list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp[qp_type]; + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); if (ret2) { + spin_unlock_irqrestore(&port_priv->reg_lock, flags); ret = ERR_PTR(ret2); goto error3; } - spin_unlock_irqrestore(&port_priv->reg_lock, flags); - return mad_agent; + /* Add mad agent into agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); -error3: - /* Remove mad agent from agent list */ - list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - /* Release allocated structures */ - kfree(reg_req); + return &mad_agent_priv->agent; + +error3: + if (reg_req) + kfree(reg_req); error2: - kfree(mad_agent); kfree(mad_agent_priv); error1: return ret; @@ -248,7 +242,6 @@ { struct ib_mad_port_private *entry; struct ib_mad_agent_private *entry2, *temp; - int not_found = 1; unsigned long flags, flags2; /* @@ -261,22 +254,23 @@ spin_lock_irqsave(&entry->reg_lock, flags2); list_for_each_entry_safe(entry2, temp, &entry->agent_list, agent_list) { - if (entry2->agent == mad_agent) { + if (&entry2->agent == mad_agent) { remove_mad_reg_req(entry2); list_del(&entry2->agent_list); + spin_unlock_irqrestore(&entry->reg_lock, flags2); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); /* Release allocated structures */ - kfree(entry2->reg_req); - kfree(entry2->agent); + if (entry2->reg_req) + kfree(entry2->reg_req); kfree(entry2); - not_found = 0; - break; + return 0; } } spin_unlock_irqrestore(&entry->reg_lock, flags2); } spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - return not_found; + return 1; } EXPORT_SYMBOL(ib_unregister_mad_agent); @@ -472,7 +466,7 @@ /* Make sure MAD registration request supplied */ if (!mad_reg_req) return 0; - private = priv->agent->device->mad; + private = priv->agent.device->mad; class = &private->version[mad_reg_req->mgmt_class_version]; mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); if (!*class) { @@ -546,7 +540,7 @@ return; } - port_priv = agent_priv->agent->device->mad; + port_priv = agent_priv->agent.device->mad; class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; if (!class) { printk(KERN_ERR "No class table yet MAD registration request supplied\n"); @@ -629,7 +623,7 @@ /* Routing is based on high 32 bits of transaction ID of MAD */ hi_tid = mad->mad_hdr.tid >> 32; list_for_each_entry(entry, &port_priv->agent_list, agent_list) { - if (entry->agent->hi_tid == hi_tid) { + if (entry->agent.hi_tid == hi_tid) { mad_agent = entry; break; } @@ -763,8 +757,8 @@ } /* Invoke receive callback */ - mad_agent->agent->recv_handler(mad_agent->agent, - &recv->header.recv_wc); + mad_agent->agent.recv_handler(&mad_agent->agent, + &recv->header.recv_wc); } spin_unlock_irqrestore(&port_priv->reg_lock, flags); From mst at mellanox.co.il Wed Sep 22 16:59:03 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 23 Sep 2004 01:59:03 +0200 Subject: [openib-general] pls subscribe archive@mail-archive.com Message-ID: <20040922235903.GA16647@mellanox.co.il> To list administrator - could you subscribe archive at mail-archive.com to the list? This would get a list archive at http://www.mail-archive.com - I find it much faster for web access than gmane is, and it has RSS feeds too which is nice. Thanks, MST From mlleinin at hpcn.ca.sandia.gov Wed Sep 22 21:28:45 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt L. Leininger) Date: Wed, 22 Sep 2004 21:28:45 -0700 Subject: [openib-general] pls subscribe archive@mail-archive.com In-Reply-To: <20040922235903.GA16647@mellanox.co.il> References: <20040922235903.GA16647@mellanox.co.il> Message-ID: <1095913724.4743.6.camel@trinity> On Wed, 2004-09-22 at 16:59, Michael S. Tsirkin wrote: > To list administrator - could you subscribe archive at mail-archive.com > to the list? > This would get a list archive at http://www.mail-archive.com - I find > it much faster for web access than gmane is, and it has RSS feeds too > which is nice. > Done. - Matt From halr at voltaire.com Wed Sep 22 21:34:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 00:34:47 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <20040922153733.6d76b3df.mshefty@ichips.intel.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040922153733.6d76b3df.mshefty@ichips.intel.com> Message-ID: <1095914028.1872.90.camel@localhost.localdomain> On Wed, 2004-09-22 at 18:37, Sean Hefty wrote: > I'll start submitting comments in the form of patches. > I'll try to keep each patch as small > and contained as possible. OK. Thanks. > This first one is fairly easy. > I didn't see where struct ib_mad_buf was used. It's needed in the PCI mapping/unmapping. In i386, PPC, and some other architectures, the mapping is a nop so it might appear unused. I don't think the void *buf pointer is needed. > If it's needed, can it be merged directly into ib_mad_private_header? Sure. I will post a patch for this. -- Hal From halr at voltaire.com Wed Sep 22 22:11:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 01:11:20 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <20040922160139.2cf79338.mshefty@ichips.intel.com> References: <1095362984.3919.36.camel@localhost.localdomain> <20040922160139.2cf79338.mshefty@ichips.intel.com> Message-ID: <1095916280.1872.112.camel@localhost.localdomain> On Wed, 2004-09-22 at 19:01, Sean Hefty wrote: > This patch embeds struct ib_mad_agent within struct > ib_mad_agent_private and fixes up some cleanup issues > related to the MAD registration. Thanks! Applied. -- Hal From halr at voltaire.com Wed Sep 22 22:23:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 01:23:20 -0400 Subject: [openib-general] [RFC] ib_mad In-Reply-To: <1095914028.1872.90.camel@localhost.localdomain> References: <1095362984.3919.36.camel@localhost.localdomain> <20040922153733.6d76b3df.mshefty@ichips.intel.com> <1095914028.1872.90.camel@localhost.localdomain> Message-ID: <1095917000.1872.115.camel@localhost.localdomain> On Thu, 2004-09-23 at 00:34, Hal Rosenstock wrote: > On Wed, 2004-09-22 at 18:37, Sean Hefty wrote: > > I'll start submitting comments in the form of patches. > > I'll try to keep each patch as small > > and contained as possible. > > OK. Thanks. > > > This first one is fairly easy. > > I didn't see where struct ib_mad_buf was used. > > It's needed in the PCI mapping/unmapping. In i386, PPC, and some > other architectures, the mapping is a nop so it might appear unused. > I don't think the void *buf pointer is needed. > > > If it's needed, can it be merged directly into ib_mad_private_header? > > Sure. I will post a patch for this. Here's the patch to fix PCI receive mapping: Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 879) +++ ib_mad_priv.h (working copy) @@ -77,15 +77,10 @@ #define MAX_MGMT_VERSION 8 -struct ib_mad_buf { - void *mad_buf; - DECLARE_PCI_UNMAP_ADDR(mapping) -}; - struct ib_mad_private_header { struct ib_mad_recv_wc recv_wc; /* must be first member (for now !!!) */ struct ib_mad_recv_buf recv_buf; - struct ib_mad_buf buf; + DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); struct ib_mad_private { Index: ib_mad.c =================================================================== --- ib_mad.c (revision 879) +++ ib_mad.c (working copy) @@ -719,7 +719,7 @@ spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&recv->buf, mapping), + pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - sizeof(struct ib_mad_private_header), PCI_DMA_FROMDEVICE); @@ -976,14 +976,14 @@ port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); - pci_unmap_addr_set(&mad_priv->header.buf, mapping, sg_list.addr); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); /* Now, post receive WR */ ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr); if (ret) { pci_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&mad_priv->header.buf, mapping), + pci_unmap_addr(&mad_priv->header, mapping), sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); From halr at voltaire.com Thu Sep 23 06:40:34 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 09:40:34 -0400 Subject: [openib-general] mthca startup problem References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> Message-ID: <003901c4a172$ec3a3d80$4302000a@Gripen> Roland Dreier wrote: > IRQ 177 seems like you have CONFIG_PCI_MSI turned on (it looks like a > vector rather than a regular IRQ). What do you get with > CONFIG_PCI_MSI=n? I finally gave up fighting that machine. I may get back to that one. I got another machine which I was able to get further and get mthca up. I had some other issues and questions: 1. mthca_cmd.c has a number of compile warnings (built with debug configured). 2. ipoib_main.c line 48 has a compile error indicating directives may not be use inside macro arguments. This is with both ipoib debugs on. MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA " and data path tracing if > 1" #endif ); Somethin like the below works: #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0 and data path tracing if > 1"); #else MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); #endif 3. mad_statuc:96 complains about line 96 "Couldn't find suitable network device; setting lid_base to 1". Is this OK ? Also, why is this done and can it be shut off ? Thanks. -- Hal From halr at voltaire.com Thu Sep 23 07:10:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 10:10:04 -0400 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <004d01c4a177$08257830$4302000a@Gripen> Hi, I have a couple of questions about IPoIB loading and starting. 1. It appears I need the following modules in this order: core/ib_client_query.ko core/ib_sa_client.ko ulp/ipoib/ib_ipoib.ko ulp/ipoib/ib_ip2pr.ko Any others ? I presume ib_pr2pr is needed. What happens if it is not present ? 2. After all the above modules are loaded, do I just configure the ib0/1 interfaces ? Something like: /sbin/ifconfig ib0 192.168.0.101 netmask 255.255.255.0 up Does anything else need to be done before it should work (other than an SM bringing these links to active) ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tom.Duffy at Sun.COM Thu Sep 23 07:24:08 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Thu, 23 Sep 2004 07:24:08 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <004d01c4a177$08257830$4302000a@Gripen> References: <004d01c4a177$08257830$4302000a@Gripen> Message-ID: <4152DC88.7000700@sun.com> Hal Rosenstock wrote: > Hi, > > I have a couple of questions about IPoIB loading and starting. > > 1. It appears I need the following modules in this order: > core/ib_client_query.ko > core/ib_sa_client.ko > ulp/ipoib/ib_ipoib.ko > ulp/ipoib/ib_ip2pr.ko > Any others ? I presume ib_pr2pr is needed. What happens if it is not > present ? ib_ip2pr is not needed for normal operation. modprobe ib_ipoib should pick up all the dependencies (if you have the modules installed in /lib/modules/`uname -r`) > 2. After all the above modules are loaded, do I just configure the ib0/1 > interfaces ? > Something like: > /sbin/ifconfig ib0 192.168.0.101 netmask 255.255.255.0 up Yup. > Does anything else need to be done before it should work (other than an > SM bringing these links to active) ? It also depends what pkey your subnet manager uses for the IP network. Sun's setup uses pkey 0x8001, so I need to setup a virtual NIC using the ipoibcfg command. -tduffy From krause at cup.hp.com Thu Sep 23 08:34:35 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 23 Sep 2004 08:34:35 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <4152DC88.7000700@sun.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> Message-ID: <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> At 07:24 AM 9/23/2004, Tom Duffy wrote: >Hal Rosenstock wrote: >>Hi, >> >>I have a couple of questions about IPoIB loading and starting. >> >>1. It appears I need the following modules in this order: >>core/ib_client_query.ko >>core/ib_sa_client.ko >>ulp/ipoib/ib_ipoib.ko >>ulp/ipoib/ib_ip2pr.ko >>Any others ? I presume ib_pr2pr is needed. What happens if it is not >>present ? > >ib_ip2pr is not needed for normal operation. modprobe ib_ipoib should >pick up all the dependencies (if you have the modules installed in >/lib/modules/`uname -r`) > >>2. After all the above modules are loaded, do I just configure the ib0/1 >>interfaces ? >>Something like: >>/sbin/ifconfig ib0 192.168.0.101 netmask 255.255.255.0 up > >Yup. > >>Does anything else need to be done before it should work (other than an >>SM bringing these links to active) ? > >It also depends what pkey your subnet manager uses for the IP network. >Sun's setup uses pkey 0x8001, so I need to setup a virtual NIC using the >ipoibcfg command. I would have thought this would be part of the IP over IB driver. Communicate with the SM to acquire the P_Key and then use that to perform Arp / ND. Strange to require a command since this gets a bit cumbersome when there are many nodes in the fabric. Mike >-tduffy > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Sep 23 12:52:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 15:52:36 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on Message-ID: <000201c4a1a9$80ed8050$4302000a@Gripen> ipoib: Fix compile problem with data path debug on Index: ipoib_main.c =================================================================== --- ipoib_main.c (revision 880) +++ ipoib_main.c (working copy) @@ -43,12 +43,14 @@ int debug_level; module_param(debug_level, int, 0644); +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" -#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA - " and data path tracing if > 1" + " and data path tracing if > 1"); +#else +MODULE_PARM_DESC(debug_level, + "Enable debug tracing if > 0"); #endif - ); int mcast_debug_level; -------------- next part -------------- An HTML attachment was scrubbed... URL: From tduffy at sun.com Thu Sep 23 14:05:11 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Sep 2004 14:05:11 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on In-Reply-To: <000201c4a1a9$80ed8050$4302000a@Gripen> References: <000201c4a1a9$80ed8050$4302000a@Gripen> Message-ID: <1095973511.25438.4.camel@duffman> your patch did apply correctly. perhaps your MUA is changing tabs to spaces? [tduffy at duffman ipoib]$ patch ipoib_main.c /tmp/ipoib.hal patching file ipoib_main.c Hunk #1 FAILED at 43. 1 out of 1 hunk FAILED -- saving rejects to file ipoib_main.c.rej On Thu, 2004-09-23 at 15:52 -0400, Hal Rosenstock wrote: > ipoib: Fix compile problem with data path debug on > > Index: ipoib_main.c > =================================================================== > --- ipoib_main.c (revision 880) > +++ ipoib_main.c (working copy) > @@ -43,12 +43,14 @@ > int debug_level; > > module_param(debug_level, int, 0644); > +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > MODULE_PARM_DESC(debug_level, > "Enable debug tracing if > 0" > -#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > - " and data path tracing if > 1" > + " and data path tracing if > 1"); > +#else > +MODULE_PARM_DESC(debug_level, > + "Enable debug tracing if > 0"); > #endif > - ); > > int mcast_debug_level; > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- "When they took the 4th Amendment, I was quiet because I didn't deal drugs. When they took the 6th Amendment, I was quiet because I am innocent. When they took the 2nd Amendment, I was quiet because I don't own a gun. Now they have taken the 1st Amendment, and I can only be quiet." --Lyle Myhr -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Thu Sep 23 13:31:14 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 16:31:14 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on In-Reply-To: <1095973511.25438.4.camel@duffman> References: <000201c4a1a9$80ed8050$4302000a@Gripen> <1095973511.25438.4.camel@duffman> Message-ID: <1095971474.2644.0.camel@hpc-1> On Thu, 2004-09-23 at 17:05, Tom Duffy wrote: > your patch did apply correctly. perhaps your MUA is changing tabs to > spaces? I hope that's it. Let me try it again. Sorry. -- Hal > > [tduffy at duffman ipoib]$ patch ipoib_main.c /tmp/ipoib.hal > patching file ipoib_main.c > Hunk #1 FAILED at 43. > 1 out of 1 hunk FAILED -- saving rejects to file ipoib_main.c.rej > > > On Thu, 2004-09-23 at 15:52 -0400, Hal Rosenstock wrote: > > ipoib: Fix compile problem with data path debug on > > > > Index: ipoib_main.c > > =================================================================== > > --- ipoib_main.c (revision 880) > > +++ ipoib_main.c (working copy) > > @@ -43,12 +43,14 @@ > > int debug_level; > > > > module_param(debug_level, int, 0644); > > +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > > MODULE_PARM_DESC(debug_level, > > "Enable debug tracing if > 0" > > -#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > > - " and data path tracing if > 1" > > + " and data path tracing if > 1"); > > +#else > > +MODULE_PARM_DESC(debug_level, > > + "Enable debug tracing if > 0"); > > #endif > > - ); > > > > int mcast_debug_level; > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Sep 23 13:34:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 23 Sep 2004 16:34:18 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on Take 2 Message-ID: <1095971657.2644.4.camel@hpc-1> ipoib: Fix compile problem with data path debug on [Take 2} Index: ipoib_main.c =================================================================== --- ipoib_main.c (revision 880) +++ ipoib_main.c (working copy) @@ -43,12 +43,14 @@ int debug_level; module_param(debug_level, int, 0644); +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0" -#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA - " and data path tracing if > 1" + " and data path tracing if > 1"); +#else +MODULE_PARM_DESC(debug_level, + "Enable debug tracing if > 0"); #endif - ); int mcast_debug_level; From tduffy at sun.com Thu Sep 23 14:23:19 2004 From: tduffy at sun.com (Tom Duffy) Date: Thu, 23 Sep 2004 14:23:19 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on Take 2 In-Reply-To: <1095971657.2644.4.camel@hpc-1> References: <1095971657.2644.4.camel@hpc-1> Message-ID: <1095974599.25438.5.camel@duffman> On Thu, 2004-09-23 at 16:34 -0400, Hal Rosenstock wrote: > ipoib: Fix compile problem with data path debug on [Take 2} Now it patched fine: [tduffy at duffman ipoib]$ patch ipoib_main.c /tmp/ipoib.hal.take2 patching file ipoib_main.c -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Thu Sep 23 17:17:45 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 23 Sep 2004 17:17:45 -0700 Subject: [openib-general] [PATCH] separates QP0/1 interactions Message-ID: <20040923171745.75e46fe9.mshefty@ichips.intel.com> The following patch separates the interactions between QP 0 and 1 in the MAD code. Each QP now has its own queuing, locking, completion handling, error handling, etc. I have a list of several of changes for the MAD code that I will try to get to tomorrow. Please let me know if you have any questions. - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 880) +++ access/ib_mad_priv.h (working copy) @@ -95,6 +95,7 @@ struct ib_mad_agent_private { struct list_head agent_list; + struct ib_mad_qp_info *qp_info; struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; u8 rmpp_version; @@ -115,17 +116,27 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; -struct ib_mad_thread_private { +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_cq *cq; + + spinlock_t send_list_lock; + struct list_head send_posted_mad_list; + int send_posted_mad_count; + + spinlock_t recv_list_lock; + struct list_head recv_posted_mad_list; + int recv_posted_mad_count; + + struct task_struct *mad_thread; wait_queue_head_t wait; }; struct ib_mad_port_private { struct list_head port_list; - struct task_struct *mad_thread; struct ib_device *device; int port_num; - struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; - struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; @@ -133,15 +144,7 @@ struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; struct list_head agent_list; - spinlock_t send_list_lock; - struct list_head send_posted_mad_list; - int send_posted_mad_count; - - spinlock_t recv_list_lock; - struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; - int recv_posted_mad_count[IB_MAD_QPS_SUPPORTED]; - - struct ib_mad_thread_private mad_thread_private; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; }; #endif /* __IB_MAD_PRIV_H__ */ Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 880) +++ access/ib_mad.c (working copy) @@ -81,12 +81,10 @@ static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); -static int ib_mad_port_restart(struct ib_mad_port_private *priv); -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp); -static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info); +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info); static inline u8 convert_mgmt_class(u8 mgmt_class); - +static int ib_mad_restart_qp(struct ib_mad_qp_info *qp_info); /* * ib_register_mad_agent - Register to send/receive MADs @@ -205,11 +203,12 @@ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->qp_info = &port_priv->qp_info[qp_type]; mad_agent_priv->agent.device = device; mad_agent_priv->agent.recv_handler = recv_handler; mad_agent_priv->agent.send_handler = send_handler; mad_agent_priv->agent.context = context; - mad_agent_priv->agent.qp = port_priv->qp[qp_type]; + mad_agent_priv->agent.qp = port_priv->qp_info[qp_type].qp; mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); @@ -287,6 +286,7 @@ struct ib_send_wr *cur_send_wr, *next_send_wr; struct ib_send_wr wr; struct ib_send_wr *bad_wr; + struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; unsigned long flags; @@ -297,6 +297,9 @@ return -EINVAL; } + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + /* Walk list of send WRs and post each on send list */ cur_send_wr = send_wr; while (cur_send_wr) { @@ -330,19 +333,22 @@ wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ /* Link send WR into posted send MAD list */ - spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + spin_lock_irqsave(&mad_agent_priv->qp_info->send_list_lock, flags); list_add_tail(&mad_send_wr->send_list, - &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); - ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + &mad_agent_priv->qp_info->send_posted_mad_list); + mad_agent_priv->qp_info->send_posted_mad_count++; + spin_unlock_irqrestore(&mad_agent_priv->qp_info->send_list_lock, + flags); ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { /* Unlink from posted send MAD list */ - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + spin_unlock_irqrestore( + &mad_agent_priv->qp_info->send_list_lock, flags); list_del(&mad_send_wr->send_list); - ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count--; - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + mad_agent_priv->qp_info->send_posted_mad_count--; + spin_unlock_irqrestore( + &mad_agent_priv->qp_info->send_list_lock, flags); *bad_send_wr = cur_send_wr; printk(KERN_NOTICE "ib_post_mad_send failed\n"); return ret; @@ -361,19 +367,32 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) { struct ib_mad_recv_buf *entry; - struct ib_mad_private *buffer = (struct ib_mad_private *)mad_recv_wc; + struct ib_mad_private_header *mad_private_header; + struct ib_mad_private *mad_private; /* * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers */ + mad_private_header = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + mad_private = container_of(mad_private_header, + struct ib_mad_private, + header); + list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { /* Free previous receive buffer */ - kmem_cache_free(ib_mad_cache, buffer); - buffer = (void *)entry - sizeof(struct ib_mad_private_header); + kmem_cache_free(ib_mad_cache, mad_private); + mad_private_header = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + mad_private = container_of(mad_private_header, + struct ib_mad_private, + header); } /* Free last buffer */ - kmem_cache_free(ib_mad_cache, buffer); + kmem_cache_free(ib_mad_cache, mad_private); } EXPORT_SYMBOL(ib_free_recv_mad); @@ -567,20 +586,6 @@ } } -static int convert_qpnum(u32 qp_num) -{ - /* - * No redirection currently!!! - * QP0 and QP1 only - * Ultimately, will need table of QP numbers and table index - * as QP numbers will not be packed once redirection supported - */ - if (qp_num > 1) { - printk(KERN_ERR "QP number %d invalid\n", qp_num); - } - return qp_num; -} - static int response_mad(struct ib_mad *mad) { /* Trap represses are responses although response bit is reset */ @@ -622,7 +627,7 @@ /* Whether MAD was solicited determines type of routing to MAD client */ if (solicited) { /* Routing is based on high 32 bits of transaction ID of MAD */ - hi_tid = mad->mad_hdr.tid >> 32; + hi_tid = (u32)(mad->mad_hdr.tid >> 32); list_for_each_entry(entry, &port_priv->agent_list, agent_list) { if (entry->agent.hi_tid == hi_tid) { mad_agent = entry; @@ -631,7 +636,7 @@ } if (!mad_agent) { printk(KERN_ERR "No client 0x%x for received MAD\n", - (u32)(mad->mad_hdr.tid >> 32)); + hi_tid); goto ret; } } else { @@ -643,12 +648,14 @@ } version = port_priv->version[mad->mad_hdr.class_version]; if (!version) { - printk(KERN_ERR "MAD received for class version %d with no client\n", mad->mad_hdr.class_version); + printk(KERN_ERR "MAD received for class version %d with no client\n", + mad->mad_hdr.class_version); goto ret; } class = version->method_table[convert_mgmt_class(mad->mad_hdr.mgmt_class)]; if (!class) { - printk(KERN_ERR "MAD receive for class %d with no client\n", mad->mad_hdr.mgmt_class); + printk(KERN_ERR "MAD receive for class %d with no client\n", + mad->mad_hdr.mgmt_class); goto ret; } mad_agent = class->agent[mad->mad_hdr.method & ~IB_MGMT_METHOD_RESP]; @@ -684,48 +691,43 @@ return valid; } -static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, +static void ib_mad_recv_done_handler(struct ib_mad_qp_info *qp_info, struct ib_wc *wc) { struct ib_mad_private *recv; unsigned long flags; - u32 qp_num; struct ib_mad_agent_private *mad_agent; int solicited; - /* For receive, WC WRID is the QP number */ - qp_num = wc->wr_id; - /* * Completion corresponds to first entry on * posted MAD receive list based on WRID in completion */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - if (!list_empty(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)])) { - recv = list_entry(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)], + spin_lock_irqsave(&qp_info->recv_list_lock, flags); + if (!list_empty(&qp_info->recv_posted_mad_list)) { + recv = list_entry(&qp_info->recv_posted_mad_list, struct ib_mad_private, header.recv_buf.list); /* Remove from posted receive MAD list */ list_del(&recv->header.recv_buf.list); - - port_priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; + qp_info->recv_posted_mad_count--; } else { - printk(KERN_ERR "Receive completion WR ID 0x%Lx on QP %d with no posted receive\n", wc->wr_id, qp_num); - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + printk(KERN_ERR "Receive completion with no posted receive\n"); + spin_unlock_irqrestore(&qp_info->recv_list_lock, flags); return; } - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + spin_unlock_irqrestore(&qp_info->recv_list_lock, flags); - pci_unmap_single(port_priv->device->dma_device, + pci_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&recv->header, mapping), sizeof(struct ib_mad_private) - sizeof(struct ib_mad_private_header), PCI_DMA_FROMDEVICE); /* Setup MAD receive work completion from "normal" work completion */ recv->header.recv_wc.wc = wc; - recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* ignore GRH size */ recv->header.recv_wc.recv_buf = &recv->header.recv_buf; /* Setup MAD receive buffer */ @@ -738,15 +740,15 @@ } /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad, qp_num)) - goto ret; + if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) + return; /* Determine corresponding MAD agent for incoming receive MAD */ - spin_lock_irqsave(&port_priv->reg_lock, flags); + spin_lock_irqsave(&qp_info->port_priv->reg_lock, flags); /* First, determine whether MAD was solicited */ solicited = solicited_mad(recv->header.recv_buf.mad); /* Now, find the mad agent */ - mad_agent = find_mad_agent(port_priv, + mad_agent = find_mad_agent(qp_info->port_priv, recv->header.recv_buf.mad, solicited); if (!mad_agent) { @@ -757,49 +759,40 @@ printk(KERN_DEBUG "Currently unsupported solicited MAD received\n"); } + /* Release locking before callback... */ /* Invoke receive callback */ mad_agent->agent.recv_handler(&mad_agent->agent, &recv->header.recv_wc); } - spin_unlock_irqrestore(&port_priv->reg_lock, flags); - - /* Post another receive request for this QP */ - ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); - -ret: - return; + spin_unlock_irqrestore(&qp_info->port_priv->reg_lock, flags); } -static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, +static void ib_mad_send_done_handler(struct ib_mad_qp_info *qp_info, struct ib_wc *wc) { struct ib_mad_send_wr_private *send_wr; unsigned long flags; /* Completion corresponds to first entry on posted MAD send list */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (!list_empty(&port_priv->send_posted_mad_list)) { - send_wr = list_entry(&port_priv->send_posted_mad_list, + spin_lock_irqsave(&qp_info->send_list_lock, flags); + if (!list_empty(&qp_info->send_posted_mad_list)) { + send_wr = list_entry(&qp_info->send_posted_mad_list, struct ib_mad_send_wr_private, send_list); - if (send_wr->wr_id != wc->wr_id) { - printk(KERN_ERR "Send completion WR ID 0x%Lx doesn't match posted send WR ID 0x%Lx\n", wc->wr_id, send_wr->wr_id); - - goto error; - } - /* Check whether timeout was requested !!! */ /* Remove from posted send MAD list */ list_del(&send_wr->send_list); - port_priv->send_posted_mad_count--; + qp_info->send_posted_mad_count--; } else { - printk(KERN_ERR "Send completion WR ID 0x%Lx but send list is empty\n", wc->wr_id); + printk(KERN_ERR "Send completion WR ID 0x%Lx but send list is empty\n", wc->wr_id); goto error; } - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + spin_unlock_irqrestore(&qp_info->send_list_lock, flags); + + /* Synchronize with deregistration... */ /* Restore client wr_id in WC */ wc->wr_id = send_wr->wr_id; @@ -811,20 +804,19 @@ return; error: - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + spin_unlock_irqrestore(&qp_info->send_list_lock, flags); return; } /* * IB MAD completion callback */ -static void ib_mad_completion_handler(struct ib_mad_port_private *port_priv) +static void ib_mad_completion_handler(struct ib_mad_qp_info *qp_info) { struct ib_wc wc; int err_status = 0; - while (!ib_poll_cq(port_priv->cq, 1, &wc)) { - printk(KERN_DEBUG "Completion - WR ID = 0x%Lx\n", wc.wr_id); + while (!ib_poll_cq(qp_info->cq, 1, &wc)) { if (wc.status != IB_WC_SUCCESS) { switch (wc.opcode) { @@ -846,10 +838,11 @@ switch (wc.opcode) { case IB_WC_SEND: - ib_mad_send_done_handler(port_priv, &wc); + ib_mad_send_done_handler(qp_info, &wc); break; case IB_WC_RECV: - ib_mad_recv_done_handler(port_priv, &wc); + ib_mad_recv_done_handler(qp_info, &wc); + ib_mad_post_receive_mad(qp_info); break; default: printk(KERN_ERR "Wrong Opcode: %d\n", wc.opcode); @@ -861,76 +854,43 @@ } if (err_status) { - ib_mad_port_restart(port_priv); + ib_mad_restart_qp(qp_info); } else { - ib_mad_post_receive_mads(port_priv); - ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + ib_req_notify_cq(qp_info->cq, IB_CQ_NEXT_COMP); } } /* * IB MAD thread */ -static int ib_mad_thread(void *param) +static int ib_mad_thread_handler(void *param) { - struct ib_mad_port_private *port_priv = param; - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + struct ib_mad_qp_info *qp_info = param; int ret; while (1) { while (!signal_pending(current)) { - ret = wait_event_interruptible(mad_thread_priv->wait, 0); + ret = wait_event_interruptible(qp_info->wait, 0); if (ret) { printk(KERN_ERR "ib_mad thread exiting\n"); return 0; } - ib_mad_completion_handler(port_priv); - + ib_mad_completion_handler(qp_info); } } } -/* - * Initialize the IB MAD thread - */ -static int ib_mad_thread_init(struct ib_mad_port_private *port_priv) -{ - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; - - init_waitqueue_head(&mad_thread_priv->wait); - - port_priv->mad_thread = kthread_create(ib_mad_thread, - port_priv, - "ib_mad-%-6s-%-2d", - port_priv->device->name, - port_priv->port_num); - if (IS_ERR(port_priv->mad_thread)) { - printk(KERN_ERR "Couldn't start mad thread for %s port %d\n", - port_priv->device->name, port_priv->port_num); - return 1; - } - return 0; -} - -/* - * Stop the IB MAD thread - */ -static void ib_mad_thread_stop(struct ib_mad_port_private *port_priv) -{ - kthread_stop(port_priv->mad_thread); /* !!! */ -} static void ib_mad_thread_completion_handler(struct ib_cq *cq) { - struct ib_mad_port_private *port_priv = cq->cq_context; - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + struct ib_mad_qp_info *qp_info; - wake_up_interruptible(&mad_thread_priv->wait); + qp_info = (struct ib_mad_qp_info*)cq->cq_context; + wake_up_interruptible(&qp_info->wait); } -static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, - struct ib_qp *qp) +static int ib_mad_post_receive_mad(struct ib_mad_qp_info *qp_info) { struct ib_mad_private *mad_priv; struct ib_sge sg_list; @@ -955,43 +915,42 @@ } /* Setup scatter list */ - sg_list.addr = pci_map_single(port_priv->device->dma_device, + sg_list.addr = pci_map_single(qp_info->port_priv->device->dma_device, &mad_priv->grh, sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; - sg_list.lkey = (*port_priv->mr).lkey; + sg_list.lkey = qp_info->port_priv->mr->lkey; /* Setup receive WR */ recv_wr.next = NULL; recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - recv_wr.wr_id = qp->qp_num; /* 32 bits left */ /* Link receive WR into posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); + spin_lock_irqsave(&qp_info->recv_list_lock, flags); list_add_tail(&mad_priv->header.recv_buf.list, - &port_priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); - port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + &qp_info->recv_posted_mad_list); + qp_info->recv_posted_mad_count++; + spin_unlock_irqrestore(&qp_info->recv_list_lock, flags); pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); /* Now, post receive WR */ - ret = ib_post_recv(qp, &recv_wr, &bad_recv_wr); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); if (ret) { - pci_unmap_single(port_priv->device->dma_device, + pci_unmap_single(qp_info->port_priv->device->dma_device, pci_unmap_addr(&mad_priv->header, mapping), sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); /* Unlink from posted receive MAD list */ - spin_lock_irqsave(&port_priv->recv_list_lock, flags); + spin_lock_irqsave(&qp_info->recv_list_lock, flags); list_del(&mad_priv->header.recv_buf.list); - port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + qp_info->recv_posted_mad_count--; + spin_unlock_irqrestore(&qp_info->recv_list_lock, flags); kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_NOTICE "ib_post_recv failed ret = %d\n", ret); @@ -1004,65 +963,61 @@ /* * Allocate receive MADs and post receive WRs for them */ -static int ib_mad_post_receive_mads(struct ib_mad_port_private *port_priv) +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info) { - int i, j; + int i, ret = 0; - for (i = 0; i < IB_MAD_QP_RECV_SIZE; i++) { - for (j = 0; j < IB_MAD_QPS_CORE; j++) { - if (ib_mad_post_receive_mad(port_priv, - port_priv->qp[j])) { - printk(KERN_ERR "receive post %d failed on %s port %d\n", - i + 1, port_priv->device->name, - port_priv->port_num); - } + for (i = qp_info->recv_posted_mad_count; i < IB_MAD_QP_RECV_SIZE; i++) { + ret = ib_mad_post_receive_mad(qp_info); + if (ret) { + printk(KERN_ERR "receive post %d failed on %s port %d\n", + i + 1, qp_info->port_priv->device->name, + qp_info->port_priv->port_num); + break; } } - return 0; + return ret; } /* * Return all the posted receive MADs */ -static void ib_mad_return_posted_recv_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_recv_mads(struct ib_mad_qp_info *qp_info) { - int i; unsigned long flags; - for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { - spin_lock_irqsave(&port_priv->recv_list_lock, flags); - while (!list_empty(&port_priv->recv_posted_mad_list[i])) { - - /* PCI mapping !!! */ + spin_lock_irqsave(&qp_info->recv_list_lock, flags); + while (!list_empty(&qp_info->recv_posted_mad_list)) { - } - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); - port_priv->recv_posted_mad_count[i] = 0; - spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); + /* PCI mapping !!! */ + list_del(&qp_info->recv_posted_mad_list); } + INIT_LIST_HEAD(&qp_info->recv_posted_mad_list); + qp_info->recv_posted_mad_count = 0; + spin_unlock_irqrestore(&qp_info->recv_list_lock, flags); } /* * Return all the posted send MADs */ -static void ib_mad_return_posted_send_mads(struct ib_mad_port_private *port_priv) +static void ib_mad_return_posted_send_mads(struct ib_mad_qp_info *qp_info) { unsigned long flags; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - while (!list_empty(&port_priv->send_posted_mad_list)) { + spin_lock_irqsave(&qp_info->send_list_lock, flags); + while (!list_empty(&qp_info->send_posted_mad_list)) { /* PCI mapping ? */ - list_del(&port_priv->send_posted_mad_list); + list_del(&qp_info->send_posted_mad_list); /* Call completion handler with some status ? */ } - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count = 0; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + INIT_LIST_HEAD(&qp_info->send_posted_mad_list); + qp_info->send_posted_mad_count = 0; + spin_unlock_irqrestore(&qp_info->send_list_lock, flags); } /* @@ -1087,13 +1042,12 @@ * one is needed for the Reset to Init transition. */ attr->pkey_index = 0; - attr->port_num = port_num; /* QKey is 0 for QP0 */ if (qp->qp_num == 0) attr->qkey = 0; else attr->qkey = IB_QP1_QKEY; - attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_QKEY; + attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; ret = ib_modify_qp(qp, attr, attr_mask, &qp_cap); kfree(attr); @@ -1182,93 +1136,180 @@ } /* - * Start the port + * Halt operations on the specified QP. */ -static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +static void ib_mad_stop_qp(struct ib_mad_qp_info *qp_info) { - int ret, i, ret2; + int ret; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp[i], - port_priv->port_num); - if (ret) { - printk(KERN_ERR "Could not change QP%d state to INIT\n", i); - return ret; - } + ret = ib_mad_change_qp_state_to_reset(qp_info->qp); + if (ret) { + printk(KERN_ERR "ib_mad_qp_stop: Could not change %s port %d QP%d state to RESET\n", + qp_info->port_priv->device->name, + qp_info->port_priv->port_num, qp_info->qp->qp_num); + } + + ib_mad_return_posted_recv_mads(qp_info); + ib_mad_return_posted_send_mads(qp_info); +} + +/* + * Start operations on the specified QP. + */ +static int ib_mad_start_qp(struct ib_mad_qp_info *qp_info) +{ + int ret; + + ret = ib_mad_change_qp_state_to_init(qp_info->qp, + qp_info->port_priv->port_num); + if (ret) { + printk(KERN_ERR "Could not change QP%d state to INIT\n", + qp_info->qp->qp_num); + return ret; } - ret = ib_mad_post_receive_mads(port_priv); + ret = ib_mad_post_receive_mads(qp_info); if (ret) { printk(KERN_ERR "Could not post receive requests\n"); goto error; } - ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + ret = ib_mad_change_qp_state_to_rtr(qp_info->qp); if (ret) { - printk(KERN_ERR "Failed to request completion notification\n"); + printk(KERN_ERR "Could not change QP%d state to RTR\n", + qp_info->qp->qp_num); goto error; } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_rtr(port_priv->qp[i]); - if (ret) { - printk(KERN_ERR "Could not change QP%d state to RTR\n", i); - goto error; - } + ret = ib_mad_change_qp_state_to_rts(qp_info->qp); + if (ret) { + printk(KERN_ERR "Could not change QP%d state to RTS\n", + qp_info->qp->qp_num); + goto error; + } - ret = ib_mad_change_qp_state_to_rts(port_priv->qp[i]); - if (ret) { - printk(KERN_ERR "Could not change QP%d state to RTS\n", i); - goto error; - } + /* Don't report receive completions until we're ready to send. */ + ret = ib_req_notify_cq(qp_info->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR "Failed to request completion notification\n"); + goto error; } return 0; -error: - ib_mad_return_posted_recv_mads(port_priv); - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret2 = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); - if (ret2) { - printk(KERN_ERR "ib_mad_port_start: Could not change QP%d state to RESET\n", i); - } - } +error: + ib_mad_stop_qp(qp_info); return ret; } /* - * Stop the port + * Restart operations on the specified QP. */ -static void ib_mad_port_stop(struct ib_mad_port_private *port_priv) +static int ib_mad_restart_qp(struct ib_mad_qp_info *qp_info) { - int i, ret; + int ret; - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_reset(port_priv->qp[i]); - if (ret) { - printk(KERN_ERR "ib_mad_port_stop: Could not change %s port %d QP%d state to RESET\n", - port_priv->device->name, port_priv->port_num, i); - } - } + /* Need to synchronize this against user's posting MADs... */ + ib_mad_stop_qp(qp_info); + ret = ib_mad_start_qp(qp_info); + if (ret) { + printk(KERN_ERR "Could not restart %s port %d QP %d\n", + qp_info->port_priv->device->name, + qp_info->port_priv->port_num, qp_info->qp->qp_num); + } + + return ret; +} + + +static void ib_mad_destroy_qp(struct ib_mad_qp_info *qp_info) +{ + /* Stop processing completions. */ + kthread_stop(qp_info->mad_thread); + ib_mad_stop_qp(qp_info); - ib_mad_return_posted_recv_mads(port_priv); - ib_mad_return_posted_send_mads(port_priv); + ib_destroy_qp(qp_info->qp); + ib_destroy_cq(qp_info->cq); } -/* - * Restart the port - */ -static int ib_mad_port_restart(struct ib_mad_port_private *port_priv) +static int ib_mad_init_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) { - int ret; + int ret, cq_size; + struct ib_qp_init_attr qp_init_attr; + struct ib_qp_cap qp_cap; - ib_mad_port_stop(port_priv); - ret = ib_mad_port_start(port_priv); - if (ret) { - printk(KERN_ERR "Could not restart %s port %d\n", - port_priv->device->name, port_priv->port_num); + qp_info->port_priv = port_priv; + + /* Allocate CQ */ + cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; + qp_info->cq = ib_create_cq(port_priv->device, + (ib_comp_handler)ib_mad_thread_completion_handler, + NULL, qp_info, cq_size); + if (IS_ERR(qp_info->cq)) { + printk(KERN_ERR "Could not create ib_mad CQ\n"); + return PTR_ERR(qp_info->cq); + } + + /* Allocate QP */ + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->cq; + qp_init_attr.recv_cq = qp_info->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = port_priv->port_num; + + qp_info->qp = ib_create_qp(port_priv->pd, &qp_init_attr, &qp_cap); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR "Could not create ib_mad QP%d\n", + qp_info->qp->qp_num); + ret = PTR_ERR(qp_info->qp); + goto error1; + } + + spin_lock_init(&qp_info->send_list_lock); + INIT_LIST_HEAD(&qp_info->send_posted_mad_list); + qp_info->send_posted_mad_count = 0; + + spin_lock_init(&qp_info->recv_list_lock); + INIT_LIST_HEAD(&qp_info->recv_posted_mad_list); + qp_info->recv_posted_mad_count = 0; + + /* Startup the completion thread. */ + init_waitqueue_head(&qp_info->wait); + qp_info->mad_thread = kthread_create(ib_mad_thread_handler, + qp_info, + "ib_mad-%-6s-%-2d-%-4d", + qp_info->port_priv->device->name, + qp_info->port_priv->port_num, + qp_info->qp->qp_num); + if (IS_ERR(qp_info->mad_thread)) { + printk(KERN_ERR "Couldn't start mad thread for %s port %d\n", + qp_info->port_priv->device->name, + qp_info->port_priv->port_num); + ret = PTR_ERR(qp_info->mad_thread); + goto error2; } + /* Start the QP. */ + ret = ib_mad_start_qp(qp_info); + if (ret) + goto error3; + + return 0; + +error3: + kthread_stop(qp_info->mad_thread); +error2: + ib_destroy_qp(qp_info->qp); +error1: + ib_destroy_cq(qp_info->cq); return ret; } @@ -1278,14 +1319,12 @@ */ static int ib_mad_port_open(struct ib_device *device, int port_num) { - int ret, cq_size, i; + int ret, i, qp; u64 iova = 0; struct ib_phys_buf buf_list = { .addr = 0, .size = (unsigned long) high_memory - PAGE_OFFSET }; - struct ib_qp_init_attr qp_init_attr; - struct ib_qp_cap qp_cap; struct ib_mad_port_private *entry, *port_priv = NULL; unsigned long flags; @@ -1320,21 +1359,11 @@ port_priv->version[i] = NULL; } - cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; - port_priv->cq = ib_create_cq(port_priv->device, - (ib_comp_handler) ib_mad_thread_completion_handler, - NULL, port_priv, cq_size); - if (IS_ERR(port_priv->cq)) { - printk(KERN_ERR "Could not create ib_mad CQ\n"); - ret = PTR_ERR(port_priv->cq); - goto error3; - } - port_priv->pd = ib_alloc_pd(device); if (IS_ERR(port_priv->pd)) { printk(KERN_ERR "Could not create ib_mad PD\n"); ret = PTR_ERR(port_priv->pd); - goto error4; + goto error1; } port_priv->mr = ib_reg_phys_mr(port_priv->pd, &buf_list, 1, @@ -1342,58 +1371,19 @@ if (IS_ERR(port_priv->mr)) { printk(KERN_ERR "Could not register ib_mad MR\n"); ret = PTR_ERR(port_priv->mr); - goto error5; + goto error2; } - for (i = 0; i < IB_MAD_QPS_CORE; i++) { - memset(&qp_init_attr, 0, sizeof qp_init_attr); - qp_init_attr.send_cq = port_priv->cq; - qp_init_attr.recv_cq = port_priv->cq; - qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; - qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; - qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; - qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; - if (i == 0) - qp_init_attr.qp_type = IB_QPT_SMI; - else - qp_init_attr.qp_type = IB_QPT_GSI; - qp_init_attr.port_num = port_priv->port_num; - port_priv->qp[i] = ib_create_qp(port_priv->pd, &qp_init_attr, - &qp_cap); - if (IS_ERR(port_priv->qp[i])) { - printk(KERN_ERR "Could not create ib_mad QP%d\n", i); - ret = PTR_ERR(port_priv->qp[i]); - if (i == 0) - goto error6; - else - goto error7; - } - printk(KERN_DEBUG "Created ib_mad QP %d\n", - port_priv->qp[i]->qp_num); + for (qp = 0; qp < IB_MAD_QPS_CORE; qp++) { + ret = ib_mad_init_qp(port_priv, + &port_priv->qp_info[qp], + qp ? IB_QPT_GSI : IB_QPT_SMI); + if (ret) + goto error3; } spin_lock_init(&port_priv->reg_lock); - spin_lock_init(&port_priv->recv_list_lock); - spin_lock_init(&port_priv->send_list_lock); INIT_LIST_HEAD(&port_priv->agent_list); - INIT_LIST_HEAD(&port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count = 0; - for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { - INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); - port_priv->recv_posted_mad_count[i] = 0; - } - - ret = ib_mad_thread_init(port_priv); - if (ret) - goto error8; - - ret = ib_mad_port_start(port_priv); - if (ret) { - printk(KERN_ERR "Couldn't start port\n"); - goto error8; - } spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_add_tail(&port_priv->port_list, &ib_mad_port_list); @@ -1401,17 +1391,14 @@ return 0; -error8: - ib_destroy_qp(port_priv->qp[1]); -error7: - ib_destroy_qp(port_priv->qp[0]); -error6: +error3: + while (qp > 0) { + ib_mad_destroy_qp(&port_priv->qp_info[--qp]); + } ib_dereg_mr(port_priv->mr); -error5: +error2: ib_dealloc_pd(port_priv->pd); -error4: - ib_destroy_cq(port_priv->cq); -error3: +error1: kfree(port_priv); return ret; @@ -1426,6 +1413,7 @@ { struct ib_mad_port_private *entry, *port_priv = NULL; unsigned long flags; + int i; spin_lock_irqsave(&ib_mad_port_list_lock, flags); list_for_each_entry(entry, &ib_mad_port_list, port_list) { @@ -1444,13 +1432,12 @@ list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - ib_mad_port_stop(port_priv); - ib_mad_thread_stop(port_priv); - ib_destroy_qp(port_priv->qp[1]); - ib_destroy_qp(port_priv->qp[0]); + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ib_mad_destroy_qp(&port_priv->qp_info[i]); + } + ib_dereg_mr(port_priv->mr); ib_dealloc_pd(port_priv->pd); - ib_destroy_cq(port_priv->cq); /* Handle deallocation of MAD registration tables!!! */ kfree(port_priv); @@ -1461,7 +1448,7 @@ static void ib_mad_init_device(struct ib_device *device) { - int ret, num_ports, cur_port, i, ret2; + int ret, num_ports, i, ret2; struct ib_device_attr device_attr; ret = ib_query_device(device, &device_attr); @@ -1472,16 +1459,14 @@ if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; - cur_port = 0; } else { num_ports = device_attr.phys_port_cnt; - cur_port = 1; } - for (i = 0; i < num_ports; i++, cur_port++) { - ret = ib_mad_port_open(device, cur_port); + for (i = 0; i < num_ports; i++) { + ret = ib_mad_port_open(device, i+1); if (ret) { printk(KERN_ERR "Could not open %s port %d\n", - device->name, cur_port); + device->name, i+1); goto error_device_open; } } @@ -1490,11 +1475,10 @@ error_device_open: while (i > 0) { - cur_port--; - ret2 = ib_mad_port_close(device, cur_port); + ret2 = ib_mad_port_close(device, i); if (ret2) { printk(KERN_ERR "Could not close %s port %d\n", - device->name, cur_port); + device->name, i); } i--; } @@ -1505,7 +1489,7 @@ static void ib_mad_remove_device(struct ib_device *device) { - int ret, i, num_ports, cur_port, ret2; + int ret, i, num_ports, ret2; struct ib_device_attr device_attr; ret = ib_query_device(device, &device_attr); @@ -1516,16 +1500,14 @@ if (device->node_type == IB_NODE_SWITCH) { num_ports = 1; - cur_port = 0; } else { num_ports = device_attr.phys_port_cnt; - cur_port = 1; } - for (i = 0; i < num_ports; i++, cur_port++) { - ret2 = ib_mad_port_close(device, cur_port); + for (i = 0; i < num_ports; i++) { + ret2 = ib_mad_port_close(device, i+1); if (ret2) { printk(KERN_ERR "Could not close %s port %d\n", - device->name, cur_port); + device->name, i+1); if (!ret) ret = ret2; } From sean.hefty at intel.com Thu Sep 23 19:59:54 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 23 Sep 2004 19:59:54 -0700 Subject: [openib-general] [PATCH] separates QP0/1 interactions In-Reply-To: <20040923171745.75e46fe9.mshefty@ichips.intel.com> Message-ID: >@@ -205,11 +203,12 @@ > memset(mad_agent_priv, 0, sizeof *mad_agent_priv); > mad_agent_priv->reg_req = reg_req; > mad_agent_priv->rmpp_version = rmpp_version; >+ mad_agent_priv->qp_info = &port_priv->qp_info[qp_type]; > mad_agent_priv->agent.device = device; > mad_agent_priv->agent.recv_handler = recv_handler; > mad_agent_priv->agent.send_handler = send_handler; > mad_agent_priv->agent.context = context; >- mad_agent_priv->agent.qp = port_priv->qp[qp_type]; >+ mad_agent_priv->agent.qp = port_priv->qp_info[qp_type].qp; The use of qp_type as an index is incorrect. Maybe we should change ib_verbs.h to set IB_QPT_SMI to 0 and IB_QPT_GSI to 1, so that their types match their QP numbers. >+ &mad_agent_priv->qp_info->send_posted_mad_list); >+ mad_agent_priv->qp_info->send_posted_mad_count++; >+ spin_unlock_irqrestore(&mad_agent_priv->qp_info- >>send_list_lock, >+ flags); > > ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); As a side note, it's my intention to optimize the send path to avoid copying the work request in the case where the send queue is not full. >-static int convert_qpnum(u32 qp_num) >-{ >- /* >- * No redirection currently!!! This call wasn't needed with these changes. When we do get to redirection, I think that we can handle it more efficiently by placing needed queues and such off the mad_agent structure. > /* Setup MAD receive work completion from "normal" work completion */ > recv->header.recv_wc.wc = wc; >- recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this >be based on wc->byte_len ? Also, RMPP !!! */ >+ recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* ignore GRH >size */ > recv->header.recv_wc.recv_buf = &recv->header.recv_buf; I think that we should be able to eliminate the copying of work completion structures when processing receive completions. Doing so would require having separate CQs for the send and receive queues, which isn't necessarily a bad idea. We could give priority to receive completion processing. >+ /* Release locking before callback... */ > /* Invoke receive callback */ > mad_agent->agent.recv_handler(&mad_agent->agent, > &recv->header.recv_wc); > } > ... >+ spin_unlock_irqrestore(&qp_info->port_priv->reg_lock, flags); I need to think about how to handle this more, but I don't think that we want to hold any spin locks while invoking callbacks. This will likely require referencing counting on the mad agent. I think we may be able to conceptually follow the CQ/QP event handling code. >- spin_unlock_irqrestore(&port_priv->send_list_lock, flags); >+ spin_unlock_irqrestore(&qp_info->send_list_lock, flags); >+ >+ /* Synchronize with deregistration... */ Send completion callbacks were invoked without holding a lock. Same comment as above. > while (1) { > while (!signal_pending(current)) { >- ret = wait_event_interruptible(mad_thread_priv->wait, 0); >+ ret = wait_event_interruptible(qp_info->wait, 0); > if (ret) { > printk(KERN_ERR "ib_mad thread exiting\n"); > return 0; > } > >- ib_mad_completion_handler(port_priv); >- >+ ib_mad_completion_handler(qp_info); > } I wanted to come back to this. Should a thread be able to just spin on the inner loop? >- attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_QKEY; >+ attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; Oops, this should have been a separate patch. Port numbers cannot be modified for special QPs. I can resubmit this separately. >+ init_waitqueue_head(&qp_info->wait); >+ qp_info->mad_thread = kthread_create(ib_mad_thread_handler, >+ qp_info, >+ "ib_mad-%-6s-%-2d-%-4d", >+ qp_info->port_priv->device->name, >+ qp_info->port_priv->port_num, >+ qp_info->qp->qp_num); Right now, this changes the threading to be one per QP. (It was the easiest change.) We can discuss how we want the threading to be: one per system/device/port/QP/CPU... Also, we can have separate completion handlers for QP 0 and QP 1 if needed. (I doubt it will be.) From halr at voltaire.com Fri Sep 24 06:58:07 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 09:58:07 -0400 Subject: [openib-general] OpenIB and IBTA CIWG Testing/Plugfest Message-ID: <1096034287.2280.12.camel@hpc-1> Hi, Just wanted to report on some goings on at the CIWG which may be of interest. There is a big push at the CIWG to deal with OpenIB compliance and testing although there is still some confusion relative to OpenIB. They are working on understanding this and the group leader asked the CIWG interop subgroup to discuss this and come up with recommendations. There was a subgroup meeting on this yesterday in which I participated. Clearly the goal is for OpenIB to achieve IL (integrator list) status. Next Plugfest is in April and the subsequent one is in October. Here is a brief synopsis: 1. DTA has been a stumbling block. DTA will be frozen at rev 0.95. There was discussion on the availability of Open Source DTAs. From the OpenIB side, I think this requires a CM and user space access. 2. There is a push for application validation but this is longer term. Shorter term is ULP interop. Currently there are IPoIB procedures which have been run at the last 2 Plugfests. The major missing one is DHCP. There is an action item to coordinate this with the IETF IPoIB WG. 3. Discussion about new ULPs for next Plugfest. Someone has signed up to develop interop procedures for SDP. uDAPL is being investigated, although it has it's own Plugfest currently (and works over more than just IB). I will keep this list posted if there is interest and as news occurs. -- Hal From halr at voltaire.com Fri Sep 24 08:16:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 11:16:21 -0400 Subject: [openib-general] ib_mad shutdown WC status code ? Message-ID: <1096038980.2391.3.camel@hpc-1> Should there be a separate WC status code to be returned when the ib_mad layer is shutdown and there are MADs queued on the send queue ? We added IB_WC_RESP_TIMEOUT_ERR. Should there be an IB_WC_SHUTDOWN added too ? Or is there some other status code to use for this scenario ? Thanks. -- Hal From halr at voltaire.com Fri Sep 24 08:21:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 11:21:13 -0400 Subject: [openib-general] [PATCH] separates QP0/1 interactions In-Reply-To: References: Message-ID: <1096039273.2391.9.camel@hpc-1> On Thu, 2004-09-23 at 22:59, Sean Hefty wrote: > >@@ -205,11 +203,12 @@ > > memset(mad_agent_priv, 0, sizeof *mad_agent_priv); > > mad_agent_priv->reg_req = reg_req; > > mad_agent_priv->rmpp_version = rmpp_version; > >+ mad_agent_priv->qp_info = &port_priv->qp_info[qp_type]; > > mad_agent_priv->agent.device = device; > > mad_agent_priv->agent.recv_handler = recv_handler; > > mad_agent_priv->agent.send_handler = send_handler; > > mad_agent_priv->agent.context = context; > >- mad_agent_priv->agent.qp = port_priv->qp[qp_type]; > >+ mad_agent_priv->agent.qp = port_priv->qp_info[qp_type].qp; > > The use of qp_type as an index is incorrect. Maybe we should change > ib_verbs.h to set IB_QPT_SMI to 0 and IB_QPT_GSI to 1, so that their types > match their QP numbers. Correct. This stems from when those QP types were in a separate special QP types enum and they were correct then :-( I missed this when special QP API was eliminated and this was combined with the normal ones. I'm all for changing these as proposed. It saves a minor amount of code here. I don't think it matters to mthca. I will respond to the rest of the comments in a series of responses over the next few days. -- Hal From ftillier at infiniconsys.com Fri Sep 24 08:23:50 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Fri, 24 Sep 2004 08:23:50 -0700 Subject: [openib-general] ib_mad shutdown WC status code ? In-Reply-To: <1096038980.2391.3.camel@hpc-1> Message-ID: <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, September 24, 2004 8:16 AM > > Should there be a separate WC status code to be returned when the ib_mad > layer is shutdown and there are MADs queued on the send queue ? We added > IB_WC_RESP_TIMEOUT_ERR. Should there be an IB_WC_SHUTDOWN added too ? Or > is there some other status code to use for this scenario ? > I would use IB_WC_WR_FLUSHED_ERR to indicate that a WR got flushed without having been sent. This is consistent with how WRs get flushed on QPs. - Fab From halr at voltaire.com Fri Sep 24 08:37:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 11:37:03 -0400 Subject: [openib-general] ib_mad shutdown WC status code ? References: <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> Message-ID: <007d01c4a24c$5ca92660$4302000a@Gripen> Fab Tillier wrote: > I would use IB_WC_WR_FLUSHED_ERR to indicate that a WR got flushed > without having been sent. This is consistent with how WRs get > flushed on QPs. That seems fine as long as a client would never need to distinguish the 2 cases. -- Hal From libor at topspin.com Fri Sep 24 08:45:40 2004 From: libor at topspin.com (Libor Michalek) Date: Fri, 24 Sep 2004 08:45:40 -0700 Subject: [openib-general] Re: [PATCH] Move SDP to dynamic device enumeration In-Reply-To: <52r7p3wesd.fsf@topspin.com>; from roland@topspin.com on Wed, Sep 15, 2004 at 11:46:58AM -0700 References: <52r7p3wesd.fsf@topspin.com> Message-ID: <20040924084540.A2899@topspin.com> Roland Dreier wrote: > > This moves the SDP in my tree to using the struct ib_client method for > device enumeration. There are still problems with adding and removing > devices because the ip2pr module is still using static methods, but I > think this fixes up SDP. > > Libor, seem OK to commit? Roland, Sorry that I missed this earlier. This looks good. Although, the change in FMR initialization, once implemented, will result in more virtual address space utilization on systems with multiple HCAs... -Libor > Index: infiniband/ulp/sdp/sdp_conn.c > =================================================================== > --- infiniband/ulp/sdp/sdp_conn.c (revision 836) > +++ infiniband/ulp/sdp/sdp_conn.c (working copy) > @@ -28,6 +28,16 @@ > static char _recv_pool_name[] = TS_SDP_SOCK_RECV_DATA_NAME; > > static struct sdev_root _dev_root_s; > + > +static void sdp_device_init_one(struct ib_device *device); > +static void sdp_device_remove_one(struct ib_device *device); > + > +static struct ib_client sdp_client = { > + .name = "sdp", > + .add = sdp_device_init_one, > + .remove = sdp_device_remove_one > +}; > + > /* --------------------------------------------------------------------- */ > /* */ > /* module specific functions */ > @@ -1016,27 +1026,17 @@ > /* > * look up correct HCA and port > */ > - for (hca = _dev_root_s.hca_list; NULL != hca; hca = hca->next) { > + hca = ib_get_client_data(device, &sdp_client); > + if (!hca) > + return -ERANGE; > > - if (device == hca->ca) { > - > - for (port = hca->port_list; NULL != port; > - port = port->next) { > - > - if (hw_port == port->index) { > - > - break; > - } > - } > - > + for (port = hca->port_list; NULL != port; port = port->next) > + if (hw_port == port->index) > break; > - } > - } > > - if (NULL == hca || NULL == port) { > - > + if (!port) > return -ERANGE; > - } > + > /* > * allocate creation parameters > */ > @@ -1815,12 +1815,6 @@ > off_t start_index, > long *end_index) > { > - struct sdev_hca_port *port; > - struct sdev_hca *hca; > - u64 subnet_prefix; > - u64 guid; > - int hca_count; > - int port_count; > int offset = 0; > > TS_CHECK_NULL(buffer, -EINVAL); > @@ -1857,40 +1851,8 @@ > offset += sprintf((buffer + offset), > "max receive buffered: <%d>\n", > _dev_root_s.recv_buff_max); > - > - offset += sprintf((buffer + offset), "HCAs:\n"); > } > - /* > - * HCA loop > - */ > - for (hca = _dev_root_s.hca_list, hca_count = 0; > - NULL != hca; hca = hca->next, hca_count++) { > > - offset += sprintf((buffer + offset), > - " hca %02x: ca <%p> pd <%p> mem <%p> " > - "l_key <%08x>\n", > - hca_count, hca->ca, hca->pd, > - hca->mem_h, hca->l_key); > - > - for (port = hca->port_list, port_count = 0; > - NULL != port; port = port->next, port_count++) { > - > - subnet_prefix = cpu_to_be64(*(u64 *) (port->gid)); > - guid = cpu_to_be64(*(u64 *)(port->gid + sizeof(u64))); > - > - offset += sprintf((buffer + offset), > - " port %02x: index <%d> gid " > - "<%08x%08x:%08x%08x>\n", > - port_count, > - port->index, > - (u32)((subnet_prefix >> 32) & > - 0xffffffff), > - (u32)(subnet_prefix & 0xffffffff), > - (u32)((guid >> 32) & 0xffffffff), > - (u32)(guid & 0xffffffff)); > - } > - } > - > return offset; > } /* sdp_proc_dump_device */ > > @@ -1899,232 +1861,200 @@ > /* initialization/cleanup functions */ > /* */ > /* --------------------------------------------------------------------- */ > + > /* ========================================================================= */ > /*.._sdp_device_table_init -- create hca list */ > -static s32 _sdp_device_table_init(struct sdev_root *dev_root) > +static void sdp_device_init_one(struct ib_device *device) > { > #ifdef _TS_SDP_AIO_SUPPORT > struct ib_fmr_pool_param fmr_param_s; > #endif > struct ib_phys_buf buffer_list; > struct ib_device_attr node_info; > - struct ib_device *hca_handle; > struct sdev_hca_port *port; > struct sdev_hca *hca; > - s32 result; > - s32 hca_count; > - s32 port_count; > - s32 fmr_size; > + int result; > + int port_count; > > - TS_CHECK_NULL(dev_root, -EINVAL); > + result = ib_query_device(device, &node_info); > + if (0 != result) { > > - TS_TRACE(MOD_LNX_SDP, T_VERY_VERBOSE, TRACE_FLOW_INOUT, > - "INIT: Probing HCA/Port list."); > + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, > + "INIT: Error <%d> fetching HCA <%s> type.", > + result, device->name); > + return; > + } > /* > - * first count number of HCA's > + * allocate per-HCA structure > */ > - hca_count = 0; > + hca = kmalloc(sizeof(struct sdev_hca), GFP_KERNEL); > + if (NULL == hca) { > > - while (ib_device_get_by_index(hca_count)) { > - > - hca_count++; > + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, > + "INIT: Error allocating HCA <%s> memory.", > + device->name); > + return; > } > + /* > + * init and insert into list. > + */ > + memset(hca, 0, sizeof(struct sdev_hca)); > > - fmr_size = TS_SDP_FMR_POOL_SIZE / hca_count; > + hca->fmr_pool = NULL; > + hca->mem_h = NULL; > + hca->pd = NULL; > + hca->ca = device; > + /* > + * protection domain > + */ > + hca->pd = ib_alloc_pd(hca->ca); > + if (IS_ERR(hca->pd)) { > > - for (hca_count = 0; > - (hca_handle = ib_device_get_by_index(hca_count)) != NULL; > - hca_count++) { > - if (!hca_handle || !try_module_get(hca_handle->owner)) > - continue; > + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > + "INIT: Error <%d> creating HCA <%s> protection domain.", > + PTR_ERR(hca->pd), device->name); > + goto error; > + } > + /* > + * memory registration > + */ > + buffer_list.addr = 0; > + buffer_list.size = (unsigned long)high_memory - PAGE_OFFSET; > > - result = ib_query_device(hca_handle, &node_info); > - if (0 != result) { > + hca->iova = 0; > > - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, > - "INIT: Error <%d> fetching HCA <%x:%d> type.", > - result, hca_handle, hca_count); > - goto error; > - } > - /* > - * allocate per-HCA structure > - */ > - hca = kmalloc(sizeof(struct sdev_hca), GFP_KERNEL); > - if (NULL == hca) { > + hca->mem_h = ib_reg_phys_mr(hca->pd, > + &buffer_list, > + 1, /* list_len */ > + IB_ACCESS_LOCAL_WRITE, > + &hca->iova); > + if (IS_ERR(hca->mem_h)) { > + result = PTR_ERR(hca->mem_h); > + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > + "INIT: Error <%d> registering HCA <%s> memory.", > + result, device->name); > + goto error; > + } > > - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, TRACE_FLOW_FATAL, > - "INIT: Error allocating HCA <%x:%d> memory.", > - hca_handle, hca_count); > - result = -ENOMEM; > - goto error; > - } > - /* > - * init and insert into list. > - */ > - memset(hca, 0, sizeof(struct sdev_hca)); > + hca->l_key = hca->mem_h->lkey; > + hca->r_key = hca->mem_h->rkey; > > - hca->next = dev_root->hca_list; > - dev_root->hca_list = hca; > +#ifdef _TS_SDP_AIO_SUPPORT > + /* > + * FMR allocation > + */ > + fmr_param_s.pool_size = TS_SDP_FMR_POOL_SIZE; > + fmr_param_s.dirty_watermark = TS_SDP_FMR_DIRTY_SIZE; > + fmr_param_s.cache = 1; > + fmr_param_s.max_pages_per_fmr = TS_SDP_IOCB_PAGES_MAX; > + fmr_param_s.access = (IB_ACCESS_LOCAL_WRITE | > + IB_ACCESS_REMOTE_WRITE | > + IB_ACCESS_REMOTE_READ); > > - hca->fmr_pool = NULL; > - hca->mem_h = NULL; > - hca->pd = NULL; > - hca->ca = hca_handle; > - /* > - * protection domain > - */ > - hca->pd = ib_alloc_pd(hca->ca); > - if (IS_ERR(hca->pd)) { > + fmr_param_s.flush_function = NULL; > + /* > + * create SDP memory pool > + */ > + result = ib_create_fmr_pool(hca->pd, > + &fmr_param_s, > + &hca->fmr_pool); > + if (0 > result) { > > - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > - "INIT: Error <%d> creating HCA <%x:%d> protection domain.", > - PTR_ERR(hca->pd), hca_handle, hca_count); > - goto error; > - } > - /* > - * memory registration > - */ > - buffer_list.addr = 0; > - buffer_list.size = (unsigned long)high_memory - PAGE_OFFSET; > + TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > + "INIT: Error <%d> creating HCA <%s> fast memory pool.", > + result, hca->ca); > + goto error; > + } > +#endif /* _TS_SDP_AIO_SUPPORT */ > + /* > + * port allocation > + */ > + for (port_count = 0; port_count < node_info.phys_port_cnt; port_count++) { > > - hca->iova = 0; > + port = kmalloc(sizeof(struct sdev_hca_port), > + GFP_KERNEL); > + if (NULL == port) { > > - hca->mem_h = ib_reg_phys_mr(hca->pd, > - &buffer_list, > - 1, /* list_len */ > - IB_ACCESS_LOCAL_WRITE, > - &hca->iova); > - if (IS_ERR(hca->mem_h)) { > - result = PTR_ERR(hca->mem_h); > - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > - "INIT: Error <%d> registering HCA <%x:%d> memory.", > - result, hca_handle, hca_count); > + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, > + TRACE_FLOW_FATAL, > + "INIT: Error allocating HCA <%s> port <%d:%d> memory.", > + device->name, port_count, > + node_info.phys_port_cnt); > + > goto error; > } > > - hca->l_key = hca->mem_h->lkey; > - hca->r_key = hca->mem_h->rkey; > + memset(port, 0, sizeof(struct sdev_hca_port)); > > -#ifdef _TS_SDP_AIO_SUPPORT > - /* > - * FMR allocation > - */ > - fmr_param_s.pool_size = fmr_size; > - fmr_param_s.dirty_watermark = TS_SDP_FMR_DIRTY_SIZE; > - fmr_param_s.cache = 1; > - fmr_param_s.max_pages_per_fmr = TS_SDP_IOCB_PAGES_MAX; > - fmr_param_s.access = (IB_ACCESS_LOCAL_WRITE | > - IB_ACCESS_REMOTE_WRITE | > - IB_ACCESS_REMOTE_READ); > + port->index = port_count + 1; > + port->next = hca->port_list; > + hca->port_list = port; > > - fmr_param_s.flush_function = NULL; > - /* > - * create SDP memory pool > - */ > - result = ib_create_fmr_pool(hca->pd, > - &fmr_param_s, > - &hca->fmr_pool); > - if (0 > result) { > + result = ib_query_gid(hca->ca, > + port->index, > + 0, /* index */ > + (union ib_gid *) port->gid); > + if (0 != result) { > > - TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > - "INIT: Error <%d> creating HCA <%d:%d> fast memory pool.", > - result, hca->ca, hca_count); > + TS_TRACE(MOD_LNX_SDP, T_VERBOSE, > + TRACE_FLOW_FATAL, > + "INIT: Error <%d> getting GID for HCA <%s> port <%d:%d>", > + result, hca->ca, > + port->index, node_info.phys_port_cnt); > goto error; > } > -#endif /* _TS_SDP_AIO_SUPPORT */ > - /* > - * port allocation > - */ > - for (port_count = 0; port_count < node_info.phys_port_cnt; > - port_count++) { > + } > > - port = kmalloc(sizeof(struct sdev_hca_port), > - GFP_KERNEL); > - if (NULL == port) { > + ib_set_client_data(device, &sdp_client, hca); > > - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, > - TRACE_FLOW_FATAL, > - "INIT: Error allocating HCA <%d:%d> port <%x:%d> memory.", > - hca_handle, hca_count, port_count, > - node_info.phys_port_cnt); > + return; > > - result = -ENOMEM; > - goto error; > - } > +error: > + for (port = hca->port_list; NULL != port; port = hca->port_list) { > + hca->port_list = port->next; > + port->next = NULL; > > - memset(port, 0, sizeof(struct sdev_hca_port)); > + kfree(port); > + } > > - port->index = port_count + 1; > - port->next = hca->port_list; > - hca->port_list = port; > + if (hca->fmr_pool) > + (void)ib_destroy_fmr_pool(hca->fmr_pool); > > - result = ib_query_gid(hca->ca, > - port->index, > - 0, /* index */ > - (union ib_gid *) port->gid); > - if (0 != result) { > + if (hca->mem_h) > + (void)ib_dereg_mr(hca->mem_h); > > - TS_TRACE(MOD_LNX_SDP, T_VERBOSE, > - TRACE_FLOW_FATAL, > - "INIT: Error <%d> getting GID for HCA <%d:%d> port <%d:%d>", > - result, hca->ca, hca_count, > - port->index, node_info.phys_port_cnt); > - goto error; > - } > - } > - } > + if (hca->pd) > + (void)ib_dealloc_pd(hca->pd); > > - return 0; > -error: > - return result; > + kfree(hca); > } /* _sdp_device_table_init */ > > /* ========================================================================= */ > /*.._sdp_device_table_cleanup -- delete hca list */ > -static s32 _sdp_device_table_cleanup(struct sdev_root *dev_root) > +static void sdp_device_remove_one(struct ib_device *device) > { > struct sdev_hca_port *port; > struct sdev_hca *hca; > > - TS_CHECK_NULL(dev_root, -EINVAL); > - /* > - * free all hca/ports > - */ > - for (hca = dev_root->hca_list; NULL != hca; hca = dev_root->hca_list) { > + hca = ib_get_client_data(device, &sdp_client); > > - dev_root->hca_list = hca->next; > - hca->next = NULL; > + for (port = hca->port_list; NULL != port; port = hca->port_list) { > + hca->port_list = port->next; > + port->next = NULL; > > - for (port = hca->port_list; NULL != port; port = hca->port_list) { > + kfree(port); > + } > > - hca->port_list = port->next; > - port->next = NULL; > + if (hca->fmr_pool) > + (void)ib_destroy_fmr_pool(hca->fmr_pool); > > - kfree(port); > - } > + if (hca->mem_h) > + (void)ib_dereg_mr(hca->mem_h); > > - if (NULL != hca->fmr_pool) { > + if (hca->pd) > + (void)ib_dealloc_pd(hca->pd); > > - (void)ib_destroy_fmr_pool(hca->fmr_pool); > - } > - > - if (hca->mem_h) { > - > - (void)ib_dereg_mr(hca->mem_h); > - } > - > - if (hca->pd) { > - > - (void)ib_dealloc_pd(hca->pd); > - } > - > - if (hca->ca) > - module_put(hca->ca->owner); > - > - kfree(hca); > - } > - > - return 0; > + kfree(hca); > } /* _sdp_device_table_cleanup */ > > /* ========================================================================= */ > @@ -2170,11 +2100,11 @@ > /* > * Get HCA/port list > */ > - result = _sdp_device_table_init(&_dev_root_s); > + result = ib_register_client(&sdp_client); > if (0 > result) { > > TS_TRACE(MOD_LNX_SDP, T_TERSE, TRACE_FLOW_FATAL, > - "INIT: Error <%d> building HCA/port list.", result); > + "INIT: Error <%d> registering SDP client.", result); > goto error_hca; > } > /* > @@ -2281,8 +2211,8 @@ > free_pages((unsigned long)_dev_root_s.sk_array, _dev_root_s.sk_ordr); > error_array: > error_size: > + ib_unregister_client(&sdp_client); > error_hca: > - (void)_sdp_device_table_cleanup(&_dev_root_s); > return result; > } /* sdp_conn_table_init */ > > @@ -2302,7 +2232,7 @@ > /* > * delete list of HCAs/PORTs > */ > - (void)_sdp_device_table_cleanup(&_dev_root_s); > + ib_unregister_client(&sdp_client); > /* > * drop socket table > */ > Index: infiniband/ulp/sdp/sdp_dev.h > =================================================================== > --- infiniband/ulp/sdp/sdp_dev.h (revision 803) > +++ infiniband/ulp/sdp/sdp_dev.h (working copy) > @@ -167,12 +167,8 @@ > int send_buff_max; > int send_usig_max; > /* > - * devices. list of installed HCA's and some associated parameters > - */ > - struct sdev_hca *hca_list; > - /* > * connections. The table is a simple linked list, since it does not > - * need to require fast lookup capabilities. > + * need fast lookup capabilities. > */ > u32 sk_size; /* socket array size */ > u32 sk_ordr; /* order size of region. */ From halr at voltaire.com Fri Sep 24 09:37:42 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 12:37:42 -0400 Subject: [openib-general] [PATCH] separates QP0/1 interactions In-Reply-To: References: Message-ID: <1096043861.5690.3.camel@hpc-1> On Thu, 2004-09-23 at 22:59, Sean Hefty wrote: > >-static int convert_qpnum(u32 qp_num) > >-{ > >- /* > >- * No redirection currently!!! > > This call wasn't needed with these changes. When we do get to redirection, > I think that we can handle it more efficiently by placing needed queues and > such off the mad_agent structure. Not sure what you mean by not needed with these changes. The idea was to put some infrastructure in place so that when redirection is implemented, most of the code would not change. The thinking behind this routine was that there would be a table which packed the indexes as the QPNs of the redirected QPs could be sparse. The choice once again will be some list to walk or a table lookup. -- Hal From mshefty at ichips.intel.com Fri Sep 24 09:40:32 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 09:40:32 -0700 Subject: [openib-general] ib_mad shutdown WC status code ? In-Reply-To: <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> References: <1096038980.2391.3.camel@hpc-1> <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> Message-ID: <20040924094032.368c1a64.mshefty@ichips.intel.com> On Fri, 24 Sep 2004 08:23:50 -0700 "Fab Tillier" wrote: > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, September 24, 2004 8:16 AM > > > > Should there be a separate WC status code to be returned when the ib_mad > > layer is shutdown and there are MADs queued on the send queue ? We added > > IB_WC_RESP_TIMEOUT_ERR. Should there be an IB_WC_SHUTDOWN added too ? Or > > is there some other status code to use for this scenario ? > > > > I would use IB_WC_WR_FLUSHED_ERR to indicate that a WR got flushed without > having been sent. This is consistent with how WRs get flushed on QPs. I think that flushed makes sense. With queuing in the access layer, we should be able to stop and restart the QP in most error cases without affecting the user. If the link goes down, flushing the requests seems reasonable. What other cases would cause the MAD layer to shut down? - Sean From mshefty at ichips.intel.com Fri Sep 24 09:44:30 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 09:44:30 -0700 Subject: [openib-general] [PATCH] set CQ size for MAD layer Message-ID: <20040924094430.39b2dd71.mshefty@ichips.intel.com> This patch is for revision 880. If QP0/1 traffic is not separated, the CQ size needs to double to prevent overruns. - Sean -- Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 880) +++ access/ib_mad.c (working copy) @@ -1320,7 +1320,7 @@ port_priv->version[i] = NULL; } - cq_size = IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE; + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; port_priv->cq = ib_create_cq(port_priv->device, (ib_comp_handler) ib_mad_thread_completion_handler, NULL, port_priv, cq_size); From mshefty at ichips.intel.com Fri Sep 24 10:38:03 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 10:38:03 -0700 Subject: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values Message-ID: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> Index: include/ib_verbs.h =================================================================== --- include/ib_verbs.h (revision 880) +++ include/ib_verbs.h (working copy) @@ -320,11 +320,11 @@ }; enum ib_qp_type { + IB_QPT_SMI, /* SMI type = QP index 0 */ + IB_QPT_GSI, /* GSI type = QP index 1 */ IB_QPT_RC, IB_QPT_UC, IB_QPT_UD, - IB_QPT_SMI, - IB_QPT_GSI, IB_QPT_RAW_IPV6, IB_QPT_RAW_ETY }; -- From mshefty at ichips.intel.com Fri Sep 24 10:40:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 10:40:20 -0700 Subject: [openib-general] [PATCH] minor optimization to solicited_mad check Message-ID: <20040924104020.74650c55.mshefty@ichips.intel.com> Minor patch to remove a branch, jump, and stack variable. Index: ib_mad.c =================================================================== --- ib_mad.c (revision 880) +++ ib_mad.c (working copy) @@ -584,30 +584,22 @@ static int response_mad(struct ib_mad *mad) { /* Trap represses are responses although response bit is reset */ - if ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || - (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)) { - return 1; - } - return 0; + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); } static int solicited_mad(struct ib_mad *mad) { - int solicited = 0; - /* CM MADs are never solicited */ if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { - goto ret; + return 0; } /* Determine whether MAD is using RMPP !!! */ /* Not using RMPP */ /* Is this MAD a response to a previous MAD ? */ - solicited = response_mad(mad); - -ret: - return solicited; + return response_mad(mad); } static struct ib_mad_agent_private *find_mad_agent(struct ib_mad_port_private *port_priv, -- From halr at voltaire.com Fri Sep 24 10:41:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 13:41:44 -0400 Subject: [openib-general] ib_mad shutdown WC status code ? References: <1096038980.2391.3.camel@hpc-1> <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> <20040924094032.368c1a64.mshefty@ichips.intel.com> Message-ID: <009b01c4a25d$c4772f60$4302000a@Gripen> Sean Hefty wrote: > I think that flushed makes sense. With queuing in the access layer, > we should be able to stop and restart the QP in most error cases > without affecting the user. If the link goes down, flushing the > requests seems reasonable. What other cases would cause the MAD > layer to shut down? Module removal/shutdown. -- Hal From mshefty at ichips.intel.com Fri Sep 24 10:45:04 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 10:45:04 -0700 Subject: [openib-general] ib_mad shutdown WC status code ? In-Reply-To: <009b01c4a25d$c4772f60$4302000a@Gripen> References: <1096038980.2391.3.camel@hpc-1> <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> <20040924094032.368c1a64.mshefty@ichips.intel.com> <009b01c4a25d$c4772f60$4302000a@Gripen> Message-ID: <20040924104504.3f859809.mshefty@ichips.intel.com> On Fri, 24 Sep 2004 13:41:44 -0400 Hal Rosenstock wrote: > Sean Hefty wrote: > > I think that flushed makes sense. With queuing in the access layer, > > we should be able to stop and restart the QP in most error cases > > without affecting the user. If the link goes down, flushing the > > requests seems reasonable. What other cases would cause the MAD > > layer to shut down? > > Module removal/shutdown. Are we expecting to be able to do this with active users of the access layer? Even in this case, I think flushed works, since the QP is being destroyed. From mshefty at ichips.intel.com Fri Sep 24 10:53:04 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 10:53:04 -0700 Subject: [openib-general] [PATCH] remove modifying port number on special QPs Message-ID: <20040924105304.1c3530ca.mshefty@ichips.intel.com> Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 880) +++ access/ib_mad.c (working copy) @@ -1068,7 +1068,7 @@ /* * Modify QP into Init state */ -static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp, int port_num) +static inline int ib_mad_change_qp_state_to_init(struct ib_qp *qp) { int ret; struct ib_qp_attr *attr = NULL; @@ -1087,13 +1087,12 @@ * one is needed for the Reset to Init transition. */ attr->pkey_index = 0; - attr->port_num = port_num; /* QKey is 0 for QP0 */ if (qp->qp_num == 0) attr->qkey = 0; else attr->qkey = IB_QP1_QKEY; - attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_QKEY; + attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_QKEY; ret = ib_modify_qp(qp, attr, attr_mask, &qp_cap); kfree(attr); @@ -1189,8 +1188,7 @@ int ret, i, ret2; for (i = 0; i < IB_MAD_QPS_CORE; i++) { - ret = ib_mad_change_qp_state_to_init(port_priv->qp[i], - port_priv->port_num); + ret = ib_mad_change_qp_state_to_init(port_priv->qp[i]); if (ret) { printk(KERN_ERR "Could not change QP%d state to INIT\n", i); return ret; -- From halr at voltaire.com Fri Sep 24 12:51:59 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 15:51:59 -0400 Subject: [openib-general] Re: [PATCH] set CQ size for MAD layer In-Reply-To: <20040924094430.39b2dd71.mshefty@ichips.intel.com> References: <20040924094430.39b2dd71.mshefty@ichips.intel.com> Message-ID: <1096055519.5573.4.camel@hpc-1> On Fri, 2004-09-24 at 12:44, Sean Hefty wrote: > This patch is for revision 880. If QP0/1 traffic is not separated, > the CQ size needs to double to prevent overruns. Thanks. Applied. -- Hal From halr at voltaire.com Fri Sep 24 12:54:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 15:54:28 -0400 Subject: [openib-general] Re: [PATCH] minor optimization to solicited_mad check In-Reply-To: <20040924104020.74650c55.mshefty@ichips.intel.com> References: <20040924104020.74650c55.mshefty@ichips.intel.com> Message-ID: <1096055668.5573.9.camel@hpc-1> On Fri, 2004-09-24 at 13:40, Sean Hefty wrote: > Minor patch to remove a branch, jump, and stack variable. Thanks. Applied. -- Hal From halr at voltaire.com Fri Sep 24 12:58:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 15:58:23 -0400 Subject: [openib-general] Re: [PATCH] remove modifying port number on special QPs In-Reply-To: <20040924105304.1c3530ca.mshefty@ichips.intel.com> References: <20040924105304.1c3530ca.mshefty@ichips.intel.com> Message-ID: <1096055903.5573.13.camel@hpc-1> On Fri, 2004-09-24 at 13:53, Sean Hefty wrote: Thanks. I think there is one more piece to this puzzle. I will make the change for when the special QPs are created, that the port number is supplied. -- Hal From halr at voltaire.com Fri Sep 24 13:02:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 16:02:28 -0400 Subject: [openib-general] Re: [PATCH] remove modifying port number on special QPs In-Reply-To: <1096055903.5573.13.camel@hpc-1> References: <20040924105304.1c3530ca.mshefty@ichips.intel.com> <1096055903.5573.13.camel@hpc-1> Message-ID: <1096056148.5573.15.camel@hpc-1> On Fri, 2004-09-24 at 15:58, Hal Rosenstock wrote: > On Fri, 2004-09-24 at 13:53, Sean Hefty wrote: > Thanks. I think there is one more piece to this puzzle. I will make the > change for when the special QPs are created, that the port number is > supplied. Never mind. That part was already done. -- Hal From halr at voltaire.com Fri Sep 24 13:05:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 16:05:52 -0400 Subject: [openib-general] Re: [PATCH] remove modifying port number on special QPs In-Reply-To: <20040924105304.1c3530ca.mshefty@ichips.intel.com> References: <20040924105304.1c3530ca.mshefty@ichips.intel.com> Message-ID: <1096056352.5573.17.camel@hpc-1> On Fri, 2004-09-24 at 13:53, Sean Hefty wrote: Thanks. Applied. -- Hal From halr at voltaire.com Fri Sep 24 13:33:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 16:33:30 -0400 Subject: [openib-general] ib_mad shutdown WC status code ? In-Reply-To: <20040924104504.3f859809.mshefty@ichips.intel.com> References: <1096038980.2391.3.camel@hpc-1> <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> <20040924094032.368c1a64.mshefty@ichips.intel.com> <009b01c4a25d$c4772f60$4302000a@Gripen> <20040924104504.3f859809.mshefty@ichips.intel.com> Message-ID: <1096058010.2475.1.camel@hpc-1> On Fri, 2004-09-24 at 13:45, Sean Hefty wrote: > > Module removal/shutdown. > > Are we expecting to be able to do this with active users of the access layer? Not with active users but if all the users of the module are shut down, then the access layer can be (and should be able to be) removed safely. -- Hal From halr at voltaire.com Fri Sep 24 13:51:43 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 24 Sep 2004 16:51:43 -0400 Subject: [openib-general] [PATCH] Change SMI/GSI QP types to match QP index values Message-ID: <1096059103.2392.0.camel@hpc-1> Change SMI/GSI QP types to match QP index values Roland's branch Index: ib_verbs.h =================================================================== -- ib_verbs.h (revision 880) +++ ib_verbs.h (working copy) @@ -346,11 +346,11 @@ }; enum ib_qp_type { + IB_QPT_SMI, /* SMI type = QP index 0 */ + IB_QPT_GSI, /* GSI type = QP index 1 */ IB_QPT_RC, IB_QPT_UC, IB_QPT_UD, - IB_QPT_SMI, - IB_QPT_GSI, IB_QPT_RAW_IPV6, IB_QPT_RAW_ETY }; From mshefty at ichips.intel.com Fri Sep 24 13:53:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 13:53:43 -0700 Subject: [openib-general] ib_mad shutdown WC status code ? In-Reply-To: <1096058010.2475.1.camel@hpc-1> References: <1096038980.2391.3.camel@hpc-1> <000001c4a24a$7eef2d20$655aa8c0@infiniconsys.com> <20040924094032.368c1a64.mshefty@ichips.intel.com> <009b01c4a25d$c4772f60$4302000a@Gripen> <20040924104504.3f859809.mshefty@ichips.intel.com> <1096058010.2475.1.camel@hpc-1> Message-ID: <20040924135343.7e1b79d4.mshefty@ichips.intel.com> On Fri, 24 Sep 2004 16:33:30 -0400 Hal Rosenstock wrote: > On Fri, 2004-09-24 at 13:45, Sean Hefty wrote: > > > Module removal/shutdown. > > > > Are we expecting to be able to do this with active users of the access layer? > > Not with active users but if all the users of the module are shut down, > then the access layer can be (and should be able to be) removed safely. I agree, but then there shouldn't be any work completions to report in those cases. I guess we can just use flush now and create a new status code if we find that we need it later. From mshefty at ichips.intel.com Fri Sep 24 13:56:57 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 13:56:57 -0700 Subject: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values In-Reply-To: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> References: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> Message-ID: <20040924135657.1926385f.mshefty@ichips.intel.com> On Fri, 24 Sep 2004 10:38:03 -0700 Sean Hefty wrote: > > Index: include/ib_verbs.h > =================================================================== > --- include/ib_verbs.h (revision 880) > +++ include/ib_verbs.h (working copy) > @@ -320,11 +320,11 @@ > }; > > enum ib_qp_type { > + IB_QPT_SMI, /* SMI type = QP index 0 */ > + IB_QPT_GSI, /* GSI type = QP index 1 */ I've committed this change. From robert.j.woodruff at intel.com Fri Sep 24 14:06:59 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 24 Sep 2004 14:06:59 -0700 Subject: [openib-general] [ANNOUNCE] New Release of IBAL posted to the sourceforge web site. Message-ID: <1AC79F16F5C5284499BB9591B33D6F000205E963@orsmsx408> A new Beta release tar ball of the IBAL InfiniBand stack has been posted to sourceforge InfiniBand project. This release is based on Bitkeeper change set 284, plus one additional complib bug fix that will be pushed to bitkeeper soon. http://sourceforge.net/projects/infiniband The code has a couple of more kernel panic bug fixes and a new feature to prevent applications from pinning down too much memory and causing the kernel to hang as a result. See the Release Notes.txt in the tar ball for details. From mshefty at ichips.intel.com Fri Sep 24 16:10:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 24 Sep 2004 16:10:54 -0700 Subject: [openib-general] [PATCH] reference counting added to ib_mad_agent Message-ID: <20040924161054.050a0d5e.mshefty@ichips.intel.com> This patch adds reference counting for MAD agents to protect against deregistration while a callback is being invoked. As part of the structure changes to support reference counting, deregistration code has been simplified, and a bug has been fixed where multiple port structures were being stored in the same pointer. Note that when sending MADs, the code currently holds a reference count from the time that the send is posted, until it completes and is returned to the user. - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 885) +++ access/ib_mad_priv.h (working copy) @@ -97,6 +97,9 @@ struct list_head agent_list; struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; + struct ib_mad_port_private *port_priv; + atomic_t refcount; + wait_queue_head_t wait; u8 rmpp_version; }; Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 885) +++ access/ib_mad.c (working copy) @@ -221,9 +221,12 @@ /* Add mad agent into agent list */ list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); - spin_unlock_irqrestore(&port_priv->reg_lock, flags); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + mad_agent_priv->port_priv = port_priv; + return &mad_agent_priv->agent; error3: @@ -241,37 +244,28 @@ */ int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) { - struct ib_mad_port_private *entry; - struct ib_mad_agent_private *entry2, *temp; - unsigned long flags, flags2; + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; - /* - * Rather than walk all the mad agent lists on all the mad ports, - * might use device in mad_agent and port number from mad agent QP - * but this approach has some downsides - */ - spin_lock_irqsave(&ib_mad_port_list_lock, flags); - list_for_each_entry(entry, &ib_mad_port_list, port_list) { - spin_lock_irqsave(&entry->reg_lock, flags2); - list_for_each_entry_safe(entry2, temp, - &entry->agent_list, agent_list) { - if (&entry2->agent == mad_agent) { - remove_mad_reg_req(entry2); - list_del(&entry2->agent_list); - - spin_unlock_irqrestore(&entry->reg_lock, flags2); - spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - /* Release allocated structures */ - if (entry2->reg_req) - kfree(entry2->reg_req); - kfree(entry2); - return 0; - } - } - spin_unlock_irqrestore(&entry->reg_lock, flags2); - } - spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); - return 1; + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + + /* Cleanup outstanding sends/pending receives for this agent... */ + + spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags); + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); + + return 0; } EXPORT_SYMBOL(ib_unregister_mad_agent); @@ -287,7 +281,9 @@ struct ib_send_wr *cur_send_wr, *next_send_wr; struct ib_send_wr wr; struct ib_send_wr *bad_wr; - struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_port_private *port_priv; unsigned long flags; cur_send_wr = send_wr; @@ -297,6 +293,10 @@ return -EINVAL; } + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + port_priv = mad_agent_priv->port_priv; + /* Walk list of send WRs and post each on send list */ cur_send_wr = send_wr; while (cur_send_wr) { @@ -330,20 +330,25 @@ wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ /* Link send WR into posted send MAD list */ - spin_lock_irqsave(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + spin_lock_irqsave(&port_priv->send_list_lock, flags); list_add_tail(&mad_send_wr->send_list, - &((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_list); - ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count++; - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + &port_priv->send_posted_mad_list); + port_priv->send_posted_mad_count++; + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Reference MAD agent until send completes. */ + atomic_inc(&mad_agent_priv->refcount); ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { /* Unlink from posted send MAD list */ - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); list_del(&mad_send_wr->send_list); - ((struct ib_mad_port_private *)mad_agent->device->mad)->send_posted_mad_count--; - spin_unlock_irqrestore(&((struct ib_mad_port_private *)mad_agent->device->mad)->send_list_lock, flags); + port_priv->send_posted_mad_count--; + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); *bad_send_wr = cur_send_wr; + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); printk(KERN_NOTICE "ib_post_mad_send failed\n"); return ret; } @@ -467,7 +472,7 @@ /* Make sure MAD registration request supplied */ if (!mad_reg_req) return 0; - private = priv->agent.device->mad; + private = priv->port_priv; class = &private->version[mad_reg_req->mgmt_class_version]; mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); if (!*class) { @@ -541,7 +546,7 @@ return; } - port_priv = agent_priv->agent.device->mad; + port_priv = agent_priv->port_priv; class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; if (!class) { printk(KERN_ERR "No class table yet MAD registration request supplied\n"); @@ -742,8 +747,11 @@ recv->header.recv_buf.mad, solicited); if (!mad_agent) { + spin_unlock_irqrestore(&port_priv->reg_lock, flags); printk(KERN_ERR "No matching mad agent found for receive MAD\n"); } else { + atomic_inc(&mad_agent->refcount); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); if (solicited) { /* Walk the send posted list to find the match !!! */ printk(KERN_DEBUG "Currently unsupported solicited MAD received\n"); @@ -752,8 +760,10 @@ /* Invoke receive callback */ mad_agent->agent.recv_handler(&mad_agent->agent, &recv->header.recv_wc); + + if (atomic_dec_and_test(&mad_agent->refcount)) + wake_up(&mad_agent->wait); } - spin_unlock_irqrestore(&port_priv->reg_lock, flags); /* Post another receive request for this QP */ ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); @@ -765,7 +775,8 @@ static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { - struct ib_mad_send_wr_private *send_wr; + struct ib_mad_send_wr_private *send_wr; + struct ib_mad_agent_private *mad_agent_priv; unsigned long flags; /* Completion corresponds to first entry on posted MAD send list */ @@ -781,6 +792,8 @@ goto error; } + mad_agent_priv = container_of(send_wr->agent, + struct ib_mad_agent_private, agent); /* Check whether timeout was requested !!! */ /* Remove from posted send MAD list */ @@ -795,10 +808,15 @@ /* Restore client wr_id in WC */ wc->wr_id = send_wr->wr_id; + /* Invoke client send callback */ send_wr->agent->send_handler(send_wr->agent, - (struct ib_mad_send_wc *)wc); - /* Release send MAD WR tracking structure */ + (struct ib_mad_send_wc *)wc); + + /* Release reference taken when sending. */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + kfree(send_wr); return; @@ -1302,7 +1320,6 @@ } memset(port_priv, 0, sizeof *port_priv); - device->mad = port_priv; port_priv->device = device; port_priv->port_num = port_num; spin_lock_init(&port_priv->reg_lock); @@ -1444,7 +1461,6 @@ /* Handle deallocation of MAD registration tables!!! */ kfree(port_priv); - device->mad = NULL; return 0; } From roland at topspin.com Sat Sep 25 09:12:45 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 25 Sep 2004 09:12:45 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <003901c4a172$ec3a3d80$4302000a@Gripen> (Hal Rosenstock's message of "Thu, 23 Sep 2004 09:40:34 -0400") References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> Message-ID: <52brful43m.fsf@topspin.com> Hal> 1. mthca_cmd.c has a number of compile warnings (built with Hal> debug configured). I fixed the ones around line 480 ("SYS_EN DDR..."). Do you get any others? Hal> 2. ipoib_main.c line 48 has a compile error indicating Hal> directives may not be use inside macro arguments. This is Hal> with both ipoib debugs on. OK, I fixed this (slightly differently from your suggestion). Hal> 3. mad_statuc:96 complains about line 96 "Couldn't find Hal> suitable network device; setting lid_base to 1". Is this OK ? Hal> Also, why is this done and can it be shut off ? The warning is benign. This was done because there are applications that want to make loopback connections before the SM has configured the local LID, but not have them break when the SM does assign a LID. It can be turned off by setting IB_ASSIGN_STATIC_LID to 0 in core/mad_priv.h. By the way, what compiler are you using? I had been doing all my testing with gcc 3.3 and 3.4, which is why I didn't see the problems you reported. I've now added gcc 2.95 to the set of compilers I test build with so I should be able to avoid most of these problems in the future. However to get gcc 2.95 working I had to fix a few problems you didn't report... - Roland From roland at topspin.com Sat Sep 25 09:13:49 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 25 Sep 2004 09:13:49 -0700 Subject: [openib-general] Re: IPoIB Loading and Starting In-Reply-To: <004d01c4a177$08257830$4302000a@Gripen> (Hal Rosenstock's message of "Thu, 23 Sep 2004 10:10:04 -0400") References: <004d01c4a177$08257830$4302000a@Gripen> Message-ID: <527jqil41u.fsf@topspin.com> Hal> 1. It appears I need the following modules in this order: Hal> core/ib_client_query.ko core/ib_sa_client.ko Hal> ulp/ipoib/ib_ipoib.ko ulp/ipoib/ib_ip2pr.ko Don't worry about which modules or what order. Just do "modprobe ib_ipoib" and "modprobe ib_mthca" (in either order) and it should work fine (assuming you run depmod first). Hal> 2. After all the above modules are loaded, do I just Hal> configure the ib0/1 interfaces ? Something like: Hal> /sbin/ifconfig ib0 192.168.0.101 netmask 255.255.255.0 up Hal> Does anything else need to be done before it should work Hal> (other than an SM bringing these links to active) ? Nope, that's it. - Roland From roland at topspin.com Sat Sep 25 09:16:49 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 25 Sep 2004 09:16:49 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> (Michael Krause's message of "Thu, 23 Sep 2004 08:34:35 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> Message-ID: <523c16l3wu.fsf@topspin.com> Michael> I would have thought this would be part of the IP over IB Michael> driver. Communicate with the SM to acquire the P_Key and Michael> then use that to perform Arp / ND. Strange to require a Michael> command since this gets a bit cumbersome when there are Michael> many nodes in the fabric. That is a nice idea in theory but it falls down in practice. First, the IPoIB driver is often loaded during the boot process, and init scripts may want to configure IPoIB interfaces before the SM has discovered the local node. Also, a node may have many P_Keys assigned, only a few of which are used for IP traffic. Administrators don't like to have to deal with 50+ network devices. Thanks, Roland From roland at topspin.com Sat Sep 25 09:17:03 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 25 Sep 2004 09:17:03 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on Take 2 In-Reply-To: <1095971657.2644.4.camel@hpc-1> (Hal Rosenstock's message of "Thu, 23 Sep 2004 16:34:18 -0400") References: <1095971657.2644.4.camel@hpc-1> Message-ID: <52y8iyjpc0.fsf@topspin.com> Thanks, I fixed this slightly differently already. - Roland From roland at topspin.com Sat Sep 25 09:21:15 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 25 Sep 2004 09:21:15 -0700 Subject: [openib-general] Re: [PATCH] Move SDP to dynamic device enumeration In-Reply-To: <20040924084540.A2899@topspin.com> (Libor Michalek's message of "Fri, 24 Sep 2004 08:45:40 -0700") References: <52r7p3wesd.fsf@topspin.com> <20040924084540.A2899@topspin.com> Message-ID: <52u0tmjp50.fsf@topspin.com> Libor> Sorry that I missed this earlier. This looks Libor> good. Although, the change in FMR initialization, once Libor> implemented, will result in more virtual address space Libor> utilization on systems with multiple HCAs... True, but I don't see any way around it given that someone could hotplug 50 HCAs after the SDP module is loaded. Also it's possible to implement FMRs to use vmalloc space more efficiently than the original VAPI code did, so I don't think this should be a problem in practice (the 1024 FMRs SDP wants should only take about 128 KB of vmalloc space). And any non-obsolete architecture has unlimited vmalloc space... - Roland From mst at mellanox.co.il Sat Sep 25 14:06:13 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 25 Sep 2004 23:06:13 +0200 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on In-Reply-To: <1095973511.25438.4.camel@duffman> References: <1095973511.25438.4.camel@duffman> Message-ID: <20040925210613.GB7717@mellanox.co.il> Tom, Not that I advocate using MUAs that break tabs, but - you do know you can patch -l to solve (at least check) tab/space issues, dont you? Quoting r. Tom Duffy (tduffy at sun.com) "Re: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on": > Subject: > Date: Thu, 23 Sep 2004 23:04:11 +0200 > > > your patch did apply correctly. perhaps your MUA is changing tabs to > spaces? > > [tduffy at duffman ipoib]$ patch ipoib_main.c /tmp/ipoib.hal > patching file ipoib_main.c > Hunk #1 FAILED at 43. > 1 out of 1 hunk FAILED -- saving rejects to file ipoib_main.c.rej > > > On Thu, 2004-09-23 at 15:52 -0400, Hal Rosenstock wrote: > > ipoib: Fix compile problem with data path debug on > > > > Index: ipoib_main.c > > =================================================================== > > --- ipoib_main.c (revision 880) > > +++ ipoib_main.c (working copy) > > @@ -43,12 +43,14 @@ > > int debug_level; > > > > module_param(debug_level, int, 0644); > > +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > > MODULE_PARM_DESC(debug_level, > > "Enable debug tracing if > 0" > > -#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA > > - " and data path tracing if > 1" > > + " and data path tracing if > 1"); > > +#else > > +MODULE_PARM_DESC(debug_level, > > + "Enable debug tracing if > 0"); > > #endif > > - ); > > > > int mcast_debug_level; > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- > "When they took the 4th Amendment, I was quiet because I didn't deal > drugs. When they took the 6th Amendment, I was quiet because I am > innocent. When they took the 2nd Amendment, I was quiet because I don't > own a gun. Now they have taken the 1st Amendment, and I can only be > quiet." --Lyle Myhr From mst at mellanox.co.il Sat Sep 25 14:12:45 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 25 Sep 2004 23:12:45 +0200 Subject: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values In-Reply-To: <20040924135657.1926385f.mshefty@ichips.intel.com> References: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> <20040924135657.1926385f.mshefty@ichips.intel.com> Message-ID: <20040925211244.GC7717@mellanox.co.il> Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values": > On Fri, 24 Sep 2004 10:38:03 -0700 > Sean Hefty wrote: > > > > > Index: include/ib_verbs.h > > =================================================================== > > --- include/ib_verbs.h (revision 880) > > +++ include/ib_verbs.h (working copy) > > @@ -320,11 +320,11 @@ > > }; > > > > enum ib_qp_type { > > + IB_QPT_SMI, /* SMI type = QP index 0 */ > > + IB_QPT_GSI, /* GSI type = QP index 1 */ > > I've committed this change. > _______________________________________________ I think you really want: enum ib_qp_type { IB_QPT_SMI=0, IB_QPT_GSI=1 to stress the fact that you really want to assign specific values to IB_QPT_SMI/IB_QPT_GSI or the code would not work correctly. MST From halr at voltaire.com Sun Sep 26 04:58:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Sun, 26 Sep 2004 07:58:54 -0400 Subject: [openib-general] mthca startup problem In-Reply-To: <52brful43m.fsf@topspin.com> References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> <52brful43m.fsf@topspin.com> Message-ID: <1096199932.1835.46.camel@localhost.localdomain> On Sat, 2004-09-25 at 12:12, Roland Dreier wrote: > Hal> 1. mthca_cmd.c has a number of compile warnings (built with > Hal> debug configured). > > I fixed the ones around line 480 ("SYS_EN DDR..."). > Do you get any others? Nope; just there. > Hal> 2. ipoib_main.c line 48 has a compile error indicating > Hal> directives may not be use inside macro arguments. This is > Hal> with both ipoib debugs on. > > OK, I fixed this (slightly differently from your suggestion). > > Hal> 3. mad_statuc:96 complains about line 96 "Couldn't find > Hal> suitable network device; setting lid_base to 1". Is this OK ? > Hal> Also, why is this done and can it be shut off ? > > The warning is benign. > > This was done because there are applications that want to make > loopback connections before the SM has configured the local LID, but > not have them break when the SM does assign a LID. Thanks. Assuming OpenIB uses this IPoIB implementation, would this be preserved ? It seems like it would remain based on your comments to Mike on the rationale for this. > It can be turned off by setting IB_ASSIGN_STATIC_LID to 0 in core/mad_priv.h. > > > By the way, what compiler are you using? I'm using gcc 3.2.2 and on an older machine 3.2. > I had been doing all my > testing with gcc 3.3 and 3.4, which is why I didn't see the problems > you reported. I've now added gcc 2.95 to the set of compilers I test > build with so I should be able to avoid most of these problems in the > future. However to get gcc 2.95 working I had to fix a few problems > you didn't report... -- Hal From gdror at mellanox.co.il Sun Sep 26 11:17:08 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Sun, 26 Sep 2004 20:17:08 +0200 Subject: [openib-general] ib_mad shutdown WC status code ? Message-ID: <506C3D7B14CDD411A52C00025558DED605F9DAA5@mtlex01.yok.mtl.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Friday, September 24, 2004 6:41 PM > I think that flushed makes sense. With queuing in the access > layer, we should be able to stop and restart the QP in most > error cases without affecting the user. If the link goes > down, flushing the requests seems reasonable. What other > cases would cause the MAD layer to shut down? > If link goes down, UD QP will go on "sending" the packets as if link was up all the time. You wouldn't get flushed WQEs in this case for a UD QP. In general, for UD QPs, you'll get flushed WQEs when the QP moves to the error state and that seldom happens. The exact scenarios for this to happen (hopefully I don't miss anything): - s/g element points at a wrong address/protection error - HCA can't fetch a WQE (e.g. bus error) -Dror > - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdror at mellanox.co.il Sun Sep 26 12:53:21 2004 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Sun, 26 Sep 2004 21:53:21 +0200 Subject: [openib-general] ib_mad shutdown WC status code ? Message-ID: <506C3D7B14CDD411A52C00025558DED605F9DAB7@mtlex01.yok.mtl.com> completing the list... -----Original Message----- From: Dror Goldenberg [mailto:gdror at mellanox.co.il] Sent: Sunday, September 26, 2004 8:17 PM To: Sean Hefty; Fab Tillier Cc: openib-general at openib.org Subject: RE: [openib-general] ib_mad shutdown WC status code ? > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com ] > Sent: Friday, September 24, 2004 6:41 PM > I think that flushed makes sense. With queuing in the access > layer, we should be able to stop and restart the QP in most > error cases without affecting the user. If the link goes > down, flushing the requests seems reasonable. What other > cases would cause the MAD layer to shut down? > If link goes down, UD QP will go on "sending" the packets as if link was up all the time. You wouldn't get flushed WQEs in this case for a UD QP. Receive queue is unaffected by the link going down. In general, for UD QPs, you'll get flushed WQEs when the QP moves to the error state and that seldom happens. The exact scenarios for this to happen (hopefully I don't miss anything): - s/g element points at a wrong address/protection error - HCA can't fetch a WQE (e.g. bus error) - manually move the QP to the error state (modify QP) -Dror > - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Sun Sep 26 21:45:20 2004 From: iod00d at hp.com (Grant Grundler) Date: Sun, 26 Sep 2004 21:45:20 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <52brful43m.fsf@topspin.com> References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> <52brful43m.fsf@topspin.com> Message-ID: <20040927044520.GG24541@cup.hp.com> On Sat, Sep 25, 2004 at 09:12:45AM -0700, Roland Dreier wrote: > I've now added gcc 2.95 to the set of compilers I test > build with so I should be able to avoid most of these problems in the > future. I'm not sure this is useful. 2.9x is certainly not interesting to kernel.org. I also don't think any supported distros are v2.9X gcc. (A few arches on Debain Stable might be...but that's 2+ years old and on it's last legs). grant From yaronh at voltaire.com Mon Sep 27 02:56:20 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 27 Sep 2004 11:56:20 +0200 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF5FB@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > Sent: Saturday, September 25, 2004 6:17 PM > To: Michael Krause > Cc: openib-general at openib.org > Subject: Re: [openib-general] IPoIB Loading and Starting > > Michael> I would have thought this would be part of the IP over IB > Michael> driver. Communicate with the SM to acquire the P_Key and > Michael> then use that to perform Arp / ND. Strange to require a > Michael> command since this gets a bit cumbersome when there are > Michael> many nodes in the fabric. > > That is a nice idea in theory but it falls down in practice. > First, the IPoIB driver is often loaded during the boot > process, and init scripts may want to configure IPoIB > interfaces before the SM has discovered the local node. > Also, a node may have many P_Keys assigned, only a few of > which are used for IP traffic. Administrators don't like to > have to deal with 50+ network devices. I tend to agree with Mike that a configuration here may be redundant IPoIB interfaces can pop up like USB devices when the P_Key table is field (in a default DHCP client mode) With one Interface per P_Key by default (unless configured otherwise) Anyway a system/interface cannot be used if the Port is not initialized by the SM so it is not critical to bring it up sooner If the interface is configured with an IP + Mask + .. It should be saves (associated with the right P_Key like in the VLAN case) The next time the system will boot, when the P_Key is discovered it try to match it with existing config if non found it is configured to use DHCP I don't see many partitions (if any) that don't have an IPoIB associated with them Any way user may also configure IPoIB to skip partitions i.e. not bring up an IPoIB interface for a certain partition (it will also happen automatically, in case you don't want to have an IPoIB interface for a certain partition, just choose one that doesn't have any DHCP server listening on as well, it will result in a non-active interface on that P_Key since no one will configure it) The benefits are clear: Zero configuration and P&P No need for manual user intervention for the default (DHCP) case Limit configuration errors (e.g. conflicting parameters) Works better in a utility computing type of environment or large stateless clusters which are the sweet spot for IB Yaron From halr at voltaire.com Mon Sep 27 06:00:27 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 09:00:27 -0400 Subject: [openib-general] [PATCH] ib_mad: Receive path fixes Message-ID: <1096290027.1836.14.camel@localhost.localdomain> ib_mad: Receive path fixes I have now successfully handled a receive SMP without a matching registration including the reposting of receive buffer (working around (not included in this patch) a bug relative to the receive list handling). I think the receive side will be working once I get past that :-) Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 885) +++ ib_mad_priv.h (working copy) @@ -77,6 +77,14 @@ #define MAX_MGMT_VERSION 8 +union ib_mad_recv_wrid { + u64 wrid; + struct { + u32 index; + u32 qpn; + } wrid_field; +}; + struct ib_mad_private_header { struct ib_mad_recv_wc recv_wc; /* must be first member (for now !!!) */ struct ib_mad_recv_buf recv_buf; @@ -117,11 +125,11 @@ struct ib_mad_thread_private { wait_queue_head_t wait; + atomic_t completion_event; }; struct ib_mad_port_private { struct list_head port_list; - struct task_struct *mad_thread; struct ib_device *device; int port_num; struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; @@ -140,7 +148,9 @@ spinlock_t recv_list_lock; struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; int recv_posted_mad_count[IB_MAD_QPS_SUPPORTED]; + u32 recv_wr_index[IB_MAD_QPS_SUPPORTED]; + struct task_struct *mad_thread; struct ib_mad_thread_private mad_thread_private; }; Index: ib_mad.c =================================================================== --- ib_mad.c (revision 885) +++ ib_mad.c (working copy) @@ -219,7 +219,7 @@ goto error3; } - /* Add mad agent into agent list */ + /* Add mad agent into port's agent list */ list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); @@ -257,6 +257,10 @@ &entry->agent_list, agent_list) { if (&entry2->agent == mad_agent) { remove_mad_reg_req(entry2); + + /* Check for any pending send MADs for this agent !!! */ + + /* Remove mad agent from port's agent list */ list_del(&entry2->agent_list); spin_unlock_irqrestore(&entry->reg_lock, flags2); @@ -576,7 +580,7 @@ * as QP numbers will not be packed once redirection supported */ if (qp_num > 1) { - printk(KERN_ERR "QP number %d invalid\n", qp_num); + return -1; } return qp_num; } @@ -680,28 +684,36 @@ struct ib_wc *wc) { struct ib_mad_private *recv; + union ib_mad_recv_wrid wrid; unsigned long flags; u32 qp_num; struct ib_mad_agent_private *mad_agent; - int solicited; + int solicited, qpn; + int callback = 0; - /* For receive, WC WRID is the QP number */ - qp_num = wc->wr_id; - + /* For receive, QP number is field in the WC WRID */ + wrid.wrid = wc->wr_id; + qp_num = wrid.wrid_field.qpn; + qpn = convert_qpnum(qp_num); + if (qpn == -1) { + printk(KERN_ERR "Packet received on unknown QPN %d\n", qp_num); + return; + } + /* * Completion corresponds to first entry on * posted MAD receive list based on WRID in completion */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); - if (!list_empty(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)])) { - recv = list_entry(&port_priv->recv_posted_mad_list[convert_qpnum(qp_num)], + if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) { + recv = list_entry(&port_priv->recv_posted_mad_list[qpn], struct ib_mad_private, header.recv_buf.list); /* Remove from posted receive MAD list */ list_del(&recv->header.recv_buf.list); - port_priv->recv_posted_mad_count[convert_qpnum(qp_num)]--; + port_priv->recv_posted_mad_count[qpn]--; } else { printk(KERN_ERR "Receive completion WR ID 0x%Lx on QP %d with no posted receive\n", wc->wr_id, qp_num); @@ -724,7 +736,7 @@ recv->header.recv_buf.list.next = NULL; /* Until RMPP implemented !!! */ recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; if (wc->wc_flags & IB_WC_GRH) { - recv->header.recv_buf.grh = (struct ib_grh *)&recv->grh; + recv->header.recv_buf.grh = &recv->grh; } else { recv->header.recv_buf.grh = NULL; } @@ -746,19 +758,25 @@ } else { if (solicited) { /* Walk the send posted list to find the match !!! */ - printk(KERN_DEBUG "Currently unsupported solicited MAD received\n"); + printk(KERN_DEBUG "Receive solicited MAD currently unsupported\n"); } + callback = 1; /* Invoke receive callback */ mad_agent->agent.recv_handler(&mad_agent->agent, &recv->header.recv_wc); } spin_unlock_irqrestore(&port_priv->reg_lock, flags); +ret: + if (!callback) { + /* Should this case be optimized ? */ + kmem_cache_free(ib_mad_cache, recv); + } + /* Post another receive request for this QP */ ib_mad_post_receive_mad(port_priv, port_priv->qp[qp_num]); -ret: return; } @@ -814,22 +832,21 @@ { struct ib_wc wc; int err_status = 0; - - while (!ib_poll_cq(port_priv->cq, 1, &wc)) { - printk(KERN_DEBUG "Completion - WR ID = 0x%Lx\n", wc.wr_id); - + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + printk(KERN_DEBUG "Completion opcode 0x%x WRID 0x%Lx\n", wc.opcode, wc.wr_id); if (wc.status != IB_WC_SUCCESS) { switch (wc.opcode) { case IB_WC_SEND: - printk(KERN_ERR "Send completion error: %d\n", + printk(KERN_ERR "Send completion error %d\n", wc.status); break; case IB_WC_RECV: - printk(KERN_ERR "Recv completion error: %d\n", + printk(KERN_ERR "Recv completion error %d\n", wc.status); break; default: - printk(KERN_ERR "Unknown completion: %d with error\n", wc.opcode); + printk(KERN_ERR "Unknown completion %d with error %d\n", wc.opcode, wc.status); break; } err_status = 1; @@ -844,9 +861,9 @@ ib_mad_recv_done_handler(port_priv, &wc); break; default: - printk(KERN_ERR "Wrong Opcode: %d\n", wc.opcode); + printk(KERN_ERR "Wrong Opcode %d on completion\n", wc.opcode); if (wc.status) { - printk(KERN_ERR "Completion error: %d\n", wc.status); + printk(KERN_ERR "Completion error %d\n", wc.status); } } @@ -855,7 +872,6 @@ if (err_status) { ib_mad_port_restart(port_priv); } else { - ib_mad_post_receive_mads(port_priv); ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); } } @@ -871,7 +887,9 @@ while (1) { while (!signal_pending(current)) { - ret = wait_event_interruptible(mad_thread_priv->wait, 0); + ret = wait_event_interruptible(mad_thread_priv->wait, + atomic_read(&mad_thread_priv->completion_event) > 0); + atomic_set(&mad_thread_priv->completion_event, 0); if (ret) { printk(KERN_ERR "ib_mad thread exiting\n"); return 0; @@ -890,6 +908,7 @@ { struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + atomic_set(&mad_thread_priv->completion_event, 0); init_waitqueue_head(&mad_thread_priv->wait); port_priv->mad_thread = kthread_create(ib_mad_thread, @@ -898,10 +917,11 @@ port_priv->device->name, port_priv->port_num); if (IS_ERR(port_priv->mad_thread)) { - printk(KERN_ERR "Couldn't start mad thread for %s port %d\n", + printk(KERN_ERR "Couldn't start ib_mad thread for %s port %d\n", port_priv->device->name, port_priv->port_num); return 1; } + wake_up_process(port_priv->mad_thread); return 0; } @@ -918,6 +938,7 @@ struct ib_mad_port_private *port_priv = cq->cq_context; struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + atomic_inc(&mad_thread_priv->completion_event); wake_up_interruptible(&mad_thread_priv->wait); } @@ -930,7 +951,16 @@ struct ib_recv_wr *bad_recv_wr; unsigned long flags; int ret; + union ib_mad_recv_wrid wrid; + int qpn; + + qpn = convert_qpnum(qp->qp_num); + if (qpn == -1) { + printk(KERN_ERR "Post receive to invalid QPN %d\n", qp->qp_num); + return -EINVAL; + } + /* * Allocate memory for receive buffer. * This is for both MAD and private header @@ -949,7 +979,7 @@ /* Setup scatter list */ sg_list.addr = pci_map_single(port_priv->device->dma_device, &mad_priv->grh, - sizeof *mad_priv - sizeof mad_priv->header, + sizeof *mad_priv - sizeof mad_priv->header, PCI_DMA_FROMDEVICE); sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; sg_list.lkey = (*port_priv->mr).lkey; @@ -959,13 +989,15 @@ recv_wr.sg_list = &sg_list; recv_wr.num_sge = 1; recv_wr.recv_flags = IB_RECV_SIGNALED; - recv_wr.wr_id = qp->qp_num; /* 32 bits left */ + wrid.wrid_field.index = port_priv->recv_wr_index[qpn]++; + wrid.wrid_field.qpn = qp->qp_num; + recv_wr.wr_id = wrid.wrid; /* Link receive WR into posted receive MAD list */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); list_add_tail(&mad_priv->header.recv_buf.list, - &port_priv->recv_posted_mad_list[convert_qpnum(qp->qp_num)]); - port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]++; + &port_priv->recv_posted_mad_list[qpn]); + port_priv->recv_posted_mad_count[qpn]++; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); @@ -982,11 +1014,11 @@ /* Unlink from posted receive MAD list */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); list_del(&mad_priv->header.recv_buf.list); - port_priv->recv_posted_mad_count[convert_qpnum(qp->qp_num)]--; + port_priv->recv_posted_mad_count[qpn]--; spin_unlock_irqrestore(&port_priv->recv_list_lock, flags); kmem_cache_free(ib_mad_cache, mad_priv); - printk(KERN_NOTICE "ib_post_recv failed ret = %d\n", ret); + printk(KERN_NOTICE "ib_post_recv WRID 0x%Lx failed ret = %d\n", recv_wr.wr_id, ret); return -EINVAL; } @@ -1028,6 +1060,8 @@ /* PCI mapping !!! */ + list_del(&port_priv->recv_posted_mad_list[i]); + } INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->recv_posted_mad_count[i] = 0; @@ -1045,11 +1079,9 @@ spin_lock_irqsave(&port_priv->send_list_lock, flags); while (!list_empty(&port_priv->send_posted_mad_list)) { - /* PCI mapping ? */ - list_del(&port_priv->send_posted_mad_list); - /* Call completion handler with some status ? */ + /* Call completion handler with flushed status !!! */ } INIT_LIST_HEAD(&port_priv->send_posted_mad_list); @@ -1345,6 +1377,7 @@ qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + /* Until Roland's ib_verbs.h ip_qp_types enum reordered !!! */ if (i == 0) qp_init_attr.qp_type = IB_QPT_SMI; else @@ -1373,6 +1406,7 @@ for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->recv_posted_mad_count[i] = 0; + port_priv->recv_wr_index[i] = 0; } ret = ib_mad_thread_init(port_priv); From krause at cup.hp.com Mon Sep 27 06:51:43 2004 From: krause at cup.hp.com (Michael Krause) Date: Mon, 27 Sep 2004 06:51:43 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <523c16l3wu.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> At 09:16 AM 9/25/2004, Roland Dreier wrote: > Michael> I would have thought this would be part of the IP over IB > Michael> driver. Communicate with the SM to acquire the P_Key and > Michael> then use that to perform Arp / ND. Strange to require a > Michael> command since this gets a bit cumbersome when there are > Michael> many nodes in the fabric. > >That is a nice idea in theory but it falls down in practice. First, >the IPoIB driver is often loaded during the boot process, and init >scripts may want to configure IPoIB interfaces before the SM has >discovered the local node. If the SM isn't up or hasn't discovered the local node, then the node isn't going to be communicating with anything else until this occurs. To get around this race condition, one simply has an event handler associated with the change in state of the CA, i.e. the SM has configured the CA so there is a clear communication path and P_Key can be discovered. A CA should know when the link is operational as it needs to find the other routing information in order to operate. Hence, boot isn't really an issue if using the IB management methods provided in the specifications. Note: There is nothing that prevents with using a default P_Key (same applies to Q_Key) - that wasn't the point. > Also, a node may have many P_Keys assigned, only a few of which are used > for IP traffic. Administrators >don't like to have to deal with 50+ network devices. What does this have to do with anything? I'm not advocating having a large number of P_Key for IP traffic. The SM manages the P_Key space for IP - can be single P_Key or multiple. The SM has to understand whether a CA is part of a given P_Key domain (same for switch ports). This is a single point of management technology to start with so I don't see the 50+ or even 48K network device problem. The IPoverIB spec allows for multiple P_Key to be used or just one. This should all be done by the SM and the CA should be asking it for what value to use. It should not be done within the CA else the implementation is establishing management policies that were clearly not the intent of the IB specifications and could to interoperability problems or more administrative overheads. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Sep 27 07:32:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 10:32:13 -0400 Subject: [openib-general] Re: [PATCH] reference counting added to ib_mad_agent In-Reply-To: <20040924161054.050a0d5e.mshefty@ichips.intel.com> References: <20040924161054.050a0d5e.mshefty@ichips.intel.com> Message-ID: <1096295533.1830.50.camel@localhost.localdomain> On Fri, 2004-09-24 at 19:10, Sean Hefty wrote: > This patch adds reference counting for MAD agents to protect > against deregistration while a callback is being invoked. > As part of the structure changes to support reference counting, > deregistration code has been simplified, and a bug has been fixed > where multiple port structures were being stored in the same > pointer. > > Note that when sending MADs, the code currently holds a reference > count from the time that the send is posted, until it completes > and is returned to the user. Thanks! Applied. Please verify that the deregistration code is as desired as it was rejected due to other changes than went on in between the patch being applied. -- Hal From roland at topspin.com Mon Sep 27 07:55:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 07:55:29 -0700 Subject: [openib-general] IPoIB Loading and Starting References: <35EA21F54A45CB47B879F21A91F4862F1DF5FB@taurus.voltaire.com> Message-ID: <52wtyfhice.fsf@topspin.com> Yaron> IPoIB interfaces can pop up like USB devices when the P_Key Yaron> table is field (in a default DHCP client mode) With one Yaron> Interface per P_Key by default (unless configured Yaron> otherwise) Anyway a system/interface cannot be used if the Yaron> Port is not initialized by the SM so it is not critical to Yaron> bring it up sooner Thinking about this, you're mostly right. I was stuck in the world of /etc/sysconfig/network-scripts, but we can take advantage of the fact that new network interfaces cause hotplug events. There is a small issue of how to name the interfaces to solve but we should be able to work that out. Thanks, Roland From roland at topspin.com Mon Sep 27 07:55:28 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 07:55:28 -0700 Subject: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values References: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> <20040924135657.1926385f.mshefty@ichips.intel.com> <20040925211244.GC7717@mellanox.co.il> Message-ID: <523c13iwwv.fsf@topspin.com> Michael> I think you really want: Michael> enum ib_qp_type { IB_QPT_SMI=0, IB_QPT_GSI=1 Michael> to stress the fact that you really want to assign Michael> specific values to IB_QPT_SMI/IB_QPT_GSI or the code Michael> would not work correctly. Hmm, this seems even more dangerous -- what happens if someone adds another type like below? enum ib_qp_type { IB_QPT_RD, IB_QPT_SMI=0, IB_QPT_GSI=1 In fact I don't like this abuse of enum ib_qp_type at all -- it looks to me like we're introducing a very brittle overloading of the enum for a quite minimal gain. - Roland From roland at topspin.com Mon Sep 27 07:55:30 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 07:55:30 -0700 Subject: [openib-general] IPoIB Loading and Starting References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> Message-ID: <52r7onhicd.fsf@topspin.com> Roland> Also, a node may have many P_Keys assigned, only a few of Roland> which are used for IP traffic. Administrators don't like Roland> to have to deal with 50+ network devices. Michael> What does this have to do with anything? I'm not Michael> advocating having a large number of P_Key for IP traffic. The issue comes up on nodes that belong to many partitions. If we expect the IPoIB driver to create a network interface for every P_Key (the only sane thing the kernel can do if we don't want userspace to have to explicitly configure P_Keys) then a single HCA with a 64-entry P_Key table and 2 ports could create 128 IPoIB interfaces. In any case this is really a cosmetic issue. - Roland From roland at topspin.com Mon Sep 27 07:55:32 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 07:55:32 -0700 Subject: [openib-general] mthca startup problem References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> <52brful43m.fsf@topspin.com> <1096199932.1835.46.camel@localhost.localdomain> Message-ID: <52llevhicb.fsf@topspin.com> Roland> This was done because there are applications that want to Roland> make loopback connections before the SM has configured the Roland> local LID, but not have them break when the SM does assign Roland> a LID. Hal> Thanks. Assuming OpenIB uses this IPoIB implementation, would Hal> this be preserved ? It seems like it would remain based on Hal> your comments to Mike on the rationale for this. This is independent of IPoIB -- the actual code is in mad_static.c. (Not sure what my response about P_Keys has to do with this) In any case my expectation would be that initial drops would not have static (pre-SM) LID assignment. Once we have enough implemented to run apps that care about this, we can look for a better way to assign LIDs (preferably in userspace) -- I don't like the current IP-address based hack and I opposed it when it was added to Topspin's drivers. - Roland From roland at topspin.com Mon Sep 27 07:55:33 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 07:55:33 -0700 Subject: [openib-general] mthca startup problem References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> <52brful43m.fsf@topspin.com> <20040927044520.GG24541@cup.hp.com> Message-ID: <52fz53hica.fsf@topspin.com> Roland> I've now added gcc 2.95 to the set of compilers I test Roland> build with so I should be able to avoid most of these Roland> problems in the future. Grant> I'm not sure this is useful. 2.9x is certainly not Grant> interesting to kernel.org. Check Linus's changelogs -- I can find "fix build for gcc-2.95" changesets even in the last month or two. Also Documentation/Changes still lists gcc 2.95.3 as the minimal supported version. I can guarantee that if this code gets merged upstream with gcc 2.95 problems, someone will complain and I'll have to fix it. So I might as well avoid problems now. Thanks, Roland From mst at mellanox.co.il Mon Sep 27 08:05:43 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 27 Sep 2004 17:05:43 +0200 Subject: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values In-Reply-To: <523c13iwwv.fsf@topspin.com> References: <20040924103803.2cb4f18b.mshefty@ichips.intel.com> <20040924135657.1926385f.mshefty@ichips.intel.com> <20040925211244.GC7717@mellanox.co.il> <523c13iwwv.fsf@topspin.com> Message-ID: <20040927150543.GA31313@mellanox.co.il> Hello! Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] [PATCH] change SMI/GSI QP types to match QP index values": > Michael> I think you really want: > > Michael> enum ib_qp_type { IB_QPT_SMI=0, IB_QPT_GSI=1 But it would be easier to spot than what we have now. > Michael> to stress the fact that you really want to assign > Michael> specific values to IB_QPT_SMI/IB_QPT_GSI or the code > Michael> would not work correctly. > > Hmm, this seems even more dangerous -- what happens if someone adds > another type like below? > > enum ib_qp_type { > IB_QPT_RD, > IB_QPT_SMI=0, > IB_QPT_GSI=1 > > In fact I don't like this abuse of enum ib_qp_type at all -- it looks > to me like we're introducing a very brittle overloading of the enum > for a quite minimal gain. > > - Roland I agree, the proper way is to have an inline function to get QP number by type. From yaronh at voltaire.com Mon Sep 27 08:14:45 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 27 Sep 2004 17:14:45 +0200 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF639@taurus.voltaire.com> > > There is a small issue of how to name the interfaces to solve > but we should be able to work that out. > We can use the Linux VLAN convention of: IF_name.VLAN_tag And do for e.g. ipoib0.P_Key_value This way P_Key's can look familiar to IT people working with VLAN's And we also don't need a special place to store the P_Key value (since its part of the IF name) Yaron From krause at cup.hp.com Mon Sep 27 08:13:21 2004 From: krause at cup.hp.com (Michael Krause) Date: Mon, 27 Sep 2004 08:13:21 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52r7onhicd.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> At 07:55 AM 9/27/2004, Roland Dreier wrote: > Roland> Also, a node may have many P_Keys assigned, only a few of > Roland> which are used for IP traffic. Administrators don't like > Roland> to have to deal with 50+ network devices. > > Michael> What does this have to do with anything? I'm not > Michael> advocating having a large number of P_Key for IP traffic. > >The issue comes up on nodes that belong to many partitions. If we >expect the IPoIB driver to create a network interface for every P_Key >(the only sane thing the kernel can do if we don't want userspace to >have to explicitly configure P_Keys) then a single HCA with a 64-entry >P_Key table and 2 ports could create 128 IPoIB interfaces. > >In any case this is really a cosmetic issue. I don't follow why this is cosmetic. The SM will configure whatever number of P_Key per port (CA or switch) that it desires. IPoverIB does not and should not care about this. IPoverIB should probe the SM/SA to determine what and how many P_Key it should use. It should then establish unique interfaces for each P_Key. This is no different than what is done for 802.1Q where a separate driver instance exists per tag. This was our intent when we wrote the specification. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Mon Sep 27 08:36:54 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 08:36:54 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> (Michael Krause's message of "Mon, 27 Sep 2004 08:13:21 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> Message-ID: <52oejrg1ux.fsf@topspin.com> Michael> I don't follow why this is cosmetic. The SM will Michael> configure whatever number of P_Key per port (CA or Michael> switch) that it desires. IPoverIB does not and should Michael> not care about this. IPoverIB should probe the SM/SA to Michael> determine what and how many P_Key it should use. It Michael> should then establish unique interfaces for each P_Key. Michael> This is no different than what is done for 802.1Q where a Michael> separate driver instance exists per tag. This was our Michael> intent when we wrote the specification. As far as I know the SM doesn't know anything about IPoIB. What information can the IPoIB driver get from the SM/SA beyond what P_Keys are assigned to its local ports? (And this is already available from the local P_Key table) By the way, it's interesting that you mention 802.1q. The Linux implementation of VLAN tagging requires a userspace program (vlanconfig) to be run to create tagged interfaces. Thanks, Roland From roland at topspin.com Mon Sep 27 08:44:57 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 08:44:57 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF639@taurus.voltaire.com> (Yaron Haviv's message of "Mon, 27 Sep 2004 17:14:45 +0200") References: <35EA21F54A45CB47B879F21A91F4862F1DF639@taurus.voltaire.com> Message-ID: <52hdpjg1hi.fsf@topspin.com> Roland> There is a small issue of how to name the interfaces to Roland> solve but we should be able to work that out. Yaron> We can use the Linux VLAN convention of: IF_name.VLAN_tag Yaron> And do for e.g. ipoib0.P_Key_value This way P_Key's can Yaron> look familiar to IT people working with VLAN's And we also Yaron> don't need a special place to store the P_Key value (since Yaron> its part of the IF name) Yes, that is one idea that makes sense. Implementing it seems to be a little ugly though since assigning the "ipoib0" part of the name uniquely seems to require grunging through the list of network devices. One option is to name the interface corresponding to P_Key table index 0 without the P_Key (just "ipoib0"). By the way, the Linux 802.1q code supports several different naming schemes (look for "vlan_name_type" in vlan.c). Thanks, Roland From mshefty at ichips.intel.com Mon Sep 27 09:05:11 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Sep 2004 09:05:11 -0700 Subject: [openib-general] [PATCH] ib_mad: Receive path fixes In-Reply-To: <1096290027.1836.14.camel@localhost.localdomain> References: <1096290027.1836.14.camel@localhost.localdomain> Message-ID: <20040927090511.025c41da.mshefty@ichips.intel.com> On Mon, 27 Sep 2004 09:00:27 -0400 Hal Rosenstock wrote: > +union ib_mad_recv_wrid { > + u64 wrid; > + struct { > + u32 index; > + u32 qpn; > + } wrid_field; > +}; > + If you accept the patch that separates QP 0/1 traffic from each other, we don't need this, and it would allow for additional optimizations. Is there any benefit to having a single queue of data buffers for receives that are posted to separate queue pairs? As the code is currently structured, an error on QP 1 will reset QP 0, and vice versa. > struct ib_mad_private_header { > struct ib_mad_recv_wc recv_wc; /* must be first member (for now > !!!) */ I don't believe that recv_wc needs to be the first member anymore, but that may depend on which patches were accepted. > + u32 recv_wr_index[IB_MAD_QPS_SUPPORTED]; The access layer won't be posting receives on redirected QPs, so I think we can eliminate the "_SUPPORTED" constant, and just use "_CORE" instead. For redirected QPs, the access layer should only need to perform segmentation for the user, or reassembly received data. From krause at cup.hp.com Mon Sep 27 09:07:03 2004 From: krause at cup.hp.com (Michael Krause) Date: Mon, 27 Sep 2004 09:07:03 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52oejrg1ux.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> At 08:36 AM 9/27/2004, Roland Dreier wrote: > Michael> I don't follow why this is cosmetic. The SM will > Michael> configure whatever number of P_Key per port (CA or > Michael> switch) that it desires. IPoverIB does not and should > Michael> not care about this. IPoverIB should probe the SM/SA to > Michael> determine what and how many P_Key it should use. It > Michael> should then establish unique interfaces for each P_Key. > Michael> This is no different than what is done for 802.1Q where a > Michael> separate driver instance exists per tag. This was our > Michael> intent when we wrote the specification. > >As far as I know the SM doesn't know anything about IPoIB. What >information can the IPoIB driver get from the SM/SA beyond what P_Keys >are assigned to its local ports? (And this is already available from >the local P_Key table) The SM only knows what it configures in each port. The SA is responsible for service management and it works with the SM to map a given service to a P_Key. The SA also sets up the all node multicast group. IPoverIB joins this group in order to issue ARP/ND messages and therefore automatically discovers the P_Key to use. IPoverIB is required to inquire what groups are available and optionally set up event notification to be informed when groups are added for its particular service. This eliminates the need for local P_Key management. >By the way, it's interesting that you mention 802.1q. The Linux >implementation of VLAN tagging requires a userspace program (vlanconfig) >to be run to create tagged interfaces. In general, the IPoverIB driver should treat each new all-nodes multicast group with a unique P_Key as a virtual hot-plug event (this was our intent both within the IETF and in the IBTA). This should be linked into whatever OS management interfaces are required allowing ifconfig / dev file creation / etc. to be executed. This is orthogonal to P_Key management which was the original point of debate earlier in this string. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Mon Sep 27 09:23:18 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 09:23:18 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> (Michael Krause's message of "Mon, 27 Sep 2004 09:07:03 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> Message-ID: <52d607fzpl.fsf@topspin.com> Michael> The SM only knows what it configures in each port. The Michael> SA is responsible for service management and it works Michael> with the SM to map a given service to a P_Key. As far as I know there is no service (in the IB service record sense) associated to IPoIB, is there? Michael> IPoverIB is required to inquire what groups are available Michael> and optionally set up event notification to be informed Michael> when groups are added for its particular service. This Michael> eliminates the need for local P_Key management. I don't see this requirement anywhere in the current IETF drafts, although I could be missing it. In any case this seems rather ugly, since the only way to get a list of IPoIB multicast groups seems to be to query for _all_ multicast groups, filter for those that match the IPoIB GID format, and then attempt to join to find out which can be used on each local port. Michael> In general, the IPoverIB driver should treat each new Michael> all-nodes multicast group with a unique P_Key as a Michael> virtual hot-plug event (this was our intent both within Michael> the IETF and in the IBTA). Hmm.. this view does not seem to match the wording of the current IPoIB drafts. For example: It is an implementation choice on how the P_Key and the scope bits related to the IPoIB subnet are determined by the implementation. These could be configuration parameters initialized by some means by the administrator. The methods employed by an implementation to determine the P_Key and scope bits are not specified by IPoIB. Thanks, Roland From halr at voltaire.com Mon Sep 27 09:34:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 12:34:23 -0400 Subject: [openib-general] [PATCH] ib_mad: Receive path fixes In-Reply-To: <20040927090511.025c41da.mshefty@ichips.intel.com> References: <1096290027.1836.14.camel@localhost.localdomain> <20040927090511.025c41da.mshefty@ichips.intel.com> Message-ID: <1096302862.11222.79.camel@localhost.localdomain> On Mon, 2004-09-27 at 12:05, Sean Hefty wrote: > On Mon, 27 Sep 2004 09:00:27 -0400 > Hal Rosenstock wrote: > > > +union ib_mad_recv_wrid { > > + u64 wrid; > > + struct { > > + u32 index; > > + u32 qpn; > > + } wrid_field; > > +}; > > + > > If you accept the patch that separates QP 0/1 traffic from each other, > we don't need this, and it would allow for additional optimizations. I am getting to your patch but want to finish up what I am doing before I introduce additional variables into the debug. That patch will also take some work to integrate. > Is there any benefit to having a single queue of data buffers for receives > that are posted to separate queue pairs? There are separate receive queues (per QP). There is one queue on the send side. This is per HCA port. > As the code is currently structured, an error on QP 1 will reset QP 0, > and vice versa. That could be fixed but I need to look at your changes in more detail before I would go down that path. > > struct ib_mad_private_header { > > struct ib_mad_recv_wc recv_wc; /* must be first member (for now > > !!!) */ > > I don't believe that recv_wc needs to be the first member anymore, > but that may depend on which patches were accepted. All patches with the exception of one have been accepted so far. The change which undoes the requirement for struct ib_mad_recv_wc not to be the first member (for now) is part of that pending change. > > + u32 recv_wr_index[IB_MAD_QPS_SUPPORTED]; > > The access layer won't be posting receives on redirected QPs, > so I think we can eliminate the "_SUPPORTED" constant, > and just use "_CORE" instead. Right. I'll fix this. > For redirected QPs, the access layer should only need to perform segmentation > for the user, or reassembly received data. Right. Recall that amongst some other things redirection support is not part of what is currently implemented. -- Hal From halr at voltaire.com Mon Sep 27 09:43:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 12:43:06 -0400 Subject: [openib-general] [PATCH] [TRIVIAL] ib_mad: Eliminate unneeded IB_MAD_QPS_SUPPORTED definition Message-ID: <1096303386.1836.85.camel@localhost.localdomain> ib_mad: Eliminate unneeded IB_MAD_QPS_SUPPORTED definition Index: ib_mad_priv.h =================================================================== --- ib_mad_priv.h (revision 892) +++ ib_mad_priv.h (working copy) @@ -62,7 +62,6 @@ #include #define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ -#define IB_MAD_QPS_SUPPORTED 2 /* QP and CQ parameters */ #define IB_MAD_QP_SEND_SIZE 2048 @@ -135,7 +134,7 @@ struct list_head port_list; struct ib_device *device; int port_num; - struct ib_qp *qp[IB_MAD_QPS_SUPPORTED]; + struct ib_qp *qp[IB_MAD_QPS_CORE]; struct ib_cq *cq; struct ib_pd *pd; struct ib_mr *mr; @@ -149,9 +148,9 @@ int send_posted_mad_count; spinlock_t recv_list_lock; - struct list_head recv_posted_mad_list[IB_MAD_QPS_SUPPORTED]; - int recv_posted_mad_count[IB_MAD_QPS_SUPPORTED]; - u32 recv_wr_index[IB_MAD_QPS_SUPPORTED]; + struct list_head recv_posted_mad_list[IB_MAD_QPS_CORE]; + int recv_posted_mad_count[IB_MAD_QPS_CORE]; + u32 recv_wr_index[IB_MAD_QPS_CORE]; struct task_struct *mad_thread; struct ib_mad_thread_private mad_thread_private; Index: ib_mad.c =================================================================== --- ib_mad.c (revision 892) +++ ib_mad.c (working copy) @@ -1068,7 +1068,7 @@ int i; unsigned long flags; - for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { spin_lock_irqsave(&port_priv->recv_list_lock, flags); while (!list_empty(&port_priv->recv_posted_mad_list[i])) { @@ -1416,7 +1416,7 @@ INIT_LIST_HEAD(&port_priv->agent_list); INIT_LIST_HEAD(&port_priv->send_posted_mad_list); port_priv->send_posted_mad_count = 0; - for (i = 0; i < IB_MAD_QPS_SUPPORTED; i++) { + for (i = 0; i < IB_MAD_QPS_CORE; i++) { INIT_LIST_HEAD(&port_priv->recv_posted_mad_list[i]); port_priv->recv_posted_mad_count[i] = 0; port_priv->recv_wr_index[i] = 0; From yaronh at voltaire.com Mon Sep 27 09:51:15 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 27 Sep 2004 18:51:15 +0200 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF650@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Monday, September 27, 2004 5:45 PM > To: Yaron Haviv > Cc: Michael Krause; openib-general at openib.org > Subject: Re: [openib-general] IPoIB Loading and Starting > > Roland> There is a small issue of how to name the interfaces to > Roland> solve but we should be able to work that out. > > Yaron> We can use the Linux VLAN convention of: IF_name.VLAN_tag > Yaron> And do for e.g. ipoib0.P_Key_value This way P_Key's can > Yaron> look familiar to IT people working with VLAN's And we also > Yaron> don't need a special place to store the P_Key value (since > Yaron> its part of the IF name) > > Yes, that is one idea that makes sense. Implementing it seems to > be a little ugly though since assigning the "ipoib0" part of the name > uniquely seems to require grunging through the list of network > devices. One option is to name the interface corresponding to P_Key > table index 0 without the P_Key (just "ipoib0"). > I would prefer to somehow use the P_Key_value since its more persistent than the index, an SM may choose to configure the indexes differently after a port restart, or when removing a partition and re-inserting it Also indexes may not be contiguous (there can be holes due to removed partitions) Basing the solution on the indexes may result in swapping the config (IP, Mask, ..) of two interfaces Yaron > By the way, the Linux 802.1q code supports several different naming > schemes (look for "vlan_name_type" in vlan.c). > > Thanks, > Roland From halr at voltaire.com Mon Sep 27 10:06:52 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 13:06:52 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52d607fzpl.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> Message-ID: <1096304811.11222.94.camel@localhost.localdomain> On Mon, 2004-09-27 at 12:23, Roland Dreier wrote: > In any case this seems rather ugly, > since the only way to get a list of IPoIB multicast groups seems to be > to query for _all_ multicast groups, filter for those that match the > IPoIB GID format, and then attempt to join to find out which can be > used on each local port. Yes, it is ugly but that would be the tradeoff against manual configuration. In terms of "all multicast groups", this would be "all multicast groups" for the PKeys configured on this port. In the SA GetTable, the PKey could be wildcarded and the SA would only return those groups for PKeys which are configured for the port. In any case, as this could change over time, either the multicast group traps need monitoring or this needs to be polled. In terms of IPoIB GID format, this would check for the IPv4 or IPv6 signature depending on the IP interface type. -- Hal From roland at topspin.com Mon Sep 27 10:13:44 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 10:13:44 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096304811.11222.94.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 27 Sep 2004 13:06:52 -0400") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> Message-ID: <52wtyfeit3.fsf@topspin.com> Hal> In terms of "all multicast groups", this would be "all Hal> multicast groups" for the PKeys configured on this port. In Hal> the SA GetTable, the PKey could be wildcarded and the SA Hal> would only return those groups for PKeys which are configured Hal> for the port. I don't think this is the case. If I read section 15.4.1 of the IB spec correctly, the SA will return all multicast groups. Hal> In terms of IPoIB GID format, this would check for the IPv4 Hal> or IPv6 signature depending on the IP interface type. Not sure what this means ?? IP interfaces don't have a type that I know of. - Roland From halr at voltaire.com Mon Sep 27 10:37:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 27 Sep 2004 13:37:28 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52wtyfeit3.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> Message-ID: <1096306647.1830.114.camel@localhost.localdomain> On Mon, 2004-09-27 at 13:13, Roland Dreier wrote: > Hal> In terms of "all multicast groups", this would be "all > Hal> multicast groups" for the PKeys configured on this port. In > Hal> the SA GetTable, the PKey could be wildcarded and the SA > Hal> would only return those groups for PKeys which are configured > Hal> for the port. > > I don't think this is the case. If I read section 15.4.1 of the IB > spec correctly, the SA will return all multicast groups. That's the section I was referring to too. Doesn't C15-0.2-1 restrict the SA GetTableResp to only include multicast groups for Pkeys on that port ? I am interpreting the subject node is the same as the requesting node in this case. What are you referring to ? > Hal> In terms of IPoIB GID format, this would check for the IPv4 > Hal> or IPv6 signature depending on the IP interface type. > > Not sure what this means ?? IP interfaces don't have a type that I > know of. Aren't there separate IP interfaces in Linux for either IPv4 or IPv6 ? That's what I meant by IP interface type. Maybe I'm not using the proper terminology. The IPoIB signatures appear in the MGID of the multicast group and are 0x401B for IPv4 and 0x601B for IPv6 per the IETF I-D. -- Hal From tduffy at sun.com Mon Sep 27 10:43:17 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 27 Sep 2004 10:43:17 -0700 Subject: [openib-general] [PATCH] [TRIVIAL] ipoib: Fix compile problem with data path debug on In-Reply-To: <20040925210613.GB7717@mellanox.co.il> References: <1095973511.25438.4.camel@duffman> <20040925210613.GB7717@mellanox.co.il> Message-ID: <1096306997.9584.9.camel@duffman> On Sat, 2004-09-25 at 23:06 +0200, Michael S. Tsirkin wrote: > Tom, > Not that I advocate using MUAs that break tabs, but - > you do know you can patch -l to solve (at least check) tab/space issues, > dont you? I didn't. Thanks for the tip. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From tduffy at sun.com Mon Sep 27 11:18:46 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 27 Sep 2004 11:18:46 -0700 Subject: [openib-general] Re: [openib-commits] r894 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> Message-ID: <1096309126.9584.16.camel@duffman> On Mon, 2004-09-27 at 10:44 -0700, roland at openib.org wrote: > Author: roland > Date: 2004-09-27 10:44:05 -0700 (Mon, 27 Sep 2004) > New Revision: 894 > > Modified: > gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib/ipoib.h > gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib/ipoib_ib.c > gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib/ipoib_main.c > gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib/ipoib_multicast.c > Log: > Get rid of multicast join thread for IPoIB (switch to IPoIB work queue) Doh. You beat me to the punch. I was working on the same thing (although, I was trying to do it with a kthread). Good work. What do you think is the next step on the TODO that I could start working on? Don't want to step on your toes... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Mon Sep 27 11:21:51 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Sep 2004 11:21:51 -0700 Subject: [openib-general] [PATCH] fix to acquire spinlock Message-ID: <20040927112151.3718a9ca.mshefty@ichips.intel.com> Index: ib_mad.c =================================================================== --- ib_mad.c (revision 893) +++ ib_mad.c (working copy) @@ -342,7 +342,7 @@ ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); if (ret) { /* Unlink from posted send MAD list */ - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + spin_lock_irqsave(&port_priv->send_list_lock, flags); list_del(&mad_send_wr->send_list); port_priv->send_posted_mad_count--; spin_unlock_irqrestore(&port_priv->send_list_lock, flags); From roland at topspin.com Mon Sep 27 12:15:08 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 12:15:08 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096306647.1830.114.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 27 Sep 2004 13:37:28 -0400") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> Message-ID: <52oejred6r.fsf@topspin.com> Hal> That's the section I was referring to too. Doesn't C15-0.2-1 Hal> restrict the SA GetTableResp to only include multicast groups Hal> for Pkeys on that port ? I am interpreting the subject node Hal> is the same as the requesting node in this case. What are you Hal> referring to ? Hmm, I don't have C15-0.2-1 in my spec. I was going by the following: C15-0.1.22: When a requester node requests data from the Subnet Administrator that would provide information about a subject node, the Subnet Administrator shall return only data providing information about subject nodes for which the requester shares a P_Key, with exceptions noted below in C15-0.1.23. which doesn't seem to restrict the MC groups that a query can return, as well as: C15-0.1.23: [...] MCMemberRecords shall always be provided with the PortGID, Join- State and ProxyJoin components set to 0, except for the case of a trusted request, in which case the actual component contents shall be provided. which seems to imply that MCMemberRecords will just have the subject node info zeroed out. Hal> Aren't there separate IP interfaces in Linux for either IPv4 Hal> or IPv6 ? That's what I meant by IP interface type. Maybe Hal> I'm not using the proper terminology. The IPoIB signatures Hal> appear in the MGID of the multicast group and are 0x401B for Hal> IPv4 and 0x601B for IPv6 per the IETF I-D. Nope, a network interface in Linux can carry all kinds of packets... IPv4, IPv6, IPX, Appletalk, etc. It can have any arbitrary collection of IPv4 and IPv6 addresses assigned at any time... - Roland From roland at topspin.com Mon Sep 27 12:17:48 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 12:17:48 -0700 Subject: [openib-general] Re: [openib-commits] r894 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <1096309126.9584.16.camel@duffman> (Tom Duffy's message of "Mon, 27 Sep 2004 11:18:46 -0700") References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> Message-ID: <52k6ufed2b.fsf@topspin.com> Tom> Doh. You beat me to the punch. I was working on the same Tom> thing (although, I was trying to do it with a kthread). Sorry dude... Tom> What do you think is the next step on the TODO that I could Tom> start working on? Don't want to step on your toes... I think I've done all the straightforward work on IPoIB now. We can try to figure out how to make it a "native" driver now (ie use the full 20 byte HW address instead of hashing down to 6 bytes, etc). I had some inconclusive discussions on netdev at oss.sgi.com about this last week but I still don't know how to do it. - R. From mshefty at ichips.intel.com Mon Sep 27 12:56:13 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Sep 2004 12:56:13 -0700 Subject: [openib-general] [PATCH] Add send queuing to MAD agent Message-ID: <20040927125613.47a91d34.mshefty@ichips.intel.com> This patch adds send queuing at the MAD agent level. The queuing will eventually be needed for request/response/timeout/RMPP/QP overflow/send optimization/unregistration purposes. The patch restructures part of the layering of the send side code for RMPP/QP overflow purposes, and fixes a bug where MADs could be posted on the QP outside of a lock, resulting in out of order completions. - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 893) +++ access/ib_mad_priv.h (working copy) @@ -105,6 +105,10 @@ struct ib_mad_agent agent; struct ib_mad_reg_req *reg_req; struct ib_mad_port_private *port_priv; + + spinlock_t send_list_lock; + struct list_head send_list; + atomic_t refcount; wait_queue_head_t wait; u8 rmpp_version; @@ -112,9 +116,11 @@ struct ib_mad_send_wr_private { struct list_head send_list; + struct list_head agent_send_list; struct ib_mad_agent *agent; u64 wr_id; /* client WRID */ int timeout_ms; + int is_active; }; struct ib_mad_mgmt_method_table { Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 893) +++ access/ib_mad.c (working copy) @@ -223,6 +223,8 @@ list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); spin_unlock_irqrestore(&port_priv->reg_lock, flags); + spin_lock_init(&mad_agent_priv->send_list_lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); atomic_set(&mad_agent_priv->refcount, 1); init_waitqueue_head(&mad_agent_priv->wait); mad_agent_priv->port_priv = port_priv; @@ -269,6 +271,35 @@ } EXPORT_SYMBOL(ib_unregister_mad_agent); +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + int ret; + + port_priv = mad_agent_priv->port_priv; + + /* Replace user's WR ID with our own to find WR on completion. */ + mad_send_wr->wr_id = send_wr->wr_id; + send_wr->wr_id = (unsigned long)mad_send_wr; + + spin_lock_irqsave(&port_priv->send_list_lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, send_wr, bad_send_wr); + if (!ret) { + list_add_tail(&mad_send_wr->send_list, + &port_priv->send_posted_mad_list); + port_priv->send_posted_mad_count++; + } else { + printk(KERN_NOTICE "ib_post_send failed\n"); + *bad_send_wr = send_wr; + } + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + return ret; +} + /* * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client @@ -312,44 +343,35 @@ return -ENOMEM; } - /* Initialize MAD send WR tracking structure */ + /* Track sent MAD with agent. */ + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + list_add_tail(&mad_send_wr->agent_send_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + /* Reference MAD agent until send completes. */ + atomic_inc(&mad_agent_priv->refcount); mad_send_wr->agent = mad_agent; - mad_send_wr->wr_id = cur_send_wr->wr_id; - /* Timeout valid only when MAD is a request !!! */ mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; + mad_send_wr->is_active = 1; + wr = *cur_send_wr; wr.next = NULL; - wr.opcode = IB_WR_SEND; /* cur_send_wr->opcode ? */ - wr.wr_id = (unsigned long)mad_send_wr; - wr.sg_list = cur_send_wr->sg_list; - wr.num_sge = cur_send_wr->num_sge; - wr.wr.ud.remote_qpn = cur_send_wr->wr.ud.remote_qpn; - wr.wr.ud.remote_qkey = cur_send_wr->wr.ud.remote_qkey; - wr.wr.ud.pkey_index = cur_send_wr->wr.ud.pkey_index; - wr.wr.ud.ah = cur_send_wr->wr.ud.ah; - wr.send_flags = IB_SEND_SIGNALED; /* cur_send_wr->send_flags ? */ - /* Link send WR into posted send MAD list */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - list_add_tail(&mad_send_wr->send_list, - &port_priv->send_posted_mad_list); - port_priv->send_posted_mad_count++; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - - /* Reference MAD agent until send completes. */ - atomic_inc(&mad_agent_priv->refcount); - - ret = ib_post_send(mad_agent->qp, &wr, &bad_wr); + ret = ib_send_mad(mad_agent_priv, mad_send_wr, &wr, &bad_wr); if (ret) { - /* Unlink from posted send MAD list */ - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - list_del(&mad_send_wr->send_list); - port_priv->send_posted_mad_count--; - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + /* Handle QP overrun separately... -ENOMEM */ + + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->send_list_lock, + flags); + list_del(&mad_send_wr->agent_send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, + flags); *bad_send_wr = cur_send_wr; if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); - printk(KERN_NOTICE "ib_post_mad_send failed\n"); + printk(KERN_NOTICE "ib_send_mad failed\n"); return ret; } cur_send_wr= next_send_wr; @@ -786,57 +808,74 @@ return; } +/* + * Process a send work completion. + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + /* Check whether timeout was requested !!! */ + mad_send_wr->is_active = 0; + + /* Handle RMPP... */ + + /* Remove send from MAD agent and notify client of completion. */ + spin_lock_irqsave(&mad_agent_priv->send_list_lock, + flags); + list_del(&mad_send_wr->agent_send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, + flags); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc); + + /* Release reference taken when sending. */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { - struct ib_mad_send_wr_private *send_wr; - struct ib_mad_agent_private *mad_agent_priv; - unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long flags; /* Completion corresponds to first entry on posted MAD send list */ spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (!list_empty(&port_priv->send_posted_mad_list)) { - send_wr = list_entry(&port_priv->send_posted_mad_list, - struct ib_mad_send_wr_private, - send_list); - - if (send_wr->wr_id != wc->wr_id) { - printk(KERN_ERR "Send completion WR ID 0x%Lx doesn't match posted send WR ID 0x%Lx\n", wc->wr_id, send_wr->wr_id); - - goto error; - } - - mad_agent_priv = container_of(send_wr->agent, - struct ib_mad_agent_private, agent); - /* Check whether timeout was requested !!! */ - - /* Remove from posted send MAD list */ - list_del(&send_wr->send_list); - port_priv->send_posted_mad_count--; + if (list_empty(&port_priv->send_posted_mad_list)) { + printk(KERN_ERR "Send completion WR ID 0x%Lx but send list " + "is empty\n", wc->wr_id); + goto error; + } - } else { - printk(KERN_ERR "Send completion WR ID 0x%Lx but send list is empty\n", wc->wr_id); + mad_send_wr = list_entry(&port_priv->send_posted_mad_list, + struct ib_mad_send_wr_private, + send_list); + if (mad_send_wr->wr_id != wc->wr_id) { + printk(KERN_ERR "Send completion WR ID 0x%Lx doesn't match " + "posted send WR ID 0x%Lx\n", wc->wr_id, mad_send_wr->wr_id); goto error; } + + /* Remove from posted send MAD list */ + list_del(&mad_send_wr->send_list); + port_priv->send_posted_mad_count--; spin_unlock_irqrestore(&port_priv->send_list_lock, flags); /* Restore client wr_id in WC */ - wc->wr_id = send_wr->wr_id; - - /* Invoke client send callback */ - send_wr->agent->send_handler(send_wr->agent, - (struct ib_mad_send_wc *)wc); + wc->wr_id = mad_send_wr->wr_id; - /* Release reference taken when sending. */ - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); - - kfree(send_wr); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); return; error: spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - return; } /* From mshefty at ichips.intel.com Mon Sep 27 15:10:34 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 27 Sep 2004 15:10:34 -0700 Subject: [openib-general] [PATCH] Add send queuing to MAD agent In-Reply-To: <20040927125613.47a91d34.mshefty@ichips.intel.com> References: <20040927125613.47a91d34.mshefty@ichips.intel.com> Message-ID: <20040927151034.79b24435.mshefty@ichips.intel.com> On Mon, 27 Sep 2004 12:56:13 -0700 Sean Hefty wrote: > struct ib_mad_send_wr_private { > struct list_head send_list; > + struct list_head agent_send_list; > struct ib_mad_agent *agent; > u64 wr_id; /* client WRID */ > int timeout_ms; > + int is_active; Looking ahead at RMPP, I think it makes more sense to change int is_active to atomic_t refcount. Then each work request posted for a large transfer would increment refcount, so we can trach how many work requests are oustanding for a given send. - Sean From tduffy at sun.com Mon Sep 27 15:33:59 2004 From: tduffy at sun.com (Tom Duffy) Date: Mon, 27 Sep 2004 15:33:59 -0700 Subject: [openib-general] updated TODO list In-Reply-To: <52k6ufed2b.fsf@topspin.com> References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> Message-ID: <1096324439.9584.33.camel@duffman> On Mon, 2004-09-27 at 12:17 -0700, Roland Dreier wrote: > I think I've done all the straightforward work on IPoIB now. We can > try to figure out how to make it a "native" driver now (ie use the > full 20 byte HW address instead of hashing down to 6 bytes, etc). I > had some inconclusive discussions on netdev at oss.sgi.com about this > last week but I still don't know how to do it. Can we put together a TODO that includes everything we can think of that needs to be done before we can send the code to lkml? Can anything stay in /proc? Right now, I see: /proc/infiniband/ |-- ipoib_arp_ib0 |-- ipoib_arp_ib1 |-- ipoib_mcast_ib0 |-- ipoib_mcast_ib1 |-- ipoib_vlan |-- mad | `-- filter |-- poll_counts `-- tracelevel with only ib_mthca and ib_ipoib loaded (and their dependencies). Some others: - port opensm to mthca - remove the ts_ from the names of all the headers - move to the new mad layer what else? -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From yaronh at voltaire.com Mon Sep 27 16:26:42 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 28 Sep 2004 01:26:42 +0200 Subject: [openib-general] Re: [openib-commits] r894 -gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Monday, September 27, 2004 9:18 PM > To: Tom Duffy > Cc: openib-general at openib.org > Subject: Re: [openib-general] Re: [openib-commits] r894 - > gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib > > Tom> Doh. You beat me to the punch. I was working on the same > Tom> thing (although, I was trying to do it with a kthread). > > Sorry dude... > > Tom> What do you think is the next step on the TODO that I could > Tom> start working on? Don't want to step on your toes... > > I think I've done all the straightforward work on IPoIB now. We can > try to figure out how to make it a "native" driver now (ie use the > full 20 byte HW address instead of hashing down to 6 bytes, etc). I > had some inconclusive discussions on netdev at oss.sgi.com about this > last week but I still don't know how to do it. > Having a 20 byte HW address towards the upper stack may result in some unexpected behavior with different networking tools such as sniffers, etc' , a variety of DHCP servers and few other protocols that use the hardware addresses. I suggest we don't rush to incorporate the 20 byte support and think of more urgent matters, and in any case when we do get to it allow the user to configure the IPoIB to work in the 6 byte mode, to enable compatibility with those apps/protocols. Yaron From iod00d at hp.com Mon Sep 27 17:18:42 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 27 Sep 2004 17:18:42 -0700 Subject: [openib-general] mthca startup problem In-Reply-To: <52fz53hica.fsf@topspin.com> References: <220970-220049221192831799@M2W066.mail2web.com> <52y8j3o1q1.fsf@topspin.com> <005101c4a0bf$d93b33f0$4302000a@Gripen> <52ekkumfyf.fsf@topspin.com> <003901c4a172$ec3a3d80$4302000a@Gripen> <52brful43m.fsf@topspin.com> <20040927044520.GG24541@cup.hp.com> <52fz53hica.fsf@topspin.com> Message-ID: <20040928001842.GE29681@cup.hp.com> On Mon, Sep 27, 2004 at 07:55:33AM -0700, Roland Dreier wrote: > Check Linus's changelogs -- I can find "fix build for gcc-2.95" > changesets even in the last month or two. Also Documentation/Changes > still lists gcc 2.95.3 as the minimal supported version. ok - I guess I'm living too far from the "real world"...both parisc port and ia64 debian ports have been using gcc 3.x for 2 over years now. > I can guarantee that if this code gets merged upstream with gcc 2.95 > problems, someone will complain and I'll have to fix it. So I might > as well avoid problems now. yeah - if someone is really using it, agreed. thanks, grant From roland at topspin.com Mon Sep 27 18:20:59 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 18:20:59 -0700 Subject: [openib-general] Re: [openib-commits] r894 -gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> (Yaron Haviv's message of "Tue, 28 Sep 2004 01:26:42 +0200") References: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> Message-ID: <527jqfdw90.fsf@topspin.com> Yaron> Having a 20 byte HW address towards the upper stack may Yaron> result in some unexpected behavior with different Yaron> networking tools such as sniffers, etc' , a variety of DHCP Yaron> servers and few other protocols that use the hardware Yaron> addresses. Yaron> I suggest we don't rush to incorporate the 20 byte support Yaron> and think of more urgent matters, and in any case when we Yaron> do get to it allow the user to configure the IPoIB to work Yaron> in the 6 byte mode, to enable compatibility with those Yaron> apps/protocols. I think there's not much chance of an IPoIB driver with a "fake ethernet" layer being accepted in the mainline kernel. Thanks, Roland From roland at topspin.com Mon Sep 27 18:25:41 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 27 Sep 2004 18:25:41 -0700 Subject: [openib-general] Re: updated TODO list In-Reply-To: <1096324439.9584.33.camel@duffman> (Tom Duffy's message of "Mon, 27 Sep 2004 15:33:59 -0700") References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> <1096324439.9584.33.camel@duffman> Message-ID: <523c13dw16.fsf@topspin.com> Tom> Can we put together a TODO that includes everything we can Tom> think of that needs to be done before we can send the code to Tom> lkml? At a high level: - Finish low-level MAD support - Implement API for SA path record and MC group queries - Make Dave Miller et al happy with IPoIB driver - Implement userspace MAD access for opensm ? (not strictly necessary) Tom> Can anything stay in /proc? Right now, I see: No, I don't think so. The IPoIB stuff should be cleaned up somehow (eg using native 20-byte addresses makes the ARP stuff superfluous, the VLAN stuff can be killed, and mcast maybe via sysfs). The mad/ directory goes away when we switch to the new MAD layer, and poll_counts/tracelevel go away when we kill the legacy/ stuff. Tom> - port opensm to mthca Yup (really, "port opensm to new userspace API") Tom> - remove the ts_ from the names of all the headers That's an easy one once we see which headers survive. Tom> - move to the new mad layer Yep. - Roland From Tom.Duffy at Sun.COM Mon Sep 27 18:48:13 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Mon, 27 Sep 2004 18:48:13 -0700 Subject: [openib-general] Re: [openib-commits] r894 -gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> Message-ID: <4158C2DD.3050003@sun.com> Yaron Haviv wrote: > Having a 20 byte HW address towards the upper stack may result in some > unexpected behavior with different networking tools such as sniffers, > etc' , a variety of DHCP servers and few other protocols that use the > hardware addresses. Or just fix those apps... -tduffy From mst at mellanox.co.il Tue Sep 28 05:03:53 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 28 Sep 2004 14:03:53 +0200 Subject: [openib-general] Re: IP over IB hardware address In-Reply-To: <527jqfdw90.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF65C@taurus.voltaire.com> <527jqfdw90.fsf@topspin.com> Message-ID: <20040928120353.GA7178@mellanox.co.il> Quoting r. Roland Dreier (roland at topspin.com) "Re: [openib-general] Re: [openib-commits] r894 -gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib": > Yaron> Having a 20 byte HW address towards the upper stack may > Yaron> result in some unexpected behavior with different > Yaron> networking tools such as sniffers, etc' , a variety of DHCP > Yaron> servers and few other protocols that use the hardware > Yaron> addresses. > > Yaron> I suggest we don't rush to incorporate the 20 byte support > Yaron> and think of more urgent matters, and in any case when we > Yaron> do get to it allow the user to configure the IPoIB to work > Yaron> in the 6 byte mode, to enable compatibility with those > Yaron> apps/protocols. > > I think there's not much chance of an IPoIB driver with a "fake > ethernet" layer being accepted in the mainline kernel. > > Thanks, > Roland I too think this needs to be addressed somehow. Existing code can cause failures in IP over IB if two guids hash to the same fake ethernet address. So if you are not willing to go to the full 20 byte hw address (or even if you want an option not to do it), you have to play with SA or find some other way to perform the mapping, properly. MST From halr at voltaire.com Tue Sep 28 06:00:03 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 09:00:03 -0400 Subject: [openib-general] Re: [PATCH] Add send queuing to MAD agent In-Reply-To: <20040927125613.47a91d34.mshefty@ichips.intel.com> References: <20040927125613.47a91d34.mshefty@ichips.intel.com> Message-ID: <1096376403.1869.2.camel@localhost.localdomain> On Mon, 2004-09-27 at 15:56, Sean Hefty wrote: > This patch adds send queuing at the MAD agent level. > The queuing will eventually be needed for > request/response/timeout/RMPP/QP overflow/send optimization/unregistration > purposes. The patch restructures part of the layering of the send side > code for RMPP/QP overflow purposes, and fixes a bug where MADs could > be posted on the QP outside of a lock, resulting in out of order > completions. Thanks! Applied. Please verify that I made the correct changes to the ib_post_send_mad() routine as that part of the merge was not automatic. -- Hal From halr at voltaire.com Tue Sep 28 06:56:12 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 09:56:12 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52oejred6r.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> <52oejred6r.fsf@topspin.com> Message-ID: <1096379771.3479.60.camel@localhost.localdomain> On Mon, 2004-09-27 at 15:15, Roland Dreier wrote: > Hal> That's the section I was referring to too. Doesn't C15-0.2-1 > Hal> restrict the SA GetTableResp to only include multicast groups > Hal> for Pkeys on that port ? I am interpreting the subject node > Hal> is the same as the requesting node in this case. What are you > Hal> referring to ? > > Hmm, I don't have C15-0.2-1 in my spec. This is from IBA 1.2. > I was going by the following: > > C15-0.1.22: When a requester node requests data from the Subnet > Administrator that would provide information about a subject node, > the Subnet Administrator shall return only data providing > information about subject nodes for which the requester shares a > P_Key, with exceptions noted below in C15-0.1.23. > > which doesn't seem to restrict the MC groups that a query can return, At IBA 1.2, C15-0.1.22 is obsolete and has been replaced by C15-0.2.1. C15-0.2.1: When a requester node sends a trusted request to SA, the re- quested data shall be returned. When a requester node sends a non- trusted request for data to SA that would provide information about a subject node, the SA shall return only data providing information about subject nodes for which the requester shares a P_Key, with exceptions noted below in C15-0.1.23. So this appears to be a 1.2/1.1 difference. With 1.1, the local node would need to filter for groups for which the port had full PKey. > as well as: > > C15-0.1.23: [...] MCMemberRecords shall always be provided with > the PortGID, Join- State and ProxyJoin components set to 0, except > for the case of a trusted request, in which case the actual > component contents shall be provided. > > which seems to imply that MCMemberRecords will just have the subject > node info zeroed out. I think that the PKey sharing is a first level check before this would occur in the SA response. > Hal> Aren't there separate IP interfaces in Linux for either IPv4 > Hal> or IPv6 ? That's what I meant by IP interface type. Maybe > Hal> I'm not using the proper terminology. The IPoIB signatures > Hal> appear in the MGID of the multicast group and are 0x401B for > Hal> IPv4 and 0x601B for IPv6 per the IETF I-D. > > Nope, a network interface in Linux can carry all kinds of > packets... IPv4, IPv6, IPX, Appletalk, etc. It can have any arbitrary > collection of IPv4 and IPv6 addresses assigned at any time... Aren't there separate network interfaces for IPv4 and IPv6 on top of some lower layer IB interface ? In any case, for an IB interface, One way IPoIB "auto" discovery could work is to bring up IPv4 and IPv6 depending on which groups were present for the groups pertaining to the port's (full) PKeys unless configured otherwise. In practicality, all multicast groups are likely IPoIB. The only issue is whether they are IPv4 or IPv6 groups and whether the groups are groups that the port can participate in (full PKey 1.2/1.1 difference). -- Hal From halr at voltaire.com Tue Sep 28 07:16:57 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 10:16:57 -0400 Subject: [openib-general] Re: [openib-commits] r894 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <52k6ufed2b.fsf@topspin.com> References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> Message-ID: <1096381016.1869.84.camel@localhost.localdomain> On Mon, 2004-09-27 at 15:17, Roland Dreier wrote: > I think I've done all the straightforward work on IPoIB now. We can > try to figure out how to make it a "native" driver now (ie use the > full 20 byte HW address instead of hashing down to 6 bytes, etc). I > had some inconclusive discussions on netdev at oss.sgi.com about this > last week but I still don't know how to do it. Can you elaborate on the issue ? The last I saw on this was when you wrote: > As far as I can tell, the only changes needed would be: > implement an ip_ib_mc_map() function and add > > case ARPHRD_INFINIBAND: > ip_ib_mc_map(addr, haddr); > return 0; > > to arp_mc_map() in net/ipv4/arp.c and to make the analogous addition > for ndisc_mc_map() in net/ipv6/ndisc.c. Here's a cut at the mapping: | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | +------ -+----+----+-----------------+---------+-------------------+ |11111111|0001|scop||< P_Key >| group ID | +--------+----+----+-----------------+---------+-------------------+ | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | +--------+----+----+----------------+---------+----------+---------+ |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|| +--------+----+----+----------------+---------+----------+---------+ For IPv6 the lower 80-bit of the group ID is used directly in the lower 80-bit of the MGID. For IPv4, the group ID is only 28-bit long and the rest of the bits are filled with 0. The ib mc_map functions would be defined in include/net/[ip if_inet6].h. Here's a cut at the one for IPv4 (ip.h): 232,261d231 < /* < * Map a multicast IP onto multicast GID for type infiniband. < * PKey is added subsequent to this by the ipoib driver. < * ipoib driver joins broadcast group itself. < */ < < static inline void ip_ib_mc_map(u32 addr, char *buf) < { < addr=ntohl(addr); < buf[0]=0xFF; < buf[1]=0x12; /* link local scope */ < buf[2]=0x40; /* IPv4 signature */ < buf[3]=0x1B; < buf[4]=0; < buf[5]=0; < buf[6]=0; < buf[7]=0; < buf[8]=0; < buf[9]=0; < buf[10]=0; < buf[11]=0; < buf[15]=addr&0xFF; < addr>>=8; < buf[14]=addr&0xFF; < addr>>=8; < buf[13]=addr&0xFF; < addr>>=8; < buf[12]=addr&0x0F; < } < The only issue I see is where the scope and PKey would come from to form the MGID. The scope could default to link local (2). Couldn't the PKey be applied by the driver as it knows the PKey for the IPoIB interface ? Or is there a problem with delivering this to the proper IPoIB interface ? -- Hal From halr at voltaire.com Tue Sep 28 08:12:23 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 11:12:23 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Fix list handling in ib_mad_recv_done_handler Message-ID: <1096384342.3479.91.camel@localhost.localdomain> ib_mad.c: Fix list handling in ib_mad_recv_done_handler With this patch, the receive side is working (with no clients registered) :-) Index: ib_mad.c =================================================================== --- ib_mad.c (revision 896) +++ ib_mad.c (working copy) @@ -706,6 +706,7 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { + struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; union ib_mad_recv_wrid wrid; unsigned long flags; @@ -729,10 +730,13 @@ */ spin_lock_irqsave(&port_priv->recv_list_lock, flags); if (!list_empty(&port_priv->recv_posted_mad_list[qpn])) { - recv = list_entry(&port_priv->recv_posted_mad_list[qpn], - struct ib_mad_private, - header.recv_buf.list); - + rbuf = list_entry(&port_priv->recv_posted_mad_list[qpn], + struct ib_mad_recv_buf, + list); + rbuf = (struct ib_mad_recv_buf *)rbuf->list.next; + recv = (struct ib_mad_private *)((char *)rbuf - + sizeof(struct ib_mad_recv_wc)); + /* Remove from posted receive MAD list */ list_del(&recv->header.recv_buf.list); From roland at topspin.com Tue Sep 28 08:38:57 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 08:38:57 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096379771.3479.60.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 09:56:12 -0400") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> <52oejred6r.fsf@topspin.com> <1096379771.3479.60.camel@localhost.localdomain> Message-ID: <52is9ycsj2.fsf@topspin.com> Hal> At IBA 1.2, C15-0.1.22 is obsolete and has been replaced by Hal> C15-0.2.1. C15-0.2.1: When a requester node sends a trusted Hal> request to SA, the requested data shall be returned. When a Hal> requester node sends a non-trusted request for data to SA Hal> that would provide information about a subject node, the SA Hal> shall return only data providing information about subject Hal> nodes for which the requester shares a P_Key, with exceptions Hal> noted below in C15-0.1.23. Hmm, this seems like the only difference from C15-0.1.22 is that it talks about trusted requests. C15-0.1.23 (below) still says that MCMemberRecords don't provide information about any subject nodes, so I guess the SM should not worry about P_Keys when returning the table of multicast groups. Roland> C15-0.1.23: [...] MCMemberRecords shall always be provided Roland> with the PortGID, Join- State and ProxyJoin components set Roland> to 0, except for the case of a trusted request, in which Roland> case the actual component contents shall be provided. - Roland From roland at topspin.com Tue Sep 28 08:39:12 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 08:39:12 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096379771.3479.60.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 09:56:12 -0400") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> <52oejred6r.fsf@topspin.com> <1096379771.3479.60.camel@localhost.localdomain> Message-ID: <52ekkmcsin.fsf@topspin.com> Hal> Aren't there separate network interfaces for IPv4 and IPv6 on Hal> top of some lower layer IB interface ? Not that I know of. You can play with IPv6 on an ethernet interface to get a feel for how it works on Linux. Something like ifconfig eth0 inet6 add / should get you started. - Roland From roland at topspin.com Tue Sep 28 08:41:30 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 08:41:30 -0700 Subject: [openib-general] Re: [openib-commits] r894 - gen2/branches/roland-merge/src/linux-kernel/infiniband/ulp/ipoib In-Reply-To: <1096381016.1869.84.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 10:16:57 -0400") References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> <1096381016.1869.84.camel@localhost.localdomain> Message-ID: <52acvacset.fsf@topspin.com> Roland> I think I've done all the straightforward work on IPoIB Roland> now. We can try to figure out how to make it a "native" Roland> driver now (ie use the full 20 byte HW address instead of Roland> hashing down to 6 bytes, etc). I had some inconclusive Roland> discussions on netdev at oss.sgi.com about this last week but Roland> I still don't know how to do it. Hal> Can you elaborate on the issue ? You can look at the archives for netdev at oss.sgi.com (look for the "Advice needed on IP-over-InfiniBand driver" thread) to see the full exchange. The basic idea is that we need to teach the Linux networking core about the fact that IPoIB needs a second path record lookup after the ARP/ND, and that the layer 2 header is built by the HCA rather than the network driver. - Roland From mshefty at ichips.intel.com Tue Sep 28 09:52:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Sep 2004 09:52:55 -0700 Subject: [openib-general] Re: updated TODO list In-Reply-To: <523c13dw16.fsf@topspin.com> References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> <1096324439.9584.33.camel@duffman> <523c13dw16.fsf@topspin.com> Message-ID: <20040928095255.5d677d80.mshefty@ichips.intel.com> On Mon, 27 Sep 2004 18:25:41 -0700 Roland Dreier wrote: > - Implement API for SA path record and MC group queries I spent a couple of days trying to define a basic query API for inclusion in the access layer, but eventually stopped. With the current MAD API, the benefits of having a generic query API (where the user specifies the method, attribute ID, attribute offset, and attribute) didn't seem worth it. It's easy enough for the user to just format a MAD with this information and send it. I think that there may be benefits to having high level query functionality. For example, a client could query using two LIDs. But this sort of functionality seemed outside the scope of the access layer. The management of multicast groups is a little different, but I haven't tried to address that. - Sean From roland at topspin.com Tue Sep 28 10:04:04 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 10:04:04 -0700 Subject: [openib-general] Re: updated TODO list In-Reply-To: <20040928095255.5d677d80.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 28 Sep 2004 09:52:55 -0700") References: <20040927174406.D62AD2283D5@openib.ca.sandia.gov> <1096309126.9584.16.camel@duffman> <52k6ufed2b.fsf@topspin.com> <1096324439.9584.33.camel@duffman> <523c13dw16.fsf@topspin.com> <20040928095255.5d677d80.mshefty@ichips.intel.com> Message-ID: <52sm92ba0r.fsf@topspin.com> Roland> - Implement API for SA path record and MC group queries Sean> I spent a couple of days trying to define a basic query API Sean> for inclusion in the access layer, but eventually stopped. Sean> With the current MAD API, the benefits of having a generic Sean> query API (where the user specifies the method, attribute Sean> ID, attribute offset, and attribute) didn't seem worth it. Sean> It's easy enough for the user to just format a MAD with this Sean> information and send it. Yeah, I wasn't really talking about a generic query API. I was just talking about encapsulating the details of marshalling all the fields of eg. a path record query and then parsing the response. (And similarly for MCMemberRecords) - Roland From mshefty at ichips.intel.com Tue Sep 28 10:23:38 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Sep 2004 10:23:38 -0700 Subject: [openib-general] [PATCH] ib_mad.c: Fix list handling in ib_mad_recv_done_handler In-Reply-To: <1096384342.3479.91.camel@localhost.localdomain> References: <1096384342.3479.91.camel@localhost.localdomain> Message-ID: <20040928102338.79718c05.mshefty@ichips.intel.com> On Tue, 28 Sep 2004 11:12:23 -0400 Hal Rosenstock wrote: > + rbuf = list_entry(&port_priv->recv_posted_mad_list[qpn], > + struct ib_mad_recv_buf, > + list); > + rbuf = (struct ib_mad_recv_buf *)rbuf->list.next; > + recv = (struct ib_mad_private *)((char *)rbuf - > + sizeof(struct > ib_mad_recv_wc)); > + Can you change this to use container_of, rather than assuming the position of the fields in the structure? I think this would work: ib_mad_private_header *mad_priv_hdr; ... mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header, recv_buf); recv = container_of(mad_priv_hdr, struct ib_mad_private, header); From halr at voltaire.com Tue Sep 28 10:55:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 13:55:46 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Better implementation of obtaining receive buffer in ib_mad_recv_done_handler Message-ID: <1096394145.1869.99.camel@localhost.localdomain> Better implementation of obtaining receive buffer in ib_mad_recv_done_handler Index: ib_mad.c =================================================================== --- ib_mad.c (revision 898) +++ ib_mad.c (working copy) @@ -706,6 +707,7 @@ static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { + struct ib_mad_private_header *mad_priv_hdr; struct ib_mad_recv_buf *rbuf; struct ib_mad_private *recv; union ib_mad_recv_wrid wrid; @@ -734,8 +736,8 @@ struct ib_mad_recv_buf, list); rbuf = (struct ib_mad_recv_buf *)rbuf->list.next; - recv = (struct ib_mad_private *)((char *)rbuf - - sizeof(struct ib_mad_recv_wc)); + mad_priv_hdr = container_of(rbuf, struct ib_mad_private_header, recv_buf); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header) ; /* Remove from posted receive MAD list */ list_del(&recv->header.recv_buf.list); From mshefty at ichips.intel.com Tue Sep 28 12:32:21 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 28 Sep 2004 12:32:21 -0700 Subject: [openib-general] [PATCH] cancel outstanding MADs when deregistering Message-ID: <20040928123221.45cde4ac.mshefty@ichips.intel.com> This patch should allow canceling of sent MADs when deregistration occurs. This seemed a little trickier (to keep simple anyway) than I thought at first, so comments are welcome. - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 899) +++ access/ib_mad_priv.h (working copy) @@ -120,7 +120,8 @@ struct ib_mad_agent *agent; u64 wr_id; /* client WRID */ int timeout_ms; - int is_active; + int refcount; + enum ib_wc_status status; }; struct ib_mad_mgmt_method_table { Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 899) +++ access/ib_mad.c (working copy) @@ -86,7 +86,7 @@ struct ib_qp *qp); static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); static inline u8 convert_mgmt_class(u8 mgmt_class); - +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); /* * ib_register_mad_agent - Register to send/receive MADs @@ -252,11 +252,11 @@ mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); - /* Cleanup outstanding sends/pending receives for this agent !!! */ + /* Cleanup pending receives for this agent !!! */ + cancel_mads(mad_agent_priv); spin_lock_irqsave(&mad_agent_priv->port_priv->reg_lock, flags); remove_mad_reg_req(mad_agent_priv); - /* Remove mad agent from port's agent list */ list_del(&mad_agent_priv->agent_list); spin_unlock_irqrestore(&mad_agent_priv->port_priv->reg_lock, flags); @@ -343,18 +343,21 @@ return -ENOMEM; } - /* Track sent MAD with agent. */ + mad_send_wr->agent = mad_agent; + mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; + if (mad_send_wr->timeout_ms) + mad_send_wr->refcount = 2; + else + mad_send_wr->refcount = 1; + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes. */ + atomic_inc(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); list_add_tail(&mad_send_wr->agent_send_list, &mad_agent_priv->send_list); spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); - /* Reference MAD agent until send completes. */ - atomic_inc(&mad_agent_priv->refcount); - mad_send_wr->agent = mad_agent; - mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; - mad_send_wr->is_active = 1; - wr = *cur_send_wr; wr.next = NULL; @@ -368,9 +371,11 @@ list_del(&mad_send_wr->agent_send_list); spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + *bad_send_wr = cur_send_wr; if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); + printk(KERN_NOTICE "ib_send_mad failed, ret = %d\n", ret); return ret; } @@ -826,20 +831,31 @@ mad_agent_priv = container_of(mad_send_wr->agent, struct ib_mad_agent_private, agent); - /* Check whether timeout was requested !!! */ - mad_send_wr->is_active = 0; + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + + mad_send_wr->status = mad_send_wc->status; + if (mad_send_wr->timeout_ms) { + mad_send_wr->timeout_ms = 0; + mad_send_wr->refcount--; + } + } - /* Handle RMPP... */ + if (--mad_send_wr->refcount > 0) { + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return; + } /* Remove send from MAD agent and notify client of completion. */ - spin_lock_irqsave(&mad_agent_priv->send_list_lock, - flags); list_del(&mad_send_wr->agent_send_list); - spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, - flags); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc); - /* Release reference taken when sending. */ + /* Release reference on agent taken when sending. */ if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); @@ -935,6 +951,55 @@ } } +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_send_list) { + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + + if (mad_send_wr->timeout_ms) { + mad_send_wr->timeout_ms = 0; + mad_send_wr->refcount--; + } + + if (mad_send_wr->refcount <= 0) { + list_del(&mad_send_wr->agent_send_list); + list_add_tail(&mad_send_wr->agent_send_list, + &cancel_list); + } + } + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + /* Report all canceled requests. */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_send_list) { + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_send_list); + kfree(mad_send_wr); + + /* Release reference on agent taken when sending. */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + /* * IB MAD thread */ From halr at voltaire.com Tue Sep 28 12:35:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 15:35:39 -0400 Subject: [openib-general] MAD layer needs for porting SA client code Message-ID: <1096400138.1863.139.camel@localhost.localdomain> Hi Roland, Just wanted to double check on the MAD layer needs for porting the current SA client code (for Get PathRecord and Set/Delete/Get MCMemberRecord for IPoIB). Is it correct that this code does not rely on request/response matching (and timeouts) currently ? If not, those aspects of the code do not need to be tested before declaring the MAD layer as usable. Thanks. -- Hal From roland at topspin.com Tue Sep 28 12:38:17 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 12:38:17 -0700 Subject: [openib-general] MAD layer needs for porting SA client code In-Reply-To: <1096400138.1863.139.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 15:35:39 -0400") References: <1096400138.1863.139.camel@localhost.localdomain> Message-ID: <52y8iu9oba.fsf@topspin.com> Hal> Just wanted to double check on the MAD layer needs for Hal> porting the current SA client code (for Get PathRecord and Hal> Set/Delete/Get MCMemberRecord for IPoIB). Is it correct that Hal> this code does not rely on request/response matching (and Hal> timeouts) currently ? If not, those aspects of the code do Hal> not need to be tested before declaring the MAD layer as Hal> usable. SA queries definitely rely on matching responses to requests and on timeouts. I don't see how we can sanely make SA queries otherwise. - Roland From halr at voltaire.com Tue Sep 28 12:50:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 15:50:21 -0400 Subject: [openib-general] MAD layer needs for porting SA client code In-Reply-To: <52y8iu9oba.fsf@topspin.com> References: <1096400138.1863.139.camel@localhost.localdomain> <52y8iu9oba.fsf@topspin.com> Message-ID: <1096401020.1869.146.camel@localhost.localdomain> On Tue, 2004-09-28 at 15:38, Roland Dreier wrote: > Hal> Just wanted to double check on the MAD layer needs for > Hal> porting the current SA client code (for Get PathRecord and > Hal> Set/Delete/Get MCMemberRecord for IPoIB). Is it correct that > Hal> this code does not rely on request/response matching (and > Hal> timeouts) currently ? If not, those aspects of the code do > Hal> not need to be tested before declaring the MAD layer as > Hal> usable. > > SA queries definitely rely on matching responses to requests and on > timeouts. I don't see how we can sanely make SA queries otherwise. Yes, someone has to do it. For the current SA query code, is this done inside the MAD layer (rather than on top of it) ? I thought that might not be the case based on an email thread on this with Yaron but perhaps I don't recall the details correctly. -- Hal From roland at topspin.com Tue Sep 28 12:50:52 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 12:50:52 -0700 Subject: [openib-general] [PATCH] cancel outstanding MADs when deregistering In-Reply-To: <20040928123221.45cde4ac.mshefty@ichips.intel.com> (Sean Hefty's message of "Tue, 28 Sep 2004 12:32:21 -0700") References: <20040928123221.45cde4ac.mshefty@ichips.intel.com> Message-ID: <52u0ti9nqb.fsf@topspin.com> Sean> This patch should allow canceling of sent MADs when Sean> deregistration occurs. This seemed a little trickier (to Sean> keep simple anyway) than I thought at first, so comments are Sean> welcome. It looks OK for current functionality but I think it will have to change to support cancelling sends. (Cancelling sends is required for consumers that start a query with a long timeout and then want to unload or something like that). When someone asks to cancel a send you have to tell the consumer whether the send was canceled or had already finished (so that they know whether the resources have already been freed). You have to make sure that you don't say that the send wasn't canceled and then return with the send completion handler still running on some other CPU, because then the consumer will probably corrupt context. That means that you can't just remove things from your pending list like the code does now -- you have to leave them there and mark them "callback running" or something like that. It ends up being even more complicated unfortunately. - Roland From roland at topspin.com Tue Sep 28 12:52:28 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 12:52:28 -0700 Subject: [openib-general] MAD layer needs for porting SA client code In-Reply-To: <1096401020.1869.146.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 15:50:21 -0400") References: <1096400138.1863.139.camel@localhost.localdomain> <52y8iu9oba.fsf@topspin.com> <1096401020.1869.146.camel@localhost.localdomain> Message-ID: <52pt469nnn.fsf@topspin.com> Hal> For the current SA query code, is this done inside the MAD Hal> layer (rather than on top of it) ? I thought that might not Hal> be the case based on an email thread on this with Yaron but Hal> perhaps I don't recall the details correctly. In my tree there is another layer ("client_query") on top of the MAD layer that handles this matching. However it doesn't make sense to carry that forward given the design of our current MAD layer. - Roland From halr at voltaire.com Tue Sep 28 13:04:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 16:04:56 -0400 Subject: [openib-general] MAD layer needs for porting SA client code In-Reply-To: <52pt469nnn.fsf@topspin.com> References: <1096400138.1863.139.camel@localhost.localdomain> <52y8iu9oba.fsf@topspin.com> <1096401020.1869.146.camel@localhost.localdomain> <52pt469nnn.fsf@topspin.com> Message-ID: <1096401896.3479.156.camel@localhost.localdomain> On Tue, 2004-09-28 at 15:52, Roland Dreier wrote: > In my tree there is another layer ("client_query") on top of the MAD > layer that handles this matching. However it doesn't make sense to > carry that forward given the design of our current MAD layer. Got it. That's good to know. Thanks. Just want to understand one more thing about what you just wrote: Is it correct to say that it doesn't make sense to carry this layer forward given the design of our current MAD layer because it does support the function of request/response with timeout ? -- Hal From roland at topspin.com Tue Sep 28 13:10:50 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 13:10:50 -0700 Subject: [openib-general] MAD layer needs for porting SA client code In-Reply-To: <1096401896.3479.156.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 16:04:56 -0400") References: <1096400138.1863.139.camel@localhost.localdomain> <52y8iu9oba.fsf@topspin.com> <1096401020.1869.146.camel@localhost.localdomain> <52pt469nnn.fsf@topspin.com> <1096401896.3479.156.camel@localhost.localdomain> Message-ID: <52lleu9mt1.fsf@topspin.com> Hal> Just want to understand one more thing about what you just Hal> wrote: Is it correct to say that it doesn't make sense to Hal> carry this layer forward given the design of our current MAD Hal> layer because it does support the function of Hal> request/response with timeout ? Yes, exactly. During the discussion on the design of the MAD layer, I suggested having a thin layer that handled nothing beyond allowing multiple consumers to share QP 0/1 and then building things like request/response matching in layers on top of that. This was overwhelmingly rejected by Yaron and others, who said that all of these functions had to be in the core MAD layer. Given that we are proceeding with a design that puts all this functionality into the core MAD layer, it doesn't make sense for other code to duplicate the same functions. - Roland From krkumar at us.ibm.com Tue Sep 28 13:21:51 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 28 Sep 2004 13:21:51 -0700 (PDT) Subject: [openib-general] [PATCH] cancel outstanding MADs when deregistering In-Reply-To: <20040928123221.45cde4ac.mshefty@ichips.intel.com> Message-ID: Sean, I have a couple of questions regarding your patch, not real problems :-) In cancel_mads() : > if (mad_send_wr->refcount <= 0) { If there is no good reason for the refcount to drop below zero, it is better to put BUG_ON for such code to catch potential bugs much earlier, while keeping the check as "if (x == 0)", etc. Also, if timeout_ms is set, will those entries get removed from the list (since they have refcnt of two) ? Finally, do you want to wake up threads waiting on mad_agent_priv->wait once finally out of the loop or each time when the refcnt drops to zero ? If there is no reason to do so each time, you can do once you finish the cancel list. BTW, in this case you can also call wake_up_nr() if you keep track of the number of times you want to wake up based on the return value of atomic_dec_* call. thx, - KK From halr at voltaire.com Tue Sep 28 13:59:10 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 16:59:10 -0400 Subject: [openib-general] [PATCH] ib_mad.c: Fix registration bugs Message-ID: <1096405150.1863.178.camel@localhost.localdomain> Fix registration bugs Registration and deregistration now appear to be working although this has not been extensively tested. Index: ib_mad.c =================================================================== --- ib_mad.c (revision 899) +++ ib_mad.c (working copy) @@ -203,6 +203,7 @@ /* Now, fill in the various structures */ memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->port_priv = port_priv; mad_agent_priv->reg_req = reg_req; mad_agent_priv->rmpp_version = rmpp_version; mad_agent_priv->agent.device = device; @@ -496,16 +497,16 @@ return 0; private = priv->port_priv; + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); class = &private->version[mad_reg_req->mgmt_class_version]; - mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); if (!*class) { - /* Allocate management class table for "new" version */ + /* Allocate management class table for "new" class version */ *class = kmalloc(sizeof **class, GFP_KERNEL); if (!*class) { printk(KERN_ERR "No memory for ib_mad_mgmt_class_table\n "); goto error1; } - /* Clear management class table */ + /* Clear management class table for this class version */ for (i = 0; i < MAX_MGMT_CLASS; i++) { (*class)->method_table[i] = NULL; } From halr at voltaire.com Tue Sep 28 14:20:33 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 17:20:33 -0400 Subject: [openib-general] Re: [PATCH] cancel outstanding MADs when deregistering In-Reply-To: <20040928123221.45cde4ac.mshefty@ichips.intel.com> References: <20040928123221.45cde4ac.mshefty@ichips.intel.com> Message-ID: <1096406433.1869.185.camel@localhost.localdomain> On Tue, 2004-09-28 at 15:32, Sean Hefty wrote: > This patch should allow canceling of sent MADs when deregistration > occurs. Thanks. Applied. -- Hal From roland at topspin.com Tue Sep 28 14:50:57 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 14:50:57 -0700 Subject: [openib-general] [PATCH] Fix MAD completion handling Message-ID: <52brfq9i66.fsf@topspin.com> While looking over Sean's changes, I noticed what look like a few bugs in the mad thread usage. It didn't seem like there was any way for the MAD thread to stop, and I think there are a few race conditions that could lead to lost wakeups. This patch tries to fix both of these problems. I didn't test this because I didn't feel like messing with the Makefile to get it to build in my environment. (It would be good to switch to a standard kbuild Makefile so things like cross-compiling and separate object directories work) Thanks, Roland Index: infiniband/access/ib_mad_priv.h =================================================================== --- infiniband/access/ib_mad_priv.h (revision 899) +++ infiniband/access/ib_mad_priv.h (working copy) @@ -131,11 +131,6 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; -struct ib_mad_thread_private { - wait_queue_head_t wait; - atomic_t completion_event; -}; - struct ib_mad_port_private { struct list_head port_list; struct ib_device *device; @@ -159,7 +154,7 @@ u32 recv_wr_index[IB_MAD_QPS_CORE]; struct task_struct *mad_thread; - struct ib_mad_thread_private mad_thread_private; + int thread_wake; }; #endif /* __IB_MAD_PRIV_H__ */ Index: infiniband/access/ib_mad.c =================================================================== --- infiniband/access/ib_mad.c (revision 899) +++ infiniband/access/ib_mad.c (working copy) @@ -892,6 +892,8 @@ struct ib_wc wc; int err_status = 0; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { printk(KERN_DEBUG "Completion opcode 0x%x WRID 0x%Lx\n", wc.opcode, wc.wr_id); if (wc.status != IB_WC_SUCCESS) { @@ -928,11 +930,8 @@ } } - if (err_status) { + if (err_status) ib_mad_port_restart(port_priv); - } else { - ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); - } } /* @@ -941,23 +940,22 @@ static int ib_mad_thread(void *param) { struct ib_mad_port_private *port_priv = param; - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; - int ret; - while (1) { - while (!signal_pending(current)) { - ret = wait_event_interruptible(mad_thread_priv->wait, - atomic_read(&mad_thread_priv->completion_event) > 0); - atomic_set(&mad_thread_priv->completion_event, 0); - if (ret) { - printk(KERN_ERR "ib_mad thread exiting\n"); - return 0; - } + __set_current_state(TASK_RUNNING); - ib_mad_completion_handler(port_priv); + do { + port_priv->thread_wake = 0; + wmb(); - } - } + ib_mad_completion_handler(port_priv); + + set_current_state(TASK_INTERRUPTIBLE); + if (!port_priv->thread_wake) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; } /* @@ -965,11 +963,8 @@ */ static int ib_mad_thread_init(struct ib_mad_port_private *port_priv) { - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; + port_priv->thread_wake = 0; - atomic_set(&mad_thread_priv->completion_event, 0); - init_waitqueue_head(&mad_thread_priv->wait); - port_priv->mad_thread = kthread_create(ib_mad_thread, port_priv, "ib_mad(%6s-%-2d)", @@ -978,27 +973,18 @@ if (IS_ERR(port_priv->mad_thread)) { printk(KERN_ERR "Couldn't start ib_mad thread for %s port %d\n", port_priv->device->name, port_priv->port_num); - return 1; + return PTR_ERR(port_priv->mad_thread); } - wake_up_process(port_priv->mad_thread); return 0; } -/* - * Stop the IB MAD thread - */ -static void ib_mad_thread_stop(struct ib_mad_port_private *port_priv) -{ - kthread_stop(port_priv->mad_thread); /* !!! */ -} - static void ib_mad_thread_completion_handler(struct ib_cq *cq) { struct ib_mad_port_private *port_priv = cq->cq_context; - struct ib_mad_thread_private *mad_thread_priv = &port_priv->mad_thread_private; - atomic_inc(&mad_thread_priv->completion_event); - wake_up_interruptible(&mad_thread_priv->wait); + port_priv->thread_wake = 1; + wmb(); + wake_up_process(port_priv->mad_thread); } static int ib_mad_post_receive_mad(struct ib_mad_port_private *port_priv, @@ -1527,7 +1513,7 @@ spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); ib_mad_port_stop(port_priv); - ib_mad_thread_stop(port_priv); + kthread_stop(port_priv->mad_thread); ib_destroy_qp(port_priv->qp[1]); ib_destroy_qp(port_priv->qp[0]); ib_dereg_mr(port_priv->mr); From halr at voltaire.com Tue Sep 28 15:09:26 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 18:09:26 -0400 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <52brfq9i66.fsf@topspin.com> References: <52brfq9i66.fsf@topspin.com> Message-ID: <1096409366.1863.193.camel@localhost.localdomain> On Tue, 2004-09-28 at 17:50, Roland Dreier wrote: > I didn't test this because I didn't feel like messing with the > Makefile to get it to build in my environment. That's understandable. I will test it. > (It would be good to > switch to a standard kbuild Makefile so things like cross-compiling > and separate object directories work) I have the changes for this but it requires some conditionalization to the core/Makefile which you previously objected to. Should I generate a patch for this ? -- Hal From roland at topspin.com Tue Sep 28 15:19:26 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 15:19:26 -0700 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096409366.1863.193.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 18:09:26 -0400") References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> Message-ID: <523c129gup.fsf@topspin.com> Hal> I have the changes for this but it requires some Hal> conditionalization to the core/Makefile which you previously Hal> objected to. Should I generate a patch for this ? Switching the Makefile from what it is now to something that uses kbuild has to be an improvement. I'm not sure why touching core/Makefile is required -- I'm not planning on committing this to my branch until I'm ready to cut over completely, and you shouldn't need any conditionals on your branch. Thanks, Roland From halr at voltaire.com Tue Sep 28 15:33:40 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 18:33:40 -0400 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <523c129gup.fsf@topspin.com> References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> Message-ID: <1096410820.1869.200.camel@localhost.localdomain> On Tue, 2004-09-28 at 18:19, Roland Dreier wrote: > Hal> I have the changes for this but it requires some > Hal> conditionalization to the core/Makefile which you previously > Hal> objected to. Should I generate a patch for this ? > > Switching the Makefile from what it is now to something that uses > kbuild has to be an improvement. I'm not sure why touching > core/Makefile is required -- I'm not planning on committing this to my > branch until I'm ready to cut over completely, and you shouldn't need > any conditionals on your branch. The conditional is to allow your build or a build using ib_mad as the access layer. Different parts of core are pulled in right now. It is transitional and would go away once the cutover occurs. What will happen with other ULPs (than IPoIB) once your cutover occurs ? -- Hal From timur.tabi at ammasso.com Tue Sep 28 15:34:54 2004 From: timur.tabi at ammasso.com (Timur Tabi) Date: Tue, 28 Sep 2004 17:34:54 -0500 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <20040903160745.A6309@topspin.com> References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> Message-ID: <4159E70E.5040705@ammasso.com> Libor Michalek wrote: > I've seen the problem in test cases, so it definetly can happen in 2.4. > Looking at the 2.6 code the problem appears to be fixed, but I have not > had a chance to run tests to verify it. Good place to take look if you > are interested is in launder_page() and try_to_unmap() in the kernel. I'm afraid it has not been fixed in 2.6. I just ran our memory locking tests, and it failed with get_user_pages running on Suse Linux 9.1 (kernel 2.6.4). The test app does this: 1) Calls our driver, which issues a get_user_pages() call for one page. 2) Using pgd/pmd/pte_offset, gets the physical address for the page 3) Tries allocate 1GB of memory (this system has 1GB of physical RAM). 4) Tries to get the physical address again. In step 4, the physical address is often zero, which means either pgd_offset or pmd_offset failed. This indicates the page was swapped out. I don't understand how this bug can continue to exist after all this time. get_user_pages() is supposed to lock the memory, because drivers use it for DMA'ing directly into user memory. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From tduffy at sun.com Tue Sep 28 15:50:06 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 15:50:06 -0700 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096410820.1869.200.camel@localhost.localdomain> References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> <1096410820.1869.200.camel@localhost.localdomain> Message-ID: <1096411807.25336.17.camel@duffman> On Tue, 2004-09-28 at 18:33 -0400, Hal Rosenstock wrote: > What will happen with other ULPs (than IPoIB) once your cutover occurs ? If they don't build either depend them on BROKEN or move them aside. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From krause at cup.hp.com Tue Sep 28 15:50:58 2004 From: krause at cup.hp.com (Michael Krause) Date: Tue, 28 Sep 2004 15:50:58 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52d607fzpl.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> At 09:23 AM 9/27/2004, Roland Dreier wrote: > Michael> The SM only knows what it configures in each port. The > Michael> SA is responsible for service management and it works > Michael> with the SM to map a given service to a P_Key. > >As far as I know there is no service (in the IB service record sense) >associated to IPoIB, is there? Additional services can be defined beyond the base IB service record. For IP, the following would solve the problem: - Have the SA define an IP service (this can be done for other protocols as well). This can be done using the Service ID annex specification (much of this annex is focused on an endnode but for this purpose, the SA can be viewed as a logical endnode). The SA registers the service providing a single point of management for the subnet. - Each endnode issues request targeting the IP service identifier for each P_Key that is configured in the port. The SA response determines whether the IP service is supported on this interface. - For each P_Key, the endnode probes for the "all nodes" multicast address and joins / creates accordingly. Given P_Key can come and go, the endnode can set up an event notification when the IP service is updated. This allows dynamic configuration without having an administrator interact with each node. The above approach is both scalable and relatively simple to implement and manage. > Michael> IPoverIB is required to inquire what groups are available > Michael> and optionally set up event notification to be informed > Michael> when groups are added for its particular service. This > Michael> eliminates the need for local P_Key management. > >I don't see this requirement anywhere in the current IETF drafts, >although I could be missing it. In any case this seems rather ugly, >since the only way to get a list of IPoIB multicast groups seems to be >to query for _all_ multicast groups, filter for those that match the >IPoIB GID format, and then attempt to join to find out which can be >used on each local port. > > Michael> In general, the IPoverIB driver should treat each new > Michael> all-nodes multicast group with a unique P_Key as a > Michael> virtual hot-plug event (this was our intent both within > Michael> the IETF and in the IBTA). > >Hmm.. this view does not seem to match the wording of the current IPoIB >drafts. For example: > > It is an implementation choice on how the P_Key and the scope > bits related to the IPoIB subnet are determined by the > implementation. These could be configuration parameters > initialized by some means by the administrator. > > The methods employed by an implementation to determine the > P_Key and scope bits are not specified by IPoIB. It was not specified due to the lack of standards for higher-level service management which partitioning is classified. Given there is an OpenSM effort in flight and the Service ID spec is already in existence, it isn't that tough to acquire an IP service ID (or any other protocol that one wants to support) and implement a solution along the lines that I describe above. This would lead to a more dynamic environment while reducing the impact to the administrator. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krkumar at us.ibm.com Tue Sep 28 15:42:04 2004 From: krkumar at us.ibm.com (Krishna Kumar) Date: Tue, 28 Sep 2004 15:42:04 -0700 (PDT) Subject: [openib-general] IB design docs/links [Was : MAD layer needs for porting SA client code] Message-ID: Hi all, > Yes, exactly. During the discussion on the design of the MAD layer, I Is there any written document with details of the design of MAD and other layers ? Or are these discussions completely online (I admit I missed them in the large volume of mails). If there are documents, I request that they be included in the toplevel directory of the submission tree (similar to Documentation directory) so that people like me who are new to this field can look at it to understand the code (and it will also be useful for anyone wanting to contribute in a particular area). Roland, what is the roland-merge directory exactly ? Is it the topspin stack that you are modifying to finally add to the openib-candidate directory ? Thanks, - KK > Yes, exactly. During the discussion on the design of the MAD layer, I > suggested having a thin layer that handled nothing beyond allowing > multiple consumers to share QP 0/1 and then building things like > request/response matching in layers on top of that. This was > overwhelmingly rejected by Yaron and others, who said that all of > these functions had to be in the core MAD layer. From tduffy at sun.com Tue Sep 28 16:04:18 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 16:04:18 -0700 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096410820.1869.200.camel@localhost.localdomain> References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> <1096410820.1869.200.camel@localhost.localdomain> Message-ID: <1096412658.25336.20.camel@duffman> On Tue, 2004-09-28 at 18:33 -0400, Hal Rosenstock wrote: > The conditional is to allow your build or a build using ib_mad as the > access layer. Different parts of core are pulled in right now. It is > transitional and would go away once the cutover occurs. Hal, maybe it would be useful if you posted instructions on how you setup your build so we could reproduce it. I would like to start playing around with the new MAD layer, etc. Thanks, -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Tue Sep 28 16:06:22 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 19:06:22 -0400 Subject: [openib-general] [PATCH] ib_mad: Fix ib_free_recv_mad Message-ID: <1096412782.1863.204.camel@localhost.localdomain> ib_mad: Fix ib_free_recv_mad Index: ib_mad.c =================================================================== --- ib_mad.c (revision 901) +++ ib_mad.c (working copy) @@ -394,8 +394,12 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) { struct ib_mad_recv_buf *entry; - struct ib_mad_private *buffer = (struct ib_mad_private *)mad_recv_wc; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *buffer; + mad_priv_hdr = container_of(mad_recv_wc, struct ib_mad_private_header, r ecv_wc); + buffer = container_of(mad_priv_hdr, struct ib_mad_private, header); + /* * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers @@ -403,7 +407,8 @@ list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { /* Free previous receive buffer */ kmem_cache_free(ib_mad_cache, buffer); - buffer = (void *)entry - sizeof(struct ib_mad_private_header); + mad_priv_hdr = container_of(entry, struct ib_mad_private_header, recv_buf); + buffer = container_of(mad_priv_hdr, struct ib_mad_private, heade r); } /* Free last buffer */ kmem_cache_free(ib_mad_cache, buffer); @@ -768,7 +773,7 @@ recv->header.recv_wc.recv_buf = &recv->header.recv_buf; /* Setup MAD receive buffer */ - recv->header.recv_buf.list.next = NULL; /* Until RMPP implemented !!! */ + INIT_LIST_HEAD(&recv->header.recv_buf.list); /* More for RMPP!!! */ recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; if (wc->wc_flags & IB_WC_GRH) { recv->header.recv_buf.grh = &recv->grh; From roland at topspin.com Tue Sep 28 16:13:49 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 16:13:49 -0700 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096411807.25336.17.camel@duffman> (Tom Duffy's message of "Tue, 28 Sep 2004 15:50:06 -0700") References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> <1096410820.1869.200.camel@localhost.localdomain> <1096411807.25336.17.camel@duffman> Message-ID: <52y8iu7zrm.fsf@topspin.com> Hal> What will happen with other ULPs (than IPoIB) once your Hal> cutover occurs ? Tom> If they don't build either depend them on BROKEN or move them Tom> aside. Exactly -- I was planning on marking them BROKEN. - R. From roland at topspin.com Tue Sep 28 16:29:21 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 16:29:21 -0700 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <4159E70E.5040705@ammasso.com> (Timur Tabi's message of "Tue, 28 Sep 2004 17:34:54 -0500") References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> <4159E70E.5040705@ammasso.com> Message-ID: <52u0ti7z1q.fsf@topspin.com> Timur> I don't understand how this bug can continue to exist after Timur> all this time. get_user_pages() is supposed to lock the Timur> memory, because drivers use it for DMA'ing directly into Timur> user memory. I think you will get a resolution much faster if you A) post your test code so people can reproduce the bug B) post to the linux-kernel mailing list rather than openib-general Thanks, Roland From roland at topspin.com Tue Sep 28 16:31:28 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 16:31:28 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> (Michael Krause's message of "Tue, 28 Sep 2004 15:50:58 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> Message-ID: <52pt467yy7.fsf@topspin.com> Michael> It was not specified due to the lack of standards for Michael> higher-level service management which partitioning is Michael> classified. Given there is an OpenSM effort in flight Michael> and the Service ID spec is already in existence, it isn't Michael> that tough to acquire an IP service ID (or any other Michael> protocol that one wants to support) and implement a Michael> solution along the lines that I describe above. This Michael> would lead to a more dynamic environment while reducing Michael> the impact to the administrator. This scheme might indeed be reasonable. However, given the absence of an IBTA spec or IETF draft, I don't see how we can rely on it in our IPoIB driver right now. Thanks, Roland From roland at topspin.com Tue Sep 28 16:33:28 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 16:33:28 -0700 Subject: [openib-general] Re: IB design docs/links In-Reply-To: (Krishna Kumar's message of "Tue, 28 Sep 2004 15:42:04 -0700 (PDT)") References: Message-ID: <52lleu7yuv.fsf@topspin.com> Krishna> Is there any written document with details of the design Krishna> of MAD and other layers ? Or are these discussions Krishna> completely online (I admit I missed them in the large Krishna> volume of mails). If there are documents, I request that Krishna> they be included in the toplevel directory of the Krishna> submission tree (similar to Documentation directory) so Krishna> that people like me who are new to this field can look at Krishna> it to understand the code (and it will also be useful for Krishna> anyone wanting to contribute in a particular area). We are following the usual Linux development style: reach consensus on the mailing list and then implement the code. As such there is no documentation beyond the header files and the mailing list archive. Krishna> Roland, what is the roland-merge directory exactly ? Is Krishna> it the topspin stack that you are modifying to finally Krishna> add to the openib-candidate directory ? It is the tree where I work on my code. Some of the code originally came from the Topspin stack, while other code (eg the mthca driver) has been written from scratch for OpenIB. The ultimate disposition of this code is still to be determined. Thanks, Roland From tduffy at sun.com Tue Sep 28 16:46:20 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 16:46:20 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52pt467yy7.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> Message-ID: <1096415180.25336.25.camel@duffman> On Tue, 2004-09-28 at 16:31 -0700, Roland Dreier wrote: > This scheme might indeed be reasonable. However, given the absence of > an IBTA spec or IETF draft, I don't see how we can rely on it in our > IPoIB driver right now. Embrace and Extend. -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From libor at topspin.com Tue Sep 28 17:20:08 2004 From: libor at topspin.com (Libor Michalek) Date: Tue, 28 Sep 2004 17:20:08 -0700 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <52u0ti7z1q.fsf@topspin.com>; from roland@topspin.com on Tue, Sep 28, 2004 at 04:29:21PM -0700 References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> <4159E70E.5040705@ammasso.com> <52u0ti7z1q.fsf@topspin.com> Message-ID: <20040928172008.A14183@topspin.com> On Tue, Sep 28, 2004 at 04:29:21PM -0700, Roland Dreier wrote: > Timur> I don't understand how this bug can continue to exist after > Timur> all this time. get_user_pages() is supposed to lock the > Timur> memory, because drivers use it for DMA'ing directly into > Timur> user memory. > > I think you will get a resolution much faster if you > > A) post your test code so people can reproduce the bug I'd be interested in seeing the test code as well. > B) post to the linux-kernel mailing list rather than openib-general However, I'd be interested if this list was CC'd, since it is very applicable to all the zero-copy userspace code. -Libor From tduffy at sun.com Tue Sep 28 17:27:43 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 17:27:43 -0700 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <20040928172008.A14183@topspin.com> References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> <4159E70E.5040705@ammasso.com> <52u0ti7z1q.fsf@topspin.com> <20040928172008.A14183@topspin.com> Message-ID: <1096417663.25336.32.camel@duffman> On Tue, 2004-09-28 at 17:20 -0700, Libor Michalek wrote: > > B) post to the linux-kernel mailing list rather than openib-general > > However, I'd be interested if this list was CC'd, since it is very > applicable to all the zero-copy userspace code. Too late. He cross posted lkml, linux-mm, and kernelnewbies: > I was hoping that this bug would be fixed in the 2.6 kernels, but > apparently it hasn't been. > > Function get_user_pages() is supposed to lock user memory. However, > under extreme memory constraints, the kernel will swap out the > "locked" > memory. > > I have a test app which does this: > > 1) Calls our driver, which issues a get_user_pages() call for one > page. > 2) Calls our driver again to get the physical address of that page > (the > driver uses pgd/pmd/pte_offset). > 3) Tries allocate 1GB of memory (this system has 1GB of physical RAM). > 4) Tries to get the physical address again. > > In step 4, the physical address is usually zero, which means either > pgd_offset or pmd_offset failed. This indicates the page was swapped > out. > > I don't understand how this bug can continue to exist after all this > time. get_user_pages() is supposed to lock the memory, because > drivers > use it for DMA'ing directly into user memory. So far, Christoph Hellwig said: > get_user_pages locks the page in memory. It doesn't do anything about > ptes. And Dave Hansen responded: > You probably want mlock(2) to keep the kernel from messing with the > ptes > at all. But, you should probably really be thinking about why you're > accessing the page tables at all. I count *ONE* instance in drivers/ > where page tables are accessed directly. Not very helpful... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Tue Sep 28 17:53:15 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 20:53:15 -0400 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <52brfq9i66.fsf@topspin.com> References: <52brfq9i66.fsf@topspin.com> Message-ID: <1096419194.1878.0.camel@localhost.localdomain> On Tue, 2004-09-28 at 17:50, Roland Dreier wrote: > While looking over Sean's changes, I noticed what look like a few bugs > in the mad thread usage. It didn't seem like there was any way for > the MAD thread to stop, and I think there are a few race conditions > that could lead to lost wakeups. This patch tries to fix both of > these problems. Thanks. Applied. -- Hal From halr at voltaire.com Tue Sep 28 18:16:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 21:16:24 -0400 Subject: [openib-general] [PATCH] ib_mad: Add Linux 2.6 style Makefile Message-ID: <1096420583.1872.3.camel@localhost.localdomain> ib_mad: Add Linux 2.6 style Makefile Index: access/Makefile =================================================================== --- access/Makefile (revision 0) +++ access/Makefile (revision 0) @@ -0,0 +1,8 @@ +EXTRA_CFLAGS += -I. -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_ACCESS_LAYER) += \ + ib_al.o + +ib_al-objs := \ + ib_mad.o + From tduffy at sun.com Tue Sep 28 18:16:46 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 18:16:46 -0700 Subject: [openib-general] static LID computation with TS_HOST_DRIVER Message-ID: <1096420606.25336.59.camel@duffman> [KERNEL_IB][ib_mad_static_compute_base][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_static.c:94]Couldn't find a suitable network device; setting lid_base to 1 I am trying to track down why this is happening, but it seems to pop up when the ib_mthca driver is loaded at boot time on my sparc64 box (haven't tested other architectures). I guess you need to have your ethernet driver loaded in and up with an IP address /before/ you load ib_mthca? What if this is not the case? Anyways, I thought the LID would be assigned by the SM... -tduffy -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Tue Sep 28 18:21:04 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 21:21:04 -0400 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) Message-ID: <1096420864.1872.6.camel@localhost.localdomain> Add access into build (Roland's branch) Index: src/linux-kernel/infiniband/Kconfig =================================================================== --- src/linux-kernel/infiniband/Kconfig (revision 904) +++ src/linux-kernel/infiniband/Kconfig (working copy) @@ -8,6 +8,15 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +config INFINIBAND_ACCESS_LAYER + tristate "InfiniBand Access Layer" + depends on INFINIBAND + default n + ---help--- + InfiniBand Access Layer (AL) includes SMI (Subnet + Management Interface) and GSI (General Services + Interface). + config INFINIBAND_USER_CM tristate "Userspace CM" depends on INFINIBAND && INFINIBAND_MELLANOX_HCA Index: src/linux-kernel/infiniband/Makefile =================================================================== --- src/linux-kernel/infiniband/Makefile (revision 904) +++ src/linux-kernel/infiniband/Makefile (working copy) @@ -1 +1 @@ -obj-$(CONFIG_INFINIBAND) += legacy/ core/ ulp/ hw/ +obj-$(CONFIG_INFINIBAND) += legacy/ core/ access/ ulp/ hw/ From halr at voltaire.com Tue Sep 28 18:32:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 21:32:32 -0400 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096412658.25336.20.camel@duffman> References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> <1096410820.1869.200.camel@localhost.localdomain> <1096412658.25336.20.camel@duffman> Message-ID: <1096421551.1878.14.camel@localhost.localdomain> On Tue, 2004-09-28 at 19:04, Tom Duffy wrote: > On Tue, 2004-09-28 at 18:33 -0400, Hal Rosenstock wrote: > > The conditional is to allow your build or a build using ib_mad as the > > access layer. Different parts of core are pulled in right now. It is > > transitional and would go away once the cutover occurs. > > Hal, maybe it would be useful if you posted instructions on how you > setup your build so we could reproduce it. I would like to start > playing around with the new MAD layer, etc. The MAD layer is not quite ready to be used by others as yet. Send needs testing and SMI needs to be tested. At that point, it will be announced to the list. I'm hopeful that it won't be much longer. There will be a readme and a script for setup to link this into your Linux source tree. OK ? -- Hal From halr at voltaire.com Tue Sep 28 18:39:44 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 21:39:44 -0400 Subject: [Fwd: [openib-general] [PATCH] Change SMI/GSI QP types to match QP index values] Message-ID: <1096421983.1872.17.camel@localhost.localdomain> Hi Roland, Any reason this patch can't be applied ? Thanks. -- Hal -----Forwarded Message----- From: Hal Rosenstock To: openib-general at openib.org Subject: [openib-general] [PATCH] Change SMI/GSI QP types to match QP index values Date: Fri, 24 Sep 2004 16:51:43 -0400 Change SMI/GSI QP types to match QP index values Roland's branch Index: ib_verbs.h =================================================================== -- ib_verbs.h (revision 880) +++ ib_verbs.h (working copy) @@ -346,11 +346,11 @@ }; enum ib_qp_type { + IB_QPT_SMI, /* SMI type = QP index 0 */ + IB_QPT_GSI, /* GSI type = QP index 1 */ IB_QPT_RC, IB_QPT_UC, IB_QPT_UD, - IB_QPT_SMI, - IB_QPT_GSI, IB_QPT_RAW_IPV6, IB_QPT_RAW_ETY }; _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Tue Sep 28 18:56:55 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 18:56:55 -0700 Subject: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <1096420606.25336.59.camel@duffman> (Tom Duffy's message of "Tue, 28 Sep 2004 18:16:46 -0700") References: <1096420606.25336.59.camel@duffman> Message-ID: <528yat96s8.fsf@topspin.com> Tom> [KERNEL_IB][ib_mad_static_compute_base][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_static.c:94]Couldn't Tom> find a suitable network device; setting lid_base to 1 Tom> I am trying to track down why this is happening, but it seems Tom> to pop up when the ib_mthca driver is loaded at boot time on Tom> my sparc64 box (haven't tested other architectures). Tom> I guess you need to have your ethernet driver loaded in and Tom> up with an IP address /before/ you load ib_mthca? What if Tom> this is not the case? Yeah, if it can't find a configured interface it bails out and picks 1. It's benign and it will go away when we switch MAD layers. In fact I'm going to disable it on my tree now... Tom> Anyways, I thought the LID would be assigned by the SM... Yeah, this is a hack to try and pick a LID that the SM won't change. Some applications create connections before the SM has discovered the node and don't want them broken when the SM does come around. - R. From roland at topspin.com Tue Sep 28 18:59:09 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 18:59:09 -0700 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) In-Reply-To: <1096420864.1872.6.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 21:21:04 -0400") References: <1096420864.1872.6.camel@localhost.localdomain> Message-ID: <524qlh96oi.fsf@topspin.com> -obj-$(CONFIG_INFINIBAND) += legacy/ core/ ulp/ hw/ +obj-$(CONFIG_INFINIBAND) += legacy/ core/ access/ ulp/ hw/ This doesn't really make sense to me. Why do we need a core/ and an access/ directory? I prefer core/, since it matches drivers/usb/core, net/core and sound/core already in the kernel tree, and "access layer" is a bit of jargon that no one not familiar with the history of IB stacks is going to understand. However if we prefer access/ then I'll move everything from core/ there. - R. From roland at topspin.com Tue Sep 28 19:00:04 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 19:00:04 -0700 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) In-Reply-To: <1096420864.1872.6.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 21:21:04 -0400") References: <1096420864.1872.6.camel@localhost.localdomain> Message-ID: <52zn397s2j.fsf@topspin.com> Oh yeah... is there any reason someone would want CONFIG_INFINIBAND but not CONFIG_INFINIBAND_ACCESS_LAYER? - R. From roland at topspin.com Tue Sep 28 19:01:19 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 19:01:19 -0700 Subject: [Fwd: [openib-general] [PATCH] Change SMI/GSI QP types to match QP index values] In-Reply-To: <1096421983.1872.17.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 21:39:44 -0400") References: <1096421983.1872.17.camel@localhost.localdomain> Message-ID: <52vfdx7s0g.fsf@topspin.com> Hal> Hi Roland, Any reason this patch can't be applied ? Other than the fact that the whitespace got mangled, not really :) I don't like it (as I said before, it seems to be making a fragile change for minimal gain) but I'll apply it. - R. From Tom.Duffy at Sun.COM Tue Sep 28 19:13:07 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Tue, 28 Sep 2004 19:13:07 -0700 Subject: [openib-general] Re: [PATCH] Fix MAD completion handling In-Reply-To: <1096421551.1878.14.camel@localhost.localdomain> References: <52brfq9i66.fsf@topspin.com> <1096409366.1863.193.camel@localhost.localdomain> <523c129gup.fsf@topspin.com> <1096410820.1869.200.camel@localhost.localdomain> <1096412658.25336.20.camel@duffman> <1096421551.1878.14.camel@localhost.localdomain> Message-ID: <415A1A33.4030107@sun.com> Hal Rosenstock wrote: > The MAD layer is not quite ready to be used by others as yet. Send needs > testing and SMI needs to be tested. At that point, it will be announced > to the list. I'm hopeful that it won't be much longer. > There will be a readme and a script for setup to link this into your > Linux source tree. OK ? Sounds good. -tduffy From tduffy at sun.com Tue Sep 28 19:32:05 2004 From: tduffy at sun.com (Tom Duffy) Date: Tue, 28 Sep 2004 19:32:05 -0700 Subject: [openib-general] [PATCH] Kill TS_HOST_DRIVER In-Reply-To: <528yat96s8.fsf@topspin.com> References: <1096420606.25336.59.camel@duffman> <528yat96s8.fsf@topspin.com> Message-ID: <20040929023205.GA5770@duffman.sfbay.sun.com> begin Roland Dreier's message dated Tue, Sep 28, 2004 at 06:56:55PM -0700: > In fact I'm going to disable it on my tree now... Ok, that pretty much does it for TS_HOST_DRIVER. Index: drivers/infiniband/ulp/dapl/Makefile =================================================================== --- drivers/infiniband/ulp/dapl/Makefile (revision 908) +++ drivers/infiniband/ulp/dapl/Makefile (working copy) @@ -3,7 +3,7 @@ -Idrivers/infiniband/ulp/ipoib \ -Idrivers/infiniband/hw/mellanox-hca/include \ -D__LINUX__ \ - -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE + -D_NO_DATA_PATH_TRACE obj-$(CONFIG_INFINIBAND_UDAPL_HELPER) += ib_udapl.o Index: drivers/infiniband/ulp/ipoib/Makefile =================================================================== --- drivers/infiniband/ulp/ipoib/Makefile (revision 908) +++ drivers/infiniband/ulp/ipoib/Makefile (working copy) @@ -1,6 +1,6 @@ EXTRA_CFLAGS += \ -Idrivers/infiniband/include \ - -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE + -D_NO_DATA_PATH_TRACE obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o ib_ip2pr.o Index: drivers/infiniband/ulp/srp/Makefile =================================================================== --- drivers/infiniband/ulp/srp/Makefile (revision 908) +++ drivers/infiniband/ulp/srp/Makefile (working copy) @@ -2,7 +2,7 @@ -Idrivers/infiniband/include \ -Idrivers/scsi \ -DRHAS_SCSI_API -DSUPPORT_SCSI_REV2 \ - -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE + -D_NO_DATA_PATH_TRACE obj-$(CONFIG_INFINIBAND_SRP) += ib_srp.o Index: drivers/infiniband/ulp/sdp/Makefile =================================================================== --- drivers/infiniband/ulp/sdp/Makefile (revision 908) +++ drivers/infiniband/ulp/sdp/Makefile (working copy) @@ -1,7 +1,7 @@ EXTRA_CFLAGS += \ -Idrivers/infiniband/include \ -Idrivers/infiniband/ulp/ipoib \ - -DTS_HOST_DRIVER -DTS_USE_CM_API_V3 -D_NO_DATA_PATH_TRACE + -DTS_USE_CM_API_V3 -D_NO_DATA_PATH_TRACE obj-$(CONFIG_INFINIBAND_SDP) += ib_sdp.o Index: drivers/infiniband/legacy/poll_main.c =================================================================== --- drivers/infiniband/legacy/poll_main.c (revision 908) +++ drivers/infiniband/legacy/poll_main.c (working copy) @@ -45,12 +45,7 @@ MODULE_PARM(sleep, "i"); MODULE_PARM_DESC(sleep, "If non-zero, sleep one jiffy after each iteration"); -/* sleep == 0 isn't very nice on a host system :) */ -#ifdef TS_HOST_DRIVER static int sleep = 1; -#else -static int sleep = 0; -#endif struct tTS_KERNEL_POLL_HANDLE_STRUCT { tTS_KERNEL_POLL_FUNCTION function; Index: drivers/infiniband/legacy/Makefile =================================================================== --- drivers/infiniband/legacy/Makefile (revision 908) +++ drivers/infiniband/legacy/Makefile (working copy) @@ -1,6 +1,6 @@ EXTRA_CFLAGS += \ -Idrivers/infiniband/include \ - -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE + -D_NO_DATA_PATH_TRACE obj-$(CONFIG_INFINIBAND) += ib_services.o ib_poll.o Index: drivers/infiniband/core/Makefile =================================================================== --- drivers/infiniband/core/Makefile (revision 908) +++ drivers/infiniband/core/Makefile (working copy) @@ -1,7 +1,7 @@ EXTRA_CFLAGS += \ -Idrivers/infiniband/include \ -Idrivers/infiniband/ulp/ipoib \ - -DTS_HOST_DRIVER -D_NO_DATA_PATH_TRACE + -D_NO_DATA_PATH_TRACE CFLAGS_useraccess_cm.o := \ -D__LINUX__ \ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From krause at cup.hp.com Tue Sep 28 19:40:13 2004 From: krause at cup.hp.com (Michael Krause) Date: Tue, 28 Sep 2004 19:40:13 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52pt467yy7.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> At 04:31 PM 9/28/2004, Roland Dreier wrote: > Michael> It was not specified due to the lack of standards for > Michael> higher-level service management which partitioning is > Michael> classified. Given there is an OpenSM effort in flight > Michael> and the Service ID spec is already in existence, it isn't > Michael> that tough to acquire an IP service ID (or any other > Michael> protocol that one wants to support) and implement a > Michael> solution along the lines that I describe above. This > Michael> would lead to a more dynamic environment while reducing > Michael> the impact to the administrator. > >This scheme might indeed be reasonable. However, given the absence of >an IBTA spec or IETF draft, I don't see how we can rely on it in our >IPoIB driver right now. The IBTA defined the specifications to establish standard wire protocols to discover services without having to track all potential services. The IETF does not address how P_Keys are managed only that the IP over IB component must follow a set of semantics / operations to enable IP communication; all IB-specific management issues are outside the scope of the drafts. As another person put it - embrace and extend. The approach being advocated is something that will be interoperable and provides a solution that can be easily incorporated into all IP over IB implementations. It also clarifies the ambiguous IB - admin management interactions for IP communications. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Tue Sep 28 20:00:18 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 20:00:18 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> (Michael Krause's message of "Tue, 28 Sep 2004 19:40:13 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> Message-ID: <52r7ol7pa5.fsf@topspin.com> Michael> The IBTA defined the specifications to establish standard Michael> wire protocols to discover services without having to Michael> track all potential services. The IETF does not address Michael> how P_Keys are managed only that the IP over IB component Michael> must follow a set of semantics / operations to enable IP Michael> communication; all IB-specific management issues are Michael> outside the scope of the drafts. Michael> As another person put it - embrace and extend. The Michael> approach being advocated is something that will be Michael> interoperable and provides a solution that can be easily Michael> incorporated into all IP over IB implementations. It Michael> also clarifies the ambiguous IB - admin management Michael> interactions for IP communications. Until an IPoIB service is mandated somewhere, IPoIB implementations need to work even on subnets where the service is not registered. Which means a bit of extra complexity to handle both the case where the service doesn't exist because the subnet has not implemented the management scheme and the case where the service doesn't exist because the administrator doesn't want to configure IPoIB for a given partition. In any case -- my plate is quite full at the moment. Any code to implement better IPoIB management will be welcomed but I won't be able to right it for quite some time. - Roland From halr at voltaire.com Tue Sep 28 20:10:13 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 23:10:13 -0400 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) In-Reply-To: <52zn397s2j.fsf@topspin.com> References: <1096420864.1872.6.camel@localhost.localdomain> <52zn397s2j.fsf@topspin.com> Message-ID: <1096427413.1872.19.camel@localhost.localdomain> On Tue, 2004-09-28 at 22:00, Roland Dreier wrote: > Oh yeah... is there any reason someone would want CONFIG_INFINIBAND > but not CONFIG_INFINIBAND_ACCESS_LAYER? Only temporarily. From halr at voltaire.com Tue Sep 28 20:13:40 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 23:13:40 -0400 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) In-Reply-To: <524qlh96oi.fsf@topspin.com> References: <1096420864.1872.6.camel@localhost.localdomain> <524qlh96oi.fsf@topspin.com> Message-ID: <1096427620.1872.24.camel@localhost.localdomain> On Tue, 2004-09-28 at 21:59, Roland Dreier wrote: > -obj-$(CONFIG_INFINIBAND) += legacy/ core/ ulp/ hw/ > +obj-$(CONFIG_INFINIBAND) += legacy/ core/ access/ ulp/ hw/ > > This doesn't really make sense to me. Why do we need a core/ and an > access/ directory? I prefer core/, since it matches drivers/usb/core, > net/core and sound/core already in the kernel tree, and "access layer" > is a bit of jargon that no one not familiar with the history of IB > stacks is going to understand. However if we prefer access/ then I'll > move everything from core/ there. I guess this is temporary too. I don't really care if it is core or access. I just chose access at the time as it was unclear how things would evolve. I'm OK with moving ib_mad (and ib_smi) into core when the time is right. -- Hal From roland at topspin.com Tue Sep 28 20:15:36 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 20:15:36 -0700 Subject: [openib-general] [PATCH] Kill TS_HOST_DRIVER In-Reply-To: <20040929023205.GA5770@duffman.sfbay.sun.com> (Tom Duffy's message of "Tue, 28 Sep 2004 19:32:05 -0700") References: <1096420606.25336.59.camel@duffman> <528yat96s8.fsf@topspin.com> <20040929023205.GA5770@duffman.sfbay.sun.com> Message-ID: <52fz517okn.fsf@topspin.com> cool, thanks... applied. From halr at voltaire.com Tue Sep 28 20:36:55 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 28 Sep 2004 23:36:55 -0400 Subject: [Fwd: Re: [openib-general] [PATCH] Add access into build (Roland's branch)] Message-ID: <1096429015.1878.32.camel@localhost.localdomain> Just one further thought on this. At the 9/9 OpenIB SWG Face to Face meeting, the following was discussed and agreed: Once there is a working MAD layer, there will be a new official gen2 branch with just phase 1 deliverables. Then stable and development branches. Development would add in CM, etc. Does this mean roland_branch is intended to be this branch ? -- Hal -----Forwarded Message----- From: Hal Rosenstock To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] [PATCH] Add access into build (Roland's branch) Date: Tue, 28 Sep 2004 23:13:40 -0400 On Tue, 2004-09-28 at 21:59, Roland Dreier wrote: > -obj-$(CONFIG_INFINIBAND) += legacy/ core/ ulp/ hw/ > +obj-$(CONFIG_INFINIBAND) += legacy/ core/ access/ ulp/ hw/ > > This doesn't really make sense to me. Why do we need a core/ and an > access/ directory? I prefer core/, since it matches drivers/usb/core, > net/core and sound/core already in the kernel tree, and "access layer" > is a bit of jargon that no one not familiar with the history of IB > stacks is going to understand. However if we prefer access/ then I'll > move everything from core/ there. I guess this is temporary too. I don't really care if it is core or access. I just chose access at the time as it was unclear how things would evolve. I'm OK with moving ib_mad (and ib_smi) into core when the time is right. -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Tom.Duffy at Sun.COM Tue Sep 28 21:02:14 2004 From: Tom.Duffy at Sun.COM (Tom Duffy) Date: Tue, 28 Sep 2004 21:02:14 -0700 Subject: [Fwd: Re: [openib-general] [PATCH] Add access into build (Roland's branch)] In-Reply-To: <1096429015.1878.32.camel@localhost.localdomain> References: <1096429015.1878.32.camel@localhost.localdomain> Message-ID: <415A33C6.5070109@sun.com> Hal Rosenstock wrote: > Does this mean roland_branch is intended to be this branch ? I don't think so. I thought we agreed we would move up mthca, ipoib, and new mad layer to gen2/ and that would be what we go forward to submit to Linus/Andrew. -tduffy From roland at topspin.com Tue Sep 28 21:32:38 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 28 Sep 2004 21:32:38 -0700 Subject: [Fwd: Re: [openib-general] [PATCH] Add access into build (Roland's branch)] In-Reply-To: <1096429015.1878.32.camel@localhost.localdomain> (Hal Rosenstock's message of "Tue, 28 Sep 2004 23:36:55 -0400") References: <1096429015.1878.32.camel@localhost.localdomain> Message-ID: <52brfp7l09.fsf@topspin.com> Hal> Once there is a working MAD layer, there will be a new Hal> official gen2 branch with just phase 1 deliverables. Then Hal> stable and development branches. Development would add in CM, Hal> etc. Hal> Does this mean roland_branch is intended to be this branch ? I don't think so. I always intended roland-merge to be a branch where I could share my work. In fact it doesn't make much sense to me to add in hooks for compiling the new MAD code etc. to my branch. If you want to have a subversion tree for testing now I would suggest making a new branch (copy roland-merge to some new directory) and hack on that as much as you want -- delete the old MAD code, etc. - Roland From halr at voltaire.com Wed Sep 29 02:28:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 05:28:17 -0400 Subject: [Fwd: Re: [openib-general] [PATCH] Add access into build (Roland's branch)] In-Reply-To: <52brfp7l09.fsf@topspin.com> References: <1096429015.1878.32.camel@localhost.localdomain> <52brfp7l09.fsf@topspin.com> Message-ID: <1096450096.1872.82.camel@localhost.localdomain> On Wed, 2004-09-29 at 00:32, Roland Dreier wrote: > Hal> Once there is a working MAD layer, there will be a new > Hal> official gen2 branch with just phase 1 deliverables. Then > Hal> stable and development branches. Development would add in CM, > Hal> etc. > > Hal> Does this mean roland_branch is intended to be this branch ? > > I don't think so. I always intended roland-merge to be a branch where > I could share my work. > > In fact it doesn't make much sense to me to add in hooks for compiling > the new MAD code etc. to my branch. The only reason I requested this was for the transition period to minimize the duplication of changes as I am presuming you will want to continue your branch. Maybe that is an incorrect assumption (or presumption). > If you want to have a subversion > tree for testing now I would suggest making a new branch (copy > roland-merge to some new directory) and hack on that as much as you > want -- delete the old MAD code, etc. I'm not adverse to a new tree (in fact, openib-candidate would work as a this branch if it gets filled in.) but was trying to minimize the work until the phase 1 work was far enough along. I suspect that will take a little time. Who would be responsible for duplicating any mthca changes (and other layers like IPoIB) ? -- Hal From halr at voltaire.com Wed Sep 29 02:34:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 05:34:30 -0400 Subject: [openib-general] [PATCH] Add SMI into access layer build Message-ID: <1096450470.1872.84.camel@localhost.localdomain> Add SMI into access layer build Index: Makefile =================================================================== --- Makefile (revision 911) +++ Makefile (working copy) @@ -4,5 +4,6 @@ ib_al.o ib_al-objs := \ - ib_mad.o + ib_mad.o \ + ib_smi.o From mst at mellanox.co.il Wed Sep 29 02:46:29 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 29 Sep 2004 11:46:29 +0200 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <4159E70E.5040705@ammasso.com> References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> <4159E70E.5040705@ammasso.com> Message-ID: <20040929094629.GC15350@mellanox.co.il> Hello! Quoting r. Timur Tabi (timur.tabi at ammasso.com) "Re: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel": > Libor Michalek wrote: > > > I've seen the problem in test cases, so it definetly can happen in 2.4. > >Looking at the 2.6 code the problem appears to be fixed, but I have not > >had a chance to run tests to verify it. Good place to take look if you > >are interested is in launder_page() and try_to_unmap() in the kernel. > > ... > > I don't understand how this bug can continue to exist after all this > time. get_user_pages() is supposed to lock the memory, because drivers > use it for DMA'ing directly into user memory. > I think the reason is that linux currently supports zero-copy only for a very limited number of situations, not for a generic user-given memory buffer, so the problem does not occur. I think its also mostly for send (hardware reads the buffer), not receive. MST From halr at voltaire.com Wed Sep 29 03:11:31 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 06:11:31 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52is9ycsj2.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> <52oejred6r.fsf@topspin.com> <1096379771.3479.60.camel@localhost.localdomain> <52is9ycsj2.fsf@topspin.com> Message-ID: <1096452691.7112.104.camel@localhost.localdomain> On Tue, 2004-09-28 at 11:38, Roland Dreier wrote: > Hal> At IBA 1.2, C15-0.1.22 is obsolete and has been replaced by > Hal> C15-0.2.1. C15-0.2.1: When a requester node sends a trusted > Hal> request to SA, the requested data shall be returned. When a > Hal> requester node sends a non-trusted request for data to SA > Hal> that would provide information about a subject node, the SA > Hal> shall return only data providing information about subject > Hal> nodes for which the requester shares a P_Key, with exceptions > Hal> noted below in C15-0.1.23. > > Hmm, this seems like the only difference from C15-0.1.22 is that it > talks about trusted requests. C15-0.1.23 (below) still says that > MCMemberRecords don't provide information about any subject nodes, so > I guess the SM should not worry about P_Keys when returning the table > of multicast groups. You are correct about the IBA 1.1/1.2 compliance difference (and the trusted request part is not relevant to this discussion). C15-0.1.22 does, however, talk about PKey sharing which is relevant to this (and consistent with 15.4.1 below). The intent is that the SA only return information relative to the partitions that the port was part of. The below is from IBA 1.1 (and 1.2 as well): 15.4.1 Restrictions on Access There are two types of access restrictions involved in SA: Authenticating the requester of information, and restricting the data that the requester is allowed to receive. These are discussed below. The SA access restrictions described here are based on partition membership, and are intended to implement this rule: If access to data is allowed by partition membership, that access is granted; but if it is disallowed, the requester should remain unaware of the existence of that information and of the network elements containing that information. Additionally, in no event is authentication information made available to untrusted requests. That a node s PortInfo:M_Key and PortInfo:M_KeyProtectBits may prohibit access to some data by SMPs without a valid M_Key has no bearing on SA access restrictions. > Roland> C15-0.1.23: [...] MCMemberRecords shall always be provided > Roland> with the PortGID, Join- State and ProxyJoin components set > Roland> to 0, except for the case of a trusted request, in which > Roland> case the actual component contents shall be provided. You omitted the following: C15-0.1.23: Subnet Administration shall follow the following additional rules concerning data access: Perhaps it is the language which is confusing and somewhat contradictory: C15-0-1.22 (IBA 1.1) and C15-0.2.1 (IBA 1.2) state "Subnet Administrator shall return only data providing information about subject nodes for which the requester shares a P_Key, with exceptions noted below in C15-0.1.23:." whereas: C15-0.1.23 states "Subnet Administration shall follow the following additional rules concerning data access" and one of those rules concerns MCMemberRecord. My interpretation is that the Pkey sharing is the first level rule before the following is applied from C15-0.1.23: "MCMemberRecords shall always be provided with the PortGID, Join- State and ProxyJoin components set to 0, except for the case of a trusted request, in which case the actual component contents shall be provided." which is what you are citing to say you get all the multicast records. -- Hal From eitan at mellanox.co.il Wed Sep 29 04:22:58 2004 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 29 Sep 2004 13:22:58 +0200 Subject: [openib-general] static LID computation with TS_HOST_DRIVER Message-ID: <506C3D7B14CDD411A52C00025558DED6047EE7E5@mtlex01.yok.mtl.com> Hi , You might try using osm tcl extention and run static lid assignment flow. Please see the file osm/osmsh/osm_ref_flow.tcl for example. Also if you read the OpenSM manual we posted you can find description of Osm tcl extention API. EZ -------------- next part -------------- An HTML attachment was scrubbed... URL: From David.Brean at Sun.COM Wed Sep 29 07:09:47 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Wed, 29 Sep 2004 10:09:47 -0400 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> Message-ID: <415AC22B.3060506@sun.com> Hello, A couple of clarifying statements about this problem: 1) assignment of the IP address to the platform ethernet during boot has nothing to do with LID assignment. [If it does, why is there a relationship?] 2) in the version of the software Tom is using, the ib_mthca driver will pick a LID if it doesn't find an IB port in the ACTIVE (or ARMED) state. This is a hack and Roland has removed the hack from the latest version of the software. If both (1) and (2) are true, is there still a plan to implement a mechanism where the applications can create connections before the SM assigns a LID and brings the port up to ACTIVE state? [I hope the response is "no".] -David Roland Dreier wrote: > Tom> [KERNEL_IB][ib_mad_static_compute_base][/build1/tduffy/openib-work/linux-2.6.9-rc2-openib/drivers/infiniband/core/mad_static.c:94]Couldn't > Tom> find a suitable network device; setting lid_base to 1 > > Tom> I am trying to track down why this is happening, but it seems > Tom> to pop up when the ib_mthca driver is loaded at boot time on > Tom> my sparc64 box (haven't tested other architectures). > > Tom> I guess you need to have your ethernet driver loaded in and > Tom> up with an IP address /before/ you load ib_mthca? What if > Tom> this is not the case? > >Yeah, if it can't find a configured interface it bails out and picks >1. It's benign and it will go away when we switch MAD layers. In >fact I'm going to disable it on my tree now... > > Tom> Anyways, I thought the LID would be assigned by the SM... > >Yeah, this is a hack to try and pick a LID that the SM won't change. >Some applications create connections before the SM has discovered the >node and don't want them broken when the SM does come around. > > - R. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From krause at cup.hp.com Wed Sep 29 06:32:14 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 06:32:14 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52r7ol7pa5.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> At 08:00 PM 9/28/2004, you wrote: > Michael> The IBTA defined the specifications to establish standard > Michael> wire protocols to discover services without having to > Michael> track all potential services. The IETF does not address > Michael> how P_Keys are managed only that the IP over IB component > Michael> must follow a set of semantics / operations to enable IP > Michael> communication; all IB-specific management issues are > Michael> outside the scope of the drafts. > > Michael> As another person put it - embrace and extend. The > Michael> approach being advocated is something that will be > Michael> interoperable and provides a solution that can be easily > Michael> incorporated into all IP over IB implementations. It > Michael> also clarifies the ambiguous IB - admin management > Michael> interactions for IP communications. > >Until an IPoIB service is mandated somewhere, IPoIB implementations >need to work even on subnets where the service is not registered. The goal here would be to establish a requirement that all implementations would use the process I've outlined here so the service is always registered. As such, there would be only one method and only one thing to implement in the future. The IBTA cannot mandate anything here as IP is not its charter to define. The IETF might be able to do something here but the best that might be done is perhaps an informational draft. Is this what is required to make forward progress here? >Which means a bit of extra complexity to handle both the case where >the service doesn't exist because the subnet has not implemented the >management scheme and the case where the service doesn't exist because >the administrator doesn't want to configure IPoIB for a given partition. IPoverIB implementations should not enable IP communication if the admin does not want it configured. The service lookup would occur per P_Key and only on those partitions that show the service registered would have IPoverIB enabled. The only thing the code does is examine whether the service is present or not for the P_Keys that are configured or when the service event handler informs the driver that the service has undergone change which may be to add or delete a partition. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From timur.tabi at ammasso.com Wed Sep 29 08:03:24 2004 From: timur.tabi at ammasso.com (Timur Tabi) Date: Wed, 29 Sep 2004 10:03:24 -0500 Subject: [openib-general] get_user_pages() vs. sys_mlock() and 2.6 kernel In-Reply-To: <20040929094629.GC15350@mellanox.co.il> References: <4134FD31.4060301@ammasso.com> <20040903160745.A6309@topspin.com> <4159E70E.5040705@ammasso.com> <20040929094629.GC15350@mellanox.co.il> Message-ID: <415ACEBC.2040606@ammasso.com> Michael S. Tsirkin wrote: > I think the reason is that linux currently supports zero-copy only for > a very limited number of situations, not > for a generic user-given memory buffer, In my test code, I only grab one 4K page with get_user_pages(), so it's not like I'm stressing it out. > so the problem does not occur. > I think its also mostly for send (hardware reads the buffer), not > receive. What difference does that make? If the hardware is reading the buffer, the buffer better be there! If a page of the buffer gets swapped out, then the hardware will read junk. -- Timur Tabi Staff Software Engineer timur.tabi at ammasso.com From roland at topspin.com Wed Sep 29 08:03:53 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 08:03:53 -0700 Subject: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EE7E5@mtlex01.yok.mtl.com> (Eitan Zahavi's message of "Wed, 29 Sep 2004 13:22:58 +0200") References: <506C3D7B14CDD411A52C00025558DED6047EE7E5@mtlex01.yok.mtl.com> Message-ID: <52655x6rs6.fsf@topspin.com> Eitan> Hi , You might try using osm tcl extention and run static Eitan> lid assignment flow. Please see the file Eitan> osm/osmsh/osm_ref_flow.tcl for example. Also if you read Eitan> the OpenSM manual we posted you can find description of Osm Eitan> tcl extention API. That's great but if you read more about this "static LID" feature, the intention is for nodes to pick a LID _before_ the SM has discovered them. Thanks, Roland From roland at topspin.com Wed Sep 29 08:07:28 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 08:07:28 -0700 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <415AC22B.3060506@sun.com> (David M. Brean's message of "Wed, 29 Sep 2004 10:09:47 -0400") References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> <415AC22B.3060506@sun.com> Message-ID: <521xgl6rm7.fsf@topspin.com> David> 1) assignment of the IP address to the platform ethernet David> during boot has nothing to do with LID assignment. [If it David> does, why is there a relationship?] Using the first IP address we can find was a quick-and-dirty way to pick a LID that would probably be unique among different hosts. David> 2) in the version of the software Tom is using, the David> ib_mthca driver will pick a LID if it doesn't find an IB David> port in the ACTIVE (or ARMED) state. This is a hack and David> Roland has removed the hack from the latest version of the David> software. Actually mthca had nothing to do with the LID assignment. It was done by the MAD layer in device-independent code before it starts processing MADs. This means it always happens, since the port cannot possibly be active without processing SMPs. David> If both (1) and (2) are true, is there still a plan to David> implement a mechanism where the applications can create David> connections before the SM assigns a LID and brings the port David> up to ACTIVE state? [I hope the response is "no".] I guess it depends on whether supporting applications that want to do this becomes important or not. Thanks, Roland From roland at topspin.com Wed Sep 29 08:10:10 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 08:10:10 -0700 Subject: [Fwd: Re: [openib-general] [PATCH] Add access into build (Roland's branch)] In-Reply-To: <1096450096.1872.82.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 29 Sep 2004 05:28:17 -0400") References: <1096429015.1878.32.camel@localhost.localdomain> <52brfp7l09.fsf@topspin.com> <1096450096.1872.82.camel@localhost.localdomain> Message-ID: <52wtyd5cx9.fsf@topspin.com> Hal> The only reason I requested this was for the transition Hal> period to minimize the duplication of changes as I am Hal> presuming you will want to continue your branch. Maybe that Hal> is an incorrect assumption (or presumption). No, it is correct. I am planning on continuing to work on IPoIB to get it to a mergable state. Hal> I'm not adverse to a new tree (in fact, openib-candidate Hal> would work as a this branch if it gets filled in.) but was Hal> trying to minimize the work until the phase 1 work was far Hal> enough along. I suspect that will take a little time. Hal> Who would be responsible for duplicating any mthca changes Hal> (and other layers like IPoIB) ? I would assume the MAD integration branch would be fairly short-lived and wouldn't need many changes pulled in. Whoever owned the branch (you and/or Sean) would be responsible for merging whatever you want/need onto the branch (and conversely I would need to continue to merge anything relevant onto my branch). I am guessing that Sean and I will be responsible for the various pieces of the trunk that we eventually create. From roland at topspin.com Wed Sep 29 08:16:10 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 08:16:10 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> (Michael Krause's message of "Wed, 29 Sep 2004 06:32:14 -0700") References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> Message-ID: <52sm915cn9.fsf@topspin.com> Michael> The IETF might be able to do something here but the best Michael> that might be done is perhaps an informational draft. Is Michael> this what is required to make forward progress here? I think so. Without an IETF draft, there _will_ be implementations that do not create a service record, and therefore interoperable implementations will have to function both in subnets without service records and subnets with service records. However, reading back over this thread, I'm not clear on what purpose having a service record for IPoIB serves. Why can't an implementation just look for the IPoIB broadcast multicast groups for each P_Key to decide whether to use that P_Key? Thanks, Roland From David.Brean at Sun.COM Wed Sep 29 08:20:47 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Wed, 29 Sep 2004 11:20:47 -0400 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <521xgl6rm7.fsf@topspin.com> References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> <415AC22B.3060506@sun.com> <521xgl6rm7.fsf@topspin.com> Message-ID: <415AD2CF.3000900@sun.com> Ok. How does the port inform the SM that it has a "preferred" LID? -David Roland Dreier wrote: > David> 1) assignment of the IP address to the platform ethernet > David> during boot has nothing to do with LID assignment. [If it > David> does, why is there a relationship?] > >Using the first IP address we can find was a quick-and-dirty way to >pick a LID that would probably be unique among different hosts. > > David> 2) in the version of the software Tom is using, the > David> ib_mthca driver will pick a LID if it doesn't find an IB > David> port in the ACTIVE (or ARMED) state. This is a hack and > David> Roland has removed the hack from the latest version of the > David> software. > >Actually mthca had nothing to do with the LID assignment. It was done >by the MAD layer in device-independent code before it starts >processing MADs. This means it always happens, since the port cannot >possibly be active without processing SMPs. > > David> If both (1) and (2) are true, is there still a plan to > David> implement a mechanism where the applications can create > David> connections before the SM assigns a LID and brings the port > David> up to ACTIVE state? [I hope the response is "no".] > >I guess it depends on whether supporting applications that want to do >this becomes important or not. > >Thanks, > Roland >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From roland at topspin.com Wed Sep 29 08:24:21 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 08:24:21 -0700 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <415AD2CF.3000900@sun.com> (David M. Brean's message of "Wed, 29 Sep 2004 11:20:47 -0400") References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> <415AC22B.3060506@sun.com> <521xgl6rm7.fsf@topspin.com> <415AD2CF.3000900@sun.com> Message-ID: <52oejp5c9m.fsf@topspin.com> David> Ok. How does the port inform the SM that it has a David> "preferred" LID? The port will already have a LID assigned when the SM discovers it. My understanding is that the SM is "encouraged" to preserve a port's LID if it doesn't conflict with any other LIDs, and this is what we're relying on. - Roland From David.Brean at Sun.COM Wed Sep 29 08:49:46 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Wed, 29 Sep 2004 11:49:46 -0400 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <52oejp5c9m.fsf@topspin.com> References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> <415AC22B.3060506@sun.com> <521xgl6rm7.fsf@topspin.com> <415AD2CF.3000900@sun.com> <52oejp5c9m.fsf@topspin.com> Message-ID: <415AD99A.3060504@sun.com> The subnet management state is protected by an M_Key (see section 14.2.4). The M_Key is managed by the SM. The M_Key is not exposed via the Verbs or through SA queries [the exception being the trusted entity, but that entity is another SA or higher layer management application - so let's ignore for this situation.] An endnode implementation should not allow any back-door mechanism that enables changing the subnet management state without the M_Key if the port is protected. Note, the M_Key and protection bits can be in persistent storage and preserved across port power cycles to eliminate the power-up exposure. So, without making subnet assumptions/restrictions, there is no reliable way for an IB client, like the local MAD layer, to specify the LID. LID assignment is SM policy and the SM may choose to preserve a port's LID, however, I don't think IB clients should depend on this behavior. -David Roland Dreier wrote: > David> Ok. How does the port inform the SM that it has a > David> "preferred" LID? > >The port will already have a LID assigned when the SM discovers it. >My understanding is that the SM is "encouraged" to preserve a port's >LID if it doesn't conflict with any other LIDs, and this is what we're >relying on. > > - Roland >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From parul_sunil at yahoo.com Wed Sep 29 09:13:30 2004 From: parul_sunil at yahoo.com (Parul Bhatt) Date: Wed, 29 Sep 2004 09:13:30 -0700 (PDT) Subject: [openib-general] VAPI source code Message-ID: <20040929161330.38959.qmail@web11509.mail.yahoo.com> Hello, Anyone has simple VAPI PingPong and PingPing source code, We have old version IB card We want to test with this sources code and we want to understand complete flow of PingPong and PingPing function. This is great help for us. We have latest version of perfmain package but it covers too many functionality and our interest is simple PingPong and PingPing type of functions. We have PingPong and PingPing functions in MPI-I it is working fine. Our interest is to find out best result in VAPI. Thank you very much Parul Bhatt --------------------------------- Do you Yahoo!? vote.yahoo.com - Register online to vote today! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Sep 29 09:22:00 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 09:22:00 -0700 Subject: [openib-general] [PATCH] cancel outstanding MADs when deregistering In-Reply-To: <52u0ti9nqb.fsf@topspin.com> References: <20040928123221.45cde4ac.mshefty@ichips.intel.com> <52u0ti9nqb.fsf@topspin.com> Message-ID: <20040929092200.202a5837.mshefty@ichips.intel.com> On Tue, 28 Sep 2004 12:50:52 -0700 Roland Dreier wrote: > It looks OK for current functionality but I think it will have to > change to support cancelling sends. (Cancelling sends is required > for consumers that start a query with a long timeout and then want to > unload or something like that). I'm aware that there's currently no functionality for canceling a single MAD. This would require a new API, and possibly changes to the send MAD API. > When someone asks to cancel a send you have to tell the consumer > whether the send was canceled or had already finished (so that they > know whether the resources have already been freed). You have to make > sure that you don't say that the send wasn't canceled and then return > with the send completion handler still running on some other CPU, > because then the consumer will probably corrupt context. I am aware of this issue. My plan to handle it was to search for MADs to cancel, in case the cancel call came after the MAD had already been returned to the user. > That means that you can't just remove things from your pending list > like the code does now -- you have to leave them there and mark them > "callback running" or something like that. It ends up being even more > complicated unfortunately. The cancel code that's there should work if deregistration is called at the same time that a send operation completes. A reference it taken on the send MAD work request structure while the work request is posted to the QP. A second reference is taken if the MAD has a timeout, meaning that a response is expected. The current cancel code (called only during deregistration) only releases the reference taken for MADs that have a timeout. If no other references remain on the send MAD work request, it is canceled immediately. Otherwise, it is marked as flushed, and will be completed once other references go away. (Other references would be for multiple work requests posted to the QP for RMPP.) As a last note, my intention is to have the code callback the user for every MAD that they've sent. From mshefty at ichips.intel.com Wed Sep 29 09:25:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 09:25:50 -0700 Subject: [openib-general] [PATCH] cancel outstanding MADs when deregistering In-Reply-To: References: <20040928123221.45cde4ac.mshefty@ichips.intel.com> Message-ID: <20040929092550.349b086a.mshefty@ichips.intel.com> Thanks for the feedback. On Tue, 28 Sep 2004 13:21:51 -0700 (PDT) Krishna Kumar wrote: > In cancel_mads() : > > if (mad_send_wr->refcount <= 0) { > > If there is no good reason for the refcount to drop below zero, it is > better to put BUG_ON for such code to catch potential bugs much earlier, > while keeping the check as "if (x == 0)", etc. Can do. The refcount should never fall below 0. > Also, if timeout_ms is set, will those entries get removed from the > list (since they have refcnt of two) ? If the refcount is still two, they will not. The second refcount is released once the work request posted to the QP is completed. See my response to Roland for more details. Unfortunately, I was trying to add in smaller patches, so the code that does request/response matching and would use the timeout_ms value is not in there yet. > Finally, do you want to wake up threads waiting on mad_agent_priv->wait > once finally out of the loop or each time when the refcnt drops to zero ? > If there is no reason to do so each time, you can do once you finish the > cancel list. BTW, in this case you can also call wake_up_nr() if you > keep track of the number of times you want to wake up based on the return > value of atomic_dec_* call. Actually, I shouldn't try to wakeup the thread at all. The mad_agent_priv->refcount must be at least 1 at the end of this function. I'll fix. From mshefty at ichips.intel.com Wed Sep 29 09:30:45 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 09:30:45 -0700 Subject: [openib-general] [PATCH] Fix MAD completion handling In-Reply-To: <52brfq9i66.fsf@topspin.com> References: <52brfq9i66.fsf@topspin.com> Message-ID: <20040929093045.7a3b154a.mshefty@ichips.intel.com> On Tue, 28 Sep 2004 14:50:57 -0700 Roland Dreier wrote: > @@ -892,6 +892,8 @@ > struct ib_wc wc; > int err_status = 0; > > + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); > + > while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { > printk(KERN_DEBUG "Completion opcode 0x%x WRID 0x%Lx\n", wc.opcode, wc.wr_id); > if (wc.status != IB_WC_SUCCESS) { > @@ -928,11 +930,8 @@ > } > } > > - if (err_status) { > + if (err_status) > ib_mad_port_restart(port_priv); > - } else { > - ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); > - } Shouldn't we be able to keep the ib_req_notify_cq at the end of this function? If additional completions are left after polling, a second event should be generated. Or at least that's what I remember from out discussions about this... From mshefty at ichips.intel.com Wed Sep 29 09:43:19 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 09:43:19 -0700 Subject: [openib-general] [PATCH] Add access into build (Roland's branch) In-Reply-To: <524qlh96oi.fsf@topspin.com> References: <1096420864.1872.6.camel@localhost.localdomain> <524qlh96oi.fsf@topspin.com> Message-ID: <20040929094319.5279a35a.mshefty@ichips.intel.com> On Tue, 28 Sep 2004 18:59:09 -0700 Roland Dreier wrote: > -obj-$(CONFIG_INFINIBAND) += legacy/ core/ ulp/ hw/ > +obj-$(CONFIG_INFINIBAND) += legacy/ core/ access/ ulp/ hw/ > > This doesn't really make sense to me. Why do we need a core/ and an > access/ directory? I prefer core/, since it matches drivers/usb/core, > net/core and sound/core already in the kernel tree, and "access layer" > is a bit of jargon that no one not familiar with the history of IB > stacks is going to understand. However if we prefer access/ then I'll > move everything from core/ there. I think core makes more sense, but I don't really care when we move to that directory structure. Maybe when we pull together the code... From krause at cup.hp.com Wed Sep 29 09:37:13 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 09:37:13 -0700 Subject: Fwd: Re: [openib-general] static LID computation with TS_HOST_DRIVER In-Reply-To: <415AD99A.3060504@sun.com> References: <7b3527db3b.7db3b7b352@bur-mail2.east.sun.com> <415AC22B.3060506@sun.com> <521xgl6rm7.fsf@topspin.com> <415AD2CF.3000900@sun.com> <52oejp5c9m.fsf@topspin.com> <415AD99A.3060504@sun.com> Message-ID: <6.1.2.0.2.20040929093103.01fd5a60@esmail.cup.hp.com> At 08:49 AM 9/29/2004, David M. Brean wrote: >The subnet management state is protected by an M_Key (see section >14.2.4). The M_Key is managed by the SM. The M_Key is not exposed via >the Verbs or through SA queries [the exception being the trusted entity, >but that entity is another SA or higher layer management application - so >let's ignore for this situation.] An endnode implementation should not >allow any back-door mechanism that enables changing the subnet management >state without the M_Key if the port is protected. Note, the M_Key and >protection bits can be in persistent storage and preserved across port >power cycles to eliminate the power-up exposure. So, without making >subnet assumptions/restrictions, there is no reliable way for an IB >client, like the local MAD layer, to specify the LID. > >LID assignment is SM policy and the SM may choose to preserve a port's >LID, however, I don't think IB clients should depend on this behavior. It has been the clear intent of the IBTA that LID assignment be strictly done by the SM as centralized management is the operating paradigm defined. It has also been clear from the start that LID values are, by definition, dynamic and should never be preserved or relied upon by endnodes. There is no requirement or policy that a SM should attempt to preserve LID assignment. Given I authored addressing and other sections of the IB specs and chaired the workgroup responsible for the link wire protocols, I'm fairly confident that this is the intent of the IBTA. Mike >-David > >Roland Dreier wrote: > >> David> Ok. How does the port inform the SM that it has a >> David> "preferred" LID? >> >>The port will already have a LID assigned when the SM discovers it. >>My understanding is that the SM is "encouraged" to preserve a port's >>LID if it doesn't conflict with any other LIDs, and this is what we're >>relying on. >> >>- Roland >>_______________________________________________ >>openib-general mailing list >>openib-general at openib.org >>http://openib.org/mailman/listinfo/openib-general >> >>To unsubscribe, please visit >>http://openib.org/mailman/listinfo/openib-general >> > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Sep 29 09:48:21 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 09:48:21 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <52sm915cn9.fsf@topspin.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <52sm915cn9.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> At 08:16 AM 9/29/2004, Roland Dreier wrote: > Michael> The IETF might be able to do something here but the best > Michael> that might be done is perhaps an informational draft. Is > Michael> this what is required to make forward progress here? > >I think so. Without an IETF draft, there _will_ be implementations >that do not create a service record, and therefore interoperable >implementations will have to function both in subnets without service >records and subnets with service records. I'll talk with Bill about getting a draft submitted if that is consensus and there is support for this approach. >However, reading back over this thread, I'm not clear on what purpose >having a service record for IPoIB serves. Why can't an implementation >just look for the IPoIB broadcast multicast groups for each P_Key to >decide whether to use that P_Key? Based on IETF discussions, our intent was: - For each partition that is enabled to support IP communication, the IP over IB implementation should join (create if the first) the associated "all nodes" multicast group. This is analogous to Ethernet VLAN usage model where if allowed to communicate, one does; hence, it isn't a decision. - When an endnode is enabled in the IB subnet and the IP over IB driver is configured, it can examine the configured P_Key and communicate with the SM/SA to determine what multicast groups are available. Based on this information, the endnode can request to join a multicast group thus enabling IP over IB to issue ARP / ND messages. - An endnode would then need to set up an event notification to understand when partitions were updated - add or deleted - for its local ports. The endnode would also need to know whether it is the last member of the multicast group as well. The method for an admin to indicate what partitions were configured is not defined by the IETF specs nor do I recall a method to state that a given IB multicast group is associated with IP and hence our discussions to date. My initial inquiry was in response to the requirement to use a tool. I view such a tool as unnecessary as well as non-scalable as the size of the cluster increases. Therefore, I have suggested a method that would not require a tool. I'm quite open to any approach as long as it avoids creating a tool as everything can be more easily integrated into an IP over IB driver which benefits the admin and the use of IB technology. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Sep 29 09:50:06 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 09:50:06 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096452691.7112.104.camel@localhost.localdomain> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <1096304811.11222.94.camel@localhost.localdomain> <52wtyfeit3.fsf@topspin.com> <1096306647.1830.114.camel@localhost.localdomain> <52oejred6r.fsf@topspin.com> <1096379771.3479.60.camel@localhost.localdomain> <52is9ycsj2.fsf@topspin.com> <1096452691.7112.104.camel@localhost.localdomain> Message-ID: <6.1.2.0.2.20040929094842.01e1cc98@esmail.cup.hp.com> At 03:11 AM 9/29/2004, Hal Rosenstock wrote: >On Tue, 2004-09-28 at 11:38, Roland Dreier wrote: > > Hal> At IBA 1.2, C15-0.1.22 is obsolete and has been replaced by > > Hal> C15-0.2.1. C15-0.2.1: When a requester node sends a trusted > > Hal> request to SA, the requested data shall be returned. When a > > Hal> requester node sends a non-trusted request for data to SA > > Hal> that would provide information about a subject node, the SA > > Hal> shall return only data providing information about subject > > Hal> nodes for which the requester shares a P_Key, with exceptions > > Hal> noted below in C15-0.1.23. > > > > Hmm, this seems like the only difference from C15-0.1.22 is that it > > talks about trusted requests. C15-0.1.23 (below) still says that > > MCMemberRecords don't provide information about any subject nodes, so > > I guess the SM should not worry about P_Keys when returning the table > > of multicast groups. > >You are correct about the IBA 1.1/1.2 compliance difference (and the >trusted request part is not relevant to this discussion). > >C15-0.1.22 does, however, talk about PKey sharing which is relevant to >this (and consistent with 15.4.1 below). > >The intent is that the SA only return information relative to the >partitions that the port was part of. The below is from IBA 1.1 (and 1.2 >as well): > >15.4.1 Restrictions on Access >There are two types of access restrictions involved in SA: >Authenticating the requester of information, and restricting the data >that the requester is allowed to receive. These are discussed below. The >SA access restrictions described here are based on partition membership, >and are intended to implement this rule: If access to data is allowed by >partition membership, that access is granted; but if it is disallowed, >the requester should remain unaware of the existence of that information >and of the network elements containing that information. Additionally, >in no event is authentication information made available to untrusted >requests. That a node s PortInfo:M_Key and PortInfo:M_KeyProtectBits may >prohibit access to some data by SMPs without a valid M_Key has no >bearing on SA access restrictions. > > > Roland> C15-0.1.23: [...] MCMemberRecords shall always be provided > > Roland> with the PortGID, Join- State and ProxyJoin components set > > Roland> to 0, except for the case of a trusted request, in which > > Roland> case the actual component contents shall be provided. > >You omitted the following: >C15-0.1.23: Subnet Administration shall follow the following additional >rules concerning data access: > >Perhaps it is the language which is confusing and somewhat >contradictory: > >C15-0-1.22 (IBA 1.1) and C15-0.2.1 (IBA 1.2) state "Subnet Administrator >shall return only data providing information about subject nodes for >which the requester shares a P_Key, with exceptions noted below in >C15-0.1.23:." > >whereas: >C15-0.1.23 states "Subnet Administration shall follow the following >additional rules concerning data access" and one of those rules concerns >MCMemberRecord. > >My interpretation is that the Pkey sharing is the first level rule P_Key sharing and access is the first rule - no endnode should be able to acquire information unless it is authorized. Mike >before the following is applied from C15-0.1.23: >"MCMemberRecords shall always be provided with the PortGID, Join- State >and ProxyJoin components set to 0, except for the case of a trusted >request, in which case the actual component contents shall be provided." >which is what you are citing to say you get all the multicast records. > >-- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at strahm.net Wed Sep 29 10:06:07 2004 From: bill at strahm.net (bill at strahm.net) Date: Wed, 29 Sep 2004 10:06:07 -0700 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <20040929170607.10624.qmail@webmail-2-6.mesa1.secureserver.net> This may have been answered by now, But in the process of creating the broadcast group there are quite a few parameters that are better off not being left to the end client (MTU being a big one) I can see this service record returning all of the needed parameters, therfor letting the central administrator have a central point of control Bill > -------- Original Message -------- > Subject: Re: [openib-general] IPoIB Loading and Starting > From: "Roland Dreier" > Date: Wed, September 29, 2004 8:16 am > To: "Michael Krause" > Cc: openib-general at openib.org > > Michael> The IETF might be able to do something here but the best > Michael> that might be done is perhaps an informational draft. Is > Michael> this what is required to make forward progress here? > > I think so. Without an IETF draft, there _will_ be implementations > that do not create a service record, and therefore interoperable > implementations will have to function both in subnets without service > records and subnets with service records. > > However, reading back over this thread, I'm not clear on what purpose > having a service record for IPoIB serves. Why can't an implementation > just look for the IPoIB broadcast multicast groups for each P_Key to > decide whether to use that P_Key? > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Sep 29 10:14:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 13:14:16 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> Message-ID: <1096478056.19157.8.camel@hpc-1> On Wed, 2004-09-29 at 09:32, Michael Krause wrote: > The IETF might be able to do something here but the best that might be > done is perhaps an informational draft. Is this what is required to > make forward progress here? While it is better than not specifying how this would work, unfortunately, as you are no doubt aware, informational RFCs don't carry the same weight as standards track RFCs :-( Why wouldn't this be made standards track ? -- Hal From mshefty at ichips.intel.com Wed Sep 29 10:18:53 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 10:18:53 -0700 Subject: [openib-general] [PATCH] tweaks to canceling MAD based on feedback. Message-ID: <20040929101853.07414b6b.mshefty@ichips.intel.com> Patch based on feedback from Krishna for canceling a MAD. -- Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 912) +++ access/ib_mad.c (working copy) @@ -1002,7 +1002,7 @@ mad_send_wr->refcount--; } - if (mad_send_wr->refcount <= 0) { + if (mad_send_wr->refcount == 0) { list_del(&mad_send_wr->agent_send_list); list_add_tail(&mad_send_wr->agent_send_list, &cancel_list); @@ -1024,9 +1024,7 @@ list_del(&mad_send_wr->agent_send_list); kfree(mad_send_wr); - /* Release reference on agent taken when sending. */ - if (atomic_dec_and_test(&mad_agent_priv->refcount)) - wake_up(&mad_agent_priv->wait); + atomic_dec(&mad_agent_priv->refcount); } } From halr at voltaire.com Wed Sep 29 10:26:53 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 13:26:53 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <52sm915cn9.fsf@topspin.com> <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> Message-ID: <1096478813.19157.21.camel@hpc-1> On Wed, 2004-09-29 at 12:48, Michael Krause wrote: > Based on IETF discussions, our intent was: > > - For each partition that is enabled to support IP communication, the > IP over IB implementation should join (create if the first) the > associated "all nodes" multicast group. This is analogous to Ethernet > VLAN usage model where if allowed to communicate, one does; hence, it > isn't a decision. There is the "limited" broadcast group from which the parameters would be derived for joining the "all nodes" multicast group (224.0.0.1). > - When an endnode is enabled in the IB subnet and the IP over IB > driver is configured, it can examine the configured P_Key and > communicate with the SM/SA to determine what multicast groups are > available. Based on this information, the endnode can request to join > a multicast group thus enabling IP over IB to issue ARP / ND > messages. I don't understand the last sentence of this. For a partition that an IPoIB interface is on (which is one of the IB port's partitions), all the relevant multicast groups can be obtained from the SA but what does this have to do with enabling "ARP/ND". Doesn't the broadcast group creation/join take care of ARP ? > - An endnode would then need to set up an event notification to > understand when partitions were updated - add or deleted - for its > local ports. Unfortunately, there is no local partition table changed event defined in IBA. > The endnode would also need to know whether it is the last member of > the multicast group as well. Not sure why this is needed by the endnode. I presume you are referring to the last full member. The group is deleted when the last full member leaves the group. -- Hal From halr at voltaire.com Wed Sep 29 10:40:30 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 13:40:30 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <20040929170607.10624.qmail@webmail-2-6.mesa1.secureserver.net> References: <20040929170607.10624.qmail@webmail-2-6.mesa1.secureserver.net> Message-ID: <1096479629.19157.25.camel@hpc-1> On Wed, 2004-09-29 at 13:06, bill at strahm.net wrote: > This may have been answered by now, > > But in the process of creating the broadcast group there are quite a few > parameters that are better off not being left to the end client (MTU > being a big one) > > I can see this service record returning all of the needed parameters, > therfor letting the central administrator have a central point of > control As SA ServiceRecords do not contain MTU or other parameters, there would need to be a standard encoding into the ServiceData fields for this. There would also likely need to be a service naming scheme for identifying which service record to find. -- Hal From roland at topspin.com Wed Sep 29 10:49:15 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 10:49:15 -0700 Subject: [openib-general] [PATCH] Fix MAD completion handling In-Reply-To: <20040929093045.7a3b154a.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 29 Sep 2004 09:30:45 -0700") References: <52brfq9i66.fsf@topspin.com> <20040929093045.7a3b154a.mshefty@ichips.intel.com> Message-ID: <528yat55k4.fsf@topspin.com> Sean> Shouldn't we be able to keep the ib_req_notify_cq at the end Sean> of this function? If additional completions are left after Sean> polling, a second event should be generated. Or at least Sean> that's what I remember from out discussions about this... On Mellanox HCAs but not necessarily on spec-compliant HW... - R. From krause at cup.hp.com Wed Sep 29 10:46:39 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 10:46:39 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096478813.19157.21.camel@hpc-1> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <52sm915cn9.fsf@topspin.com> <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> <1096478813.19157.21.camel@hpc-1> Message-ID: <6.1.2.0.2.20040929104232.01f6f938@esmail.cup.hp.com> At 10:26 AM 9/29/2004, you wrote: >On Wed, 2004-09-29 at 12:48, Michael Krause wrote: > > Based on IETF discussions, our intent was: > > > > - For each partition that is enabled to support IP communication, the > > IP over IB implementation should join (create if the first) the > > associated "all nodes" multicast group. This is analogous to Ethernet > > VLAN usage model where if allowed to communicate, one does; hence, it > > isn't a decision. > >There is the "limited" broadcast group from which the parameters would >be derived for joining the "all nodes" multicast group (224.0.0.1). Correct. > > - When an endnode is enabled in the IB subnet and the IP over IB > > driver is configured, it can examine the configured P_Key and > > communicate with the SM/SA to determine what multicast groups are > > available. Based on this information, the endnode can request to join > > a multicast group thus enabling IP over IB to issue ARP / ND > > messages. > >I don't understand the last sentence of this. For a partition that an >IPoIB interface is on (which is one of the IB port's partitions), all >the relevant multicast groups can be obtained from the SA but what does >this have to do with enabling "ARP/ND". Doesn't the broadcast group >creation/join take care of ARP ? Apologies for my sentence structure. Yes. > > - An endnode would then need to set up an event notification to > > understand when partitions were updated - add or deleted - for its > > local ports. > >Unfortunately, there is no local partition table changed event defined in IBA. In what I was proposing, the change in IP service being provided for a given partition would result in a service event notification. You are correct that unless an endnode periodically examines its P_Key table per port for change, there is no method to know that an admin has effected a change in the partition space. The IP service with event notification would provide this state change as a service event. > > The endnode would also need to know whether it is the last member of > > the multicast group as well. > >Not sure why this is needed by the endnode. I presume you are referring >to the last full member. The group is deleted when the last full member >leaves the group. It isn't required that an endnode leave but if there is one around to listen, why remain in the multicast group. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Sep 29 10:48:09 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 10:48:09 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096478056.19157.8.camel@hpc-1> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <1096478056.19157.8.camel@hpc-1> Message-ID: <6.1.2.0.2.20040929104653.036712b0@esmail.cup.hp.com> At 10:14 AM 9/29/2004, you wrote: >On Wed, 2004-09-29 at 09:32, Michael Krause wrote: > > The IETF might be able to do something here but the best that might be > > done is perhaps an informational draft. Is this what is required to > > make forward progress here? > >While it is better than not specifying how this would work, >unfortunately, as you are no doubt aware, informational RFCs don't carry >the same weight as standards track RFCs :-( Why wouldn't this be made >standards track ? It could be if there is support; not clear if this would require a charter change as this now goes to how IP over IB interacts with the SA which is currently not specified. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at strahm.net Wed Sep 29 10:59:22 2004 From: bill at strahm.net (bill at strahm.net) Date: Wed, 29 Sep 2004 10:59:22 -0700 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <20040929175922.21632.qmail@webmail01.mesa1.secureserver.net> There would need to be a charter change to add this as a WG item. This is simply an e-mail between myself and Margarete (the Area Director) and is quite easy to do if the working group wants it done Bill > -------- Original Message -------- > Subject: Re: [openib-general] IPoIB Loading and Starting > From: "Michael Krause" > Date: Wed, September 29, 2004 10:48 am > To: openib-general at openib.org > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Sep 29 11:09:02 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 14:09:02 -0400 Subject: [openib-general] Re: [PATCH] tweaks to canceling MAD based on feedback. In-Reply-To: <20040929101853.07414b6b.mshefty@ichips.intel.com> References: <20040929101853.07414b6b.mshefty@ichips.intel.com> Message-ID: <1096481342.2322.10.camel@hpc-1> On Wed, 2004-09-29 at 13:18, Sean Hefty wrote: > Patch based on feedback from Krishna for canceling a MAD. Thanks. Applied. -- Hal From mshefty at ichips.intel.com Wed Sep 29 11:17:05 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 11:17:05 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API Message-ID: <20040929111705.1711b347.mshefty@ichips.intel.com> Here's a patch for discussion for an API and implementation that should allow canceling a sent MAD. The primary use of the call is to cancel a MAD with a large timeout, but it should also allow cancel sending a large RMPP MAD. - Sean -- Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 913) +++ access/ib_mad.c (working copy) @@ -1028,6 +1028,54 @@ } } +int ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_send_list) { + if (mad_send_wr->wr_id == wr_id) + goto found; + } + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return -EINVAL; + +found: + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + + if (mad_send_wr->timeout_ms) { + mad_send_wr->timeout_ms = 0; + mad_send_wr->refcount--; + } + + if (mad_send_wr->refcount == 0) { + list_del(&mad_send_wr->agent_send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return 0; + } + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return -EBUSY; +} +EXPORT_SYMBOL(ib_cancel_mad); + /* * IB MAD thread */ Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 913) +++ include/ib_mad.h (working copy) @@ -275,6 +275,17 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); /** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent - Specifies the registration associated with sent MAD. + * @wr_id - Indicates the work request identifier of the MAD to cancel. + * + * If the MAD is successfully canceled, it will be returned to the user through + * the corresponding ib_mad_send_handler. + */ +int ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** * ib_redirect_mad_qp - Registers a QP for MAD services. * @qp - Reference to a QP that requires MAD services. * @rmpp_version - If set, indicates that the client will send From roland at topspin.com Wed Sep 29 11:21:24 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 11:21:24 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929111705.1711b347.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 29 Sep 2004 11:17:05 -0700") References: <20040929111705.1711b347.mshefty@ichips.intel.com> Message-ID: <52hdph3pi3.fsf@topspin.com> +int ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) Is wr_id enough to identify a sent MAD? I guess this imposes the requirement that consumers must not use duplicate wr_ids. +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_send_list) { + if (mad_send_wr->wr_id == wr_id) + goto found; + } + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return -EINVAL; This is exactly the issue I was talking about last time. Since we remove the MAD from the send list before calling the consumer's send handler, it's entirely possible for ib_cancel_mad() to return -EINVAL with the send handler still running on another CPU ... oops. - R. From mshefty at ichips.intel.com Wed Sep 29 11:25:07 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 11:25:07 -0700 Subject: [openib-general] [PATCH] Fix MAD completion handling In-Reply-To: <528yat55k4.fsf@topspin.com> References: <52brfq9i66.fsf@topspin.com> <20040929093045.7a3b154a.mshefty@ichips.intel.com> <528yat55k4.fsf@topspin.com> Message-ID: <20040929112507.78b430dd.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 10:49:15 -0700 Roland Dreier wrote: > Sean> Shouldn't we be able to keep the ib_req_notify_cq at the end > Sean> of this function? If additional completions are left after > Sean> polling, a second event should be generated. Or at least > Sean> that's what I remember from out discussions about this... > > On Mellanox HCAs but not necessarily on spec-compliant HW... I realize that, but I thought that's the behavior that we agreed that the API would have. I'm really fine either way, but want to be clear on the behavior. From mshefty at ichips.intel.com Wed Sep 29 11:33:38 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 11:33:38 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <52hdph3pi3.fsf@topspin.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> Message-ID: <20040929113338.5efa4eb2.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 11:21:24 -0700 Roland Dreier wrote: > +int ib_cancel_mad(struct ib_mad_agent *mad_agent, > + u64 wr_id) > > Is wr_id enough to identify a sent MAD? I guess this imposes the > requirement that consumers must not use duplicate wr_ids. I didn't see any better alternative. But, yes, users wanting to be able to cancel MADs cannot use duplicate wr_ids. > This is exactly the issue I was talking about last time. Since we > remove the MAD from the send list before calling the consumer's send > handler, it's entirely possible for ib_cancel_mad() to return -EINVAL > with the send handler still running on another CPU ... oops. I'm not sure why this is an issue. The user receives exactly one callback for every sent MAD. Even if the MAD is found, the cancel operation will not complete until after all posted work requests have completed. From halr at voltaire.com Wed Sep 29 11:45:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 14:45:28 -0400 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <52hdph3pi3.fsf@topspin.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> Message-ID: <1096483528.2395.5.camel@hpc-1> On Wed, 2004-09-29 at 14:21, Roland Dreier wrote: > +int ib_cancel_mad(struct ib_mad_agent *mad_agent, > + u64 wr_id) > > Is wr_id enough to identify a sent MAD? I guess this imposes the > requirement that consumers must not use duplicate wr_ids. Isn't the requirement no duplicate wr_ids per mad agent ? Are there other fields you would propose ? Also, would a cancel all MADs be useful ? -- Hal From mshefty at ichips.intel.com Wed Sep 29 11:52:43 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 11:52:43 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <1096483528.2395.5.camel@hpc-1> References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <1096483528.2395.5.camel@hpc-1> Message-ID: <20040929115243.04886726.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 14:45:28 -0400 Hal Rosenstock wrote: > > Is wr_id enough to identify a sent MAD? I guess this imposes the > > requirement that consumers must not use duplicate wr_ids. > > Isn't the requirement no duplicate wr_ids per mad agent ? Are there > other fields you would propose ? To cancel requests, I think that TID is sufficient. To allow canceling response MADs (e.g. large RMPP), you need something like: TID, SGID or SLID, and class. wr_id seemed simpler. > Also, would a cancel all MADs be useful ? Outside of deregistration, I doubt it. But we could add this call, and let deregistration call it, or force users to call it before deregistering. From mshefty at ichips.intel.com Wed Sep 29 11:56:58 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 11:56:58 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929111705.1711b347.mshefty@ichips.intel.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> Message-ID: <20040929115658.25abc415.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 11:17:05 -0700 Sean Hefty wrote: > + if (mad_send_wr->refcount == 0) { > + list_del(&mad_send_wr->agent_send_list); > + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); > + > + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; > + mad_send_wc.vendor_err = 0; > + mad_send_wc.wr_id = mad_send_wr->wr_id; > + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, > + &mad_send_wc); > + > + kfree(mad_send_wr); > + if (atomic_dec_and_test(&mad_agent_priv->refcount)) > + wake_up(&mad_agent_priv->wait); > + return 0; > + } > + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); > + return -EBUSY; > +} I'm not sure that a client would care between a return code of 0, versus -EBUSY. Even if -EBUSY is returned, the canceled MAD could still have been returned to the user. From bill at strahm.net Wed Sep 29 12:35:53 2004 From: bill at strahm.net (bill at strahm.net) Date: Wed, 29 Sep 2004 12:35:53 -0700 Subject: [openib-general] IPoIB Loading and Starting Message-ID: <20040929193553.31378.qmail@webmail02.mesa1.secureserver.net> > -------- Original Message -------- > Subject: RE: [openib-general] IPoIB Loading and Starting > From: "Hal Rosenstock" > Date: Wed, September 29, 2004 10:40 am > To: bill at strahm.net > Cc: "Roland Dreier" , openib-general at openib.org > > On Wed, 2004-09-29 at 13:06, bill at strahm.net wrote: > > This may have been answered by now, > > > > But in the process of creating the broadcast group there are quite a few > > parameters that are better off not being left to the end client (MTU > > being a big one) > > > > I can see this service record returning all of the needed parameters, > > therfor letting the central administrator have a central point of > > control > > As SA ServiceRecords do not contain MTU or other parameters, there would > need to be a standard encoding into the ServiceData fields for this. > There would also likely need to be a service naming scheme for > identifying which service record to find. > > -- Hal And that is what I would expect an IETF ID to specify - what the contents of the Service Record are (and how they map into the IPoIB Encap draft) - how to search for them correctly so that everyone can get the same answer, and how an administrator would setup an IPoIB subnet using the SA. Bill From halr at voltaire.com Wed Sep 29 13:05:46 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 16:05:46 -0400 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <6.1.2.0.2.20040929104232.01f6f938@esmail.cup.hp.com> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <52sm915cn9.fsf@topspin.com> <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> <1096478813.19157.21.camel@hpc-1> <6.1.2.0.2.20040929104232.01f6f938@esmail.cup.hp.com> Message-ID: <1096488346.2431.13.camel@hpc-1> On Wed, 2004-09-29 at 13:46, Michael Krause wrote: > In what I was proposing, the change in IP service being provided for a > given partition would result in a service event notification. You are > correct that unless an endnode periodically examines its P_Key table > per port for change, there is no method to know that an admin has > effected a change in the partition space. The IP service with event > notification would provide this state change as a service event. Are you saying that when a particular service record is created in the SA, an event is generated to a set of interested endnodes ? I don't think there is a way to do that. The only choice is for the endnode to continue to poll the service records based on the matching criteria which we would need to define (ServiceID or name perhaps). > It isn't required that an endnode leave but if there is one around to > listen, why remain in the multicast group. I don't think there is a way to tell a node is the last (full) member of the group. Anyhow, if this were done, how would that node know to rejoin once another node came along ? -- Hal From roland at topspin.com Wed Sep 29 13:12:36 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 13:12:36 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929113338.5efa4eb2.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 29 Sep 2004 11:33:38 -0700") References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <20040929113338.5efa4eb2.mshefty@ichips.intel.com> Message-ID: <52d6044yx7.fsf@topspin.com> Sean> I'm not sure why this is an issue. The user receives Sean> exactly one callback for every sent MAD. Even if the MAD is Sean> found, the cancel operation will not complete until after Sean> all posted work requests have completed. OK, think about the following scenario. We're in ib_mad_complete_send_wr(): /* Remove send from MAD agent and notify client of completion. */ list_del(&mad_send_wr->agent_send_list); spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); /* HERE ===> */ if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc); and the kernel gets preempted or a long-running interrupt comes along where I've marked. Now, on another CPU (or after the preemption), a consumer calls ib_cancel_mad(), which does: spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_send_list) { if (mad_send_wr->wr_id == wr_id) goto found; } spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); return -EINVAL; and doesn't find the work request, since ib_mad_complete_send_wr() has already removed it from the list, so it returns -EINVAL. The consumer says, "oh, OK, I have no pending requests so I can free my context." Then the first thread continues and proceeds to call the consumer's send_handler function. We can say that the consumer has to have reference counting or otherwise protect itself against this, but it makes more sense to me to avoid this sort of bug in common code rather than debugging every consumer... - R. From halr at voltaire.com Wed Sep 29 14:46:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 17:46:58 -0400 Subject: [openib-general] Re: [PATCH] minor fix, formatting for MAD receive handling In-Reply-To: <20040929125532.355df4a4.mshefty@ichips.intel.com> References: <20040929125532.355df4a4.mshefty@ichips.intel.com> Message-ID: <1096494418.2476.9.camel@hpc-1> On Wed, 2004-09-29 at 15:55, Sean Hefty wrote: > This patch makes some minor updates to the MAD receive handling > routine. It removes a stack variable, reposts receives in all error > cases, and breaks long code lines. I'm sure that reposting receive buffers in the case of an invalid QP number makes sense but it doesn't make much of a difference as this is a pathological case anyhow and since convert_qpnum may go away in the long term so would this code. Thanks. Applied. -- Hal From halr at voltaire.com Wed Sep 29 14:58:41 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 29 Sep 2004 17:58:41 -0400 Subject: [openib-general] Naming police Message-ID: <1096495121.2429.2.camel@hpc-1> In ib_verbs.h, in ib_ah_attr struct, it is src_path_bits whereas in ib_wc struct, it is dlid_path_bits Should the latter be dest_path_bits for naming consistency ? -- Hal From mshefty at ichips.intel.com Wed Sep 29 15:03:54 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 15:03:54 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <52d6044yx7.fsf@topspin.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <20040929113338.5efa4eb2.mshefty@ichips.intel.com> <52d6044yx7.fsf@topspin.com> Message-ID: <20040929150354.7a6cd70f.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 13:12:36 -0700 Roland Dreier wrote: > OK, think about the following scenario. We're in ib_mad_complete_send_wr(): I understand the scenario that you're describing. I just think that we've made it easy as it can be for the client already. > and doesn't find the work request, since ib_mad_complete_send_wr() has > already removed it from the list, so it returns -EINVAL. > > The consumer says, "oh, OK, I have no pending requests so I can free > my context." Then the first thread continues and proceeds to call the > consumer's send_handler function. Currently, the consumer _only_ has to free their send context in their send MAD completion handler. No reference counting by the consumer is needed. And it doesn't matter if a send succeeds, timeouts, has an error, or is canceled. Clients that try to release their send contexts in both their send handlers and after calling cancel are asking for synchronization issues. If your refering to a client's mad_agent context, I don't believe that we can support clients who free their context before deregistering (at least not in most usage cases). - Sean From roland at topspin.com Wed Sep 29 15:18:12 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 15:18:12 -0700 Subject: [openib-general] Naming police In-Reply-To: <1096495121.2429.2.camel@hpc-1> (Hal Rosenstock's message of "Wed, 29 Sep 2004 17:58:41 -0400") References: <1096495121.2429.2.camel@hpc-1> Message-ID: <52655w4t3v.fsf@topspin.com> Hal> In ib_verbs.h, in ib_ah_attr struct, it is src_path_bits Hal> whereas Hal> in ib_wc struct, it is dlid_path_bits Hal> Should the latter be dest_path_bits for naming consistency ? Maybe, although the two fields really are different -- in ib_ah_attr we are telling the HCA what value to use for the source LID in the message it sends. In ib_wc we are getting the value in the destination LID of the message we just received. - Roland From roland at topspin.com Wed Sep 29 15:20:20 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 15:20:20 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929150354.7a6cd70f.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 29 Sep 2004 15:03:54 -0700") References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <20040929113338.5efa4eb2.mshefty@ichips.intel.com> <52d6044yx7.fsf@topspin.com> <20040929150354.7a6cd70f.mshefty@ichips.intel.com> Message-ID: <521xgk4t0b.fsf@topspin.com> Sean> Currently, the consumer _only_ has to free their send Sean> context in their send MAD completion handler. No reference Sean> counting by the consumer is needed. And it doesn't matter Sean> if a send succeeds, timeouts, has an error, or is canceled. The consumer does need to do reference counting. If I cancel a MAD and get -EINVAL back, I might still get a completion later, so I have to have a count of the number of MADs outstanding and wait for that to go to zero. - R. From mshefty at ichips.intel.com Wed Sep 29 15:26:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 15:26:55 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <521xgk4t0b.fsf@topspin.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <20040929113338.5efa4eb2.mshefty@ichips.intel.com> <52d6044yx7.fsf@topspin.com> <20040929150354.7a6cd70f.mshefty@ichips.intel.com> <521xgk4t0b.fsf@topspin.com> Message-ID: <20040929152655.5c37de79.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 15:20:20 -0700 Roland Dreier wrote: > Sean> Currently, the consumer _only_ has to free their send > Sean> context in their send MAD completion handler. No reference > Sean> counting by the consumer is needed. And it doesn't matter > Sean> if a send succeeds, timeouts, has an error, or is canceled. > > The consumer does need to do reference counting. If I cancel a MAD > and get -EINVAL back, I might still get a completion later, so I have > to have a count of the number of MADs outstanding and wait for that to > go to zero. What is the client doing with the reference counting? When their send handler gets called, they free their send context. After they deregister, they free their mad_agent context. From sean.hefty at intel.com Wed Sep 29 15:47:27 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 15:47:27 -0700 Subject: [openib-general] Re: [PATCH] minor fix, formatting for MAD receive handling In-Reply-To: <1096494418.2476.9.camel@hpc-1> Message-ID: >I'm sure that reposting receive buffers in the case of an invalid QP >number makes sense but it doesn't make much of a difference as this is a >pathological case anyhow and since convert_qpnum may go away in the long >term so would this code. Thanks for applying this. What I'd really like to be able to do is pull the ib_mad_private receive buffer off the QP list before polling the CQ, to avoid having to copy the work completion structure. This would allow us to use pre-formatted ib_mad_recv_wc structures as well. Along those same lines, the send path can be optimized to avoid copying the work request structure for 256 byte MADs. From ftillier at infiniconsys.com Wed Sep 29 15:50:45 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 29 Sep 2004 15:50:45 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929152655.5c37de79.mshefty@ichips.intel.com> Message-ID: <000001c4a676$c2ae4890$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, September 29, 2004 3:27 PM > > On Wed, 29 Sep 2004 15:20:20 -0700 > Roland Dreier wrote: > > > Sean> Currently, the consumer _only_ has to free their send > > Sean> context in their send MAD completion handler. No reference > > Sean> counting by the consumer is needed. And it doesn't matter > > Sean> if a send succeeds, timeouts, has an error, or is canceled. > > > > The consumer does need to do reference counting. If I cancel a MAD > > and get -EINVAL back, I might still get a completion later, so I have > > to have a count of the number of MADs outstanding and wait for that to > > go to zero. > > What is the client doing with the reference counting? When their send > handler gets called, they free their send context. After they deregister, > they free their mad_agent context. If deregistration is synchronous then as long as the MAD layer keeps reference counts for outstanding sends, the client does not need to. The client is guaranteed that their send callback will not be called after deregistration completes. As Sean mentioned, it's simpler for clients to know they will *always* get a send completion regardless of status. It allows them to do all the send completion processing in their handler, rather than having it split between the send handler and the cancel logic. In fact, I'd just rather remove the return value all together - what can a client do with the return value that they wouldn't know from the status reported to the send handler? - Fab From mshefty at ichips.intel.com Wed Sep 29 16:08:56 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 16:08:56 -0700 Subject: [openib-general] MAD request/response completion order Message-ID: <20040929160856.6f10d216.mshefty@ichips.intel.com> Does anyone have a preference which order request/response MADs complete? Sends first always? Receives first always? Whatever is convenient? - Sean -- From ftillier at infiniconsys.com Wed Sep 29 16:18:47 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 29 Sep 2004 16:18:47 -0700 Subject: [openib-general] MAD request/response completion order In-Reply-To: <20040929160856.6f10d216.mshefty@ichips.intel.com> Message-ID: <000101c4a67a$acd2e5e0$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, September 29, 2004 4:09 PM > > Does anyone have a preference which order request/response MADs complete? > Sends first always? Receives first always? Whatever is convenient? > The only reason I can think of to report receives before sends is to support user-mode. If a receive can't be reported to a user-mode client, then the corresponding send should not carry a status value that would indicate that a receive was - it should be over-ridden to indicate something equivalent to a timeout since the receive will not be delivered. If we don't care about user-mode, then I think whatever is most convenient makes sense. - Fab From yaronh at voltaire.com Wed Sep 29 16:24:55 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 30 Sep 2004 01:24:55 +0200 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> I agree with Dave that Static LID is problematic and we should think of other short and longer term alternative for that (There are many cases where the SM may dictate a non random LID allocation policy, E.g. LMC configuration changes, Subnet Merge, .. and the HCA is not aware of it). I believe that the need for it comes from applications that want to talk to some kind of a loop back adapter without depending on the port state or even before the port is up. A better solution that IBTA needs to look at is creating a well known Loopback LID value that apps use when they want to talk locally (like IP 127...) It may even be feasible to implement something on the existing HCA HW (by using one of the unused multicast LID's and some firmware changes) Yaron > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Wednesday, September 29, 2004 5:24 PM > To: David M. Brean > Cc: openib-general at openib.org > Subject: Re: Fwd: Re: [openib-general] static LID computation > withTS_HOST_DRIVER > > David> Ok. How does the port inform the SM that it has a > David> "preferred" LID? > > The port will already have a LID assigned when the SM discovers it. > My understanding is that the SM is "encouraged" to preserve a port's > LID if it doesn't conflict with any other LIDs, and this is what we're > relying on. > > - Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From roland at topspin.com Wed Sep 29 16:54:18 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 16:54:18 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929152655.5c37de79.mshefty@ichips.intel.com> (Sean Hefty's message of "Wed, 29 Sep 2004 15:26:55 -0700") References: <20040929111705.1711b347.mshefty@ichips.intel.com> <52hdph3pi3.fsf@topspin.com> <20040929113338.5efa4eb2.mshefty@ichips.intel.com> <52d6044yx7.fsf@topspin.com> <20040929150354.7a6cd70f.mshefty@ichips.intel.com> <521xgk4t0b.fsf@topspin.com> <20040929152655.5c37de79.mshefty@ichips.intel.com> Message-ID: <52u0tg3a39.fsf@topspin.com> Sean> What is the client doing with the reference counting? When Sean> their send handler gets called, they free their send Sean> context. After they deregister, they free their mad_agent Sean> context. OK, here's a realistic example. In IPoIB, I'll probably only create one mad_agent per HCA. Now suppose I want to remove a network interface, say because a P_Key is going away. I need to cancel any outstanding multicast queries that might be outstanding. With the current cancel API, If I don't reference count in the IPoIB driver, the race I described could cause me to get a multicast response after I've freed the network interface it corresponds to. - Roland From roland at topspin.com Wed Sep 29 16:58:12 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 16:58:12 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <000001c4a676$c2ae4890$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Wed, 29 Sep 2004 15:50:45 -0700") References: <000001c4a676$c2ae4890$655aa8c0@infiniconsys.com> Message-ID: <52pt4439wr.fsf@topspin.com> Fab> If deregistration is synchronous then as long as the MAD Fab> layer keeps reference counts for outstanding sends, the Fab> client does not need to. The client is guaranteed that their Fab> send callback will not be called after deregistration Fab> completes. I agree that deregistering an agent is fine with the current API. Fab> As Sean mentioned, it's simpler for clients to know they will Fab> *always* get a send completion regardless of status. It Fab> allows them to do all the send completion processing in their Fab> handler, rather than having it split between the send handler Fab> and the cancel logic. Fab> In fact, I'd just rather remove the return value all together Fab> - what can a client do with the return value that they Fab> wouldn't know from the status reported to the send handler? The problem with Sean's proposed API for canceling a single MAD send is that it's not synchronous. So clients have to wait for the callback of the send they want to cancel. I agree that as it stands the return value is not useful because no matter what it is, a completion may or may not come after the cancel call returns. - Roland From roland at topspin.com Wed Sep 29 17:09:23 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 17:09:23 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> (Yaron Haviv's message of "Thu, 30 Sep 2004 01:24:55 +0200") References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> Message-ID: <52lles39e4.fsf@topspin.com> Yaron> A better solution that IBTA needs to look at is creating a Yaron> well known Loopback LID value that apps use when they want Yaron> to talk locally (like IP 127...) If/when IBTA specifies this and available hardware implements this, then this will be a great solution. In the meantime, I don't see a problem with having a mechanism to provide an initial value for PortInfo:LID _before_ the SM discovers the node. The SM must be able to deal with arbitrary values of this field, since it can't make any assumptions about the state of the subnet. I don't think the spec forbids having a mechanism to specify the initial value of PortInfo:LID. I agree that the spec places the SM under no obligation to preserve the LIDs it finds. However, if someone wants to run an application that requires loopback connections to be preserve, I see no problem with requiring an SM that provides an administrative mechanism for static LID assignment. The advantage of this scheme is that it requires no modifications to either the IB spec or existing IB hardware. Thanks, Roland From ftillier at infiniconsys.com Wed Sep 29 17:15:43 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 29 Sep 2004 17:15:43 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <52pt4439wr.fsf@topspin.com> Message-ID: <000201c4a682$a0b2fae0$655aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Wednesday, September 29, 2004 4:58 PM > > Fab> If deregistration is synchronous then as long as the MAD > Fab> layer keeps reference counts for outstanding sends, the > Fab> client does not need to. The client is guaranteed that their > Fab> send callback will not be called after deregistration > Fab> completes. > > I agree that deregistering an agent is fine with the current API. > > Fab> As Sean mentioned, it's simpler for clients to know they will > Fab> *always* get a send completion regardless of status. It > Fab> allows them to do all the send completion processing in their > Fab> handler, rather than having it split between the send handler > Fab> and the cancel logic. > > Fab> In fact, I'd just rather remove the return value all together > Fab> - what can a client do with the return value that they > Fab> wouldn't know from the status reported to the send handler? > > The problem with Sean's proposed API for canceling a single MAD send > is that it's not synchronous. So clients have to wait for the > callback of the send they want to cancel. I agree that as it stands > the return value is not useful because no matter what it is, a > completion may or may not come after the cancel call returns. > I think as long as ib_cancel_mad can return before the corresponding send completes (i.e. return -EBUSY), you have this problem and client must provide their own synchronization, whether through reference counting or some other means. Returning -EBUSY from ib_cancel requires the caller to block until the send handler is invoked. This in turn means that there needs to be code so that the send handler can wakeup the canceling thread once the send is complete. I don't see the difference between such synchronization requirements to the client and reference counting. To solve this, you would need ib_cancel_mad to be a synchronous call that would block in the -EBUSY case (at which point the return value is also pointless). However, making this change requires all callers to be in a thread context suitable for blocking. I don't think we want to impose this sort of requirement for MAD cancellation. I think such a requirement is fine for MAD agent destruction though. So I see two options: 1. Implement reference counting to track your own sends if you plan on sharing a MAD agent. 2. Don't share MAD agents. - Fab From ftillier at infiniconsys.com Wed Sep 29 17:19:15 2004 From: ftillier at infiniconsys.com (Tillier, Fabian) Date: Wed, 29 Sep 2004 20:19:15 -0400 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER Message-ID: <5D78D28F88822E4D8702BB9EEF1A436706240B@mercury.infiniconsys.com> > From: Roland Dreier [mailto:roland at topspin.com] > Sent: Wednesday, September 29, 2004 5:09 PM > > Yaron> A better solution that IBTA needs to look at is creating a > Yaron> well known Loopback LID value that apps use when they want > Yaron> to talk locally (like IP 127...) > > If/when IBTA specifies this and available hardware implements this, > then this will be a great solution. > > In the meantime, I don't see a problem with having a mechanism to > provide an initial value for PortInfo:LID _before_ the SM discovers > the node. The SM must be able to deal with arbitrary values of this > field, since it can't make any assumptions about the state of the > subnet. I don't think the spec forbids having a mechanism to specify > the initial value of PortInfo:LID. > > I agree that the spec places the SM under no obligation to preserve > the LIDs it finds. However, if someone wants to run an application > that requires loopback connections to be preserve, I see no problem > with requiring an SM that provides an administrative mechanism for > static LID assignment. > > The advantage of this scheme is that it requires no modifications to > either the IB spec or existing IB hardware. > To add to this, I think it is quite likely that an administrator would be setting such static LIDs - the IPoIB code won't just pick a LID out of the blue. This administrator is likely going to be the same administrator that configures the SM to use LMC > 1. I think we can trust that such an administrator will do what they can such that any statically assigned LIDs can be preserved with whatever SM settings they use. If they don't then they're a lousy administrator, and deserve to have their LIDs reassigned. - Fab From krause at cup.hp.com Wed Sep 29 17:38:21 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 17:38:21 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> Message-ID: <6.1.2.0.2.20040929173614.03704ef0@esmail.cup.hp.com> At 04:24 PM 9/29/2004, you wrote: >I agree with Dave that Static LID is problematic and we should think of >other short and longer term alternative for that >(There are many cases where the SM may dictate a non random LID >allocation policy, E.g. LMC configuration changes, Subnet Merge, .. and >the HCA is not aware of it). > >I believe that the need for it comes from applications that want to talk >to some kind of a loop back adapter without depending on the port state >or even before the port is up. This is covered in the specification. >A better solution that IBTA needs to look at is creating a well known >Loopback LID value that apps use when they want to talk locally (like IP >127...) See section 17.3.1 version 1.1 page 919 of volume 1 which provides specific guidance on how loopback is implemented and what LID should be used. There is no reason for a loopback LID. This topic was debated and the spec reflects the outcome of the debate. Mike >It may even be feasible to implement something on the existing HCA HW >(by using one of the unused multicast LID's and some firmware changes) > >Yaron > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Roland Dreier > > Sent: Wednesday, September 29, 2004 5:24 PM > > To: David M. Brean > > Cc: openib-general at openib.org > > Subject: Re: Fwd: Re: [openib-general] static LID computation > > withTS_HOST_DRIVER > > > > David> Ok. How does the port inform the SM that it has a > > David> "preferred" LID? > > > > The port will already have a LID assigned when the SM discovers it. > > My understanding is that the SM is "encouraged" to preserve a port's > > LID if it doesn't conflict with any other LIDs, and this is what we're > > relying on. > > > > - Roland > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib- > > general >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Wed Sep 29 17:49:58 2004 From: krause at cup.hp.com (Michael Krause) Date: Wed, 29 Sep 2004 17:49:58 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <52lles39e4.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <52lles39e4.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> At 05:09 PM 9/29/2004, Roland Dreier wrote: > Yaron> A better solution that IBTA needs to look at is creating a > Yaron> well known Loopback LID value that apps use when they want > Yaron> to talk locally (like IP 127...) > >If/when IBTA specifies this and available hardware implements this, >then this will be a great solution. As noted in my other response, the specs do not require modification to provide loopback service. >In the meantime, I don't see a problem with having a mechanism to provide >an initial value for PortInfo:LID _before_ the SM discovers the node. The >SM must be able to deal with arbitrary values of this field, since it >can't make any assumptions about the state of the subnet. I don't think >the spec forbids having a mechanism to specify the initial value of >PortInfo:LID. The SM is the only entity that is supposed to assign LID as well as the subnet prefix. The SM should not trust any CA / switch configuration if it has not configured it thus should wipe it out and replace it with what it deems best. As for the subnet merge problem, until the M_Key is sorted out, reassignment isn't an issue per se. Mike >I agree that the spec places the SM under no obligation to preserve the >LIDs it finds. However, if someone wants to run an application that >requires loopback connections to be preserve, I see no problem with >requiring an SM that provides an administrative mechanism for static LID >assignment. > >The advantage of this scheme is that it requires no modifications to >either the IB spec or existing IB hardware. > >Thanks, > Roland >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Sep 29 18:10:38 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 18:10:38 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <6.1.2.0.2.20040929173614.03704ef0@esmail.cup.hp.com> (Michael Krause's message of "Wed, 29 Sep 2004 17:38:21 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <6.1.2.0.2.20040929173614.03704ef0@esmail.cup.hp.com> Message-ID: <52hdpg36k1.fsf@topspin.com> Michael> See section 17.3.1 version 1.1 page 919 of volume 1 which Michael> provides specific guidance on how loopback is implemented Michael> and what LID should be used. There is no reason for a Michael> loopback LID. This topic was debated and the spec Michael> reflects the outcome of the debate. I think you missed the context of this discussion. There are applications that want to start up even if their node has not been discovered by the SM yet (and their local ports are in INIT or even DOWN). These applications have multiple processes that communicate via loopback RC QPs. If they use the current value of the node's LID (before the SM has finished discovery) and the SM then changes this LID, then the RC QPs have their connections broken and the application is not happy. - Roland From roland at topspin.com Wed Sep 29 18:13:09 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 18:13:09 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> (Michael Krause's message of "Wed, 29 Sep 2004 17:49:58 -0700") References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <52lles39e4.fsf@topspin.com> <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> Message-ID: <52d60436fu.fsf@topspin.com> Michael> The SM is the only entity that is supposed to assign LID Michael> as well as the subnet prefix. The SM should not trust Michael> any CA / switch configuration if it has not configured it Michael> thus should wipe it out and replace it with what it deems Michael> best. I don't see anything in the spec that forbids a CA from having an arbitrary value in PortInfo:LID after initialization but before the SM discovery (please correct me if I missed something). I also don't see anything that forbids an SM implementation from providing a mechanism for preserving the LIDs it finds or administratively assigning LIDs. Of course none of this is required but I don't see a problem with allowing it. - Roland From roland at topspin.com Wed Sep 29 18:14:14 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 18:14:14 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <000201c4a682$a0b2fae0$655aa8c0@infiniconsys.com> (Fab Tillier's message of "Wed, 29 Sep 2004 17:15:43 -0700") References: <000201c4a682$a0b2fae0$655aa8c0@infiniconsys.com> Message-ID: <528yas36e1.fsf@topspin.com> Fab> I think as long as ib_cancel_mad can return before the Fab> corresponding send completes (i.e. return -EBUSY), you have Fab> this problem and client must provide their own Fab> synchronization, whether through reference counting or some Fab> other means. Fab> Returning -EBUSY from ib_cancel requires the caller to block Fab> until the send handler is invoked. This in turn means that Fab> there needs to be code so that the send handler can wakeup Fab> the canceling thread once the send is complete. I don't see Fab> the difference between such synchronization requirements to Fab> the client and reference counting. Fab> To solve this, you would need ib_cancel_mad to be a Fab> synchronous call that would block in the -EBUSY case (at Fab> which point the return value is also pointless). However, Fab> making this change requires all callers to be in a thread Fab> context suitable for blocking. I don't think we want to Fab> impose this sort of requirement for MAD cancellation. I Fab> think such a requirement is fine for MAD agent destruction Fab> though. That's fine, I can live with this. Then all we need to change is that ib_cancel_mad should not have a return value. - Roland From ftillier at infiniconsys.com Wed Sep 29 19:24:35 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Wed, 29 Sep 2004 19:24:35 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> Message-ID: <000301c4a694$a13badb0$655aa8c0@infiniconsys.com> > From: Michael Krause [mailto:krause at cup.hp.com] > Sent: Wednesday, September 29, 2004 5:50 PM > > The SM is the only entity that is supposed to assign LID as well as the > subnet prefix. The SM should not trust any CA / switch configuration if > it has not configured it thus should wipe it out and replace it with what > it deems best. As for the subnet merge problem, until the M_Key is sorted > out, reassignment isn't an issue per se. In the case where the SM crashes or is stopped and then restated and there is no failover SM, resetting all LIDs seems rather drastic. Even the case where the SM is stopped, upgraded, and then restarted need to account for situations where the fabric as configured by the previous SM, while fully functional, followed a different algorithm than the updated SM code. I don't see how an SM can distinguish between a LID assigned by a "micro-SM" embedded on every host system and one assigned by a previous incarnation. Resetting every assigned LID just because it can't be trusted would be quite disruptive IMO. If a CA/switch configuration does not cause problems, the SM should do its best to keep things from changing so as to minimize the impact of SM disruptions on overall fabric operation. - Fab From yaronh at voltaire.com Wed Sep 29 19:45:01 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Thu, 30 Sep 2004 04:45:01 +0200 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Roland Dreier > Sent: Thursday, September 30, 2004 3:13 AM > To: Michael Krause > Cc: openib-general at openib.org > Subject: Re: Fwd: Re: [openib-general] static LID > computationwithTS_HOST_DRIVER > > Michael> The SM is the only entity that is supposed to assign LID > Michael> as well as the subnet prefix. The SM should not trust > Michael> any CA / switch configuration if it has not configured it > Michael> thus should wipe it out and replace it with what it deems > Michael> best. > > I don't see anything in the spec that forbids a CA from having an > arbitrary value in PortInfo:LID after initialization but before the SM > discovery (please correct me if I missed something). I also don't see > anything that forbids an SM implementation from providing a mechanism > for preserving the LIDs it finds or administratively assigning LIDs. > While I agree that other SM's in a recovery/merge phase should try and preserve the LID's I think a CA shouldn't just like it is not supposed to change its own P_Key table, and because it is not aware of the policy and/or the bigger picture. Applications should be designed to deal with LID changes or other RC connection failures. But any way out of curiosity how do you generate a unique LID (locally by the host) for every node in the fabric in a large fabric when the ports are down and the nodes don't talk to each other ? (I hope not through Ethernet :)) Or how do you anticipate the LMC value (LID spacing)? Yaron From roland at topspin.com Wed Sep 29 19:48:31 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 19:48:31 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> (Yaron Haviv's message of "Thu, 30 Sep 2004 04:45:01 +0200") References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> Message-ID: <52vfdw1ngg.fsf@topspin.com> Yaron> While I agree that other SM's in a recovery/merge phase Yaron> should try and preserve the LID's I think a CA shouldn't Yaron> just like it is not supposed to change its own P_Key table, I don't think the P_Key table is a good analogy. There is a very clear statement in the IB spec for what values should be in the P_Key table before the SM sets it. I don't know of any such statement for PortInfo:LID. Yaron> But any way out of curiosity how do you generate a unique Yaron> LID (locally by the host) for every node in the fabric in a Yaron> large fabric when the ports are down and the nodes don't Yaron> talk to each other ? (I hope not through Ethernet :)) Yaron> Or how do you anticipate the LMC value (LID spacing)? If the application cares about this, the administrator has to set things up so that it can work. Of course there are scenarios where this breaks. It's up to the user to avoid them if it matters. - Roland From sean.hefty at intel.com Wed Sep 29 22:21:30 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 29 Sep 2004 22:21:30 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <000201c4a682$a0b2fae0$655aa8c0@infiniconsys.com> Message-ID: >Returning -EBUSY from ib_cancel requires the caller to block until the send >handler is invoked. This in turn means that there needs to be code so that >the send handler can wakeup the canceling thread once the send is complete. >I don't see the difference between such synchronization requirements to the >client and reference counting. Note that returning -EBUSY from ib_cancel_mad would only indicate that a callback *might* be invoked. It could have already been called, which leads me to think that no return value would be better than one that a client tries to use. Also, MAD agents are registered per port, not per HCA, since an agent is tied to a specific QP. From roland at topspin.com Wed Sep 29 22:55:56 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 29 Sep 2004 22:55:56 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: (Sean Hefty's message of "Wed, 29 Sep 2004 22:21:30 -0700") References: Message-ID: <52is9w1es3.fsf@topspin.com> Sean> Note that returning -EBUSY from ib_cancel_mad would only Sean> indicate that a callback *might* be invoked. It could have Sean> already been called, which leads me to think that no return Sean> value would be better than one that a client tries to use. Agree... at least with the current implementation, the return value is worse than useless (since trying to use it will almost inevitably be buggy), so ib_cancel_mad should probably be a void function. - R. From halr at voltaire.com Thu Sep 30 06:56:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 30 Sep 2004 09:56:20 -0400 Subject: [openib-general] Re: [PATCH] minor fix, formatting for MAD receive handling In-Reply-To: References: Message-ID: <1096552579.1844.19.camel@localhost.localdomain> On Wed, 2004-09-29 at 18:47, Sean Hefty wrote: > What I'd really like to be able to do is pull the ib_mad_private > receive buffer off the QP list before polling the CQ, > to avoid having to copy the work completion structure. > This would allow us to use pre-formatted > ib_mad_recv_wc structures as well. This requires separate CQs for send and receive so the receive completions are separated out. I believe that change is part of your still pending patch which I have not forgotten and will get to. You would also need to know which QPN the receive completion is for. I forget whether that is part of the pending patch too. There are also the error cases to consider. > Along those same lines, the send path can be optimized to avoid > copying the work request structure for 256 byte MADs. I will look at the error cases for this too. I will look at doing these after I evaluate and integrate the pending patch. That will likely be next week as I am starting to test the send side right now. Once I get that a little further along, I will work on the pending patch, and then this (barring any other things that come up on the list). -- Hal From sean.hefty at intel.com Thu Sep 30 08:36:23 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 08:36:23 -0700 Subject: [openib-general] Re: [PATCH] minor fix, formatting for MAD receive handling In-Reply-To: <1096552579.1844.19.camel@localhost.localdomain> Message-ID: >On Wed, 2004-09-29 at 18:47, Sean Hefty wrote: >> What I'd really like to be able to do is pull the ib_mad_private >> receive buffer off the QP list before polling the CQ, >> to avoid having to copy the work completion structure. >> This would allow us to use pre-formatted >> ib_mad_recv_wc structures as well. > >This requires separate CQs for send and receive so the receive >completions are separated out. I believe that change is part of your >still pending patch which I have not forgotten and will get to. This is not part of the pending patch, but would be fairly easy to add to that patch. >You would also need to know which QPN the receive completion is for. >I forget whether that is part of the pending patch too. This is part of that patch. That patch provides each QP with its own CQ, so that the CQ context indicates the QP. >There are also the error cases to consider. Not all error handling is in the patch. What isn't handled yet is reposting or completing sends when a QP goes into the error state. >> Along those same lines, the send path can be optimized to avoid >> copying the work request structure for 256 byte MADs. > >I will look at the error cases for this too. Avoiding the copy isn't possible in all cases, but could be in some common ones. A copy would need to be made to handle the case for QP overrun. And if we want to repost all sends after an error occurs, then we may need a copy in all cases. We will also need to start thinking about how RMPP will fit into the picture as well. A MAD that results in posting multiple work requests makes error recovery slightly more difficult. From mshefty at ichips.intel.com Thu Sep 30 08:40:26 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 08:40:26 -0700 Subject: [openib-general] MAD request/response completion order In-Reply-To: <000101c4a67a$acd2e5e0$655aa8c0@infiniconsys.com> References: <20040929160856.6f10d216.mshefty@ichips.intel.com> <000101c4a67a$acd2e5e0$655aa8c0@infiniconsys.com> Message-ID: <20040930084026.2721c89b.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 16:18:47 -0700 "Fab Tillier" wrote: > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Wednesday, September 29, 2004 4:09 PM > > > > Does anyone have a preference which order request/response MADs complete? > > Sends first always? Receives first always? Whatever is convenient? > > > > The only reason I can think of to report receives before sends is to support > user-mode. If a receive can't be reported to a user-mode client, then the > corresponding send should not carry a status value that would indicate that > a receive was - it should be over-ridden to indicate something equivalent to > a timeout since the receive will not be delivered. Looking at how the implementation is coming out, reporting receives before sends is a little easier than sends before receives. (A send may still have active work requests when a response is received.) So I will make this the behavior for the MAD layer. From mshefty at ichips.intel.com Thu Sep 30 09:55:09 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 09:55:09 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040929111705.1711b347.mshefty@ichips.intel.com> References: <20040929111705.1711b347.mshefty@ichips.intel.com> Message-ID: <20040930095509.509ceace.mshefty@ichips.intel.com> On Wed, 29 Sep 2004 11:17:05 -0700 Sean Hefty wrote: > Here's a patch for discussion for an API and implementation that should allow canceling a sent MAD. Patch is similar to previous patch, but removes the return code from ib_cancel_mad. - Sean Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 915) +++ access/ib_mad.c (working copy) @@ -1026,6 +1026,54 @@ } } +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_send_list) { + if (mad_send_wr->wr_id == wr_id) + goto found; + } + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return; + +found: + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + + if (mad_send_wr->timeout_ms) { + mad_send_wr->timeout_ms = 0; + mad_send_wr->refcount--; + } + + if (mad_send_wr->refcount != 0) { + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return; + } + + list_del(&mad_send_wr->agent_send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); +} +EXPORT_SYMBOL(ib_cancel_mad); + /* * IB MAD thread */ Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 915) +++ include/ib_mad.h (working copy) @@ -275,6 +275,17 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); /** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent - Specifies the registration associated with sent MAD. + * @wr_id - Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** * ib_redirect_mad_qp - Registers a QP for MAD services. * @qp - Reference to a QP that requires MAD services. * @rmpp_version - If set, indicates that the client will send From ftillier at infiniconsys.com Thu Sep 30 10:06:36 2004 From: ftillier at infiniconsys.com (Fab Tillier) Date: Thu, 30 Sep 2004 10:06:36 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <20040930095509.509ceace.mshefty@ichips.intel.com> Message-ID: <000501c4a70f$d8b539c0$655aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, September 30, 2004 9:55 AM > > On Wed, 29 Sep 2004 11:17:05 -0700 > Sean Hefty wrote: > > > Here's a patch for discussion for an API and implementation that should > allow canceling a sent MAD. > > Patch is similar to previous patch, but removes the return code from > ib_cancel_mad. > ... > + if (mad_send_wr->wr_id == wr_id) > + goto found; > + } > + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); > + return; > + > +found: I find the "goto found" syntax ugly and confusing. It seems unnatural to jump over the unlock like that. Just my personal opinion - I'd rather see: + if (mad_send_wr->wr_id != wr_id) + continue; But that's just me. - Fab From krause at cup.hp.com Thu Sep 30 10:34:29 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 30 Sep 2004 10:34:29 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <000301c4a694$a13badb0$655aa8c0@infiniconsys.com> References: <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> <000301c4a694$a13badb0$655aa8c0@infiniconsys.com> Message-ID: <6.1.2.0.2.20040930102857.01d38c80@esmail.cup.hp.com> At 07:24 PM 9/29/2004, Fab Tillier wrote: > > From: Michael Krause [mailto:krause at cup.hp.com] > > Sent: Wednesday, September 29, 2004 5:50 PM > > > > The SM is the only entity that is supposed to assign LID as well as the > > subnet prefix. The SM should not trust any CA / switch configuration if > > it has not configured it thus should wipe it out and replace it with what > > it deems best. As for the subnet merge problem, until the M_Key is sorted > > out, reassignment isn't an issue per se. > >In the case where the SM crashes or is stopped and then restated and there >is no failover SM, resetting all LIDs seems rather drastic. Even the case >where the SM is stopped, upgraded, and then restarted need to account for >situations where the fabric as configured by the previous SM, while fully >functional, followed a different algorithm than the updated SM code. I >don't see how an SM can distinguish between a LID assigned by a "micro-SM" >embedded on every host system and one assigned by a previous incarnation. >Resetting every assigned LID just because it can't be trusted would be quite >disruptive IMO. If a CA/switch configuration does not cause problems, the >SM should do its best to keep things from changing so as to minimize the >impact of SM disruptions on overall fabric operation. Examine the purpose and associated specification text regarding the M_Key. The SM can distinguish between a locally assigned value and one it assigned through the use of the M_Key. If the SM is also reasonably robust, it would also implement a SM database to understand what CA / switch exist in the fabric, how addressing was assigned, SL / VL arbitration, etc. The SM is supposed to be smart and thus should enable recovery. As for resetting because it cannot be trusted, that is exactly what the IBTA intended. If in doubt, then reset the values so that one does not violate the objective of partitioning and the defined trust domains. This is no different in desire than the growing use of 802.1x for Ethernet fabric login. Customers want to know that the components that are communicating have been properly identified and configured to communicate within a defined partition / trust domain. If any component is blindly trusted, then that puts the fabric and other components at risk. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Thu Sep 30 10:40:48 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 30 Sep 2004 10:40:48 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <52vfdw1ngg.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> Message-ID: <6.1.2.0.2.20040930103540.01d2d908@esmail.cup.hp.com> At 07:48 PM 9/29/2004, Roland Dreier wrote: > Yaron> While I agree that other SM's in a recovery/merge phase > Yaron> should try and preserve the LID's I think a CA shouldn't > Yaron> just like it is not supposed to change its own P_Key table, > >I don't think the P_Key table is a good analogy. There is a very >clear statement in the IB spec for what values should be in the P_Key >table before the SM sets it. I don't know of any such statement for >PortInfo:LID. > > Yaron> But any way out of curiosity how do you generate a unique > Yaron> LID (locally by the host) for every node in the fabric in a > Yaron> large fabric when the ports are down and the nodes don't > Yaron> talk to each other ? (I hope not through Ethernet :)) > > Yaron> Or how do you anticipate the LMC value (LID spacing)? > >If the application cares about this, the administrator has to set >things up so that it can work. > >Of course there are scenarios where this breaks. It's up to the user >to avoid them if it matters. The whole point of a central management scheme was to simplify the management of the fabric and enable one to scale a solution across a large number of endnodes. A LID defines a path in the fabric not just an endnode. The LMC defines a range of LID assigned to a port and thus the number of paths supported within the fabric. This allows traffic to be routed through the fabric to meet specific QoS objectives or to avoid congestion on a given path under varying workloads. Attempting to just assign this at the endnode is simply wrong and certainly goes against the intention of the IBTA. If really want a static LID and its only purpose is loopback, then do this within the CI and don't mess with the port configuration. It is cleaner and will make life easier for the customers as there will not be conflicting policies nor a requirement that the customer has to know every nuance of a particular implementation. The goal is supposed to make IB easier to deploy for customers and not require them to be experts at all aspect of IB operation. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Thu Sep 30 11:41:07 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 11:41:07 -0700 Subject: [openib-general] Re: Advice needed on IP-over-InfiniBand driver In-Reply-To: <20040927215244.697aaa02.davem@davemloft.net> (David S. Miller's message of "Mon, 27 Sep 2004 21:52:44 -0700") References: <52fz5esxx6.fsf@topspin.com> <20040919140133.60ea3fb3.davem@davemloft.net> <52r7onc8ev.fsf@topspin.com> <20040927215244.697aaa02.davem@davemloft.net> Message-ID: <521xgj1tx8.fsf@topspin.com> David> I think you might learn something by having a look at what David> net/atm/clip.c is doing, it creates it's own neighbour David> layer for CLIP ATM neighbours. It is in a similar boat to David> your IPoIB stuff. Thanks, this suggestion was very helpful. I think I'm making progress. Now I know my next question :) CLIP ATM is a little different from IPoIB in that it completely replaces the ARP layer with its own ARP daemon. For IPoIB I don't want to reinvent the ARP and ND code -- I just want to add a secondary lookup after the response comes back. I think I have an idea of how to do that and then stash the information in the struct neighbour, so that my hard_start_xmit method can get it from skb->dst (ala clip.c). However, it seems that broadcast ARP packets have skb->dst == NULL. Is it safe for me to assume that packets with skb->dst == NULL are broadcast packets? Will multicast packets have a non-NULL dst? Thanks, Roland From halr at voltaire.com Thu Sep 30 11:58:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 30 Sep 2004 14:58:54 -0400 Subject: [openib-general] mthca and DDR not hidden Message-ID: <1096570734.2393.9.camel@hpc-1> I have a question relative to sending UD transport. This may have been discussed before but I don't have an easy way to search for it right now... I attempted to send UD and got stopped by the following in mthca_av.c: int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, struct ib_ud_header *header) { if (ah->on_hca) return -EINVAL; on_hca is set due to the following in mthca_create_ah: if (!atomic_read(&pd->sqp_count) && !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { index = mthca_alloc(&dev->av_table.alloc); /* fall back to allocate in host memory */ if (index == -1) goto host_alloc; av = kmalloc(sizeof *av, GFP_KERNEL); if (!av) goto host_alloc; ah->on_hca = 1; Is it a requirement to run the HCA with DDR hidden ? Is there a way to get the AH when the DDR is not hidden ? If so, should this support be added into mthca ? Thanks. -- Hal From roland at topspin.com Thu Sep 30 12:10:13 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 12:10:13 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <1096570734.2393.9.camel@hpc-1> (Hal Rosenstock's message of "Thu, 30 Sep 2004 14:58:54 -0400") References: <1096570734.2393.9.camel@hpc-1> Message-ID: <52wtybzi7e.fsf@topspin.com> Hal> I attempted to send UD and got stopped by the following in mthca_av.c: Hal> int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, Hal> struct ib_ud_header *header) Hal> { Hal> if (ah->on_hca) Hal> return -EINVAL; Hal> on_hca is set due to the following in mthca_create_ah: Hal> if (!atomic_read(&pd->sqp_count) && Hal> !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { [...] Hal> ah->on_hca = 1; Hal> Is it a requirement to run the HCA with DDR hidden ? Notice that on_hca won't be set if either the DDR is hidden _or_ the PD being used has special QPs attached. We only need to reread the AH for sends on QP0/QP1 (not general UD QPs), so this shouldn't be a problem in practice. Are you using the same PD for your QPs and AHs (they have to match for correct operation)? I run nearly all of my HCAs without the DDR hidden and it works fine. Hal> Is there a way to get the AH when the DDR is not hidden ? If so, should Hal> this support be added into mthca ? Yes, it's possible. However I don't think it's urgent because things should work fine as it stands now. The only issue seems to be that we may not generate the correct error when an AH from a different PD is used on QP0/QP1. - Roland From mshefty at ichips.intel.com Thu Sep 30 12:16:28 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 12:16:28 -0700 Subject: [openib-general] [PATCH] request/response matching in MAD code Message-ID: <20040930121628.2e966a1f.mshefty@ichips.intel.com> The following patch should match response MADs with the corresponding request. A response without a matching request is discarded, and responses are reported before requests. Timeouts of request MADs are not yet handled. - Sean -- Index: access/ib_mad_priv.h =================================================================== --- access/ib_mad_priv.h (revision 915) +++ access/ib_mad_priv.h (working copy) @@ -119,6 +119,7 @@ struct list_head agent_send_list; struct ib_mad_agent *agent; u64 wr_id; /* client WRID */ + u64 tid; int timeout_ms; int refcount; enum ib_wc_status status; Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 915) +++ access/ib_mad.c (working copy) @@ -87,6 +87,8 @@ static int ib_mad_post_receive_mads(struct ib_mad_port_private *priv); static inline u8 convert_mgmt_class(u8 mgmt_class); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); /* * ib_register_mad_agent - Register to send/receive MADs @@ -344,6 +346,8 @@ return -ENOMEM; } + mad_send_wr->tid = ((struct ib_mad_hdr*)(unsigned long) + send_wr->sg_list->addr)->tid; mad_send_wr->agent = mad_agent; mad_send_wr->timeout_ms = cur_send_wr->wr.ud.timeout_ms; if (mad_send_wr->timeout_ms) @@ -740,6 +744,81 @@ return valid; } +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet. + */ +static struct ib_mad_private* reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_send_list) { + + if (mad_send_wr->tid == tid) { + /* Verify request is still valid. */ + if (mad_send_wr->status == IB_WC_SUCCESS && + mad_send_wr->timeout_ms) + return mad_send_wr; + else + return NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing. */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) + return; + + /* Complete corresponding request. */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, + flags); + ib_free_recv_mad(&recv->header.recv_wc); + return; + } + mad_send_wr->timeout_ms = 0; + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + /* Defined behavior is to complete response before request. */ + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, struct ib_wc *wc) { @@ -797,17 +876,10 @@ /* Setup MAD receive work completion from "normal" work completion */ recv->header.recv_wc.wc = wc; - recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); recv->header.recv_wc.recv_buf = &recv->header.recv_buf; - - /* Setup MAD receive buffer */ - INIT_LIST_HEAD(&recv->header.recv_buf.list); /* More for RMPP!!! */ recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; - if (wc->wc_flags & IB_WC_GRH) { - recv->header.recv_buf.grh = &recv->grh; - } else { - recv->header.recv_buf.grh = NULL; - } + recv->header.recv_buf.grh = &recv->grh; /* Validate MAD */ if (!validate_mad(recv->header.recv_buf.mad, qp_num)) @@ -820,21 +892,11 @@ solicited); if (!mad_agent) { spin_unlock_irqrestore(&port_priv->reg_lock, flags); - printk(KERN_ERR "No matching mad agent found for receive MAD\n"); + printk(KERN_NOTICE "No matching mad agent found for receive MAD\n"); } else { atomic_inc(&mad_agent->refcount); spin_unlock_irqrestore(&port_priv->reg_lock, flags); - if (solicited) { - /* Walk the send posted list to find the match !!! */ - printk(KERN_DEBUG "Receive solicited MAD currently unsupported\n"); - } - - /* Invoke receive callback */ - mad_agent->agent.recv_handler(&mad_agent->agent, - &recv->header.recv_wc); - - if (atomic_dec_and_test(&mad_agent->refcount)) - wake_up(&mad_agent->wait); + ib_mad_complete_recv(mad_agent, recv, solicited); } ret: From mshefty at ichips.intel.com Thu Sep 30 12:24:35 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 12:24:35 -0700 Subject: [openib-general] [PATCH] request/response matching in MAD code In-Reply-To: <20040930121628.2e966a1f.mshefty@ichips.intel.com> References: <20040930121628.2e966a1f.mshefty@ichips.intel.com> Message-ID: <20040930122435.4356f965.mshefty@ichips.intel.com> Some additional comments related to changes in the receive processing are below. > static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, > struct ib_wc *wc) > { > @@ -797,17 +876,10 @@ > - recv->header.recv_wc.mad_len = sizeof(struct ib_mad); /* Should this be based on wc->byte_len ? Also, RMPP !!! */ > + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); I believe that this is correct, to avoid including the GRH size. For RMPP, I expect that the RMPP code will need to update the recv_wc.mad_len once all segments have been assembled. > - if (wc->wc_flags & IB_WC_GRH) { > - recv->header.recv_buf.grh = &recv->grh; > - } else { > - recv->header.recv_buf.grh = NULL; > - } > + recv->header.recv_buf.grh = &recv->grh; The API is defined such that the grh pointer is always valid; it's the data that may or may not be. I think this will make it easier for clients to repost receive buffers on redirected QPs. From mshefty at ichips.intel.com Thu Sep 30 12:36:14 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 12:36:14 -0700 Subject: [openib-general] [PATCH] ib_cancel_mad API In-Reply-To: <000501c4a70f$d8b539c0$655aa8c0@infiniconsys.com> References: <20040930095509.509ceace.mshefty@ichips.intel.com> <000501c4a70f$d8b539c0$655aa8c0@infiniconsys.com> Message-ID: <20040930123614.4eb864a2.mshefty@ichips.intel.com> On Thu, 30 Sep 2004 10:06:36 -0700 "Fab Tillier" wrote: > I find the "goto found" syntax ugly and confusing. It seems unnatural to > jump over the unlock like that. Does this patch (version 3) seem more natural and less confusing to you? :) - Sean Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 915) +++ access/ib_mad.c (working copy) @@ -1026,6 +1026,65 @@ } } +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_send_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->send_list_lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + + if (mad_send_wr->timeout_ms) { + mad_send_wr->timeout_ms = 0; + mad_send_wr->refcount--; + } + + if (mad_send_wr->refcount != 0) { + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + return; + } + + list_del(&mad_send_wr->agent_send_list); + spin_unlock_irqrestore(&mad_agent_priv->send_list_lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); +} +EXPORT_SYMBOL(ib_cancel_mad); + /* * IB MAD thread */ Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 915) +++ include/ib_mad.h (working copy) @@ -275,6 +275,17 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); /** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent - Specifies the registration associated with sent MAD. + * @wr_id - Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** * ib_redirect_mad_qp - Registers a QP for MAD services. * @qp - Reference to a QP that requires MAD services. * @rmpp_version - If set, indicates that the client will send From halr at voltaire.com Thu Sep 30 12:46:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 30 Sep 2004 15:46:20 -0400 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <52wtybzi7e.fsf@topspin.com> References: <1096570734.2393.9.camel@hpc-1> <52wtybzi7e.fsf@topspin.com> Message-ID: <1096573580.5392.5.camel@hpc-1> On Thu, 2004-09-30 at 15:10, Roland Dreier wrote: > Hal> I attempted to send UD and got stopped by the following in mthca_av.c: > > Hal> int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, > Hal> struct ib_ud_header *header) > Hal> { > Hal> if (ah->on_hca) > Hal> return -EINVAL; > > Hal> on_hca is set due to the following in mthca_create_ah: > > Hal> if (!atomic_read(&pd->sqp_count) && > Hal> !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { > > [...] > > Hal> ah->on_hca = 1; > > Hal> Is it a requirement to run the HCA with DDR hidden ? > > Notice that on_hca won't be set if either the DDR is hidden _or_ the > PD being used has special QPs attached. We only need to reread the AH > for sends on QP0/QP1 (not general UD QPs), so this shouldn't be a > problem in practice. Are you using the same PD for your QPs and AHs > (they have to match for correct operation)? Oops. Looks like we need a way to expose the PD and MR to be able to do this. How about adding this into the mad_agent structure returned ? If that makes sense, I will generate the patch for this. > I run nearly all of my HCAs without the DDR hidden and it works fine. > > Hal> Is there a way to get the AH when the DDR is not hidden ? If so, should > Hal> this support be added into mthca ? > > Yes, it's possible. However I don't think it's urgent because things > should work fine as it stands now. Agreed (and there already is a TODO list item on DDR hidden relative to UD AHs (other than QP0/1)). > The only issue seems to be that we > may not generate the correct error when an AH from a different PD is > used on QP0/QP1. What's should be the correct error for this case ? Thanks. -- Hal From roland at topspin.com Thu Sep 30 12:50:25 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 12:50:25 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <1096573580.5392.5.camel@hpc-1> (Hal Rosenstock's message of "Thu, 30 Sep 2004 15:46:20 -0400") References: <1096570734.2393.9.camel@hpc-1> <52wtybzi7e.fsf@topspin.com> <1096573580.5392.5.camel@hpc-1> Message-ID: <52ekkjzgce.fsf@topspin.com> Hal> Oops. Looks like we need a way to expose the PD and MR to be Hal> able to do this. How about adding this into the mad_agent Hal> structure returned ? If that makes sense, I will generate the Hal> patch for this. Yep, looks that way. I didn't notice at the time, but reusing the ib_send_wr structure for ib_post_send_mad() implies that the consumer is responsible to creating/destroying AHs, which definitely means the consumer has to be able to see the PD with the special QPs. struct ib_qp already has a struct ib_pd *, so maybe the consumer can just use mad_agent->qp->pd? - R. From mshefty at ichips.intel.com Thu Sep 30 12:54:14 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 12:54:14 -0700 Subject: [openib-general] [PATCH] code/stack cleanup in find_mad_agent/validate_mad routines Message-ID: <20040930125414.47bca27e.mshefty@ichips.intel.com> Patch removes a couple of stack variables, eliminates gotos, and reformats lines over 80 columns. - Sean -- Index: access/ib_mad.c =================================================================== --- access/ib_mad.c (revision 915) +++ access/ib_mad.c (working copy) @@ -670,7 +670,7 @@ struct ib_mad *mad, int solicited) { - struct ib_mad_agent_private *entry, *mad_agent = NULL; + struct ib_mad_agent_private *entry; struct ib_mad_mgmt_class_table *version; struct ib_mad_mgmt_method_table *class; u32 hi_tid; @@ -680,64 +680,52 @@ /* Routing is based on high 32 bits of transaction ID of MAD */ hi_tid = mad->mad_hdr.tid >> 32; list_for_each_entry(entry, &port_priv->agent_list, agent_list) { - if (entry->agent.hi_tid == hi_tid) { - mad_agent = entry; - break; - } - } - if (!mad_agent) { - printk(KERN_ERR "No client 0x%x for received MAD\n", - (u32)(mad->mad_hdr.tid >> 32)); - goto ret; + if (entry->agent.hi_tid == hi_tid) + return entry; } + printk(KERN_NOTICE "No client 0x%x for received MAD\n", hi_tid); } else { /* Routing is based on version, class, and method */ if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) { - printk(KERN_ERR "MAD received with unsupported class version %d\n", - mad->mad_hdr.class_version); - goto ret; + printk(KERN_ERR "MAD received with unsupported class " + "version %d\n", mad->mad_hdr.class_version); + return NULL; } version = port_priv->version[mad->mad_hdr.class_version]; if (!version) { - printk(KERN_ERR "MAD received for class version %d with no client\n", mad->mad_hdr.class_version); - goto ret; + printk(KERN_ERR "MAD received for class version %d " + "with no client\n", mad->mad_hdr.class_version); + return NULL; } - class = version->method_table[convert_mgmt_class(mad->mad_hdr.mgmt_class)]; + class = version->method_table[ + convert_mgmt_class(mad->mad_hdr.mgmt_class)]; if (!class) { - printk(KERN_ERR "MAD receive for class %d with no client\n", mad->mad_hdr.mgmt_class); - goto ret; + printk(KERN_ERR "MAD receive for class %d with no " + "client\n", mad->mad_hdr.mgmt_class); + return NULL; } - mad_agent = class->agent[mad->mad_hdr.method & ~IB_MGMT_METHOD_RESP]; + return class->agent[mad->mad_hdr.method & ~IB_MGMT_METHOD_RESP]; } - -ret: - return mad_agent; + return NULL; } static int validate_mad(struct ib_mad *mad, u32 qp_num) { - int valid = 0; - /* Make sure MAD base version is understood */ if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { - printk(KERN_ERR "MAD received with unsupported base version %d\n", - mad->mad_hdr.base_version); - goto ret; + printk(KERN_ERR "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + return 0; } /* Filter SMI packets sent to other than QP0 */ if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { - if (qp_num == 0) - valid = 1; + return (qp_num == 0); } else { /* Filter GSI packets sent to QP0 */ - if (qp_num != 0) - valid = 1; + return (qp_num != 0); } - -ret: - return valid; } static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, From halr at voltaire.com Thu Sep 30 12:59:47 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 30 Sep 2004 15:59:47 -0400 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <52ekkjzgce.fsf@topspin.com> References: <1096570734.2393.9.camel@hpc-1> <52wtybzi7e.fsf@topspin.com> <1096573580.5392.5.camel@hpc-1> <52ekkjzgce.fsf@topspin.com> Message-ID: <1096574387.2285.1.camel@hpc-1> On Thu, 2004-09-30 at 15:50, Roland Dreier wrote: > struct ib_qp already has a struct ib_pd *, so maybe the consumer can > just use mad_agent->qp->pd? Yup. He's already got a way to get the PD. Rather than the MR, I think just adding the lkey to the mad_agent structure will suffice. Do you agree ? -- Hal From roland at topspin.com Thu Sep 30 13:03:06 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 13:03:06 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <1096574387.2285.1.camel@hpc-1> (Hal Rosenstock's message of "Thu, 30 Sep 2004 15:59:47 -0400") References: <1096570734.2393.9.camel@hpc-1> <52wtybzi7e.fsf@topspin.com> <1096573580.5392.5.camel@hpc-1> <52ekkjzgce.fsf@topspin.com> <1096574387.2285.1.camel@hpc-1> Message-ID: <52acv7zfr9.fsf@topspin.com> Hal> Yup. He's already got a way to get the PD. Rather than the Hal> MR, I think just adding the lkey to the mad_agent structure Hal> will suffice. Do you agree ? I guess so... it seems a little odd to make the consumer copy the L_Key from one place to another without knowing anything about the MR. I guess overloading the ib_send_wr struct the way we are forces these layering violations though. - R. From sean.hefty at intel.com Thu Sep 30 13:25:51 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 13:25:51 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <52acv7zfr9.fsf@topspin.com> Message-ID: > Hal> Yup. He's already got a way to get the PD. Rather than the > Hal> MR, I think just adding the lkey to the mad_agent structure > Hal> will suffice. Do you agree ? > >I guess so... it seems a little odd to make the consumer copy the >L_Key from one place to another without knowing anything about the MR. >I guess overloading the ib_send_wr struct the way we are forces these >layering violations though. The intent of using the ib_send_wr structure is to allow posting MADs directly onto the QP, without the access layer needing to translate from one work request type structure to another. If there's not a reasonably clean way to reuse the work request structure, we can create a new ib_send_mad_wr structure, push address handle creation into the MAD layer, or do something else. I need to reread through this mail thread, because I do not understand what the issue is yet. What is a client of the MAD API missing in order to post to the QP? From roland at topspin.com Thu Sep 30 13:32:31 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 13:32:31 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: (Sean Hefty's message of "Thu, 30 Sep 2004 13:25:51 -0700") References: Message-ID: <52655vzee8.fsf@topspin.com> Sean> I need to reread through this mail thread, because I do not Sean> understand what the issue is yet. What is a client of the Sean> MAD API missing in order to post to the QP? To post a send to a special QP through ib_send_wr, you need an AH to specify the destination, and each gather entry must have an L_Key. The underlying QP, the AH, and the MR for the L_Key all must be attached to the same PD, which means that the consumer must use the PD that the MAD layer used to create the QP to create its AH. Also, the consumer must either create its own MR using that PD, or have access to at least the L_Key for the MAD layer's MR. - R. From halr at voltaire.com Thu Sep 30 13:40:05 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 30 Sep 2004 16:40:05 -0400 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <52655vzee8.fsf@topspin.com> References: <52655vzee8.fsf@topspin.com> Message-ID: <1096576805.2885.2.camel@hpc-1> On Thu, 2004-09-30 at 16:32, Roland Dreier wrote: > Also, the > consumer must either create its own MR using that PD, or have access > to at least the L_Key for the MAD layer's MR. That would be an acceptable solution too: to have the consumer create the MR based on the MAD PD. Then the lkey doesn't need exposing but exposing and copying the lkey seems simpler to me, -- Hal From David.Brean at Sun.COM Thu Sep 30 13:39:04 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 30 Sep 2004 16:39:04 -0400 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <52vfdw1ngg.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> Message-ID: <415C6EE8.4010203@sun.com> I think Yaron's point is that the P_Key table can only be updated by a management entity, such as the SM, that has the M_Key. [The default values in the P_Key table are established as as result of port initialization - an external entity is not involved. -David Roland Dreier wrote: > Yaron> While I agree that other SM's in a recovery/merge phase > Yaron> should try and preserve the LID's I think a CA shouldn't > Yaron> just like it is not supposed to change its own P_Key table, > >I don't think the P_Key table is a good analogy. There is a very >clear statement in the IB spec for what values should be in the P_Key >table before the SM sets it. I don't know of any such statement for >PortInfo:LID. > > > From roland at topspin.com Thu Sep 30 13:48:52 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 13:48:52 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <415C6EE8.4010203@sun.com> (David M. Brean's message of "Thu, 30 Sep 2004 16:39:04 -0400") References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> <415C6EE8.4010203@sun.com> Message-ID: <521xgjzdmz.fsf@topspin.com> David> I think Yaron's point is that the P_Key table can only be David> updated by a management entity, such as the SM, that has David> the M_Key. The default values in the P_Key table are David> established as as result of port initialization - an David> external entity is not involved. You can think about the initial value in PortInfo:LID in the same way-- the result of device initialization before the link is brought up. - Roland From David.Brean at Sun.COM Thu Sep 30 14:01:29 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 30 Sep 2004 17:01:29 -0400 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <521xgjzdmz.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> <415C6EE8.4010203@sun.com> <521xgjzdmz.fsf@topspin.com> Message-ID: <415C7429.9080302@sun.com> I'm describing what is in the current IBA. The IBA describes the conditions where a P_Key value should be set into the P_Key table. There is no similar description for LIDs in the IBA. -David Roland Dreier wrote: > David> I think Yaron's point is that the P_Key table can only be > David> updated by a management entity, such as the SM, that has > David> the M_Key. The default values in the P_Key table are > David> established as as result of port initialization - an > David> external entity is not involved. > >You can think about the initial value in PortInfo:LID in the same >way-- the result of device initialization before the link is brought up. > > - Roland > > From mshefty at ichips.intel.com Thu Sep 30 14:04:29 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 30 Sep 2004 14:04:29 -0700 Subject: [openib-general] mthca and DDR not hidden In-Reply-To: <1096576805.2885.2.camel@hpc-1> References: <52655vzee8.fsf@topspin.com> <1096576805.2885.2.camel@hpc-1> Message-ID: <20040930140429.486106c3.mshefty@ichips.intel.com> On Thu, 30 Sep 2004 16:40:05 -0400 Hal Rosenstock wrote: > On Thu, 2004-09-30 at 16:32, Roland Dreier wrote: > > Also, the > > consumer must either create its own MR using that PD, or have access > > to at least the L_Key for the MAD layer's MR. > > That would be an acceptable solution too: to have the consumer create > the MR based on the MAD PD. Then the lkey doesn't need exposing but > exposing and copying the lkey seems simpler to me, Okay - I think I got it now. Some more thoughts on the topic: It seems like having the consumer create the MR would be a more generic solution, but the least efficient. The idea of exposing MR implies that the user know the access layer implementation in order to get the right virtual address when posting MADs. (The user must know that the MR applies to all of system memory.) Giving out only the lkey seems less desirable, since the client must also assume the starting virtual address. Another alternative is for the access layer to set the lkey value (and adjust the addr if needed) in the SGEs when posting. This seems somewhat equivalent to the access layer performing the memory registration on the user's behalf... (not recommending this, just mentioning it) From roland at topspin.com Thu Sep 30 14:07:52 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 14:07:52 -0700 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <415C7429.9080302@sun.com> (David M. Brean's message of "Thu, 30 Sep 2004 17:01:29 -0400") References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> <415C6EE8.4010203@sun.com> <521xgjzdmz.fsf@topspin.com> <415C7429.9080302@sun.com> Message-ID: <52wtybxy6v.fsf@topspin.com> David> I'm describing what is in the current IBA. The IBA David> describes the conditions where a P_Key value should be set David> into the P_Key table. There is no similar description for David> LIDs in the IBA. Right, as I said before, that's what I thought (but I wasn't sure I hadn't missed something). So the IBA doesn't make any statement about what value PortInfo:LID should be initialized to. Presumably the IBA doesn't require that PortInfo:LID be initialized to a random value contained in an uninitialized memory location, so I don't see a problem with the SMA initializing PortInfo:LID to an algorithmicly determined value as part of boot-up. And, unlike the P_Key table, the IBA makes no statement about this initialization algorithm. - Roland From dlstevens at us.ibm.com Thu Sep 30 14:21:20 2004 From: dlstevens at us.ibm.com (David Stevens) Date: Thu, 30 Sep 2004 14:21:20 -0700 Subject: [openib-general] Re: Advice needed on IP-over-InfiniBand driver In-Reply-To: <521xgj1tx8.fsf@topspin.com> Message-ID: > However, it seems that broadcast ARP packets have skb->dst == NULL. > Is it safe for me to assume that packets with skb->dst == NULL are > broadcast packets? Will multicast packets have a non-NULL dst? I think it would be a mistake to use skb->dst as a flag for unicast or not. Even if it is correct in all cases you care about now (I don't know either way), it would be a hidden dependency with high potential to break something eventually. +-DLS From roland at topspin.com Thu Sep 30 14:48:53 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 30 Sep 2004 14:48:53 -0700 Subject: [openib-general] Re: Advice needed on IP-over-InfiniBand driver In-Reply-To: (David Stevens's message of "Thu, 30 Sep 2004 14:21:20 -0700") References: Message-ID: <52sm8zxwai.fsf@topspin.com> David> I think it would be a mistake to use skb->dst as a flag for David> unicast or not. Even if it is correct in all cases you care David> about now (I don't know either way), it would be a hidden David> dependency with high potential to break something David> eventually. That's kind of what I thought. But since my packets have no L2 header in them, I don't know what hard_start_xmit can look at other than skb->dst. I guess hard_header could put some info in skb->cb -- is cb available for net device use between hard_header and hard_start_xmit, or does someone else still own it? Thanks, Roland From David.Brean at Sun.COM Thu Sep 30 15:40:02 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 30 Sep 2004 18:40:02 -0400 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <52d60436fu.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <52lles39e4.fsf@topspin.com> <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> <52d60436fu.fsf@topspin.com> Message-ID: <415C8B42.1080304@sun.com> The IBA provides two mechanisms for updating subnet management data: 1) through the verbs - see Modify HCA (section 11.2.1.3) 2) through Subnet management packets (SMPs) - see Subnet Management Class (section 14.2) The IBA only supports updating the LID via SMPs (#2 above) and an entity using SMPs must have the M_Key. If that entity doesn't have the M_Key, then it can't reliably change the LID. In addition, the IBA allows an endnode to request, through the verbs interface provided for the "node reinitialization" (see 14.4.4) mechanism, that subnet management state, such as the LID, be preserved, when a port transitions through the DOWN state. However, the SM may not honor that request so the endnode must handle that possibility because LID assignment policy is owned by the SM. Furthermore, this mechanism is used on ports that have previously been initialized by the SM (maybe that's why it's called the reinitialization function :)). Given the mechanisms in the specification, I think that its possible to have IB clients use loopback, even under the endnode power-up scenario, while the port is not in the ACTIVE state and have them continue without disruption when the port is made ACTIVE on the subnet by the SM with use of the reinitialization mechanism. This is a very useful mechanism for various failover situations. There is no current IBA mechanism or protocol for an endnode to set just the LID, even if it had the M_Key, and have the SM preserve that value. -David Roland Dreier wrote: >I don't see anything in the spec that forbids a CA from having an >arbitrary value in PortInfo:LID after initialization but before the SM >discovery (please correct me if I missed something). I also don't see >anything that forbids an SM implementation from providing a mechanism >for preserving the LIDs it finds or administratively assigning LIDs. > >Of course none of this is required but I don't see a problem with >allowing it. > From David.Brean at Sun.COM Thu Sep 30 15:48:56 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 30 Sep 2004 18:48:56 -0400 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <415C8B42.1080304@sun.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <52lles39e4.fsf@topspin.com> <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> <52d60436fu.fsf@topspin.com> <415C8B42.1080304@sun.com> Message-ID: <415C8D58.9020500@sun.com> David M. Brean wrote: > The IBA provides two mechanisms for updating subnet management data: > > 1) through the verbs - see Modify HCA (section 11.2.1.3) > 2) through Subnet management packets (SMPs) - see Subnet Management > Class (section 14.2) > > The IBA only supports updating the LID via SMPs (#2 above) and an entity > using SMPs must have the M_Key. If that entity doesn't have the M_Key, > then it can't reliably change the LID. > > In addition, the IBA allows an endnode to request, through the verbs > interface provided for the "node reinitialization" (see 14.4.4) > mechanism, that subnet management state, such as the LID, be > preserved, when a port transitions through the DOWN state. However, > the SM may not honor that request so the endnode must handle that > possibility because LID assignment policy is owned by the SM. > Furthermore, this mechanism is used on ports that have previously been > initialized by the SM (maybe that's why it's called the > reinitialization function :)). > > Given the mechanisms in the specification, I think that its possible > to have IB clients use loopback, even under the endnode power-up > scenario, while the port is not in the ACTIVE state and have them > continue without disruption when the port is made ACTIVE on the subnet > by the SM with use of the reinitialization mechanism. This is a very > useful mechanism for various failover situations. > Loopback with reinitialization properly applied.... -David > There is no current IBA mechanism or protocol for an endnode to set > just the LID, even if it had the M_Key, and have the SM preserve that > value. > > -David > > Roland Dreier wrote: > >> I don't see anything in the spec that forbids a CA from having an >> arbitrary value in PortInfo:LID after initialization but before the SM >> discovery (please correct me if I missed something). I also don't see >> anything that forbids an SM implementation from providing a mechanism >> for preserving the LIDs it finds or administratively assigning LIDs. >> >> Of course none of this is required but I don't see a problem with >> allowing it. >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From krause at cup.hp.com Thu Sep 30 16:04:16 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 30 Sep 2004 16:04:16 -0700 Subject: [openib-general] IPoIB Loading and Starting In-Reply-To: <1096488346.2431.13.camel@hpc-1> References: <004d01c4a177$08257830$4302000a@Gripen> <4152DC88.7000700@sun.com> <6.1.2.0.2.20040923083322.01c5de90@esmail.cup.hp.com> <523c16l3wu.fsf@topspin.com> <6.1.2.0.2.20040927064203.01e60488@esmail.cup.hp.com> <52r7onhicd.fsf@topspin.com> <6.1.2.0.2.20040927081008.01da2090@esmail.cup.hp.com> <52oejrg1ux.fsf@topspin.com> <6.1.2.0.2.20040927085419.01f423c8@esmail.cup.hp.com> <52d607fzpl.fsf@topspin.com> <6.1.2.0.2.20040927135932.01e30340@esmail.cup.hp.com> <52pt467yy7.fsf@topspin.com> <6.1.2.0.2.20040928193055.0358e260@esmail.cup.hp.com> <52r7ol7pa5.fsf@topspin.com> <6.1.2.0.2.20040929062540.01e40080@esmail.cup.hp.com> <52sm915cn9.fsf@topspin.com> <6.1.2.0.2.20040929092739.01ea7aa0@esmail.cup.hp.com> <1096478813.19157.21.camel@hpc-1> <6.1.2.0.2.20040929104232.01f6f938@esmail.cup.hp.com> <1096488346.2431.13.camel@hpc-1> Message-ID: <6.1.2.0.2.20040930160125.01f0fd88@esmail.cup.hp.com> At 01:05 PM 9/29/2004, you wrote: >On Wed, 2004-09-29 at 13:46, Michael Krause wrote: > > In what I was proposing, the change in IP service being provided for a > > given partition would result in a service event notification. You are > > correct that unless an endnode periodically examines its P_Key table > > per port for change, there is no method to know that an admin has > > effected a change in the partition space. The IP service with event > > notification would provide this state change as a service event. > >Are you saying that when a particular service record is created in the >SA, an event is generated to a set of interested endnodes ? I don't >think there is a way to do that. The only choice is for the endnode >to continue to poll the service records based on the matching criteria >which we would need to define (ServiceID or name perhaps). A proposal would be reflected in an ECR that basically re-used the trap event notification method to return status indicating service state has changed. In this case, it would be the addition / deletion of a partition that supported the IP service. > > It isn't required that an endnode leave but if there is one around to > > listen, why remain in the multicast group. > >I don't think there is a way to tell a node is the last (full) member of >the group. Anyhow, if this were done, how would that node know to rejoin >once another node came along ? The SM knows this and could simply delete the group. It isn't important to have happen as most fabrics will have at least one IP service partition running with multiple members. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From David.Brean at Sun.COM Thu Sep 30 16:10:44 2004 From: David.Brean at Sun.COM (David M. Brean) Date: Thu, 30 Sep 2004 19:10:44 -0400 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER In-Reply-To: <52wtybxy6v.fsf@topspin.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF71A@taurus.voltaire.com> <52vfdw1ngg.fsf@topspin.com> <415C6EE8.4010203@sun.com> <521xgjzdmz.fsf@topspin.com> <415C7429.9080302@sun.com> <52wtybxy6v.fsf@topspin.com> Message-ID: <415C9274.1060306@sun.com> The port may put any value in the LID during its initialization (an time before it starts responding to SMPs). However, there is nothing in the specification that says that the SM should use it. I sent another email with a more compete description. -David Roland Dreier wrote: > David> I'm describing what is in the current IBA. The IBA > David> describes the conditions where a P_Key value should be set > David> into the P_Key table. There is no similar description for > David> LIDs in the IBA. > >Right, as I said before, that's what I thought (but I wasn't sure I >hadn't missed something). So the IBA doesn't make any statement about >what value PortInfo:LID should be initialized to. Presumably the IBA >doesn't require that PortInfo:LID be initialized to a random value >contained in an uninitialized memory location, so I don't see a >problem with the SMA initializing PortInfo:LID to an algorithmicly >determined value as part of boot-up. And, unlike the P_Key table, the >IBA makes no statement about this initialization algorithm. > > - Roland >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From krause at cup.hp.com Thu Sep 30 16:29:09 2004 From: krause at cup.hp.com (Michael Krause) Date: Thu, 30 Sep 2004 16:29:09 -0700 Subject: Fwd: Re: [openib-general] static LID computation withTS_HOST_DRIVER In-Reply-To: <415C8B42.1080304@sun.com> References: <35EA21F54A45CB47B879F21A91F4862F1DF719@taurus.voltaire.com> <52lles39e4.fsf@topspin.com> <6.1.2.0.2.20040929174718.038a7150@esmail.cup.hp.com> <52d60436fu.fsf@topspin.com> <415C8B42.1080304@sun.com> Message-ID: <6.1.2.0.2.20040930162520.034d5220@esmail.cup.hp.com> At 03:40 PM 9/30/2004, David M. Brean wrote: >The IBA provides two mechanisms for updating subnet management data: > >1) through the verbs - see Modify HCA (section 11.2.1.3) >2) through Subnet management packets (SMPs) - see Subnet Management >Class (section 14.2) > >The IBA only supports updating the LID via SMPs (#2 above) and an entity >using SMPs must have the M_Key. If that entity doesn't have the M_Key, >then it can't reliably change the LID. > >In addition, the IBA allows an endnode to request, through the verbs >interface provided for the "node reinitialization" (see 14.4.4) >mechanism, that subnet management state, such as the LID, be preserved, >when a port transitions through the DOWN state. However, the SM may not >honor that request so the endnode must handle that possibility because LID >assignment policy is owned by the SM. Furthermore, this mechanism is used >on ports that have previously been initialized by the SM (maybe that's why >it's called the reinitialization function :)). > >Given the mechanisms in the specification, I think that its possible to >have IB clients use loopback, even under the endnode power-up scenario, >while the port is not in the ACTIVE state and have them continue without >disruption when the port is made ACTIVE on the subnet by the SM with use >of the reinitialization mechanism. This is a very useful mechanism for >various failover situations. This is a reasonable approach where the loopback LID being used is updated upon the port being initialized (akin to solving this in the CI but still allowing CM to work with a known LID. It avoids any complexity in the SM having to preserve LID that may not be optimal or potentially unique within the subnet. Not sure this might work but it seems to me that APM mech could be used to configure a new configured LID and then transfer the connection to the configured. May take a bit of work in CM as APM is nominally set up during these exchanges. >There is no current IBA mechanism or protocol for an endnode to set just >the LID, even if it had the M_Key, and have the SM preserve that value. Agreed. Mike >-David > >Roland Dreier wrote: >>I don't see anything in the spec that forbids a CA from having an >>arbitrary value in PortInfo:LID after initialization but before the SM >>discovery (please correct me if I missed something). I also don't see >>anything that forbids an SM implementation from providing a mechanism >>for preserving the LIDs it finds or administratively assigning LIDs. >> >>Of course none of this is required but I don't see a problem with >>allowing it. > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From yaronh at voltaire.com Thu Sep 30 16:44:02 2004 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 1 Oct 2004 01:44:02 +0200 Subject: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER Message-ID: <35EA21F54A45CB47B879F21A91F4862F1DF751@taurus.voltaire.com> As I mentioned before, I think the best approach in the long run is to have a well known Loopback LID (that will stay as an alias also after the port changed its LID, not to break apps), just like in any other stack >From a short research I did once I think it is possible to create one even in the current Mellanox HW leveraging on the Multicast support with little firmware changes, maybe Mellanox can comment on that It is also possible to leverage on APM (+ the SMI port change events) if we don't want to deal with the HCA's And in any case we want the apps to be able to recover from any RC failures gracefully (not just LID changes) Doing manual configuration on each host violates all the idea of zero configuration and utility computing we all advocate for Yaron ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause Sent: Friday, October 01, 2004 1:29 AM To: openib-general at openib.org Subject: Re: Fwd: Re: [openib-general] static LID computationwithTS_HOST_DRIVER At 03:40 PM 9/30/2004, David M. Brean wrote: The IBA provides two mechanisms for updating subnet management data: 1) through the verbs - see Modify HCA (section 11.2.1.3) 2) through Subnet management packets (SMPs) - see Subnet Management Class (section 14.2) The IBA only supports updating the LID via SMPs (#2 above) and an entity using SMPs must have the M_Key. If that entity doesn't have the M_Key, then it can't reliably change the LID. In addition, the IBA allows an endnode to request, through the verbs interface provided for the "node reinitialization" (see 14.4.4) mechanism, that subnet management state, such as the LID, be preserved, when a port transitions through the DOWN state. However, the SM may not honor that request so the endnode must handle that possibility because LID assignment policy is owned by the SM. Furthermore, this mechanism is used on ports that have previously been initialized by the SM (maybe that's why it's called the reinitialization function :)). Given the mechanisms in the specification, I think that its possible to have IB clients use loopback, even under the endnode power-up scenario, while the port is not in the ACTIVE state and have them continue without disruption when the port is made ACTIVE on the subnet by the SM with use of the reinitialization mechanism. This is a very useful mechanism for various failover situations. This is a reasonable approach where the loopback LID being used is updated upon the port being initialized (akin to solving this in the CI but still allowing CM to work with a known LID. It avoids any complexity in the SM having to preserve LID that may not be optimal or potentially unique within the subnet. Not sure this might work but it seems to me that APM mech could be used to configure a new configured LID and then transfer the connection to the configured. May take a bit of work in CM as APM is nominally set up during these exchanges. There is no current IBA mechanism or protocol for an endnode to set just the LID, even if it had the M_Key, and have the SM preserve that value. Agreed. Mike -David Roland Dreier wrote: I don't see anything in the spec that forbids a CA from having an arbitrary value in PortInfo:LID after initialization but before the SM discovery (please correct me if I missed something). I also don't see anything that forbids an SM implementation from providing a mechanism for preserving the LIDs it finds or administratively assigning LIDs. Of course none of this is required but I don't see a problem with allowing it. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: