[openib-general] OpenSM work

Eitan Zahavi eitan at mellanox.co.il
Fri Apr 8 09:40:09 PDT 2005


Hi All,

FYI: Mellanox is focusing on the following items on OpenSM development for
the last few weeks:

1.	Stability testing over the IB management simulator:
a.	Randomly pick bad links with high packet drop statistics – success
is SUBNET UP
b.	Route using up/down algorithm – success is no credit loops

2.	Semi-static LID assignment:
a.	Developed an interface for persistent storage of arbitrary data. The
goal is to enable further development of LDAP (ala Troy’s request) or SQL
module. Please see osm_db.h attached
			  <<osm_db.h>> 
b.	Developed file based implementation for osm_db.h
c.	Modify osm_lid_mgr (lid assignment algorithm) to use the LIDs stored
in the persistent storage. Handle all cases of bad file and new LIDs on the
fabric. The –r flag now lets OpenSM overwrite the known data. Persistent
Guid to LIDs data is kept even if the GUID disappears for a while. The code
also handles LID assignment for LMC > 0 in a way better then the previous
algorithm: It used to assign 2^LMC LIDs for every port – even for switches
port 0. Now it will only preserve 1 LID for switch port 0.

3.	Irresponsive port:
a.	The phenomenon is: A port does not respond to the SM during the
discovery stage. OpenSM can not obtain enough data about the port and thus
it does not appear in the final database. Since OpenSM uses light sweeps
when there is no “change detected” it will not query the port until either a
switch sets its “change bit” or send a trap. So that irresponsive port will
never be polled again if there is no heavy sweep.
b.	The solution: 
i.	During discovery track ports (physical ports) that have their
logical link state != DOWN but the port on the other side of the link is not
known to the SM. 
ii.	During light sweep:  not only scan the switches “change bit” but
also test to see if the port on the other side on these ports (from i) is
responding. If it does – issue a heavy sweep.

4.	Head of Queue Life:
a.	Problem: In cases of PCI hardware failure HCAs can not complete RDMA
requests and loose all credits from their input ports (in other words: their
input buffers are filled). So they create back pressure on the fabric. 
b.	Solution: use a fast head of queue time limit on every switch port
that drives an HCA.

5.	SA queries stress testing:
a.	We are exploring max performance of the SA and ways to improve it.

Eitan



Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/91c96aae/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osm_db.h
Type: application/octet-stream
Size: 11514 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050408/91c96aae/attachment.obj>


More information about the general mailing list