[openib-general] Unknown SMP Recv

Michael Arndt michael.arndt at informatik.tu-chemnitz.de
Wed Feb 14 17:34:32 PST 2007


Hi,

I used your changes and it helps in some cases, but there are still 
situations where the umad_send return with that error. I try to describe 
this situation:

(Node 1) -> (Node 2) -> (Node 3)

Node 1:  sends 100 SubnGets to Node 3 (Dr [0][1][1])
Node 2: traverse 100 SubGets to Node 3 and also traverse 100 SubnGetResp to 
Node 1
Node 3: response 100 times

That works fine!! Please don't wonder that the Node2 gets the packets, 
that's because I changed the SMI.

But if I start now the sender on Node 1 again, so that it sends another 100 
SubnGets the Node 2 produces umad_send errors. The error didn't come every 
time. The receive are allways ok and also the packets are.

Below I attach the main code from the router tool on Node 2. I also tested 
to allocate a packet for every single receive and send, but that didn't work 
as well.

What is about the size of the packet, could there be any error?

Thanks Michael

while(run){

  bcopy((char*)&fd_ports,(char*)&fd_ports_tmp,sizeof(fd_ports));

  activ = select(highest_fd+1, (fd_set*)&fd_ports_tmp, (fd_set*)0, 
(fd_set*)0,(struct timeval*)0);

  if (activ < 0 ){
   if (run) printf("Error: select : %i\n",activ);
   run = 0;
  }
  else if (activ == 0) printf("Nothing to do\n");
  else {

   // ++ Alloc MAD ++
     //printf("... Alloc UMAD .......................");
     if (!(umad = umad_alloc(Port_ID_cnt, umad_size() + IB_MAD_SIZE))){
        printf("Error: umad_alloc\n");
        goto Exit;
     }
      //printf("done\n");

     // ++ Alloc SMP Pointer ++
     //printf("... Alloc SMP ........................");
     smp = (struct drsmp**) malloc(Port_ID_cnt * sizeof(struct drsmp*));
     for (i = 0; i < Port_ID_cnt; i++)
        smp[i] = (struct drsmp*) umad_get_mad(umad + (i * (umad_size() + 
IB_MAD_SIZE)));
     //printf("done\n");


   // ++ Check All Ports where something is to do ++
   for (i = 0; i < Port_ID_cnt; i++) {
    if (  (Port_ID[i] >= 0) && (Agent_ID[i] >= 0) && 
(FD_ISSET(umad_get_fd(Port_ID[i]),(fd_set*)&fd_ports_tmp))) {

     smplength = IB_MAD_SIZE;
     packet_size = umad_size() + IB_MAD_SIZE;

     printf("... Recv Mad (Port: %i (ID:%i).....",i+1,Port_ID[i]);
     // ++ Receive ++
       if ((ret = umad_recv(Port_ID[i], umad + (i * packet_size), 
&smplength, timeout_ms_r)) != Agent_ID[i]){
          printf("Error: umad_recv: %s ,Nr: %i\n", 
drmad_status_str(smp[i]),ret);
      if (optExitRecvFail) run = 0;
     }
     else {
      // ++ Drop Echo ++
      if (smp[i]->initial_path[1] != 0) {

       // ++ Keep TID in Mind with supporting turning algorithm ++
       if ( !( (smp[i]->initial_path[smp[i]->hop_ptr] == i+1)  &&
          (smp[i]->status & DIRECTION)        &&
          (smp[i]->hop_cnt == smp[i]->hop_ptr)     &&
          (smp[i]->initial_path[smp[i]->hop_ptr] != 
smp[i]->initial_path[smp[i]->hop_ptr - 1]) )
          &&
           ( (Agent_TIDs[i] == -1) || (Agent_TIDs[i] != 
(own_ntoh64(smp[i]->tid) >> 32))   )
        )
        Agent_TIDs[i] = smp[i]->tid;
       printf("TID: 0x%lx\n",own_ntoh64(Agent_TIDs[i]));

       // ++ Message Logging ++
       if (optMsgLog) {
        fprintf(MsgLogFile,"...............................................................................................\n");
        fprintf(MsgLogFile,"... Recv Mad (Port: %i 
(ID:%i)...............\n",i+1,Port_ID[i]);
        fprintf(MsgLogFile,"... Recv TID: 0x%lx 
\n",own_ntoh64(Agent_TIDs[i]));
        dump_dr_smp(smp[i], MsgLogFile);
       }

       // ++ Looking up the Out-Port ++
       Out_Port_index = routing(smp[i],Devices_Info,Devices_cnt);

       if ((Out_Port_index >= 0) && (Port_ID[Out_Port_index] >=0)){
        printf("... Send Mad (Port: %i 
(ID:%i).....",Out_Port_index+1,Port_ID[Out_Port_index]);

        // ++ Replace TID
        if (Agent_TIDs[Out_Port_index] != -1) smp[i]->tid = (uint64_t) 
Agent_TIDs[Out_Port_index];

        // ++ Sending ++
        //printf("%i\n",timeout_ms_s); //= (smp[i]->status & DIRECTION)? 0 : 
200;
          if ((ret = umad_send(Port_ID[Out_Port_index], 
Agent_ID[Out_Port_index], umad + (i * packet_size), smplength, 
(smp[i]->status & DIRECTION)? 0 : timeout_ms_s, 3)) < 0){
           printf("Error: umad_send Nr: %i \n",ret);
         if (optExitSendFail) run = 0;
        }
          else printf("TID: 0x%lx 
\n",own_ntoh64(Agent_TIDs[Out_Port_index]));

        if (optMsgLog) {
         fprintf(MsgLogFile,"... Send TID: 0x%lx 
\n",own_ntoh64(Agent_TIDs[Out_Port_index]));
             fprintf(MsgLogFile,"... Send Mad (Port: %i 
(ID:%i)(%s)(%i)...............\n",Out_Port_index+1,Port_ID[Out_Port_index],(ret 
 >= 0)?"OK":"Fail",(smp[i]->status & DIRECTION)? 0 : timeout_ms_s);
                         fprintf(MsgLogFile,"...............................................................................................\n");
         fflush(MsgLogFile);
                      }
        traversed++;
       }
      }
      else {
       printf("dropped, probably there is missing a response mad\n");
       dropped++;
      }
     }
    }
   }
   if (umad) umad_free(umad);
  }
  printf("... Traversed Packets (%i)(%i) 
.............................\n",traversed,dropped);
 }








More information about the general mailing list