Software-NAA API Documentation

This page documents the methods of a software-based NAA, which implements the server-side counterpart to clients invoking RPCs via the low-level or middleware API. In the NAAICE project, NAAs were realized using FPGAs, resulting in long development cycles. To facilitate testing and rapid prototyping, a software implementation was developed that replicates the FPGA-based NAA behavior.

The server now also supports multiple connections, which have been implemented using a master-worker concept. The master thread expects connections via a shared event channel and, once the connection has been successfully established, passes them on to the corresponding worker threads (see Figure fig-multi-kernel-swnaa). Currently, the maximum number of parallel connections is set to the number of available cores.

../_images/multi_kernel_swnaa.png

As on the client side, the software NAA transitions through various states that differ slightly from the client. In error-free operation, the following states are traversed:

../_images/states_server.png

Structs & Enums

The structs used are identical to those of the low-level API.

Functions

void *worker_procedure(void *args)
int naaice_swnaa_init_master(struct context **ctx, uint16_t local_cm_port)
int naaice_swnaa_init_worker(struct context **ctx, uint8_t worker_id)
int naaice_swnaa_init_communication_context(struct naaice_communication_context **comm_ctx)

Initialize a communication context structure.

This function initializes a communication context structure. The dummy software NAA reuses the communication context structure from the host-side AP1 implementation, but does not use all fields in the same way. In particular, the size and number of parameters are not known (and related fields are not populated) until the MRSP has completed.

Parameters:
  • comm_ctx – Pointer to a communication context structure to be initialized. The pointer must not reference an existing structure; the structure is allocated and returned by this function.

  • port – String specifying the connection port (e.g. "12345").

Returns:

0 on success, -1 on failure.

int naaice_swnaa_setup_connection(struct context *ctx)

Set up the software NAA connection.

Polls for and handles connection events until the connection setup is complete. Unlike the base naaice implementation, this function does not require handling address or route resolution events. It does, however, handle the “connection requests complete” event, which is not handled on the host side.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure (e.g. due to timeout).

int naaice_swnaa_poll_and_handle_connection_event(struct context *ctx)

Poll for and handle a software NAA connection event.

Polls the RDMA event channel stored in the communication context for a connection event and handles it if one is received. This function delegates the actual work to the corresponding poll and handler functions.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success (regardless of whether an event was received), -1 on failure.

int naaice_swnaa_init_mrsp(struct naaice_communication_context *comm_ctx)

Initialize MRSP on the software NAA side.

Starts the MRSP on the NAA side by posting a receive for MRSP packets expected from the host.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection and associated memory regions.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_post_recv_mrsp(struct naaice_communication_context *comm_ctx)

Post a receive request for an MRSP message.

Posts a receive request for an MRSP message. The receive request is added to the queue and specifies the memory region to be written to (the MRSP memory region).

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_handle_work_completion(struct ibv_wc *wc, struct naaice_communication_context *comm_ctx)

Handle a single work completion.

Handles one work completion retrieved from the completion queue. Work completions represent memory region write operations from the host to the NAA or from the NAA to the host.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_send_message(struct naaice_communication_context *comm_ctx, enum message_id message_type, uint8_t errorcode)

Send an MRSP message to the remote peer.

Sends an MRSP packet to the remote peer using ibv_post_send() with opcode IBV_WR_SEND.

Parameters:
  • comm_ctx – Pointer to the communication context structure describing the connection.

  • message_type – Type of message to send. Must be one of: MSG_MR_ERR, MSG_MR_AAR, or MSG_MR_A.

  • errorcode – Error code included in the packet if message_type is MSG_MR_ERR. Unused for other message types.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_post_recv_data(struct naaice_communication_context *comm_ctx)

Post a receive for a memory region write.

Posts a receive request for an RDMA memory region write. Only the final memory region write (the one with an immediate value) requires a posted receive. RDMA writes without an immediate occur without consuming a receive request from the queue.

The memory region specified in the receive request is the MRSP region. This is a placeholder; the actual write destination is determined by the sender.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_write_data(struct naaice_communication_context *comm_ctx, uint8_t errorcode)

Write a return memory region to the remote peer.

Writes the return memory region, indicated by comm_ctx->mr_return_idx, to the remote peer using ibv_post_send() with opcode IBV_WR_RDMA_WRITE_WITH_IMM. The immediate value signals whether an error occurred during computation (nonzero = error).

Parameters:
  • comm_ctx – Pointer to the communication context structure describing the connection.

  • fncode – Function code for the NAA routine. Positive value indicates success, 0 indicates an error.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_disconnect_and_cleanup(struct naaice_communication_context *comm_ctx)

Disconnect and clean up the software NAA connection.

Terminates the RDMA connection and frees all memory associated with the communication context.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_do_mrsp(struct naaice_communication_context *comm_ctx)

Execute MRSP logic in a blocking manner.

Performs all necessary MRSP processing in a blocking fashion, ensuring that the procedure completes before returning.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_receive_data_transfer(struct naaice_communication_context *comm_ctx)

Receive data transfer from the remote peer.

Handles receiving data from the remote peer in a blocking manner. Updates the communication context with information about the received data.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_do_data_transfer(struct naaice_communication_context *comm_ctx, uint8_t errorcode)

Perform the complete data transfer procedure.

Executes all steps of the data transfer in a blocking manner, including:

  • Receiving data from the NAA

  • Waiting for the NAA computation to complete

  • Writing the return data back to the remote peer

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success, -1 on failure.

int naaice_swnaa_poll_cq_nonblocking(struct naaice_communication_context *comm_ctx)

Poll the completion queue in a non-blocking manner.

Polls the completion queue for any work completions and handles them using naaice_swnaa_handle_work_completion if received. Updates comm_ctx->state to reflect the current status of the NAA connection and routine.

Parameters:

comm_ctx – Pointer to the communication context structure describing the connection.

Returns:

0 on success (regardless of whether any work completions were received), -1 on failure.

MRSP Packet Handlers

group MRSP packet handlers

Handlers for MRSP packets of type announce and announce-and-request. These functions process the contents of a received MRSP packet of the corresponding type and populate the relevant fields in the communication context.

param comm_ctx:

Pointer to the communication context structure describing the connection.

return:

0 on success, -1 on failure.

Functions

int naaice_swnaa_handle_mr_announce(struct naaice_communication_context *comm_ctx)

Handle MRSP announce packets.

int naaice_swnaa_handle_mr_announce_and_request(struct naaice_communication_context *comm_ctx)

Handle MRSP announce-and-request packets.

Event Handlers

group Software NAA connection event handlers

These functions each handle a specific RDMA connection event. If the type of the provided event matches the event type handled by the function, the required logic is executed.

After handling an event, flags in the communication context are updated to reflect the current state of connection establishment.

The events handled by these functions, in order, are:

  • RDMA_CM_EVENT_CONNECTION_REQUEST

  • RDMA_CM_EVENT_CONNECT_ESTABLISHED

The following events are handled by naaice_swnaa_handle_error():

  • RDMA_CM_EVENT_ADDR_ERROR

  • RDMA_CM_EVENT_ROUTE_ERROR

  • RDMA_CM_EVENT_CONNECT_ERROR

  • RDMA_CM_EVENT_UNREACHABLE

  • RDMA_CM_EVENT_REJECTED

  • RDMA_CM_EVENT_DEVICE_REMOVAL

  • RDMA_CM_EVENT_DISCONNECTED

param comm_ctx:

Pointer to the communication context structure describing the connection and maintaining connection state.

param ev:

Pointer to the RDMA CM event to be checked and, if applicable, handled.

return:

0 on success (either the event was handled successfully or it was not of the matching type), -1 on failure.

Functions

int naaice_swnaa_handle_connection_requests(struct context *ctx, struct rdma_cm_event *ev)

Handle RDMA_CM_EVENT_CONNECTION_REQUEST events.

int naaice_swnaa_handle_connection_established(struct naaice_communication_context *comm_ctx, struct rdma_cm_event *ev)

Handle RDMA_CM_EVENT_CONNECT_ESTABLISHED events.

int naaice_swnaa_handle_error(struct naaice_communication_context *comm_ctx, struct rdma_cm_event *ev)

Handle connection error events.

Example

An example implementation of a software NAA can be found in `examples/naaice_server.c <https://github.com/naaice-greenhpc/naa-communication-library/blob/main/examples/naaice_server.c`_.