NAAICE Middleware API Documentation

This document describes the API for the NAAICE Middleware which integrates a network- attached accelerator (NAA) into HPC data centers using RoCEv2, allowing for IP-based remote direct memory accesses (RDMA). In order for this to work, a middleware/com- munication library was designed to achieve the following goals:

1. Easy integration into HPC applications The communication details should be transparent to the user, since an application developer should not deal with communication specifics, but rather concentrate on the specification of the computation which should be offloaded to the NAA. Communication details include for example a communication context containing ibverbs-specific structures such as Infiniband queue pairs or details of memory region handling.

2. Fast adaption by the HPC community The middleware library should be easily understood and used by the HPC com- munity. As such, the popular and well known Message Passing Interface (MPI) standard will be taken as an inspiration to formulate an NAA middleware, which will reproduce or reuse functionalities and structures of the MPI standards. We assume that a middleware with analogies to MPI can be more easily adopted by the HPC community.

3. Ability for communication-computation-overlap (CCO) Instead of serialized communication and computation, the aim of the middleware is to allow both to happen concurrently. For this, non-blocking communication calls are necessary. Non-blocking communication functions return before the ac- tual communication i.e. data-transfer is done. The host node can then continue with some other computation while the NIC continues with the data transfer.

A PDF version of the middleware documentation can also be found here.

Structs & Enums

enum naa_error

Error codes returned by NAA routines.

Defines possible error values for RPCs or communication failures between host and NAA.

Values:

enumerator NAA_SUCCESS

Successful RPC.

enumerator SOCKET_UNAVAIL

Socket unavailable.

enumerator KERNEL_TIMEOUT

Kernel timed out (timeout definition TBD).

typedef struct naa_handle naa_handle

Represents a handle to a NAA session.

Holds information about an active NAA session, including the function code specifying the routine to execute and the associated low-level communication context.

typedef struct naa_status naa_status

Status information for a NAA session.

Holds the current state of the communication, any error codes returned by the NAA, and the number of bytes received so far.

struct naa_handle
#include <naaice_ap2.h>

Represents a handle to a NAA session.

Holds information about an active NAA session, including the function code specifying the routine to execute and the associated low-level communication context.

Public Members

naa_function_code_t function_code

Function code specifying the routine to be executed on the NAA.

struct naaice_communication_context *comm_ctx

Pointer to the communication context used for low-level API operations.

struct naa_status
#include <naaice_ap2.h>

Status information for a NAA session.

Holds the current state of the communication, any error codes returned by the NAA, and the number of bytes received so far.

Public Members

naaice_communication_state state

Current state of the communication session.

enum naa_error naa_error

Last error code returned by the NAA.

uint64_t bytes_received

Number of bytes received during this session.

struct naa_param_t
#include <naaice_ap2.h>

Represents a single parameter (input or output) for an NAA routine.

Holds information about the data region corresponding to a parameter, including its address, size, and whether it should be sent only once during the connection (e.g., for configuration data).

Public Members

void *addr

Pointer to the data region.

size_t size

Size of the data region, in bytes.

bool single_send

Indicates that the parameter should be sent only once. If true, the parameter is sent only during the first communication with the NAA routine (typically for configuration data)

Functions

int naa_create(const naa_function_code_t function_code, naa_param_t *input_params, unsigned int input_amount, naa_param_t *output_params, unsigned int output_amount, naa_handle *handle)

Finding IP address and socket ID for an NAA matching required function code. Prepare connection, register and exchange memory region information between HPC node and NAA.

IP address and socket ID are already known to HPC node. Info is retrieved from resource management system (Slurm) at creation/deployment of slurm job. User knows function code for method/calculation to outsource to NAA. Connection to NAA is done by connection establishment protocol from the Infiniband standard. During connection preparation, the HPC nodes allocates buffers for memory re- gions and resolves route to NAA. After connection establishment, memory region information is exchanged between HPC node and NAA. The protocol for this was designed in NAAICE AP1.

This method will register the addresses of the parameters with ibverbs as mem- ory regions, hiding the memory region semantic from the user. All memory re- gions, for both input and output parameters are announced to the NAA during naa_create(). Therefore, memory regions can not be changed from input to output between iterations. Currently, no example has been found where this is necessary. The handle object was previously returned by the library and includes information on how to connect to the right NAA. The resource management system will provide information on the IP of the NAA and socket ID of the NAA.

Parameters:
  • function_code – Function code specifying the routine to execute on the NAA.

  • input_params – Array of naa_param_t structs representing input regions.

  • input_amount – Number of input memory regions.

  • output_params – Array of naa_param_t structs representing output regions.

  • output_amount – Number of output memory regions.

  • handle – Pointer to a naa_handle struct to be initialized for this session.

Returns:

int 0 if successful, -1 if an error occurred.

int naa_invoke(naa_handle *handle)

Sends input data to the peer and triggers the corresponding NAA routine.

Initiates the data transfer for the current session using the provided communication handle. Handles posting RDMA writes and waiting for the remote computation to complete.

Note

Data transfer is done with RDMA_WITH_IMM. If the transfer requires n > 1 operations, n − 1 RDMA_WRITE operations are done. The last writing operation is RDMA_WITH_IMM, where the immediate data value is the function code. RDMA_WITH_IMM signals the end of the data transfer to the NAA and initiates calculations on the NAA (RPC start)

Parameters:

handle – Pointer to a naa_handle created by naa_create.

Returns:

int 0 if successful, -1 if an error occurred.

int naa_test(naa_handle *handle, bool *flag, naa_status *status)

Waits in non-blocking mode for a receive.

Much like MPI_TEST, the naa_test call is non-blocking and polls the completion queue of the queue pair associated with the data transfer.

A call to naa_test returns flag=true if the operation identified by handle is complete. * In such a case, the status object is set to contain information on the completed operation. The call returns flag = false if the operation is not complete. In this case, the value of the status object is undefined.

Parameters:
  • handle

  • flag

  • status

Returns:

int

int naa_wait(naa_handle *handle, naa_status *status)

Waits in blocking mode for a receive.

Much like MPI_WAIT, the naa_wait call is blocking and polls the completion queue of the queue pair associated with the data transfer. naa_wait returns, when data has been written back to the HPC node. The call returns with the information on the completed operation stored in the status variable.

Parameters:
  • handle – communication handle created by naa_create

  • status

Returns:

int

int naa_finalize(naa_handle *handle)

Terminates connection and cleans up the corresponding data structures.

Parameters:

handle – communication handle created by naa_create

Returns:

int if sucessful, -1 if not.

Example

The following code shows the use of the middleware in a minimal example, in which two input regions are offloaded and the result is expected in a third output region. The example shows non-blocking waiting on the first invoke and blocking waiting on the second one.

Minimal NAAICE Middleware example
//All data is gathered and sent to the NAA in one go.
#include <naaice_ap2.h>
#define FNCODE_VEC_ADD 0

void *a, *b, *c;
a = calloc(64, sizeof(double));
b = calloc(64, sizeof(double));
c = calloc(64, sizeof(double));

// define input and output memory regions
naa_param_t input_param[2] = {{a, 64 * sizeof(double)},
                          {b, 64 * sizeof(double)}};
naa_param_t output_param[1] = {{c, 64 * sizeof(double)}};

naa_handle handle;

// establish connection
naa_create(FNCODE_VEC_ADD, &input_params, 2, &output_params, 1, &handle) ;

int flag = 0;
naa_status status;

// transfer data to NAA
naa_invoke(&handle);

// non-blocking check if results are received
naa_test(&handle,&flag,&status)
while (!flag) {
  do_other_work();
  naa_test(&handle,&flag,&status)
}

// process results
...

// set inputs with new data
set_inputs(a, 64, b, 64) ;

naa_invoke(&handle);

// blocked waiting for RPC to finish
naa_wait (&handle,&status)
process_results(c);

// finalize connection
naa_finalize(&handle);

Further examples of the middleware API can be found in examples/naaice_client_ap2.c.