All API¶

struct ncclConfig_t¶

Public Members

size_t size¶

unsigned int magic¶

unsigned int version¶

int blocking¶

struct ncclUniqueId¶

Public Members

char internal[NCCL_UNIQUE_ID_BYTES]¶

file nccl.h

#include <hip/hip_runtime.h>

#include <hip/hip_fp16.h>

Defines

NCCL_MAJOR¶

NCCL_MINOR¶

NCCL_PATCH¶

NCCL_SUFFIX¶

NCCL_VERSION_CODE¶

NCCL_VERSION(X, Y, Z)¶

RCCL_BFLOAT16¶

RCCL_GATHER_SCATTER¶

RCCL_ALLTOALLV¶

RCCL_MULTIRANKPERGPU¶

NCCL_UNIQUE_ID_BYTES¶

NCCL_CONFIG_INITIALIZER¶

Typedefs

typedef struct ncclComm *ncclComm_t¶: Opaque handle to communicator.

typedef int mscclAlgoHandle_t¶: Opaque handle to MSCCL algorithm.

Enums

enum ncclResult_t¶

Error type.

Values:

enumerator ncclSuccess¶

enumerator ncclUnhandledCudaError¶

enumerator ncclSystemError¶

enumerator ncclInternalError¶

enumerator ncclInvalidArgument¶

enumerator ncclInvalidUsage¶

enumerator ncclRemoteError¶

enumerator ncclInProgress¶

enumerator ncclNumResults¶

enum ncclRedOp_dummy_t¶

Reduction operation selector.

Values:

enumerator ncclNumOps_dummy¶

enum ncclRedOp_t¶

Values:

enumerator ncclSum¶

enumerator ncclProd¶

enumerator ncclMax¶

enumerator ncclMin¶

enumerator ncclAvg¶

enumerator ncclNumOps¶

enumerator ncclMaxRedOp¶

enum ncclDataType_t¶

Data types.

Values:

enumerator ncclInt8¶

enumerator ncclChar¶

enumerator ncclUint8¶

enumerator ncclInt32¶

enumerator ncclInt¶

enumerator ncclUint32¶

enumerator ncclInt64¶

enumerator ncclUint64¶

enumerator ncclFloat16¶

enumerator ncclHalf¶

enumerator ncclFloat32¶

enumerator ncclFloat¶

enumerator ncclFloat64¶

enumerator ncclDouble¶

enumerator ncclBfloat16¶

enumerator ncclNumTypes¶

enum ncclScalarResidence_t¶

ncclScalarResidence_t: Location and dereferencing logic for scalar arguments.

Values:

enumerator ncclScalarDevice¶

enumerator ncclScalarHostImmediate¶

Functions

ncclResult_t ncclGetVersion(int *version)¶

Return the NCCL_VERSION_CODE of the NCCL library in the supplied integer.

This integer is coded with the MAJOR, MINOR and PATCH level of the NCCL library

ncclResult_t ncclGetUniqueId(ncclUniqueId *uniqueId)¶

Generates an ID for ncclCommInitRank.

Generates an ID to be used in ncclCommInitRank. ncclGetUniqueId should be called once and the Id should be distributed to all ranks in the communicator before calling ncclCommInitRank.

Parameters: uniqueId – [in] ncclUniqueId* pointer to uniqueId

ncclResult_t ncclCommInitRankConfig(ncclComm_t *comm, int nranks, ncclUniqueId commId, int rank, ncclConfig_t *config)¶: Create a new communicator (multi thread/process version) with a configuration set by users.

ncclResult_t ncclCommInitRank(ncclComm_t *comm, int nranks, ncclUniqueId commId, int rank)¶

Creates a new communicator (multi thread/process version).

rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.

Parameters: comm – [in] ncclComm_t* communicator struct pointer

ncclResult_t ncclCommInitRankMulti(ncclComm_t *comm, int nranks, ncclUniqueId commId, int rank, int virtualId)¶

Creates a new communicator (multi thread/process version) allowing multiple ranks per device.

rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a HIP device, which has to be set before calling ncclCommInitRankMulti. Since this version of the function allows multiple ranks to utilize the same HIP device, a unique virtualId per device has to be provided by each calling rank. ncclCommInitRankMulti implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.

Parameters: comm – [in] ncclComm_t* communicator struct pointer

ncclResult_t ncclCommInitAll(ncclComm_t *comm, int ndev, const int *devlist)¶

Creates a clique of communicators (single process version).

This is a convenience function to create a single-process communicator clique. Returns an array of ndev newly initialized communicators in comm. comm should be pre-allocated with size at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndev HIP devices are used. Order of devlist defines user-order of processors within the communicator.

ncclResult_t ncclCommFinalize(ncclComm_t comm)¶

Finalize a communicator.

ncclCommFinalize flushes all issued communications, and marks communicator state as ncclInProgress. The state will change to ncclSuccess when the communicator is globally quiescent and related resources are freed; then, calling ncclCommDestroy can locally free the rest of the resources (e.g. communicator itself) without blocking.

ncclResult_t ncclCommDestroy(ncclComm_t comm)¶: Frees local resources associated with communicator object.

ncclResult_t ncclCommAbort(ncclComm_t comm)¶: Frees resources associated with communicator object and aborts any operations that might still be running on the device.

const char *ncclGetErrorString(ncclResult_t result)¶: Returns a string for each error code.

const char *ncclGetLastError(ncclComm_t comm)¶: Returns a human-readable message of the last error that occurred. comm is currently unused and can be set to NULL.

ncclResult_t ncclCommGetAsyncError(ncclComm_t comm, ncclResult_t *asyncError)¶

ncclResult_t ncclCommCount(const ncclComm_t comm, int *count)¶: Gets the number of ranks in the communicator clique.

ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int *device)¶: Returns the rocm device number associated with the communicator.

ncclResult_t ncclCommUserRank(const ncclComm_t comm, int *rank)¶: Returns the user-ordered “rank” associated with the communicator.

ncclResult_t ncclRedOpCreatePreMulSum(ncclRedOp_t *op, void *scalar, ncclDataType_t datatype, ncclScalarResidence_t residence, ncclComm_t comm)¶: ncclRedOpCreatePreMulSum Creates a new reduction operator which pre-multiplies input values by a given scalar locally before reducing them with peer values via summation. For use only with collectives launched against comm and datatype. The residence argument indicates how/when the memory pointed to by scalar will be dereferenced. Upon return, the newly created operator’s handle is stored in op.

ncclResult_t ncclRedOpDestroy(ncclRedOp_t op, ncclComm_t comm)¶

ncclRedOpDestroy

Destroys the reduction operator op. The operator must have been created by ncclRedOpCreatePreMul with the matching communicator comm. An operator may be destroyed as soon as the last NCCL function which is given that operator returns.

ncclResult_t ncclReduce(const void *sendbuff, void *recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, hipStream_t stream)¶

Reduce.

Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete.

In-place operation will happen if sendbuff == recvbuff.

ncclResult_t ncclBcast(void *buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)¶

(deprecated) Broadcast (in-place)

Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started.

This operation is implicitely in place.

ncclResult_t ncclBroadcast(const void *sendbuff, void *recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)¶

Broadcast.

Copies count values from root to all other devices. root is the rank (not the HIP device) where data resides before the operation is started.

In-place operation will happen if sendbuff == recvbuff.

ncclResult_t ncclAllReduce(const void *sendbuff, void *recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)¶

All-Reduce.

Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff.

In-place operation will happen if sendbuff == recvbuff.

ncclResult_t ncclReduceScatter(const void *sendbuff, void *recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, hipStream_t stream)¶

Reduce-Scatter.

Reduces data in sendbuff using op operation and leaves reduced result scattered over the devices so that recvbuff on rank i will contain the i-th block of the result. Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements.

In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

ncclResult_t ncclAllGather(const void *sendbuff, void *recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)¶

All-Gather.

Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements.

In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

ncclResult_t ncclSend(const void *sendbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)¶

Send.

Send data from sendbuff to rank peer. Rank peer needs to call ncclRecv with the same datatype and the same count from this rank.

This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/ ncclGroupEnd section.

ncclResult_t ncclRecv(void *recvbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, hipStream_t stream)¶

Receive.

Receive data from rank peer into recvbuff. Rank peer needs to call ncclSend with the same datatype and the same count to this rank.

This operation is blocking for the GPU. If multiple ncclSend and ncclRecv operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/ ncclGroupEnd section.

ncclResult_t ncclGather(const void *sendbuff, void *recvbuff, size_t sendcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)¶

Gather.

Root device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount.

Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements.

In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

ncclResult_t ncclScatter(const void *sendbuff, void *recvbuff, size_t recvcount, ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream)¶

Scatter.

Scattered over the devices so that recvbuff on rank i will contain the i-th block of the data on root.

Assumes sendcount is equal to nranks*recvcount, which means that sendbuff should have a size of at least nranks*recvcount elements.

In-place operations will happen if recvbuff == sendbuff + rank * recvcount.

ncclResult_t ncclAllToAll(const void *sendbuff, void *recvbuff, size_t count, ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)¶

All-To-All.

Device (i) send (j)th block of data to device (j) and be placed as (i)th block. Each block for sending/receiving has count elements, which means that recvbuff and sendbuff should have a size of nranks*count elements.

In-place operation will happen if sendbuff == recvbuff.

ncclResult_t ncclAllToAllv(const void *sendbuff, const size_t sendcounts[], const size_t sdispls[], void *recvbuff, const size_t recvcounts[], const size_t rdispls[], ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream)¶

All-To-Allv.

Device (i) sends sendcounts[j] of data from offset sdispls[j] to device (j). In the same time, device (i) receives recvcounts[j] of data from device (j) to be placed at rdispls[j].

sendcounts, sdispls, recvcounts and rdispls are all measured in the units of datatype, not bytes.

In-place operation will happen if sendbuff == recvbuff.

ncclResult_t mscclLoadAlgo(const char *mscclAlgoFilePath, mscclAlgoHandle_t *mscclAlgoHandle)¶

MSCCL Load Algorithm.

Load MSCCL algorithm file specified in mscclAlgoFilePath and return its handle via mscclAlgoHandle. This API is expected to be called by MSCCL scheduler instead of end users.

ncclResult_t pmscclLoadAlgo(const char *mscclAlgoFilePath, mscclAlgoHandle_t *mscclAlgoHandle)¶

ncclResult_t mscclRunAlgo(const void *sendBuff, const size_t sendCounts[], const size_t sDisPls[], void *recvBuff, const size_t recvCounts[], const size_t rDisPls[], size_t count, ncclDataType_t dataType, int root, int peer, ncclRedOp_t op, mscclAlgoHandle_t mscclAlgoHandle, ncclComm_t comm, hipStream_t stream)¶

MSCCL Run Algorithm.

Run MSCCL algorithm specified by mscclAlgoHandle. The parameter list merges all possible parameters required by different operations as this is a general-purposed API. This API is expected to be called by MSCCL scheduler instead of end users.

ncclResult_t pmscclRunAlgo(const void *sendBuff, const size_t sendCounts[], const size_t sDisPls[], void *recvBuff, const size_t recvCounts[], const size_t rDisPls[], size_t count, ncclDataType_t dataType, int root, int peer, ncclRedOp_t op, mscclAlgoHandle_t mscclAlgoHandle, ncclComm_t comm, hipStream_t stream)¶

ncclResult_t mscclUnloadAlgo(mscclAlgoHandle_t mscclAlgoHandle)¶

MSCCL Load Algorithm.

Unload MSCCL algorithm previous loaded using its handle. This API is expected to be called by MSCCL scheduler instead of end users.

ncclResult_t pmscclUnloadAlgo(mscclAlgoHandle_t mscclAlgoHandle)¶

ncclResult_t ncclGroupStart()¶

Group Start.

Start a group call. All calls to NCCL until ncclGroupEnd will be fused into a single NCCL operation. Nothing will be started on the CUDA stream until ncclGroupEnd.

ncclResult_t ncclGroupEnd()¶

Group End.

End a group call. Start a fused NCCL operation consisting of all calls since ncclGroupStart. Operations on the CUDA stream depending on the NCCL operations need to be called after ncclGroupEnd.