API Reference

ACCL

class pyaccl.accl.ACCLArithConfig(uncompressed_elem_bytes, compressed_elem_bytes, elem_ratio_log, compressor_tdest, decompressor_tdest, arith_is_compressed, arith_tdest)

Bases: object

An arithmetic configuration for the ACCL hardware datapath. The datapath includes a variety of hardware functions for compression and reduction. Compression functions convert between compressed and uncompressed representations of data. Reduction functions apply arithmetic-logic functions to two buffers, producing a third. Each class of hardware function is identified by a unique ID.

Args:

uncompressed_elem_bytes (int): Number of bytes in the uncompressed datatype

compressed_elem_bytes (int): Number of bytes in the uncompressed datatype

elem_ratio_log (int): Log of number of uncompressed elements required to produce one compressed element

compressor_tdest (int): Hardware function ID of compressor

decompressor_tdest (int): Hardware function ID of decompressor

arith_is_compressed (bool): Indicate whether any arithmetic is to be performed on compressed or uncompressed data.

arith_tdest (list of int): List of hardware function IDs corresponding to the supported reduction operations

class pyaccl.accl.accl(nranks, local_rank, ranks=None, protocol=None, nbufs=16, bufsize=1024, arith_config={('float16', 'float16'): <pyaccl.accl.ACCLArithConfig object>, ('float32', 'float16'): <pyaccl.accl.ACCLArithConfig object>, ('float32', 'float32'): <pyaccl.accl.ACCLArithConfig object>, ('float64', 'float64'): <pyaccl.accl.ACCLArithConfig object>, ('int32', 'int32'): <pyaccl.accl.ACCLArithConfig object>, ('int64', 'int64'): <pyaccl.accl.ACCLArithConfig object>}, sim_mode=False, xclbin=None, sim_sock=None, board_idx=None, cclo_idx=0)

Bases: object

ACCL Python Driver

allgather(sbuf, rbuf, count, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Fused gather-bcast

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

allocate(shape, dtype=<class 'numpy.float32'>, physical_address=None, prealloc=True)

Allocates an ACCLBuffer in the device memory associated with this CCLO instance.

Args:

shape (tuple): The shape of the desired buffer.

dtype (NumPy datatype, optional): Desired data type. Defaults to np.float32.

physical_address (int, optional): Physical address override. Defaults to None.

prealloc (bool, optional): Populate the device-side memory immediately upon buffer creation. Defaults to True.

Returns:

ACCLBuffer: Handle to the created ACCL buffer.

allreduce(sbuf, rbuf, count, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Fused reduce-bcast

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

func (int): Index of function to be applied from ACCLReduceFunctions

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

barrier(comm_id=0)
bcast(buf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Broadcast data to all ACCL instances in a communicator

Args:

buf (ACCLBuffer): Buffer from which to send data, or into which to receive.

count (int): Number of elements to copy

root (int): Index of the root, i.e. the ACCL instance which sends from buf. All others receive into buf.

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

combine(count, func, val1, val2, result, val1_from_fpga=False, val2_from_fpga=False, to_fpga=False, run_async=False)

Combine data from two buffers and put result in a third buffer

Args:

count (int): Number of elements to copy

func (int): Index of function to be applied from ACCLReduceFunctions

val1, val2 (ACCLBuffers): Operand buffers

result (ACCLBuffer): Result buffer

val1_from_fpga, val2_from_fpga (bool, optional): Combine without syncing operand buffers first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing result first, assuming the data is not required in host memory. Defaults to False.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

copy(srcbuf, dstbuf, count, from_fpga=False, to_fpga=False, run_async=False)

Copy data between two buffers

Args:

srcbuf (ACCLBuffer): Buffer from which to send data

dstbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

from_fpga (bool, optional): Send without syncing srcbuf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing dstbuf first, assuming the data is not required in host memory. Defaults to False.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

deinit()

De-initializes an ACCL instance, resetting the CCLO kernel and deallocating all internal buffers, but not buffers created by users with allocate()

gather(sbuf, rbuf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Gathers data from all ACCL instances in a communicator

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

root (int): Index of the root, i.e. the ACCL instance which receives into rbuf. All others send from sbuf.

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

nop(run_async=False)

Calls the accelerator with no work. Useful for measuring call latency

Args:

run_async (bool, optional): Whether to execute asynchronously. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

recv(dstbuf, count, src, tag=4294967295, comm_id=0, to_fpga=False, compress_dtype=None, run_async=False)

Receive data from a remote ACCL instance

Args:

dstbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

src (int): Rank index of source, in the selected communicator

tag (int, optional): Optional tag. Defaults to TAG_ANY.

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

to_fpga (bool, optional): Return without syncing dstbuf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

reduce(sbuf, rbuf, count, root, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Combine data from multiple ACCL instances, using a reduction function

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

root (int): Index of the root, i.e. the ACCL instance which writes to rbuf. All others send from sbuf.

func (int): Index of function to be applied from ACCLReduceFunctions

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

reduce_scatter(sbuf, rbuf, count, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Fused reduce-scatter

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

func (int): Index of function to be applied from ACCLReduceFunctions

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

scatter(sbuf, rbuf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)

Scatter data to all ACCL instances in a communicator

Args:

sbuf (ACCLBuffer): Buffer from which to send data

rbuf (ACCLBuffer): Buffer into which to receive data

count (int): Number of elements to copy

root (int): Index of the root, i.e. the ACCL instance which sends from sbuf. All others receive into rbuf.

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.

to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

send(srcbuf, count, dst, tag=4294967295, comm_id=0, from_fpga=False, compress_dtype=None, stream_flags=ACCLStreamFlags.NO_STREAM, run_async=False)

Send data to a remote ACCL instance

Args:

srcbuf (ACCLBuffer): Buffer from which to send data

count (int): Number of elements to copy

dst (int): Rank index of destination, in the selected communicator

tag (int, optional): Optional tag. Defaults to TAG_ANY.

comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.

from_fpga (bool, optional): Send without syncing srcbuf first, assuming the data is already in FPGA memory. Defaults to False.

compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.

stream_flags (int, optional): Indicates streaming options. Defaults to ACCLStreamFlags.NO_STREAM.

run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.

Returns:

handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.

split_communicator(indices)

Creates a new communicator from the global communicator by filtering it with a list of indices

Args:

indices (list of int): List of rank indices to include in the new communicator

Buffer

ACCLBuffer objects are similar to Pynq buffers but may also interact with the ACCL emulator rather than just Alveo memory. Users should not create ACCL buffers explicitly but should instead utilize the allocate() function of a specific ACCL instance.

class pyaccl.buffer.ACCLBuffer(shape, dtype=<class 'numpy.float32'>, target=None, zmqsocket=None, physical_address=None, prealloc=True)

Bases: ndarray

property device_address

Get physical address in FPGA memory

Returns:

int: Physical address

sync_from_device()

Copy buffer data in the device to host direction

sync_to_device()

Copy buffer data in the host to device direction

Constants

class pyaccl.constants.ACCLCompressionFlags(value)

Bases: IntEnum

Compression flags

ETH_COMPRESSED = 8

Apply over-the-wire compression

NO_COMPRESSION = 0

No compression

OP0_COMPRESSED = 1

First input buffer is compressed

OP1_COMPRESSED = 2

Second input buffer is compressed

RES_COMPRESSED = 4

Result buffer is compressed

class pyaccl.constants.ACCLReduceFunctions(value)

Bases: IntEnum

CCLO reduction functions

SUM = 0

Elementwise sum of vectors

class pyaccl.constants.ACCLStreamFlags(value)

Bases: IntEnum

Stream flags

NO_STREAM = 0

No streaming. All operands and results are in memory

OP0_STREAM = 1

The first operand is pulled from stream instead of memory

RES_STREAM = 2

The result is pushed to stream instead of memory

class pyaccl.constants.ErrorCode(value)

Bases: IntEnum

Error codes returned by CCLO kernel in FPGA

ARITH_ERROR = 524288
COLLECTIVE_NOT_IMPLEMENTED = 16384
COMPRESSION_ERROR = 4194304
CONFIG_SWITCH_ERROR = 256
DEQUEUE_BUFFER_SPARE_BUFFER_DMATAG_MISMATCH = 4096
DEQUEUE_BUFFER_SPARE_BUFFER_INDEX_ERROR = 8192
DEQUEUE_BUFFER_SPARE_BUFFER_STATUS_ERROR = 1024
DEQUEUE_BUFFER_TIMEOUT_ERROR = 512
DMA_DECODE_ERROR = 4
DMA_INTERNAL_ERROR = 2
DMA_MISMATCH_ERROR = 1
DMA_NOT_END_OF_PACKET_ERROR = 32
DMA_NOT_EXPECTED_BTT_ERROR = 64
DMA_NOT_OKAY_ERROR = 16
DMA_SIZE_ERROR = 262144
DMA_SLAVE_ERROR = 8
DMA_TAG_MISMATCH_ERROR = 67108864
DMA_TIMEOUT_ERROR = 128
KRNL_STS_COUNT_ERROR = 16777216
KRNL_TIMEOUT_STS_ERROR = 8388608
OPEN_CON_NOT_SUCCEEDED = 131072
OPEN_PORT_NOT_SUCCEEDED = 65536
PACK_SEQ_NUMBER_ERROR = 2097152
PACK_TIMEOUT_STS_ERROR = 1048576
RECEIVE_OFFCHIP_SPARE_BUFF_ID_NOT_VALID = 32768
RECEIVE_TIMEOUT_ERROR = 2048
SEGMENTER_EXPECTED_BTT_ERROR = 33554432