API Reference
ACCL
- class pyaccl.accl.ACCLArithConfig(uncompressed_elem_bytes, compressed_elem_bytes, elem_ratio_log, compressor_tdest, decompressor_tdest, arith_is_compressed, arith_tdest)
Bases:
objectAn arithmetic configuration for the ACCL hardware datapath. The datapath includes a variety of hardware functions for compression and reduction. Compression functions convert between compressed and uncompressed representations of data. Reduction functions apply arithmetic-logic functions to two buffers, producing a third. Each class of hardware function is identified by a unique ID.
- Args:
uncompressed_elem_bytes (int): Number of bytes in the uncompressed datatype
compressed_elem_bytes (int): Number of bytes in the uncompressed datatype
elem_ratio_log (int): Log of number of uncompressed elements required to produce one compressed element
compressor_tdest (int): Hardware function ID of compressor
decompressor_tdest (int): Hardware function ID of decompressor
arith_is_compressed (bool): Indicate whether any arithmetic is to be performed on compressed or uncompressed data.
arith_tdest (list of int): List of hardware function IDs corresponding to the supported reduction operations
- class pyaccl.accl.accl(nranks, local_rank, ranks=None, protocol=None, nbufs=16, bufsize=1024, arith_config={('float16', 'float16'): <pyaccl.accl.ACCLArithConfig object>, ('float32', 'float16'): <pyaccl.accl.ACCLArithConfig object>, ('float32', 'float32'): <pyaccl.accl.ACCLArithConfig object>, ('float64', 'float64'): <pyaccl.accl.ACCLArithConfig object>, ('int32', 'int32'): <pyaccl.accl.ACCLArithConfig object>, ('int64', 'int64'): <pyaccl.accl.ACCLArithConfig object>}, sim_mode=False, xclbin=None, sim_sock=None, board_idx=None, cclo_idx=0)
Bases:
objectACCL Python Driver
- allgather(sbuf, rbuf, count, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Fused gather-bcast
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- allocate(shape, dtype=<class 'numpy.float32'>, physical_address=None, prealloc=True)
Allocates an ACCLBuffer in the device memory associated with this CCLO instance.
- Args:
shape (tuple): The shape of the desired buffer.
dtype (NumPy datatype, optional): Desired data type. Defaults to np.float32.
physical_address (int, optional): Physical address override. Defaults to None.
prealloc (bool, optional): Populate the device-side memory immediately upon buffer creation. Defaults to True.
- Returns:
ACCLBuffer: Handle to the created ACCL buffer.
- allreduce(sbuf, rbuf, count, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Fused reduce-bcast
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
func (int): Index of function to be applied from ACCLReduceFunctions
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- barrier(comm_id=0)
- bcast(buf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Broadcast data to all ACCL instances in a communicator
- Args:
buf (ACCLBuffer): Buffer from which to send data, or into which to receive.
count (int): Number of elements to copy
root (int): Index of the root, i.e. the ACCL instance which sends from buf. All others receive into buf.
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- combine(count, func, val1, val2, result, val1_from_fpga=False, val2_from_fpga=False, to_fpga=False, run_async=False)
Combine data from two buffers and put result in a third buffer
- Args:
count (int): Number of elements to copy
func (int): Index of function to be applied from ACCLReduceFunctions
val1, val2 (ACCLBuffers): Operand buffers
result (ACCLBuffer): Result buffer
val1_from_fpga, val2_from_fpga (bool, optional): Combine without syncing operand buffers first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing result first, assuming the data is not required in host memory. Defaults to False.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- copy(srcbuf, dstbuf, count, from_fpga=False, to_fpga=False, run_async=False)
Copy data between two buffers
- Args:
srcbuf (ACCLBuffer): Buffer from which to send data
dstbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
from_fpga (bool, optional): Send without syncing srcbuf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing dstbuf first, assuming the data is not required in host memory. Defaults to False.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- deinit()
De-initializes an ACCL instance, resetting the CCLO kernel and deallocating all internal buffers, but not buffers created by users with allocate()
- gather(sbuf, rbuf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Gathers data from all ACCL instances in a communicator
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
root (int): Index of the root, i.e. the ACCL instance which receives into rbuf. All others send from sbuf.
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- nop(run_async=False)
Calls the accelerator with no work. Useful for measuring call latency
- Args:
run_async (bool, optional): Whether to execute asynchronously. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- recv(dstbuf, count, src, tag=4294967295, comm_id=0, to_fpga=False, compress_dtype=None, run_async=False)
Receive data from a remote ACCL instance
- Args:
dstbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
src (int): Rank index of source, in the selected communicator
tag (int, optional): Optional tag. Defaults to TAG_ANY.
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
to_fpga (bool, optional): Return without syncing dstbuf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- reduce(sbuf, rbuf, count, root, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Combine data from multiple ACCL instances, using a reduction function
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
root (int): Index of the root, i.e. the ACCL instance which writes to rbuf. All others send from sbuf.
func (int): Index of function to be applied from ACCLReduceFunctions
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- reduce_scatter(sbuf, rbuf, count, func, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Fused reduce-scatter
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
func (int): Index of function to be applied from ACCLReduceFunctions
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing sbuf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing rbuf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- scatter(sbuf, rbuf, count, root, comm_id=0, from_fpga=False, to_fpga=False, compress_dtype=None, run_async=False)
Scatter data to all ACCL instances in a communicator
- Args:
sbuf (ACCLBuffer): Buffer from which to send data
rbuf (ACCLBuffer): Buffer into which to receive data
count (int): Number of elements to copy
root (int): Index of the root, i.e. the ACCL instance which sends from sbuf. All others receive into rbuf.
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing buf first, assuming the data is already in FPGA memory. Defaults to False.
to_fpga (bool, optional): Return without syncing buf first, assuming the data is not required in host memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- send(srcbuf, count, dst, tag=4294967295, comm_id=0, from_fpga=False, compress_dtype=None, stream_flags=ACCLStreamFlags.NO_STREAM, run_async=False)
Send data to a remote ACCL instance
- Args:
srcbuf (ACCLBuffer): Buffer from which to send data
count (int): Number of elements to copy
dst (int): Rank index of destination, in the selected communicator
tag (int, optional): Optional tag. Defaults to TAG_ANY.
comm_id (int, optional): Index in the internal communicator list. Defaults to 0 which is the global communicator.
from_fpga (bool, optional): Send without syncing srcbuf first, assuming the data is already in FPGA memory. Defaults to False.
compress_dtype (NumPy datatype, optional): A NumPy datatype to which the CCLO will compress the data before seding on the wire. Defaults to None.
stream_flags (int, optional): Indicates streaming options. Defaults to ACCLStreamFlags.NO_STREAM.
run_async (bool, optional): Return handle to call instead of waiting for completion. Defaults to False.
- Returns:
handle to Pynq call: When run_async is True, returns a handle to a pynq call, which can be waited on. Otherwise, returns None.
- split_communicator(indices)
Creates a new communicator from the global communicator by filtering it with a list of indices
- Args:
indices (list of int): List of rank indices to include in the new communicator
Buffer
ACCLBuffer objects are similar to Pynq buffers but may also interact
with the ACCL emulator rather than just Alveo memory. Users should not
create ACCL buffers explicitly but should instead utilize the allocate()
function of a specific ACCL instance.
- class pyaccl.buffer.ACCLBuffer(shape, dtype=<class 'numpy.float32'>, target=None, zmqsocket=None, physical_address=None, prealloc=True)
Bases:
ndarray- property device_address
Get physical address in FPGA memory
- Returns:
int: Physical address
- sync_from_device()
Copy buffer data in the device to host direction
- sync_to_device()
Copy buffer data in the host to device direction
Constants
- class pyaccl.constants.ACCLCompressionFlags(value)
Bases:
IntEnumCompression flags
- ETH_COMPRESSED = 8
Apply over-the-wire compression
- NO_COMPRESSION = 0
No compression
- OP0_COMPRESSED = 1
First input buffer is compressed
- OP1_COMPRESSED = 2
Second input buffer is compressed
- RES_COMPRESSED = 4
Result buffer is compressed
- class pyaccl.constants.ACCLReduceFunctions(value)
Bases:
IntEnumCCLO reduction functions
- SUM = 0
Elementwise sum of vectors
- class pyaccl.constants.ACCLStreamFlags(value)
Bases:
IntEnumStream flags
- NO_STREAM = 0
No streaming. All operands and results are in memory
- OP0_STREAM = 1
The first operand is pulled from stream instead of memory
- RES_STREAM = 2
The result is pushed to stream instead of memory
- class pyaccl.constants.ErrorCode(value)
Bases:
IntEnumError codes returned by CCLO kernel in FPGA
- ARITH_ERROR = 524288
- COLLECTIVE_NOT_IMPLEMENTED = 16384
- COMPRESSION_ERROR = 4194304
- CONFIG_SWITCH_ERROR = 256
- DEQUEUE_BUFFER_SPARE_BUFFER_DMATAG_MISMATCH = 4096
- DEQUEUE_BUFFER_SPARE_BUFFER_INDEX_ERROR = 8192
- DEQUEUE_BUFFER_SPARE_BUFFER_STATUS_ERROR = 1024
- DEQUEUE_BUFFER_TIMEOUT_ERROR = 512
- DMA_DECODE_ERROR = 4
- DMA_INTERNAL_ERROR = 2
- DMA_MISMATCH_ERROR = 1
- DMA_NOT_END_OF_PACKET_ERROR = 32
- DMA_NOT_EXPECTED_BTT_ERROR = 64
- DMA_NOT_OKAY_ERROR = 16
- DMA_SIZE_ERROR = 262144
- DMA_SLAVE_ERROR = 8
- DMA_TAG_MISMATCH_ERROR = 67108864
- DMA_TIMEOUT_ERROR = 128
- KRNL_STS_COUNT_ERROR = 16777216
- KRNL_TIMEOUT_STS_ERROR = 8388608
- OPEN_CON_NOT_SUCCEEDED = 131072
- OPEN_PORT_NOT_SUCCEEDED = 65536
- PACK_SEQ_NUMBER_ERROR = 2097152
- PACK_TIMEOUT_STS_ERROR = 1048576
- RECEIVE_OFFCHIP_SPARE_BUFF_ID_NOT_VALID = 32768
- RECEIVE_TIMEOUT_ERROR = 2048
- SEGMENTER_EXPECTED_BTT_ERROR = 33554432