- Introduction
- Schema-derived hardware streams
- Handshaking protocol
- Command stream
- Unlock stream
- Arrow data streams
If you are just getting started with Fletcher, you will use Fletchgen to automatically generate a design. Somewhere, beneath several layers of magically generated hierarchies and structures, your kernel will reside. The kernel will be the only thing a hardware developer should have to implement. In terms of interfacing with Arrow data sets, your kernel will have a bunch of hardware streams. These hardware streams are derived from the Arrow schemas that you've provided to Fletchgen. In this guide, it will be explained what these hardware streams are. The signals of these streams will appear to you when you open up the kernel template that was generated by Fletchgen.
For each Arrow Array (an Arrow Array is like a column in a tabular in-memory data set called an Arrow RecordBatch) that you want to read from, Fletchgen will provide you with at least three streams:
- Command stream.
- Unlock stream.
- Arrow data stream(s). The number of Arrow data streams will depend on the type of the Arrow field in your schema.
Operating the interface will generally consist of the following steps:
- You give a command to the interface over the command stream.
- The data stream will give you data, or you give data to the interface over the data stream.
- The unlock stream will signal you that all the data was transferred from/to the memory.
Very much simplified, with several fields omitted, this will look as follows for the "Hello, world!" example project that you can find here:
All Fletcher hardware streams follow a ready-valid handshaking protocol similar to AXI4. You can ready about that protocol here. If you fail to adhere to this protocol, you will end up with erroneous designs. If you have never worked with a valid-ready handshaked streaming protocol before, it is highly recommended to read and fully understand the specification linked above.
Its handshaking characteristics are as follows:
- If and only if
valid
andready
are asserted in the same clock cycle, a transfer (handshake) is made. - A producer of stream data may not wait for the
ready
signal to be asserted, before assertingvalid
. - A consumer of stream data may assert
ready
beforevalid
is asserted. - Once
valid
is asserted, it must remain asserted until handshaked and the data may not change during this period.
We will now go over each of the aforementioned streams and their function.
This is an output stream for your kernel.
The command stream (or cmd
stream) is used to issue commands to the generated
interface. This command contains a range of row indices that you would like to
read from an Arrow Array.
It has the following fields:
Field | Description |
---|---|
tag |
An identifier for your command (optional). |
firstIdx |
the first row index you want to read. |
lastIdx |
the last row index you want to read, plus one. |
This means that in the range firstIdx ... lastIdx
, the last index is
exclusive.
If you are sending a command to a RecordBatchWriter (i.e. you are writing stuff into Arrow format in memory through the generated interface) and you don't know how large your output is going to be, you can set lastIdx to zero.
This is an input stream for your kernel.
The unlock stream will transfer a single handshake for each command that was handshaked on the command stream. For commands to RecordBatchWriters (i.e. when you are writing data into the memory in the Arrow format through the generated interface) that means that all write buffers are emptied and the data should be in the memory. This stream has the following fields:
Field | Description |
---|---|
tag |
The identifier of the handled command (optional). |
This is either an input stream or an output stream for your kernel, depending on
the fletcher_mode
metadata value of your schema (see
Fletchgen). This simply depends on whether you
are reading from or writing to an Arrow RecordBatch in memory.
The number of data streams depends on the Arrow type of the field that corresponds to the Arrow Array this stream was generated for.
Primitive (fixed-width) types (such as int32, float64, etc...) will generate a single stream. This stream will have the following fields:
Field | Description |
---|---|
dvalid |
Signals whether this stream transfer contains any data (dvalid is high) or just control (dvalid is low) |
last |
Signals that this is the last transfer of the command. |
count |
The number of valid Arrow data elements in this transfer (depends on EPC). |
data |
count Arrow data elements. |
By default, the number of Arrow data elements transfered (e.g. your int32's or
float64's) is one per cycle. However, you can use the key-value metadata
{"fletcher_epc", "<epc>"}
on your Arrow field in your Arrow schema to transfer
more elements-per-cycle (EPC). This is useful if you want to increase
throughput.
Some Arrow types are nested, such as utf8
strings and binary
or any other
list<T>
(list of some other type), and struct
.
In the case of Arrow struct
, there is no parent stream. In terms of streams
appearing on the interface, the struct
child fields are just handled as if
it's a separate Arrow Array.
In the case of Arrow list
, where a RecordBatch element contains a
variable-length piece of data, a parent stream transfers the length of the
variable-length piece of data on a "parent" stream (also called the length
stream). The "child" stream of this list will transfer the list elements.
For example, if we use the binary
type (which is basically a list of
non-nullable bytes, or sometimes called a blob database land), this is the same
as a list<byte>
, and we get the following streams and fields:
Field | Description |
---|---|
dvalid |
Signals whether this stream transfer contains any data (dvalid is high) or just control (dvalid is low) |
last |
Signals that this is the last list length of the command. |
count |
Not used, is always 1. |
length |
The length of the number of elements in the child stream. |
Field | Description |
---|---|
dvalid |
Signals whether this stream transfer contains any data (dvalid is high) or just control (dvalid is low). It may be, for example, that dvalid is low because we're transfering an empty list. In that case, last is asserted and count and bytes are don't cares. |
last |
Signals that this transfer contains the last elements in the list. |
count |
The number of valid Arrow data elements in this transfer, depends on EPC. |
bytes |
count Arrow list elements (in this example: bytes). |
(for advanced users only) (partially outdated)
The basic hardware primitives that Fletchgen instantiates in a large design are ArrayReader or a ArrayWriter, depending on whether you want to read or write data from/to an Arrow Array in a RecordBatch. If you want to use these components directly, e.g. you're not using Fletchgen, then you can read the rest of this guide below.
For reading, because elements in an Arrow Recordbatch can be processed in parallel, you are free to implement, for example:
- One ArrayReader for each Arrow Array
- Multiple ArrayReaders for each Arrow Array
- One ArrayReader for just one of the Arrays in a Table
- Multiple ArrayReaders for each Arrow Array
For writing, this depends on the data type. For primitive fixed-width types, it is possible to have multiple ArrayWriters for a single array. However, for variable length types (for example; strings), this is currently not possible. Of course you are free to build up the data set in parallel and merge them later on your host system in software or in some other way.
To configure a ArrayReader/ArrayWriter, you must set the generics to the HDL component appropriately. Things like bus data width, bus address width, burst length, etc... should speak for itself. However, one important generic is the configuration string.
The configuration string provided to a ArrayReader/Writer is somewhat equivalent to an Arrow Schema. It conveys the same information about the layout/structure of the Array in memory. There are some additional options to tweak internals (like FIFO depths), but we will ignore them for now.
ArrayConfig_pkg.vhd contains an in-depth guide on which entries of the config string are supported.
Make sure not to use any whitespace characters in the configuration such as spaces or newlines.
The following elements are supported:
- prim(<width>)
- Any type of fixed-width <width>, such as ints, floats, doubles, bits, etc...
- list(<A>)
- A list of any of the supported types
- struct(<A>,<B>)
- A structure of any of the supported types
- listprim(<width>;epc=N)
- A non-nullable list of non-nullable primitives, where you will receive N of these primitive elements per cycle at the output. epc is optional. (useful for UTF8 strings, for example).
- null(<A>)
- To allow an element to be nullable
For example, if you have the Schema of a RecordBatch as follows:
Schema { X: int32 Y: string // Using UTF8 characters Z: struct{ A: int32 // Nullable B: double } }
For simplicity, assume all elements are non-nullable, except those of field A in the struct of field Z.
Suppose we would like to read from this RecordBatch in host memory. You can instantiate three ArrayReaders using the following three configuration strings:
- X: "prim(32)"
- Y: "listprim(8)"
- Z" "struct(null(prim(32)),prim(64))"
After you set the ArrayReader/Writer configuration string, the hardware structure is generated.
Each ArrayReader/Writer has the following streams:
-
From accelerator to ArrayReader/Writer:
- Command (cmd): To issue commands to the ArrayReader/Writer
-
From ArrayReader/Writer to accelerator:
- Unlock: To notify the accelerator the command has been executed.
A ArrayReader additionally has the following streams:
- From ArrayReader to host memory interface:
- Bus read request (bus_rreq): To issue burst read requests to the host memory.
- From host memory interface to ArrayReader:
- Bus read data (bus_rdat): To receive requested data from host memory.
- From ArrayReader to accelerator:
- Data (out):
- Streams of the data type defined in the schema.
For ArrayWriters:
- From ArrayReader to host memory interface:
- Bus write request (bus_wreq): To issue burst write requests to the host memory.
- Bus write data (bus_wdat): To stream write data to host memory.
- From accelerator to ArrayWriter
- Data (in):
- Streams of the data type defined in the schema.
The streams follow the ready/valid handshaking methodology similar to AXI4. This means a transaction is handshaked only when both the ready and valid signal are asserted in a single clock cycle. Furthermore, it means any producer of a stream may not wait for the ready signal to be asserted. However, a stream consumer may assert ready before any valid signal is asserted. For more detail, it is highly recommended to read the AXI4 protocol specification chapter on "Basic read and write transactions".
Here are some examples that show the lay-out of the output (out) or input (in) streams for specific schema's.
In these examples, whenever the (nested) field is nullable, the config string should be wrapped with null().
prim(32)
or
null(prim(32))
if it is nullable.
- data(32): (optional) '1' when this element is not null, '0' otherwise
- data(31 ... 0): Int32 element
- last: '1' when element is last in request, '0' otherwise
list(prim(8))
or
null(list(null(prim(8))))
if both the list and the char are nullable.
or
null(list(prim(8)))
if only the list is nullable
or
list(null(prim(8)))
if only the char is nullable.
Assuming the null() examples are clear, we omit them in further examples.
- data(32): (optional) '1' when this list is not null, '0' otherwise
- data(31 ... 0): length of list of chars
- last: '1' when list is last in request, '0' otherwise
- data(8): (optional) '1' when this element is not null, '0' otherwise
- data(7 ... 0): Char element
- last: '1' when element is last in request, '0' otherwise
list(list(prim(8))
- data(32): (optional) '1' when this list (of lists of lists) is not null, '0' otherwise
- data(31 ... 0): length of list (of lists of lists)
- last: '1' when list (of lists) is last in request, '0' otherwise
- data(32): (optional) '1' when this list (of lists) is not null, '0' otherwise
- data(31 ... 0): length of list (of bytes)
- last: '1' when list (of lists) is last in request, '0' otherwise
- data(8): (optional) '1' when this element is not null, '0' otherwise
- data(7 ... 0): Byte element
- last: '1' when element is last in request, '0' otherwise
struct(list(prim(8)),prim(32))
Note well: when any of the "not null" bits are unused, the bit indices shift accordingly.
- data(66): (optional) '1' when the whole struct is not null, '0' otherwise
- data(65): (optional) '1' when the list field in the struct is not null, '0' otherwise
- data(64 ... 33): length of list (of chars)
- data(32): (optional) '1' when the int32 field in the struct is not null, '0' otherwise
- data(31 ... 0): Int32 element
- last: '1' when struct is last in request, '0' otherwise
- data(8): (optional) '1' when this element is not null, '0' otherwise
- data(7 ... 0): Char element
- last: '1' when element is last in list, '0' otherwise
list(struct(list(prim(8)),prim(32)))
- data(32): (optional) '1' when this list (of structs) is not null, '0' otherwise
- data(31 ... 0): length of list (of structs)
- last: '1' when list (of structs) is last in request, '0' otherwise
*Note well: when any of the "not null" bits are unused, the bit indices shift accordingly.
- data(66): (optional) '1' when this struct is not null, '0' otherwise
- data(65): (optional) '1' when the list field in the struct is not null, '0' otherwise
- data(64 ... 33): length of list (of chars)
- data(32): (optional) '1' when the int32 field in the struct is not null, '0' otherwise
- data(31 ... 0): Int32 element
- last: '1' when struct is last in list (of structs), '0' otherwise
- data(8): (optional) '1' when this element is not null, '0' otherwise
- data(7 ... 0): Char element
- last: '1' when element is last in list, '0' otherwise
For more in-depth information, check out ArrayConfig.vhd