Explicit pointwise Conv1D implementation for "Latency" strategy #811

jmduarte · 2023-06-14T18:26:11Z

Description

This is mostly for discussion and to let others test it out like @Duchstf. This PR adds an explicit pointwise Conv1D implementation, where the reuse factor (RF) is used to split the layer execution and reuse the existing module RF times

Original pointwise Conv1D:

(in_width, n_chan) -> (in_width, n_filt)

This PR splits it into RF calls of

(in_width/RF, n_chan) -> (in_width/RF, n_filt)
(in_width/RF, n_chan) -> (in_width/RF, n_filt)
(in_width/RF, n_chan) -> (in_width/RF, n_filt)
...

The II ~ RF. To turn it on you have to configure ConvImplementation of the layer named <layer>

config["LayerName"]["<layer>"]["ConvImplementation"] = "Pointwise"

Limitations:

Assumes in_width is divisible by RF
Hardcoded explicit execution up to RF = 120. Could be automated with code generation.

Type of change

New feature (non-breaking change which adds functionality)
A new research paper code implementation

Tests

See test/pytest/test_pointwiseconv.py

Checklist

I have read the guidelines for contributing.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have installed and run pre-commit on the files I edited or added.
I have added tests that prove my fix is effective or that my feature works.

jmduarte · 2023-06-14T18:42:22Z

pre-commit.ci autofix

Duchstf · 2023-06-20T08:57:35Z

@jmduarte Can you re-base this to the lastest changes from #815? I'd love to try this out after that!

Duchstf · 2023-06-20T10:05:48Z

@jmduarte I'm actually trying this out now, but I just realized it is in vivado, is it possible to update this to vitis?? I would be happy to contribute if you want!!

Duchstf · 2023-06-20T10:12:46Z

hls4ml/templates/vivado/nnet_utils/nnet_conv1d_latency.h

+    pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[0], res_tmp[0], weights, biases);
+    pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[1], res_tmp[1], weights, biases);
+    if (CONFIG_T::reuse_factor > 2)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[2], res_tmp[2], weights, biases);
+    if (CONFIG_T::reuse_factor > 3)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[3], res_tmp[3], weights, biases);
+    if (CONFIG_T::reuse_factor > 4)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[4], res_tmp[4], weights, biases);
+    if (CONFIG_T::reuse_factor > 5)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[5], res_tmp[5], weights, biases);
+    if (CONFIG_T::reuse_factor > 6)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[6], res_tmp[6], weights, biases);
+    if (CONFIG_T::reuse_factor > 7)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[7], res_tmp[7], weights, biases);
+    if (CONFIG_T::reuse_factor > 8)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[8], res_tmp[8], weights, biases);
+    if (CONFIG_T::reuse_factor > 9)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[9], res_tmp[9], weights, biases);
+    if (CONFIG_T::reuse_factor > 10)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[10], res_tmp[10], weights, biases);
+    if (CONFIG_T::reuse_factor > 11)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[11], res_tmp[11], weights, biases);
+    if (CONFIG_T::reuse_factor > 12)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[12], res_tmp[12], weights, biases);
+    if (CONFIG_T::reuse_factor > 13)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[13], res_tmp[13], weights, biases);
+    if (CONFIG_T::reuse_factor > 14)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[14], res_tmp[14], weights, biases);
+    if (CONFIG_T::reuse_factor > 15)
+        pointwise_conv_1d_latency_cl<data_T, res_T, CONFIG_T>(data_tmp[15], res_tmp[15], weights, biases);
+    if (CONFIG_T::reuse_factor > 16)


I'm wondering if there is a better way to do this ...

Yes, I think we can use the code-generation machinery like this:

hls4ml/hls4ml/backends/fpga/passes/codegen.py

Lines 6 to 51 in abaea98

class GenerateConvIm2col(OptimizerPass):

'''Generates tcode for im2col step of 1D/2d convolution'''

def match(self, node):

return isinstance(node, (Conv1D, Conv2D)) and node.model.config.get_config_value('IOType') == 'io_parallel'

def transform(self, model, node):

node_class = node.__class__.__name__

if '1D' in node_class:

self._generate_im2col_1d(node)

elif '2D' in node_class:

self._generate_im2col_2d(node)

else:

raise Exception(f'Cannot generate instructions for node {node.name} ({node_class})')

def _generate_im2col_1d(self, node):

code_str = node.model.config.backend.generate_conv1d_line_buffer_fn(

node.get_attr('index'),

node.get_attr('n_partitions'),

node.get_input_variable().shape[0],

node.get_input_variable().shape[1],

kernel=node.get_attr('filt_width'),

stride=node.get_attr('stride_width'),

pad=(node.get_attr('pad_left'), node.get_attr('pad_right')),

)

node.set_attr('line_buffer_codegen', Source(code_str))

def _generate_im2col_2d(self, node):

code_str = node.model.config.backend.generate_conv2d_line_buffer_fn(

node.get_attr('index'),

node.get_attr('n_partitions'),

node.get_input_variable().shape[0],

node.get_input_variable().shape[1],

node.get_input_variable().shape[2],

kernel=(node.get_attr('filt_height'), node.get_attr('filt_width')),

stride=(node.get_attr('stride_height'), node.get_attr('stride_width')),

pad=(

node.get_attr('pad_top'),

node.get_attr('pad_bottom'),

node.get_attr('pad_left'),

node.get_attr('pad_right'),

),

)

node.set_attr('line_buffer_codegen', Source(code_str))

hls4ml/hls4ml/backends/fpga/fpga_backend.py

Lines 671 to 731 in abaea98

def generate_conv1d_line_buffer_fn(self, layer_idx, n_partitions, in_W, in_C, kernel=3, stride=1, pad=0, dilation=1):

"""Generate a C++ function that mimics the im2col algorithm. This function works for 1D convolution.

The HLS compiler produces suboptimal designs for a im2col algorithm implementation, so a trick we use is

to generate a resulting a result of im2col transformation explicitly, instead of relying on loops. Since

the result depends on the paraleters of the convolution layer (the input size, the kernel size, stride etc),

we need to do this for every convolution layer.

Args:

layer_idx (int): Index of layer ('index' attribute).

n_partitions (int): Number of partitions to divide the input into.

The pixels in each partition will be processed in parallel.

in_W (int): Width of input.

in_C (int): Number of channels.

kernel (int, optional): Size of the kernel. Defaults to 3.

stride (int, optional): Stride length. Defaults to 1.

pad (int or Iterable, optional): Padding to apply. Defaults to 0.

Specified as either a number or a list [left_pad, right_pad].

dilation (int, optional): Dilation rate. Defaults to 1.

Returns:

str: Generated C++ function

"""

if isinstance(pad, Iterable):

pad_left = pad[0]

pad_right = pad[1]

else:

pad_left = pad

pad_right = pad

im2col_matrix = self._compute_conv1d_im2col((in_W, in_C), kernel, stride, (pad_left, pad_right), dilation)

generated_code = (

"template<class data_T, typename CONFIG_T>\n"

"class fill_buffer_{index} : public FillConv1DBuffer<data_T, CONFIG_T> {{\n"

" public:\n"

" static void fill_buffer(\n"

" data_T data[CONFIG_T::in_width * CONFIG_T::n_chan],\n"

" data_T buffer[CONFIG_T::n_pixels][CONFIG_T::filt_width * CONFIG_T::n_chan],\n"

" const unsigned partition\n"

" ) {{\n"

).format(index=layer_idx)

indent = ' '

for partition_idx, partition in enumerate(np.split(im2col_matrix, n_partitions)):

generated_code += indent * 2 + f'if (partition == {partition_idx:>3}) {{\n'

for pixel_idx, arr in enumerate(partition):

buffer_stmts = []

for j, v in enumerate(arr):

if v == 0:

val = '0'

else:

val = f'data[{int(v - 1)}]'

buffer_stmts.append(f'buffer[{pixel_idx}][{j}] = {val:>10};')

generated_code += indent * 3 + ' '.join(buffer_stmts) + '\n'

generated_code += '\n' + indent * 2 + '}\n'

generated_code += indent + '}\n'

generated_code += '};\n'

return generated_code

@vloncar @Duchstf I started this branch to use the code generation machinery: jmduarte#20

direct diff w.r.t. main: main...jmduarte:split_pointwise_conv_by_rf_codegen
#881

is this better than the current approach?

hls4ml/templates/vivado/nnet_utils/nnet_conv1d_latency.h

jmduarte · 2023-10-20T15:14:40Z

Superseded by #881

jmduarte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Jun 14, 2023

Duchstf mentioned this pull request Jun 19, 2023

[Vitis hls] Cannot apply array transformation pragma/directive because of full array load/store. #805

Closed

Duchstf reviewed Jun 20, 2023

View reviewed changes

jmduarte and others added 10 commits June 21, 2023 07:23

merge

ea5c5a8

add pointwise

6849e0b

latency

0244b66

unroll

3ae7752

add hls unroll

23126b7

fix pragma from walkie

6aff9e9

[pre-commit.ci] auto fixes from pre-commit hooks

7f1c318

add test

69aecc6

pre-commit

4febced

pre-commit

56797e7

jmduarte force-pushed the split_pointwise_conv_by_rf_rebase_latest branch from 90f9e10 to 56797e7 Compare June 21, 2023 14:24

jmduarte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Jun 21, 2023

Duchstf reviewed Jun 27, 2023

View reviewed changes

hls4ml/templates/vivado/nnet_utils/nnet_conv1d_latency.h Show resolved Hide resolved

hls4ml/templates/vivado/nnet_utils/nnet_conv1d_latency.h Show resolved Hide resolved

jmitrevs added this to the v0.8.0 milestone Aug 11, 2023

Merge branch 'main' into split_pointwise_conv_by_rf_rebase_latest

fc8cf1f

jmduarte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Sep 8, 2023

Merge branch 'main' into split_pointwise_conv_by_rf_rebase_latest

48609c6

jmduarte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Sep 12, 2023

Merge branch 'main' into split_pointwise_conv_by_rf_rebase_latest

7e11eea

jmduarte changed the title ~~Explicit Pointwise Conv1D implementaiton for "Latency" strategy~~ Explicit pointwise Conv1D implementation for "Latency" strategy Oct 8, 2023

jmduarte mentioned this pull request Oct 8, 2023

Pointwise Conv1D with code generation for "Latency" strategy (update of #811) #881

Open

jmduarte added please test Trigger testing by creating local PR branch and removed please test Trigger testing by creating local PR branch labels Oct 9, 2023

jmduarte closed this Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit pointwise Conv1D implementation for "Latency" strategy #811

Explicit pointwise Conv1D implementation for "Latency" strategy #811

jmduarte commented Jun 14, 2023 •

edited

Loading

jmduarte commented Jun 14, 2023

Duchstf commented Jun 20, 2023

Duchstf commented Jun 20, 2023

Duchstf Jun 20, 2023

jmduarte Jun 21, 2023 •

edited

Loading

jmduarte Oct 8, 2023 •

edited

Loading

jmduarte commented Oct 20, 2023

	class GenerateConvIm2col(OptimizerPass):
	'''Generates tcode for im2col step of 1D/2d convolution'''

	def match(self, node):
	return isinstance(node, (Conv1D, Conv2D)) and node.model.config.get_config_value('IOType') == 'io_parallel'

	def transform(self, model, node):
	node_class = node.__class__.__name__
	if '1D' in node_class:
	self._generate_im2col_1d(node)
	elif '2D' in node_class:
	self._generate_im2col_2d(node)
	else:
	raise Exception(f'Cannot generate instructions for node {node.name} ({node_class})')

	def _generate_im2col_1d(self, node):
	code_str = node.model.config.backend.generate_conv1d_line_buffer_fn(
	node.get_attr('index'),
	node.get_attr('n_partitions'),
	node.get_input_variable().shape[0],
	node.get_input_variable().shape[1],
	kernel=node.get_attr('filt_width'),
	stride=node.get_attr('stride_width'),
	pad=(node.get_attr('pad_left'), node.get_attr('pad_right')),
	)

	node.set_attr('line_buffer_codegen', Source(code_str))

	def _generate_im2col_2d(self, node):
	code_str = node.model.config.backend.generate_conv2d_line_buffer_fn(
	node.get_attr('index'),
	node.get_attr('n_partitions'),
	node.get_input_variable().shape[0],
	node.get_input_variable().shape[1],
	node.get_input_variable().shape[2],
	kernel=(node.get_attr('filt_height'), node.get_attr('filt_width')),
	stride=(node.get_attr('stride_height'), node.get_attr('stride_width')),
	pad=(
	node.get_attr('pad_top'),
	node.get_attr('pad_bottom'),
	node.get_attr('pad_left'),
	node.get_attr('pad_right'),
	),
	)

	node.set_attr('line_buffer_codegen', Source(code_str))

	def generate_conv1d_line_buffer_fn(self, layer_idx, n_partitions, in_W, in_C, kernel=3, stride=1, pad=0, dilation=1):
	"""Generate a C++ function that mimics the im2col algorithm. This function works for 1D convolution.

	The HLS compiler produces suboptimal designs for a im2col algorithm implementation, so a trick we use is
	to generate a resulting a result of im2col transformation explicitly, instead of relying on loops. Since
	the result depends on the paraleters of the convolution layer (the input size, the kernel size, stride etc),
	we need to do this for every convolution layer.

	Args:
	layer_idx (int): Index of layer ('index' attribute).
	n_partitions (int): Number of partitions to divide the input into.
	The pixels in each partition will be processed in parallel.
	in_W (int): Width of input.
	in_C (int): Number of channels.
	kernel (int, optional): Size of the kernel. Defaults to 3.
	stride (int, optional): Stride length. Defaults to 1.
	pad (int or Iterable, optional): Padding to apply. Defaults to 0.
	Specified as either a number or a list [left_pad, right_pad].
	dilation (int, optional): Dilation rate. Defaults to 1.

	Returns:
	str: Generated C++ function
	"""
	if isinstance(pad, Iterable):
	pad_left = pad[0]
	pad_right = pad[1]
	else:
	pad_left = pad
	pad_right = pad

	im2col_matrix = self._compute_conv1d_im2col((in_W, in_C), kernel, stride, (pad_left, pad_right), dilation)

	generated_code = (
	"template<class data_T, typename CONFIG_T>\n"
	"class fill_buffer_{index} : public FillConv1DBuffer<data_T, CONFIG_T> {{\n"
	" public:\n"
	" static void fill_buffer(\n"
	" data_T data[CONFIG_T::in_width * CONFIG_T::n_chan],\n"
	" data_T buffer[CONFIG_T::n_pixels][CONFIG_T::filt_width * CONFIG_T::n_chan],\n"
	" const unsigned partition\n"
	" ) {{\n"
	).format(index=layer_idx)
	indent = ' '

	for partition_idx, partition in enumerate(np.split(im2col_matrix, n_partitions)):
	generated_code += indent * 2 + f'if (partition == {partition_idx:>3}) {{\n'
	for pixel_idx, arr in enumerate(partition):
	buffer_stmts = []
	for j, v in enumerate(arr):
	if v == 0:
	val = '0'
	else:
	val = f'data[{int(v - 1)}]'
	buffer_stmts.append(f'buffer[{pixel_idx}][{j}] = {val:>10};')
	generated_code += indent * 3 + ' '.join(buffer_stmts) + '\n'
	generated_code += '\n' + indent * 2 + '}\n'

	generated_code += indent + '}\n'
	generated_code += '};\n'

	return generated_code

Explicit pointwise Conv1D implementation for "Latency" strategy #811

Explicit pointwise Conv1D implementation for "Latency" strategy #811

Conversation

jmduarte commented Jun 14, 2023 • edited Loading

Description

Type of change

Tests

Checklist

jmduarte commented Jun 14, 2023

Duchstf commented Jun 20, 2023

Duchstf commented Jun 20, 2023

Duchstf Jun 20, 2023

Choose a reason for hiding this comment

jmduarte Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

jmduarte Oct 8, 2023 • edited Loading

Choose a reason for hiding this comment

jmduarte commented Oct 20, 2023

jmduarte commented Jun 14, 2023 •

edited

Loading

jmduarte Jun 21, 2023 •

edited

Loading

jmduarte Oct 8, 2023 •

edited

Loading