Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPEC] DeformablePSROIPooling specification update #4282

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 61 additions & 22 deletions docs/ops/detection/DeformablePSROIPooling_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,70 +4,77 @@

**Category**: Object detection

**Short description**: *DeformablePSROIPooling* computes position-sensitive pooling on regions of interest specified by input.
**Short description**: *DeformablePSROIPooling* computes deformable position-sensitive pooling of regions of interest specified by input.

**Detailed description**: [Reference](https://arxiv.org/abs/1703.06211).

*DeformablePSROIPooling* operation takes two or three input tensors: with feature maps, with regions of interests (box coordinates) and an optional tensor with transformation values.
The box coordinates are specified as five element tuples: *[batch_id, x_1, y_1, x_2, y_2]* in absolute values.
*DeformablePSROIPooling* operation takes two or three input tensors: with position score maps, with regions of interests (ROI, box coordinates) and an optional tensor with transformation values (normalized offsets for ROI bins coordinates).
If only two inputs are provided, position sensitive pooling with regular ROI bins position is calculated (non-deformable).
If third input is provided, each bin position is transformed by adding corresponding offset to the bin left top corner coordinates. Third input values are usually calculated by regular position sensitive pooling layer, so non-deformable mode (DeformablePSROIPooling with two inputs).

The ROI coordinates are specified as five element tuples: `[batch_id, x_1, y_1, x_2, y_2]` in absolute values.


**Attributes**

* *output_dim*

* **Description**: *output_dim* is a pooled output channel number.
* **Description**: *output_dim* is the number of the output channels, size of output `C` dimension.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: None
* **Required**: *yes*

* *group_size*

* **Description**: *group_size* is the number of groups to encode position-sensitive score maps.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*

* *spatial_scale*

* **Description**: *spatial_scale* is a multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling.
* **Description**: *spatial_scale* is a multiplicative spatial scale factor to translate ROI coordinates from their input original size to the pooling input. Ratio of the input score map size to the original image size.
* **Range of values**: a positive floating-point number
* **Type**: `float`
* **Default value**: None
* **Required**: *yes*

* *group_size*

* **Description**: *group_size* is the number of horizontal bins per row to divide single ROI area. Total number of bins can be calculated as `group_size*group_size`. It defines pooled width and height, so output `H_out` and `W_out` dimensions (always equal).
Square of the `group_size` is also the number to divide input channels `C_in` dimension and split it into `C_in \\ group_size*group_size` groups.
Each group corresponds to the exactly one output channel and ROI's bins are spread over input channel group members.

* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*

* *mode*
* **Description**: *mode* specifies mode for pooling.
* **Range of values**:
* *bilinear_deformable* - perform pooling with bilinear interpolation and deformable transformation
* *bilinear_deformable* - perform pooling with bilinear interpolation over single ROI bin. For each ROI bin average of his interpolated `spatial_bins_x*spatial_bins_y` sub-bins values is calculated.
* **Type**: string
* **Default value**: *bilinear_deformable*
* **Required**: *no*

* *spatial_bins_x*
* **Description**: *spatial_bins_x* specifies numbers of bins to divide the input feature maps over width.
* **Description**: *spatial_bins_x* specifies number of horizontal sub-bins (bilinear interpolation samples) to divide ROI single bin.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*

* *spatial_bins_y*
* **Description**: *spatial_bins_y* specifies numbers of bins to divide the input feature maps over height.
* **Description**: *spatial_bins_y* specifies number of vertical sub-bins (bilinear interpolation samples) to divide ROI single bin.
* **Range of values**: a positive integer
* **Type**: `int`
* **Default value**: 1
* **Required**: *no*

* *trans_std*
* **Description**: *trans_std* is the value that all transformation (offset) values are multiplied with.
* **Description**: *trans_std* is the value that all third input values(offests) are multiplied with to modulate the magnitude of the offsets.
* **Range of values**: floating point number
* **Type**: `float`
* **Default value**: 1
* **Required**: *no*

* *part_size*
* **Description**: *part_size* is the number of parts the output tensor spatial dimensions are divided into. Basically it is the height and width of the third input
* **Description**: *part_size* is the size of `H` and `W` dimensions of the third input (offsets). Basically it is the height and width of the third input
with transformation values.
* **Range of values**: positive integer number
* **Type**: `int`
Expand All @@ -76,18 +83,50 @@ The box coordinates are specified as five element tuples: *[batch_id, x_1, y_1,

**Inputs**:

* **1**: 4D input tensor with feature maps. Required.
* **1**: 4D input tensor of type *T* and shape `[N_in, C_in, H_in, W_in]` with position sensitive score maps. Required.

* **2**: 2D input tensor describing box consisting of five element tuples: `[batch_id, x_1, y_1, x_2, y_2]`. Required.
* **2**: 2D input tensor of type *T* and shape `[NUM_ROIS, 5]`. It contains a list of five element tuples describing a single ROI (region of interest): `[batch_id, x_1, y_1, x_2, y_2]`. Required.
Batch indices must be in the range of `[0, N_in-1]`.

* **3**: 4D input blob with transformation [values](https://arxiv.org/abs/1703.06211) (offsets). Optional.
* **3**: 4D input tensor of type *T* and shape `[NUM_ROIS, 2*NUM_CLASSES, group_size, group_size]` with transformation values. It contains normalized `[0, 1]` offsets for each ROI bin left top corner coordinates. Channel dimension is multiplied by `2` because of encoding two `(x, y)` coordinates. Optional.

**Outputs**:
* **1**: 4D output tensor of type *T* shape `[NUM_ROIS, output_dim, group_size, group_size]` with ROIs score maps.

**Types**:

* **1**: 4D output tensor with areas copied and interpolated from the 1st input tensor by coordinates of boxes from the 2nd input and transformed according to values from the 3rd input.
* *T*: Any floating point type.


**Example**

* Two inputs (without offsets)
```xml
<layer id="6" name="rfcn_cls_offset" type="DeformablePSROIPooling" version="opset1">
<data spatial_scale="0.0625" output_dim="882" group_size="3" mode="bilinear_deformable" spatial_bins_x="4" spatial_bins_y="4" pooled_width="3" pooled_height="3" trans_std="0.0" part_size="3"/>
<input>
<port id="0">
<dim>1</dim>
<dim>7938</dim>
<dim>63</dim>
<dim>38</dim>
</port>
<port id="1">
<dim>300</dim>
<dim>5</dim>
</port>
</input>
<output>
<port id="2" precision="FP32">
<dim>300</dim>
<dim>882</dim>
<dim>3</dim>
<dim>3</dim>
</port>
</output>
</layer>
```
* Three inputs (with offsets)
```xml
<layer ... type="DeformablePSROIPooling" ... >
<data group_size="7" mode="bilinear_deformable" output_dim="8" part_size="7" spatial_bins_x="4" spatial_bins_y="4" spatial_scale="0.0625" trans_std="0.1"/>
Expand Down