Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated readme with module file guidelines and example #151

Merged
merged 1 commit into from
Jun 10, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 38 additions & 6 deletions tfx_addons/feature_selection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,45 @@ Component
This project will allow the user to select different algorithms for performing feature selection on datasets artifacts in TFX pipelines.

## Project Implementation
Feature Selection Custom Component will be implemented as Python function-based component.
Implementation of the Feature Selection Custom Component can be done using the following approach:
Feature Selection Custom Component is implemented as Python function-based component.

Implementation of the Feature Selection Custom Component is done using the following approach:
- Get dataset artifact generated by ExampleGen
- Convert it into the format compatible with Scikit-Learn functions
- Perform univariate feature selection using parameters given by users
- Remove not selected features from the dataset
- Provide feature scores of the selected features as a custom artifact
- Convert it into the format compatible with Scikit-Learn functions (TFRecord to numpy disctionaries)
- Perform univariate feature selection with `SelectorFunc` specified in the module file
- Output the following two artifacts:
- `updated_data`: Duplicate of the input `Example` artifact, but with updated URI and data values
- `feature_selection`: Contains data about the feature selection process with the following values available:
- `scores`: Metric scores from the selector
- `p_values`: Calculated p-values from the selector
- `selected_features`: List of selected columns afetr feature selection

## Module file
#### Structure
The module file is required to have a structure with the following three values:
- `SELECTOR_PARAMS`: Parameters for `SelectorFunc`
- `TARGET_FEATURE`: The target feature in the dataset
- `SelectorFunc`: Univariate function for feature selection

#### Example module file
In the below example, we have used sklearn functions directly for simplicity. You may define custom functions while ensuring that the overall i/o structure is the same.
``` python
from sklearn.feature_selection import SelectKBest as SelectorFunc
from sklearn.feature_selection import chi2

SELECTOR_PARAMS = {"score_func": chi2, "k": 2}
TARGET_FEATURE = 'species'
```

## Example usage
You may use the feature selection component in a way similar to [StatisticsGen](https://www.tensorflow.org/tfx/guide/statsgen)
``` python
feature_selector = FeatureSelection(
orig_examples = example_gen.outputs['examples'],
module_file='example.modules.iris_module_file'
)
```


## Project Dependencies
The implementation will use the [Scikit-learn feature selection functions](https://scikit-learn.org/stable/modules/feature_selection.html)
Expand Down