diff --git a/tfx_addons/feature_selection/README.md b/tfx_addons/feature_selection/README.md index 99a21e0d..e6253c71 100644 --- a/tfx_addons/feature_selection/README.md +++ b/tfx_addons/feature_selection/README.md @@ -19,13 +19,45 @@ Component This project will allow the user to select different algorithms for performing feature selection on datasets artifacts in TFX pipelines. ## Project Implementation -Feature Selection Custom Component will be implemented as Python function-based component. -Implementation of the Feature Selection Custom Component can be done using the following approach: +Feature Selection Custom Component is implemented as Python function-based component. + +Implementation of the Feature Selection Custom Component is done using the following approach: - Get dataset artifact generated by ExampleGen -- Convert it into the format compatible with Scikit-Learn functions -- Perform univariate feature selection using parameters given by users -- Remove not selected features from the dataset -- Provide feature scores of the selected features as a custom artifact +- Convert it into the format compatible with Scikit-Learn functions (TFRecord to numpy disctionaries) +- Perform univariate feature selection with `SelectorFunc` specified in the module file +- Output the following two artifacts: + - `updated_data`: Duplicate of the input `Example` artifact, but with updated URI and data values + - `feature_selection`: Contains data about the feature selection process with the following values available: + - `scores`: Metric scores from the selector + - `p_values`: Calculated p-values from the selector + - `selected_features`: List of selected columns afetr feature selection + +## Module file +#### Structure +The module file is required to have a structure with the following three values: +- `SELECTOR_PARAMS`: Parameters for `SelectorFunc` +- `TARGET_FEATURE`: The target feature in the dataset +- `SelectorFunc`: Univariate function for feature selection + +#### Example module file +In the below example, we have used sklearn functions directly for simplicity. You may define custom functions while ensuring that the overall i/o structure is the same. +``` python +from sklearn.feature_selection import SelectKBest as SelectorFunc +from sklearn.feature_selection import chi2 + +SELECTOR_PARAMS = {"score_func": chi2, "k": 2} +TARGET_FEATURE = 'species' +``` + +## Example usage +You may use the feature selection component in a way similar to [StatisticsGen](https://www.tensorflow.org/tfx/guide/statsgen) +``` python +feature_selector = FeatureSelection( + orig_examples = example_gen.outputs['examples'], + module_file='example.modules.iris_module_file' + ) +``` + ## Project Dependencies The implementation will use the [Scikit-learn feature selection functions](https://scikit-learn.org/stable/modules/feature_selection.html)