A python package for streamlining the front end of the machine learning workflow.
Common to the front end of most machine learning pipelines is an
exploratory data analysis (EDA) and feature preprocessing. EDA’s
facilitate a better understanding of the data being analyzed and allows
for a targeted and more robust model development while feature
imputation and preprocessing is a requirement for many machine learning
alogirthms. nursepy
aims to streamline the front end of the machine
learning pipeline by generating descriptive summary tables and figures,
automating feature imputation, and automating preprocessing. Automated
feature imputations and preprocessing detection has been implemented to
minimize time and optimize the processing methods used. The functions in
nursepy
were developed to provide useful and informative metrics that
are applicable to a wide array of datasets.
nursepy
was developed as part of DSCI 524 of the MDS program at
UBC.
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple nursepy
The package includes the following three functions:
Function | Input | Output | Description |
---|---|---|---|
eda |
- a pandas dataframe | - a python dictionary | - Dictionary that contains histogram and summary statistics for each column |
impute |
- a pandas dataframe | - a pandas dataframe with imputed values | - Functionality for automatic imputation detection and user defined imputation method selection |
preproc |
- a pandas dataframe | - a pandas dataframe with preprocessed features | - Functionality for automatic feature preprocessing detection and user defined feature preprocessing |
nursepy
was developed to closely align with:
However, the functions herein streamline and automate the front-end machine learning pipeline for use with any machine learning package.
- numpy==1.18.1
- pandas==0.25.3
- sklearn==0.0
- altair==3.2.0
- pytest==5.3.2
The eda()
function helps to easily explore the data by both giving
visual insights and summary statistics for a chosen column.
Note:
- Function should be run in Jupyter Lab (or other IDE) to view output.
If Jupyter Notebook is used, run
altair.renderers.enable('notebook')
- Depending on the data set, alt.data_transformers.disable_max_rows()
may be required before running eda
These are required imports:
from sklearn.datasets import load_wine
import pandas as pd
from nursepy.eda import eda
To see how this function works, we will load wine
dataset from
sklearn.
wine = load_wine()
data = pd.DataFrame(wine.data)
data.columns = wine.feature_names
After calling the eda()
function, we will get the following outputs:
eda_results = eda(data)
eda_results['stats']['magnesium']
## magnesium
## count 178.000000
## mean 99.741573
## std 14.282484
## min 70.000000
## 25% 88.000000
## 50% 98.000000
## 75% 107.000000
## max 162.000000
eda_results['histograms']['magnesium']
The purpose of impute
is to automatically impute missing data by
comparing test results of multiple models using different imputation
methods (currently this function only works on numeric data). An
sklearn
RandomForestClassifier
or RandomForestRegressor
is used to
compute scores for each imputed data set.
Required imports:
from nursepy.impute import impute
import numpy as np
import pandas as pd
To use impute
, we need a dataset with missing data. Also, this data
needs to be split into training and test sets. Let’s create our data:
Xt = {'one': np.random.randn(10), 'two': np.random.randn(10),
'three': np.random.randn(10), 'four': np.random.randn(10)}
yt_c = pd.DataFrame({'target': [1, 0, 1, 1, 0, 0, 0, 1, 0, 1]})
Xt['two'][3:5] = None
Xt['three'][7] = None
Xt['four'][1] = None
Xt['four'][3] = None
Xt = pd.DataFrame(Xt)
Xv = {'one': np.random.randn(10), 'two': np.random.randn(10),
'three': np.random.randn(10), 'four': np.random.randn(10)}
yv_c = pd.DataFrame({'target': [1, 0, 0, 0, 1, 1, 1, 0, 0, 1]})
Xv['one'][2:4] = None
Xv['two'][2] = None
Xv['four'][1] = None
Xv['four'][4] = None
Xv = pd.DataFrame(Xv)
Now that that’s over with, let’s call impute
!
summary = impute(Xt, yt_c, Xv, yv_c, model_type='classification')
A call to impute
returns a dictionary with the following keys-values:
- “imputation_scores_”: a dictionary of the 6 imputation methods and
their associated
RandomForest
scores- remove_na
- forward_fill
- backward_fill
- feature_mean
- feature_median
- feature_interpolate
- “missing_value_counts_”: a dictionary with feature’s as keys and number of missing values as associated values
- “missing_indeces_”: a dictionary with the indeces of rows that contain missing data for the train and test sets
- “best_imputed_data_”: a dataframe with the imputed data of the imputation method with the best score
We can access the return objects from impute
by indexing their keys.
Let’s take a look at the RandomForestClassifer
scores for each of the
imputation methods:
pd.DataFrame({'Method': list(summary['imputation_scores_'].keys()),
'Score': list(summary['imputation_scores_'].values())})
## Method Score
## 0 remove_na 0.333333
## 1 forward_fill 0.400000
## 2 backward_fill 0.300000
## 3 feature_mean 0.300000
## 4 feature_median 0.300000
## 5 feature_interpolate 0.200000
The number of missing values in each column can be extracted the same way:
pd.DataFrame({'Feature': list(summary['missing_value_counts_'].keys()),
'Count': list(summary['missing_value_counts_'].values())})
## Feature Count
## 0 one 0
## 1 two 2
## 2 three 1
## 3 four 2
And finally, let’s extract the training data (index 0 for training and 1 for validation) with the best imputed score:
summary['best_imputed_data_'][0]
## one two three four
## 0 -1.047538 0.042689 -0.247081 -0.363426
## 1 -0.304713 0.800541 -2.626776 -0.363426
## 2 0.327052 -0.755122 -0.505563 0.915825
## 3 0.679104 -0.755122 1.582215 0.915825
## 4 -0.179888 -0.755122 0.857867 1.016828
## 5 -0.085712 -0.774920 0.107280 1.354976
## 6 0.860473 0.029632 -0.979732 1.270861
## 7 0.217099 0.384991 -0.979732 -0.425301
## 8 0.191896 -1.207064 0.105697 0.742878
## 9 0.207372 -0.607005 1.752648 -0.216495
preproc
preprocesses data frames, including onehot encoding, scaling,
and imputation, and label encoding.
Required imports:
from nursepy.preproc import preproc
import pandas as pd
from sklearn.datasets import load_wine
Let’s load some data from sklearn:
wine = load_wine()
data = pd.DataFrame(wine.data)
data.columns = wine.feature_names
The output of preproc
is a tuple with the processed training and test
sets. Let’s visualize the preprocessed train set:
X_train_processed, X_test_processed = preproc(data)
X_train_processed
## alcohol malic_acid ash ... hue od280/od315_of_diluted_wines proline
## 0 14.23 1.71 2.43 ... 1.04 3.92 1065.0
## 1 13.20 1.78 2.14 ... 1.05 3.40 1050.0
## 2 13.16 2.36 2.67 ... 1.03 3.17 1185.0
## 3 14.37 1.95 2.50 ... 0.86 3.45 1480.0
## 4 13.24 2.59 2.87 ... 1.04 2.93 735.0
## .. ... ... ... ... ... ... ...
## 173 13.71 5.65 2.45 ... 0.64 1.74 740.0
## 174 13.40 3.91 2.48 ... 0.70 1.56 750.0
## 175 13.27 4.28 2.26 ... 0.59 1.56 835.0
## 176 13.17 2.59 2.37 ... 0.60 1.62 840.0
## 177 14.13 4.10 2.74 ... 0.61 1.60 560.0
##
## [178 rows x 13 columns]
The official documentation is hosted on Read the Docs: https://nursepy.readthedocs.io/en/latest/
This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.