Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError instead of TypeError in Python 2.7 #10

Open
dburns7 opened this issue Feb 27, 2017 · 5 comments
Open

ValueError instead of TypeError in Python 2.7 #10

dburns7 opened this issue Feb 27, 2017 · 5 comments

Comments

@dburns7
Copy link

dburns7 commented Feb 27, 2017

The try except block starting at line 76 of datacleaner.py raises a ValueError in Python 2.7 when the column is of type object (string). Since the Python 2.7 icon is displayed in the repo markdown, can you clarify which Python version is supported?

@rhiever
Copy link
Owner

rhiever commented Mar 8, 2017

Sounds like this is a bug. Would you be willing to write a patch for it?

@jmeguira
Copy link

jmeguira commented Oct 5, 2018

Hi! Mind if I take a shot at this?

@rhiever
Copy link
Owner

rhiever commented Oct 10, 2018

Please do! Probably the best starting point is to write a minimal example that reproduces the error, then that will stand as our first unit test for this patch.

@jmeguira
Copy link

jmeguira commented Oct 15, 2018

So I took a look at this today and was having trouble reproducing the error. I've included my test script here. Maybe I'm misinterpreting the issue described above?

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

def autoclean(input_dataframe, drop_nans=False, copy=False, encoder=None,
              encoder_kwargs=None, ignore_update_check=False):
    """Performs a series of automated data cleaning transformations on the provided data set

    Parameters
    ----------
    input_dataframe: pandas.DataFrame
        Data set to clean
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False)
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_dataframe: pandas.DataFrame
        Cleaned data set

    """
    '''global update_checked
    if ignore_update_check:
        update_checked = True

    if not update_checked:
        update_check('datacleaner', __version__)
        update_checked = True'''

    if copy:
        input_dataframe = input_dataframe.copy()

    if drop_nans:
        input_dataframe.dropna(inplace=True)

    if encoder_kwargs is None:
        encoder_kwargs = {}

    for column in input_dataframe.columns.values:
        # Replace NaNs with the median or mode of the column depending on the column type
        try:
            print('hit try block')
            input_dataframe[column].fillna(input_dataframe[column].median(), inplace=True)
        except TypeError:
            print('caught type error')
            most_frequent = input_dataframe[column].mode()
            # If the mode can't be computed, use the nearest valid value
            # See https://github.com/rhiever/datacleaner/issues/8
            if len(most_frequent) > 0:
                input_dataframe[column].fillna(input_dataframe[column].mode()[0], inplace=True)
            else:
                input_dataframe[column].fillna(method='bfill', inplace=True)
                input_dataframe[column].fillna(method='ffill', inplace=True)


        # Encode all strings with numerical equivalents
        if str(input_dataframe[column].values.dtype) == 'object':
            if encoder is not None:
                column_encoder = encoder(**encoder_kwargs).fit(input_dataframe[column].values)
            else:
                column_encoder = LabelEncoder().fit(input_dataframe[column].values)

            input_dataframe[column] = column_encoder.transform(input_dataframe[column].values)

    return input_dataframe
def test_type_error():
    d = {'A': ['a',np.nan,'c'], 'B': [np.nan,'e',np.nan]}
    df = pd.DataFrame(data = d)
    print(df)
    print(df['A'].dtypes)
    cleaned_data = autoclean(df)
    print(cleaned_data)

def main():
    test_type_error()

if __name__ == '__main__':
    main()

which outputs:

     A    B
0    a  NaN
1  NaN    e
2    c  NaN
object
hit try block
caught type error
hit try block
caught type error
   A  B
0  0  0
1  0  0
2  1  0

@rhiever
Copy link
Owner

rhiever commented Oct 19, 2018

ping @dburns7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants