Skip to content
This repository has been archived by the owner on Dec 16, 2024. It is now read-only.

replace_values() throws TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 #45

Closed
georgm8 opened this issue Mar 11, 2023 · 4 comments

Comments

@georgm8
Copy link
Contributor

georgm8 commented Mar 11, 2023

replace_values() function throws the error TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 as it is trying to replace values in a column with the string value in the variable other in instances where the Pandas Series data is not a string.

Quick fix suggested is to change the series to a string and also replace dictionary keys with strings.

# emergency_care_features.py
def replace_values(
    data: pd.Series, replacements: dict, other: str = "ERROR:Unmapped - Not In Refset"
) -> pd.Series:
    # if value is in replacements, keep the value, else use `other` for all others
    # then use replacements to assign the other categories
    
    # Convert the replacements dictionary to strings and data type to str to allow replacement by other 
    replacements_str = {str(k): v for k, v in replacements.items()}
    data = data.astype(str)

    data_cat = (
        # data.where(data.isin(replacements), other).replace(replacements).astype(str)
        data.where(data.isin(replacements_str), other).replace(replacements_str).astype(str)
    )

    return data_cat
@georgm8
Copy link
Contributor Author

georgm8 commented Mar 11, 2023

Pull request #46

@vvcb
Copy link
Member

vvcb commented Mar 13, 2023

Thanks for reporting this @georgm8 . The data expected in this column are SNOMED codes which are integers rather than strings. feature_maps.py generates the map between the SNOMED codes as integers and string categories.

Not sure why that error appears. Looking at this on my phone at the moment. Will check this evening and merge.

Regarding the SNOMED codes for missing data, I agree that they should go in feature_maps along with 0 which is already in there I think.

@vvcb
Copy link
Member

vvcb commented Mar 13, 2023

@georgm8, I have rerun v0.3.1 on the LTH data and don't get this error. The following is the truncated output of good.dtypes after the first validation. There shouldn't really be any Int64 dtypes unless you are coercing columns into this in a previous step.
Is it possible that this may have been introduced to allow nan in SNOMED columns instead of assigning 0 or one of the allowed values for missing or unknown values.

Can you please check and close this issue if this explains it?

Also please see pandas-dev/pandas#45729.

column dtype
patient_id int64
visit_id int64
townsend_score_quintile int64
gender object
activage int64
ethnos object
accommodationstatus int64
procodet object
edsitecode object
eddepttype object
edarrivalmode int64
edattendcat object
edattendsource int64
edarrivaldatetime datetime64[ns, UTC]
edwaittime float64
edacuity int64
edchiefcomplaint int64
edcomorb_01 int64
eddiag_NN int64
edentryseq_NN int64
eddiagqual_NN int64
edinvest_NN int64
edtreat_NN int64
timeined float64
disstatus int64
edattenddispatch int64
edrefservice int64

@georgm8
Copy link
Contributor Author

georgm8 commented Mar 14, 2023

Thanks - you're absolutely right - I forgot to remove the Nullable Integer data type I was testing out earlier. No error with int64 data types. Closing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants