Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less strict numerical type #466

Open
quancore opened this issue Apr 20, 2021 · 11 comments
Open

Less strict numerical type #466

quancore opened this issue Apr 20, 2021 · 11 comments
Labels
enhancement New feature or request future Issues that should be tracked but not actioned yet. help wanted Extra attention is needed

Comments

@quancore
Copy link

Is there any type that represents a numerical column (includes int, float etc.)?

@quancore quancore added the question Further information is requested label Apr 20, 2021
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 21, 2021

currently there's no way of specifying a "number" column since right now pandera adheres to pandas data types (and also in general python doesn't have a generic number type), although with @jeffzi's work on #369 you could make custom datatypes like this.

for now I'd recommend specifying a float since floats are a superset of integers.

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Apr 21, 2021

oh, I guess another way of doing this would be to specify pandas_dtype = None (the default) and then use a Check to validate a number type:

import pandera as pa
from pandas.api.types import is_number

is_number = pa.Check(lambda s: s.map(is_number), name="is_number")

schema = pa.DataFrameSchema({
    "column": pa.Column(checks=is_number)
})

schema(pd.DataFrame({"column": [1,2,"a"]}))

# Output
SchemaError: <Schema Column(name=column, type=None)> failed element-wise validator 0:
<Check is_number>
failure cases:
   index failure_case
0      2            a

@jeffzi
Copy link
Collaborator

jeffzi commented Apr 21, 2021

although with @jeffzi's work on #369 you could make custom datatypes like this.

We could even have a built-in Number dype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)

@quancore
Copy link
Author

although with @jeffzi's work on #369 you could make custom datatypes like this.

We could even have a built-in Number dype. Coercion would output floats or ints depending on the actual values (same as pandas.to_numeric)

I think we should add a built-in Number type that includes all kinds of integers and floats because we have huge datasets and checks with mapping would not be the best performant case. @cosmicBboy

@cosmicBboy
Copy link
Collaborator

I think we should add a built-in Number type that includes all kinds of integers and floats

The higher-level data types are still TBD, but Number will most likely be one of them

In the mean time, the more performant thing to do would be

from pandas.api.types import is_numeric_dtype

is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

# Output
SchemaError: <Schema Column(name=column, type=None)> failed series validator 0:
<Check is_number>

Not that it won't be as informative an error message (no indication of which element caused the check to fail).

@quancore
Copy link
Author

@cosmicBboy I propose to add enhancement tag to this issue.

@cosmicBboy cosmicBboy added enhancement New feature or request help wanted Extra attention is needed and removed question Further information is requested labels May 2, 2021
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 2, 2021

adjusted the tags, PR is welcome after the fix for #369 is done

@quancore
Copy link
Author

quancore commented May 3, 2021

@cosmicBboy If adding Number type will take time, could you add a build-in check that can be serializable and suitable for data synthesis?

@cosmicBboy
Copy link
Collaborator

hey @quancore you can register checks into the pa.Check namespace with the extensions API. I'd recommend doing that, as I don't think it makes sense to temporarily add a built-in check for this type if there will be a first-class representation of it in the new type system.

Let me know if you need any help with the strategy implementation!

@cosmicBboy cosmicBboy added the future Issues that should be tracked but not actioned yet. label May 16, 2021
@fleimgruber
Copy link
Contributor

After #369 and #559 what is the preferred solution here? Still #466 (comment) or #466 (comment)?

@smarie
Copy link
Contributor

smarie commented Mar 19, 2024

The second (#466 (comment)) seems most efficient as it uses is_numeric_dtype (no element-wise check)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future Issues that should be tracked but not actioned yet. help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants