Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an informative error message when object array has mixed types #4700

Merged
merged 25 commits into from
Nov 28, 2023

Conversation

andersy005
Copy link
Member

@andersy005 andersy005 commented Dec 16, 2020

@andersy005
Copy link
Member Author

andersy005 commented Dec 16, 2020

Before

In [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")

In [3]: xr.conventions._infer_dtype(data, 'test')
Out[3]: dtype('O')

As pointed out in #2620, this doesn't seem problematic until the user tries writing the xarray object to disk. This results in a very cryptic error message:

In [7]: ds.to_netcdf('test.nc', engine='netcdf4')
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__setitem__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._put()

TypeError: expected bytes, int found

After

In [2]: data = np.array([["x", 1], ["y", 2]], dtype="object")

In [3]: xr.conventions._infer_dtype(data, 'test')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-addaab43c03a> in <module>
----> 1 xr.conventions._infer_dtype(data, 'test')

~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
    142     native_dtypes = set(map(lambda x: type(x), array.flatten()))
    143     if len(native_dtypes) > 1:
--> 144         raise ValueError(
    145             "unable to infer dtype on variable {!r}; object array "
    146             "contains mixed native types: {}".format(

ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int

During I/O, the user gets:

...
~/devel/pydata/xarray/xarray/conventions.py in ensure_dtype_not_object(var, name)
    223             data[missing] = fill_value
    224         else:
--> 225             data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))
    226 
    227         assert data.dtype.kind != "O" or data.dtype.metadata

~/devel/pydata/xarray/xarray/conventions.py in _infer_dtype(array, name)
    142     native_dtypes = set(map(lambda x: type(x), array.flatten()))
    143     if len(native_dtypes) > 1:
--> 144         raise ValueError(
    145             "unable to infer dtype on variable {!r}; object array "
    146             "contains mixed native types: {}".format(

ValueError: unable to infer dtype on variable 'test'; object array contains mixed native types: str,int

@andersy005 andersy005 marked this pull request as ready for review December 16, 2020 15:22
xarray/conventions.py Outdated Show resolved Hide resolved
xarray/conventions.py Outdated Show resolved Hide resolved
@pep8speaks

This comment has been minimized.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

I ran a quick performance check — I'm not sure how large an impact this is in proportional terms though.

Alternatives — not ideal ones — would be to wait until the main error is raised, or only test a subset of the values.

In [7]: np.asarray(list("abcdefghijklmnopqrstuvwxyz"))
Out[7]:
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

In [14]: array = np.repeat(x, 5_000_000)


In [15]: %timeit set(np.vectorize(type, otypes=[object])(array.ravel()))
10.2 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

xarray/conventions.py Outdated Show resolved Hide resolved
@andersy005
Copy link
Member Author

Alternatives — not ideal ones — would be to wait until the main error is raised, or only test a subset of the values.

I thought of taking a random sample from the array and checking the types on the sample only, but I wasn't so confident about how representative this sample would be and/or how to deal with misleading, skewed samples. If anyone has thoughts on this, please let me know.

@mathause
Copy link
Collaborator

It took around 40 s for an array of 10**9 elements. That would be around 150 years of daily data (180*360*150*365). I am not sure though how much sense it makes to have such a large array with object dtype. Also an array of this size is likely a dask array and there is already a performance warning on this. So I'd say go ahead.

"variable {} has data in the form of a dask array with "
"dtype=object, which means it is being loaded into memory "
"to determine a data type that can be safely stored on disk. "
"To avoid this, coerce this variable to a fixed-size dtype "

@andersy005
Copy link
Member Author

Also an array of this size is likely a dask array and there is already a performance warning on this. So I'd say go ahead.

@mathause, just to make sure I am not misinterpreting your comment, is this a go ahead to sampling the array to determine the types? :)

@mathause
Copy link
Collaborator

Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.)

@andersy005
Copy link
Member Author

andersy005 commented Jan 22, 2021

Yes, I'd say go ahead. (I just hope it's not too big of a performance hit for normal use cases.)

@mathause, I am noticing a performance hit even for the special use cases. Here's how I am doing the sampling

sample_indices = np.random.choice(array.size, size=min(20, array.size), replace=False)
native_dtypes = set(np.vectorize(type, otypes=[object])(array.ravel()[sample_indices]))

and here's the code snippet I tested this on:

In [1]: import xarray as xr, numpy as np

In [2]: x = np.asarray(list("abcdefghijklmnopqrstuvwxyz"), dtype="object")

In [3]: array = np.repeat(x, 5_000_000)

In [4]: array.size
Out[4]: 130000000

In [5]: array.dtype
Out[5]: dtype('O')

Without sampling

In [6]: %timeit xr.conventions._infer_dtype(array, "test")
7.63 s ± 515 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With sampling

In [15]: %timeit xr.conventions._infer_dtype(array, "test")
8.31 s ± 395 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I could be wrong, but the sampling doesn't seem to be worth it.

@mathause
Copy link
Collaborator

No I wouldn't subsample. With normal use case I meant saving legitimate object arrays. I am not sure how often they occur in the wild.

@github-actions
Copy link
Contributor

Unit Test Results

         6 files           6 suites   54m 27s ⏱️
16 229 tests 14 508 ✔️ 1 721 💤 0 ❌
90 570 runs  82 410 ✔️ 8 160 💤 0 ❌

Results for commit 2107949.

@max-sixty
Copy link
Collaborator

No I wouldn't subsample. With normal use case I meant saving legitimate object arrays. I am not sure how often they occur in the wild.

I think this is most likely for lots of strings which aren't using a numpy string type.

@andersy005 to the extent the perf difference is 7 seconds vs 8 seconds, I would definitely vote to merge. The case I was originally concerned about is that something suddenly takes 1000 seconds. If we're confident that that's not the case, I would also vote to merge...

@andersy005 andersy005 added plan to merge Final call for comments and removed needs review labels Nov 28, 2023
@andersy005 andersy005 merged commit dc0931a into pydata:main Nov 28, 2023
26 of 27 checks passed
@andersy005 andersy005 deleted the bugfix/mixed-int-str-data branch November 28, 2023 22:19
dcherian added a commit to rabernat/xarray that referenced this pull request Nov 29, 2023
* upstream/main:
  Raise an informative error message when object array has mixed types (pydata#4700)
  Start renaming `dims` to `dim` (pydata#8487)
  Reduce redundancy between namedarray and variable tests (pydata#8405)
  Fix Zarr region transpose (pydata#8484)
  Refine rolling_exp error messages (pydata#8485)
  Use numbagg for `ffill` by default (pydata#8389)
  Fix bug for categorical pandas index with categories with EA dtype (pydata#8481)
  Improve "variable not found" error message (pydata#8474)
  Add whatsnew for pydata#8475 (pydata#8478)
  Allow `rank` to run on dask arrays (pydata#8475)
  Fix mypy tests (pydata#8476)
  Use concise date format when plotting (pydata#8449)
  Fix `map_blocks` docs' formatting (pydata#8464)
  Consolidate `_get_alpha` func (pydata#8465)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset.to_netcdf with mixed int and str in data
5 participants