Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowInvalid: cannot mix list and non-list, non-null values #8

Closed
Jack-Hayes opened this issue Oct 12, 2024 · 5 comments · Fixed by #14
Closed

ArrowInvalid: cannot mix list and non-list, non-null values #8

Jack-Hayes opened this issue Oct 12, 2024 · 5 comments · Fixed by #14
Assignees
Labels
bug Something isn't working

Comments

@Jack-Hayes
Copy link
Member

gf_tdx = coincident.search.search(dataset='tdx')
ArrowInvalid: cannot mix list and non-list, non-null values

Getting this when searching the entire dataset, I think we just need to update STAC collection parsing to ensure consistent datatypes before converting to geopandas

@Jack-Hayes
Copy link
Member Author

Jack-Hayes commented Oct 16, 2024

Same for maxar

Show error output

File c:\Users\JackE\anaconda3\envs\JackConda\Lib\site-packages\coincident\search\stac.py:31, in to_geopandas(collection)
27 def to_geopandas(
28 collection: pystac.item_collection.ItemCollection,
29 ) -> gpd.GeoDataFrame:
30 """Convert returned from STAC API to geodataframe via arrow"""
---> 31 record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow(collection)
32 gf = gpd.GeoDataFrame.from_arrow(record_batch_reader) # doesn't keep arrow dtypes
34 # Additional columns for convenience
35 # NOTE: these become entries under STAC properties
...
117 else:
118 array = pa.array(wkb_items)
--> 120 return cls(pa.RecordBatch.from_struct_array(array))

TypeError: Argument 'struct_array' has incorrect type (expected pyarrow.lib.StructArray, got pyarrow.lib.NullArray)

@Jack-Hayes Jack-Hayes changed the title TDX Search Consistent Dtypes -> gpd to_geopoandas Consistent Dtypes -> gpd Oct 16, 2024
@Jack-Hayes Jack-Hayes changed the title to_geopoandas Consistent Dtypes -> gpd to_geopoandas Consistent Dtypes Oct 16, 2024
@scottyhq
Copy link
Member

Getting this when searching the entire dataset

would be good to narrow in on a small AOI that reproduces this if possible. I'll have to dig into this a bit, my guess is there might be the occasional weird STAC Item that throughs off the conversion,
https://github.com/stac-utils/stac-geoparquet/blob/4b00f5be649609a896242f391d6e9c56377c7f25/stac_geoparquet/arrow/_batch.py#L87

@scottyhq scottyhq self-assigned this Oct 18, 2024
@scottyhq scottyhq changed the title to_geopoandas Consistent Dtypes ArrowInvalid: cannot mix list and non-list, non-null values Oct 18, 2024
@scottyhq scottyhq added the bug Something isn't working label Oct 18, 2024
@Jack-Hayes
Copy link
Member Author

Jack-Hayes commented Oct 21, 2024

@scottyhq I've been getting this result for searching the cop30 dataset at various spatial and temporal extents. Below is some code to reproduce the results for one of the sites. I don't fully understand the error from from_dicts() we're getting here. Not sure if the below is relevant for stac_geoparquet.arrow.parse_stac_items_to_arrow(collection)
https://github.com/stac-utils/stac-geoparquet/blob/4b00f5be649609a896242f391d6e9c56377c7f25/stac_geoparquet/arrow/_to_arrow.py

Code
# random 3dep flight in CO
aoi = gpd.read_file('https://raw.githubusercontent.com/unitedstates/districts/refs/heads/gh-pages/states/CO/shape.geojson')
gf_3dep = coincident.search.search(dataset='3dep',
                                   intersects=aoi,
                            datetime=['2018', '2019'],                      
)
site = gf_3dep[gf_3dep.workunit == "CO_Eastern_B1_2018"]
# failed attempt
# TypeError: Argument 'struct_array' has incorrect type (expected pyarrow.lib.StructArray, got pyarrow.lib.NullArray)
gf_cop = coincident.search.search(dataset='cop30',
                                   intersects=site.geometry.envelope,
                                    datetime=['2018-04-30', '2018-07-25'],                      
)
# another failed attempt
# TypeError: Argument 'struct_array' has incorrect type (expected pyarrow.lib.StructArray, got pyarrow.lib.NullArray)
gf_cop = coincident.search.search(dataset='cop30',
                                   intersects=site.geometry.envelope,
                                    datetime=['2018', '2019'],                      
)

@scottyhq
Copy link
Member

scottyhq commented Oct 31, 2024

TypeError: Argument 'struct_array' has incorrect type (expected pyarrow.lib.StructArray, got pyarrow.lib.NullArray)

Is happening when the search results are empty & we're trying to convert an empty list to a dataframe. For the case below no results are returned b/c the cop30 dem 'representatitive' timestamp is 2021-04-22. I'll make a change to raise a better error message if no results are found. But for the cop30 search, just leave out datetime constraints.

gf_cop = coincident.search.search(dataset='cop30',
                                   intersects=site.geometry.envelope,
                                    datetime=['2018-04-30', '2018-07-25'],  

@scottyhq
Copy link
Member

scottyhq commented Oct 31, 2024

gf_tdx = coincident.search.search(dataset='tdx')
ArrowInvalid: cannot mix list and non-list, non-null values

two separate issues with this: 1. should raise a warning if not passing either bbox or intersects as that will likely be slow. 2. the arrow error took a while to track down, but unlike JSON it is sensitive to schema, so if columns have a mix of lists and non-list values it complains. The TDX data has lots of nested metadata fields, and it looks like one of those has a mix of lists and non lists. I don't think we care about all that nested medata, so will just drop it for the time being before converting to a geodataframe

pd.json_normalize(gf.acquisitionInfo).T
#polarisationList.polLayer	HH	[HH, HV]	[HH, HV]	[HH, HV]	[HH, HV]

@scottyhq scottyhq mentioned this issue Oct 31, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants