-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_dataframe (pandas) usage question #1534
Comments
Marinna,
You are correct. In the present release of Xarray, converting to a pandas
dataframe loads all of the data eagerly into memory as a regular pandas
object, giving up dask's parallel capabilities and potentially consuming
lots of memory. With chunked Xarray data, It would be preferable instead to
convert to a dask.dataframe, rather than a regular pandas dataframe, which
would carry over some of the performance benefits.
This is a known issue:
#1462
With a solution in the works:
#1489
So hopefully a release of Xarray in the near future will have the feature
you seek.
Alternatively, if you describe the filtering, masking, and other QA/QC that
you need to do in more detail, we may be able to help you accomplish this
entirely within Xarray.
Good luck!
Ryan
…On Mon, Aug 28, 2017 at 2:02 PM, Marinna Martini ***@***.***> wrote:
Apologies for what is probably a very newbie question:
If I convert such a large file to pandas using to_dataframe() to gain
access to more pandas methods, will I lose the speed and dask capabillity
that is so wonderful in xarray?
I have a very large netCDF file (3 GB with 3 Million data points of 1-2 Hz
ADCP data) that needs to be reduced to hourly or 10 min averages. xarray is
perfect for this. I am exploring resample and other methods. It is
amazingly fast doing this:
ds = xr.open_dataset('hugefile.nc')
ds_lp = ds.resample('H','time','mean')
And an offset of about half a day is introduced to the data. Probably user
error or due to filtering. To figure this out, I am looking at using
resample in pandas directly, or multindexing and reshaping using methods
that are not inherited from pandas by xarray, then back to xarray using
to_xarray. I will also need to be masking data (and other things pandas can
do) during a QA/QC process. It appears that pandas can do masking and
xarray does not inherit masking?
Am I understanding the relationship between xarray and pandas correctly?
Thanks,
Marinna
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1534>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJFJiIu3U-Y3o1jXE5FyqdYuzH2WrJGks5scwDRgaJpZM4PE25E>
.
|
Many thanks, I will go learn about dask dataframes. |
Hello Ryan, I have read a bit about dask. Am I missing the Pandas Panel analog in Dask? My data is in netcdf4, and the files can have as many as 17 variables or more. It's not clear how to get this easily into dask. In Pandas I think the entire netCDF file equates to a Panel. A single variable would be a DataFrame. Rather than wandering around in the weeds, I could use a hint here. Do I really need to open the netCDF4 file, then iterate over my variables and deal them into a series of dask data frames? That seems very un-pythonic. I tried this... df = dd.read_hdf('reallybignetCDF4file.nc',key='/c') # this does not work Thanks, |
@mmartini-usgs, an entire netCDF file (as long as it only has 1 group, which it most likely does if we're talking about standard atmospheric/oceanic data) would be the equivalent of an To start with, you should read in your data using the chunks keyword to ds = xr.open_dataset('hugefile.nc', chunks={<fill me in>})
ds_lp = ds.resample('H','time','mean') You'd have to choose chunks based on the dimensions of your data. Like @rabernat previously mentioned, it's very likely you can perform your entire workflow within xarray without every having to drop down to pandas; let us know if you can share more details |
Many thanks! I will try this. OK, since you asked for more details: I have used xarray resample successfully on a file with ~3 million single ping ADCP ensembles, 17 variables of these with 1D and 2D data. Three lines of code in a handful of minutes to reduce that. On a middling laptop. Amazing. Unintended behaviors from resample that I need to figure out: On the menu next to learn/use: Where is all this going? This is my learn-python project, so apologies for the non-pythonic approach. I also need to preserve backwards compatibility with existing code and conventions (EPIC, historically, CF and thredds, going forward). The project is here: https://github.com/mmartini-usgs/ADCPy |
@mmartini-usgs - Thanks for the questions. I'm going to close this now as it seems like you're up and going. In the future, we try to keep our "Usage Questions" to the xarray users google group or StackOverflow. Cheers! |
Apologies for what is probably a very newbie question:
If I convert such a large file to pandas using to_dataframe() to gain access to more pandas methods, will I lose the speed and dask capabillity that is so wonderful in xarray?
I have a very large netCDF file (3 GB with 3 Million data points of 1-2 Hz ADCP data) that needs to be reduced to hourly or 10 min averages. xarray is perfect for this. I am exploring resample and other methods. It is amazingly fast doing this:
ds = xr.open_dataset('hugefile.nc')
ds_lp = ds.resample('H','time','mean')
And an offset of about half a day is introduced to the data. Probably user error or due to filtering. To figure this out, I am looking at using resample in pandas directly, or multindexing and reshaping using methods that are not inherited from pandas by xarray, then back to xarray using to_xarray. I will also need to be masking data (and other things pandas can do) during a QA/QC process. It appears that pandas can do masking and xarray does not inherit masking?
Am I understanding the relationship between xarray and pandas correctly?
Thanks,
Marinna
The text was updated successfully, but these errors were encountered: