-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/simple cli for chunking local or remote NetCDF files #319
base: main
Are you sure you want to change the base?
Conversation
I wonder, are you aware of pangeo-forge? It provides a recipe-runner abstraction for reading various xarray supported file types and converting them for storage. That conversion can be via kerchunk to produce JSON files like you are doing. The target is mostly for automatic running of recipes on various cloud backends, so very large datasets; but you can execute a recipe locally in a way that is probably quite similar to the CLI here. If we decide to go ahead here, could we extend to multiple file types? This is one of kerchunk's great strengths. Each file type, of course, takes a different set of options and may have other semantic differences (a grib2 file produces a list of reference sets, for instance). |
Also, before I forget: the auto_dask function also does s=much of the job of automating scanning multiple files and combining the results in a single call (with parallelised tree reduction). Might be worth calling that rather than writing a class to do the same thing, however short that class may be. |
I had a quick look before this PR on pangeo-forge, but it seems to me very cloud-oriented and a little bit "the-big-thing" to do what I want. My use-case was really to tackle simple case, easy to demonstrate and to explain, where everything go well, and there is no need to write any python. I understood pangeo-forge target was to cover all use-cases (therefore the need to write some python in receipe.py), and provide a cloud-ready CI (which is a really really great job!!!). Possible solution to go on:
Happy to get your view on this. |
Thanks I totally miss this function ! For sure will use it if we go ahead
Currently started with NetCDF file, but yes I need to cover GRIB as well |
from kerchunk.hdf import SingleHdf5ToZarr | ||
from kerchunk.combine import MultiZarrToZarr | ||
|
||
logger = logging.getLogger("kercli") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"kerchunk-cli-nc" is fine :)
@@ -31,6 +31,9 @@ | |||
"FITSVarBintable = kerchunk.codecs:VarArrCodec", | |||
"record_member = kerchunk.codecs.RecordArrayMember", | |||
], | |||
'console_scripts': [ | |||
'kerchunk-nc = kerchunk.cli.chunk_nc:cli', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just call it kerchunk
, and either infer the file type from the URL extension or provide a --format option to select nc, the only one available right now.
logger = logging.getLogger("kercli") | ||
|
||
|
||
class NetcdfChunker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely NetCDFKerchunker !
value = json.loads(value) | ||
return value | ||
|
||
@click.command() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a general description
@click.option("--input", "-i", | ||
help="Input file url, readable by fsspec", required=True, | ||
multiple=True) | ||
@click.option("--input-format", default="nc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see you already thought of this - but we don't do anything with this argument, right? We should raise a useful message if anything other than "nc" is provided.
Would it make sense to make this a sub command? Then there could be another sub command for combining ref json files? |
I have no preference between subcommands and passing extra arguments, it's just a matter of style. |
Hello, what is the status of this "feature"? |
Hello, currently I don't have time to work on this. Happy if someone want to take over. |
I am working-out something over at https://github.com/NikosAlexandris/rekx. |
Step-by-step, I have some Very DRAFT without tests at https://github.com/NikosAlexandris/rekx/tree/main/rekx in a 'works-for-me' state. @martindurant any interest in seeing this growing ? |
I wouldn't use it personally, but it seems that some in here would, so I'd be happy to include something like this. |
I am working on it : https://github.com/NikosAlexandris/rekx#examples -- these are just a small part of what |
@NikosAlexandris , I see you've already put a decent amount of effort into it! I'd be happy to link to it from the kerchunk documentation or include it right here if you think it appropriate - whenever you reckon it's ready for a wider audience. |
I'd appreciate some guidance on all matters about Kerchunk and, of course, I'd be grateful for suggestions to eventually make this effort meaningful outside own needs. Some examples :
Maybe we can better shape it before asking for exposure ? ps- A larger tutorial using SARAH3 products is on its way, also thanks to the good people in the german weather service (DWD) who actually produce these data. |
@martindurant And of course, if I wasn't clear, I don't mind for whatever scenario if this goes well: integrate directly here-in or link to it. Whatever works better. |
I have a slight preference to integrate it into kerchunk, using command |
Would you prefer a rather clean Kerchunking interface (i.e. |
Yes, I think it's fine to have all those commands - they can be helpful shortcuts in some places. |
I am working on Ah, and testing... of course! This front needs some love. It would be good however to start a discussion on the integration (requirements overall, dependencies, things to do and things not to do) at some point and formalise the tasks to-do (?). |
I am planning a kerchunk (virtual) get-together to discuss all manner of topics, and this would be a good one.
Since nothing exists yet, I am not too worried. Probably it's reasonable to add |
I hope I can make it to join.
You are right, I will try to contribute useful things while I expect to learn a lot from the interaction and the experience. |
https://discourse.pangeo.io/t/kerchunk-planning/4002/2 for the kerchunk planning thread |
Hello, thanks for this lib !
I ended up rewriting several times the scan and consolidate parts, from your tutorial. I thought this small cli would be of interest, when working outside notebook ! Happy to share your view on this.
Usage example :
$ kerchunk-nc -i s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc -i s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc INFO:kercli:Scanning s3://era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc ... INFO:kercli:Scanning s3://era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc ... INFO:kercli:Data loaded from json/mydataset : 2 found INFO:kercli:Consolidating to zarr/mydataset.zarr ...
Will result in
Help looks like :