-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize ERDDAP server queries #14
Comments
yeah, I just saw this from @ocefpaf
|
BTW, doing that on a single server will get some sys adims mad at you! But doing this across different servers is fine. |
For the first function, it should be fine, since we're just querying the Awesome ERDDAP server list one query each. For the second, we randomize the dataset hits from the first query and iterate through them. There's a chance that they all could come from the same ERDDAP server. We could skip the randomize step and figure out a safe way to parallelize this result set instead? Basically these few lines of code is where that happens. |
Venerated Sir, |
@Rohan-cod I added a 'Student Test' to the description with some ideas for how to get started. Please take a look there. |
Sure Sir. I will try my best to solve the test. It will help get a deep understanding of the project and the codebase. |
I have gone through the student tests of this issue and issue #25 and based on my understanding of the project meanwhile I have created a student proposal "https://docs.google.com/document/d/13Zx1hBk42hHPZuacIIlSohCtS7q-LrGcfvgVkd89lmI/edit?usp=sharing" |
An implementation of parallelized query has been added to erddapy, see ioos/erddapy#199. In the erddapy implementation, the 'standard' ERDDAP search is used per format_search_string(). This allows search by keyword, essentially. For colocate, however, we use the ERDDAP 'Advanced Search' to filter datasets (provided by the erddapy get_search_url() function) - implemented in colocate in the query() function by passing the
Let's use the erddapy implementation of parallelized search in multiple_server_search.py as a starting point an implement something similar for colocate. Then perhaps it can be pushed upstream to erddapy if appropriate. I also have the beginning of an implementation in this branch. |
Hi Micah, Thanks a lot for letting me know about the execution of the task, Finally one of the key tasks in collocate has been done. Although I would have loved to do it GSOC instead, It’s execution makes me gives an immense satisfaction. Just a gentle remainder to let me know in Future If we can collaborate on any projects I will be always up for it. With RegardsShivam Sundram
|
PR submitted to (mostly) resolve this: #26. There are still issues for someone to solve, however:
|
The function that is being parallelized should probably have the time out. I suggest the timeout_decorator but there may be newer techniques out there.
My guess is that latest ERDDAP servers will allow two concurrent connections, right? So there isn't much to be done in terms of making them parallel :-/ |
@ocefpaf I hoped you might have some ideas. This would be a good project for someone to tackle for OHW if anyone is interested. My initial approach to use the If that library supported timeouts natively in such a way, that would be great, because we could just set the I got too complicated for me so I quit at that point. Just adding this link for reference, this is where I discovered the expected behavior of the |
Yeah. That is why I suggest to add the timeout decorator in |
Project Description:
Note: this GSoC project should be done in combination with #25 to improve the way the colocate library interacts with ERDDAP servers generally. This issue aims to improve how efficient colocate is in searching ERDDAP servers for relevant data and extracting a subset of points to plot on the map view. #25 focuses on generating ERDDAP API URLs to extract individual datasets from ERDDAP servers. Interested students should submit applications that consider both issue #14 and #25 together for an overall ERDDAP enhancement GSoC project.
Existing code in the erddap_query.py module:
Both of these tasks could be parallelized using Python libraries, for example:
Or maybe another technique? Implementing parallelization will improve response time in the notebook and/or help eventual development of a dashboard application (#24) out of the project.
Extra Credit/Part 2:
The parallelization of the ERDDAP search step in
query(url , **kw)
should be a fairly minimal effort, and might not constitute a full project in terms of time commitment. More difficult will be the secondget_coordinates(df, **kw)
parallelization approach.This has to do with the variability of the potential results from the search step: the app queries all known public ERDDAP services, each of which can have an unknown number of datasets with unknown data density. It is very easy to potentially overwhelm a single ERDDAP service with multiple parallel large data queries if a user searches too broadly or the results happen to include many datasets from a single ERDDAP.
An extension of this project could be to figure out how to change the UX of the app so that either:
a) the user interactively selects one or multiple datasets to display on the map together (rather than just displaying the first 10 of X number of unknown search results in random order as it does now) or
b) parallelizing display of X number or results, but only requesting data from a single ERDDAP server in serial, or implementing some other means of preventing excessive concurrent data requests to the same ERDDAP server while still displaying an unknown number of dataset results from the search step in a DataShader map for visualization/preview
Expected Outcomes:
Faster results returned from users' filter queries to the ERDDAP server list in the 'awesome-erddap' repo and the ability to plot with HoloViz/DataShader either the entire result set or user-selected subset of results without overwhelming ERDDAP servers
Skills required:
Python programming, multi-threading
Difficulty:
Low/Medium
Student Test:
all_coords = erddap_query.get_coordinates(ui.df, **ui.kw)
runs the erddap_query.get_coordinates() function, randomizes the datasets matched above, and searches in serial for the first 10 datasets from any ERDDAP dataset that return coordinate values to plot. How could this code be improved to be faster, but not send more than one request to a single ERDDAP server at a time, assuming we kept it to only show the first 10 datasets from Cell 6 that return coordinates? Suggest some ideas in your proposal.Mentor(s):
Micah Wengren @mwengren, Mathew Biddle @MathewBiddle
The text was updated successfully, but these errors were encountered: