Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize ERDDAP server queries #14

Open
mwengren opened this issue Aug 12, 2020 · 13 comments
Open

Parallelize ERDDAP server queries #14

mwengren opened this issue Aug 12, 2020 · 13 comments

Comments

@mwengren
Copy link
Member

mwengren commented Aug 12, 2020

Project Description:

Note: this GSoC project should be done in combination with #25 to improve the way the colocate library interacts with ERDDAP servers generally. This issue aims to improve how efficient colocate is in searching ERDDAP servers for relevant data and extracting a subset of points to plot on the map view. #25 focuses on generating ERDDAP API URLs to extract individual datasets from ERDDAP servers. Interested students should submit applications that consider both issue #14 and #25 together for an overall ERDDAP enhancement GSoC project.

Existing code in the erddap_query.py module:

  • iterates through a Pandas dataframe of ERDDAP servers for datasets that match the standard_name, time period, and geographic bounding box submitted by the user in the notebook - see query(url, **kw)
  • queries the set of datasets returned from the above function for x/y positions of features within each matching dataset to display them on a HoloViews/DataShader map - see get_coordinates(df, **kw)

Both of these tasks could be parallelized using Python libraries, for example:

from joblib import Parallel, delayed
import multiprocessing

Or maybe another technique? Implementing parallelization will improve response time in the notebook and/or help eventual development of a dashboard application (#24) out of the project.

Extra Credit/Part 2:

The parallelization of the ERDDAP search step in query(url , **kw) should be a fairly minimal effort, and might not constitute a full project in terms of time commitment. More difficult will be the second get_coordinates(df, **kw) parallelization approach.

This has to do with the variability of the potential results from the search step: the app queries all known public ERDDAP services, each of which can have an unknown number of datasets with unknown data density. It is very easy to potentially overwhelm a single ERDDAP service with multiple parallel large data queries if a user searches too broadly or the results happen to include many datasets from a single ERDDAP.

An extension of this project could be to figure out how to change the UX of the app so that either:

a) the user interactively selects one or multiple datasets to display on the map together (rather than just displaying the first 10 of X number of unknown search results in random order as it does now) or
b) parallelizing display of X number or results, but only requesting data from a single ERDDAP server in serial, or implementing some other means of preventing excessive concurrent data requests to the same ERDDAP server while still displaying an unknown number of dataset results from the search step in a DataShader map for visualization/preview

Expected Outcomes:
Faster results returned from users' filter queries to the ERDDAP server list in the 'awesome-erddap' repo and the ability to plot with HoloViz/DataShader either the entire result set or user-selected subset of results without overwhelming ERDDAP servers

Skills required:
Python programming, multi-threading

Difficulty:
Low/Medium

Student Test:

  • Fork the repository
  • Run the colocate-dev.ipynb in Binder to get an idea for how the ERDDAP query code works in Cell 6 (using traditional Jupyter notebook rather than JupyterLab provides better debug output).
  • Investigate adding code to functions such as run.ui_query() and erddap.query() to query ERDDAP servers in parallel rather than serial - this is actually Part 1 of this project. This should help to familiarize yourself with the code base.
  • Push your changes to your GitHub fork and post a link to run your modified colocate-dev.ipynb notebook in this issue with the parallelized ERDDAP search code working
  • For the Extra Credit/Part 2 of this project, investigate the debug output in Cell 8 of colocate-dev.ipynb. Currently, Cell 8's code all_coords = erddap_query.get_coordinates(ui.df, **ui.kw) runs the erddap_query.get_coordinates() function, randomizes the datasets matched above, and searches in serial for the first 10 datasets from any ERDDAP dataset that return coordinate values to plot. How could this code be improved to be faster, but not send more than one request to a single ERDDAP server at a time, assuming we kept it to only show the first 10 datasets from Cell 6 that return coordinates? Suggest some ideas in your proposal.

Mentor(s):
Micah Wengren @mwengren, Mathew Biddle @MathewBiddle

@MathewBiddle
Copy link
Contributor

yeah, I just saw this from @ocefpaf

from joblib import Parallel, delayed
import multiprocessing


def request_whoi(dataset_id):
    e.constraints = None
    e.protocol = "tabledap"
    e.variables = ["longitude", "latitude", "temperature", "salinity"]
    e.dataset_id = dataset_id
    # Drop units in the first line and NaNs.
    df = e.to_pandas(response="csv", skiprows=(1,)).dropna()
    return (dataset_id, df)


num_cores = multiprocessing.cpu_count()
downloads = Parallel(n_jobs=num_cores)(
    delayed(request_whoi)(dataset_id) for dataset_id in whoi_gliders
)

dfs = {glider: df for (glider, df) in downloads}

@ocefpaf
Copy link
Member

ocefpaf commented Aug 12, 2020

BTW, doing that on a single server will get some sys adims mad at you! But doing this across different servers is fine.

@mwengren
Copy link
Member Author

mwengren commented Aug 12, 2020

For the first function, it should be fine, since we're just querying the Awesome ERDDAP server list one query each.

For the second, we randomize the dataset hits from the first query and iterate through them. There's a chance that they all could come from the same ERDDAP server.

We could skip the randomize step and figure out a safe way to parallelize this result set instead?

Basically these few lines of code is where that happens.

@mwengren mwengren added the GSoC Project ideas for Google Summer of Code label Feb 4, 2021
@Rohan-cod
Copy link

Venerated Sir,
I hope you are safe and in good health in the wake of prevailing COVID-19.
My name is Rohan Gupta and I am a 3rd-year Computer Science undergraduate student at Shri Mata Vaishno Devi University. I have been working with Python and deep learning for a couple of years now and have in-depth knowledge of it. I look forward to contributing to this idea as part of this year's GSoC.
It would be a great assistance if you could suggest how to get started.
My Linkedin Profile:- https://www.linkedin.com/in/rohang4837b4124/

@mwengren mwengren changed the title Parallelize ERDDAP server queries Parallelize ERDDAP server queries (GSoC Project 1) Mar 10, 2021
@mwengren
Copy link
Member Author

@Rohan-cod I added a 'Student Test' to the description with some ideas for how to get started. Please take a look there.

@Rohan-cod
Copy link

Sure Sir. I will try my best to solve the test. It will help get a deep understanding of the project and the codebase.

@RATED-R-SUNDRAM
Copy link

RATED-R-SUNDRAM commented Mar 28, 2021

I have gone through the student tests of this issue and issue #25 and based on my understanding of the project meanwhile I have created a student proposal "https://docs.google.com/document/d/13Zx1hBk42hHPZuacIIlSohCtS7q-LrGcfvgVkd89lmI/edit?usp=sharing"

@mwengren
Copy link
Member Author

mwengren commented Aug 2, 2021

An implementation of parallelized query has been added to erddapy, see ioos/erddapy#199.

In the erddapy implementation, the 'standard' ERDDAP search is used per format_search_string(). This allows search by keyword, essentially.

For colocate, however, we use the ERDDAP 'Advanced Search' to filter datasets (provided by the erddapy get_search_url() function) - implemented in colocate in the query() function by passing the **kw parameter on to erddapy to get more finer-grained results. Here's an example of the filter parameters we're currently passing via **kw:

{'min_lon': -72.706369,
 'max_lon': -64.17569700000001,
 'min_lat': 35.69571500000001,
 'max_lat': 41.28732200000002,
 'min_time': '2017-08-11T00:00:00Z',
 'max_time': '2019-12-31T00:00:00Z',
 'standard_name': 'sea_water_practical_salinity'}

Let's use the erddapy implementation of parallelized search in multiple_server_search.py as a starting point an implement something similar for colocate. Then perhaps it can be pushed upstream to erddapy if appropriate.

I also have the beginning of an implementation in this branch.

@mwengren mwengren added the ohw21 Issues to address in OceanHackWeek 21 label Aug 2, 2021
@RATED-R-SUNDRAM
Copy link

RATED-R-SUNDRAM commented Aug 3, 2021 via email

@mwengren
Copy link
Member Author

mwengren commented Aug 3, 2021

PR submitted to (mostly) resolve this: #26.

There are still issues for someone to solve, however:

  1. How to implement a timeout in joblib to prevent waiting for the 'slowest' of the ERDDAP server collection to respond (or timeout).
  2. Whether/how to parallelize the get_coordinates(df, **kw) function without sending too many requests to any one ERDDAP server.

@ocefpaf
Copy link
Member

ocefpaf commented Aug 4, 2021

  1. How to implement a timeout in joblib to prevent waiting for the 'slowest' of the ERDDAP server collection to respond (or timeout).

The function that is being parallelized should probably have the time out. I suggest the timeout_decorator but there may be newer techniques out there.

2. Whether/how to parallelize the get_coordinates(df, **kw) function without sending too many requests to any one ERDDAP server.

My guess is that latest ERDDAP servers will allow two concurrent connections, right? So there isn't much to be done in terms of making them parallel :-/

@mwengren
Copy link
Member Author

mwengren commented Aug 4, 2021

@ocefpaf I hoped you might have some ideas. This would be a good project for someone to tackle for OHW if anyone is interested.

My initial approach to use the timeout parameter in joblib.Parallel - in my development branch here - fails because if the timeout is triggered, the assignment to results on line 24 fails, even for any Parallel jobs that have successfully completed before the timeout is triggered.

If that library supported timeouts natively in such a way, that would be great, because we could just set the timeout to 30 seconds and ignore any servers that don't respond by then (similar to how erddap.com must do it). But perhaps it must be done by the implementing function like you mentioned. .

I got too complicated for me so I quit at that point.

Just adding this link for reference, this is where I discovered the expected behavior of the timeout parameter: https://coderzcolumn.com/tutorials/python/joblib-parallel-processing-in-python#5

@ocefpaf
Copy link
Member

ocefpaf commented Aug 4, 2021

My initial approach to use the timeout parameter in joblib.Parallel - in my development branch here - fails because if the timeout is triggered, the assignment to results on line 24 fails, even for any Parallel jobs that have successfully completed before the timeout is triggered.

Yeah. That is why I suggest to add the timeout decorator in do_query(server) instead. It would probably require some "gracefully" failed result that we can use later in the list.

@mwengren mwengren removed the ohw21 Issues to address in OceanHackWeek 21 label Feb 8, 2023
@mwengren mwengren changed the title Parallelize ERDDAP server queries (GSoC Project 1) Parallelize ERDDAP server queries Feb 8, 2023
@mwengren mwengren removed the GSoC Project ideas for Google Summer of Code label Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants