multi erddap search added #199

callumrollo · 2021-07-17T18:05:47Z

First attempt at resolving #190

callumrollo · 2021-07-17T18:24:21Z

ERDDAP.search_all_servers searches multiple servers for a user-supplied string. It returns a dataframe the dataset name, ID, institution and server url. All a user needs to make data queries against those datasets. The user can supply a list of server urls, or it tries all the servers in erddapy.servers

erddapy/erddapy.py

abkfenris · 2021-07-19T15:54:41Z

erddapy/erddapy.py

+            }
+        num_cores = multiprocessing.cpu_count()
+        returns = Parallel(n_jobs=num_cores)(
+            delayed(parse_results)(url, key, protocol="tabledap")


Could protocol also be a kwarg? Or at least in the doc string? Right now you have to read the code to figure out that the function only searches for tabledap.

That's a good point. Protocol added as kwarg in 4ab1e5a

erddapy/erddapy.py

callumrollo · 2021-07-23T14:10:06Z

@abkfenris thank you for the review. I hope the subsequent commits have improved this PR. @ocefpaf could you look over this? I'm not sure why the pre-commit is failing again, it worked fine locally :/

abkfenris

Nice! It looks a lot cleaner now.

I think there are still some ways that it could be cleaned up even more that could help the test-ability and reuse-ability. Extracting the data transformation and url generation into their own functions would help that.

For example, the current fetch_results() could become two functions that should be come much easier to test individually.

def parse_results(data: bytes) -> Dict[str, DataFrame]:
    df = pd.read_csv(data)
    try:
        df.dropna(subset=[protocol], inplace=True)
    except KeyError:
        return None
    df["Server url"] = url.split("search")[0]
    return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]}


def fetch_results(url: str, key: str, protocol="tabledap") -> Optional[Dict[str, DataFrame]]:
    """
    Parse search results from multiple servers
    """
    data = multi_urlopen(url)
    if data is None:
        return None
    return parse_results(data)

Then parse_results() can be tested by passing it concrete data, and fetch_results() can be tested with parse_results() mocked out in addition to using library like VCR.

parse_results() could also be then used by a server specific search method on erddapy.ERDDAP, or if I wanted to use httpx to make async requests instead. Similarly generating search URL could have multiple uses too.

It might make sense to pull these functions out into a separate file to keep things more organized.

erddapy/erddapy.py

abkfenris · 2021-07-23T14:26:32Z

erddapy/erddapy.py

+    df = pd.read_csv(data)
+    try:
+        df.dropna(subset=[protocol], inplace=True)
+    except KeyError:
+        return None
+    df["Server url"] = url.split("search")[0]
+    return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]}


I'd suggest splitting out the parsing of data from your data fetching function.

erddapy/erddapy.py

callumrollo · 2021-07-23T16:21:17Z

Thanks @abkfenris, I've refactored into several smaller functions and pulled all of these out into a separate file. I'll start working on some tests, which should be easier with the more atomic structure you recommended

callumrollo · 2021-07-23T17:07:50Z

Am I on the right track with tests here? I'm not sure how to proceed with the more involved ones like parse_results how would we mock data for this?

ocefpaf · 2021-07-26T17:52:49Z

erddapy/url_handling.py

@@ -33,6 +33,23 @@ def urlopen(url: str, auth: Optional[tuple] = None, **kwargs: Dict) -> BinaryIO:
    return data


+def multi_urlopen(url: str) -> BinaryIO:


Let's "fold" this one into a the canonical urlopen by making the latter a thin wrapper to this one. That will allow us to cache the results in that one.

Let's tackle this in another PR.

erddapy/url_handling.py

Co-authored-by: Filipe <[email protected]>

callumrollo · 2021-07-26T18:57:26Z

Thanks for the help @abkfenris and @ocefpaf! Can we close #32 now?

callumrollo added 2 commits July 17, 2021 19:04

multi erddap search added

6b6f851

optional specify servers to search

9bbc1d9

callumrollo added the GSoC label Jul 19, 2021

abkfenris reviewed Jul 19, 2021

View reviewed changes

callumrollo added 3 commits July 19, 2021 18:19

pull search_all_servers out of ERDDAP class

4ab1e5a

parse_results >> fetch_results

4e51d58

make parallel a kwarg for all server search

949e4e0

abkfenris reviewed Jul 23, 2021

View reviewed changes

callumrollo added 4 commits July 23, 2021 16:16

search url generation in seperate function

8faa0b4

seperate fetch and format of multi-server search

e5641b7

seperate multi server results formatting

cf1d2a8

refactor multi server search to seperate file

ca4b024

callumrollo added 2 commits July 23, 2021 17:55

added tests for new url_handling functions

53ed965

test for fetch_results

74cbf0b

callumrollo added 2 commits July 23, 2021 18:34

correct path

3febbaa

another path fix

6373cea

ocefpaf reviewed Jul 26, 2021

View reviewed changes

erddapy/url_handling.py Outdated Show resolved Hide resolved

Update erddapy/url_handling.py

e873e0e

Co-authored-by: Filipe <[email protected]>

ocefpaf merged commit 0bf5bcf into ioos:main Jul 26, 2021

MathewBiddle mentioned this pull request Jul 27, 2021

function for grabbing list of erddaps in erddapy ioos/colocate#22

Open

mwengren mentioned this pull request Aug 2, 2021

Parallelize ERDDAP server queries ioos/colocate#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi erddap search added #199

multi erddap search added #199

callumrollo commented Jul 17, 2021

callumrollo commented Jul 17, 2021

abkfenris Jul 19, 2021

callumrollo Jul 19, 2021

callumrollo commented Jul 23, 2021

abkfenris left a comment

abkfenris Jul 23, 2021

callumrollo commented Jul 23, 2021

callumrollo commented Jul 23, 2021

ocefpaf Jul 26, 2021 •

edited

Loading

callumrollo commented Jul 26, 2021

		@@ -33,6 +33,23 @@ def urlopen(url: str, auth: Optional[tuple] = None, **kwargs: Dict) -> BinaryIO:
		return data


		def multi_urlopen(url: str) -> BinaryIO:

multi erddap search added #199

multi erddap search added #199

Conversation

callumrollo commented Jul 17, 2021

callumrollo commented Jul 17, 2021

abkfenris Jul 19, 2021

Choose a reason for hiding this comment

callumrollo Jul 19, 2021

Choose a reason for hiding this comment

callumrollo commented Jul 23, 2021

abkfenris left a comment

Choose a reason for hiding this comment

abkfenris Jul 23, 2021

Choose a reason for hiding this comment

callumrollo commented Jul 23, 2021

callumrollo commented Jul 23, 2021

ocefpaf Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

callumrollo commented Jul 26, 2021

ocefpaf Jul 26, 2021 •

edited

Loading