Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi erddap search added #199

Merged
merged 14 commits into from
Jul 26, 2021
Merged

multi erddap search added #199

merged 14 commits into from
Jul 26, 2021

Conversation

callumrollo
Copy link
Contributor

First attempt at resolving #190

@callumrollo
Copy link
Contributor Author

ERDDAP.search_all_servers searches multiple servers for a user-supplied string. It returns a dataframe the dataset name, ID, institution and server url. All a user needs to make data queries against those datasets. The user can supply a list of server urls, or it tries all the servers in erddapy.servers

erddapy/erddapy.py Outdated Show resolved Hide resolved
}
num_cores = multiprocessing.cpu_count()
returns = Parallel(n_jobs=num_cores)(
delayed(parse_results)(url, key, protocol="tabledap")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could protocol also be a kwarg? Or at least in the doc string? Right now you have to read the code to figure out that the function only searches for tabledap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Protocol added as kwarg in 4ab1e5a

erddapy/erddapy.py Outdated Show resolved Hide resolved
erddapy/erddapy.py Outdated Show resolved Hide resolved
@callumrollo
Copy link
Contributor Author

@abkfenris thank you for the review. I hope the subsequent commits have improved this PR. @ocefpaf could you look over this? I'm not sure why the pre-commit is failing again, it worked fine locally :/

Copy link
Contributor

@abkfenris abkfenris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It looks a lot cleaner now.

I think there are still some ways that it could be cleaned up even more that could help the test-ability and reuse-ability. Extracting the data transformation and url generation into their own functions would help that.

For example, the current fetch_results() could become two functions that should be come much easier to test individually.

def parse_results(data: bytes) -> Dict[str, DataFrame]:
    df = pd.read_csv(data)
    try:
        df.dropna(subset=[protocol], inplace=True)
    except KeyError:
        return None
    df["Server url"] = url.split("search")[0]
    return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]}


def fetch_results(url: str, key: str, protocol="tabledap") -> Optional[Dict[str, DataFrame]]:
    """
    Parse search results from multiple servers
    """
    data = multi_urlopen(url)
    if data is None:
        return None
    return parse_results(data)

Then parse_results() can be tested by passing it concrete data, and fetch_results() can be tested with parse_results() mocked out in addition to using library like VCR.

parse_results() could also be then used by a server specific search method on erddapy.ERDDAP, or if I wanted to use httpx to make async requests instead. Similarly generating search URL could have multiple uses too.

It might make sense to pull these functions out into a separate file to keep things more organized.

erddapy/erddapy.py Outdated Show resolved Hide resolved
Comment on lines 147 to 153
df = pd.read_csv(data)
try:
df.dropna(subset=[protocol], inplace=True)
except KeyError:
return None
df["Server url"] = url.split("search")[0]
return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest splitting out the parsing of data from your data fetching function.

erddapy/erddapy.py Outdated Show resolved Hide resolved
@callumrollo
Copy link
Contributor Author

Thanks @abkfenris, I've refactored into several smaller functions and pulled all of these out into a separate file. I'll start working on some tests, which should be easier with the more atomic structure you recommended

@callumrollo
Copy link
Contributor Author

Am I on the right track with tests here? I'm not sure how to proceed with the more involved ones like parse_results how would we mock data for this?

@@ -33,6 +33,23 @@ def urlopen(url: str, auth: Optional[tuple] = None, **kwargs: Dict) -> BinaryIO:
return data


def multi_urlopen(url: str) -> BinaryIO:
Copy link
Member

@ocefpaf ocefpaf Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's "fold" this one into a the canonical urlopen by making the latter a thin wrapper to this one. That will allow us to cache the results in that one.

Let's tackle this in another PR.

erddapy/url_handling.py Outdated Show resolved Hide resolved
@ocefpaf ocefpaf merged commit 0bf5bcf into ioos:main Jul 26, 2021
@callumrollo
Copy link
Contributor Author

Thanks for the help @abkfenris and @ocefpaf! Can we close #32 now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants