-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi erddap search added #199
Conversation
|
erddapy/erddapy.py
Outdated
} | ||
num_cores = multiprocessing.cpu_count() | ||
returns = Parallel(n_jobs=num_cores)( | ||
delayed(parse_results)(url, key, protocol="tabledap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could protocol
also be a kwarg? Or at least in the doc string? Right now you have to read the code to figure out that the function only searches for tabledap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Protocol added as kwarg in 4ab1e5a
@abkfenris thank you for the review. I hope the subsequent commits have improved this PR. @ocefpaf could you look over this? I'm not sure why the pre-commit is failing again, it worked fine locally :/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It looks a lot cleaner now.
I think there are still some ways that it could be cleaned up even more that could help the test-ability and reuse-ability. Extracting the data transformation and url generation into their own functions would help that.
For example, the current fetch_results()
could become two functions that should be come much easier to test individually.
def parse_results(data: bytes) -> Dict[str, DataFrame]:
df = pd.read_csv(data)
try:
df.dropna(subset=[protocol], inplace=True)
except KeyError:
return None
df["Server url"] = url.split("search")[0]
return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]}
def fetch_results(url: str, key: str, protocol="tabledap") -> Optional[Dict[str, DataFrame]]:
"""
Parse search results from multiple servers
"""
data = multi_urlopen(url)
if data is None:
return None
return parse_results(data)
Then parse_results()
can be tested by passing it concrete data, and fetch_results()
can be tested with parse_results()
mocked out in addition to using library like VCR.
parse_results()
could also be then used by a server specific search method on erddapy.ERDDAP
, or if I wanted to use httpx to make async requests instead. Similarly generating search URL could have multiple uses too.
It might make sense to pull these functions out into a separate file to keep things more organized.
erddapy/erddapy.py
Outdated
df = pd.read_csv(data) | ||
try: | ||
df.dropna(subset=[protocol], inplace=True) | ||
except KeyError: | ||
return None | ||
df["Server url"] = url.split("search")[0] | ||
return {key: df[["Title", "Institution", "Dataset ID", "Server url"]]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest splitting out the parsing of data from your data fetching function.
Thanks @abkfenris, I've refactored into several smaller functions and pulled all of these out into a separate file. I'll start working on some tests, which should be easier with the more atomic structure you recommended |
Am I on the right track with tests here? I'm not sure how to proceed with the more involved ones like |
@@ -33,6 +33,23 @@ def urlopen(url: str, auth: Optional[tuple] = None, **kwargs: Dict) -> BinaryIO: | |||
return data | |||
|
|||
|
|||
def multi_urlopen(url: str) -> BinaryIO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's "fold" this one into a the canonical urlopen
by making the latter a thin wrapper to this one. That will allow us to cache the results in that one.
Let's tackle this in another PR.
Co-authored-by: Filipe <[email protected]>
Thanks for the help @abkfenris and @ocefpaf! Can we close #32 now? |
First attempt at resolving #190