-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
harmonize-wq #157
Comments
Hello there @jbousquin, thank you for submitting this issue--welcome to the pyOpenSci community! Just wanted to let you know we've seen your issue. The next step is for us to run some initial checks, we will give that first feedback soon. In the meantime, if you have any questions you can ask here or in our discourse. |
Editor in Chief checksHi there! Thank you for submitting your package for pyOpenSci Please check our Python packaging guide for more information on the elements
Editor commentsAs a Floridian, I do appreciate your tutorial locations 🐊 A few quick fixes:
In the meantime, I'll start hunting for an editor to facilitate a review for you! |
Thanks @isabelizimm - made those suggested changes on pyOpenSci-review branch. Let me know if there is anything else while we wait. |
No other tasks yet! That should be good to start. I think I've got an editor just about figured out, I will let you know for sure mid-next week. |
Update: @Batalex will be the editor for |
Hey @jbousquin, |
Thanks @Batalex. No questions so far, let me know if anything comes up. |
👋 Hi @rcaneill and @Jacqui-123! Thank you for volunteering to review for pyOpenSci! Please don't hesitate to introduce yourselves. @jbousquin, I am pleased to announce that we found our A-team to proceed with the review. Please fill out our pre-review surveyBefore beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.
The following resources will help you complete your review:
Please get in touch with any questions or concerns! Your review is due: April 8thReviewers: @rcaneill, @Jacqui-123 |
I just filled the survey |
Hi @jbousquin I am happy to review this package and will start soon :) |
Thanks @rcaneill! Let me know as things come up :) |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme file requirements
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
UsabilityReviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Functionality
For packages also submitting to JOSS
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted. The package contains a
Final approval (post-review)
Estimated hours spent reviewing: 8-10 Review Comments
|
Please find below a list of comments, with my own format (editor's privilege 🐈⬛ ) Praises
Typos
Nitpicks
Discussions
Suggestions
def set_basis(df, mask, basis, basis_col):
return df.assign(**{basis_col: np.where(mask, basis, np.nan)}) I find this implementation easier to read (but I understand that this is debatable), but it is also more efficient. I have noticed that you use this pattern quite a few time throughout the code base, so I figured this might interests you. Todos
Issues
General recommendationsCode quality is important in a public package. To do so, I would advise you to use both a linter and a formatter.
This is up to debate of course, some people might prefer one tool over another, but the point is that a project using such tools:
If you are ok with everything I said so far, I'd be happy to propose a PR to help you setup everything. |
I'll start addressing these on a pyOpenSciReview branch (I'll try to be better about merging to main so other reviewers aren't running into the same things). Will generate a issue task list w/ any that are more involved. Let me know if there is anything else that I should be doing for review/edit tracking. Would love a PR for black & ruff setup - have been running a linter and code analysis locally and definitely see the value for contributors/maintenance. Only concern is being able to easily ignore certain conventions when appropriate. |
@Batalex fixing issue (general): circular dependencies - will be a breaking change. To resolve I moved functions from harmonize, df_checks()/add_qa_flag() to clean, convert_unit_series() to convert and units_dimension() to wq_data (to become a method). These seemed as logical a place to find them as harmonize. Now importing specific functions from other modules where practical. This breaks docs - before addressing that I wanted to confirm this is what you had in mind? |
@jbousquin Based on a quick look through the PR, yes that's exactly what I had in mind |
Great package! I hope these comments are helpful. This was my first package review so please let me know if there is anything I missed or if I was misguided with any of my comments. Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme file requirements
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
UsabilityReviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Functionality (Skipped this)
For packages also submitting to JOSS
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted. The package contains a
Final approval (post-review)
Estimated hours spent reviewing: approximately 8 Review Comments
|
Hello @Batalex, Apologies - I was hoping to have gotten the tutorials checked against changes and summaries of changes/responses copied over here before getting buried in field work in June. My intent is not to be non-transparent, just hopeful I would have had a chance to do those small tasks by now. Should see some movement this week. Thank you for your understanding. |
Hey @Jacque-123, 1. Harmonize_Pensacola.Rmd: Two PRs (67 & 78 from branch 62) were used to make these suggested updates to the example for setting up and running the python package in R. The second PR focused on CI/CD tests via git actions that will render the rmd to help ensure there are not errors. One of the runners generates an artifact for easier inspection (e.g., https://github.com/USEPA/harmonize-wq/actions/runs/9811884248/artifacts/1672085755). Hopefully that will help make it easier to identify any further text edit suggestions you have. 2. Harmonize_CapeCod_Simple.ipynb Generated an issue for this, hard to reproduce but I have a feeling it has to do with dependency management and how you installed the package. I'm hopeful changes to the pyproject file will fix it (USEPA/harmonize-wq@b125f65), but if not we can try to dig into this error a bit more. 3 usability: I suspect the issue here is having something that goes deeper than the function documentation, which is what the tutorials are meant to do. There are a lot of functions (50+), each should currently be documented in numpy style (with input/return parameter types/descriptions and examples). clean.add_qa_flag() is meant to be used by higher level functions to add QA_flags as the data are cleaned and harmonized, i.e., to make changes/assumptions that might have quality issues more transparent to the user and allow them to filter/remove on them if it doesn't meet their QA standards. Those higher-level functions should document the specific flag string used, e.g., basis.basis_from_unit() provides an example where speciation was updated by conflicting meta-data. However, the add_qa_flag() function is exposed to the user because we can't anticipate all data they may want to flag. The example shows a custom mask and flag text (it's a bit of a spam and eggs type example, simplified to show how it works) whereas examples in the tutorials are more 'real-world'. In e.g., Harmonize_Pensacola_Detailed.ipynb, code-block 11 we show how the docstring for harmonize_locations can be displayed, and that references how it implements 'QA_flag' to identify 'any row that has location based problems like limited decimal precision or an unknown input CRS'. In code-block 15 of the same demo we examine what QA_flags were assigned. In code-block 27-29 we look to this flag to help explain why ResultMeasure/MeasureUnitCode is NaN. There are several additional examples of this in that notebook and all of the detailed notebooks should follow a similar structure. Please let me know if any of the functions are missing documentation or examples in the docs, if you have suggestions for improving any of those descriptions/examples, if you have suggestions for improving the detailed tutorials to make the use of QA_flags clearer, etc. 4. I am curious to know if the package looks at or flags the different method detection limits (mdl) that different analytical laboratories often use, or if that is an issue with this dataset? I tend to run into this issue in my work but I don't typically work with EPA datasets. This is 100% the direction of some future feature adds. Specifically, 17 plans to address detection limits. It is a multi-part problem though. The existing function will pull in detection limits from that specific meta-data table, but then it needs to be compared against results to determine if the result value was under it and a QA_flag needs to be assigned. If the result value is under the limit there are several alternatives to estimate values statistically (user would have actively choose to alter results in this way but we could port the functionality from USEPA/EPATADA). However, as you've also identified the data-provider may have specified a method with a standard MDL, in which case the detection limit might not be in the meta-data table and might have to be inferred from those methods. Methods filtering 37 is the first step for that, where we start to develop a table/dict of standard methods and try to recognize them (a lot of differences in how they are entered). MDL could be associated with each as a col in that table/lookup or in a related table/lookup. |
@Batalex - weird I commented your responses a couple weeks ago, but just came back to make sure I hadn't missed anything from you and don't see that comment here... I'll try to re-create, mainly just copying over month old status from the repo (there is also follow-up on your draft PR that I'd written after as follow-up in case you didn't see it here) |
@Batalex If you would like additional links/line numbers just let me know: Typos Nitpicks nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL Discussions Suggestions suggestion (domain.py): We could replace the following pattern for x in list(set(pandas_series)) by using the .unique method suggestion (domain.py, basic.py): out_col_lookup does not need to be a function. Same for all other functions returning a dict. If we make those simple module-level dicts, we can still list the sources in the module docstring. suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code. suggestion (basis.py, general): By using pandas' methods, we could streamline a little some operations. The choice is ultimately yours; I prefer using existing methods over rolling my own implementations, even if that means that other folks need to go to the documentation website to understand what is going on. I agree on using existing methods, I really tried to implement this suggestion but ran into issues. In the provided example if there are existing values in columns those need to be preserved. That can be done with an if/else. Additionally, numpy.where will coerce the other values (y) to the dtype which is problematic for nan. Do-able, but more complex than the current solution. Todos contributing.rst Issues General recommendations To summarize, working on implementing black. All the code changes are sitting on the pyOpenSci-review branch. It runs locally as suggested in your PR. I'm trying to get my head around pre-commits so that contributors will have style/format checks without having to run it locally. |
@rcaneill - Really appreciate your doing issues/PRs over on the repo (saves steps!). I think we resolved everything over there (leaving the citation issue open so it gets resolved after), but let me know if I missed anything from your review here. |
@jbousquin, here is some quick feedback.
I am not sure how using a raw string is relevant to the reasons you mentioned. Maybe we are not talking about the same thing: I am speaking about the
The idea would be to add the sources and any relevant information in the module docstring: constants.py """
Constants submodule.
References
-----------
Plank:
The NIST Reference on Constants, Units, and Uncertainty. [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology). 20 May 2019.
"""
plank = 6.62607015e-34 Then you can access the source using
As for the rest of my original points, I am okay with the changes / reasons not to change. Nice job! |
@Jacqui-123, @rcaneill Were your concerns addressed? |
Doesn't need to be raw string (see Batalex pyOpenSci/software-submission#157 (comment))
@Batalex RE quick feedback: Ah! You really did mean it being raw string not it being a constant, resolved on branch (passing, will merge with the linting). docstrings for dict constants - what I was stuck on was what to document it as if module level (''Attributes'' for sphinx). I'm not sure how to do the child level of an attribute, e.g., Examples, but I'll play around with it. docstring at the variable I wasn't sure how to associate it (still not sure of that, but looking at the sphinx doc helped me understand it needed to be after), documented that way the child level works, but I see where it doesn't seem to be part of the module level help, and I'm not sure how you would get help to retrieve the variable level doc-string (will look into that if module level doesn't work out). |
@jbousquin Thanks so much for the detailed response to my review/comments. The changes look great, and I appreciate your explanations. @Batalex I don't have anything further to add but let me know if you need anything else. |
Perfect, I just need you to check the approval box in your review above. Thank you so much for contributing to this review! |
@Batalex RE:RE quick feedback: module level doc-strings are passing for both help() and docs. pre-commits are very close to working, just need ruff to see settings in pyproject.toml like it does when local. Tried a few things based on pre-commit issues but haven't solved it yet. Close to just writing them out in the config - but reluctant since that duplicates what is in the toml (more maintenance making sure they always match) |
@Batalex I am happy with the changes made / the answers when the authors disagreed with me |
* Implementing suggested ruff rules * isort * Fix whitespace (many of these were copied from docs example execution - need to confirm it passes docs tests) * Run test.yml on push to this branch * Whitespace * F401 (redundant alias) * Missed whitespace * First attempt w/ pre-commit * Fix indent * indent/drop name * Rename .pre-commit-config.yml to .pre-commit-config.yaml yAml * Update .pre-commit-config.yaml fix file structure * Reduce .pre-commit-config.yaml Reduce what files it is run on * Update domains.py Doesn't need to be raw string (see Batalex pyOpenSci/software-submission#157 (comment)) * Dict doc strings as module level attributes * Update to main (#88) * Update domains.py 'Field' -> 'Field***' * 62 r test ci (#86) Update test_r.yaml to install conda outside r, specifically miniforge, then run on env from setup with current package (vs pip installing main) * Update .pre-commit-config.yaml From issue: pass_filenames: false in the pre-commit config so that the file discovery is done by Ruff taking into account the includes and excludes configured by the user in their pyproject.toml * Update .pre-commit-config.yaml Try updating to patch version and specify config in args. * Update pyproject.toml try withouth 'docstring-code-format = true' as this may override other settings. * Update pyproject.toml Try to get pre-commit to see config * Update pyproject.toml Warning message, so it is getting these settings from the toml? * Update conf.py E501 * Update basis.py E501 * Update basis.py Moved constant doc-string to module level * Update clean.py E501 * Update convert.py E501 * Update conf.py lint/format edits * Update pyproject.toml Without single checking if double is default * Update pyproject.toml Will move to one or the other (likely default double for ease), but trying to post-pone to work through diff * lint/formating * linted * W293 * black format/lint * W605 - try pulling r str out of test doc-string and instead as a comment. Comment shouldn't cause problems but this one has in the past. * I001 (all whitespace except test_harmonize_WQP.py) * lint conf file * lint * Add white space between module doc-string and imports * Format: add whitespace after mod doc-string * Add assert for actual2 - where the characteristics specific function is used instead of the generic. * Resolved some E501 * Check if new line fails doctest * Revert to get doc-test passing * Spread out example df entry * Spread out dict read out to reduce line length. White space is already normalized for doc-test so this may pass. * Revert * Spread out building df for wq_dat.WQCharData example. * spread out example df for we_date.measure_mask() * Shotern len of dict for wq_data.replace_unit_str() & wq_data.apply_conversion() examples * Attempt to skip E501 on this line * skip rule on line * Last attempt to ignore line too long in docstrings (3) * Update pyproject.toml Drop single quote for lint * '' -> "" * Update test.yml Revert back to testing on main only
@Batalex - resolved ruff checks with pre-commits on PR 89, please let me know if there is anything unresolved from your review. Really happy getting lint/formatting as part of this workflow and thank you as the edits to the pyproject.toml in your draft PR helped immensely! |
🎉 Author Wrap Up TasksThere are a few things left to do to wrap up this submission:
It looks like you would like to submit this package to JOSS. Here are the next steps:
🎉 Congratulations! You are now published with both JOSS and pyOpenSci! 🎉 Editor Final Checks
If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide. |
Author Wrap Up TasksWill update as tasks to wrap up this submission are completed:
It looks like you would like to submit this package to JOSS. Here are the next steps:
|
hey team. has this package been accepted by JOSS? it looks like we might be able to update the labels, fill out the header and close it if that is the case. please let me know however. |
It looks to me like we just need the archive link, the version accepted and a label change! |
Yes - it cleared JOSS and citation info on the repo are updated. Let me know if you need any info/action from me. Thanks! |
Submitting Author: Justin Bousquin (@jbousquin)
All current maintainers: (@jbousquin)
Package Name: harmonize-wq
One-Line Description of Package: Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
Repository Link: https://github.com/USEPA/harmonize-wq
Version submitted: 0.4.0
Editor: @Batalex
Reviewer 1: @rcaneill
Reviewer 2: @Jacqui-123
Archive: https://doi.org/10.5281/zenodo.13356847
JOSS DOI:
Version accepted: 0.5.0
Date accepted (month/day/year): 08/10/2024
Code of Conduct & Commitment to Maintain Package
Description
The US EPA's Water Quality Portal (WQP) is a data warehouse that facilitates access to data stored in large water quality databases in a common format. There are tools to facilitate both publishing data to and retrieving data from WQP, harmonize-wq is focused on retrieved data (1) cleaning to ensure it meets the required quality standards, and (2) wrangling to get it in a more analytic-ready format. Although there are many examples where this has been done, standardized tools to perform this task could make it less time-intensive, more standardized, and more reproducible.
Scope
Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
Domain Specific & Community Partnerships
Community Partnerships
If your package is associated with an
existing community please check below:
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
Who is the target audience and what are scientific applications of this package?
Water quality domain experts trying to synthesize available data in a stream, bay, estuary, etc.. More standardized data cleansing and wrangling allows outputs to be integrated into other tools in the water quality data pipeline, e.g., for integration into dashboards for visualization (Beck et al., 2021) or decision support tools (Booth et al., 2011).
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
No python packages to my knowledge, there is in R: USEPA/TADA
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tag
the editor you contacted: Presubmission: harmonize-wq #132Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Footnotes
Please fill out a pre-submission inquiry before submitting a data visualization package. ↩
The text was updated successfully, but these errors were encountered: