-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restez -- submission #232
Comments
Editor checks:
Editor commentsThanks for your submission @DomBennett ! ── GP restez ───────
It is good practice to
✖ write unit tests for all functions, and all package code in general. 87% of code lines are
covered by test cases.
R/biomartr-tools.R:41:NA
R/biomartr-tools.R:42:NA
R/biomartr-tools.R:43:NA
R/biomartr-tools.R:44:NA
R/biomartr-tools.R:45:NA
... and 65 more lines
✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows
that are about 80 characters wide. Try make your lines shorter than 80 characters
R/rentrez-wrappers.R:26:1
tests/testthat/test-setup.R:43:1
✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions.
They are error prone and result 1:0 if the expression on the right hand side is zero. Use seq_len()
or seq_along() instead.
R/gb-get-tools.R:55:19 Seeking reviewers now 🕐 Reviewers: |
@sckott I'd be happy to review this if you like. Right in my wheelhouse and looks like something I would make use of for my own research/teaching. |
@naupaka thanks for offering, that would be great, thanks, is 3 weeks okay? |
Yes that should work. 👍🏼 |
grerat, thanks |
Hi all! Find attached my review. Sorry for the delay. Great job overall @DomBennett! Note that it's my first time reviewing for rOpenSci, and partially for that reason, I think I tended to focus on higher level stuff. Please don't hesitate to ask for any clarification. -- Evan Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide.
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: I estimate that I spent 13 hours on this review. Review CommentsGeneral CommentsOverall, this is a well thought out and documented package. I don't work that often with GenBank sequences myself, but based on discussion with others, it seems this package will usefully solve the problem of easily downloading and querying sequences from NCBI. In general, all described functionality in the package seemed to operate as intended. I particularly appreciated the pkgdown site and multiple vignettes. These will be great for new users and offer an easy way to communicate changes and new features as the package develops. I tried to conduct my review in the spirit of a "naive user". I first went through some of the code base checks recommended in the rOpenSci reviewing guide. From there, I worked through the package functionality in a way I expect most new users would: I went from the README to the "Get started" tutorial on through the other vignettes. I referenced the function descriptions and help files throughout. I think my comments represent mostly minor issues, but hopefully they will improve overall organization and usability of the package. I've arranged them below in rough order of importance. Specific CommentsExplain Maintenance of Multiple SQL Databases: Given the performance issues associated with querying large databases, users might reasonably want to have multiple local SQL databases containing data from different taxa, different sequence lengths, etc. I would guess that re-coding functions to allow for the assignment of different SQL database names within the same restez path would be a bit much at this point (but potentially a feature for future versions?). In lieu of that, I would simply recommend pointing out to users that setting different restez paths would be an easy way to maintain different databases containing different sets of information that could referenced from within one analysis script. This is a very simple solution, but it didn't occur to me until I spent a few hours with the package, so a new user would probably benefit from knowing that it's an option for maintaining multiple local databases. More Informative Error Messages When restez Path Not Set: You've done your due diligence in reminding users to run Change Don't Export Update Clarify SQL Database Size Requirements: I think there is some issue with the communication of expected database sizes when running
However, after download and database creation (which went fine),
So all told, my local, raw sequence files were smaller than expected (only 12 GB), but the SQL database was much larger than anticipated, resulting in a local database folder that was roughly twice what was expected (30 GB). Is this a peculiarity of this dataset? Something that went wrong with the database generation? Revise References To Defunct Functions: There were some confusing references to Differing Typo Fixes: Make Vignette Availability More Obvious: It's great that you put in the time to generate useful vignettes for this package. I'd recommend some subtle changes to the README to make this more apparent for users. Why not move the sentence "For more detailed tutorials, visit the restez website" under a new heading below "Quick Examples" called "Detailed Tutorials"? I think that would be preferable since then new users could first execute the "Quick Examples" code to see how setup, querying, and Entrez wrappers work. Then they could move on to more in-depth tutorials with the vignettes. I think the link in this case should also point directly to the articles page of the package website (rather than the website index, which just replicates the README page). Revise README Graphic: I appreciated the README file graphic that illustrated the overall package organization. I wonder if a couple of small changes could make this more comprehensive, however. First, would it make more sense for the white box for "restez/" to be moved above the blue box containing "downloads/" and "sql_db"? To me, this would be more accurate since "downloads/" and "sql_db" will be found within the "restez/" directory on a user's machine. Second, could you update the "Query" box to contain all the exported Better Organize Function Types (Families): This primarily applies to the Harmonize Vignette Workflows: This is minor, but I might consider reorganizing the multiple vignettes slightly such that they all follow the same conventions and workflow (i.e., all set restez paths to the user's desktop, all delete local GenBank databases following analysis to illustrate best practices, etc.). Performance Issues With Querying Large Databases: There may not be a good way around this, but I noticed that the In the course of discussing with @noamross, he suggested migrating to a MonetDB structure from SQL might improve these performance issues. I definitely don't think this is critical to package functionality at present, but just a consideration for future package development. roxygen2 Update: |
👋Hi all! Per Evan's comment above, I had suggested using |
Thanks for such a detailed review @eveskew! I really appreciate the hours you have put into trying and testing the package. This whole ROpenSci malarky is pretty nifty. @noamross, I will investigate using |
|
Correct. @DomBennett Each of those databases uses an entirely different format / schema for organizing the taxonomic information, so a query that works with, say, COL to return classification won't work with ITIS, etc.
On the user-facing side, |
@DomBennett Very cool package by the way, don't want to interfere with the reviews but looks very well implemented! In general I'm excited to see more packages that interact with local databases instead of needing millions of API calls. Completely minor note here, but the way you read in the compressed flat files into the database is pretty slick, though I'm pretty sure you could skip the |
thanks for the help carl |
Thanks @cboettig! It'd be cool if I wouldn't want to try anything at the moment though. I think simple and basal is best and I always fear the monolith programs. |
Right, as you probably know, NCBI taxa dump already includes a pretty nice list of mapping both NCBI's valid scientific name and recognized synonym scientific names to taxon ids (and even common names), so the NCBI maintainers clearly had queries by species/genus or even common name (including at common names that refer to higher taxonomic ranks -- e.g. "fishes") as valid. So the NCBI synonyms table probably covers much of the matching species in NCBI to species in other databases -- I suspect (but haven't tried yet, could be very wrong) that most unmatched names in the other databases would be from species that don't have any match in NCBI (either because a species group was later split or renamed in one database but not another, or because certain clades just aren't well covered in NCBI). In any event, there's already existing lists of mappings between recognized synonyms, not just from NCBI, but also at the actual identifier level, e.g. as provided by WikiData or EOL. So in principle this should make it pretty easy to crosswalk. A goal of |
@sckott (et al) sorry for the delay -- deadline blew right past. I finish up at ESA Friday and can get to it this weekend. |
Okay, thanks @naupaka |
@naupaka ping 😸 |
I think that's right that I'm getting a failure on that vignette too. @DomBennett could you address these few things soon so the reviewers can take another look |
Hmmm.... it looks like local installs of a package do not lead to automatic updates for dependencies: https://stackoverflow.com/questions/51032658/update-dependencies-when-installing-a-local-package
As for the vignette building, I've explicitly tried to avoid building them in the R package as the vignettes depend on a database of all rodents which I build in a separate script ( To hopefully rectify the issue, I have more closely followed the advice from the pkgdown documentation (https://pkgdown.r-lib.org/articles/pkgdown.html#articles): dropping vignette builder in the DESCRIPTION, removing vignette details in the article yaml's and, as already done, adding vignettes/ to .Rbuildignore. Does the Thanks! |
installs and checks fine for me now |
thanks for the reminder @DomBennett @eveskew @naupaka are you happy with the changes? if no response by mid next week I'll assume you're good |
Approved! Thanks again for your submission @DomBennett ! To-dos:
We've started putting together a bookdown with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved. The repo is at https://github.com/ropensci/dev_guide Are you interested in doing a blog post for our blog https://ropensci.org/blog/ ? either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development (https://ropensci.org/blog/). If so, we'll have our community manager @stefaniebutland get in touch with you on that |
Hi all! Sorry for the late response on this, but I've looked a bit more into the revised package. @DomBennett did a very thorough job in responding to my suggestions, and I really like some of the new package functionality (the improved However, I ran into a snag that I wanted to bring to Dom's attention before he migrates the package. Here's an issue I had with
I believe I tracked the issue down to a couple of lines having to do with the
you want
This change at least resulted in Sorry again to bring up an issue at this late hour. I know we should move along to get the package out to users, so if there are any package changes, I can be sure to review quickly and verify that Best, |
Thanks @eveskew! I also found this error as I was transferring to ROpenSci. You're fix does get rid of the error -- thanks! But it does indicate that my parsing of the GenBank release notes wasn't quite up to scratch! I've gone through and made some changes to my REGEX patterns to solve the issue. Because this could potentially be an issue every month when the latest GenBank version is released, I've also added a warning that will be raised if there are indications that the downloaded file information is not complete. I will push these changes once I have push access on the now transferred repo (@sckott access please!) The other change I'll want to make before I push to CRAN is to switch from subprocess to callr. This should reduce the amount of coding and shouldn't impact functionality. |
@DomBennett you have admin access |
@sckott I added the restez repo to the restez team. I think I need to be a member of that to push, right? At least, I don't seem to be able to add myself to that team. |
your approved now |
@DomBennett, just wanted to report that I cloned the rOpenSci version of the package, and all seems to be well now! I was able to download a sample of real data (phage sequences), create the database, and query it successfully using the suite of |
Thanks very much @eveskew! |
@DomBennett I just re-tried to set up the plant/fungal DB, and it hung on the
Is this the same error that you were talking about earlier? |
@naupaka Hmmm... looks like a new error I will investigate it! I have in the past had issues with the database but haven't always been able to recreate the error. You might find that re-running the Also note, I have just updated the (The database behaviour is not 100% predictable, at least for me. For example, I found I couldn't build a database on a USB stick, I should add that to the documentation.) |
Good to know. I will pull the new version and give it another go. |
Thanks! For clarity, I'm now testing the following commands: # devtools::install_github('ropensci/restez')
library(restez)
restez_path_set('.')
db_download(preselection = '7')
db_create() |
Hi @naupaka, The plant/fungal database seems to work for me -- see attached. Let me know if you're still having problems. plants_fungi_resetz_rsession.txt |
It's running now. I forgot to start in tmux the first time through and a broken ssh connection killed it. The database building step take many hours. So far so good (downloading worked without a hitch); I should know if it worked by the morning. |
Awesome! All works great for me now. Thanks for all your hard work on this, @DomBennett! Going to be a really helpful tool. |
@sckott Dumb question: Do I now submit the package and paper to JOSS? Or is it automated? Thanks! |
not a dumb question. You do have to submit it manually (maybe someday it will be automated).
|
Thanks for the helpful answer! I've now submitted the paper -- I was waiting for CRAN to host. |
cool, closing now, seems like JOSS and blog post are started |
Summary
The restez package downloads all or sections of GenBank and creates a local SQLite copy of the database for querying. The package comes with a series of useful functions for querying the database and is designed to work with rentrez.
https://github.com/AntonelliLab/restez
data retrieval, for users that wish to retrieve lots of sequence information and find NCBI's Entrez too slow.
Researchers wishing to perform any form of analysis with DNA sequence data. For my own purposes, I will use the package to retrieve large amounts of sequence data for phylogenetic analysis.
yours differ or meet our criteria for best-in-category?
Hajk-Georg Drost's Biomartr is similar and is in fact the inspiration for restez. It only allows users to download genome specific data, however, not GenBank sequences. From the ecologcial sciences perspective, genome data is just not nearly taxonomically representative enough yet for any questions concerning biodiversity.
NA
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
I imagine the rentrez and biomartr developers to be good reviewers: dwinter and HajkD
The text was updated successfully, but these errors were encountered: