Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenOmics: Library for integration of multi-omics, annotation, and interaction data #31

Closed
18 of 22 tasks
JonnyTran opened this issue Jan 1, 2021 · 58 comments
Closed
18 of 22 tasks

Comments

@JonnyTran
Copy link

JonnyTran commented Jan 1, 2021

Submitting Author: Jonny Tran (@JonnyTran)
All current maintainers: @JonnyTran
Package Name: openomics
One-Line Description of Package: Library for integration of multi-omics, annotation, and interaction data
Repository Link: https://github.com/JonnyTran/OpenOmics
Version submitted: 0.8.4
Editor: @NickleDave
Reviewer 1: @gawbul
Reviewer 2: @ksielemann
Archive: DOI
JOSS DOI: DOI
Version accepted: v 0.8.8
Date accepted (month/day/year): 04/17/2021


Description

OpenOmics is a Python library to assist integration of heterogeneous multi-omics bioinformatics data. By providing an API of data manipulation tools as well as a web interface (WIP), OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis. It features support for:

  • Genomics, Transcriptomics, Proteomics, and Clinical data.
  • Harmonization with 20+ popular annotation, interaction, and disease-association databases (e.g. GENCODE, Ensembl, RNA Central, BioGRID, DisGeNet etc.)

OpenOmics also has an efficient data pipeline that bridges the popular data manipulation Pandas library and Dask distributed processing to address the following use cases:

  • Provides a standard pipeline for dataset indexing, table joining and querying, which are transparent and customizable for end-users.
  • Efficient disk storage for large multi-omics dataset with Parquet data structures.
  • Multiple data types that supports both interactions and sequence data, and allows users to export to NetworkX graphs or down-stream machine learning.
  • An easy-to-use API that works seamlessly with external Galaxy tool interface or the built-in Dash web interface (WIP).

Scope

  • Please indicate which category or categories this package falls under:
    • Data retrieval
    • Data extraction
    • Data munging
    • Data deposition
    • Reproducibility
    • Geospatial
    • Education
    • Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

  • Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

OpenOmics' core functionalities are to provide a suite of tools for data preprocessing, data integration, and public database retrieval. Its main goal is to maximize the transparency and reproducibility in the process of multi-omics data integration.

  • Who is the target audience and what are scientific applications of this package?

OpenOmics' primary target audience are computational bioinformaticians, and the scientific application of this package is to provide scalable ad-hoc data-frame manipulation for multi-omics data integration in a reproducible manner. Also, we are currently developing an interactive web dashboard and interfaces to the Galaxy Tool Shed, disseminating the tool to biologists without a programming background.

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Existing PyPI Python packages within the scope of multi-omics data analysis are "pythomics" and "omics". Their functions appear to be lacking support for manipulation of integrated multi-omics dataset, retrieval of public databases, and extensible OOP design. OpenOmics aims to follow modern software best-practices and package publishing standards.

Aside from multi-omics integration tools, several specialized Python packages exists for single omics data, such as ScanPy's "AnnData" and "Loom" files. They provide an intuitive data structure for expression arrays and side annotations, and Loom file even allows for out-of-core data-frame processing. However, they don't yet provide mechanisms for multi-omics data integration, where each omics data may have overlapping samples or varying row/column sizes.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

#30

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.4441167

Note: Do not submit your package separately to JOSS

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

@lwasser
Copy link
Member

lwasser commented Jan 11, 2021

welcome to pyopensci @JonnyTran someone will followup with you in the next week or so! Our progress is slow right now as we're working o n funding for this effort!! @NickleDave was this one you wanted to work on? you are doing a lot so please let me know if you have time (or if you don't that is understandable too!)

@NickleDave
Copy link
Contributor

@JonnyTran this looks perfect, thank you.
Sorry for not replying when you opened this issue.

Hey @lwasser yes I have this on my to-do list and will start looking for reviewers Wednesday

@lwasser
Copy link
Member

lwasser commented Jan 13, 2021

wonderful! @NickleDave thank you!

@NickleDave
Copy link
Contributor

NickleDave commented Jan 14, 2021

Hi all, just adding editor checks.

Thank you @JonnyTran for your detailed submission

Editor checks:

  • Fit: The package meets criteria for fit and overlap.
    • yes, tool for working with different types of -omics data
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
    • yes, Travis CI
  • License: The package has an OSI accepted license
    • yes, MIT
  • Repository: The repository link resolves correctly
    • yup

submitter did not check "yes" for submitting to JOSS
-- edit 2021-01-22: reviewer will submit to JOSS. Has added DOI to repository in response to my initial comments

  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Overall:

  • perfect fit for PyOpenSci, definitely looks like it has the potential to be an incredibly useful data retrieval / extraction / munging tool that faciliates reproducible pipelines
    • many use cases e.g. buildng datasets for machine learning from multiple omics sources

Minor comments:
things I notice on first glance

  • no DOI for releases, e.g. using Zenodo integration. Good idea to have DOI to make releases citable, e.g. in papers to make explicit which version was used
  • no explicit link to docs page
    • this could be in README and / or in the "about" section -- GitHub lets you add a link there in addition to the text description
    • I did finally find the docs by clicking on the badge ... but if it were me I would have a big giant link point to the docs as obvious as possible
  • some info could be added to README

This is definitely an "elements of style" question / matter of personal preference, and not a thing that is directly addressed by our developer guide, but:
I really like the readthedocs suggestions for what to include in your README:
https://www.writethedocs.org/guide/writing/beginners-guide-to-docs/#readme
I don't see a lot of those things included in yours now
Don't get me wrong, It is definitely great that you have many concrete examples built around code snippets
but there's a trade-off involved in having many of them at the expense of having all the other information, e.g. more detailed install instructions, info on where/how to raise issues, license type, etc.
Maybe have one good code snippet + then link to the other examples via a link to the docs?
You don't want to miss getting all that other info to potential contributors / other developers who are kind of most likely to look at your GitHub README, IMHO
but for sure many successful projects that don't follow this pattern (see for example the tqdm README)


Reviewers: @gawbul @ksielemann
Due date: February 15, 2021

@NickleDave
Copy link
Contributor

started reaching out to reviewers, will update as soon as we hear back

@JonnyTran
Copy link
Author

Minor comments:

  • no DOI for releases, e.g. using Zenodo integration. Good idea to have DOI to make releases citable, e.g. in papers to make explicit which version was used
  • no explicit link to docs page

Thanks so much for the great suggestions, @NickleDave. I've addressed the "releases DOI" and "docs link" issues above, and I'm planning on updating the README file to be more attractive to potential contributors.

@JonnyTran
Copy link
Author

JonnyTran commented Jan 17, 2021

Hi @NickleDave and @lwasser, will it be possible for OpenOmics to make submission to JOSS at this stage of the review? I just changed my mind about this, but I already have a complete manuscript and I'm ready to provide the paper.md within a few days.

@NickleDave
Copy link
Contributor

NickleDave commented Jan 17, 2021

Hey @JonnyTran I think that could be okay. But first let me make sure I understand what you're asking.

Hi @NickleDave and @lwasser, will it be possible for OpenOmics to make submission to JOSS at this stage of the review?

Do you mean that you want to change "Publication Options" in your submission above, and check the box for "automatically submit to" JOSS?
If so, yes I think that's fine. @lwasser please confirm

I just changed my mind about this, but I already have a complete manuscript and I'm ready to provide the paper.md within a few days.

Do you mean that you have a complete, separate manuscript written about OpenOmics, in addition to the paper.md you would submit to JOSS?
E.g., like Physcraper
#26
which has a paper on biorxiv
https://www.biorxiv.org/content/10.1101/2020.09.15.299156v1

If that's the case, we should discuss more whether you want to submit to JOSS. We can perhaps tag some editors and ask if they can give us input. My impression is that JOSS is usually meant to provide a mechanism for getting publication credit for software in cases where the developer/maintainer can't easily publish a paper about it.
Although I think they might have changed some of the language in their submission guidelines about this.
See for example: https://joss.readthedocs.io/en/latest/submitting.html#co-publication-of-science-methods-and-software

so:
if you just want to go through PyOpenSci review and then submit to JOSS at the end, yes, totally fine assuming @lwasser agrees with me.
Anything else, we should probably discuss a little more first

@NickleDave
Copy link
Contributor

Hi @JonnyTran just want to follow up on this -- please let me know what you're thinking

I think I do have one potential reviewer and can move ahead with finding another whenever you're ready

@JonnyTran
Copy link
Author

Hi @NickleDave, sorry it took awhile to consult with my advisor.
Yes, I meant that I'd like to check the box on automatic submission, and no, that I have not submitted a separate manuscript elsewhere. My intention for the JOSS submission is to publish contributions on the technical software aspects that I have thus far. But in later months (after finishing the web-app features in openomics), I do plan on making another contribution to a bioinformatics journal on the scientific bioinformatics use-cases. So I think this would fall under the "co-publication".

Thank you for the clarifications!

@NickleDave
Copy link
Contributor

great, glad to hear it @JonnyTran
and I totally understand needing to consult with your advisor--didn't mean to rush you, just don't want this to fall of my to-do list

please do go ahead and edit your initial comment to check that automatic submission box, and make sure you address the to-do list in that section

I will continue with my part of the review process

@NickleDave
Copy link
Contributor

Hi again @JonnyTran excited to let you know that @gawbul and @ksielemann have both kindly accepted our invitation to review

@gawbul actually started PyOpenSci back in 2013 (see this ROpenSci blogpost) and develops related tools such as pyEnsemblREST

@ksielemann has significant experience with omics datasets and was recommended to us by @bpucker as developer of the tool QUOD (from their publication https://www.biorxiv.org/content/10.1101/2020.04.28.065714v1.abstract)

@gawbul and @ksielemann here are related links again for your convenience:
Our reviewers guide details what we look for in a package review, and includes links to sample reviews. Our standards are detailed in our packaging guide, and we provide a reviewer template for you to use. Please make sure you do not have a conflict of interest preventing you from reviewing this package. If you have questions or feedback, feel free to ask me here or by email, or post to the pyOpenSci forum.

I will update my editor checks above to add you both as reviewers, and set an initial due date of three weeks: February 15, 2021

@lwasser
Copy link
Member

lwasser commented Jan 25, 2021

this is so awesome!!

@gawbul hello again!! so great to see you here. We are moving forward with PyOS through the Sloan foundation (fingers crossed) as we briefly discussed forever ago. I'd love to see you continue to participate when you have time in whatever capacity you have time for!! :)

thank you all for this review!

@ksielemann
Copy link

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README

It is not completely clear what OpenOmics can be used for. An overview of the available functions and methods would be great: What exactly does OpenOmics do? What are specific usage examples (also e.g. after using OpenOmics: What's next?)?:

'OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis.': For which bioinformatic analyses exactly?

  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally

'# Load each expression dataframe': additional ')' in lines should be removed! as current form results in an error

mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
Results in warnings (only in first time use):
'/homes/.local/lib/python3.7/site-packages/openomics/transcriptomics.py:95: FutureWarning:
read_table is deprecated, use read_csv instead.
/homes/.local/lib/python3.7/site-packages/openomics/transcriptomics.py:95: ParserWarning:
Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.'

som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt", transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
Results in: 'KeyError: 'gene_name''. This should probably be 'gene_index="GeneSymbol"'.

luad_data.add_clinical_data(clinical_data=folder_path+"nationwidechildrens.org_clinical_patient_luad.txt")
Results in warning: '/homes/.local/lib/python3.7/site-packages/openomics/clinical.py:51: FutureWarning:
read_table is deprecated, use read_csv instead, passing sep='\t'.'

gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz","basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz","lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz","transcripts.fa": "gencode.v32.transcripts.fa.gz"},remove_version_num=True,npartitions=5)
Results in: _'AttributeError: 'io.TextIOWrapper' object has no attribute 'startswith''.

  • Function Documentation: for all user-facing functions

Please see above (in 'A statement of need').

  • Examples for all user-facing functions
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.

Please see above (in 'A statement of need'). The goals could be communicated more specifically.

  • Installation instructions
  • Any additional setup required (authentication tokens, etc)
  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages
  • Citation information

Citation information is missing at the end of the README.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider:

  • The documentation is easy to find and understand
  • The need for the package is clear

Please see above (in 'A statement of need'). Specific examples for bioinformatic analyses after the use of OpenOmics could be added.

  • All functions have documentation and associated examples for use

Please see above (in 'A statement of need'). An overview of all methods and functions of the package would be helpful.

Functionality

  • Installation: Installation succeeds as documented.

Installation with pip install openomics worked fine on the Linux system I am using. However, using my windows computer, I got the following error: error: Microsoft Visual C++ 14.0 is required

A list of dependencies/requirements in the README would be great.

from openomics import MultiOmics results in the "UserWarning: Tensorflow not installed; ParametricUMAP will be unavailable"

  • Functionality: Any functional claims of the software been confirmed.

Please see above.

  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

Please see above.

  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5


Review Comments

The concept of OpenOmics seems interesting and useful for the integration of various omics datasets. Overall, the package has a clear documentation. However, there are still a few issues that should be addressed. Please see the points above and below.

  • Download of test data: Add code to download the data within a Python script so that the user does not have to download the whole repository or describe exactly how to download the test data. (Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler) folder_path = "tests/data/TCGA_LUAD/" # Located at https://github.com/BioMeCIS-Lab/OpenOmics/tree/master/tests) is not sufficient.) Here is an example on how I did this:
#fetch data
import os
import tarfile
import urllib.request

def fetch_data(file_url, own_path, file_name):
    if not os.path.isdir(own_path):
        os.makedirs(own_path)
    own_file_path = os.path.join(own_path, file_name)
    urllib.request.urlretrieve(file_url, own_file_path)


FILE_NAMES = ["LUAD__geneExp.txt",
              "LUAD__miRNAExp__RPM.txt",
              "LUAD__protein_RPPA.txt",
              "LUAD__somaticMutation_geneLevel.txt",
              "TCGA-rnaexpr.tsv",
              "genome.wustl.edu_biospecimen_sample_luad.txt",
              "nationwidechildrens.org_clinical_drug_luad.txt",
              "nationwidechildrens.org_clinical_patient_luad.txt",
              "protein_RPPA.txt"]

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/BioMeCIS-Lab/OpenOmics/master/tests/data/TCGA_LUAD/"
OWN_PATH = os.path.join("data", "omics")

for file_name in FILE_NAMES:
  FILE_URL = DOWNLOAD_ROOT + file_name
  fetch_data(FILE_URL, OWN_PATH, file_name) 
  • In the ReadTheDocs documentation:

gtex = GTEx(path="https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/") Results in: OSError: Not enough free space in /homes/.astropy/cache/download/url to download a 3.6G file, only 2.7G left.

Is there a possibility to choose the directory in which the files should be downloaded? This would be great.

@JonnyTran
Copy link
Author

JonnyTran commented Feb 2, 2021

Thanks so much for the fantastic reviews @ksielemann! I can work on the revisions, which should be available in 2 weeks.

Download of test data. Add code to download the data within a Python script so that the user does not have to download the whole repository or describe exactly how to download the test data.

Thanks for pointing this out and providing the automated scripts. I initially placed the test data at tests/data/ for automated tests when you run pytest ./ at the root directory. You can actually copy the codes at tests/test_multiomics.py to load the -omics data.

gtex = GTEx(path="https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/") Results in: OSError: Not enough free space in /homes/.astropy/cache/download/url to download a 3.6G file, only 2.7G left.

I used the package astropy to automatically cache downloaded files. It defaults to saving the files at /homes/.astropy/cache/, and ideally it should be in one location for each user-session. But I can see how useful it is for the user to be able to make a setting to choose a directory of their choice - I will see about making an openomics configuration file at the user directory, located at ~/.openomics/conf.json. I've made an issue at JonnyTran/OpenOmics#112

@NickleDave
Copy link
Contributor

Just echoing @JonnyTran -- yes thank you for getting this detailed review back so quickly @ksielemann

Looks great to me. I will read in detail this weekend just to make sure I'm staying up to date with the review process

@gawbul
Copy link

gawbul commented Feb 8, 2021

Just working through my review at present. Should have it done by the end of the day. Apologies for the delay.

@gawbul
Copy link

gawbul commented Feb 8, 2021

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README

So the package states it is solving the problem of targeting various multi-omics datasets, though this is fairly broad, perhaps because the intention is to make the scope of the project broader in the future. It isn't clear what datasets and data formats are supported, however. Perhaps this isn't relevant for the README, but I feel it could be included in the linked documentation on Read the Docs. The target audience is implicitly defined and my assumption is that this would primarily be used by bioinformaticians, though perhaps this could be more explicit?

  • Installation instructions: for the development version of package and any non-standard dependencies in README

Installation instructions seemed clear and I attempted to install via pip, however initially received the following error:

Installing collected packages: MarkupSafe, Werkzeug, numpy, Jinja2, itsdangerous, zope.interface, zope.event, urllib3, threadpoolctl, scipy, retrying, PyYAML, pytz, python-dateutil, pyparsing, ptyprocess, llvmlite, joblib, idna, heapdict, greenlet, Flask, chardet, certifi, brotli, zict, typing-extensions, tornado, toolz, tblib, soupsieve, sortedcontainers, scikit-learn, requests, psutil, plotly, pillow, pexpect, patsy, pandas, packaging, numba, msgpack, locket, gevent, future, flask-compress, decorator, dask, dash-table, dash-renderer, dash-html-components, dash-core-components, colorlog, cloudpickle, xmltodict, xlsxwriter, xlrd, wrapt, wget, suds-jurko, statsmodels, requests-cache, pynndescent, pyerfa, pydot, partd, networkx, lxml, kiwisolver, grequests, fsspec, easydev, docopt, distributed, dash, cython, cycler, cachetools, bokeh, beautifulsoup4, appdirs, validators, umap-learn, typing, sqlalchemy, scikit-allel, rarfile, obonet, matplotlib, large-image, h5py, gunicorn, gtfparse, goatools, filetype, dash-daq, dash-bootstrap-components, bioservices, biopython, astropy, openomics
    Running setup.py install for retrying ... done
    Running setup.py install for llvmlite ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"'; __file__='"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-record-rgduko9u/install-record.txt --single-version-externally-managed --compile --install-headers /Users/stephenmoss/.pyenv/versions/3.9.0/include/python3.9/llvmlite
         cwd: /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/
    Complete output (29 lines):
    running install
    running build
    got version from file /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/llvmlite/_version.py {'version': '0.34.0', 'full': 'c5889c9e98c6b19d5d85ebdd982d64a03931f8e2'}
    running build_ext
    /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py
    LLVM version... Traceback (most recent call last):
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 105, in main_posix
        out = subprocess.check_output([llvm_config, '--version'])
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 501, in run
        with Popen(*popenargs, **kwargs) as process:
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 947, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 1819, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 191, in <module>
        main()
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 185, in main
        main_posix('osx', '.dylib')
      File "/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/ffi/build.py", line 107, in main_posix
        raise RuntimeError("%s failed executing, please point LLVM_CONFIG "
    RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config
    error: command '/Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9' failed with exit code 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/stephenmoss/.pyenv/versions/3.9.0/bin/python3.9 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"'; __file__='"'"'/private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-install-sm6pl876/llvmlite_0bff606e31a6496399a22ccfcec04d59/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bf/cl_g6mhx7zd9_1mhhvppbzdh0000gn/T/pip-record-rgduko9u/install-record.txt --single-version-externally-managed --compile --install-headers /Users/stephenmoss/.pyenv/versions/3.9.0/include/python3.9/llvmlite Check the logs for full command output.

I needed to run the following to fix the issue:

brew install llvm@9
LLVM_CONFIG=/usr/local/opt/llvm@9/bin/llvm-config pip install openomics

I tried with brew install llvm (version 11.0.1 and it failed with RuntimeError: Building llvmlite requires LLVM 10.0.x or 9.0.x, got '11.0.1'. Be sure to set LLVM_CONFIG to the right executable path.

Perhaps an external dependency on this can be specified (required by llvmlite)? The assumption is that the end-user has a working python installation and relevant compilers etc. installed, though this doesn't seem to be specified anywhere?

  • Vignette(s) demonstrating major functionality that runs successfully locally

When trying to run an openomics_test.py file with the from openomics import MultiOmics statement I received the following:

Creating directory /Users/stephenmoss/Library/Application Support/bioservices
Matplotlib is building the font cache; this may take a moment.
Traceback (most recent call last):
  File "/Users/stephenmoss/Dropbox/Code/openomics_test.py", line 1, in <module>
    from openomics import MultiOmics
  File "/Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/__init__.py", line 40, in <module>
    from .visualization import (
  File "/Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/visualization/umap.py", line 3, in <module>
    import umap
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/umap/__init__.py", line 2, in <module>
    from .umap_ import UMAP
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/umap/umap_.py", line 47, in <module>
    from pynndescent import NNDescent
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/__init__.py", line 3, in <module>
    from .pynndescent_ import NNDescent, PyNNDescentTransformer
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/pynndescent_.py", line 21, in <module>
    import pynndescent.sparse as sparse
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 330, in <module>
    def sparse_alternative_jaccard(ind1, data1, ind2, data2):
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/decorators.py", line 218, in wrapper
    disp.compile(sig)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 819, in compile
    cres = self._compiler.compile(args, return_type)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 82, in compile
    raise retval
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 92, in _compile_cached
    retval = self._compile_core(args, return_type)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/dispatcher.py", line 105, in _compile_core
    cres = compiler.compile_extra(self.targetdescr.typing_context,
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 627, in compile_extra
    return pipeline.compile_extra(func)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 363, in compile_extra
    return self._compile_bytecode()
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 425, in _compile_bytecode
    return self._compile_core()
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 405, in _compile_core
    raise e
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler.py", line 396, in _compile_core
    pm.run(self.state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 341, in run
    raise patched_exception
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 332, in run
    self._runPass(idx, pass_inst, state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 291, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/compiler_machinery.py", line 264, in check
    mangled = func(compiler_state)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typed_passes.py", line 92, in run_pass
    typemap, return_type, calltypes = type_inference_stage(
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typed_passes.py", line 70, in type_inference_stage
    infer.propagate(raise_errors=raise_errors)
  File "/Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/typeinfer.py", line 1071, in propagate
    raise errors[0]
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython mode backend)
Failed in nopython mode pipeline (step: nopython mode backend)
Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function make_quicksort_impl.<locals>.run_quicksort at 0x13e3a6940>) found for signature:

 >>> run_quicksort(array(int32, 1d, C))

There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'register_jitable.<locals>.wrap.<locals>.ov_wrap': File: numba/core/extending.py: Line 150.
    With argument(s): '(array(int32, 1d, C))':
   Rejected as the implementation raised a specific error:
     UnsupportedError: Failed in nopython mode pipeline (step: analyzing bytecode)
   Use of unsupported opcode (LOAD_ASSERTION_ERROR) found

   File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/misc/quicksort.py", line 180:
       def run_quicksort(A):
           <source elided>
               while high - low >= SMALL_QUICKSORT:
                   assert n < MAX_STACK
                   ^

  raised from /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/core/byteflow.py:269

During: resolving callee type: Function(<function make_quicksort_impl.<locals>.run_quicksort at 0x13e3a6940>)
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py (5007)


File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py", line 5007:
    def array_sort_impl(arr):
        <source elided>
        # Note we clobber the return value
        sort_func(arr)
        ^

During: lowering "$14call_method.5 = call $12load_method.4(func=$12load_method.4, args=[], kws=(), vararg=None)" at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/numba/np/arrayobj.py (5017)
During: lowering "$8call_method.3 = call $4load_method.1(arr, func=$4load_method.1, args=[Var(arr, sparse.py:28)], kws=(), vararg=None)" at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (28)
During: resolving callee type: type(CPUDispatcher(<function arr_unique at 0x13e1d4280>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (41)

During: resolving callee type: type(CPUDispatcher(<function arr_unique at 0x13e1d4280>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (41)


File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 41:
def arr_union(ar1, ar2):
    <source elided>
    else:
        return arr_unique(np.concatenate((ar1, ar2)))
        ^

During: resolving callee type: type(CPUDispatcher(<function arr_union at 0x13e1d4820>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (331)

During: resolving callee type: type(CPUDispatcher(<function arr_union at 0x13e1d4820>))
During: typing of call at /Users/stephenmoss/.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py (331)


File "../../../.pyenv/versions/3.9.0/lib/python3.9/site-packages/pynndescent/sparse.py", line 331:
def sparse_alternative_jaccard(ind1, data1, ind2, data2):
    num_non_zero = arr_union(ind1, ind2).shape[0]
    ^

This turned out to be an issue with python 3.9, which the package is supposed to support. I tried with python 3.8 instead, but it failed initially with the scipy install as it needed the BLAS and LAPACK libraries. I needed to install using:

brew install openblas lapack
LDFLAGS="-L/usr/local/opt/openblas/lib -L/usr/local/opt/lapack/lib" CPPFLAGS="-I/usr/local/opt/openblas/include -I/usr/local/opt/lapack/include" LLVM_CONFIG=/usr/local/opt/llvm@9/bin/llvm-config pip install openomics

Now running python openomics_test.py gives me only the following warning:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/umap/__init__.py:9: UserWarning: Tensorflow not installed; ParametricUMAP will be unavailable
  warn("Tensorflow not installed; ParametricUMAP will be unavailable")

Running the following rectifies this:

brew install libtensorflow
pip install tensorflow

I feel, therefore, a dependency on python 3.8 should be specified in the documentation and setup.py, as it looks like there are issues with python 3.9 at present. It would also be useful to include tensorflow in the list of package dependencies (i.e. requirements.txt) to avoid this warning. Using something like pipenv might be an ideal solution here? Though explicitly stating the external library dependencies for scipy would still be necessary.

When extending openomics_test.py to include the example for Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data, I get the following error running the first sample:

File "openomics_test.py", line 11
  usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
  ^
IndentationError: unexpected indent

This is because there are close parentheses where there shouldn't be.

The example in the README should be the following (I have submitted a PR for this):

# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path+"LUAD__miRNAExp__RPM.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path+"TCGA-rnaexpr.tsv",
        transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path+"protein_RPPA.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")

Running openomics_test.py now gives me the following warning for the MessengerRNA function:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py:95: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

I believe this can be rectified by updating transcriptomics.py here with:

df = pd.read_table(data, sep=None, engine='python')

The SomaticMutation function gives me the following error:

Traceback (most recent call last):
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'gene_name'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "openomics_test.py", line 14, in <module>
    som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/genomics.py", line 22, in __init__
    super(SomaticMutation, self).__init__(data=data, transpose=transpose, gene_index=gene_index, usecols=usecols,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py", line 50, in __init__
    self.expressions = self.preprocess_table(df, usecols=usecols, gene_index=gene_index, transposed=transpose,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py", line 148, in preprocess_table
    df = df[df[gene_index] != '?']
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'gene_name'

Using GeneSymbol instead of gene_name for the gene_index parameter in the vignette fixes this, e.g.:

som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt",
        transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol")

Running openomics_test.py now gives me the following:

/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/transcriptomics.py:95: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
  df = pd.read_table(data, sep=None)
MessengerRNA (576, 20472) , indexed by: gene_name
MicroRNA (494, 1870) , indexed by: transcript_name
LncRNA (546, 12727) , indexed by: gene_id
SomaticMutation (889, 21070) , indexed by: GeneSymbol
Protein (364, 200) , indexed by: protein_name

This differs from the output in the README, however, which is:

PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)

Running the example under Annotate LncRNAs with GENCODE genomic annotations returns the following:

Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz
|========================================================================================================================================================================================================| 4.4M/4.4M (100.00%)         0s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.basic.annotation.gtf.gz
|========================================================================================================================================================================================================|  26M/ 26M (100.00%)         7s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.lncRNA_transcripts.fa.gz
|========================================================================================================================================================================================================|  14M/ 14M (100.00%)         3s
Downloading ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.transcripts.fa.gz
|========================================================================================================================================================================================================|  72M/ 72M (100.00%)        15s
INFO:root:<_io.TextIOWrapper name='/Users/stephenmoss/.astropy/cache/download/url/141581d04d4001254d07601dfa7d983b/contents' encoding='UTF-8'>
Traceback (most recent call last):
  File "openomics_test.py", line 34, in <module>
    gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 67, in __init__
    super(GENCODE, self).__init__(path=path, file_resources=file_resources, col_rename=col_rename,
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 17, in __init__
    super(SequenceDataset, self).__init__(**kwargs)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/base.py", line 39, in __init__
    self.data = self.load_dataframe(file_resources, npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/database/sequence.py", line 74, in load_dataframe
    df = read_gtf(file_resources[gtf_file], npartitions=npartitions)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 349, in read_gtf
    result_df = parse_gtf_and_expand_attributes(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 290, in parse_gtf_and_expand_attributes
    result = parse_gtf(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/openomics/utils/read_gtf.py", line 195, in parse_gtf
    chunk_iterator = dd.read_table(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 659, in read
    return read_pandas(
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 464, in read_pandas
    paths = get_fs_token_paths(urlpath, mode="rb", storage_options=storage_options)[
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/core.py", line 619, in get_fs_token_paths
    path = cls._strip_protocol(urlpath)
  File "/Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/fsspec/implementations/local.py", line 147, in _strip_protocol
    if path.startswith("file://"):
AttributeError: '_io.TextIOWrapper' object has no attribute 'startswith'

I initially submitted an issue with fsspec here, but it turned out to be due to the gzip library wrapping the return object in an io.TextIOWrapper object? This is due to the fsspec package being unable to determine a path for the object and therefore returning the original object, which then fails downstream. See the issue for more details, but we somehow need to return a legitimate file object, instead of an io.TextIOWrapper object that has a resolvable path and therefore doesn't cause the failure in the fsspec package.

I was unable to continue with the rest of the vignettes as a result despite attempting some fixes locally.

  • Function Documentation: for all user-facing functions

Docstrings seem to be available throughout the codebase for all relevant functions, however, I found the documentation on Read the Docs to be lacking in comparison, particularly in providing detail of function parameters that are otherwise available in the docstrings.

  • Examples for all user-facing functions

Looking at the various classes and respective functions throughout the codebase, there doesn't seem to be a full coverage of of all functionality in the examples provided in the documentation. This would be too much for the README, I feel, but could certainly be represented on Read the Docs.

  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.

There is no link to the contribution guidelines from the main README, which I feel would be beneficial, however contributing guidelines are available in https://github.com/BioMeCIS-Lab/OpenOmics/blob/master/CONTRIBUTING.rst and on Read the Docs.

Just being finicky, I personally would prefer a common format for the README and CONTRIBUTING documents etc. Either both in reStructuredText or both in Markdown. There seems to be an outdated README.rst that could probably be replaced with the current README.md?

  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
  • Installation instructions
  • Any additional setup required (authentication tokens, etc)

I feel some things are missing regarding setup information for certain platforms (i.e. I had issues on macOS). Use of pipenv may be beneficial for certain dependencies, i.e. those that generated warnings. Reproducibiltiy across platforms is always an issue, though perhaps a Docker image with all dependencies required could be made available for people to run their scripts?

  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages

No comparisons are made to other packages. I feel it is similar in some ways to PyCogent, though this is no longer actively maintained, and perhaps even comparable in some ways to QIIME2?

  • Citation information

No citation information is provided in the README, though perhaps this can be made available if submitted to JOSS. It would also be great to see a DOI made available via GitHub's integration with Zenodo as mentioned by @NickleDave.

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Package structure should follow general community best-practices. In general please consider:

One minor PR submitted as described above. I'll try see if I can spare some time to look into the gzip issue at some point too.

  • The documentation is easy to find and understand

Though could be extended as discussed above.

  • The need for the package is clear

This is relatively clear, but could be expanded on as above.

  • All functions have documentation and associated examples for use

As discussed above, there appear to be docstrings throughout the code, but this doesn't seem to be reflected in the README or online documentation, particularly for function parameters.

Functionality

  • Installation: Installation succeeds as documented.

I had several issues with installation as described above, though I only tested on a macOS system running Big Sur version 11.2. It appears there are issues with Python 3.9.x in particular, though I can't confirm whether this is platform specific. There were some failures in building scipy on Python 3.8.7 due to dependent libraries missing on my machine.

  • Functionality: Any functional claims of the software been confirmed.

The package goes a good way towards meeting the claims it reports to be developed for, though the difficulties in getting the examples to run means I was limited in my ability to fully assess them.

  • Performance: Any performance claims of the software been confirmed.

No performance claims were provided. On my 16" MacBook Pro with 2.3 GHz Octa-core Intel Core i9 and 32GB RAM, however, I felt the package was a little slow in loading it's dependencies on first run. Tests also took some time to complete.

  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

Some tests are available and run as part of the Travis CI pipeline, though coverage isn't amazing and would benefit from additional work. Focusing on test driven development is a good way to ensure greater coverage. Running the tests locally took a long time, and returned various warnings and an error.

☁ OpenOmics [master] python -m pytest --cov=./
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.8.7, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/stephenmoss/Dropbox/Code/OpenOmics, configfile: setup.cfg
plugins: cov-2.11.1, dash-1.19.0
collected 35 items

tests/test_annotations.py .........                                                                                                                                                                                                    [ 25%]
tests/test_disease.py ..........                                                                               [ 54%]
tests/test_interaction.py .E.....                                                                              [ 74%]
tests/test_multiomics.py ...                                                                                   [ 82%]
tests/test_sequences.py ......                                                                                 [100%]

======================================================= ERRORS =======================================================
______________________________________ ERROR at setup of test_import_MiRTarBase ______________________________________

    @pytest.fixture
    def generate_MiRTarBase():
>       return MiRTarBase(path="/data/datasets/Bioinformatics_ExternalData/miRTarBase/", strip_mirna_name=True,
                          filters={"Species (Target Gene)": "Homo sapiens"})

tests/test_interaction.py:19:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openomics/database/interaction.py:611: in __init__
    super(MiRTarBase, self).__init__(path=path, file_resources=file_resources,
openomics/database/interaction.py:40: in __init__
    self.validate_file_resources(path, file_resources, verbose=verbose)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <openomics.database.interaction.MiRTarBase object at 0x1aa5036d0>
path = '/data/datasets/Bioinformatics_ExternalData/miRTarBase/'
file_resources = {'miRTarBase_MTI.xlsx': '/data/datasets/Bioinformatics_ExternalData/miRTarBase/miRTarBase_MTI.xlsx'}
npartitions = None, verbose = False

    def validate_file_resources(self, path, file_resources, npartitions=None, verbose=False) -> None:
        """For each file in file_resources, fetch the file if path+file is a URL
        or load from disk if a local path. Additionally unzip or unrar if the
        file is compressed.

        Args:
            path (str): The folder or url path containing the data file
                resources. If url path, the files will be downloaded and cached
                to the user's home folder (at ~/.astropy/).
            file_resources (dict): default None, Used to list required files for
                preprocessing of the database. A dictionary where keys are
                required filenames and value are file paths. If None, then the
                class constructor should automatically build the required file
                resources dict.
            npartitions:
            verbose:
        """
        if validators.url(path):
            for filename, filepath in copy.copy(file_resources).items():
                data_file = get_pkg_data_filename(path, filepath,
                                                  verbose=verbose)  # Download file and replace the file_resource path
                filetype_ext = filetype.guess(data_file)

                # This null if-clause is needed incase when filetype_ext is None, causing the next clause to fail
                if filetype_ext is None:
                    file_resources[filename] = data_file

                elif filetype_ext.extension == 'gz':
                    file_resources[filename] = gzip.open(data_file, 'rt')

                elif filetype_ext.extension == 'zip':
                    zf = zipfile.ZipFile(data_file, 'r')

                    for subfile in zf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = zf.open(subfile.filename, mode='r')

                elif filetype_ext.extension == 'rar':
                    rf = rarfile.RarFile(data_file, 'r')

                    for subfile in rf.infolist():
                        if os.path.splitext(subfile.filename)[-1] == os.path.splitext(filename)[-1]: # If the file extension matches
                            file_resources[filename] = rf.open(subfile.filename, mode='r')
                else:
                    file_resources[filename] = data_file

        elif os.path.isdir(path) and os.path.exists(path):
            for _, filepath in file_resources.items():
                if not os.path.exists(filepath):
                    raise IOError(filepath)
        else:
>           raise IOError(path)
E           OSError: /data/datasets/Bioinformatics_ExternalData/miRTarBase/

openomics/database/base.py:113: OSError
================================================== warnings summary ==================================================
../../../.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/site-packages/_pytest/config/__init__.py:1233: PytestConfigWarning: Unknown config option: collect_ignore

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

tests/test_annotations.py: 6 warnings
tests/test_disease.py: 6 warnings
tests/test_interaction.py: 4 warnings
tests/test_multiomics.py: 3 warnings
tests/test_sequences.py: 4 warnings
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/transcriptomics.py:108: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
    df = pd.read_table(data, sep=None)

tests/test_annotations.py::test_import_GTEx
tests/test_annotations.py::test_GTEx_annotate
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/annotation.py:239: FutureWarning: The default value of regex will change from True to False in a future version.
    gene_exp_medians["Name"] = gene_exp_medians["Name"].str.replace("[.].*", "")

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\ '
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_disease.py::test_import_HMDD
tests/test_disease.py::test_annotate_HMDD
  /Users/stephenmoss/.pyenv/versions/3.8.7/lib/python3.8/encodings/unicode_escape.py:26: DeprecationWarning: invalid escape sequence '\s'
    return codecs.unicode_escape_decode(input, self.errors)[0]

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:476: FutureWarning: Your version of xlrd is 1.2.0. In xlrd >= 2.0, only the xls format is supported. As a result, the openpyxl engine will be used if it is installed and the engine argument is not specified. Install openpyxl instead.
    table = pd.read_excel(file_resources["lncRNA_target_from_low_throughput_experiments.xlsx"])

tests/test_interaction.py::test_import_LncRNA2Target
  /Users/stephenmoss/Dropbox/Code/OpenOmics/openomics/database/interaction.py:480: FutureWarning: The default value of regex will change from True to False in a future version.
    table["Target_official_symbol"] = table["Target_official_symbol"].str.replace("(?i)(mir)", "hsa-mir-")

-- Docs: https://docs.pytest.org/en/stable/warnings.html

---------- coverage: platform darwin, python 3.8.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
openomics/__init__.py                         25      7    72%
openomics/clinical.py                         57     19    67%
openomics/database/__init__.py                 7      0   100%
openomics/database/annotation.py             197    110    44%
openomics/database/base.py                   140     37    74%
openomics/database/disease.py                 61      1    98%
openomics/database/interaction.py            330    196    41%
openomics/database/ontology.py               152     93    39%
openomics/database/sequence.py               128     66    48%
openomics/genomics.py                         26      8    69%
openomics/imageomics.py                       63     47    25%
openomics/multicohorts.py                      0      0   100%
openomics/multiomics.py                      111     66    41%
openomics/proteomics.py                       13      2    85%
openomics/transcriptomics.py                 111     32    71%
openomics/utils/GTF.py                        53     53     0%
openomics/utils/__init__.py                    0      0   100%
openomics/utils/df.py                         23     11    52%
openomics/utils/io.py                         40     19    52%
openomics/utils/read_gtf.py                  107     24    78%
openomics/visualization/__init__.py            1      0   100%
openomics/visualization/heatmat.py            11      8    27%
openomics/visualization/umap.py               29     24    17%
openomics_web/__init__.py                      0      0   100%
openomics_web/app.py                          69     69     0%
openomics_web/callbacks.py                     0      0   100%
openomics_web/layouts/__init__.py              0      0   100%
openomics_web/layouts/annotation_view.py       0      0   100%
openomics_web/layouts/app_layout.py            7      7     0%
openomics_web/layouts/clinical_view.py        10     10     0%
openomics_web/layouts/control_tabs.py          5      5     0%
openomics_web/layouts/datatable_view.py       28     28     0%
openomics_web/server.py                        2      2     0%
openomics_web/utils/__init__.py                0      0   100%
openomics_web/utils/io.py                     62     62     0%
openomics_web/utils/str_utils.py              25     25     0%
setup.py                                      44     44     0%
tests/__init__.py                              0      0   100%
tests/data/__init__.py                         0      0   100%
tests/data/test_dask_dataframes.py             0      0   100%
tests/test_annotations.py                     39      0   100%
tests/test_disease.py                         34      0   100%
tests/test_interaction.py                     20      1    95%
tests/test_multiomics.py                      46      1    98%
tests/test_sequences.py                       18      1    94%
--------------------------------------------------------------
TOTAL                                       2094   1078    49%

============================================== short test summary info ===============================================
ERROR tests/test_interaction.py::test_import_MiRTarBase - OSError: /data/datasets/Bioinformatics_ExternalData/miRTa...
================================ 34 passed, 32 warnings, 1 error in 662.47s (0:11:02) ================================

The main error seemed to be a missing dataset. On further inspection of the codebase it seems that the package is supposed to download the miRTarBase data (although it appears to have the version 7 release URL hardcoded when version 8 is now available). I wondered whether this was a permissions issue with not being able to create the /data/datasets/Bioinformatics_ExternalData/miRTarBase/ path on my system. I tried with sudo python -m pytest --cov=./ tests/test_interaction.py and got the same? I tried sudo mkdir -p /data/datasets/Bioinformatics_ExternalData/miRTarBase beforehand, which returned:

mkdir: /data/datasets/Bioinformatics_ExternalData/miRTarBase: Read-only file system

This is likely due to macOS system integrity protection.

However, it seems I am also unable to resolve http://mirtarbase.mbc.nctu.edu.tw/cache/download/7.0/. I believe the URL should actually be http://mirtarbase.cuhk.edu.cn/cache/download/7.0/ (or even http://mirtarbase.cuhk.edu.cn/cache/download/8.0/)? I manually updated to the working version 7.0 release and updated the path in test_interaction.py before running the following:

mkdir -p tests/data/datasets/Bioinformatics_ExternalData/miRTarBase
sudo python -m pytest --cov=./ tests/test_interaction.py

I still received the error, so something needs looking at in more detail here.

  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

When raising a PR the build in Travis CI failed for all python versions. There is some work needed to get these issues resolved though I didn't inspect the output in detail.

#### For packages co-submitting to JOSS

- [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

A number of changes and bug fixes are required before I would recommend approving this package, but in general I feel it would be a great addition to pyOpenSci.

Estimated hours spent reviewing: 7


Review Comments

I would recommend looking through the Author's Guide in more detail, particularly the Tools for developers section. Using git pre-commit hooks for local development would be beneficial for both the author and any contributors and would enable the production of greater quality code. These can also be integrated with Travis CI to ensure any pull requests also meet the same requirements via automated testing against a variety of different Python versions.

I echo the comments of @ksielemann, in that the package seems very interesting and I can see it would have a broad range of applications.

Hopefully, we can work together to get the issues resolved and get this approved. Would be interested in seeing this in JOSS at some point too.

@gawbul
Copy link

gawbul commented Feb 8, 2021

@JonnyTran

Just in case it gets lost in the review, I submitted PRs for a couple of the issues I hit here JonnyTran/OpenOmics#103 and here JonnyTran/OpenOmics#105.

Also opened an issue for the problem with the gzip.open returning an io.TextIOWrapper object here JonnyTran/OpenOmics#104.

@NickleDave
Copy link
Contributor

NickleDave commented Feb 15, 2021

Thank you @gawbul for the very thorough review, really appreciate that you could get that back to us a week before the deadline, especially with other things you've been dealing with.

@JonnyTran just want to check where you are at with this. I know it might feel like a lot, and you could have other things going on.

Our guidelines suggest aiming for a two-week turnaround time after reviews are in. We definitely don't have to hold strictly to that especially if you have other obligations to deal with.

But please when you can just give me some idea of how you'll move forward. I would suggest converting reviewer comments into issues on OpenOmics and linking to them where you can. See for example issues on physcraper from their review: https://github.com/McTavishLab/physcraper/issues

@JonnyTran
Copy link
Author

JonnyTran commented May 4, 2021

Thanks so much for everyone's patience, help and support, @NickleDave, @lwasser, @gawbul, and @ksielemann !

Most of all thanks for making this a better software! I will work on making it more usable for all.

@lwasser
Copy link
Member

lwasser commented May 4, 2021

oh all - So the JOSS review process is simple in that they accept our review as theirs! @arfon please note that this package was submitted to the JOSS review process. Can we please fast track it given we have reviewed here on the pyopensci side of things. Please let us know what you need. @JonnyTran is there an open issue in JOSS right now? can you kindly reference it here if you haven't already.

I normally keep these reviews open here until the JOSS part is finished. When it is they will ask you to add the JOSS badge on your readme as well. thank you all for this!

@JonnyTran
Copy link
Author

Hi @lwasser , there isn't a JOSS issue yet, but I will tag this pyOpenSci issue once it is opened.

@lwasser
Copy link
Member

lwasser commented May 4, 2021

ahh ok perfect. Normally they just accept through our issue! let's wait for arfon to get back to us here to ensure that is still the best process! congratulations on being a part of the pyopensci ecosystem and thank you for your submission here!

@ksielemann
Copy link

@ksielemann @gawbul would you be okay with me adding you as contributors to the pyOpenSci site?
I can request your review on the PR when I do so

Sure, I am okay with this! Should I add myself to the file you mentioned above or do you prefer to add me to the contributor site yourself?

@arfon
Copy link

arfon commented May 5, 2021

Sorry for the delay folks. Things are now moving in openjournals/joss-reviews#3249

@NickleDave
Copy link
Contributor

NickleDave commented May 5, 2021

Sure, I am okay with this! Should I add myself to the file you mentioned above or do you prefer to add me to the contributor site yourself?

@ksielemann it would be excellent if you could add yourself with a PR

Sorry for the delay folks. Things are now moving in openjournals/joss-reviews#3249

Great thank you @arfon !!!

@arfon
Copy link

arfon commented May 9, 2021

This is now published in JOSS https://joss.theoj.org/papers/10.21105/joss.03249 🥳

@lwasser
Copy link
Member

lwasser commented May 10, 2021

yea!! thanks so much @arfon and thank you for the reviews @gawbul @ksielemann and the editor effort @NickleDave closing this!!

@lwasser
Copy link
Member

lwasser commented Sep 15, 2022

hey 👋 @JonnyTran @gawbul @ksielemann @NickleDave ! I hope that you are all well. I am reaching out here to all reviewers and maintainers about pyOpenSci now that i am working full time on the project (read more here). We have a survey that we'd like for you to fill out so we can:

🔗 HERE IS THE SURVEY LINK 🔗

  1. invite you to our slack channel to participate in our community (if you wish to join - no worries if that is not how you prefer to communicate / participate).
  2. Collect information from you about how we can improve our review process and also better serve maintainers.
    The survey should take about 10 minutes to complete depending upon how much you decide to write. This information will help us greatly as we make decisions about how pyOpenSci grows and serves the community. Thank you so much in advance for filling it out.

NOTE: this is different from the form designed for reviewers to sign up to review.
IMPORTANT: If there are other maintainers for this project, please ping them here and ask them to fill out the survey as well. It is important that we ensure packages are supported long term or sunsetted with sufficient communication to users. Thus we will check in with maintainers annually about maintenance.

Thank you in advance for doing this and supporting pyOpenSci.

@lwasser
Copy link
Member

lwasser commented Sep 28, 2022

Hi there 👋 @JonnyTran @ksielemann I know that everyone is super busy BUT if you have just 5-10 minutes to fill our our onboarding / feedback survey for this review i'd greatly appreciate it!! Many thanks in advance for your time. it really helps our organization! Jonny, if there are other maintainers involved in this package please ping them here as well. we would like to keep in touch with our maintainers to ensure the package is still being kept up to date AND to understand how we can better support you ! Many thanks in advance!!
🔗 HERE IS THE SURVEY LINK 🔗

@JonnyTran
Copy link
Author

Hi there wave @JonnyTran @ksielemann fill our our onboarding / feedback survey for this review i'd greatly appreciate it!! Many thanks in advance for your time. it really helps our organization! Jonny, if there are other maintainers involved in this package please ping them here as well. we would like to keep in touch with our maintainers to ensure the package is still being kept up to date AND to understand how we can better support you !

Thank you for the reminder! I've just submitted the survey. Looking forward to future contributions to PyOpenSci

@NickleDave
Copy link
Contributor

Thank you @JonnyTran, same!

@lwasser
Copy link
Member

lwasser commented Sep 28, 2022

you rock @JonnyTran !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: joss-accepted
Development

No branches or pull requests

6 participants