Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: pyb4model (Python) #28

Open
10 of 22 tasks
andrealee011 opened this issue Mar 17, 2020 · 4 comments
Open
10 of 22 tasks

Submission: pyb4model (Python) #28

andrealee011 opened this issue Mar 17, 2020 · 4 comments
Assignees

Comments

@andrealee011
Copy link

andrealee011 commented Mar 17, 2020

Submitting Author: Andrea Lee (@andrealee011 ), Jaekeun Lee (@agdal1125), Sakariya Aynashe (@eyrakas), Xinwen Wang (@xiw315 )
Package Name: pyb4model
One-Line Description of Package: A python package to preprocess data and conduct machine learning
Repository Link: pyb4model
Version submitted: v1.1.0
Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Cheng Min (@marvinmin)
Reviewer 2: Ke Xin Zhao (@Margaret8521)
Archive: TBD
Version accepted: TBD


Description

  • Include a brief paragraph describing what your package does:

Our package elegantly performs data pre-processing and machine learning in a fast and easy manner. With four separate functions that will come along with the pyb4model package, users will have greater flexibility in handling many different types of datasets. With the pyb4model package, users will be able to smoothly pre-process their data and conduct machine learning using their model of choice.

Scope

  • Please indicate which category or categories this package falls under:
    • Data retrieval
    • Data extraction
    • Data munging
    • Data deposition
    • Reproducibility
    • Geospatial
    • Education
    • Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

  • Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

Our package smoothly pre-processes data and conducts machine learning using the user's model of choice.

  • Who is the target audience and what are scientific applications of this package?

Our package smoothly pre-processes data for machine learning and provides wrapper functions of pre-existing machine learning libraries to reduce lines of code and promote efficiency.

  • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

sklearn is probably the most widely used package for supervised learning problems in python. Although the library provides various model fitting and preprocessing features, programmers end up with writing the same line of code over and over again. Our pyb4model library provides a simple solution to this pain point: wrapper functions of sklearn and other primary libraries used for supervised learning to reduce lines of code and promote efficiency.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has an OSI approved license
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: Do not submit your package separately to JOSS

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

@andrealee011 andrealee011 assigned marvinmin and unassigned marvinmin Mar 17, 2020
@andrealee011 andrealee011 changed the title Submission: py4model Submission: py4model (Python) Mar 17, 2020
@andrealee011 andrealee011 changed the title Submission: py4model (Python) Submission: pyb4model (Python) Mar 17, 2020
@marvinmin
Copy link

marvinmin commented Mar 17, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all user-facing functions
  • Examples for all user-facing functions
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
  • Installation instructions
  • Any additional setup required (authentication tokens, etc)
  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages
  • Citation information

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
2 hours

Review Comments

General Comments
I think you have done a good job with this package. It addresses a specific need. There are some issues about the package Documentation and Functionality.

  1. The badge for continuous integration in the README is missing. Would you please add this badge to the README?
  2. There is no citation nor credits information at the end of README. If there is any, please include it.
  3. The styles of docstrings for different functions do not consist. For example, in the docstring for the missing_val function, the style for parameters is
    image

But in the docstring for the ForSelect function, the style of parameters is
image

  1. The packages used in the examples of README are not fully included, which results in the failures of the examples for fit_and_report and ForSelect functions as shown in the following screenshots:

image
image
image

  1. Some variables used in the examples of README are not defined, which results in the failures of the examples for missing_val, ForSelect and feature_splitter functions as shown in the following screenshots:

image
image
image

  1. The raised error message does not consist with the code for function missing_val as shown in the following screenshot:

image

If it is possible, please address these issues and I'll review the package again. Thanks.

@Margaret8521
Copy link

Margaret8521 commented Mar 21, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all user-facing functions
  • Examples for all user-facing functions
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements
The package meets the readme requirements below:

  • Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • The package name
  • Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
  • Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
  • Installation instructions
  • Any additional setup required (authentication tokens, etc)
  • Brief demonstration usage
  • Direction to more detailed documentation (e.g. your documentation files or website).
  • If applicable, how the package compares to other similar packages and/or how it relates to other packages
  • Citation information

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
  • Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

3h

Review Comments

Compliments:

  • Very clear instruction and description, I really like the README.md file which is nice and clean.
  • It is very easy to install and it works well.
  • I like how you list the software dependencies in your project repo with their links.

Suggestions:

  1. It might be useful to add usage results in the README.md file. This will help users to see the results from your functions directly and enhance their understanding of your package.

  2. For function missing_val:

  • It seems like the function cannot deal with empty strings in a dataframe. delete works fine, but mean still contains NaN values, and knn throws an error. My suggestion here is to add a test to only include numbers in a dataframe such that strings or booleans are not allowed. Examples below:

Screen Shot 2020-03-21 at 3 54 18 PM

Screen Shot 2020-03-21 at 3 55 57 PM

  1. For function ForSelect:
  • The function produces repetitive text which is not necessary.

Screen Shot 2020-03-21 at 3 57 38 PM

  • In the example above, the minimum features selected should be 2, but the result only gives me one feature. I also tried with other datasets, the function always produced the same result no matter how I changed min_features and max_features.

Screen Shot 2020-03-21 at 4 00 28 PM

  • Also, the function does not seem to work with regression problems, it throws an error in the following example:

Screen Shot 2020-03-21 at 4 01 30 PM

Screen Shot 2020-03-21 at 4 02 57 PM

  1. For function feature_splitter:
  • It might be useful to produce the result as a dataframe and add labels to it. It will help users to know which columns are numerical and which columns are categorical.

  • Also, the function seems to split the columns according to whether they include strings or numbers instead of categorical or numerical features. For instance, in the df_breast dataset below, Class, malignancy_degree and avgerage_axillary_lymph_nodes are categorical features instead of numerical features. My suggestion is to make a note in the function documentation or/and README.md file so that users can be aware of it. See the example below:

Screen Shot 2020-03-21 at 4 05 30 PM

  1. Some minor things:
  • update the example results in your function documentation. For example in the missing_val function:

Screen Shot 2020-03-21 at 4 07 31 PM

  • typo (minimum features instead of mininum features) in pyb4model.py
  • add the breakdown of group members in the repo with their Github links.

Great work! I hope these suggestions help, it was fun trying out your package!

@andrealee011
Copy link
Author

andrealee011 commented Mar 26, 2020

Thank you for the feedback! It was extremely hepful! Because of time constraints, we were unable to address all your feedback. Here is a list of the changes we did make:

  • fixed our examples in the README to ensure they are working
  • fixed the error message in our missing_val function
  • added a build badge to the README
  • added test to our missing_val function to ensure strings and booleans are not allowed
  • edited docstrings for consistency
  • added output to our examples in the usage section of the README
  • improved our feature splitter function by indicating which of the output is numerical and which is categorical
  • fixed our forward selection function to remove the minimum number of features argument

Please find our new relase here: https://github.com/UBC-MDS/pyb4model/tree/v1.1.1

@eyrakas
Copy link

eyrakas commented Mar 26, 2020

Thank you @marvinmin and @Margaret8521 for your valuable feedback. We have made the changes outlined above by @andrealee011 and @agdal1125. I just wanted to add one more comment on the feature_splitter for the use of OHE. The intended use of the function was to help users visualize what kind of data they are dealing with, i.e. which ones are numerical and which ones are categorical. Kind of quick EDA for them. Not necessarily to do full data preprocessing. But we have changed its output into a data frame which is more presentable to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants