Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: sleev: An R Package for Semiparametric Likelihood Estimation with Errors in Variables #7320

Open
editorialbot opened this issue Oct 7, 2024 · 38 comments
Assignees
Labels
C++ R review TeX Track: 5 (DSAIS) Data Science, Artificial Intelligence, and Machine Learning

Comments

@editorialbot
Copy link
Collaborator

editorialbot commented Oct 7, 2024

Submitting author: @JiangmeiRubyXiong (Jiangmei Xiong)
Repository: https://github.com/dragontaoran/sleev
Branch with paper.md (empty if default branch): jossSubmission
Version: v1.0.3
Editor: @jbytecode
Reviewers: @alemermartinez, @aalfons
Archive: Pending

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/2dc012ed16ca93ac2f265dc6725f1466"><img src="https://joss.theoj.org/papers/2dc012ed16ca93ac2f265dc6725f1466/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/2dc012ed16ca93ac2f265dc6725f1466/status.svg)](https://joss.theoj.org/papers/2dc012ed16ca93ac2f265dc6725f1466)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@alemermartinez & @aalfons, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review.
First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @jbytecode know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @aalfons

📝 Checklist for @alemermartinez

@editorialbot editorialbot added C++ R review TeX Track: 5 (DSAIS) Data Science, Artificial Intelligence, and Machine Learning labels Oct 7, 2024
@editorialbot
Copy link
Collaborator Author

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

✅ OK DOIs

- 10.1002/sim.8799 is OK
- 10.1111/biom.13512 is OK
- 10.1017/CBO9780511618994 is OK
- 10.1080/01621459.2017.1295864 is OK
- 10.2307/2669386 is OK
- 10.1214/20-aoas1343 is OK

🟡 SKIP DOIs

- No DOI given, and none found for title: An empirical study for impacts of measurement erro...

❌ MISSING DOIs

- None

❌ INVALID DOIs

- None

@editorialbot
Copy link
Collaborator Author

Software report:

github.com/AlDanial/cloc v 1.90  T=0.03 s (1498.9 files/s, 242374.7 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
R                               30            524           1164           2131
C++                              5            252            360           1379
Markdown                         2             80              0            375
TeX                              1              0              0             71
C/C++ Header                     1             19             67             22
YAML                             1              1              4             19
-------------------------------------------------------------------------------
SUM:                            40            876           1595           3997
-------------------------------------------------------------------------------

Commit count by author:

   112	Joey Sherrill
    26	Sarah Lotspeich
     9	JiangmeiRubyXiong
     5	Sarah
     3	Ran Tao
     3	Sarah Lotspeich (She/Her)
     2	Ruby XIONG
     2	dragontaoran
     1	Shawn Garbett

@editorialbot
Copy link
Collaborator Author

Paper file info:

📄 Wordcount for paper.md is 2168

✅ The paper includes a Statement of need section

@editorialbot
Copy link
Collaborator Author

License info:

🟡 License found: GNU General Public License v3.0 (Check here for OSI approval)

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@jbytecode
Copy link

@editorialbot remind @aalfons in four weeks

@aalfons - You will get reminded by our bot automatically. As you emphasized in pre-review issue, we expect your report in 8-10 weeks. Thank you in advance

@editorialbot
Copy link
Collaborator Author

Reminder set for @aalfons in four weeks

@jbytecode
Copy link

Dear @alemermartinez & @aalfons, you can start your review by generating your tasklist, please just type

@editorialbot generate my checklist

Thank you in advance

@jbytecode
Copy link

@editorialbot remind @alemermartinez in 15 days

@editorialbot
Copy link
Collaborator Author

Reminder set for @alemermartinez in 15 days

@aalfons
Copy link

aalfons commented Oct 8, 2024

Review checklist for @aalfons

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/dragontaoran/sleev?
  • License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@JiangmeiRubyXiong) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
  • Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@editorialbot
Copy link
Collaborator Author

👋 @alemermartinez, please update us on how your review is going (this is an automated reminder).

@alemermartinez
Copy link

alemermartinez commented Oct 22, 2024

Review checklist for @alemermartinez

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/dragontaoran/sleev?
  • License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@JiangmeiRubyXiong) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
  • Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1. Contribute to the software 2. Report issues or problems with the software 3. Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@alemermartinez
Copy link

@editorialbot commands

@editorialbot
Copy link
Collaborator Author

Hello @alemermartinez, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Run checks and provide information on the repository and the paper file
@editorialbot check repository

# Check the references of the paper for missing DOIs
@editorialbot check references

# Generates the pdf paper
@editorialbot generate pdf

# Generates a LaTeX preprint file
@editorialbot generate preprint

# Get a link to the complete list of reviewers
@editorialbot list reviewers

@alemermartinez
Copy link

Hi @jbytecode I would like to ask for your guidance on how to communicate some issues I observed while reviewing the paper.

For instance, I believe that requiring users to manually select the B-splines basis, without providing a default option for the Bspline argument, makes the package less "user-friendly" than it could be. Including a default option for the univariate case (and for the bivariate setting, if possible) would be beneficial. In situations where computing k-fold cross-validation is computationally expensive, using a fixed number of B-splines as a starting point could be an effective approach. Additionally, it would be important to include some relevant methods to enhance usability.

The term “robust” has been used in the paper. I would appreciate some clarification on what “robust” refers to in this context.

Regarding related packages, are there any other commonly-used packages that estimate under these models, aside from the two developed by the author’s colleagues?

Additionally, I would suggest including system times in order to compare the different B-spline methods presented in the vignette.

I also found some typos, particularly in the mathematical equations. Some of these equations also extend beyond the right margin of the PDF file, affecting readability.

@jbytecode
Copy link

@alemermartinez - Thank you for your review and suggestions. I think they're very clear and I hope @JiangmeiRubyXiong can handle most of them. You can also open pull requests that include your corrections on the manuscript and/or open issues that address your suggestions. Writing down your suggestions here is also a convenient way of reviewing. Thank you!

@JiangmeiRubyXiong - Could you please consider the suggestions above and update your status? Thank you in advance.

@alemermartinez
Copy link

Thanks @jbytecode Below are the additional issues I found while reviewing the paper and vignette.

From the paper:

  • Line 26: The word “and” is missing.
  • Lines: 45 and 54. The letter “beta” is missing in the PDF compilation.
  • Line 49: The symbol for independence might not be clear to the reader. I suggest replacing it with the word “independent”.
  • After equation (1): There is no mention in the main paper of the probability term that was left out.
  • Line 62: The equation exceeds the page margins and is unreadable.
  • Equation (2). The variable “s_n” is lowercase but in the sum it appears in uppercase. Additionally, an “i” appears instead of a “k” and some “k”’s are missing.
  • Lines 115 and 117: The formulas exceed the right margin and can be broken before “degree=3”.
  • Lines 138 and 139: The formulas exceed the right margin as well.
  • Lines 143: The output is missing.
  • Table 2: This table is not referenced in the text.

From the vignette:

  • The reference to Schumaker should be 2007, as this is the third edition (as mentioned in the paper), but it is cited as 1981 here.
  • Before Equation (1), the regression model is stated as Y=alpha + X * beta + epsilon. However, in the first paragraph, it states alpha + beta^T * X + epsilon. Is X the same in both cases? If so, a transpose might be missing.
  • Equation (3): There are two logarithms in the second sum.

@JiangmeiRubyXiong
Copy link

Thanks @alemermartinez for catching all the typos and reviews! I will fix all typos, and I will discuss with co-authors regarding the major comments.

@editorialbot
Copy link
Collaborator Author

👋 @aalfons, please update us on how your review is going (this is an automated reminder).

@jbytecode
Copy link

May I have a status update from our reviewers? Thank you in advance.

@aalfons
Copy link

aalfons commented Nov 18, 2024

@jbytecode - I've been extremely swamped in the past weeks with teaching tasks and deadlines for my own research. I won't have time to get around to the review before December, and I should have the review completed in the first half of December. Please note that this is still within the 8-10 weeks timeline that I communicated in the pre-review phase.

@jbytecode
Copy link

@aalfons - The first half of December is perfectly good to us. I was just checking out the people, thank you for your status update. Sorry for bothering.

@aalfons
Copy link

aalfons commented Nov 18, 2024

@jbytecode - No worries and no need to apologize. I know it's important to check in every now and then in your role. (I hope my response didn't sound like I was bothered by the request. If it did, that was not my intention.)

@jbytecode
Copy link

@aalfons - You are doing a job that you are not obligated to do, which is why I feel the need to be very polite. I know that was not your intention, and I sincerely thank you for the explanation.

@alemermartinez
Copy link

@jbytecode Hi! In my case, I'm waiting for the authors to revise the major comments I've suggested.

@JiangmeiRubyXiong
Copy link

@editorialbot set jossSubmission as branch

@editorialbot
Copy link
Collaborator Author

Done! branch is now jossSubmission

@JiangmeiRubyXiong
Copy link

Hi @alemermartinez @jbytecode , thank you for your patience. Here is our response towards the reviewer comments so far. Note that we have moved the branch for submission. Feel free to let me know if you have any further questions!

Major Comments:

  1. For instance, I believe that requiring users to manually select the B-splines basis, without providing a default option for the B-spline argument, makes the package less "user-friendly" than it could be. Including a default option for the univariate case (and for the bivariate setting, if possible) would be beneficial.
    Response:
    We agree with the reviewer that the R package would be more user-friendly if there is a default pre-specified B-spline basis. However, we eventually decided not to incorporate a default specification because B-spline approximation is inherently a data-adaptive approach, and the proper choice of B-spline basis depends on the Phase I and II sample sizes and the number of continuous and discrete covariates (Tao et al. 2021). A pre-specified B-spline basis simply will not work for all scenarios, and we do not want to provide the authors with such a default option that could fail in some scenarios. In the vignette that comes with the R package, we have provided detailed guidance on how one should construct the B-spline basis under different scenarios and what is the rationale for doing so. We believe that this information will better equip the user with the insight and techniques needed for implementing our methods in a valid and efficient manner.

  2. In situations where computing k-fold cross-validation is computationally expensive, using a fixed number of B-splines as a starting point could be an effective approach.
    Response:
    We totally agree with the reviewer that starting with a (properly chosen) fixed number of B-spline basis functions is computationally efficient and will likely result in valid inference in many scenarios. With that being said, we provide the cross-validation function so that one has the option to choose the optimal B-spline basis in a more rigorous and data-driven manner. In fact, a good routine would be to start with a properly chosen fixed number of B-spline basis function for exploratory analysis. If the results in the exploratory analysis look promising, one may then use cross-validation to either confirm or refine the initial choice of the B-splines.

  3. Additionally, it would be important to include some relevant methods to enhance usability.
    Response:
    Due to the page limit, we are unable to include more details in this manuscript. In the vignette that comes with the R package, we have included additional data analysis examples with multiple continuous covariates and binary outcomes.

  4. The term “robust” has been used in the paper. I would appreciate some clarification on what “robust” refers to in this context.
    Response:
    Certainly! “Robust” in this paper refers to the fact that the methods implemented in the R package models the error data mechanisms nonparametrically and thus is robust to misspecification of the error models.

  5. Regarding related packages, are there any other commonly used packages that estimate under these models, aside from the two developed by the author’s colleagues?
    Response:
    We are not aware of other methods that can analyze two-phase studies with potentially correlated errors in the outcome and covariate data. Classical measurement error literature largely deals with scenarios where there are measurement errors in the outcome only, covariates only, but not both. In fact, this is a main selling point for our two recent methods publications (Tao et al. 2021; Lotspeich et al. 2022), whose methods are implemented in the R package sleev.

  6. Additionally, I would suggest including system times in order to compare the different B-spline methods presented in the vignette.
    Response:
    Certainly! System time is now included in both the main manuscript and the vignette that comes with the R package.

Minor comments:

  1. Line 26: The word “and” is missing.
    Thanks for spotting that. The word is now added.

  2. Lines: 45 and 54. The letter “beta” is missing in the PDF compilation.
    Thanks for noticing this. The letter is now added.

  3. Line 49: The symbol for independence might not be clear to the reader. I suggest replacing it with the word “independent”.
    Thanks for pointing out this ambiguity. It is now described as “We assume that W, U and \epsilon are independent”.

  4. After equation (1): There is no mention in the main paper of the probability term that was left out.
    Thanks for pointing this out. We agree that this could be confusing and we added explanation as follows: “….equation (1)…. where $P(\pmb{X^})$ is left out, because the error-prone covariates are fully observed and thus $P(\pmb{X^})$ can simply be estimated empirically.”

  5. Line 62: The equation exceeds the page margins and is unreadable.
    Thanks for noticing this. We now adjusted the equation so that it is within the page margin.

  6. Equation (2). The variable “s_n” is lowercase but in the sum it appears in uppercase. Additionally, an “i” appears instead of a “k” and some “k”’s are missing.
    Thanks for spotting this. The equation is now updated with the correct letter case for s_n. We did not found discrepancy with “i” and “k” in the formula, but we added parenthesis to avoid potential confusion.

  7. Lines 115 and 117: The formulas exceed the right margin and can be broken before “degree=3”.
    Thanks for noticing this. We now adjusted the equation so that it is within the page margin.

  8. Lines 138 and 139: The formulas exceed the right margin as well.
    Thanks for noticing this. We now adjusted the equation so that it is within the page margin.

  9. Lines 143: The output is missing.
    Thanks for noticing this. The output is now added.

  10. Table 2: This table is not referenced in the text.
    Thanks for catching this. Table 2 is supposed to referenced at the paragraph where the mock data is introduced. It is now fixed and referenced at row 85.

  11. Vignette: The reference to Schumaker should be 2007, as this is the third edition (as mentioned in the paper), but it is cited as 1981 here.
    Thanks for pointing out this discrepancy. This reference is now fixed.

  12. Vignette: Before Equation (1), the regression model is stated as Y=alpha + X * beta + epsilon. However, in the first paragraph, it states alpha + beta^T * X + epsilon. Is X the same in both cases? If so, a transpose might be missing.
    Thanks for pointing out this inconsistency. The linear regression expression are now all changed to alpha + beta^T * X + epsilon.

  13. Vignette: Equation (3): There are two logarithms in the second sum.
    Thanks for picking up this typo. Equation (3) is now fixed.

References:

Lotspeich, S. C., B. E. Shepherd, G. Amorim, P. A. Shaw, and R. Tao. 2022. “EfficientOdds Ratio Estimation Under Two-Phase Sampling Using Error-Prone Data from a Multi-National HIV Research Cohort.”Biometrics78 (4): 1674–85.https://doi.org/10.1111/biom.13512.

Tao, R., S. C. Lotspeich, G. Amorim, P. A. Shaw, and B. E. Shepherd. 2021. “EfficientSemiparametric Inference for Two‐phase Studies with Outcome and Covariate MeasurementErrors.”Statistics in Medicine40 (3): 725–38.https://doi.org/10.1002/sim.8799.

@alemermartinez
Copy link

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@alemermartinez
Copy link

Thanks, @JiangmeiRubyXiong , for addressing my comments. For clarification, I respond to each item below:
1. I understand your point. My suggestion was more about an “internal” procedure to select the B-spline basis. However, I realize this could require significantly more programming.
2. and 3. Agreed.
4. Regarding the explanation of what “robust” means in this context, my concern wasn’t about misunderstanding the concept. Rather, since “robust” can have different interpretations, I believe it would be helpful to explicitly clarify its meaning in the paper, perhaps with a statement similar to the one provided in the vignette.
5. Ok.
6. Excellent.
For Equation (2): Based on the result in Line 65, I still believe that w_i​ should be w_k​ in the first term of the two-term formula. Additionally, what do w and u without subscripts represent? Please compare this with Equation (2) in the vignette.
Lastly, I noticed two additional typos:
• Line 43: There is a missing space between the period and “Suppose.”
• At the end of Equation (1): There is a period at the end of the line.

@JiangmeiRubyXiong
Copy link

Thanks @alemermartinez for the follow-up comments! Here is our response:

1. I understand your point. My suggestion was more about an “internal” procedure to select the B-spline basis. However, I realize this could require significantly more programming.
Thank you, we agree that an “internal” procedure to select the B-spline basis would be ideal. However, as the reviewer pointed out, it could be a nontrivial task and require a significant amount of programming if we would like to incorporate an “fully automatic” B-spline selection mechanism that is versatile enough to accommodate a wide range of scenarios under different number/types of covariates, phase one/two sample sizes, and regression models. While appealing, we consider such a major update to be outside the scope of the current paper/software, whose main goal is to provide for the first time, a user-friendly software that can perform efficient inference in two-phase measurement error studies. We will attempt a fully automatic B-spline selection mechanism in a future update of the package.

4. Regarding the explanation of what “robust” means in this context, my concern wasn’t about misunderstanding the concept. Rather, since “robust” can have different interpretations, I believe it would be helpful to explicitly clarify its meaning in the paper, perhaps with a statement similar to the one provided in the vignette.
Thank you for the clarification. We now modified our introduction of the SMLE at line 28 to clarify this:
"SMLE is an estimator that analyzes two-phase data that combines the error-prone data on all records with the validated data on a subsample. SMLE is highly efficient and robust because it utilizes all available data from both phases while not making any parametric assumption on the error model."

For Equation (2): Based on the result in Line 65, I still believe that w_i​ should be w_k​ in the first term of the two-term formula. Additionally, what do w and u without subscripts represent? Please compare this with Equation (2) in the vignette.
Thank you very much for pointing out this oversight. Equation 2 is now updated and modified with w_k and u_k in place. w, u without subscript is referring to the value of W and U in equation 1. Equation 2 should use w_k and u_k, the distinct observed values of W and U, as defined in the paragraph before equation 2.

Lastly, I noticed two additional typos:
• Line 43 (now line 45): There is a missing space between the period and “Suppose.”
• At the end of Equation (1): There is a period at the end of the line.

Thank you for pointing out these typos. They are now fixed in the manuscript.

@alemermartinez
Copy link

Hi @jbytecode I just finished the review. What should I do next?

@jbytecode
Copy link

@alemermartinez - Thank you for completing your review. In your side, nothing to do by now.

@aalfons
Copy link

aalfons commented Dec 21, 2024

@JiangmeiRubyXiong @jbytecode

First of all, my apologies for taking so long with the review. I do believe that the package implements important functionality that shows some promise, but in its current form the package seems unfinished and not ready for publication. I do emphasize the first part of the previous sentence, and I hope that you find the comments below useful for providing an updated version of the software and manuscript.

Here are my comments on the package:

  1. Functions in the package simply return lists. It's best practice for modeling functions to assign a certain class to the results, and then provide methods for print() and summary() to display the results, as well as methods for coef(), fitted(), residuals(), and predict() functions to access or compute the corresponding information from the object. Also some other methods might be relevant, such as logLik(). Without adhering to these common design principles, I cannot recommend to accept the submission for publication.

  2. Why is there a function to select the B-spline basis via cross-validation only available for the linear model, but not for the logistic model? I would assume that selecting an appropriate B-spline basis via cross-validation is also a relevant for the logistic regression model?

  3. Arguments use a mix of different naming conventions: some use underscores for separation (Y_unval, X_unval, hn_scale), others do not (Bspline, noSE), and while most of them are in lower case, some are in upper case (TOL, MAX_ITER). I think some consistency in style would increase the usability of the functions. At least I kept getting things wrong when playing around with the package. I know that it is daunting to change the user interface, and there may be legitimate reasons not to do it. On the other hand, the package does not have any reverse dependencies at the moment, so there will never be a better time to implement such changes. For a transition period, you can keep dual arguments for some time (old name and new name), and give a warning when the old name of an argument is used (with a hint that this argument will be removed in the future).

  4. The behavior of functions linear2ph() and logistic2ph() seems different from common design principles for modeling functions in R. For instance, the coefficients component returned by these functions seems to contain a matrix of the coefficient estimates, standard errors, test statistics, and p-values. It's common that the model fitting function would only return the coefficient estimates, and that the summary() method would provide the inference. There can be legitimate reasons to deviate from this, for instance if the procedures for estimation and inference cannot easily be disentangled in the implementation. But then this difference from common design principles should clearly be addressed in the documentation so that the user knows what to expect. And even in that case, it would be beneficial to the user if the print(), summary(), and coefficient() methods still behave in the usual ways, i.e., similar to those for classes "lm" as returned by function lm().

  5. The documentation of the returned object is also unclear. For instance, it is not mentioned that the out put is a list (see also my above comment that this should be changed by assigning a class), the documentation jumps straight to describing the list components. Moreover, the description of the list components is also unclear. For instance, the the description of the coefficient component simply states "Stores the analysis results." What are the "analysis results"? In what data type or class are they stored? Please provide complete information on the different list components and what data type or class is used to store them.

  6. The titles of the help page for funciton cv_linear2ph() consists of two sentences and is identical to the Description section. Please provide proper (short) titles for all help files and make sure that they are substantially different from the Description section.

  7. None of the help files seem to contain a Details section. I think that providing more details on the estimators in the help files would be beneficial to the user. I do appreciate that there is a vignette and that the references to the papers are given in the help files, but I think that some basic details are also warranted in the help files themselves.

  8. I'm going to jump in on a discussion started by @alemermartinez. Thank you for explaining that it may not be possible to come up with a good B-spline basis that works well in all scenarios. However, in the paper you do state that the involvement of B-splines constitutes a difficulty of using the method in practice, but that your package "makes this method readily applicable for practitioners in a user-friendly way". I therefore think that it would be great if the package could provide some utility functions that make the use of B-splines a bit easier on the practitioner. Consider the case study in the manuscript, where you use separate B-spline bases for the two sex groups, which are then combined into one matrix in a specific way and added to the data frame containing the original information. There could be a utility function that does that for the user. For instance, the utility function could take the following arguments: a spline basis for each group, the original data set, and some way to identify which variable in the data set defines the groups. The utility function then returns the data set including the spline bases. Note that this function should work with an arbitrary number of groups, it should work if the data are not necessarily ordered according to the grouping variable, and the data should be returned in the original order of the observations. I can see that applications where different B-spline bases are needed for different groups are quite common, and such a function would greatly enhance usability (in line with your claims on usability) while making the handling of the different B-spline bases less error prone.

Major comments on the manuscript:

  1. I appreciate that there has already been a discussion with @alemermartinez regarding clarification the term "robustness". In addition to clarifying the term in the manuscript, I think it would be beneficial if you could include a one- or two-sentence summary on what results you obtained in the referenced papers with respect to robustness. For instance, are there results for consistency under any type of error distribution, or theoretical bounds on the estimation error that hold for any error distribution? Or has robustness been studied empirically by means of simulation? If the latter, which error distributions have been studied? This would provide further details on the robustness of the method without requiring the reader to search for this information the referenced papers.

  2. While I understand from the discussion with @alemermartinez that there are no other packages that implement these models, I would be surprised if there are no other packages that are in some form related to what this package are doing. While there may be no packages for such two-phase validated data, there are plenty of packages that address robust or nonparametric estimation of linear and logistic models, and there should also be packages that deal with measurement error/uncertainty in other ways. Hence I think it would be good if you include a paragraph (perhaps as a separate section) on related R packages and how they differ from sleev.

  3. In the example case study, why do you choose these specific transformations? Most notably, why is CD4 divided by 10 before taking the square root?

  4. The manuscript makes claims regarding computational performance compared to packages logreg2ph and TwoPhaseReg. How large is the performance improvement? Can you please show this, either in the paper (if space allows) or otherwise somewhere else (for instance the GitHub README)?

Minor comments on the manuscript:

  • Missing space between the two sentences on page 2, line 43.

  • Page 3, line 77: "The package can be installed and loaded from the R CRAN library or Github." First, the R in CRAN already stands for the R language, so "R CRAN" is redundant. Second, there is some confusion on terminology. In R, the term "library" refers to a path on a machine that contains installed packages. CRAN is a central repository of user-contributed package, not a library. Third, packages are not loaded from CRAN or GitHub. They can be installed from CRAN or GitHub, but afterwards they are loaded from the local library. So this should be changed to something like "The package can be installed from CRAN or Github." You should also give the link to CRAN, not just to GitHub.

  • I think there is another missing space between the two sentences on page 4, line 134. In general, also @alemermartinez found quite a few typos as well as notation and language issues, so please make sure to always carefully proofread and check your work.

  • Output from the example is not well formatted and difficult to read (page 4, lines 157-160). Please fix this. (And please use summary() to show the results once the requested changes to the package are implemented, see my comment above.)

  • Please add a reference for R. You can use citation() to find out how R wants to be cited.

Coding style of example in the manuscript and the GitHub README:

  • If one simply copies and pastes the example code from the paper and runs it in R, one gets error messages as the command to load package sleev is omitted. I suggest to include library("sleev") in the example code. Note the quotes around the package name, cf. my next comment.
  • It's best practice to use data("mock.vccc") instead of data(mock.vccc), as the former highlights that mock.vccc is not (yet) an object that exists in the user's workspace.
  • Please always use appropriate spacing to make the code readable, for instance around operators or argument assignments, e.g., sum(mock.vccc$Sex == 0) / n or
Bspline0 <- splines::bs(x = mock.vccc$VL_unval_l10[mock.vccc$Sex == 0], df = sn_0, 
                        degree = 3, intercept = TRUE)
  • Using function cbind() in data <- data.frame(cbind(mock.vccc, Bspline)) is redundant, data <- data.frame(mock.vccc, Bspline) achieves the same result.

GitHub:

  • It seems that no LICENSE file is included in the GitHub repository, or am I overlooking it?

  • In the Install section of the README, you only mention how to install the package from GitHub. Please also include information on how to install the package from CRAN.

  • Please add community guidelines to the README that explain how third parties can 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support. Or have I overlooked this information?

All the best with making the necessary adjustments, and please let me know if anything in my comments is unclear.

@jbytecode
Copy link

@JiangmeiRubyXiong - May I kindly have a status update regarding the reviewer's report, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ R review TeX Track: 5 (DSAIS) Data Science, Artificial Intelligence, and Machine Learning
Projects
None yet
Development

No branches or pull requests

5 participants