Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality #10

Closed
mdhaber opened this issue Mar 28, 2023 · 4 comments
Closed

Functionality #10

mdhaber opened this issue Mar 28, 2023 · 4 comments

Comments

@mdhaber
Copy link

mdhaber commented Mar 28, 2023

One of the items in the JOSS review criteria #1 is:

Functionality: Have the functional claims of the software been confirmed?

Many have, but I still don't understand some things.

  1. I haven't seen an interface to the goodness-of-fit tests.

image

image

Do I understand that MAS does not report the results of these tests directly? Instead, after fitting the distributions to the data, several goodness-of-fit tests are run to determine which distribution is best? If so, how are the two-sample tests used? (e.g. generate random data from the fitted distribution and compare that against the real data?) How are the results of multiple tests combined to choose the best distribution? I just need a high-level overview, since I maintain implementations of these functions and wrote scipy.stats.goodness_of_fit.

  1. Exactly how do the data transforms work? For example, does the "expectation" transform simply subtract the sample mean from the sample?

image

  1. I haven't run across these when working with MAS.

image

How are they used/accessed?

@MrShoenel
Copy link
Owner

I will address some points here:

  1. The G-o-F tests are conducted during distribution fitting by the class StatisticalTest. Then, the process of pre-generating densities uses the most applicable test to select the best fit. The web application then shows the test statistic in the table. However, there is no other explicit interface to access all the tests or their results or even allow the user to pick which test they want to choose (that would be rather infeasible because the densities need to be generated). The reason behind this is that the intention for the web application is to be free of any configuration, i.e., load a dataset and go. I would suggest the following: Similar to the other tests (see point 3), I would add functionality to export the result of each and every test in a spreadsheet, which can then be included in the process of generating an own dataset (and perhaps its results can be summarized and added to the automatically generated report). Lastly, I would edit the web application to show which test was used and what the p-value was. This way, the interested user could connect the dots and evaluate all tests vs. the one chosen. What do you think?
  2. That depends. The functionality is in https://github.com/MrShoenel/metrics-as-scores/blob/master/src/metrics_as_scores/distribution/distribution.py#LL846C9-L846C18. This is a technical detail that I feel would go too far to be included in the paper.
  3. These three tests are conducted when generating an own dataset. The results are then included in the report that is created and can be rendered with Quarto. The results themselves are not used directly within the application, otherwise.

@mdhaber
Copy link
Author

mdhaber commented Apr 13, 2023

  1. If that's something you want to do, OK, but it's not necessarily required to address this issue. Either way, I think there needs to be some additional information in the article or documentation about how this works. You are welcome to cite a reference for the details.

So that I understand better, can you answer here the question about how the two-sample versions of tests are used? IIUC, in the context of fitting, you are interested in comparing a distribution to data, so the one-sample statistics seem applicable. I don't see the need to compare two samples against one another here.

Also,

How are the results of multiple tests combined to choose the best distribution?

IIUC, only the "most applicable" test is used to select the best distribution, so you don't need to combine data from multiple tests. But how is that determined?

  1. That is fine, but it seems like it is written as though the meaning of these non-parametric transforms will be understood by the reader without additional explanation. You are welcome to cite a reference instead of describing it in complete detail.

  2. I'm not familiar with Quarto, but I see that it is covered in the example. From a statistical perspective, some alarms go off when one provides the user with a menu of p-values for them to choose from, but I will consider this resolved from a software perspective.

@MrShoenel
Copy link
Owner

MrShoenel commented Apr 21, 2023

  1. The results of all tests are now exported as human-readable CSVs as well. This will allow for further manual investigation of each fit and test. I made this clear in the paper, also mentioning that for continuous fits, now the one-sample KS test, and for discrete fits the two-sample Epps-Singleton test is used (because, unlike KS, it is applicable to discrete samples; since we fitted a distribution, we take its PPF to obtain a second sample uniformly/deterministically that is then used in the test). The results of these two tests are used to select the best-fitting distribution when generating densities for the web application (that is why it does not mention the test's name or p-value). So while there is no actual interface for exploring all the results, as it is/was not a planned functionality, one can do the manual inspection if desired.
  2. I chose to briefly describe all transforms in the paper.
  3. These three extra tests are conducted to better understand how the contexts (as a whole) distinguish from each other. That is more of a high-level result and, therefore, rendered into the report when creating a dataset. There is no menu of p-values to choose from. Rather, using the default threshold of $\alpha=0.05$, we determine for each test whether the null hypothesis should be rejected or not. When aggregating the results of all tests (for example, testing the same quantity across all contexts), the report then summarizes these findings in order to get an idea of whether the context matters.

@MrShoenel
Copy link
Owner

It feels as if this issue can be closed. Please re-open if more changes are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants