Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on software #35

Open
10 of 17 tasks
fmu2 opened this issue Jul 30, 2018 · 4 comments
Open
10 of 17 tasks

Comments on software #35

fmu2 opened this issue Jul 30, 2018 · 4 comments

Comments

@fmu2
Copy link
Collaborator

fmu2 commented Jul 30, 2018

  • Can we give a visual indication that there are three stages to advance through as the user selects data, trains models, and tests a model? Perhaps show these as tabs?
  • How does the user know that they should wait for training to finish? For a large dataset, will it look like the GUI hung?
  • I received this warning, which should be fixed in the next scikit-learn release so we can ignore it https://stackoverflow.com/questions/48687375/deprecation-error-in-sklearn-about-empty-array-without-any-empty-array-in-my-cod
  • We should start to look for functions and parts of the code that would support unit tests.
  • Can I save the best classifier, either the trained model or the description of its parameters? Can I save the table of models I tried and their performance? This may be out of scope for the initial software but a potential v2 feature.
  • Can we have larger font sizes in the plots?
  • Is there a way for us to define jargon like classifier hyperparameters within the GUI? Or should we link to sklearn docs?
  • After we have the workshop guides ready, please document the code organization so that others could eventually help maintain it.
  • Should we display the class label (column name or valid values) in the Data Summary panel?
  • After the warning about too many samples for leave-one-out, the software still advanced to the next tab. Is that intended behavior? What is that sample threshold?
  • Is the Data Plot implemented or is it a placeholder?
  • In the iris dataset I had a classifier that reported training AUROC of 1 in the table and figure legend but the ROC plot was not a flat line. See the attached image.
  • Need better documentation or more robust file loading for the unlabeled data csv file. Should we accept a file that has class labels and pop up a warning that the labels have been removed?
  • When saving the results .csv should be added to the end of the file name automatically if it isn't there already.
  • The default iterations for the neural network is too low. On toy dataset 1, it reaches the max iterations and fails training on default settings.
  • Implement plot figure saving.
  • Document expected warnings from the wrapper scripts.
@agitter
Copy link
Member

agitter commented Jul 30, 2018

I converted the list to the GitHub checkbox format. We can check off the short term issues and create new issues for specific items that take more than a few days or are longer-term efforts.

@fmu2
Copy link
Collaborator Author

fmu2 commented Aug 6, 2018

Done:

  • added a title for each stage
  • increased font size from 6 to 7 in the plotting area
  • added links to sklearn docs in the 2nd page
  • sample threshold for leave-one-out: 50
  • Data Plot is only implemented for datasets with exactly two continuous features
  • metrics rounded to 3 decimal points instead of 2
  • training in a new thread (terminating training half-way currently not supported)

To do:

  • allow user-supplied test set
  • save classifier parameters

@agitter
Copy link
Member

agitter commented Aug 7, 2018

Excellent progress!

allow user-supplied test set

This could wait for v2. It may be a less common use case.

For the documentation, I suggest documenting the code instead of writing a separate Markdown file. There are Python documentation conventions that can be used to generate documentation files from the code comments. sklearn is actually a great example of this because they have strong documentation. If you inspect their source code, for example the decision tree, you can see how the functions, arguments, examples, etc. are all documented in the code. They are somehow using Circle CI to automatically build and deploy the documentation, but we wouldn't need that complexity.

I believe that Sphinx is the underlying system that translates the comments to external documents. Let's explore that as an option for documentation.

@agitter
Copy link
Member

agitter commented Aug 26, 2018

I added some good suggestions from @csmagnano to the list above. Thank you for the testing and great feedback.

The wrapper script warnings he saw were:

ml4bio/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
ml4bio/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

Both seem relatively harmless but will confuse beginners.

@agitter agitter transferred this issue from carpentries-incubator/ml4bio-workshop Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants