Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Fixes for Tabular Regression #235

Merged
merged 18 commits into from
Jun 16, 2021

Conversation

ravinkohli
Copy link
Contributor

@ravinkohli ravinkohli commented May 21, 2021

This PR allows reproducibility in Tabular Regression, enables traditional methods for tabular regression, and adds tests for these.

Specifically, it adds the following

  1. set torch seed for tabular regression pipeline
  2. Rename BaseClassifier in 'classifier_models' to BaseTraditionalLearner
  3. Refactor TraditionalLearners to remove duplicate code.
  4. adds score to TabularClassificationPipeline and improve documentation for TabularRegressionPipeline.
  5. Adds TraditionalTabularRegressionPipeline
  6. Adds test for traditional models
  7. Refactor test for pipeline scores in test_tabular_classification and test_tabular_regression
  8. Adds TabularRegressionTask documentation to github.io
  9. Adds installation instruction for mac os

P.S. I couldn't think of better names so please feel free to suggest.

@@ -74,8 +80,10 @@ def fit(self, X: Dict[str, Any], y: Any = None) -> autoPyTorchSetupComponent:

# instantiate model
self.model = self.build_model(input_shape=input_shape,
logger_port=X['logger_port'],
output_shape=output_shape)
logger_port=X['logger_port'] if 'logger_port' in X else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I Think we should always have this here, the logger port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in one of the tests we didn't have it

assert 'val_preds' in model.fit_output.keys()
assert isinstance(model.fit_output['val_preds'], list)
assert len(model.fit_output['val_preds']) == len(fit_dictionary_tabular['val_indices'])
if model.model.is_classification:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a unit test that makes sure that is_classifiaction what set properly? Like I was not able to find where in the code we make sure that is properly setup up...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert y_pred.shape[0] == len(fit_dictionary_tabular['val_indices'])
# Test if classifier can score and
# the result is same as in results
score = model.score(fit_dictionary_tabular['X_train'][fit_dictionary_tabular['val_indices']],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the value of the score? I think this traditional classifier should achieve a pretty good score

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately some of the classifiers fail to get a good score on some datasets. Sometimes its really low as well. In a later PR we can try and optimize the hyperparameters of the traditional classifiers to get good score for all scenarios but I feel for the purpose of this PR its fine.

Copy link
Contributor

@franchuterivera franchuterivera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the PR, with this we will be able to compare to other automl systems on regression.

Some minor question/changes on this PR.

@ravinkohli ravinkohli changed the title Fixes for Tabular Regression [FIX] Fixes for Tabular Regression May 28, 2021
@franchuterivera
Copy link
Contributor

I just started running with this, but the first fix we need so that I do not forget is that we have to update this file: https://github.com/automl/Auto-PyTorch/blob/refactor_development/MANIFEST.in with the new json files.

Also, i think we have to add the greedy portfolio here?

@franchuterivera
Copy link
Contributor

The only other question that I have is what should we do with the greedy portfolio for regression?

One possibility can be to have in there the default configuration per neural network (default per mlp, per shaped, and so on). I see very good performance on the default configuration in Boston, but not so good in other configurations because the BO model has not yet learned what to do with the other configurations. The other is to generate the portfolio for regression.

What do you think?

@ravinkohli
Copy link
Contributor Author

The only other question that I have is what should we do with the greedy portfolio for regression?

One possibility can be to have in there the default configuration per neural network (default per mlp, per shaped, and so on). I see very good performance on the default configuration in Boston, but not so good in other configurations because the BO model has not yet learned what to do with the other configurations. The other is to generate the portfolio for regression.

What do you think?

I think for now we can continue using the greedy portfolio json configs, and when we setup the scripts to build the portfolio ourselves, we can build one for regression as well. However, as you are saying that the default configs are giving good results, we can compare them with the portfolio we have right now and use the one which gives the best performance boost

results in a MemoryError.
y (np.ndarray):
Ground Truth labels
metric_name (str, default = 'r2'):
Copy link
Collaborator

@nabenabe0928 nabenabe0928 Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it from sklearn, right?
Can you add it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no these are our names

@@ -300,6 +299,7 @@ def test_pipeline_score(fit_dictionary_tabular_dummy):

pipeline = TabularRegressionPipeline(
dataset_properties=fit_dictionary_tabular_dummy['dataset_properties'],
random_state=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we convert a seed with a random state instance with this line.

@@ -0,0 +1,70 @@
from typing import Any, Dict, Optional, Tuple, Type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later

@@ -0,0 +1,266 @@
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later

@@ -0,0 +1,366 @@
import logging.handlers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later

@@ -0,0 +1,185 @@
import warnings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later

@@ -0,0 +1,134 @@
import copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check later

@codecov
Copy link

codecov bot commented Jun 9, 2021

Codecov Report

Merging #235 (eb4b80e) into development (9a847e2) will increase coverage by 0.41%.
The diff coverage is 76.42%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #235      +/-   ##
===============================================
+ Coverage        80.73%   81.14%   +0.41%     
===============================================
  Files              148      150       +2     
  Lines             8563     8559       -4     
  Branches          1323     1331       +8     
===============================================
+ Hits              6913     6945      +32     
+ Misses            1173     1131      -42     
- Partials           477      483       +6     
Impacted Files Coverage Δ
autoPyTorch/api/tabular_regression.py 96.87% <ø> (ø)
...h/pipeline/components/training/trainer/__init__.py 69.56% <0.00%> (-0.77%) ⬇️
autoPyTorch/utils/common.py 87.09% <ø> (+19.23%) ⬆️
...PyTorch/pipeline/traditional_tabular_regression.py 25.42% <25.42%> (ø)
autoPyTorch/evaluation/abstract_evaluator.py 75.89% <45.16%> (-2.28%) ⬇️
...tup/traditional_ml/traditional_learner/learners.py 82.97% <82.97%> (ø)
...line/components/setup/traditional_ml/base_model.py 73.61% <88.88%> (+0.59%) ⬆️
...tup/traditional_ml/traditional_learner/__init__.py 91.66% <90.00%> (ø)
...ml/traditional_learner/base_traditional_learner.py 94.80% <94.80%> (ø)
.../setup/traditional_ml/tabular_traditional_model.py 96.66% <96.66%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9a847e2...eb4b80e. Read the comment docs.

@franchuterivera franchuterivera merged commit 3995391 into automl:development Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants