Skip to content

Commit

Permalink
Allow uploading only validation tables, closes #16
Browse files Browse the repository at this point in the history
  • Loading branch information
dunnkers committed Jun 12, 2021
1 parent 37a52d0 commit c1aa265
Show file tree
Hide file tree
Showing 15 changed files with 132 additions and 75 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,26 @@ Now, create a [wandb](https://wandb.ai/) account and login to the CLI. We are no

Run ANOVA F-Value on Iris dataset:
```shell
fseval dataset=iris estimator@pipeline.ranker=anova_f_value
fseval dataset=iris estimator@ranker=anova_f_value
```

## Supported Feature Rankers
A [collection](https://github.com/dunnkers/fseval/tree/master/fseval/conf/estimator) of feature rankers are already built-in, which can be used without further configuring. Others need their dependencies installed. List of rankers:

| Ranker | Dependency | Command line argument
--- | --- | ---
[ANOVA F-Value](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif) | \<no dep\> | `estimator@pipeline.ranker=anova_f_value`
[Boruta](https://github.com/scikit-learn-contrib/boruta_py) | `pip install Boruta` | `estimator@pipeline.ranker=boruta`
[Chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) | \<no dep\> | `estimator@pipeline.ranker=chi2`
[Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) | \<no dep\> | `estimator@pipeline.ranker=decision_tree`
[FeatBoost](https://github.com/amjams/FeatBoost) | `pip install git+https://github.com/dunnkers/FeatBoost.git@support-cloning` (ℹ️) | `estimator@pipeline.ranker=featboost`
[MultiSURF](https://github.com/EpistasisLab/scikit-rebate) | `pip install skrebate` | `estimator@pipeline.ranker=multisurf`
[Mutual Info](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) | \<no dep\> | `estimator@pipeline.ranker=mutual_info`
[ReliefF](https://github.com/EpistasisLab/scikit-rebate) | `pip install skrebate` | `estimator@pipeline.ranker=relieff`
[Stability Selection](https://github.com/scikit-learn-contrib/stability-selection) | `pip install git+https://github.com/dunnkers/stability-selection.git@master matplotlib` (ℹ️) | `estimator@pipeline.ranker=stability_selection`
[TabNet](https://github.com/dreamquark-ai/tabnet) | `pip install pytorch-tabnet` | `estimator@pipeline.ranker=tabnet`
[XGBoost](https://xgboost.readthedocs.io/) | `pip install xgboost` | `estimator@pipeline.ranker=xgb`
[Infinite Selection](https://github.com/giorgioroffo/Infinite-Feature-Selection) | `pip install git+https://github.com/dunnkers/infinite-selection.git@master` (ℹ️) | `estimator@pipeline.ranker=infinite_selection`
[ANOVA F-Value](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif) | \<no dep\> | `estimator@ranker=anova_f_value`
[Boruta](https://github.com/scikit-learn-contrib/boruta_py) | `pip install Boruta` | `estimator@ranker=boruta`
[Chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) | \<no dep\> | `estimator@ranker=chi2`
[Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) | \<no dep\> | `estimator@ranker=decision_tree`
[FeatBoost](https://github.com/amjams/FeatBoost) | `pip install git+https://github.com/dunnkers/FeatBoost.git@support-cloning` (ℹ️) | `estimator@ranker=featboost`
[MultiSURF](https://github.com/EpistasisLab/scikit-rebate) | `pip install skrebate` | `estimator@ranker=multisurf`
[Mutual Info](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) | \<no dep\> | `estimator@ranker=mutual_info`
[ReliefF](https://github.com/EpistasisLab/scikit-rebate) | `pip install skrebate` | `estimator@ranker=relieff`
[Stability Selection](https://github.com/scikit-learn-contrib/stability-selection) | `pip install git+https://github.com/dunnkers/stability-selection.git@master matplotlib` (ℹ️) | `estimator@ranker=stability_selection`
[TabNet](https://github.com/dreamquark-ai/tabnet) | `pip install pytorch-tabnet` | `estimator@ranker=tabnet`
[XGBoost](https://xgboost.readthedocs.io/) | `pip install xgboost` | `estimator@ranker=xgb`
[Infinite Selection](https://github.com/giorgioroffo/Infinite-Feature-Selection) | `pip install git+https://github.com/dunnkers/infinite-selection.git@master` (ℹ️) | `estimator@ranker=infinite_selection`


ℹ️ This library was customized to make it compatible with the fseval pipeline.
Expand Down
5 changes: 5 additions & 0 deletions fseval/conf/backend/wandb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# @package _global_
defaults:
- override /callbacks:
- wandb
- override /storage_provider: wandb
5 changes: 2 additions & 3 deletions fseval/conf/my_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@ defaults:
- pipeline/rank_and_validate
- dataset: synclf_easy
- cv: kfold
- callbacks:
- wandb
- storage_provider: wandb
- callbacks: []
- storage_provider: null
- override hydra/job_logging: colorlog
- override hydra/hydra_logging: colorlog

Expand Down
3 changes: 3 additions & 0 deletions fseval/conf/pipeline/rank_and_validate.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@ pipeline: rank-and-validate
n_bootstraps: 2
n_jobs: 1
all_features_to_select: range(1, min(50, p) + 1)
upload_ranking_scores: true
upload_validation_scores: true
upload_best_scores: true
18 changes: 0 additions & 18 deletions fseval/conf/preset/extra_validator.yaml

This file was deleted.

3 changes: 3 additions & 0 deletions fseval/conf/storage_provider/wandb.yaml
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
defaults:
- base_wandb_storage_provider

_target_: fseval.storage_providers.WandbStorageProvider
resume: ${callbacks.wandb.resume}
30 changes: 27 additions & 3 deletions fseval/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,39 @@
from fseval.pipeline.estimator import TaskedEstimatorConfig
from fseval.pipeline.resample import ResampleConfig

cs = ConfigStore.instance()


@dataclass
class StorageProviderConfig:
_target_: str = MISSING
local_dir: Optional[str] = None


@dataclass
class WandbStorageProviderConfig(StorageProviderConfig):
resume: Optional[str] = None


cs.store(
group="storage_provider",
name="base_wandb_storage_provider",
node=WandbStorageProviderConfig,
)


@dataclass
class BaseConfig:
dataset: DatasetConfig = MISSING
cv: CrossValidatorConfig = MISSING
callbacks: Dict[str, Any] = field(default_factory=dict)
storage_provider: Any = field(
default_factory=lambda: dict(_target_="fseval.types.AbstractStorageProvider")
storage_provider: Optional[StorageProviderConfig] = field(
default_factory=lambda: StorageProviderConfig(
_target_="fseval.storage_providers.MockStorageProvider"
)
)


cs = ConfigStore.instance()
cs.store(name="base_config", node=BaseConfig)


Expand All @@ -36,6 +57,9 @@ class RankAndValidateConfig(BaseConfig):
n_bootstraps: int = MISSING
n_jobs: Optional[int] = MISSING
all_features_to_select: str = MISSING
upload_ranking_scores: bool = MISSING
upload_validation_scores: bool = MISSING
upload_best_scores: bool = MISSING


cs.store(name="base_rank_and_validate", node=RankAndValidateConfig)
8 changes: 4 additions & 4 deletions fseval/pipeline/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ class TaskedEstimatorConfig(EstimatorConfig):
multioutput: Optional[bool] = False
multioutput_only: Optional[bool] = False
requires_positive_X: Optional[bool] = False
estimates_feature_importances: Optional[bool] = False # returns importance scores
estimates_feature_support: Optional[bool] = False # returns feature subset
estimates_feature_ranking: Optional[bool] = False # returns feature ranking
estimates_target: Optional[bool] = False # can predict target
estimates_feature_importances: Optional[bool] = False
estimates_feature_support: Optional[bool] = False
estimates_feature_ranking: Optional[bool] = False
estimates_target: Optional[bool] = False
# runtime properties
task: Task = II("dataset.task")
is_multioutput_dataset: bool = II("dataset.multioutput")
Expand Down
4 changes: 0 additions & 4 deletions fseval/pipelines/_callback_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,6 @@ def on_config_update(self, config: Dict):
for callback in self._iterator:
callback.on_config_update(config)

def on_log(self, msg: Any, *args: Any):
for callback in self._iterator:
callback.on_log(msg, *args)

def on_metrics(self, metrics):
for callback in self._iterator:
callback.on_metrics(metrics)
Expand Down
5 changes: 3 additions & 2 deletions fseval/pipelines/_experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@ def _step_text(self, step_name, step_number, estimator):
estimator_repr = self._get_estimator_repr(estimator)

return lambda secs: (
TerminalColor.yellow(f"{overrides_text}")
+ f"{estimator_repr} ... {step_name} "
overrides_text
+ TerminalColor.yellow(f"{estimator_repr}")
+ f" ... {step_name} "
+ "in "
+ TerminalColor.cyan(f"{format_timespan(secs)} ")
+ TerminalColor.green("✓ ")
Expand Down
60 changes: 37 additions & 23 deletions fseval/pipelines/rank_and_validate/_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import numpy as np
import pandas as pd
from fseval.callbacks import WandbCallback
from fseval.types import TerminalColor
from fseval.types import TerminalColor as tc
from omegaconf import MISSING
from sklearn.base import clone
from tqdm import tqdm
Expand Down Expand Up @@ -65,7 +65,7 @@ def score(self, X, y, **kwargs):
scores["bootstrap_state"] = self.bootstrap_state

self.logger.info(
f"scored bootstrap_state={self.bootstrap_state} " + TerminalColor.green("✓")
f"scored bootstrap_state={self.bootstrap_state} " + tc.green("✓")
)
return scores

Expand Down Expand Up @@ -127,7 +127,7 @@ def score(self, X, y, **kwargs):
##### Ranking scores - aggregation
agg_ranking_scores = ranking_scores.agg(["mean", "std", "var", "min", "max"])
# print scores
self.logger.info(f"{self.ranker.name} ranking scores:")
self.logger.info(f"{tc.yellow(self.ranker.name)} ranking scores:")
print(agg_ranking_scores)
# send metrics
agg_ranking_scores = agg_ranking_scores.to_dict()
Expand All @@ -151,7 +151,7 @@ def score(self, X, y, **kwargs):
self.callbacks.on_metrics(dict(validator=agg_feature_scores_dict))
# print scores
print()
self.logger.info(f"{self.validator.name} validation scores:")
self.logger.info(f"{tc.yellow(self.validator.name)} validation scores:")
agg_val_scores = val_scores_per_feature.mean().drop(columns=["bootstrap_state"])
print(agg_val_scores)

Expand All @@ -168,30 +168,17 @@ def score(self, X, y, **kwargs):
# summary
summary = dict(best=best)

##### Upload tables
wandb_callback = getattr(self.callbacks, "wandb", False)
if wandb_callback:
##### Upload tables
self.logger.info(f"Uploading tables to wandb...")
wandb_callback = cast(WandbCallback, wandb_callback)

### upload best scores
best_subset_prefixed = best_subset.add_prefix("validator.")
best_ranker_prefixed = best_ranker.add_prefix("ranker.")
best_scores = pd.concat([best_subset_prefixed, best_ranker_prefixed])
best_scores_df = pd.DataFrame([best_scores])
wandb_callback.upload_table(best_scores_df, "best_scores")

### upload ranking scores
wandb_callback.upload_table(ranking_scores.reset_index(), "ranking_scores")
### ranking scores
if wandb_callback and self.upload_ranking_scores:
self.logger.info(f"Uploading ranking scores...")

### upload validation scores
wandb_callback.upload_table(validation_scores, "validation_scores")

### upload mean validation scores
all_agg_val_scores = agg_val_scores.reset_index()
wandb_callback.upload_table(all_agg_val_scores, "validation_scores_mean")

### upload raw rankings
## upload raw rankings
# feature importances
if self.ranker.estimates_feature_importances:
importances_table = self._get_ranker_attribute_table(
Expand All @@ -212,6 +199,33 @@ def score(self, X, y, **kwargs):
"feature_ranking_", "feature_ranking"
)
wandb_callback.upload_table(ranking_table, "feature_ranking")
self.logger.info(f"Tables uploaded {TerminalColor.green('✓')}")

## upload ranking scores
wandb_callback.upload_table(ranking_scores.reset_index(), "ranking_scores")

### validation scores
if wandb_callback and self.upload_validation_scores:
self.logger.info(f"Uploading validation scores...")

## upload validation scores
wandb_callback.upload_table(validation_scores, "validation_scores")

## upload mean validation scores
all_agg_val_scores = agg_val_scores.reset_index()
wandb_callback.upload_table(all_agg_val_scores, "validation_scores_mean")

### upload best scores
if wandb_callback and self.upload_best_scores:
self.logger.info(f"Uploading best scores...")

## best ranker- and validation scores
best_subset_prefixed = best_subset.add_prefix("validator.")
best_ranker_prefixed = best_ranker.add_prefix("ranker.")
best_scores = pd.concat([best_subset_prefixed, best_ranker_prefixed])
best_scores_df = pd.DataFrame([best_scores])
wandb_callback.upload_table(best_scores_df, "best_scores")

if wandb_callback:
self.logger.info(f"Tables uploaded {tc.green('✓')}")

return summary
3 changes: 3 additions & 0 deletions fseval/pipelines/rank_and_validate/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ class RankAndValidatePipeline(Pipeline):
n_bootstraps: int = MISSING
n_jobs: Optional[int] = MISSING
all_features_to_select: str = MISSING
upload_ranking_scores: bool = MISSING
upload_validation_scores: bool = MISSING
upload_best_scores: bool = MISSING

def _get_config(self):
return {
Expand Down
4 changes: 3 additions & 1 deletion fseval/pipelines/rank_and_validate/_ranking_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import numpy as np
import pandas as pd
from fseval.types import IncompatibilityError
from fseval.types import IncompatibilityError, TerminalColor
from omegaconf import MISSING
from sklearn.metrics import accuracy_score, log_loss, r2_score

Expand Down Expand Up @@ -45,6 +45,8 @@ def prefit(self):
self.ranker._load_cache(self._cache_filename, self.storage_provider)

def fit(self, X, y):
self.logger.info(f"fitting ranker: " + TerminalColor.yellow(self.ranker.name))

super(RankingValidator, self).fit(X, y)

def postfit(self):
Expand Down
21 changes: 20 additions & 1 deletion fseval/storage_providers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
from typing import Any, Callable

from fseval.config import StorageProviderConfig

from .wandb import WandbStorageProvider

__all__ = ["WandbStorageProvider"]

class MockStorageProvider(StorageProviderConfig):
def save(self, filename: str, writer: Callable, mode: str = "w"):
...

def save_pickle(self, filename: str, obj: Any):
...

def restore(self, filename: str, reader: Callable, mode: str = "r") -> Any:
...

def restore_pickle(self, filename: str) -> Any:
...


__all__ = ["WandbStorageProvider", "MockStorageProvider"]
12 changes: 9 additions & 3 deletions fseval/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,35 +42,41 @@ def get_data(self) -> Tuple[List, List]:


class Callback(ABC):
@abstractmethod
def on_begin(self, config: DictConfig):
...

@abstractmethod
def on_config_update(self, config: Dict):
...

def on_log(self, msg: Any, *args: Any):
...

@abstractmethod
def on_metrics(self, metrics):
...

@abstractmethod
def on_summary(self, summary: Dict):
...

@abstractmethod
def on_end(self, exit_code: Optional[int] = None):
...


class AbstractStorageProvider(ABC):
@abstractmethod
def save(self, filename: str, writer: Callable, mode: str = "w"):
...

@abstractmethod
def save_pickle(self, filename: str, obj: Any):
...

@abstractmethod
def restore(self, filename: str, reader: Callable, mode: str = "r") -> Any:
...

@abstractmethod
def restore_pickle(self, filename: str) -> Any:
...

Expand Down

0 comments on commit c1aa265

Please sign in to comment.