-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for serializing detectors with scikit-learn backends and/or models #642
Support for serializing detectors with scikit-learn backends and/or models #642
Conversation
This reverts commit af3be06.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #642 +/- ##
==========================================
+ Coverage 78.83% 79.06% +0.22%
==========================================
Files 123 126 +3
Lines 8747 8819 +72
==========================================
+ Hits 6896 6973 +77
+ Misses 1851 1846 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -0,0 +1,7 @@ | |||
from alibi_detect.saving._sklearn.saving import save_model_config as save_model_config_sk | |||
from alibi_detect.saving._sklearn.loading import load_model as load_model_sk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added __init__.py
here for consistency with the _tensorflow
subpackage. Not particularly important since _saving._sklearn
is private anyway...
|
||
@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_') | ||
@parametrize('model', [classifier_model, xgb_classifier_model]) | ||
def test_save_model_sk(data, model, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We arguably could be more specific about what tests we keep in saving/tests/test_saving.py
and what ones we keep here.
For now, I've adopted moving tests that are very backend specific to the backend saving subpackages (e.g. _save_model_config
is tested in saving/_sklearn/tests/test_saving_sk.py
when model
is a sklearn or xgboost model). However, for the more generic functional tests such as test_save_ksdrift
, I test them with both backend
's in saving/tests/test_saving.py
to minimise duplication. There are some other tests in saving/tests/test_saving.py
such as test_save_kernel
that we could probably think about moving to the backend-specific testing files in the future...
Also note, our more usual convention would be to test saving._sklearn.save_model
here, instead of the parent function saving._save_model_config
. I am testing _save_model_config
here so that we also test the logic inside, which should call saving._sklearn.save_model
if it recognises model
to be a sklearn.base.BaseEstimator
.
raise NotImplementedError('Loading preprocess_fn for PyTorch not yet supported.') | ||
# device = cfg['device'] # TODO - device should be set already - check | ||
# kwargs.update({'model': kwargs['model'].to(device)}) # TODO - need .to(device) here? | ||
# kwargs.update({'device': device}) | ||
elif model is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the model is None
case separately here, so that we don't miss the kwargs.pop('device')
when model is None
(it used to be done inside prep_model_and_emb_tf
when backend == 'tensorflow' even if
model is None). When we add pytorch support a bit more logic might be needed here to avoid pop'ing
device` when the embedding is a pytorch model...
|
||
Returns | ||
------- | ||
The loaded model. | ||
""" | ||
|
||
# Load model | ||
flavour = cfg['flavour'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ModelConfig
and EmbeddingConfig
now contain a new field called flavour
, which specs whether the model is tensorflow
, sklearn
, or pytorch
model. Struggled to think of a good name for this and am open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think flavour
is good and it maps reasonably well to the mlflow
concept of flavor
which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors
Slightly more thorny question would be if we stick with the British spelling... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for flavorflav
🤣!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, boyeee!
|
||
return model | ||
|
||
|
||
def _load_embedding_config(cfg: dict, backend: str) -> Callable: # TODO: Could type return more tightly | ||
def _load_embedding_config(cfg: dict) -> Callable: # TODO: Could type return more tightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer need to be passed backend
as flavour
is contained in the EmbeddingConfig
.
The ultimate goal here is to make these artefact config's self-contained. This will pave the way towards a more OOP approach! (@mauicv )
elif key[-1] == 'tokenizer': | ||
obj = _load_tokenizer_config(src) | ||
elif key[-1] == 'optimizer': | ||
obj = _load_optimizer_config(src, backend) | ||
elif key[-1] == 'preprocess_fn': | ||
obj = _load_preprocess_config(src, backend) | ||
obj = _load_preprocess_config(src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend
not passed to some of these functions since the artefact configs themselves now contain flavour
.
elif key[-1] == 'embedding': | ||
obj = _load_embedding_config(src, backend) | ||
obj = _load_embedding_config(src) | ||
elif key[-1] == 'tokenizer': | ||
obj = _load_tokenizer_config(src) | ||
elif key[-1] == 'optimizer': | ||
obj = _load_optimizer_config(src, backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these functions still have a backend
arg since we still want some of these objects' backends to be constrained by the detector backend
. For example, it would not make sense to allow a pytorch optimizer to be loaded with a tensorflow backend detector.
if backend != 'tensorflow': | ||
raise NotImplementedError("Currently, saving is only supported with backend='tensorflow'.") | ||
backend = detector.meta.get('backend', None) | ||
if backend not in (None, 'tensorflow', 'sklearn'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is now included since detectors such as KSDrift
now have no backend
in their config.
raise ValueError('A sklearn backend is not available for this model') | ||
return model | ||
class SupportedModelsType: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we typed model
fields as:
model: Optional[Callable] = None
and then validated them with:
_validate_model = validator('model', allow_reuse=True, pre=True)(validate_model)
This did not work for sklearn models since sklearn.base.BaseEstimator
is not a Callable
. We could simple change to model: Optional[Any] = None
. However I think a more elegent solution is to use a pydantic custom data type. This can then be used directly as:
model: Optional[SupportedModelsType] = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the:
raise ValueError('A TensorFlow backend is not available for this model')
Has been changed to:
raise TypeError("`backend='tensorflow'` but the `model` doesn't appear to be a TensorFlow supported model.")
as I think the previous error message was misleading.
@@ -97,8 +97,6 @@ class DetectorConfig(CustomBaseModel): | |||
""" | |||
name: str | |||
"Name of the detector e.g. `MMDDrift`." | |||
backend: Literal['tensorflow', 'pytorch', 'sklearn', 'keops'] = 'tensorflow' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No backend
in DetectorConfig
as this is now specific to each detector.
|
||
|
||
@fixture | ||
def encoder_model(backend, current_cases): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fixtures have all been moved from test_saving.py
, so that they can be reused in _sklearn/tests/test_saving_sk.py
etc.
@@ -76,213 +72,6 @@ | |||
# TODO - future: Some of the fixtures can/should be moved elsewhere (i.e. if they can be recycled for use elsewhere) | |||
|
|||
|
|||
@fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to models.py
.
@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_') | ||
@parametrize('model', [encoder_model]) | ||
@parametrize('layer', [None, -1]) | ||
def test_save_model(data, model, layer, backend, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to test_saving_tf.py
.
@@ -23,3 +23,4 @@ tox>=3.21.0, <4.0.0 # used to generate licence info via `make licenses` | |||
twine>3.2.0, <4.0.0 # 4.x causes deps clashes with testing/requirements.txt, as requires rich>=12.0.0 -> requires typing-extensions>=4.0.0 -> too high for spacy and thinc! | |||
packaging>=19.0, <22.0 # Used to check scipy version for CVMDrift test. Can be removed once python 3.6 support dropped (and scipy lower bound >=1.7.0). | |||
codecov>=2.0.15, <3.0.0 | |||
xgboost>=1.3.2, <2.0.0 # Install for use in testing since we support serialization of xgboost models under the sklearn API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very likely outside the scope of this PR, but is it true that we only support the Reason for asking is two-fold:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, I left some questions in places I'm unsure about.
alibi_detect/saving/schemas.py
Outdated
@@ -117,9 +126,15 @@ class ModelConfig(CustomBaseModel): | |||
.. code-block :: toml | |||
|
|||
[model] | |||
flavour "tensorflow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing =
?
@@ -69,7 +69,7 @@ def load_model(filepath: Union[str, os.PathLike], | |||
return model | |||
|
|||
|
|||
def prep_model_and_emb(model: Optional[Callable], emb: Optional[TransformerEmbedding]) -> Callable: | |||
def prep_model_and_emb(model: Callable, emb: Optional[TransformerEmbedding]) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a distraction here, but what about the typing? Of course since this is tensorflow
specific we wouldn't use the new SupportedModelsType
, but that raises a question that maybe we should have a custom tensorflow
type. Perhaps one for later...
@@ -78,7 +81,10 @@ def save_model_config(model: Callable, | |||
if model is not None: | |||
filepath = base_path.joinpath(local_path) | |||
save_model(model, filepath=filepath, save_dir='model') | |||
cfg_model = {'src': local_path.joinpath('model')} | |||
cfg_model = { | |||
'flavour': 'tensorflow', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely another distraction/follow-up... But I would say it would be more robust to define an enum once and then use its values everywhere to avoid possible typos, e.g. as in https://github.com/SeldonIO/alibi/blob/fa034a2ebf7cd69c1796cf81ac6a2862a15f03b7/alibi/utils/frameworks.py#L4-L6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sanity check would be that grepping the code base for a string like 'tensorflow'
shouldn't bring up very many hits :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a follow-up PR for this.
backend = cfg.pop('backend') # popping so that cfg left as kwargs + `name` when passed to _init_detector | ||
if backend.lower() != 'tensorflow': | ||
raise NotImplementedError('Loading detectors with PyTorch, sklearn or keops backend is not yet supported.') | ||
backend = cfg.get('backend', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more popping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need to as pydantic now only populates the config with backend
for detectors who have a backend
kwarg (before we added backend
to all detector configs as it also represented the "flavour" of preprocessing).
|
||
Returns | ||
------- | ||
The loaded model. | ||
""" | ||
|
||
# Load model | ||
flavour = cfg['flavour'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think flavour
is good and it maps reasonably well to the mlflow
concept of flavor
which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors
Slightly more thorny question would be if we stick with the British spelling... :)
if backend not in ('tensorflow', 'pytorch', 'keops'): | ||
pytest.skip("Detector doesn't have this backend") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about these statements, is this future-proofing if we had more backends? Also looking at the backend
fixture it's only parametrized with tensorflow
and sklearn
, so how does it work for keops
, pytorch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes future proofing, but more in the sense of future-proofing in anticipation of serialization support being extended.
This file (test_saving.py
) is currently only parametrized with backend = param_fixture("backend", ['tensorflow', 'sklearn'])
because we currently only support saving with tensorflow
and sklearn
. As we add support for more backends, we should add them to this parametrization.
The if backend not in ('tensorflow', 'pytorch', 'keops'):
is intended to skip the test if any other backend
is passed to it (e.g. atm backend=='sklearn'
is passed), because MMDDrift
only has 'tensorflow'
, 'pytorch'
and 'keops'
backends. There are various ways we could do this but I thought the above logic was the most maintainable because there is a one-to-one mapping between the contents of the tuple and the actual backends supported by the detector.
filepath | ||
Saved model directory. | ||
load_dir | ||
Name of saved model folder within the filepath directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure of the difference between the args? Is filepath
the saved detector directory and the load_dir
the location within the detector directory? Why not just have filepath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. This is really only for consistency with the _tensorflow.loading.load_model
function. It has these two separate kwarg's for use in the legacy load_detector
here:
alibi-detect/alibi_detect/saving/_tensorflow/loading.py
Lines 275 to 276 in 6fa1708
try: # legacy load_model behaviour was to return None if not found. Now it raises error, hence need try-except. | |
model = load_model(filepath, load_dir='encoder') |
Happy to remove now for sklearn if you think best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in c3fd225 :)
|
||
|
||
%### PyTorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the % intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It will be uncommented once we add pytorch serialization support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR implements save/load support for sklearn (and xgboost) backends and models.
The primary change addition is a
saving._sklearn
subpackage, containgsave_model
andload_model
functions that usejoblib
to serialize/deserialize sklearn (and xgboost) models.Additionally, since this PR introduces a new "flavour" of models/backends to serialize (in addition to the existing
tensorflow
), this PR includes a number of changes to better handle serialization of multiple flavours/models:backend
config field is no longer tied to themodel
(orpreprocess_fn
) type. This gives us the flexibility to allow mixing and matching of models in the future e.g. a scikit-learn preprocessing model could be used with aClassifierDrift(..., backend='tensorflow')
detector (note: an sklearnpreprocess_drift
does not yet exist, just an example!). Also, this change means we do not have the confusing side-effect ofbackend='tensorflow'
being written to detector config files without a backend, such as forKSDrift
. We can also be more be granular when validatingbackend
for different detectors.flavour
field inModelConfig
andEmbeddingConfig
. The aim is to make these configs more self-contained i.e.model.flavour
tells us whethermodel.src
refers to a tensorflow, pytorch or sklearn model, and we can therefore decide how to load it. Self-contained/atomic configs should help with the move to OOP serialization that @mauicv and I have been talking about...