Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for serializing detectors with scikit-learn backends and/or models #642

Merged
merged 28 commits into from
Oct 12, 2022
Merged

Support for serializing detectors with scikit-learn backends and/or models #642

merged 28 commits into from
Oct 12, 2022

Conversation

ascillitoe
Copy link
Contributor

@ascillitoe ascillitoe commented Sep 29, 2022

This PR implements save/load support for sklearn (and xgboost) backends and models.

The primary change addition is a saving._sklearn subpackage, containg save_model and load_model functions that use joblib to serialize/deserialize sklearn (and xgboost) models.

Additionally, since this PR introduces a new "flavour" of models/backends to serialize (in addition to the existing tensorflow), this PR includes a number of changes to better handle serialization of multiple flavours/models:

  1. The detector backend config field is no longer tied to the model (or preprocess_fn) type. This gives us the flexibility to allow mixing and matching of models in the future e.g. a scikit-learn preprocessing model could be used with a ClassifierDrift(..., backend='tensorflow') detector (note: an sklearn preprocess_drift does not yet exist, just an example!). Also, this change means we do not have the confusing side-effect of backend='tensorflow' being written to detector config files without a backend, such as for KSDrift. We can also be more be granular when validating backend for different detectors.
  2. The above is achieved via a new flavour field in ModelConfig and EmbeddingConfig. The aim is to make these configs more self-contained i.e. model.flavour tells us whether model.src refers to a tensorflow, pytorch or sklearn model, and we can therefore decide how to load it. Self-contained/atomic configs should help with the move to OOP serialization that @mauicv and I have been talking about...
  3. The saving tests have been significantly refactored, with backend-specific tests being moved to the backend-specific saving subpackages.

@ascillitoe ascillitoe added WIP PR is a Work in Progress Type: Serialization Serialization proposals and changes labels Sep 29, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 29, 2022

Codecov Report

Merging #642 (c3fd225) into master (c6e78fc) will increase coverage by 0.22%.
The diff coverage is 91.33%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #642      +/-   ##
==========================================
+ Coverage   78.83%   79.06%   +0.22%     
==========================================
  Files         123      126       +3     
  Lines        8747     8819      +72     
==========================================
+ Hits         6896     6973      +77     
+ Misses       1851     1846       -5     
Flag Coverage Δ
macos-latest-3.10 ?
ubuntu-latest-3.10 78.96% <91.33%> (+0.24%) ⬆️
ubuntu-latest-3.7 78.87% <91.12%> (+0.23%) ⬆️
ubuntu-latest-3.8 ?
ubuntu-latest-3.9 ?
windows-latest-3.9 76.12% <91.20%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
alibi_detect/saving/_tensorflow/saving.py 81.95% <75.00%> (-0.14%) ⬇️
alibi_detect/saving/saving.py 86.44% <82.35%> (+0.19%) ⬆️
alibi_detect/saving/loading.py 90.45% <88.46%> (+0.61%) ⬆️
alibi_detect/saving/schemas.py 98.75% <91.11%> (-0.50%) ⬇️
alibi_detect/saving/_sklearn/__init__.py 100.00% <100.00%> (ø)
alibi_detect/saving/_sklearn/loading.py 100.00% <100.00%> (ø)
alibi_detect/saving/_sklearn/saving.py 100.00% <100.00%> (ø)
alibi_detect/saving/_tensorflow/loading.py 85.42% <100.00%> (+0.11%) ⬆️
alibi_detect/utils/_types.py 93.10% <0.00%> (-3.45%) ⬇️
alibi_detect/cd/base_online.py 88.05% <0.00%> (-3.15%) ⬇️
... and 3 more

@@ -0,0 +1,7 @@
from alibi_detect.saving._sklearn.saving import save_model_config as save_model_config_sk
from alibi_detect.saving._sklearn.loading import load_model as load_model_sk
Copy link
Contributor Author

@ascillitoe ascillitoe Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added __init__.py here for consistency with the _tensorflow subpackage. Not particularly important since _saving._sklearn is private anyway...


@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_')
@parametrize('model', [classifier_model, xgb_classifier_model])
def test_save_model_sk(data, model, tmp_path):
Copy link
Contributor Author

@ascillitoe ascillitoe Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We arguably could be more specific about what tests we keep in saving/tests/test_saving.py and what ones we keep here.

For now, I've adopted moving tests that are very backend specific to the backend saving subpackages (e.g. _save_model_config is tested in saving/_sklearn/tests/test_saving_sk.py when model is a sklearn or xgboost model). However, for the more generic functional tests such as test_save_ksdrift, I test them with both backend's in saving/tests/test_saving.py to minimise duplication. There are some other tests in saving/tests/test_saving.py such as test_save_kernel that we could probably think about moving to the backend-specific testing files in the future...

Also note, our more usual convention would be to test saving._sklearn.save_model here, instead of the parent function saving._save_model_config. I am testing _save_model_config here so that we also test the logic inside, which should call saving._sklearn.save_model if it recognises model to be a sklearn.base.BaseEstimator.

raise NotImplementedError('Loading preprocess_fn for PyTorch not yet supported.')
# device = cfg['device'] # TODO - device should be set already - check
# kwargs.update({'model': kwargs['model'].to(device)}) # TODO - need .to(device) here?
# kwargs.update({'device': device})
elif model is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the model is None case separately here, so that we don't miss the kwargs.pop('device') when model is None (it used to be done inside prep_model_and_emb_tf when backend == 'tensorflow' even if model is None). When we add pytorch support a bit more logic might be needed here to avoid pop'ing device` when the embedding is a pytorch model...


Returns
-------
The loaded model.
"""

# Load model
flavour = cfg['flavour']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelConfig and EmbeddingConfig now contain a new field called flavour, which specs whether the model is tensorflow, sklearn, or pytorch model. Struggled to think of a good name for this and am open to suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think flavour is good and it maps reasonably well to the mlflow concept of flavor which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors

Slightly more thorny question would be if we stick with the British spelling... :)

Copy link
Collaborator

@mauicv mauicv Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for flavorflav 🤣!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, boyeee!


return model


def _load_embedding_config(cfg: dict, backend: str) -> Callable: # TODO: Could type return more tightly
def _load_embedding_config(cfg: dict) -> Callable: # TODO: Could type return more tightly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need to be passed backend as flavour is contained in the EmbeddingConfig.

The ultimate goal here is to make these artefact config's self-contained. This will pave the way towards a more OOP approach! (@mauicv )

elif key[-1] == 'tokenizer':
obj = _load_tokenizer_config(src)
elif key[-1] == 'optimizer':
obj = _load_optimizer_config(src, backend)
elif key[-1] == 'preprocess_fn':
obj = _load_preprocess_config(src, backend)
obj = _load_preprocess_config(src)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend not passed to some of these functions since the artefact configs themselves now contain flavour.

elif key[-1] == 'embedding':
obj = _load_embedding_config(src, backend)
obj = _load_embedding_config(src)
elif key[-1] == 'tokenizer':
obj = _load_tokenizer_config(src)
elif key[-1] == 'optimizer':
obj = _load_optimizer_config(src, backend)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these functions still have a backend arg since we still want some of these objects' backends to be constrained by the detector backend. For example, it would not make sense to allow a pytorch optimizer to be loaded with a tensorflow backend detector.

if backend != 'tensorflow':
raise NotImplementedError("Currently, saving is only supported with backend='tensorflow'.")
backend = detector.meta.get('backend', None)
if backend not in (None, 'tensorflow', 'sklearn'):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is now included since detectors such as KSDrift now have no backend in their config.

raise ValueError('A sklearn backend is not available for this model')
return model
class SupportedModelsType:
"""
Copy link
Contributor Author

@ascillitoe ascillitoe Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we typed model fields as:

model: Optional[Callable] = None

and then validated them with:

_validate_model = validator('model', allow_reuse=True, pre=True)(validate_model)

This did not work for sklearn models since sklearn.base.BaseEstimator is not a Callable. We could simple change to model: Optional[Any] = None. However I think a more elegent solution is to use a pydantic custom data type. This can then be used directly as:

model: Optional[SupportedModelsType] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the:

raise ValueError('A TensorFlow backend is not available for this model')

Has been changed to:

raise TypeError("`backend='tensorflow'` but the `model` doesn't appear to be a TensorFlow supported model.")

as I think the previous error message was misleading.

@@ -97,8 +97,6 @@ class DetectorConfig(CustomBaseModel):
"""
name: str
"Name of the detector e.g. `MMDDrift`."
backend: Literal['tensorflow', 'pytorch', 'sklearn', 'keops'] = 'tensorflow'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No backend in DetectorConfig as this is now specific to each detector.



@fixture
def encoder_model(backend, current_cases):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fixtures have all been moved from test_saving.py, so that they can be reused in _sklearn/tests/test_saving_sk.py etc.

@@ -76,213 +72,6 @@
# TODO - future: Some of the fixtures can/should be moved elsewhere (i.e. if they can be recycled for use elsewhere)


@fixture
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to models.py.

@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_')
@parametrize('model', [encoder_model])
@parametrize('layer', [None, -1])
def test_save_model(data, model, layer, backend, tmp_path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to test_saving_tf.py.

@@ -23,3 +23,4 @@ tox>=3.21.0, <4.0.0 # used to generate licence info via `make licenses`
twine>3.2.0, <4.0.0 # 4.x causes deps clashes with testing/requirements.txt, as requires rich>=12.0.0 -> requires typing-extensions>=4.0.0 -> too high for spacy and thinc!
packaging>=19.0, <22.0 # Used to check scipy version for CVMDrift test. Can be removed once python 3.6 support dropped (and scipy lower bound >=1.7.0).
codecov>=2.0.15, <3.0.0
xgboost>=1.3.2, <2.0.0 # Install for use in testing since we support serialization of xgboost models under the sklearn API
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to install this for the xgboost model defined here and used in a test here.

@ascillitoe ascillitoe removed the WIP PR is a Work in Progress label Oct 5, 2022
@jklaise
Copy link
Contributor

jklaise commented Oct 7, 2022

Very likely outside the scope of this PR, but is it true that we only support the sklearn interface for xgboost models where they can be used in the library (ClassifierDrift being the only place?) ?

Reason for asking is two-fold:

@jklaise jklaise self-requested a review October 7, 2022 14:07
Copy link
Contributor

@jklaise jklaise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, I left some questions in places I'm unsure about.

@@ -117,9 +126,15 @@ class ModelConfig(CustomBaseModel):
.. code-block :: toml

[model]
flavour "tensorflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing = ?

alibi_detect/saving/schemas.py Outdated Show resolved Hide resolved
@@ -69,7 +69,7 @@ def load_model(filepath: Union[str, os.PathLike],
return model


def prep_model_and_emb(model: Optional[Callable], emb: Optional[TransformerEmbedding]) -> Callable:
def prep_model_and_emb(model: Callable, emb: Optional[TransformerEmbedding]) -> Callable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a distraction here, but what about the typing? Of course since this is tensorflow specific we wouldn't use the new SupportedModelsType, but that raises a question that maybe we should have a custom tensorflow type. Perhaps one for later...

@@ -78,7 +81,10 @@ def save_model_config(model: Callable,
if model is not None:
filepath = base_path.joinpath(local_path)
save_model(model, filepath=filepath, save_dir='model')
cfg_model = {'src': local_path.joinpath('model')}
cfg_model = {
'flavour': 'tensorflow',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely another distraction/follow-up... But I would say it would be more robust to define an enum once and then use its values everywhere to avoid possible typos, e.g. as in https://github.com/SeldonIO/alibi/blob/fa034a2ebf7cd69c1796cf81ac6a2862a15f03b7/alibi/utils/frameworks.py#L4-L6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sanity check would be that grepping the code base for a string like 'tensorflow' shouldn't bring up very many hits :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a follow-up PR for this.

backend = cfg.pop('backend') # popping so that cfg left as kwargs + `name` when passed to _init_detector
if backend.lower() != 'tensorflow':
raise NotImplementedError('Loading detectors with PyTorch, sklearn or keops backend is not yet supported.')
backend = cfg.get('backend', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more popping?

Copy link
Contributor Author

@ascillitoe ascillitoe Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to as pydantic now only populates the config with backend for detectors who have a backend kwarg (before we added backend to all detector configs as it also represented the "flavour" of preprocessing).


Returns
-------
The loaded model.
"""

# Load model
flavour = cfg['flavour']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think flavour is good and it maps reasonably well to the mlflow concept of flavor which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors

Slightly more thorny question would be if we stick with the British spelling... :)

Comment on lines +205 to +206
if backend not in ('tensorflow', 'pytorch', 'keops'):
pytest.skip("Detector doesn't have this backend")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about these statements, is this future-proofing if we had more backends? Also looking at the backend fixture it's only parametrized with tensorflow and sklearn, so how does it work for keops, pytorch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes future proofing, but more in the sense of future-proofing in anticipation of serialization support being extended.

This file (test_saving.py) is currently only parametrized with backend = param_fixture("backend", ['tensorflow', 'sklearn']) because we currently only support saving with tensorflow and sklearn. As we add support for more backends, we should add them to this parametrization.

The if backend not in ('tensorflow', 'pytorch', 'keops'): is intended to skip the test if any other backend is passed to it (e.g. atm backend=='sklearn' is passed), because MMDDrift only has 'tensorflow', 'pytorch' and 'keops' backends. There are various ways we could do this but I thought the above logic was the most maintainable because there is a one-to-one mapping between the contents of the tuple and the actual backends supported by the detector.

@ascillitoe ascillitoe requested a review from mauicv October 10, 2022 08:22
filepath
Saved model directory.
load_dir
Name of saved model folder within the filepath directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure of the difference between the args? Is filepath the saved detector directory and the load_dir the location within the detector directory? Why not just have filepath?

Copy link
Contributor Author

@ascillitoe ascillitoe Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. This is really only for consistency with the _tensorflow.loading.load_model function. It has these two separate kwarg's for use in the legacy load_detector here:

try: # legacy load_model behaviour was to return None if not found. Now it raises error, hence need try-except.
model = load_model(filepath, load_dir='encoder')

Happy to remove now for sklearn if you think best?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in c3fd225 :)



%### PyTorch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the % intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It will be uncommented once we add pytorch serialization support.

Copy link
Collaborator

@mauicv mauicv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ascillitoe ascillitoe merged commit 1898ad2 into SeldonIO:master Oct 12, 2022
@ascillitoe ascillitoe added this to the v0.11.0 milestone Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Serialization Serialization proposals and changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants