Support for serializing detectors with scikit-learn backends and/or models #642

ascillitoe · 2022-09-29T14:49:03Z

This PR implements save/load support for sklearn (and xgboost) backends and models.

The primary change addition is a saving._sklearn subpackage, containg save_model and load_model functions that use joblib to serialize/deserialize sklearn (and xgboost) models.

Additionally, since this PR introduces a new "flavour" of models/backends to serialize (in addition to the existing tensorflow), this PR includes a number of changes to better handle serialization of multiple flavours/models:

The detector backend config field is no longer tied to the model (or preprocess_fn) type. This gives us the flexibility to allow mixing and matching of models in the future e.g. a scikit-learn preprocessing model could be used with a ClassifierDrift(..., backend='tensorflow') detector (note: an sklearn preprocess_drift does not yet exist, just an example!). Also, this change means we do not have the confusing side-effect of backend='tensorflow' being written to detector config files without a backend, such as for KSDrift. We can also be more be granular when validating backend for different detectors.
The above is achieved via a new flavour field in ModelConfig and EmbeddingConfig. The aim is to make these configs more self-contained i.e. model.flavour tells us whether model.src refers to a tensorflow, pytorch or sklearn model, and we can therefore decide how to load it. Self-contained/atomic configs should help with the move to OOP serialization that @mauicv and I have been talking about...
The saving tests have been significantly refactored, with backend-specific tests being moved to the backend-specific saving subpackages.

This reverts commit af3be06.

…ving

codecov-commenter · 2022-09-29T15:20:20Z

Codecov Report

Merging #642 (c3fd225) into master (c6e78fc) will increase coverage by 0.22%.
The diff coverage is 91.33%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #642      +/-   ##
==========================================
+ Coverage   78.83%   79.06%   +0.22%     
==========================================
  Files         123      126       +3     
  Lines        8747     8819      +72     
==========================================
+ Hits         6896     6973      +77     
+ Misses       1851     1846       -5

Flag	Coverage Δ
macos-latest-3.10	`?`
ubuntu-latest-3.10	`78.96% <91.33%> (+0.24%)`	⬆️
ubuntu-latest-3.7	`78.87% <91.12%> (+0.23%)`	⬆️
ubuntu-latest-3.8	`?`
ubuntu-latest-3.9	`?`
windows-latest-3.9	`76.12% <91.20%> (+0.26%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
alibi_detect/saving/_tensorflow/saving.py	`81.95% <75.00%> (-0.14%)`	⬇️
alibi_detect/saving/saving.py	`86.44% <82.35%> (+0.19%)`	⬆️
alibi_detect/saving/loading.py	`90.45% <88.46%> (+0.61%)`	⬆️
alibi_detect/saving/schemas.py	`98.75% <91.11%> (-0.50%)`	⬇️
alibi_detect/saving/_sklearn/__init__.py	`100.00% <100.00%> (ø)`
alibi_detect/saving/_sklearn/loading.py	`100.00% <100.00%> (ø)`
alibi_detect/saving/_sklearn/saving.py	`100.00% <100.00%> (ø)`
alibi_detect/saving/_tensorflow/loading.py	`85.42% <100.00%> (+0.11%)`	⬆️
alibi_detect/utils/_types.py	`93.10% <0.00%> (-3.45%)`	⬇️
alibi_detect/cd/base_online.py	`88.05% <0.00%> (-3.15%)`	⬇️
... and 3 more

ascillitoe · 2022-10-05T12:11:21Z

alibi_detect/saving/_sklearn/__init__.py

@@ -0,0 +1,7 @@
+from alibi_detect.saving._sklearn.saving import save_model_config as save_model_config_sk
+from alibi_detect.saving._sklearn.loading import load_model as load_model_sk


Added __init__.py here for consistency with the _tensorflow subpackage. Not particularly important since _saving._sklearn is private anyway...

ascillitoe · 2022-10-05T12:21:17Z

alibi_detect/saving/_sklearn/tests/test_saving_sk.py

+
+@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_')
+@parametrize('model', [classifier_model, xgb_classifier_model])
+def test_save_model_sk(data, model, tmp_path):


We arguably could be more specific about what tests we keep in saving/tests/test_saving.py and what ones we keep here.

For now, I've adopted moving tests that are very backend specific to the backend saving subpackages (e.g. _save_model_config is tested in saving/_sklearn/tests/test_saving_sk.py when model is a sklearn or xgboost model). However, for the more generic functional tests such as test_save_ksdrift, I test them with both backend's in saving/tests/test_saving.py to minimise duplication. There are some other tests in saving/tests/test_saving.py such as test_save_kernel that we could probably think about moving to the backend-specific testing files in the future...

Also note, our more usual convention would be to test saving._sklearn.save_model here, instead of the parent function saving._save_model_config. I am testing _save_model_config here so that we also test the logic inside, which should call saving._sklearn.save_model if it recognises model to be a sklearn.base.BaseEstimator.

ascillitoe · 2022-10-05T12:31:55Z

alibi_detect/saving/loading.py

                raise NotImplementedError('Loading preprocess_fn for PyTorch not yet supported.')
 #               device = cfg['device'] # TODO - device should be set already - check
 #               kwargs.update({'model': kwargs['model'].to(device)})  # TODO - need .to(device) here?
 #               kwargs.update({'device': device})
+            elif model is None:


Handle the model is None case separately here, so that we don't miss the kwargs.pop('device') when model is None (it used to be done inside prep_model_and_emb_tf when backend == 'tensorflow' even if model is None). When we add pytorch support a bit more logic might be needed here to avoid pop'ing device` when the embedding is a pytorch model...

ascillitoe · 2022-10-05T12:36:37Z

alibi_detect/saving/loading.py


    Returns
    -------
    The loaded model.
    """

    # Load model
+    flavour = cfg['flavour']


ModelConfig and EmbeddingConfig now contain a new field called flavour, which specs whether the model is tensorflow, sklearn, or pytorch model. Struggled to think of a good name for this and am open to suggestions!

I think flavour is good and it maps reasonably well to the mlflow concept of flavor which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors

Slightly more thorny question would be if we stick with the British spelling... :)

I vote for flavorflav 🤣!

Yeah, boyeee!

ascillitoe · 2022-10-05T12:38:26Z

alibi_detect/saving/loading.py


    return model


-def _load_embedding_config(cfg: dict, backend: str) -> Callable:  # TODO: Could type return more tightly
+def _load_embedding_config(cfg: dict) -> Callable:  # TODO: Could type return more tightly


No longer need to be passed backend as flavour is contained in the EmbeddingConfig.

The ultimate goal here is to make these artefact config's self-contained. This will pave the way towards a more OOP approach! (@mauicv )

ascillitoe · 2022-10-05T12:39:00Z

alibi_detect/saving/loading.py

            elif key[-1] == 'tokenizer':
                obj = _load_tokenizer_config(src)
            elif key[-1] == 'optimizer':
                obj = _load_optimizer_config(src, backend)
            elif key[-1] == 'preprocess_fn':
-                obj = _load_preprocess_config(src, backend)
+                obj = _load_preprocess_config(src)


backend not passed to some of these functions since the artefact configs themselves now contain flavour.

ascillitoe · 2022-10-05T12:40:09Z

alibi_detect/saving/loading.py

            elif key[-1] == 'embedding':
-                obj = _load_embedding_config(src, backend)
+                obj = _load_embedding_config(src)
            elif key[-1] == 'tokenizer':
                obj = _load_tokenizer_config(src)
            elif key[-1] == 'optimizer':
                obj = _load_optimizer_config(src, backend)


Some of these functions still have a backend arg since we still want some of these objects' backends to be constrained by the detector backend. For example, it would not make sense to allow a pytorch optimizer to be loaded with a tensorflow backend detector.

ascillitoe · 2022-10-05T12:41:00Z

alibi_detect/saving/saving.py

-    if backend != 'tensorflow':
-        raise NotImplementedError("Currently, saving is only supported with backend='tensorflow'.")
+    backend = detector.meta.get('backend', None)
+    if backend not in (None, 'tensorflow', 'sklearn'):


None is now included since detectors such as KSDrift now have no backend in their config.

ascillitoe · 2022-10-05T12:58:46Z

alibi_detect/saving/schemas.py

-        raise ValueError('A sklearn backend is not available for this model')
-    return model
+class SupportedModelsType:
+    """


Previously we typed model fields as:

model: Optional[Callable] = None

and then validated them with:

_validate_model = validator('model', allow_reuse=True, pre=True)(validate_model)

This did not work for sklearn models since sklearn.base.BaseEstimator is not a Callable. We could simple change to model: Optional[Any] = None. However I think a more elegent solution is to use a pydantic custom data type. This can then be used directly as:

model: Optional[SupportedModelsType] = None

Note that the:

raise ValueError('A TensorFlow backend is not available for this model')

Has been changed to:

raise TypeError("`backend='tensorflow'` but the `model` doesn't appear to be a TensorFlow supported model.")

as I think the previous error message was misleading.

ascillitoe · 2022-10-05T12:59:12Z

alibi_detect/saving/schemas.py

@@ -97,8 +97,6 @@ class DetectorConfig(CustomBaseModel):
    """
    name: str
    "Name of the detector e.g. `MMDDrift`."
-    backend: Literal['tensorflow', 'pytorch', 'sklearn', 'keops'] = 'tensorflow'


No backend in DetectorConfig as this is now specific to each detector.

ascillitoe · 2022-10-05T13:01:17Z

alibi_detect/saving/tests/models.py

+
+
+@fixture
+def encoder_model(backend, current_cases):


These fixtures have all been moved from test_saving.py, so that they can be reused in _sklearn/tests/test_saving_sk.py etc.

ascillitoe · 2022-10-05T13:01:44Z

alibi_detect/saving/tests/test_saving.py

@@ -76,213 +72,6 @@
 # TODO - future: Some of the fixtures can/should be moved elsewhere (i.e. if they can be recycled for use elsewhere)


-@fixture


Moved to models.py.

ascillitoe · 2022-10-05T13:02:20Z

alibi_detect/saving/tests/test_saving.py

-@parametrize_with_cases("data", cases=ContinuousData.data_synthetic_nd, prefix='data_')
-@parametrize('model', [encoder_model])
-@parametrize('layer', [None, -1])
-def test_save_model(data, model, layer, backend, tmp_path):


Moved to test_saving_tf.py.

ascillitoe · 2022-10-05T13:03:33Z

requirements/dev.txt

@@ -23,3 +23,4 @@ tox>=3.21.0, <4.0.0 # used to generate licence info via `make licenses`
 twine>3.2.0, <4.0.0  # 4.x causes deps clashes with testing/requirements.txt, as requires rich>=12.0.0 -> requires typing-extensions>=4.0.0 -> too high for spacy and thinc!
 packaging>=19.0, <22.0 # Used to check scipy version for CVMDrift test. Can be removed once python 3.6 support dropped (and scipy lower bound >=1.7.0).
 codecov>=2.0.15, <3.0.0
+xgboost>=1.3.2, <2.0.0 # Install for use in testing since we support serialization of xgboost models under the sklearn API


Need to install this for the xgboost model defined here and used in a test here.

jklaise · 2022-10-07T11:08:40Z

Very likely outside the scope of this PR, but is it true that we only support the sklearn interface for xgboost models where they can be used in the library (ClassifierDrift being the only place?) ?

Reason for asking is two-fold:

should we support xgboost native format for detectors like ClassifierDrift?
the native format seems to have good json-based serialization: https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

jklaise

Mostly LGTM, I left some questions in places I'm unsure about.

jklaise · 2022-10-07T11:34:39Z

alibi_detect/saving/schemas.py

@@ -117,9 +126,15 @@ class ModelConfig(CustomBaseModel):
    .. code-block :: toml

        [model]
+        flavour "tensorflow"


missing = ?

alibi_detect/saving/schemas.py

jklaise · 2022-10-07T12:37:23Z

alibi_detect/saving/_tensorflow/loading.py

@@ -69,7 +69,7 @@ def load_model(filepath: Union[str, os.PathLike],
    return model


-def prep_model_and_emb(model: Optional[Callable], emb: Optional[TransformerEmbedding]) -> Callable:
+def prep_model_and_emb(model: Callable, emb: Optional[TransformerEmbedding]) -> Callable:


Maybe a distraction here, but what about the typing? Of course since this is tensorflow specific we wouldn't use the new SupportedModelsType, but that raises a question that maybe we should have a custom tensorflow type. Perhaps one for later...

jklaise · 2022-10-07T12:42:18Z

alibi_detect/saving/_tensorflow/saving.py

@@ -78,7 +81,10 @@ def save_model_config(model: Callable,
    if model is not None:
        filepath = base_path.joinpath(local_path)
        save_model(model, filepath=filepath, save_dir='model')
-        cfg_model = {'src': local_path.joinpath('model')}
+        cfg_model = {
+            'flavour': 'tensorflow',


Likely another distraction/follow-up... But I would say it would be more robust to define an enum once and then use its values everywhere to avoid possible typos, e.g. as in https://github.com/SeldonIO/alibi/blob/fa034a2ebf7cd69c1796cf81ac6a2862a15f03b7/alibi/utils/frameworks.py#L4-L6

A sanity check would be that grepping the code base for a string like 'tensorflow' shouldn't bring up very many hits :)

Will open a follow-up PR for this.

jklaise · 2022-10-07T12:47:16Z

alibi_detect/saving/loading.py

-    backend = cfg.pop('backend')  # popping so that cfg left as kwargs + `name` when passed to _init_detector
-    if backend.lower() != 'tensorflow':
-        raise NotImplementedError('Loading detectors with PyTorch, sklearn or keops backend is not yet supported.')
+    backend = cfg.get('backend', None)


No more popping?

We no longer need to as pydantic now only populates the config with backend for detectors who have a backend kwarg (before we added backend to all detector configs as it also represented the "flavour" of preprocessing).

jklaise · 2022-10-07T12:54:30Z

alibi_detect/saving/loading.py


    Returns
    -------
    The loaded model.
    """

    # Load model
+    flavour = cfg['flavour']


I think flavour is good and it maps reasonably well to the mlflow concept of flavor which more people would be familiar with: https://www.mlflow.org/docs/latest/models.html#built-in-model-flavors

Slightly more thorny question would be if we stick with the British spelling... :)

jklaise · 2022-10-07T14:50:30Z

alibi_detect/saving/tests/test_saving.py

+    if backend not in ('tensorflow', 'pytorch', 'keops'):
+        pytest.skip("Detector doesn't have this backend")


Slightly confused about these statements, is this future-proofing if we had more backends? Also looking at the backend fixture it's only parametrized with tensorflow and sklearn, so how does it work for keops, pytorch?

Yes future proofing, but more in the sense of future-proofing in anticipation of serialization support being extended.

This file (test_saving.py) is currently only parametrized with backend = param_fixture("backend", ['tensorflow', 'sklearn']) because we currently only support saving with tensorflow and sklearn. As we add support for more backends, we should add them to this parametrization.

The if backend not in ('tensorflow', 'pytorch', 'keops'): is intended to skip the test if any other backend is passed to it (e.g. atm backend=='sklearn' is passed), because MMDDrift only has 'tensorflow', 'pytorch' and 'keops' backends. There are various ways we could do this but I thought the above logic was the most maintainable because there is a one-to-one mapping between the contents of the tuple and the actual backends supported by the detector.

mauicv · 2022-10-10T14:21:58Z

alibi_detect/saving/_sklearn/loading.py

+    filepath
+        Saved model directory.
+    load_dir
+        Name of saved model folder within the filepath directory.


Not quite sure of the difference between the args? Is filepath the saved detector directory and the load_dir the location within the detector directory? Why not just have filepath?

Fair point. This is really only for consistency with the _tensorflow.loading.load_model function. It has these two separate kwarg's for use in the legacy load_detector here:

alibi-detect/alibi_detect/saving/_tensorflow/loading.py

Lines 275 to 276 in 6fa1708

try: # legacy load_model behaviour was to return None if not found. Now it raises error, hence need try-except.

model = load_model(filepath, load_dir='encoder')

Happy to remove now for sklearn if you think best?

Removed in c3fd225 :)

mauicv · 2022-10-10T15:43:01Z

doc/source/overview/saving.md


+
+%### PyTorch


Is the % intentional?

Yes. It will be uncommented once we add pytorch serialization support.

mauicv

LGTM!

…odels (#642)

ascillitoe added 18 commits September 27, 2022 13:39

Limit sphinx-autodoc-typehints upper bound

af3be06

Initial changes

5800e79

Remove ClassifierTF from test_saving

5ceec2d

Remove model from LARGE_ARTEFACTS

d2e9bab

Revert incorrect changes

9c15b00

Revert "Limit sphinx-autodoc-typehints upper bound"

a3ccb66

This reverts commit af3be06.

Remove changes to optional deps docstring

2d8e89e

Make saving.tensorflow private

f632340

Fix missing _tensorflow update

0edcac6

Merge branch 'feature/save_load_improvements' into feature/sklearn_sa…

05184e9

…ving

Add __all__ to saving/_tensorflow/__init__.py

0982465

Move incorrectly placed comment in saving.py

aa57bbe

Merge branch 'feature/save_load_improvements' into feature/sklearn_sa…

6394327

…ving

WIP: Initial framework for sklearn save/load

d855a5a

Decouple model type and backend

a32cb89

Simplify pydantic validation for model field

c0b33b4

Remove config_spec

ae6e366

Merge branch 'feature/remove_config_spec' into feature/sklearn_saving

9e3bfd3

ascillitoe added WIP PR is a Work in Progress Type: Serialization Serialization proposals and changes labels Sep 29, 2022

ascillitoe added 7 commits September 30, 2022 13:59

Partial refactor of save/load tests

9261337

Fix typo in test_save_preprocess_nlp

70897cb

Add xgboost model to sklearn save tests

362f808

Update save/load docs

13deafe

Update saving.md

9a66a92

Update saving.md

4f304f1

Merge branch 'master' into feature/sklearn_saving

1a6cab8

ascillitoe commented Oct 5, 2022

View reviewed changes

Add backend-conditional validation back to SupportedModelsType

04b503e

ascillitoe removed the WIP PR is a Work in Progress label Oct 5, 2022

jklaise self-requested a review October 7, 2022 14:07

jklaise reviewed Oct 7, 2022

View reviewed changes

ascillitoe requested a review from mauicv October 10, 2022 08:22

mauicv reviewed Oct 10, 2022

View reviewed changes

mauicv approved these changes Oct 10, 2022

View reviewed changes

ascillitoe mentioned this pull request Oct 10, 2022

Explore widening support for xgboost models #649

Open

ascillitoe added 2 commits October 11, 2022 13:32

Changes to schemas.py as per comments

4d530e5

Remove load_dir from sklearn load_model

c3fd225

jklaise approved these changes Oct 12, 2022

View reviewed changes

mauicv approved these changes Oct 12, 2022

View reviewed changes

ascillitoe merged commit 1898ad2 into SeldonIO:master Oct 12, 2022

ascillitoe mentioned this pull request Oct 12, 2022

Follow-up to sklearn serialization #650

Merged

ascillitoe added this to the v0.11.0 milestone Oct 19, 2022

ascillitoe added a commit that referenced this pull request Nov 8, 2022

Support for serializing detectors with scikit-learn backends and/or m…

7206bb2

…odels (#642)

		@@ -0,0 +1,7 @@
		from alibi_detect.saving._sklearn.saving import save_model_config as save_model_config_sk
		from alibi_detect.saving._sklearn.loading import load_model as load_model_sk

		@@ -76,213 +72,6 @@
		# TODO - future: Some of the fixtures can/should be moved elsewhere (i.e. if they can be recycled for use elsewhere)


		@fixture

		if backend not in ('tensorflow', 'pytorch', 'keops'):
		pytest.skip("Detector doesn't have this backend")

	try: # legacy load_model behaviour was to return None if not found. Now it raises error, hence need try-except.
	model = load_model(filepath, load_dir='encoder')



		%### PyTorch

Support for serializing detectors with scikit-learn backends and/or models #642

Support for serializing detectors with scikit-learn backends and/or models #642

Conversation

ascillitoe commented Sep 29, 2022 • edited Loading

codecov-commenter commented Sep 29, 2022 • edited Loading

Codecov Report

ascillitoe Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

ascillitoe Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauicv Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ascillitoe Oct 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jklaise commented Oct 7, 2022

jklaise left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ascillitoe Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ascillitoe Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauicv left a comment

Choose a reason for hiding this comment

ascillitoe commented Sep 29, 2022 •

edited

Loading

codecov-commenter commented Sep 29, 2022 •

edited

Loading

ascillitoe Oct 5, 2022 •

edited

Loading

ascillitoe Oct 5, 2022 •

edited

Loading

mauicv Oct 10, 2022 •

edited

Loading

ascillitoe Oct 5, 2022 •

edited

Loading

ascillitoe Oct 11, 2022 •

edited

Loading

ascillitoe Oct 11, 2022 •

edited

Loading