Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flytekit-omegaconf plugin #2299

Merged
merged 28 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d0149a9
add flytekit-hydra
mg515 Mar 27, 2024
f65136a
fix small typo readme
mg515 Mar 27, 2024
76297da
ruff ruff
mg515 Mar 28, 2024
20e2281
lint more
mg515 Mar 28, 2024
9574b44
rename plugin into flytekit-omegaconf
mg515 Apr 4, 2024
a568a82
lint sort imports
mg515 Apr 8, 2024
61fcbab
use flytekit logger
mg515 Apr 9, 2024
4395540
use flytekit logger #2
mg515 Apr 9, 2024
8864f70
fix typing info in is_flatable
mg515 Apr 9, 2024
0785c83
use default_factory instead of mutable default value
mg515 Apr 9, 2024
167a28c
add python3.11 and python3.12 to setup.py
mg515 Apr 9, 2024
12a779e
make fmt
mg515 Apr 9, 2024
a783c8d
define error message only once
mg515 Apr 9, 2024
5abe25d
add docstring
mg515 Apr 12, 2024
2feac8e
remove GenericEnumTransformer and tests
mg515 May 12, 2024
debe2f7
fallback to TypeEngine.get_transformer(node_type) to find suitable tr…
mg515 May 12, 2024
4fdb7e5
explicit valueerrors instead of asserts
mg515 May 12, 2024
08f9a80
minor style improvements
mg515 May 12, 2024
40af6f3
remove obsolete warnings
mg515 Jul 30, 2024
9c6db69
import flytekit logger instead of instantiating our own
mg515 Jul 30, 2024
1e15009
docstrings in reST format
mg515 Jul 30, 2024
e6e1896
refactor transformer mode
mg515 Jul 30, 2024
f8e945c
improve docs
mg515 Jul 30, 2024
34fd0a4
refactor dictconfig class into smaller methods
mg515 Jul 30, 2024
212ee86
add unit tests for dictconfig transformer
mg515 Jul 30, 2024
0a745d0
refactor of parse_type_description()
mg515 Jul 30, 2024
75dabed
add omegaconf plugin to pythonbuild.yaml
mg515 Aug 1, 2024
257e32d
Merge remote-tracking branch 'origin' into plugin-flytekit-hydra
eapolinario Aug 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pythonbuild.yml
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ jobs:
# onnx-tensorflow needs a version of tensorflow that does not work with protobuf>4.
# The issue is being tracked on the tensorflow side in https://github.com/tensorflow/tensorflow/issues/53234#issuecomment-1330111693
# flytekit-onnx-tensorflow
- flytekit-omegaconf
- flytekit-openai
- flytekit-pandera
- flytekit-papermill
Expand Down
69 changes: 69 additions & 0 deletions plugins/flytekit-omegaconf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Flytekit OmegaConf Plugin

Flytekit python natively supports serialization of many data types for exchanging information between tasks.
The Flytekit OmegaConf Plugin extends these by the `DictConfig` type from the
[OmegaConf package](https://omegaconf.readthedocs.io/) as well as related types
that are being used by the [hydra package](https://hydra.cc/) for configuration management.

## Task example
```
from dataclasses import dataclass
import flytekitplugins.omegaconf # noqa F401
from flytekit import task, workflow
from omegaconf import DictConfig
@dataclass
class MySimpleConf:
_target_: str = "lightning_module.MyEncoderModule"
learning_rate: float = 0.0001
@task
def my_task(cfg: DictConfig) -> None:
fg91 marked this conversation as resolved.
Show resolved Hide resolved
print(f"Doing things with {cfg.learning_rate=}")
@workflow
def pipeline(cfg: DictConfig) -> None:
my_task(cfg=cfg)
if __name__ == "__main__":
from omegaconf import OmegaConf
cfg = OmegaConf.structured(MySimpleConf)
pipeline(cfg=cfg)
```

## Transformer configuration

The transformer can be set to one of three modes:

`Dataclass` - This mode should be used with a StructuredConfig and will reconstruct the config from the matching dataclass
during deserialisation in order to make typing information from the dataclass and continued validation thereof available.
This requires the dataclass definition to be available via python import in the Flyte execution environment in which
objects are (de-)serialised.

`DictConfig` - This mode will deserialize the config into a DictConfig object. In particular, dataclasses are translated
into DictConfig objects and only primitive types are being checked. The definition of underlying dataclasses for
structured configs is only required during the initial serialization for this mode.

`Auto` - This mode will try to deserialize according to the Dataclass mode and fall back to the DictConfig mode if the
dataclass definition is not available. This is the default mode.

You can set the transformer mode globally or for the current context only the following ways:
```python
from flytekitplugins.omegaconf import set_transformer_mode, set_local_transformer_mode, OmegaConfTransformerMode

# Set the global transformer mode using the new function
set_transformer_mode(OmegaConfTransformerMode.DictConfig)

# You can also the mode for the current context only
with set_local_transformer_mode(OmegaConfTransformerMode.Dataclass):
# This will use the Dataclass mode
pass
```

```note
Since the DictConfig is flattened and keys transformed into dot notation, the keys of the DictConfig must not contain
dots.
fg91 marked this conversation as resolved.
Show resolved Hide resolved
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from contextlib import contextmanager

from flytekitplugins.omegaconf.config import OmegaConfTransformerMode
from flytekitplugins.omegaconf.dictconfig_transformer import DictConfigTransformer # noqa: F401
from flytekitplugins.omegaconf.listconfig_transformer import ListConfigTransformer # noqa: F401

_TRANSFORMER_MODE = OmegaConfTransformerMode.Auto


def set_transformer_mode(mode: OmegaConfTransformerMode) -> None:
"""Set the global serialization mode for OmegaConf objects."""
global _TRANSFORMER_MODE
_TRANSFORMER_MODE = mode


def get_transformer_mode() -> OmegaConfTransformerMode:
"""Get the global serialization mode for OmegaConf objects."""
return _TRANSFORMER_MODE


@contextmanager
def local_transformer_mode(mode: OmegaConfTransformerMode):
"""Context manager to set a local serialization mode for OmegaConf objects."""
global _TRANSFORMER_MODE
previous_mode = _TRANSFORMER_MODE
set_transformer_mode(mode)
try:
yield
finally:
set_transformer_mode(previous_mode)


__all__ = ["set_transformer_mode", "get_transformer_mode", "local_transformer_mode", "OmegaConfTransformerMode"]
15 changes: 15 additions & 0 deletions plugins/flytekit-omegaconf/flytekitplugins/omegaconf/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from enum import Enum


class OmegaConfTransformerMode(Enum):
"""
Operation Mode indicating whether a (potentially unannotated) DictConfig object or a structured config using the
underlying dataclass is returned.
Note: We define a single shared config across all transformers as recursive calls should refer to the same config
Note: The latter requires the use of structured configs.
"""

DictConfig = "DictConfig"
DataClass = "DataClass"
Auto = "Auto"
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
import importlib
import re
from typing import Any, Dict, Type, TypeVar

import flatten_dict
import flytekitplugins.omegaconf
from flyteidl.core.literals_pb2 import Literal as PB_Literal
from flytekitplugins.omegaconf.config import OmegaConfTransformerMode
from flytekitplugins.omegaconf.type_information import extract_node_type
from google.protobuf.json_format import MessageToDict, ParseDict
from google.protobuf.struct_pb2 import Struct

import omegaconf
from flytekit import FlyteContext
from flytekit.core.type_engine import TypeTransformerFailedError
from flytekit.extend import TypeEngine, TypeTransformer
from flytekit.loggers import logger
from flytekit.models.literals import Literal, Scalar
from flytekit.models.types import LiteralType, SimpleType
from omegaconf import DictConfig, OmegaConf

T = TypeVar("T")
NoneType = type(None)


class DictConfigTransformer(TypeTransformer[DictConfig]):
def __init__(self):
"""Construct DictConfigTransformer."""
super().__init__(name="OmegaConf DictConfig", t=DictConfig)

def get_literal_type(self, t: Type[DictConfig]) -> LiteralType:
"""
Provide type hint for Flytekit type system.
To support the multivariate typing of nodes in a DictConfig, we encode them as binaries (no introspection)
with multiple files.
"""
return LiteralType(simple=SimpleType.STRUCT)

def to_literal(self, ctx: FlyteContext, python_val: T, python_type: Type[T], expected: LiteralType) -> Literal:
"""Convert from given python type object ``DictConfig`` to the Literal representation."""
check_if_valid_dictconfig(python_val)

base_config = OmegaConf.get_type(python_val)
type_map, value_map = extract_type_and_value_maps(ctx, python_val)
wrapper = create_struct(type_map, value_map, base_config)

return Literal(scalar=Scalar(generic=wrapper))

def to_python_value(self, ctx: FlyteContext, lv: Literal, expected_python_type: Type[DictConfig]) -> DictConfig:
"""Re-hydrate the custom object from Flyte Literal value."""
if lv and lv.scalar is not None:
nested_dict = flatten_dict.unflatten(MessageToDict(lv.scalar.generic), splitter="dot")
cfg_dict = {}
for key, type_desc in nested_dict["types"].items():
cfg_dict[key] = parse_node_value(ctx, key, type_desc, nested_dict)

return handle_base_dataclass(ctx, nested_dict, cfg_dict)
raise TypeTransformerFailedError(f"Cannot convert from {lv} to {expected_python_type}")


def is_flattenable(d: DictConfig) -> bool:
"""Check if a DictConfig can be properly flattened and unflattened, i.e. keys do not contain dots."""
return all(
isinstance(k, str) # keys are strings ...
and "." not in k # ... and do not contain dots
and (
OmegaConf.is_missing(d, k) # values are either MISSING ...
or not isinstance(d[k], DictConfig) # ... not nested Dictionaries ...
or is_flattenable(d[k])
) # or flattenable themselves
for k in d.keys()
)


def check_if_valid_dictconfig(python_val: DictConfig) -> None:
"""Validate the DictConfig to ensure it's serializable."""
if not isinstance(python_val, DictConfig):
raise ValueError(f"Invalid type {type(python_val)}, can only serialize DictConfigs")
if not is_flattenable(python_val):
raise ValueError(f"{python_val} cannot be flattened as it contains non-string keys or keys containing dots.")


def extract_type_and_value_maps(ctx: FlyteContext, python_val: DictConfig) -> (Dict[str, str], Dict[str, Any]):
"""Extract type and value maps from the DictConfig."""
type_map = {}
value_map = {}
for key in python_val.keys():
if OmegaConf.is_missing(python_val, key):
type_map[key] = "MISSING"
else:
node_type, type_name = extract_node_type(python_val, key)
type_map[key] = type_name

transformer = TypeEngine.get_transformer(node_type)
literal_type = transformer.get_literal_type(node_type)

value_map[key] = MessageToDict(
transformer.to_literal(ctx, python_val[key], node_type, literal_type).to_flyte_idl()
)
return type_map, value_map


def create_struct(type_map: Dict[str, str], value_map: Dict[str, Any], base_config: Type) -> Struct:
"""Create a protobuf Struct object from type and value maps."""
wrapper = Struct()
wrapper.update(
flatten_dict.flatten(
{
"types": type_map,
"values": value_map,
"base_dataclass": f"{base_config.__module__}.{base_config.__name__}",
},
reducer="dot",
keep_empty_types=(dict,),
)
)
return wrapper


def parse_type_description(type_desc: str) -> Type:
"""Parse the type description and return the corresponding type."""
generic_pattern = re.compile(r"(?P<type>[^\[\]]+)\[(?P<args>[^\[\]]+)\]")
match = generic_pattern.match(type_desc)

if match:
origin_type = match.group("type")
args = match.group("args").split(", ")

origin_module, origin_class = origin_type.rsplit(".", 1)
origin = importlib.import_module(origin_module).__getattribute__(origin_class)

sub_types = []
for arg in args:
if arg == "NoneType":
sub_types.append(type(None))
else:
module_name, class_name = arg.rsplit(".", 1)
sub_type = importlib.import_module(module_name).__getattribute__(class_name)
sub_types.append(sub_type)

if origin_class == "Optional":
return origin[sub_types[0]]
return origin[tuple(sub_types)]
else:
module_name, class_name = type_desc.rsplit(".", 1)
return importlib.import_module(module_name).__getattribute__(class_name)


def parse_node_value(ctx: FlyteContext, key: str, type_desc: str, nested_dict: Dict[str, Any]) -> Any:
"""Parse the node value from the nested dictionary."""
if type_desc == "MISSING":
return omegaconf.MISSING

node_type = parse_type_description(type_desc)
transformer = TypeEngine.get_transformer(node_type)
value_literal = Literal.from_flyte_idl(ParseDict(nested_dict["values"][key], PB_Literal()))
return transformer.to_python_value(ctx, value_literal, node_type)


def handle_base_dataclass(ctx: FlyteContext, nested_dict: Dict[str, Any], cfg_dict: Dict[str, Any]) -> DictConfig:
"""Handle the base dataclass and create the DictConfig."""
if (
nested_dict["base_dataclass"] != "builtins.dict"
and flytekitplugins.omegaconf.get_transformer_mode() != OmegaConfTransformerMode.DictConfig
):
# Explicitly instantiate dataclass and create DictConfig from there in order to have typing information
module_name, class_name = nested_dict["base_dataclass"].rsplit(".", 1)
try:
return OmegaConf.structured(importlib.import_module(module_name).__getattribute__(class_name)(**cfg_dict))
except (ModuleNotFoundError, AttributeError) as e:
logger.error(
f"Could not import module {module_name}. If you want to deserialise to DictConfig, "
f"set the mode to DictConfigTransformerMode.DictConfig."
)
if flytekitplugins.omegaconf.get_transformer_mode() == OmegaConfTransformerMode.DataClass:
raise e
return OmegaConf.create(cfg_dict)


TypeEngine.register(DictConfigTransformer())
Loading
Loading