-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sktime semantic data types for time series & vision #194
Comments
Hey @fkiraly - thanks for reaching out! I haven't had a chance to fully grok everything sktime is doing but I might be able to provide a few thoughts based on the high level API and a quick read of the check and convert implementations. While I use We have a data compression library named compressio we wrote to show how easy it was to build on type-based logic which might be of some interest to you if you decide to use visions. A common motif we've used is From what I'm seeing there are a few differences between visions and sktime which might inform the rest of the conversation. Scitypes and Mtypes
Visions doesn't have a formal abstraction for the data container the way you've developed your notion of Type Implementation
In
TypesetsFrom what I can tell, Typesets - Graphs / SetsFor each typeset, we construct a directed dependency graph between types. The easiest way to think about this is through the lens of sets. Taking a math example, Type inference within One nice thing you get in this construction is the ability to perform inference across many hops in the graph all in one go. For example, maybe you're passed a sequence of ordered YYYY-MM-DD strings with daily frequency,
The end result would be a pandas.Period with freq='D' (this is just an example, I may be misusing pandas implementation of period). Because type dependency is explicitly encoded here it's also okay for a sequence to belong to many types (meaning issues like this are not a concern either).
All of this should be doable, most actions in visions end up in the traverse_graph function at some point. It tracks three pieces of information:
We would be very happy to help you guys in any way we can. Happy to hop on a zoom call as well! |
This is really hugely helpful @ieaves, thank you for putting in the time to elaborate on how a type system might be applicable in our case. 🙌 I've made a few attempts in the past to break down time series into smaller specialized objects and tried to build complexity back up again but I hit a few dead ends. After what you've wrote I think I can see how a type system might help here. I would like to take another stab at the problem and I'm wondering what would you recommend as a good place to start when developing a type system? |
Hey @lmmentel, I'm so glad! If you're thinking about using The first thing to do is to write down the set of types you need for your problem space. The You can tackle that type in a couple of ways - one would be separate implementations of each type method (contains_op, and relations) for each container. @Table.contains_op.register
def boolean_contains(series: pd.Series, state: dict) -> bool:
return series.dtypes != "object"
PRIMITIVE_TYPES = (float, int, str)
@Table.contains_op.register
def boolean_contains(series: list, state: dict) -> bool:
for i, d in enumerate(obj):
for key in d.keys():
if not isinstance(d[key], PRIMITIVE_TYPES):
return False
state['thing you want to save'] = 'your information'
return True The annotation can be read as A second approach might be more like what we did in our example ML typeset. In this, typesets are nested nested, one handling primitive types and a seconds capturing higher order types like Generally speaking we advise using dispatch when your type defines an idea independent of its container. For example, latitude / longitude should be latitude / longitude regardless of whether I feed you the same sequence as a pd.series, list, tuple, np.array, or something else altogether. You can see how we structured that idea in the backends section of our codebase. If you can write down what semantic types are and the attributes you think they ought to have things become relatively easy. You need a few basic concepts
Relations have two methods. Let's say we have a relation between types A and B, that defines a mapping from A to B.
Bringing it together with the Float dtype class Float(VisionsBaseType):
"""**Float** implementation of :class:`visions.types.type.VisionsBaseType`.
Examples:
>>> import visions
>>> x = [1.0, 2.5, 5.0]
>>> x in visions.Float
True
"""
@staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(Generic),
InferenceRelation(String),
InferenceRelation(Complex),
]
return relations
@staticmethod
@multimethod
def contains_op(item: Any, state: dict) -> bool:
pass
@Float.register_relationship(Complex, pd.Series)
def complex_is_float(series: pd.Series, state: dict) -> bool:
return all(np.imag(series.values) == 0)
@Float.register_transformer(Complex, pd.Series)
def complex_to_float(series: pd.Series, state: dict) -> pd.Series:
return series.astype(float) We've got a new class defining the Type, two relations declared in get_relations.
You'll notice relations defined on a type map TO the type, not the other way around.
The transformer on an Now it's just a process of bootstrapping your way up through each type you need. Once you've defined your types you can compose them together (literally, just addition, e.g. `Float + Integer -> Typeset([Float, Integer])' and you're in business :). |
Yes, @ieaves, thanks! Sorry for the late answer, I've taken some time to digest over the week-end. If I understand correctly what you are saying, please correct me if wrong:
I do agree with point 1 and point 3, however point 2 I think is not correct based on what you say? Since you have a type hierarchy, you simply could make mtypes subtypes of the scitype. No? You would have to designate some types as mtypes and some as scitypes, but types can have types too, so that would not be a problem. Do you agree or disagree? What would be nice imo:
Re call - that would be nice! |
No, you're right, it definitely does. There's no intrinsic difference between machine and scitypes within visions. They are each just different semantics. You could very easily have something like class PandasSeries(VisionsBaseType):
@staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(Generic),
]
return relations
@staticmethod
@multimethod
def contains_op(item: Any, state: dict) -> bool:
return isinstance(item, pd.Series)
class NumpyArray(VisionsBaseType):
@staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(Generic),
]
return relations
@staticmethod
@multimethod
def contains_op(item: Any, state: dict) -> bool:
return isinstance(item, np.ndarray)
It definitely does! Typesets -> A graph Every relation requires two methods, one for validating whether a conversion is appropriate, and another for actually performing the conversion (see the
Taking the pd.Series / np.ndarray example above you can implement conversions by adding an InferenceRelation into @NumpyArray.register_relationship(PandasSeries, pd.Series)
def pandas_is_numpy(series: Any, state: dict) -> bool:
return True
@NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> np.ndarray:
return np.ndarray(series)
@PandasSeries.register_relationship(NumpyArray, np.ndarray)
def numypy_is_pandas(series: Any, state: dict) -> bool:
return True
@PandasSeries.register_transformer(NumpyArray, np.ndarray)
def numpy_to_pandas(series: Any, state: dict) -> pd.Series:
return pd.Series(series)
That's a really good idea! It wouldn't be particularly difficult to implement if there was interest. Maybe something to discuss on a call. If y'all want to get together and find a time that works on your end just grab a time off my calendly! EDIT: There runs some risk of overcomplicating things but in the Pandas -> Numpy, Numpy -> Pandas example above we normally exclude cyclic relations between types because we try to offer automatic type inference and we can't resolve cycles automatically (it's like ping ponging between types). That being said, if you aren't worried about fully automated inference the cycles won't affect you, we would just add something like this for you: def cast_along_path(series, graph, path, state={}):
base_type = path[0]
for vision_type in path[1:]:
relation = graph[base_type][vision_type]["relationship"]
series = relation.transform(series, state)
return series Path is just a direction to travel through the graph, to go from PandasSeries to NumpyArray it would be the list |
Oh, that's neat! Related questions:
|
Yes, it's really just semantic sugar to make development easier, particularly when working with multiple backends.
I think the easiest way to accomplish this is to use the state dict. You might do something like @NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> np.ndarray:
state['index'] = series.index
return np.ndarray(series)
@PandasSeries.register_transformer(NumpyArray, np.ndarray)
def numpy_to_pandas(series: Any, state: dict) -> pd.Series:
return pd.Series(series, index=state.get('index', None)) That dictionary will be passed up and down the stack and can be recovered whenever needed.
We've solved this puzzle with the 'relationship' method e.g. @Float.register_relationship(Complex, pd.Series)
def complex_is_float(series: pd.Series, state: dict) -> bool:
return all(np.imag(series.values) == 0) This way edges along the graph have explicit validation, it's not just that complex numbers can be coerced to floats (just drop the complex data) but that they should be in some ontological sense. For us this was an advantage because it liberated us from requiring the user to specify what they wanted to cast to. Instead, the deepest element in the tree was the best specified type for the data. That being said, if you know what you want to cast to then djikstra would give you the cast path and you could use the
Yes, it should be. If I'm understanding things correctly this is equivalent to finding all connected nodes in the relation graph. There are two subgraphs we track on each Typeset - an "inferential" graph (changes the underlying data e.g. Complex -> Float), and a "non-inferential" graph (no change to the underlying data e.g. Object -> String). As I said, we are using networkx under the hood so the list of scitypes would be generated as import networkx as nx
from_type = "MyType"
type_paths = list(nx.all_shortest_paths(graph, from_type))
type_enum = [path[-1] for path in type_paths] If we were to implement this for you it would look a bit like your_typeset = PandasSeries + NumpyArray # Any other types you wished to consider
your_typeset.accessible_types(PandasSeries)
-> [NumpyArray]
your_typeset.path_to_type(PandasSeries, NumpyArray) # Only two types so they are a single hop
-> [PandasSeries, NumpyArray]
your_typeset.cast_to(numpy_array, PandasSeries)
-> pandas_series, state # i.e. automatically uses the shortest path to coerce the initial numpy_array to a PandasSeries EDIT: I realized there's a mistake in some of the code snippets I provided (this is what I get for spitballing), the API requires you to return the state dictionary as well so the registered operations should actually look like @NumpyArray.register_transformer(PandasSeries, pd.Series)
def pandas_to_numpy(series: Any, state: dict) -> Tuple[np.ndarray, dict]:
state['index'] = series.index
return np.ndarray(series), state |
I've recently been made aware of this excellent and imo much needed library by @lmmentel.
The reason is its similarity to the
datatypes
module of sktime, which introduces semantic typing for time series related data types - we distinguish "mtypes" (machine representations) and "scitypes" (scientific types, what visions calls semantic type). More details here as reference.Few questions for visions devs:
datatypes
module and assess how similar this is to visions? If similar, we might be tempted to take a dependency on visions and contribute. Key features are mtype conversions, scitype inference, checks that also return metadata (e.g., number of time stamps in a series, which can be represented 4 different ways)The text was updated successfully, but these errors were encountered: