Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

versioning #386

Merged
merged 36 commits into from
Dec 6, 2023
Merged

versioning #386

merged 36 commits into from
Dec 6, 2023

Conversation

sdafni
Copy link
Collaborator

@sdafni sdafni commented Nov 12, 2023

TODO @kbolashev suggestions:

  1. currently there is not verifying of the select list. i.e it can include
  • duplication: (FIeld('x'), Field('x'))
  • contradiction: (FIeld('x', as_of_time=t1, alias='x_t1'), FIeld('x', as_of_time=t2, alias='x_t1')
  • missed aliases (FIeld('x', as_of_time=t1), FIeld('x', as_of_time=t2))
  1. if select +include_all is used, we'll get all columns. its implemented in client by adding all names. would be more efficient to do it in backend

  2. Guy suggests '*' as give me all, otherwise only selected

@sdafni sdafni marked this pull request as ready for review November 16, 2023 10:07
@sdafni sdafni requested a review from kbolashev November 16, 2023 10:08
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
tests/data_engine/test_querying.py Show resolved Hide resolved
tests/data_engine/test_querying.py Outdated Show resolved Hide resolved
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
@sdafni sdafni requested a review from kbolashev November 22, 2023 12:51
Copy link
Member

@kbolashev kbolashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't have a strong opinion on the naive/aware timezone thing, whether we should block them completely, or try to figure out a workaround. Asking in the internal discord.

It's still good to have the docstring adhere by best practices.

tests/data_engine/test_querying.py Outdated Show resolved Hide resolved
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
dagshub/data_engine/datasets.py Outdated Show resolved Hide resolved
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
dagshub/data_engine/model/datasource.py Outdated Show resolved Hide resolved
docs/data_engine.md Show resolved Hide resolved
docs/data_engine.md Outdated Show resolved Hide resolved
- both "x" and Field("x") can be used
- alias, as_of_time - are optional
- the list should make sense, i.e ```python.select(Field("x", as_of_time=t1), Field("x", as_of_time=t2))``` does not make sense since there is no alias to differentiate, the result will not reflect the intention. also ```python .select("x","x")```
- when no select list specified all datapoint enrichements are returned, else only those specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this answered my question. This needs to be written at the very beginning of the function doc basically.

Also we 100% should have a flag to include the regular fields along with the selected ones.
Especially considering the way we write the docs, people might assume that the selected fields will be added to the original ones. That's how it read to me

- the list should make sense, i.e ```python.select(Field("x", as_of_time=t1), Field("x", as_of_time=t2))``` does not make sense since there is no alias to differentiate, the result will not reflect the intention. also ```python .select("x","x")```
- when no select list specified all datapoint enrichements are returned, else only those specified.
##### Global as_of behavior:
- it applies to all entities unless otherwise specified, i.e if we use ```python Field("x", as_of_time=t1))``` than this t will take precedence over a t2 specified in .as_of(t2). the sensibility of the results is up to the caller. you could get datapoints that existed in t1 < t2 based on a condition applied on their enrichmnts in t2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs some rewording probably.

Fixing typos:

Suggested change
- it applies to all entities unless otherwise specified, i.e if we use ```python Field("x", as_of_time=t1))``` than this t will take precedence over a t2 specified in .as_of(t2). the sensibility of the results is up to the caller. you could get datapoints that existed in t1 < t2 based on a condition applied on their enrichmnts in t2.
- it applies to all entities unless otherwise specified, i.e if we use ```python Field("x", as_of_time=t1))``` then this t will take precedence over a t2 specified in .as_of(t2). the sensibility of the results is up to the caller. you could get datapoints that existed in t1 < t2 based on a condition applied on their enrichments in t2.

Alternatively trying to reword it myself:

as_of specificity:

  • as_of filters defined inside of select() take complete precedence over the as_of defined on top of the query.

This means that a query:

q = ds.select("size", Field("size", as_of_time=t1, alias="size_t1"), "train")
      .as_of(t2)
      .all()

Will return fields size and train as they were as of t2, and size_t1 will have size as of t1.
This behavior is kept regardless of whethert1 came before or after t2

@sdafni sdafni changed the title versioning #1 versioning Nov 30, 2023
@sdafni sdafni merged commit d4aeb9d into master Dec 6, 2023
6 checks passed
@sdafni sdafni deleted the versioning branch December 6, 2023 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants