Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decomposing Join #648

Closed
jhellerstein opened this issue May 8, 2023 · 5 comments
Closed

Decomposing Join #648

jhellerstein opened this issue May 8, 2023 · 5 comments
Labels
hydroflow syntax Hydroflow's custom surface syntax lattices/properties

Comments

@jhellerstein
Copy link
Contributor

flowchart TD
    optimize["auto-optimize (rewrite) <tt>persist()/deltae()</tt> #347"]
    joinstate["split up <tt>join()</tt> state into <tt>persist()</tt>, etc. #347"]
    joinlattice["lattice types in <tt>join()</tt> (~#271)"]
    joinopt["auto-optimize <tt>join()</tt> state"]

    optimize --> joinopt
    joinstate --> joinlattice --> joinopt

    %% david["David dedalus optimizations?"]
    %% rewrite_api --> david
    %% optimize --> david
Loading
@jhellerstein jhellerstein added this to the 0.2 release milestone May 8, 2023
@MingweiSamuel MingweiSamuel added the hydroflow syntax Hydroflow's custom surface syntax label May 8, 2023
@zzlk
Copy link
Contributor

zzlk commented May 8, 2023

I think this issue hits at the overall question of like what are our edges? Currently they are essentially streams, they don't do deduping, they don't do re-ordering. So that means what does group_by/reduce look like? Currently because edges are just streams then it works with almost any closure you put in, but if we want to treat edges like sets or like lattices or like some other structure then that would greatly affect the implementation of join/groupby/reduce/many other operators.

@shadaj
Copy link
Member

shadaj commented May 14, 2023

Here's a brief summary of one take that definitely requires more discussion: edge types should always be streams by default and lattice types should fall out of applying operators.

For example, putting a stream through a shuffle() operator would produce a new stream, but now effectively with bag semantics. If you have a downstream reduce function that is known to have an associative + commutative function, then we know a safe rewrite can transform that to shuffle() -> fold_ac(...), and the shuffle can then be pushed through other operators and used to perform the aggregation with partitions, etc.

On the types side, we can focus on checking for determinism. For example, it would be a compile error if a shuffle()d stream goes into a fold (it must go into a fold_c).

@jhellerstein jhellerstein removed this from the 0.3 release milestone Jun 5, 2023
@jhellerstein
Copy link
Contributor Author

See #929 -- same issue

@MingweiSamuel
Copy link
Member

#1050 #1058

@MingweiSamuel
Copy link
Member

Closing this as we have lattice_bimorphism() operator - can make separate issues for improving that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hydroflow syntax Hydroflow's custom surface syntax lattices/properties
Projects
None yet
Development

No branches or pull requests

4 participants