Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make promote_typejoin recursive #26501

Closed
wants to merge 1 commit into from
Closed

Make promote_typejoin recursive #26501

wants to merge 1 commit into from

Conversation

bramtayl
Copy link
Contributor

No description provided.

@bramtayl
Copy link
Contributor Author

bramtayl commented Mar 17, 2018

So this hit a stackoverflow. But I think that I just revealed an existing bug in typejoin. There's a line that looks like this:

function typejoin(a, b)
    # some stuff
    a′ === a ? typejoin(a, b) : recur(a′, b)

If the a′ === a condition is met then promotion will always overflow. Apparently this condition had just never been met before...

@nalimilan
Copy link
Member

AFAICT this is what #25553 implemented before it was reverted by #26109.

@bramtayl
Copy link
Contributor Author

Oh ok should have read closer before reinventing the wheel.

@bramtayl
Copy link
Contributor Author

Is there an ELI5 of why #25553 was reverted? Maybe I'm being dense. Is it worth implementing for a custom named-tuple like type?

@vtjnash
Copy link
Member

vtjnash commented Mar 17, 2018

It's very slow (causes even worse performance than not having it), so it's misleading to suggest that you should structure code this way. It appears to me that, in general, for good performance you need to operate on data column-wise. Otherwise, trying to copy data by rows makes it too easy to run off the performance cliff and just have no way to recover.

@bramtayl
Copy link
Contributor Author

Well shoot if I knew that I should always work with data column-wise I would have stuck with R where everything is always vectorized in the first place.

@nalimilan
Copy link
Member

nalimilan commented Mar 17, 2018

It's very slow (causes even worse performance than not having it), so it's misleading to suggest that you should structure code this way. It appears to me that, in general, for good performance you need to operate on data column-wise. Otherwise, trying to copy data by rows makes it too easy to run off the performance cliff and just have no way to recover.

This statement mixes two different issues:

  1. How can we make tuples with small Union fields as fast as possible
  2. Is it more efficient to operate on vectors of tuples or on tuples of vectors

Regarding 1), I (and others apparently) haven't yet understood why it's the case, and (more importantly) why it's going to remain the case for the whole life of the 1.x series. Could you explain this in detail for non-compiler developers like us?

Regarding 2), I agree that storing data as vectors of tuples is generally a bad idea compared with storing them as tuples of vectors (EDIT: which doesn't mean that tuples shouldn't be used for a streaming API, as a temporary representation of rows e.g. from one input data frame to an output data frame). But it's not completely unreasonable either for small data sets, e.g. if you want to keep a handful of tuples of summary statistics computed using Query.jl based on a large data frame. Tuples are a convenient lightweight structure for many applications. Should we refuse to support this because we're afraid that people would misuse it?

@bramtayl
Copy link
Contributor Author

It seems what’s really misssing here is collect-like machinery that is able to take a generator which returns a tuple of values and populate a tuple of vectors?

@vtjnash
Copy link
Member

vtjnash commented Mar 17, 2018

small data sets

Simple types like Any will compile faster, so you'll get the answer faster, if you avoid specializing the code. Of course, if you don't have much data, basically anything will perform OK: Dict or even just Array might also perform just fine.

Could you explain this in detail for non-compiler developers like us?

Someone else is welcome to explore this, and optimize some of these cases during v1.x. But as the guy who said "we can have fast small unions" (and did so, mostly), I'm also acutely aware that it works best in cases where the union is of completely unrelated items (like Tuple and Nothing), and that the work does not generalize to container constructors (or any other large union, such as is represented by Tuple{Union...}). That would be a much more difficult problem. Afaik, supporting that would require a static type system. (Optimal performance currently depends heavily on creating all 2^n specializations – although as Jeff notes in https://discourse.julialang.org/t/missing-data-and-namedtuple-compatibility/8136/34 we won't actually need them – so we just instead make sure that inference won't do that work or compute a return type and instead defer that analysis for runtime).

@bramtayl
Copy link
Contributor Author

Ok, closing this for now; I've found a workaround #25924 (comment) and duplicate of #25553

@kshyatt kshyatt added the types and dispatch Types, subtyping and method dispatch label May 27, 2018
@bramtayl bramtayl closed this Jul 6, 2018
@bramtayl bramtayl mentioned this pull request Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
types and dispatch Types, subtyping and method dispatch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants