Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(rust, python): Faster is_sorted when no flag set #9777

Merged
merged 2 commits into from
Jul 10, 2023

Conversation

magarick
Copy link
Contributor

@magarick magarick commented Jul 8, 2023

This will check sortedness without sorting the entire series even when no flag is set. Seems faster.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 8, 2023
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice TODO hunt. :)

@ritchie46 ritchie46 merged commit 9962605 into pola-rs:main Jul 10, 2023
@magarick magarick deleted the is-sorted branch July 12, 2023 21:12
c-peters pushed a commit to c-peters/polars that referenced this pull request Jul 14, 2023
@s-banach
Copy link
Contributor

Does this break is_sorted on struct columns?

@magarick
Copy link
Contributor Author

Does this break is_sorted on struct columns?

Not that I know of. It seems fine. Did you see something?

@s-banach
Copy link
Contributor

In polars = 0.18.8, I get the following, maybe or maybe not related to this PR:

import polars as pl
df = pl.DataFrame({"x": [1, 2], "y": [3, 4]})
df.select(pl.struct("x", "y")).to_series().is_sorted()
PanicException: not implemented

@magarick
Copy link
Contributor Author

You're right. I tried with 0.18.7
I think I know what the issue is.

@magarick
Copy link
Contributor Author

Inequality operations aren't implemented for structs. It makes sense to have them. After all, if you can sort structs, you can compare them. But some care is required.
It seems like I could either

  1. Implement it at the ChunkedArray level, but this is a bad idea because it requires checking the fields and possibly returning an error if the structs aren't the same type internally. That would require changing return types and introduces higher level behavior than feels appropriate there.
  2. Implement it at the Series level. Those functions already return a PolarsResult and a little extra checking for structs, then building up an output seems fine.
  3. Special case it for is_sorted and check each field individually if the Series is a Struct.

I'm inclined toward option 2 but I'll defer to @ritchie46 since he knows the internals much better.

@ritchie46
Copy link
Member

We should do this in two steps. First fix the regression by pattern matching on structs (and maybe lists) as well. For those we can restore old behavior.

Next we can see if we can implement comparison on structs. This needs to be implemented on the StructChunked struct. A quick way to implement this is doing field wise comparison and then combine the boolean outputs with an AND operation. Later we can check if we need to optimize it further.

@s-banach
Copy link
Contributor

I thought structs were sorted in dictionary order,
eg (0, 1) < (1, 0),
so I don’t understand what you mean by combining the fieldwise comparisons with AND.

@ritchie46
Copy link
Member

I thought structs were sorted in dictionary order, eg (0, 1) < (1, 0), so I don’t understand what you mean by combining the fieldwise comparisons with AND.

I am talking about equality inly in this case. The other one needs row encoding. Probably that's the proper wat for all struct comparisons.

@magarick
Copy link
Contributor Author

Equality is already implemented for Structs, the issue is inequality checks, which aren't implemented in ChunkCompare. The only issue is defining the output when schemata don't match. You might have to change the associated type from BooleanChunked to PolarsResult<BooleanChunked> because non-comparable structs could either give
a. All false (probably not what you want)
b. All null (maybe what you want, but then you're conflating this case with comparing against a null value)
c. Raise an error

Cases:

  1. Names and types are equal, but ordering is different. Probably don't want to mess with reordering.
  2. Names are different but field types are all the same and in the same order. Maybe this is ok to compare, maybe not.
  3. Names are the same, types are the same. Do the comparison!
  4. Anything else: return all nulls or fail

Honestly, I think I could do the proper change in not much longer than special-casing a fix, but if you'd like time to think about and discuss how to handle the operation properly we can do it in two phases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants