Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): Improve cut and allow use in expressions #9580

Merged
merged 17 commits into from
Jul 5, 2023

Conversation

magarick
Copy link
Contributor

Cut can work directly on a series and return a series and it also works in df expressions. It no longer requires sorting and on first glance appears faster as well. This PR also fixes a bug where you could get incorrect results if the breaks weren't passed in sorted. The code is shorter too, but I've kept around the old version for now in case people still want it.

This is a WIP as I could use some feedback on if I did things the right way, but the code does appear to work fine.

@magarick
Copy link
Contributor Author

I need to fix up the tests but this does seem to work for the most part.
However, there's a problem with over expressions. If the global string cache isn't enabled, there won't be different categories for different groups when there should be. And if it is enabled, I get an error when trying to do qcut. Any help would be appreciated, since working correctly over groups is important.

@magarick magarick changed the title Improve cut and allow use in expressions feat(rust, python): Improve cut and allow use in expressions Jun 28, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jun 28, 2023
@magarick
Copy link
Contributor Author

magarick commented Jul 3, 2023

I'm using the new style expressions but still have an issue with "over". Any ideas for how to work around this would be helpful. One way that seems reasonable, but probably not ideal is, is to let the function return a series containing Strings and then convert that to a categorical in the function that the expression dispatches. But this feels like a hack and I'd like to understand what's going on.

@magarick
Copy link
Contributor Author

magarick commented Jul 3, 2023

Ok, it more or less works now. The only caveat is that if you use qcut in an over expression without specifying labels and the quantiles are different per group it ends up just taking labels from one of the groups :-( I'm not sure how to resolve this.

@@ -14,6 +14,7 @@ mod clip;
mod concat;
mod correlation;
mod cum;
mod cut;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you feature gate this functionality?

@@ -192,6 +193,17 @@ pub enum FunctionExpr {
method: correlation::CorrelationMethod,
ddof: u8,
},
Cut {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you feature gate this functionality?

labels: Option<Vec<String>>,
left_closed: bool,
},
QCut {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you feature gate this functionality?


use crate::series::ops::SeriesSealed;

pub trait CutSeries: SeriesSealed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only dispatch via expression. Can make it simpler by just having a function that accepts a &Series and arguments?

@ritchie46
Copy link
Member

Ok, it more or less works now. The only caveat is that if you use qcut in an over expression without specifying labels and the quantiles are different per group it ends up just taking labels from one of the groups :-( I'm not sure how to resolve this.

I know how to fix that. Will do in a separate PR.

Thanks for the PR @magarick. Two points.

  • Can you feature gate the functionality? We want to keep the compile times as low as possible by default.
  • Can you simplify the trait by using a function? I want to make the code simpler, and prefer functions for that.

@magarick
Copy link
Contributor Author

magarick commented Jul 4, 2023

I feature gated and simplified dispatch. Less code is better code.
HOWEVER, I now have the same problem as when I used the string cache manually:
In the following example, everything but the last line works

import polars as pl
x = pl.Series(range(20))
r = pl.Series([pl.repeat('a', 10, eager = True), pl.repeat('b', 10, eager = True)]).explode()
df = pl.DataFrame(dict(x = x, g = r))

x.qcut([.5], series = True)

df.with_columns(pl.col('x').qcut([.5], labels = ["less", "more"]).over('g'))

df.with_columns(pl.col('x').qcut([.5]).over('g'))

The error is

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /Users/josh/repos/polars/polars/polars-core/src/chunked_array/logical/categorical/builder.rs:107:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1397, in __repr__
    return self.__str__()
  File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1394, in __str__
    return self._df.as_str()
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

I don't know if I'm building the categorical incorrectly or if I've misunderstood something in the code. It works when sharing labels, and it works when not in an over expression.

@magarick
Copy link
Contributor Author

magarick commented Jul 5, 2023

The problem seems to be related to how over maps values back to their original row. If I understand correctly, window functions first perform an aggregation, giving you a Series with a list for each group and then explodes and reshuffles the results. I've recreated it in a simpler form.

>>> pl.enable_string_cache(False)
>>> s1 = pl.Series(['a', 'b'], dtype=pl.Categorical)
>>> s2 = pl.Series(['c', 'd'], dtype=pl.Categorical)
>>> pl.DataFrame(dict(c = [s1, s2]))
shape: (2, 1)
┌────────────┐
│ c          │
│ ---        │
│ list[cat]  │
╞════════════╡
│ ["a", "b"] │
│ ["a", "b"] │
└────────────┘
>>>
>>> pl.enable_string_cache(True)
>>> s1 = pl.Series(['a', 'b'], dtype=pl.Categorical)
>>> s2 = pl.Series(['c', 'd'], dtype=pl.Categorical)
>>> pl.DataFrame(dict(c = [s1, s2]))
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /Users/josh/repos/polars/polars/polars-core/src/chunked_array/logical/categorical/builder.rs:107:42
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1397, in __repr__
    return self.__str__()
  File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1394, in __str__
    return self._df.as_str()
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

In the first case, levels get silently mis-mapped. The second has the same error as the qcut example.

@magarick
Copy link
Contributor Author

magarick commented Jul 5, 2023

In case it helps, I get the same error if I try pl.Series([s1, s2]).
And it seems like maybe this has something to do with NamedFrom, specifically for AsRef<[Series]>? and how it uses get_list_builder. I don't see any handling for categoricals or a ListCategoricalChunkedBuilder like with other data types. I don't know if this is even related but maybe? @ritchie46 any help would be appreciated since this seems to be getting deep into internals. If you're not sure I'll try again in the morning. Thanks.

@ritchie46
Copy link
Member

Thanks, I will take a look.

@ritchie46
Copy link
Member

Ok, I will see if I can fix the Series([s1, s2]) bug.

For this PR, I think we should raise if qcut is used in window expressions. There is a lot of complexity involved and I don't see how we can bubble up the data type properly in window functions.

Let's raise until we are a bit further in the core architecture and then we can maybe deal with it.

@ritchie46
Copy link
Member

Thanks @magarick. I hope to lift the restrictions regarding groupby/window functions soon!

@ritchie46 ritchie46 merged commit 9ff6908 into pola-rs:main Jul 5, 2023
@magarick magarick deleted the series-cut branch July 5, 2023 17:56
c-peters pushed a commit to c-peters/polars that referenced this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants