-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust, python): Improve cut and allow use in expressions #9580
Conversation
I need to fix up the tests but this does seem to work for the most part. |
I'm using the new style expressions but still have an issue with "over". Any ideas for how to work around this would be helpful. One way that seems reasonable, but probably not ideal is, is to let the function return a series containing Strings and then convert that to a categorical in the function that the expression dispatches. But this feels like a hack and I'd like to understand what's going on. |
Ok, it more or less works now. The only caveat is that if you use |
@@ -14,6 +14,7 @@ mod clip; | |||
mod concat; | |||
mod correlation; | |||
mod cum; | |||
mod cut; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you feature gate this functionality?
@@ -192,6 +193,17 @@ pub enum FunctionExpr { | |||
method: correlation::CorrelationMethod, | |||
ddof: u8, | |||
}, | |||
Cut { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you feature gate this functionality?
labels: Option<Vec<String>>, | ||
left_closed: bool, | ||
}, | ||
QCut { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you feature gate this functionality?
|
||
use crate::series::ops::SeriesSealed; | ||
|
||
pub trait CutSeries: SeriesSealed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only dispatch via expression. Can make it simpler by just having a function that accepts a &Series
and arguments?
I know how to fix that. Will do in a separate PR. Thanks for the PR @magarick. Two points.
|
I feature gated and simplified dispatch. Less code is better code. import polars as pl
x = pl.Series(range(20))
r = pl.Series([pl.repeat('a', 10, eager = True), pl.repeat('b', 10, eager = True)]).explode()
df = pl.DataFrame(dict(x = x, g = r))
x.qcut([.5], series = True)
df.with_columns(pl.col('x').qcut([.5], labels = ["less", "more"]).over('g'))
df.with_columns(pl.col('x').qcut([.5]).over('g')) The error is
I don't know if I'm building the categorical incorrectly or if I've misunderstood something in the code. It works when sharing labels, and it works when not in an over expression. |
The problem seems to be related to how >>> pl.enable_string_cache(False)
>>> s1 = pl.Series(['a', 'b'], dtype=pl.Categorical)
>>> s2 = pl.Series(['c', 'd'], dtype=pl.Categorical)
>>> pl.DataFrame(dict(c = [s1, s2]))
shape: (2, 1)
┌────────────┐
│ c │
│ --- │
│ list[cat] │
╞════════════╡
│ ["a", "b"] │
│ ["a", "b"] │
└────────────┘
>>>
>>> pl.enable_string_cache(True)
>>> s1 = pl.Series(['a', 'b'], dtype=pl.Categorical)
>>> s2 = pl.Series(['c', 'd'], dtype=pl.Categorical)
>>> pl.DataFrame(dict(c = [s1, s2]))
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /Users/josh/repos/polars/polars/polars-core/src/chunked_array/logical/categorical/builder.rs:107:42
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1397, in __repr__
return self.__str__()
File "/Users/josh/repos/polars/py-polars/polars/dataframe/frame.py", line 1394, in __str__
return self._df.as_str()
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value In the first case, levels get silently mis-mapped. The second has the same error as the |
In case it helps, I get the same error if I try |
Thanks, I will take a look. |
Ok, I will see if I can fix the For this PR, I think we should raise if qcut is used in window expressions. There is a lot of complexity involved and I don't see how we can bubble up the data type properly in window functions. Let's raise until we are a bit further in the core architecture and then we can maybe deal with it. |
Thanks @magarick. I hope to lift the restrictions regarding groupby/window functions soon! |
Cut can work directly on a series and return a series and it also works in df expressions. It no longer requires sorting and on first glance appears faster as well. This PR also fixes a bug where you could get incorrect results if the breaks weren't passed in sorted. The code is shorter too, but I've kept around the old version for now in case people still want it.
This is a WIP as I could use some feedback on if I did things the right way, but the code does appear to work fine.