-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regex filtering performance #5613
Comments
Also got a |
Hi @ritchie46, unfortunately, the data is private and cannot be shared. In general, though, here are some stats about the distribution of the lengths of the column, which are all Unicode strings: (
df
.select(pl.col("body").str.lengths())
.select([
pl.col("body").min().suffix("_min"),
pl.col("body").max().suffix("_max"),
pl.col("body").median().suffix("_median"),
pl.col("body").mean().suffix("_mean"),
pl.col("body").var().suffix("_var"),
pl.col("body").skew().suffix("_skewness"),
])
)
shape: (1, 6)
┌──────────┬──────────┬─────────────┬───────────┬──────────┬───────────────┐
│ body_min ┆ body_max ┆ body_median ┆ body_mean ┆ body_var ┆ body_skewness │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪═════════════╪═══════════╪══════════╪═══════════════╡
│ 1 ┆ 821896 ┆ 16.0 ┆ 35.556744 ┆ 1.5757e6 ┆ 304.124384 │
└──────────┴──────────┴─────────────┴───────────┴──────────┴───────────────┘ As you can see, it's a relatively very skewed distribution, but most of the strings are relatively small. I could maybe come up with a synthetic dataset using lorem ipsum that tries to mimic this distribution and formalize these comparisons. However, I don't see how that would change the results significantly. Also, I edited the OP above to clarify that after removing the logic for the |
Could you create a dummy dataset that shows the slowdown you describe? |
Hi @ritchie46, please see this notebook. It uses the Yelp reviews dataset. Note that you'll need a machine with at least 8GB of RAM (maybe even more because of the initial ndjson to Parquet conversion) to run it. I used an The results for the complex regex were:
You can see that DuckDB's compute time is the same regardless of how complex the regex is. Polars, in contrast, really starts to suffer with more complex regex. (On the plus side, for some reason, even the |
I don't have a yelp account. Can you compress the dataset and send it here? |
@ritchie46 I've attached a subset that shows the same behaviour import polars as pl
df = pl.read_csv("review.csv")
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feugiat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu facilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rhoncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
# Create 50+ patterns to match against
regex_list = list(zip(*[iter(text.split(" "))] * 2))
regex_list = [r"" + r"\s".join(x) + r"\b" for x in regex_list]
len(regex_list)
# Single regex
df.filter(pl.col("text").str.contains(regex_list[0]))
# Combine regexes
regex = "|".join(regex_list)
# Slower on combination
df.filter(pl.col("text").str.contains(regex)) |
@mjkanji Probably you are just paying the price for advanced unicode support in regex. # Creating a filter without `\b`.
In [22]: text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feu
...: giat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis
...: varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu f
...: acilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rh
...: oncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
...:
...: # Create 50+ patterns to match against
...: regex_list2 = list(zip(*[iter(text.split(" "))] * 2))
...: regex_list2 = [r"\s".join(x) for x in regex_list2]
...: combined_regex2 = "|".join(regex_list2)
In [20]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, combined_regex).height)
...:
...:
255
255
255
255
255
27.4 s ± 1.61 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
# Filter without "\b".
In [23]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, combined_regex2).height)
...:
...:
6786
6786
6786
6786
6786
554 ms ± 3.22 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
# Filter first without "\b" and filter those results again with "\b".
In [24]: %%timeit -n 1 -r 5
...: print(polars_multiple_filters(polars_multiple_filters(df, regex_list2), regex_list).height)
...:
...:
255
255
255
255
255
7.03 s ± 61.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) |
Your regex is also badly constructed as you add the If you write the regex as In [25]: text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feu
...: giat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis
...: varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu f
...: acilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rh
...: oncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
...:
...: # Create 50+ patterns to match against
...: regex_list3 = list(zip(*[iter(text.split(" "))] * 2))
...: regex_list3 = [r"\s".join(x) for x in regex_list3]
...: combined_regex3 = r"\b(" + "|".join(regex_list3) + r")\b"
In [19]: %%timeit -n 1 -r 5
...: r2 = duckdb_filter(df_table, combined_regex)
...: print(r2.shape)
...: del r2
...:
...:
(255, 1)
(255, 1)
(255, 1)
(255, 1)
(255, 1)
7.78 s ± 233 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [27]: %%timeit -n 1 -r 5
...: r2 = duckdb_filter(df_table, combined_regex3)
...: print(r2.shape)
...: del r2
...:
...:
(255, 1)
(255, 1)
(255, 1)
(255, 1)
(255, 1)
7.44 s ± 9.38 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [28]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, combined_regex3).height)
...:
...:
255
255
255
255
255
4.67 s ± 105 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) |
I know it is just an example, but the current constructed regex is probably not what you want as there are When disabling unicode support for regex (ascii only), the speedup is 9x. In [43]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, combined_regex3.replace(".", "\.")).height)
...:
...:
255
255
255
255
255
4.66 s ± 113 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [42]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, r"(?-u)" + combined_regex3.replace(".", "\.")).height)
...:
...:
255
255
255
255
255
540 ms ± 11.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) Also "\b(foo1\sbar1|foo2\sbar|...)\b" might become even faster in the future: |
@ritchie46 You don't really need a Yelp account. The download page does ask for your name, email, and initials, but it's not a sign-up/sign in page. It's just a legal formality to indicate you've accepted their agreement for using the dataset and they say they do not store your email or use it contact you in any way.
@ghuls Point taken! I definitely need to get better at working with regex. Unlike your results, in my testing, though, Polars is 2x slower than DuckDB for # 8.73 s ± 13.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
r2 = duckdb_filter(df_table, combined_regex3)
print(r2.shape, end="\r")
del r2
# 18.9 s ± 66.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3).height, end="\r") I tested again with an even more powerful machine (16vCPU) and Polars was able to beat DuckDB on the above now: 4s vs 7s. Another interesting tidbit: DuckDB actually gets worse if you remove the # 8.71 s ± 17.8 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex)
print(r2.shape, end="\r")
del r2
# 30.3 s ± 39.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex2)
print(r2.shape, end="\r")
del r2 Though, all of this does beg the question: why DuckDB is faster, even with my original, sub-optimal query? Is it somehow optimizing the And how is it able to do so with just a single core? For example, if I set # DuckDB for reference. Always single core
# 8.71 s ± 7.21 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex3)
print(r2.shape, end="\r")
del r2
# Polars with streaming=True. All cores
# 18.9 s ± 31.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3, streaming=True).height, end="\r")
# Polars with streaming=False. Single core
# 41.1 s ± 55.4 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3, streaming=False).height, end="\r") When you combine this with the fact that a more powerful machine was able to beat DuckDB simply because it had more cores, it leads me to believe that the regex engine underlying DuckDB is still more efficient (or is running some 'query plan optimizations' to the regex) because in an apples-to-apples single-core comparison, it does a better job on even the least optimized regex. |
For anyone just looking for the best solution (and not as interested in the minutia), the fastest solution was the multi-stage filter (first filter without # 2.53 s ± 42.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(polars_single_filter(df, combined_regex2), combined_regex).height, end="\r") |
Different regex libraries have different performance and optimization stragegies. Rust regex library is Unicode aware (so "\s" contains unicode spaces (e.g. non-breaking spaces) and "\b" contains probably a much bigger set of characters than the regex library used by duckdb. If all your text is just ASCII, you can disable unicode support and see that Rust regex is x9 faster than with unicode support for this example. In [42]: %%timeit -n 1 -r 5
...: print(polars_single_filter(df, r"(?-u)" + combined_regex3.replace(".", "\.")).height)
...:
...:
255
255
255
255
255
540 ms ± 11.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) |
Got it! Thank you for the detailed explanations. My data is Unicode (non-latin, emojis, etc.), but the keywords/phrases I'm looking for are ASCII and I don't think there would be many edge cases where a unicode character being treated as a word boundary would be a problem, so I think I'll switch unicode off. It is actually quite interesting to see the rather stark difference when using regex with unicode on/off even on the original, unoptimized pattern (1min 38s to 2s, a ~50x speedup!): # original pattern with unicode
# 1min 38s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, combined_regex).height)
# original pattern without unicode
# 2.18 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, r"(?-u)" + combined_regex).height)
# optimised pattern with unicode
# 18.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, combined_regex3).height)
# optimised pattern without unicode
# 2.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, r"(?-u)" + combined_regex3).height) |
To add some context here as the regex crate author: Unicode very often leads to worse performance on at least some dimension, but There are other competing factors in play here too (like literal optimizations), but the Unicode-aware And indeed, most regex engines make |
Problem description
While benchmarking Polars vs. DuckDB to find any rows that contain any of a set of keywords/phrases, I noticed that Polars'
str.contains(some_regex)
implementation becomes progressively worse for more complex patterns. By comparison, I also noticed DuckDB takes the same time to filter on a pattern regardless of how complex it is. I'm wondering if regex filters in Polars could be improved somehow (maybe by looking at how DuckDB handles it)?I also tried parallelizing the search so that each keyword in the complex pattern was handled separately (to leverage Polars' excellent multi-threading performance), but this was also significantly worse than DuckDB.
As an example,
DuckDB was able to handle even the more complex regex in ~2 seconds. And it's time to finish the filter actually didn't change whether it was a single keyword/phrase or all 50 merged into one.
Note that I also tried using
collect(allow_streaming=True)
with the lazy API but it made no difference in Option 1 and the workload was still single-threaded. Option 2 was parallelized, but I believe that's because of how the filter is set up and whetherallow_streaming
is used or not seems to have no impact on the performance.Also, while DuckDB is faster, it also consumed a significantly larger amount of memory (3-4GB more, in my case) than Polars on this task so maybe there's a tradeoff here. But if there's speed to be had on the table, it would be great to explore possible optimizations (or give users the option to trade faster processing for more RAM use).(Upon further inspection, DuckDB's excessive RAM use was coming from its inefficient implementation of lag/lead, compared to Polars' no-copy
shift
. Since this post is about regex filtering, I don't think that's relevant here.)See this SO post and specifically this comment and the ones that follow it for more details and concrete numbers.
Edit: Updated the numbers to focus solely on the regex filters (and remove the other parts of the transformation for both DuckDB and Polars). Also clarified that DuckDB's excessive memory use was resulting from those other parts of the transform, and not the filtering operation itself.
pl.show_versions()
The text was updated successfully, but these errors were encountered: