docs: Add alternatives.qmd #7333

NickCrews · 2023-10-11T20:33:03Z

Other libs such as polars have this and I think they are very useful:
https://pola-rs.github.io/polars/user-guide/misc/alternatives/

Other libs such as polars have this and I think they are very useful: https://pola-rs.github.io/polars/user-guide/misc/alternatives/

gforsyth

Can you elaborate a little more on how these are useful? I tend to view them (and I've definitely written them, I am not above the fray) the same way as I view benchmarks -- biased towards the library at hand.

I think they are usually trying to answer "why should I use this instead of ?" And I've asked myself that question a bunch, but I don't know if I would arrive at an answer based on an alternatives list like this.

I'm not against including comparisons, but I do think it comes with a certain maintenance burden if/when projects ask to be included or removed, or to quibble over a given characterization.

gforsyth · 2023-10-11T20:40:36Z

docs/concepts/alternatives.qmd

+| ------- | ---------------- | ---------- | ------------------------------------ | ---------------- | ----------- | ---------- | ----------- | ----------- | --------------- |
+| Ibis    | expression-based | Lazy       | ✅, plus some backends do additional | ✅               | 🟡          | 🟡         | 🟡          | 🟡          | 🟡              |
+| Pandas  | pandas           | Eager      | ❌                                   | Python object    | numpy/arrow | ❌         | ❌          | ❌          | optional >= 2.0 |
+| Dask DF | pandas           | Eager      | ❌                                   | Python object    | numpy/arrow | ✅         | ✅          | ✅          | ❌              |


dask is not eager

gforsyth · 2023-10-11T20:40:51Z

docs/concepts/alternatives.qmd

+| Ibis    | expression-based | Lazy       | ✅, plus some backends do additional | ✅               | 🟡          | 🟡         | 🟡          | 🟡          | 🟡              |
+| Pandas  | pandas           | Eager      | ❌                                   | Python object    | numpy/arrow | ❌         | ❌          | ❌          | optional >= 2.0 |
+| Dask DF | pandas           | Eager      | ❌                                   | Python object    | numpy/arrow | ✅         | ✅          | ✅          | ❌              |
+| Modin   | pandas           | Eager      | ❌                                   | Python object    | numpy/arrow | ✅         | ✅          | ✅          | ❌              |


I don't think modin is eager, either, but I'm not certain

gforsyth · 2023-10-11T20:43:13Z

docs/concepts/alternatives.qmd

+
+Some general summaries:
+
+Any *eager* library is going to be limited in its ability to optimize queries.


I think this needs to be reconsidered if DuckDB falls into the eager camp. They do not appear to be limited in their ability to optimize queries.

riiiggght, I'm splitting things up poorly, I meant for all eager imperative APIs

NickCrews · 2023-10-11T21:48:38Z

Thanks for the review @gforsyth! I can make those changes, but first wanted to get the general direction right before I do more work.

I agree with what you are saying, I read them as biased too. But bias is still useful, I just use it as one signal among many. I tend to give a lot more weight in my decision to comparisons written by library authors than most comparisons written on medium.com by some 3rd party user. I would rather have a well-informed but biased review (that I know is biased) than a poorly informed, unbiased review.

For instance, after I read polars writeup I learned about how vaex needs to translate csvs/parquet into a memory mapped file, and the IO drawbacks of OS-controlled paging. That is a significant drawback I was experiencing when I used vaex, but I had no idea why. Vaex didn't advertise this, I'd never seen this anywhere else. So I don't read that polars writeup trying to learn about the benefits of polars, but I look for how much they are able to find the flaws with their competitors. If they mention a drawback that is relevant to me, that is super useful, I am going to keep that in mind when I read that other libs docs/examples. If they don't mention any drawbacks that I find compelling, then I feel a lot more safe about the competitor. :)

It also helps me understand the priorities of the lib. For example the polars comparison never mentions the multi-machine advantages of some other systems. That means that I bet polars doesn't support multiple machines.

projects ask to be included/removed

lol I just did this to polars at pola-rs/polars#11670 :)

Is this a problem? Ask them to write the PR, and if it seems fair then accept it. Or reject it. I would rather add benefit to the thousands of users who read the comparison than be safe politically. We could add a disclaimer saying that these are our biased opinions, and please file a PR here if you see a problem? I think this would also help with the credibility of ibis, I don't want this to come off as "ibis is the best lib in all situations".

lostmygithubaccount · 2023-10-11T22:58:37Z

I have some mixed feelings I need to write up; not against a page like this in general, I do have concerns with this as-is

cpcloud · 2023-10-12T14:10:55Z

@NickCrews Thanks for the PR.

I think this is the wrong level of analysis for comparing $THINGs with Ibis.

Ibis is about bringing its API to bear on a given engine, and working with that engine in the best way possible.

If we're going to compare Ibis to anything it should be to the APIs of other DataFrame libraries, not the underlying engine and its capabilities.

Details like whether a system handles larger-than-memory data or whether a system is distributed should be part of Ibis's backend documentation and not on a page comparing Ibis with those systems.

Users should choose between engines that Ibis works with, not between Ibis and the engine.

Here's a shortlist of libraries that have a DataFrame API that I think might be worth comparing to Ibis:

pandas-like APIs (pandas, dask, modin, koalas (pandas on pyspark))
pyspark-like APIs (pyspark, snowpark, pystarburst)
neither pandas-like or pyspark-like APIs (polars, vaex)

NickCrews · 2023-10-12T15:33:29Z

Thanks is for the feedback everyone. This is basically the same feedback that I just got from the maintainers of Polars on that linked PR too, so it seems like I'm the oddball out here :) with that in mind I realize that I might have the wrong idea here :)

I do want to push back a little though. When I'm deciding how to solve a problem, I look at all of these libraries holistically: their performance characteristics, their API, but also their stability, amount of development, amount of community, quality of documentation, extendability, etc. I find it very helpful whenever I find an article that talks about these things, and I was hoping that this could cover all those. When choosing between pandas and ibis, I bet most people would choose pandas if we didn't talk about performance and only talked about their API, because they are similar enough but pandas has such larger community. The combo of features that ibis has and allows is what makes it great.

My personal timeline was I was using pandas, then that couldn't handle my dataset size, so I googled around and found vaex getting compared to it, so I tried that. But that still didn't work well enough, so then I did more and more searching and finally found ibis, which has been awesome. But like none of the comparisons I read included ibis! They had duckdb, but I didn't want to learn/write sql. What a shame I didn't find it earlier. I think we should start positioning ibis as a peer of these other systems because they solve the same problem, even if they aren't the same thing.

Close this out if you still don't find that convincing and Ill be satisfied since I'm so outvoted. Thanks!

cpcloud · 2023-10-12T15:46:49Z

@NickCrews I think your pushback is very valuable.

Looking at the problem holistically is critical, no argument there.

I find your path-to-ibis story very compelling, but I am struggling to come up with a way to communicate a general version of it that can be consumed in some kind of diagram, chart or table.

Definitely agree that talking about Ibis without a execution engine is not particularly compelling. Pandas-like isn't enough, but pandas-like plus $ENGINE is very compelling.

Perhaps there are two paths to explore:

API comparisons as I mention above, but with some additional emphasis on the fact that you get the performance of the underlying backend, and a table similar to the one you made except without the Ibis row.
I wonder if a narrative better communicates the sweet spot. Narratives can be compelling, especially if there's a non-obvious route to Ibis.

NickCrews · 2023-10-24T14:30:29Z

I think both of those changes sound great and can make those here! If I do that do you think you will want to merge it, or are other concerns still standing? I don't want to make the effort for nothing. Thanks!

docs: Add alternatives.qmd

2851f2c

Other libs such as polars have this and I think they are very useful: https://pola-rs.github.io/polars/user-guide/misc/alternatives/

gforsyth reviewed Oct 11, 2023

View reviewed changes

NickCrews closed this Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add alternatives.qmd #7333

docs: Add alternatives.qmd #7333

NickCrews commented Oct 11, 2023

gforsyth left a comment

gforsyth Oct 11, 2023

gforsyth Oct 11, 2023

gforsyth Oct 11, 2023

NickCrews Oct 11, 2023

NickCrews commented Oct 11, 2023 •

edited

Loading

lostmygithubaccount commented Oct 11, 2023

cpcloud commented Oct 12, 2023 •

edited

Loading

NickCrews commented Oct 12, 2023

cpcloud commented Oct 12, 2023

NickCrews commented Oct 24, 2023


		Some general summaries:

		Any eager library is going to be limited in its ability to optimize queries.

docs: Add alternatives.qmd #7333

docs: Add alternatives.qmd #7333

Conversation

NickCrews commented Oct 11, 2023

gforsyth left a comment

Choose a reason for hiding this comment

gforsyth Oct 11, 2023

Choose a reason for hiding this comment

gforsyth Oct 11, 2023

Choose a reason for hiding this comment

gforsyth Oct 11, 2023

Choose a reason for hiding this comment

NickCrews Oct 11, 2023

Choose a reason for hiding this comment

NickCrews commented Oct 11, 2023 • edited Loading

lostmygithubaccount commented Oct 11, 2023

cpcloud commented Oct 12, 2023 • edited Loading

NickCrews commented Oct 12, 2023

cpcloud commented Oct 12, 2023

NickCrews commented Oct 24, 2023

NickCrews commented Oct 11, 2023 •

edited

Loading

cpcloud commented Oct 12, 2023 •

edited

Loading