Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add alternatives.qmd #7333

Closed
wants to merge 1 commit into from

Conversation

NickCrews
Copy link
Contributor

Other libs such as polars have this and I think they are very useful:
https://pola-rs.github.io/polars/user-guide/misc/alternatives/

Other libs such as polars have this and I think they are very useful:
https://pola-rs.github.io/polars/user-guide/misc/alternatives/
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a little more on how these are useful? I tend to view them (and I've definitely written them, I am not above the fray) the same way as I view benchmarks -- biased towards the library at hand.

I think they are usually trying to answer "why should I use this instead of ?" And I've asked myself that question a bunch, but I don't know if I would arrive at an answer based on an alternatives list like this.

I'm not against including comparisons, but I do think it comes with a certain maintenance burden if/when projects ask to be included or removed, or to quibble over a given characterization.

| ------- | ---------------- | ---------- | ------------------------------------ | ---------------- | ----------- | ---------- | ----------- | ----------- | --------------- |
| Ibis | expression-based | Lazy | ✅, plus some backends do additional | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| Pandas | pandas | Eager | ❌ | Python object | numpy/arrow | ❌ | ❌ | ❌ | optional >= 2.0 |
| Dask DF | pandas | Eager | ❌ | Python object | numpy/arrow | ✅ | ✅ | ✅ | ❌ |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask is not eager

| Ibis | expression-based | Lazy | ✅, plus some backends do additional | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| Pandas | pandas | Eager | ❌ | Python object | numpy/arrow | ❌ | ❌ | ❌ | optional >= 2.0 |
| Dask DF | pandas | Eager | ❌ | Python object | numpy/arrow | ✅ | ✅ | ✅ | ❌ |
| Modin | pandas | Eager | ❌ | Python object | numpy/arrow | ✅ | ✅ | ✅ | ❌ |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think modin is eager, either, but I'm not certain


Some general summaries:

Any *eager* library is going to be limited in its ability to optimize queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be reconsidered if DuckDB falls into the eager camp. They do not appear to be limited in their ability to optimize queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

riiiggght, I'm splitting things up poorly, I meant for all eager imperative APIs

@NickCrews
Copy link
Contributor Author

NickCrews commented Oct 11, 2023

Thanks for the review @gforsyth! I can make those changes, but first wanted to get the general direction right before I do more work.

I agree with what you are saying, I read them as biased too. But bias is still useful, I just use it as one signal among many. I tend to give a lot more weight in my decision to comparisons written by library authors than most comparisons written on medium.com by some 3rd party user. I would rather have a well-informed but biased review (that I know is biased) than a poorly informed, unbiased review.

For instance, after I read polars writeup I learned about how vaex needs to translate csvs/parquet into a memory mapped file, and the IO drawbacks of OS-controlled paging. That is a significant drawback I was experiencing when I used vaex, but I had no idea why. Vaex didn't advertise this, I'd never seen this anywhere else. So I don't read that polars writeup trying to learn about the benefits of polars, but I look for how much they are able to find the flaws with their competitors. If they mention a drawback that is relevant to me, that is super useful, I am going to keep that in mind when I read that other libs docs/examples. If they don't mention any drawbacks that I find compelling, then I feel a lot more safe about the competitor. :)

It also helps me understand the priorities of the lib. For example the polars comparison never mentions the multi-machine advantages of some other systems. That means that I bet polars doesn't support multiple machines.

projects ask to be included/removed

lol I just did this to polars at pola-rs/polars#11670 :)

Is this a problem? Ask them to write the PR, and if it seems fair then accept it. Or reject it. I would rather add benefit to the thousands of users who read the comparison than be safe politically. We could add a disclaimer saying that these are our biased opinions, and please file a PR here if you see a problem? I think this would also help with the credibility of ibis, I don't want this to come off as "ibis is the best lib in all situations".

@lostmygithubaccount
Copy link
Member

I have some mixed feelings I need to write up; not against a page like this in general, I do have concerns with this as-is

@cpcloud
Copy link
Member

cpcloud commented Oct 12, 2023

@NickCrews Thanks for the PR.

I think this is the wrong level of analysis for comparing $THINGs with Ibis.

Ibis is about bringing its API to bear on a given engine, and working with that engine in the best way possible.

If we're going to compare Ibis to anything it should be to the APIs of other DataFrame libraries, not the underlying engine and its capabilities.

Details like whether a system handles larger-than-memory data or whether a system is distributed should be part of Ibis's backend documentation and not on a page comparing Ibis with those systems.

Users should choose between engines that Ibis works with, not between Ibis and the engine.

Here's a shortlist of libraries that have a DataFrame API that I think might be worth comparing to Ibis:

  • pandas-like APIs (pandas, dask, modin, koalas (pandas on pyspark))
  • pyspark-like APIs (pyspark, snowpark, pystarburst)
  • neither pandas-like or pyspark-like APIs (polars, vaex)

@NickCrews
Copy link
Contributor Author

Thanks is for the feedback everyone. This is basically the same feedback that I just got from the maintainers of Polars on that linked PR too, so it seems like I'm the oddball out here :) with that in mind I realize that I might have the wrong idea here :)

I do want to push back a little though. When I'm deciding how to solve a problem, I look at all of these libraries holistically: their performance characteristics, their API, but also their stability, amount of development, amount of community, quality of documentation, extendability, etc. I find it very helpful whenever I find an article that talks about these things, and I was hoping that this could cover all those. When choosing between pandas and ibis, I bet most people would choose pandas if we didn't talk about performance and only talked about their API, because they are similar enough but pandas has such larger community. The combo of features that ibis has and allows is what makes it great.

My personal timeline was I was using pandas, then that couldn't handle my dataset size, so I googled around and found vaex getting compared to it, so I tried that. But that still didn't work well enough, so then I did more and more searching and finally found ibis, which has been awesome. But like none of the comparisons I read included ibis! They had duckdb, but I didn't want to learn/write sql. What a shame I didn't find it earlier. I think we should start positioning ibis as a peer of these other systems because they solve the same problem, even if they aren't the same thing.

Close this out if you still don't find that convincing and Ill be satisfied since I'm so outvoted. Thanks!

@cpcloud
Copy link
Member

cpcloud commented Oct 12, 2023

@NickCrews I think your pushback is very valuable.

Looking at the problem holistically is critical, no argument there.

I find your path-to-ibis story very compelling, but I am struggling to come up with a way to communicate a general version of it that can be consumed in some kind of diagram, chart or table.

Definitely agree that talking about Ibis without a execution engine is not particularly compelling. Pandas-like isn't enough, but pandas-like plus $ENGINE is very compelling.

Perhaps there are two paths to explore:

  1. API comparisons as I mention above, but with some additional emphasis on the fact that you get the performance of the underlying backend, and a table similar to the one you made except without the Ibis row.
  2. I wonder if a narrative better communicates the sweet spot. Narratives can be compelling, especially if there's a non-obvious route to Ibis.

@NickCrews
Copy link
Contributor Author

I think both of those changes sound great and can make those here! If I do that do you think you will want to merge it, or are other concerns still standing? I don't want to make the effort for nothing. Thanks!

@NickCrews NickCrews closed this Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants