Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: blog for the 1 billion row challenge #8004

Merged
merged 13 commits into from
Jan 22, 2024

Conversation

lostmygithubaccount
Copy link
Member

Description of changes

work in progress, just throwing up the code

Issues closed

.agg(
min_temp=ibis._.temperature.min(),
mean_temp=ibis._.temperature.mean(),
max_temp=ibis._.temperature.max(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good place to show off selectors:

.agg(s.across(_.temp, {"min": _.min(), "mean": _.mean(), "max": _.max()}))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I've written this correctly but getting an error:

TypeError                                 Traceback (most recent call last)
[/Users/cody/repos/ibis/docs/posts/1brc/index.qmd](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/docs/posts/1brc/index.qmd) in line 6
      [288](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=287) t = ibis.read_csv("1brc/data/measurements.txt", **kwargs)
      [289](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=288) res = (
      [290](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=289)     t
      [291](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=290)     .group_by(ibis._.station)
      [292](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=291)     .agg(
----> [293](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=292)         s.across(
      [294](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=293)             ibis._.temperature,
      [295](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=294)             {
      [296](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=295)                 "min": ibis._.min(),
     [297](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=296)                 "mean": ibis._.mean(),
     [298](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=297)                 "max": ibis._.max(),
     [299](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=298)             },
     [300](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=299)         )
     [301](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=300)     )
     [302](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=301)     .order_by(ibis._.station.desc())
     [303](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=302) )
     [305](file:///Users/cody/repos/ibis/docs/posts/1brc/index.qmd?line=304) res

File [~/repos/ibis/ibis/selectors.py:498](https://file+.vscode-resource.vscode-cdn.net/Users/cody/repos/ibis/~/repos/ibis/ibis/selectors.py:498), in across(selector, func, names)
    [496](file:///Users/cody/repos/ibis/ibis/selectors.py?line=495) funcs = dict(func if isinstance(func, Mapping) else {None: func})
    [497](file:///Users/cody/repos/ibis/ibis/selectors.py?line=496) if not isinstance(selector, Selector):
--> [498](file:///Users/cody/repos/ibis/ibis/selectors.py?line=497)     selector = c(*util.promote_list(selector))
...
--> [396](file:///Users/cody/repos/ibis/ibis/selectors.py?line=395)     names = frozenset(col if isinstance(col, str) else col.get_name() for col in names)
    [398](file:///Users/cody/repos/ibis/ibis/selectors.py?line=397)     def func(col: ir.Value) -> bool:
    [399](file:///Users/cody/repos/ibis/ibis/selectors.py?line=398)         schema = col.op().table.schema

TypeError: unhashable type: 'Deferred'

header=False,
columns={"station": "VARCHAR", "temperature": "DOUBLE"},
)
elif ibis.get_backend().name == "polars":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead collect the kwargs into a dict and then have a single call to ibis.read_csv? It's pretty noisy with the repetition.

```

```{python}
t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're already showing t in the previous block of code.

@ncclementi
Copy link
Contributor

Dropping couple of links I found in the python submissions, that might help as reference on what people did:

Here are the dask and spark solutions, idk if we want to get into try to run those with Ibis but just in case, gunnarmorling/1brc#450 (comment)

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2024

ibis.read_csv ultimately calls pl.scan_csv if possible: https://github.com/ibis-project/ibis/blob/main/ibis/backends/polars/__init__.py#L169

separator=";",
has_header=False,
new_columns=["station", "temperature"],
schema={"station": pl.datatypes.Utf8, "temperature": pl.datatypes.Float64},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
schema={"station": pl.datatypes.Utf8, "temperature": pl.datatypes.Float64},
schema={"station": pl.Utf8, "temperature": pl.Float64},

@ncclementi
Copy link
Contributor

Interesting, I haven't used polars much, I was just reading the thread on what people tried, looks like Ritchie chimed in that thread explaining why read_csv in their case is slower than scan_csv:

If you read csv, you force polars to materialize the whole dataset at once. It is expected to be slower as you don't allow polars to determine how to process the query.

gunnarmorling/1brc#62 (reply in thread)

@lostmygithubaccount
Copy link
Member Author

this code is so simple I think we should try to show it across as many backends as possible -- DuckDB, Polars, DataFusion, SQLite (?), Postgres, Clickhouse should all work pretty easily? then we can have a fancy title like "Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, DataFusion, Clickhouse, SQLite, Postgres"

@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2024

I would show it across the columnar local backends, only to avoid having to write a bunch of ingestion code for backends that don't support read_csv (SQLite for example).

@lostmygithubaccount
Copy link
Member Author

so far:

  • 27s for DuckDB
  • 290s for Polars, not sure what's going on there
  • 76s for DataFusion

@ncclementi
Copy link
Contributor

@lostmygithubaccount what are the specs of the machine you are running this on? The results for an M1 with 32GB RAM for polars reported here way different https://github.com/ifnesi/1brc/tree/main#performance-on-a-macbook-pro-m1-32gb

@lostmygithubaccount
Copy link
Member Author

also a macbook pro M1 32GB -- just pushed duckdb/polars/datafusion, need to be away for a bit, will try to finish this up later. unclear how to pass in the right kwargs for Clickhouse right now

}

# kwargs = duckdb_kwargs if ibis.get_backend().name == "duckdb" else polars_kwargs
match ibis.get_backend().name:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is my first use of a match statement

@lostmygithubaccount lostmygithubaccount added docs Documentation related issues or PRs docs-preview Add this label to trigger a docs preview labels Jan 18, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Jan 18, 2024

@lostmygithubaccount lostmygithubaccount added the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Jan 18, 2024

@lostmygithubaccount lostmygithubaccount marked this pull request as ready for review January 18, 2024 16:52
@lostmygithubaccount lostmygithubaccount added the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Jan 18, 2024

@lostmygithubaccount lostmygithubaccount added the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 18, 2024
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Jan 18, 2024


::: {.panel-tabset}

## DuckDb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## DuckDb
## DuckDB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arggghhhh thanks...rerendering for another 20 minutes 😂

@ncclementi
Copy link
Contributor

I think the post looks great @lostmygithubaccount .

The last small comment I have is, whether the conclusion section should be before the Bonus content. I think the Bonus is neat, but feels disconnected from the rest, and people might skip it and not read the conclusion. But this is just an opinion, feel free to ignore.

@lostmygithubaccount lostmygithubaccount added the docs-preview Add this label to trigger a docs preview label Jan 19, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 19, 2024
@lostmygithubaccount lostmygithubaccount added the docs-preview Add this label to trigger a docs preview label Jan 19, 2024
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Jan 19, 2024
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Jan 19, 2024

@lostmygithubaccount
Copy link
Member Author

I think this is good to merge today

Copy link
Contributor

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cpcloud cpcloud merged commit 141edea into ibis-project:main Jan 22, 2024
19 checks passed
@cpcloud cpcloud added this to the 8.0 milestone Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related issues or PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants