Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We're changing database #408

Closed
samuelcolvin opened this issue Aug 29, 2024 · 5 comments
Closed

We're changing database #408

samuelcolvin opened this issue Aug 29, 2024 · 5 comments

Comments

@samuelcolvin
Copy link
Member

samuelcolvin commented Aug 29, 2024

Rollout

We're gradually rolling out queries to the new database now. If you're affected, you'll see a banner like this:

Screenshot 2024-09-18 at 14 42 24

If you notice queries taking longer or returning errors or different results, please let us know below or contact us via email or Slack.

If you need to continue querying the old database, you can do so by right-clicking on your profile picture in the top right and setting the query engine to 'TS' (Timescale, the old database):

Screenshot 2024-09-18 at 14 44 53

To get rid of the warning banner, set the query engine to 'TS' and then back to 'FF' (FusionFire, the new database) again.

We will be increasing the percentage of users whose default query engine is FF over time and monitoring the impact. We may decrease it again if we notice problems. If you set a query engine explicitly to either TS or FF, this won't affect you. Otherwise, your query engine may switch back and forth. For most users, there shouldn't be a noticeable difference.

Most queries should be faster with FF, especially if they aggregate lots of data over a long time period. If your dashboards were timing out before with TS, try using FF. However some specific queries that are very fast with TS are slower with FF. In particular, TS can look up trace and span IDs almost instantly without needing a specific time range. If you click on a link to a trace/span ID in a table, it will open the live view with a time range of 30 days because it doesn't know any better. If this doesn't load, reduce the time range.

Summary

We're changing the database that stores observability data in the Logfire platform from Timescale to a custom database built on Apache Datafusion.

This should bring big improvements in performance, but will lead to some SQL compatibility issues initially (details below).

Background

Timescale is great, it can be really performant when you know the kind of queries you regularly run (so you can set up continuous aggregates) and when you can enable their compression features (which both save money and make queries faster).

Unfortunately we can't use either of those features:

  • our users can query their data however they like using SQL, so continuous aggregates aren't that helpful
  • Timescale's compression features are incompatible with row level permissions — in Timescale/PostgreSQL we have to have row level permissions since we're running users SQL directly against the database

Earlier this year, as the volume of data the Logfire platform received increased in the beta, these limitations became clearer and clearer.

The other more fundamental limitation of Timescale was their open/closed source business model.

The ideal data architecture for us (and any analytics database I guess) is separated storage and compute: data is stored in S3/GCS as parquet (or equivalent), with an external index used by the query/compute nodes. Timescale has this, but it's completely closed source. So we can either get a scaleable architecture but be forced to use their SAAS, or run Timescale as a traditional "coupled storage and compute" database ourselves.

For lots of companies either of those solutions would be satisfactory, but if Logfire scales as we hope it does, we'd be scuppered with either.

Datafusion

We settled on Datafusion as the foundation for our new database for a few reasons:

  1. It's completely open source so we can build the separated storage and compute solution we want
  2. It's all Rust, quite a few of our team are comfortable writing Rust, meaning the database isn't just a black box, we can dive in and improve it as we wish (as an example, Datafusion didn't have JSON querying support until we implemented it in datafusion-functions-json). Since starting to use datafusion, our team has contributed 20 or 30 pull requests to datafusion, and associated projects like arrow-rs and sqlparser-rs
  3. Datafusion is extremely extensible, we can adjust the SQL syntax, how queries are planned and run and build indexes exactly as we need them
  4. Datafusion's SQL parser has pretty good compatibility with Postgres, and again, it's just Rust so we can improve it fairly easily
  5. The project is excellently run, part of Apache, leverages the Arrow/Parquet ecosystem, and is used by large organizations like InfluxDB, Apple and Nvidia

Transition

For the last couple of months we've been double-writing to Timescale and Fusionfire (our cringey internal name for the new datafusion-based database), working on improving reliability and performance of Fusionfire for all types of queries.

Fusionfire is now significantly (sometimes >10x) faster than timescale for most queries. There's a few low latency queries on very recent data which are still faster on timescale that we're working on improving.

Currently by default the live view, explore view, dashboards and alerts use timescale by default. You can try fusionfire now for everything except alerts by right clicking on your profile picture in the top right and selecting "FF" as the query engine.

In the next couple of weeks we'll migrate fully to Fusionfire and retire timescale.

We're working hard to make Fusionfire more compatible with PostgreSQL (see apache/datafusion-sqlparser-rs#1398, apache/datafusion-sqlparser-rs#1394, apache/datafusion-sqlparser-rs#1360, apache/arrow-rs#6211, apache/datafusion#11896, apache/datafusion#11876, apache/datafusion#11849, apache/datafusion#11321, apache/arrow-rs#6319, apache/arrow-rs#6208, apache/arrow-rs#6197, apache/arrow-rs#6082, apache/datafusion#11307), but there are still a few expressions which currently don't run correctly (a lot related to intervals):

If you notice any other issues, please let us know on this issue or a new issue, and we'll let you know how quickly we can fix it.

@samuelcolvin samuelcolvin pinned this issue Aug 29, 2024
@samuelcolvin
Copy link
Member Author

samuelcolvin commented Aug 30, 2024

Small update as I forgot to include this in the main issue:

We previously supported direct connection to the database using the PostgreSQL wire protocol, meaning you could connect with psql, pgcli or pandas, but also with BI tools that "talked postgres" like tableau, google looker studio, metabase etc.

(Side note: it wasn't actually a direct connection, but rather a pg wire protocol proxy we wrote which checked the query AST for functions we didn't want to call (like pg_sleep), managed authentication, then proxied the queries to timescale)

We've had to temporarily switch this off while we migrate to fusionfire.

Instead we're allowing uses to query their data with SQL using an HTTP API (data can be returned as arrow IPC, JSON or CSV), see #405 — this should be available to use in the next few days.

We aim to reimplement the PG wire protocol connections with fusionfire in a few months, the hardest bit will be getting the information schemas to exactly match postgres so the very complex schema introspection queries run by BI tools and pgcli work correctly. If you need this feature urgently, please let us know.

@baggiponte
Copy link

Well, congratulations first of all! (Though I'd call it logfusion 🪵⚛️)

but there are still a few expressions which currently don't run correctly (a lot related to intervals)

@MarcoGorelli should've worked on a lot of these features for Polars: I don't know if he can contribute, but he's a bit of Time(zone) lord.

@frankie567

This comment was marked as off-topic.

@samuelcolvin
Copy link
Member Author

Thanks for reporting @frankie567, I've moved that to #433.

@alexmojaki
Copy link
Contributor

We've been fully switched to the new database for a while now.

@sydney-runkle sydney-runkle unpinned this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants