Dataframe API v2 p2: MVP implementation #7560

teh-cmc · 2024-10-01T15:18:08Z

A first implementation of the new dataframe APIs.
The name is now very misleading though: there isn't anything dataframe-y left in here, it is a row-based iterator with Rerun semantics baked in, driven by a sorted streaming join.

It is rather slow (related: #7558 (comment)), lacks many features and is full of edge cases, but it works.
It does support dedupe-latest semantics (slowly), view contents and selections, chunk overlaps, and pagination (horribly, by virtue of implementing Iterator).
It does not support Clears, nor latest-at sparse-filling, nor PoVs, nor index sampling. Yet.

Upcoming PRs will be all about fixing these shortcomings one by one.

It should look somewhat familiar:

let query_cache = QueryCache::new(store);
let query_engine = QueryEngine {
    store,
    cache: &query_cache,
};

let mut query = QueryExpression2::new(timeline);
query.view_contents = Some(
    query_engine
        .iter_entity_paths(&entity_path_filter)
        .map(|entity_path| (entity_path, None))
        .collect(),
);
query.filtered_index_range = Some(ResolvedTimeRange::new(time_from, time_to));
eprintln!("{query:#?}:");

let query_handle = query_engine.query(query.clone());
// eprintln!("{:#?}", query_handle.selected_contents());
for batch in query_handle.into_batch_iter().skip(offset).take(len) {
    eprintln!("{batch}");
}

No tests until we have the guarantee that these are the semantics we will commit to.

Part of Expand rust APIs to support new Data API concepts #7495
Requires Dataframe API v2 p1: API definitions #7559

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!
If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

crates/store/re_dataframe2/src/query.rs

jleibs · 2024-10-02T00:32:09Z

crates/store/re_dataframe2/src/query.rs

+        let cur_index_value = streaming_state_per_component
+            .values()
+            // NOTE: We're purposefully ignoring RowId-related semantics here: we just want to know
+            // the value we're looking for on the "main" index (dedupe semantics).
+            .min_by_key(|streaming_state| streaming_state.index_value)
+            .map(|streaming_state| streaming_state.index_value)?;


Rather than doing this on every call to next_row, I suspect it might be clearer to do this whole thing as a 2-phased process.

First, work just with the Timeline data from every view-relevant chunk to materialize a new column of sorted/unique TimeInt values (note, as an added benefit this is the same input you'll want to be able to feed into sample_index_values() anyways). This could still be done incrementally, "batch-wise" by only looking at overlapping chunks on some horizon.

Then, once we have the ability to iterate over batches of TimeInts, we iterate through them incrementally and look for the matching values from the relevant chunks, as you're doing below, which now becomes a common code-path between this implementation and sampled_index_values()

Additionally, my gut is that having batches of unique TimeInts in advance sets us up nicely for some future optimizations.

It lets us fairly easily parallelize the per-selected-column work. Each worker can independently yield a sequence of rows matching the requested sequence of TimeInts.

It lets us look ahead to check for matching runs in the given columns. Any time we have a matching run in a range with a single column (happy path) we can directly yield a slice of multiple rows from our column-generator.

Similarly, null runs can quickly be identified and generated when the last TimeInt in the requested batch is less than the next available time-int for the column.

The aggregator consuming from each of the parallel columns generators can then yield RecordBatches based on overlapping row-runs from the separate columns, which means in the happy path of dense non-overlapping chunks we return to getting nice contiguous slices again.

We can do these improvements in follow up PRs, let's focus on landing all semantics first.

Agreed -- not a requested change. Just an observation about the structure to keep in mind as you refactor in the direction of supporting sampled_index_values()

crates/store/re_dataframe2/src/query.rs

teh-cmc · 2024-10-02T14:28:54Z

We've integrated all of this in @abey79's work-in-progress dataframe-view -- everything works semantics-wise.

Next steps (future PRs):

implement all missing features
make it fast
minimal testing

jleibs

🚀

teh-cmc added 🔍 re_query affects re_query itself do-not-merge Do not merge this PR include in changelog labels Oct 1, 2024

teh-cmc force-pushed the cmc/dataframev2_2_api_impl branch from 82d2b01 to e37a795 Compare October 1, 2024 15:21

teh-cmc marked this pull request as ready for review October 1, 2024 15:28

teh-cmc force-pushed the cmc/dataframev2_1_api_def branch from 18886e3 to a989ac2 Compare October 1, 2024 16:55

teh-cmc force-pushed the cmc/dataframev2_2_api_impl branch from e37a795 to ae03b16 Compare October 1, 2024 17:01

jleibs reviewed Oct 1, 2024

View reviewed changes

crates/store/re_dataframe2/src/query.rs Outdated Show resolved Hide resolved

jleibs reviewed Oct 2, 2024

View reviewed changes

teh-cmc force-pushed the cmc/dataframev2_1_api_def branch from 7d1cb72 to 39cfb1a Compare October 2, 2024 10:02

Base automatically changed from cmc/dataframev2_1_api_def to main October 2, 2024 10:07

teh-cmc force-pushed the cmc/dataframev2_2_api_impl branch from 1d76116 to 94a9c09 Compare October 2, 2024 10:07

teh-cmc commented Oct 2, 2024

View reviewed changes

crates/store/re_dataframe2/src/query.rs Outdated Show resolved Hide resolved

jleibs reviewed Oct 2, 2024

View reviewed changes

crates/store/re_dataframe2/src/query.rs Outdated Show resolved Hide resolved

teh-cmc force-pushed the cmc/dataframev2_2_api_impl branch from 94a9c09 to 9ce5152 Compare October 2, 2024 12:42

teh-cmc added 6 commits October 2, 2024 14:53

add re_dataframe2 boilerplate

4c9a082

impl new query engine

0617010

add query example

12f7230

that too

df8e2c3

bugfix pass

e551c10

lint

9412150

teh-cmc force-pushed the cmc/dataframev2_2_api_impl branch from bc4f392 to 9412150 Compare October 2, 2024 12:53

teh-cmc added 3 commits October 2, 2024 15:46

add schema_for_view_contents (antoine happy)

ab06ff8

implement god awful num_rows (antoine super happy)

e738ff0

delay mutability question to later (antoine giga happy)

cc1ffba

teh-cmc removed the do-not-merge Do not merge this PR label Oct 2, 2024

jleibs approved these changes Oct 2, 2024

View reviewed changes

teh-cmc merged commit aab3ed9 into main Oct 2, 2024
33 of 34 checks passed

teh-cmc deleted the cmc/dataframev2_2_api_impl branch October 2, 2024 15:01

teh-cmc changed the title ~~Dataframe API v2 #2: MVP implementation~~ Dataframe API v2 [2: MVP implementation Oct 3, 2024

teh-cmc changed the title ~~Dataframe API v2 [2: MVP implementation~~ Dataframe API v2 p2: MVP implementation Oct 3, 2024

emilk removed the include in changelog label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe API v2 p2: MVP implementation #7560

Dataframe API v2 p2: MVP implementation #7560

teh-cmc commented Oct 1, 2024 •

edited by github-actions bot

Loading

jleibs Oct 2, 2024

teh-cmc Oct 2, 2024

jleibs Oct 2, 2024

teh-cmc commented Oct 2, 2024

jleibs left a comment

Dataframe API v2 p2: MVP implementation #7560

Dataframe API v2 p2: MVP implementation #7560

Conversation

teh-cmc commented Oct 1, 2024 • edited by github-actions bot Loading

Checklist

jleibs Oct 2, 2024

Choose a reason for hiding this comment

teh-cmc Oct 2, 2024

Choose a reason for hiding this comment

jleibs Oct 2, 2024

Choose a reason for hiding this comment

teh-cmc commented Oct 2, 2024

jleibs left a comment

Choose a reason for hiding this comment

teh-cmc commented Oct 1, 2024 •

edited by github-actions bot

Loading