[FEAT] Native Runner #3178

colin-ho · 2024-11-04T22:19:04Z

Makes swordfish a top level runner, set_runner_native(). Additionally sets swordfish to be the default runner for development.

This PR also contains some bug fixes and test changes, of which I have left comments for.

Additionally, this PR refactors swordfish in two ways:

Buffers scan tasks based on a num_parallel_tasks parameter, that takes into account any pushed down limits.
Adds an is_err check on the sender in parts of the code where we have a while receiver.recv.await -> sender.send pattern, such that it breaks out of the loop if the sender is dropped. This is needed in cases when the consumer is done receiving data, such as in a Limit, or if the user is doing iter(df) and breaks out of the iter, which will cause receivers to be dropped. As such, the senders should recognize this and drop as well.

github-actions · 2024-11-04T22:28:00Z

🚀 Deployed on https://deploy-preview-3178--daft-docs-preview.netlify.app

codspeed-hq · 2024-11-04T22:33:09Z

CodSpeed Performance Report

Merging #3178 will degrade performances by 60.07%

_{Comparing colin/native-runner (485132b) with main (2b71ffb)}

Summary

⚡ 3 improvements
❌ 1 regressions
✅ 13 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`colin/native-runner`	Change
⚡	`test_count[1 Small File]`	4.2 ms	3.4 ms	+21.59%
⚡	`test_count[100 Small Files]`	129.6 ms	69.3 ms	+87.03%
⚡	`test_iter_rows_first_row[100 Small Files]`	9,448.6 ms	268.9 ms	×35
❌	`test_show[100 Small Files]`	16.2 ms	40.6 ms	-60.07%

codecov · 2024-11-04T22:51:44Z

Codecov Report

Attention: Patch coverage is 80.45455% with 43 lines in your changes missing coverage. Please review.

Project coverage is 78.37%. Comparing base (2b71ffb) to head (485132b).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/context.py	40.00%	18 Missing ⚠️
src/daft-local-execution/src/dispatcher.rs	40.00%	9 Missing ⚠️
src/daft-local-execution/src/sources/empty_scan.rs	57.14%	3 Missing ⚠️
src/daft-local-execution/src/sources/in_memory.rs	66.66%	3 Missing ⚠️
daft/runners/native_runner.py	95.74%	2 Missing ⚠️
...-execution/src/intermediate_ops/intermediate_op.rs	66.66%	2 Missing ⚠️
...c/daft-local-execution/src/sinks/streaming_sink.rs	75.00%	2 Missing ⚠️
daft/io/writer.py	0.00%	1 Missing ⚠️
src/common/daft-config/src/lib.rs	0.00%	1 Missing ⚠️
src/daft-local-execution/src/run.rs	66.66%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3178      +/-   ##
==========================================
- Coverage   79.13%   78.37%   -0.77%     
==========================================
  Files         640      641       +1     
  Lines       77983    78938     +955     
==========================================
+ Hits        61715    61868     +153     
- Misses      16268    17070     +802

Files with missing lines	Coverage Δ
daft/runners/pyrunner.py	`86.34% <100.00%> (-1.29%)`	⬇️
daft/runners/ray_runner.py	`81.01% <100.00%> (+0.06%)`	⬆️
daft/runners/runner.py	`78.57% <100.00%> (+2.57%)`	⬆️
src/daft-local-execution/src/pipeline.rs	`95.23% <100.00%> (+0.06%)`	⬆️
src/daft-local-execution/src/sources/scan_task.rs	`75.55% <100.00%> (+3.05%)`	⬆️
src/daft-local-plan/src/plan.rs	`96.38% <100.00%> (+0.02%)`	⬆️
src/daft-local-plan/src/translate.rs	`93.63% <100.00%> (+0.04%)`	⬆️
daft/io/writer.py	`0.00% <0.00%> (ø)`
src/common/daft-config/src/lib.rs	`82.14% <0.00%> (-5.36%)`	⬇️
src/daft-local-execution/src/run.rs	`87.96% <66.66%> (-0.58%)`	⬇️
... and 8 more

... and 6 files with indirect coverage changes

samster25 · 2024-11-05T08:08:19Z

@colin-ho looks like this introduces a regression for the benchmark test_show[100 Small Files] Can you take a look?

colin-ho · 2024-11-06T06:00:50Z

daft/io/writer.py

+        output_file = self.fs.open_output_stream(self.full_path)
        return pacsv.CSVWriter(
-            self.full_path,
+            output_file,


The pacsv writer doesn't have a filesystem argument for us to pass in a fs to, instead we have to use the open_output_stream function on the fs, and pass it into the writer.

colin-ho · 2024-11-06T06:07:16Z

src/daft-local-execution/src/sources/scan_task.rs

+                io_runtime.spawn(async move {
+                    stream_scan_task(
+                        scan_task.clone(),
                        io_stats.clone(),
-                        delete_map.clone(),
-                    ));
-                }
-                while let Some(result) = task_set.join_next().await {
-                    result.context(JoinSnafu)??;
-                }
-                Ok(())
-            },
-            self.name(),
-        );
+                        delete_map.map(Arc::new),
+                        maintain_order,
+                    )
+                    .await
+                })
+            }))
+            .buffered(self.num_parallel_tasks)
+            .map(|s| s?)
+            .try_flatten();


This is actually a pretty important fix because previously all the scan tasks were spawned on the execution_runtime, which is single threaded. I tested TPCH with num_files = 2 x num_cpus, and this change makes it much faster. On that note, we should make our benchmarks test not just 1 or 2 files, but also > num_cpu files.

colin-ho · 2024-11-06T06:08:29Z

src/daft-local-execution/src/sources/scan_task.rs

+                    .await
+                })
+            }))
+            .buffered(self.num_parallel_tasks)


The codspeed test is still slower :( , still figuring out why.

colin-ho · 2024-11-06T06:09:28Z

tests/integration/io/test_s3_reads_include_path.py

@@ -27,16 +27,15 @@ def test_read_multi_parquet_from_s3_with_include_file_path_column(minio_io_confi
    with minio_create_bucket(minio_io_config, bucket_name=bucket_name):
        file_paths = (
            daft.from_pydict(data)
-            .into_partitions(3)
-            .write_parquet(f"s3://{bucket_name}", io_config=minio_io_config)
+            .write_parquet(f"s3://{bucket_name}", partition_cols=["a"], io_config=minio_io_config)


Using partition_cols instead of into_partitions. For swordfish.

colin-ho · 2024-11-06T06:23:49Z

.github/workflows/python-package.yml

@@ -810,6 +802,8 @@ jobs:
        source activate
        maturin develop
        pytest --doctest-modules --continue-on-collection-errors daft/dataframe/dataframe.py daft/expressions/expressions.py daft/convert.py daft/udf.py
+      env:
+        DAFT_RUNNER: py


Doc tests don't pass on swordfish because some of them use into_partitions.

daft/context.py

samster25 · 2024-11-07T20:49:48Z

src/daft-local-execution/src/sinks/streaming_sink.rs

@@ -129,12 +129,18 @@ impl StreamingSinkNode {
                match result {
                    StreamingSinkOutput::NeedMoreInput(mp) => {
                        if let Some(mp) = mp {
-                            let _ = output_sender.send(mp).await;
+                            if output_sender.send(mp).await.is_err() {


we should instead match on the exact error like:

if let Err(ErrorType) = output_sender.send(mp).await

It's an error struct, so there's no pattern to match. And it always signifies channel closure. https://docs.rs/tokio/latest/tokio/sync/mpsc/error/struct.SendError.html

src/daft-local-execution/src/sinks/streaming_sink.rs

src/daft-local-execution/src/sources/scan_task.rs

native runner

46d6cde

github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 4, 2024

env

add5feb

ignore pyarrow 7 and py 3.1

38a9133

fix tests

bb0dd29

Colin Ho and others added 10 commits November 5, 2024 15:04

break if sender closed

a1fe5fe

Merge branch main into colin/native-runner

3770b1c

runner share local partitions et

b9c285b

fix deps and doctest

dfcff2a

num parallel tasks

940dad8

async get delete files

8cd03df

Merge branch main into colin/native-runner

3803a41

fix set runner

4bea433

make check <= 0.0

997a3c4

cleanup

e7642de

colin-ho commented Nov 6, 2024

View reviewed changes

Colin Ho added 2 commits November 5, 2024 22:12

change location of local partition set cache

b3d0bc3

doc test use py;

b4281f1

colin-ho requested a review from samster25 November 6, 2024 06:18

colin-ho marked this pull request as ready for review November 6, 2024 06:18

colin-ho commented Nov 6, 2024

View reviewed changes

colin-ho mentioned this pull request Nov 6, 2024

[CHORE] Refactor pipeline channel #3235

Closed

Colin Ho and others added 6 commits November 6, 2024 14:55

test w 1

3d70d6d

Merge branch 'main' into colin/native-runner

811c39d

unskip test

c97654d

buffer a processed stream

744116a

Merge branch main into colin/native-runner

dcd6313

fix stream

971a060

samster25 approved these changes Nov 7, 2024

View reviewed changes

Colin Ho added 3 commits November 7, 2024 13:18

address feedback

3da5d5e

unused dep

ef83c15

lock

485132b

colin-ho enabled auto-merge (squash) November 7, 2024 21:41

colin-ho merged commit 6e28b3f into main Nov 7, 2024
42 of 43 checks passed

colin-ho deleted the colin/native-runner branch November 7, 2024 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Native Runner #3178

[FEAT] Native Runner #3178

colin-ho commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

codspeed-hq bot commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading

samster25 commented Nov 5, 2024

colin-ho Nov 6, 2024

colin-ho Nov 6, 2024 •

edited

Loading

colin-ho Nov 6, 2024 •

edited

Loading

colin-ho Nov 6, 2024 •

edited

Loading

colin-ho Nov 6, 2024

samster25 Nov 7, 2024

colin-ho Nov 7, 2024

[FEAT] Native Runner #3178

[FEAT] Native Runner #3178

Conversation

colin-ho commented Nov 4, 2024 • edited Loading

github-actions bot commented Nov 4, 2024 • edited Loading

codspeed-hq bot commented Nov 4, 2024 • edited Loading

CodSpeed Performance Report

Merging #3178 will degrade performances by 60.07%

Summary

Benchmarks breakdown

codecov bot commented Nov 4, 2024 • edited Loading

Codecov Report

samster25 commented Nov 5, 2024

colin-ho Nov 6, 2024

Choose a reason for hiding this comment

colin-ho Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

colin-ho Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

colin-ho Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

colin-ho Nov 6, 2024

Choose a reason for hiding this comment

samster25 Nov 7, 2024

Choose a reason for hiding this comment

colin-ho Nov 7, 2024

Choose a reason for hiding this comment

colin-ho commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading

codspeed-hq bot commented Nov 4, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024 •

edited

Loading

colin-ho Nov 6, 2024 •

edited

Loading

colin-ho Nov 6, 2024 •

edited

Loading

colin-ho Nov 6, 2024 •

edited

Loading