Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Native Runner #3178

Merged
merged 25 commits into from
Nov 7, 2024
Merged

[FEAT] Native Runner #3178

merged 25 commits into from
Nov 7, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Nov 4, 2024

Makes swordfish a top level runner, set_runner_native(). Additionally sets swordfish to be the default runner for development.

This PR also contains some bug fixes and test changes, of which I have left comments for.

Additionally, this PR refactors swordfish in two ways:

  1. Buffers scan tasks based on a num_parallel_tasks parameter, that takes into account any pushed down limits.
  2. Adds an is_err check on the sender in parts of the code where we have a while receiver.recv.await -> sender.send pattern, such that it breaks out of the loop if the sender is dropped. This is needed in cases when the consumer is done receiving data, such as in a Limit, or if the user is doing iter(df) and breaks out of the iter, which will cause receivers to be dropped. As such, the senders should recognize this and drop as well.

@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

Copy link

codspeed-hq bot commented Nov 4, 2024

CodSpeed Performance Report

Merging #3178 will degrade performances by 60.07%

Comparing colin/native-runner (485132b) with main (2b71ffb)

Summary

⚡ 3 improvements
❌ 1 regressions
✅ 13 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main colin/native-runner Change
test_count[1 Small File] 4.2 ms 3.4 ms +21.59%
test_count[100 Small Files] 129.6 ms 69.3 ms +87.03%
test_iter_rows_first_row[100 Small Files] 9,448.6 ms 268.9 ms ×35
test_show[100 Small Files] 16.2 ms 40.6 ms -60.07%

Copy link

codecov bot commented Nov 4, 2024

Codecov Report

Attention: Patch coverage is 80.45455% with 43 lines in your changes missing coverage. Please review.

Project coverage is 78.37%. Comparing base (2b71ffb) to head (485132b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
daft/context.py 40.00% 18 Missing ⚠️
src/daft-local-execution/src/dispatcher.rs 40.00% 9 Missing ⚠️
src/daft-local-execution/src/sources/empty_scan.rs 57.14% 3 Missing ⚠️
src/daft-local-execution/src/sources/in_memory.rs 66.66% 3 Missing ⚠️
daft/runners/native_runner.py 95.74% 2 Missing ⚠️
...-execution/src/intermediate_ops/intermediate_op.rs 66.66% 2 Missing ⚠️
...c/daft-local-execution/src/sinks/streaming_sink.rs 75.00% 2 Missing ⚠️
daft/io/writer.py 0.00% 1 Missing ⚠️
src/common/daft-config/src/lib.rs 0.00% 1 Missing ⚠️
src/daft-local-execution/src/run.rs 66.66% 1 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3178      +/-   ##
==========================================
- Coverage   79.13%   78.37%   -0.77%     
==========================================
  Files         640      641       +1     
  Lines       77983    78938     +955     
==========================================
+ Hits        61715    61868     +153     
- Misses      16268    17070     +802     
Files with missing lines Coverage Δ
daft/runners/pyrunner.py 86.34% <100.00%> (-1.29%) ⬇️
daft/runners/ray_runner.py 81.01% <100.00%> (+0.06%) ⬆️
daft/runners/runner.py 78.57% <100.00%> (+2.57%) ⬆️
src/daft-local-execution/src/pipeline.rs 95.23% <100.00%> (+0.06%) ⬆️
src/daft-local-execution/src/sources/scan_task.rs 75.55% <100.00%> (+3.05%) ⬆️
src/daft-local-plan/src/plan.rs 96.38% <100.00%> (+0.02%) ⬆️
src/daft-local-plan/src/translate.rs 93.63% <100.00%> (+0.04%) ⬆️
daft/io/writer.py 0.00% <0.00%> (ø)
src/common/daft-config/src/lib.rs 82.14% <0.00%> (-5.36%) ⬇️
src/daft-local-execution/src/run.rs 87.96% <66.66%> (-0.58%) ⬇️
... and 8 more

... and 6 files with indirect coverage changes

@samster25
Copy link
Member

@colin-ho looks like this introduces a regression for the benchmark test_show[100 Small Files] Can you take a look?

Comment on lines +178 to +180
output_file = self.fs.open_output_stream(self.full_path)
return pacsv.CSVWriter(
self.full_path,
output_file,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pacsv writer doesn't have a filesystem argument for us to pass in a fs to, instead we have to use the open_output_stream function on the fs, and pass it into the writer.

Comment on lines 97 to 109
io_runtime.spawn(async move {
stream_scan_task(
scan_task.clone(),
io_stats.clone(),
delete_map.clone(),
));
}
while let Some(result) = task_set.join_next().await {
result.context(JoinSnafu)??;
}
Ok(())
},
self.name(),
);
delete_map.map(Arc::new),
maintain_order,
)
.await
})
}))
.buffered(self.num_parallel_tasks)
.map(|s| s?)
.try_flatten();
Copy link
Contributor Author

@colin-ho colin-ho Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a pretty important fix because previously all the scan tasks were spawned on the execution_runtime, which is single threaded. I tested TPCH with num_files = 2 x num_cpus, and this change makes it much faster. On that note, we should make our benchmarks test not just 1 or 2 files, but also > num_cpu files.

.await
})
}))
.buffered(self.num_parallel_tasks)
Copy link
Contributor Author

@colin-ho colin-ho Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codspeed test is still slower :( , still figuring out why.

@@ -27,16 +27,15 @@ def test_read_multi_parquet_from_s3_with_include_file_path_column(minio_io_confi
with minio_create_bucket(minio_io_config, bucket_name=bucket_name):
file_paths = (
daft.from_pydict(data)
.into_partitions(3)
.write_parquet(f"s3://{bucket_name}", io_config=minio_io_config)
.write_parquet(f"s3://{bucket_name}", partition_cols=["a"], io_config=minio_io_config)
Copy link
Contributor Author

@colin-ho colin-ho Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using partition_cols instead of into_partitions. For swordfish.

@colin-ho colin-ho requested a review from samster25 November 6, 2024 06:18
@colin-ho colin-ho marked this pull request as ready for review November 6, 2024 06:18
@@ -810,6 +802,8 @@ jobs:
source activate
maturin develop
pytest --doctest-modules --continue-on-collection-errors daft/dataframe/dataframe.py daft/expressions/expressions.py daft/convert.py daft/udf.py
env:
DAFT_RUNNER: py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc tests don't pass on swordfish because some of them use into_partitions.

daft/context.py Outdated Show resolved Hide resolved
@@ -129,12 +129,18 @@ impl StreamingSinkNode {
match result {
StreamingSinkOutput::NeedMoreInput(mp) => {
if let Some(mp) = mp {
let _ = output_sender.send(mp).await;
if output_sender.send(mp).await.is_err() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should instead match on the exact error like:

if let Err(ErrorType) = output_sender.send(mp).await

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an error struct, so there's no pattern to match. And it always signifies channel closure. https://docs.rs/tokio/latest/tokio/sync/mpsc/error/struct.SendError.html

src/daft-local-execution/src/sources/scan_task.rs Outdated Show resolved Hide resolved
@colin-ho colin-ho enabled auto-merge (squash) November 7, 2024 21:41
@colin-ho colin-ho merged commit 6e28b3f into main Nov 7, 2024
42 of 43 checks passed
@colin-ho colin-ho deleted the colin/native-runner branch November 7, 2024 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants