Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [CSV Reader] Bulk CSV reader + general CSV reader refactor #1614

Merged
merged 9 commits into from
Nov 21, 2023

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Nov 16, 2023

This PR adds support for bulk CSV reading to the native CSV reader, and integrates bulk CSV reading with MicroPartition as the default reading path.

Driveby Refactors

  • Consolidated the execution-side configuration of CSV reading into 3 config classes: CsvConvertOptions, CsvParseOptions, and CsvReadOptions. This reduces the bloat of our execution-side code (and tests) by a good bit, and providing config objects that are transparently passed through the execution layer should make it easier to add more CSV configuration options in the future (less code to change). Note that these are currently not exposed to the query plan (logical or physical), although these might be moved into the FileFormatConfig enum once the old Table-based I/O path is removed.
  • CSV reading has been refactored into a pipeline of stream transformations.
  • A lot of repeated intermediate CSV reading code has been eliminated by making the Arrow2 CSV reader a trait object.
  • Stats gathered during schema inference are now bundled into a CsvReadStats struct.

@github-actions github-actions bot added the enhancement New feature or request label Nov 16, 2023
Copy link

codecov bot commented Nov 16, 2023

Codecov Report

Merging #1614 (16435d0) into main (06c2ccf) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1614      +/-   ##
==========================================
+ Coverage   85.00%   85.01%   +0.01%     
==========================================
  Files          55       55              
  Lines        5287     5291       +4     
==========================================
+ Hits         4494     4498       +4     
  Misses        793      793              
Files Coverage Δ
daft/logical/schema.py 91.50% <100.00%> (+0.08%) ⬆️
daft/table/micropartition.py 89.58% <100.00%> (ø)
daft/table/schema_inference.py 98.14% <100.00%> (ø)
daft/table/table.py 84.01% <100.00%> (ø)
daft/table/table_io.py 95.51% <100.00%> (+0.08%) ⬆️

src/daft-csv/src/lib.rs Show resolved Hide resolved
src/daft-csv/src/metadata.rs Outdated Show resolved Hide resolved
)
}

#[getter]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the pyo3 get attribute instead i believe
https://pyo3.rs/v0.20.0/class?search=get

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You unfortunately cannot use the field-wise get attribute within the context of a cfg_attr on the struct, and we can't use the get_all workaround since some of the getters aren't trivial, so we need to explicitly define getters for each field.

PyO3/pyo3#1003

src/daft-csv/src/read.rs Show resolved Hide resolved
src/daft-csv/src/read.rs Show resolved Hide resolved
src/daft-csv/src/read.rs Outdated Show resolved Hide resolved
where
R: AsyncRead + Unpin + Send,
R: AsyncRead + Unpin + Send + 'static,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try R: AsyncReader instead to drop the 'static requirement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncReader is a struct, not a trait, and we're bounding the async-readable type that the AsyncReader is instantiated with. Although I might be misunderstanding your suggestion here. https://docs.rs/arrow2/latest/arrow2/io/csv/read_async/struct.AsyncReader.html

We need the R: AsyncRead bound on the inner async-readable for constructing the AsyncReader (link) and for reading from the reader with read_rows (link).

src/daft-csv/src/read.rs Show resolved Hide resolved
src/daft-micropartition/src/micropartition.rs Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the clark/micropartition-read-csv branch from 4838c9c to 89765af Compare November 21, 2023 20:31
@clarkzinzow clarkzinzow merged commit f289da1 into main Nov 21, 2023
39 checks passed
@clarkzinzow clarkzinzow deleted the clark/micropartition-read-csv branch November 21, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants