Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Experience Punchlist #1072

Open
10 of 32 tasks
danking opened this issue Oct 17, 2024 · 0 comments
Open
10 of 32 tasks

User Experience Punchlist #1072

danking opened this issue Oct 17, 2024 · 0 comments

Comments

@danking
Copy link
Member

danking commented Oct 17, 2024

In Progress:

Short term:

  • Teach PyArray __getitem__ which just delegates to scalar_at.
  • Teach PyVortex to use parallelism during decompression.
  • Make reading a Vortex file into an Arrow Array as fast as Parquet (at least partly needs to address read/write disagreement on chunk size, see below we side stepped this by removing buffer sizes when implementing filter pushdown).
  • Make reading a Vortex file with Polars as fast as Parquet. Reading a Vortex array in Polars is slower than Parquet #1071
  • read should have high throughput on files written by write (currently, write does not enforce chunking whereas read does which can degrade throughput on arrays for which slice is not free again, sidestepped by removing buffer sizing from filter pushdown).
  • (docs) Write a comparison section which describes similarities and differences from other file formats.
  • Expose Vortex compute methods in Python API by way of new classes (e.g. Array.as_struct() and StructArray which permits column selection).
  • Teach Vortex IsNull & IsNotNull and plumb substrait into them.
  • Do not expose modules with confusing names such as vortex.encoding.
  • Expose more functions on scalar values such as __eq__, array indexing, or getting a memoryview.
  • Teach PyVortex (really: Layout readers and writers) to read/write non-struct arrays.
  • Teach RecordBatchReader to read from multiple files.
  • Teach Polars to write Vortex files.
  • Teach DuckDB to write Vortex files.
  • Support multiple files and/or directories in the Vortex Dataset API
  • Consider using the Rufo theme for the docs

Long term:

  • For Torch, expose a method to read from a Vortex file directly into a mutable NumPy array. Torch does not support immutable NumPy arrays.
  • Reduce Vortex array metadata size. This primarily benefits very small datasets (e.g. PBI AirlineSentiment).
  • Implement a RecordBatchReader for Vortex arrays and Vortex files.
  • Implement a Pandas ExtensionArray which permits compute on the compressed array representations (thus avoiding the cost of decompression).
  • Integrate with DuckDB.
  • (docs) Finish specification and move into docs.

Complete:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant