Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(DRAFT): Vendored vega_datasets demo #3631

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 4, 2024

Related

Description

Early WIP

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets

Notes

  • Investigating bundling metadata (22a5039), (1792340)
    • Depending on how well the compression scales, it might be reasonable to include this for some number of versions
    • Deliberately including redundant info early on - can always chip away at it later

Outstanding issues

Currently a low priority item personally, so keeping track of the longer term issues to fix here

npm/vega-datasets does not have every version available at https://github.com/vega/vega-datasets/tags

Plan strategy for user-configurable dataset cache

  • Everything so far has been building the tools for a compact bundled index
    • 1, 2, 3, 4, 5
    • Refreshing the index would not be included in altair, each release would simply ship with changes baked in
  • Trying to avoid bloating altair package size with datasets
  • User-facing
    • Goal of requesting each unique dataset version once
      • The user cache would not need to be updated between altair versions
    • Some kind of opt-in config to say store the datasets in this directory please
      • Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
      • When not provided, always perform remote requests
        • User motivation would be that it would be faster to enable caching
  • There may be opportunities to reduce the cache footprint further
    • e.g. storing the .(csv|tsv|json) files as .parquet
  • Need to do more testing on this though to ensure
    • the shape of each dataset is preserved
    • where relevant - intentional errors remain intact

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant