feat(RFC): Adds `altair.datasets` #3631

dangotbanned · 2024-10-04T18:57:00Z

Status

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

3.0.0 Release vega-datasets#654

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

No datasets are included in the package
- Instead, included is only a single 18.7KB file metadata.parquet
- The file describes all versions of all datasets
  - provided they are accessible via both npm and github
Strong support for typing
- Annotations are generated from the metadata itself
- https://github.com/vega/altair/blob/9e9deeb95668d2c4e7d30311e85a8f9f6acdc88c/altair/datasets/_typing.py
So far, 4 backends have been implemented, instead of only pandas
- These provide precise IDE completions, with a lot of help from https://github.com/narwhals-dev/narwhals
Users can opt-in to caching remote dataset requests
- With the "polars" backend, the slowest I've had on a cache-hit is 0.1s to load
  - https://cdn.jsdelivr.net/npm/[email protected]/data/flights-200k.json

Examples

These all come from the docstrings of:

Loader
Loader.from_backend
Loader.__call__

from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

Tasks

Resolved

Investigate bundling metadata

Investigating bundling metadata (22a5039), (1792340)
- Depending on how well the compression scales, it might be reasonable to include this for some number of versions
- Deliberately including redundant info early on - can always chip away at it later

npm does not have every version available GitHub

Sources
- npm/vega-datasets
  - Fixed with: https://data.jsdelivr.com/v1/packages/npm/vega-datasets
- https://github.com/vega/vega-datasets/tags
Known missing
feat(DRAFT): Add a source for available npm versions
Need to add some handling to invalidate these entries returned from list-repository-tags once confirmed they cannot be requested from npm
- Can technically request from github, but during testing this was much slower
- Also, these versions would not have been available from https://github.com/altair-viz/vega_datasets, since that only used npm

Plan strategy for user-configurable dataset cache

Everything so far has been building the tools for a compact bundled index
- 1, 2, 3, 4, 5
- Refreshing the index would not be included in altair, each release would simply ship with changes baked in
Trying to avoid bloating altair package size with datasets
User-facing
- Goal of requesting each unique dataset version once
  - The user cache would not need to be updated between altair versions
- Some kind of opt-in config to say store the datasets in this directory please
  - Basic solution would be defining an env variable like ALTAIR_DATASETS_DIR
  - When not provided, always perform remote requests
    - User motivation would be that it would be faster to enable caching

Deferred

Reducing cache footprint

e.g. storing the .(csv|tsv|json) files as .parquet
Need to do more testing on this though to ensure
- the shape of each dataset is preserved
- where relevant - intentional errors remain intact

Investigate providing a decorator to add a backend

Will be trivial for the user-side, since they don't need to be concerned about imports
Just need to provide these attributes:
- _name: LiteralString
- _read_fn: dict[Extension, Callable[..., IntoDataFrameT]]
- _scan_fn: dict[_ExtensionScan, Callable[..., IntoFrameT]]

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`

How many datasets, size (per & total)?
What version range does a given sha cover?
Blocked: Running into issues with
- pandas/pyarrow group_by warnings
- min and max return all nulls in pl.Enum pola-rs/polars#18394
- Missing nw.Expr.(first|last)
- nw.Expr.(head|tail)(1) not equivalent in a group_by().agg(...) context
  - pandas -> scalar
  - polars -> list
- pl.Enum translating to non-ordered pd.Categorical

polars-native solution

from __future__ import annotations

from pathlib import Path

import polars as pl
from altair.datasets import Loader, _readers

data = Loader.from_backend("polars")

# NOTE: Enable caching, populate with some responses
data.cache_dir = Path.home() / ".altair_cache"
data("cars")
data("cars", tag="v1.5.0")
data("movies")
data("movies", tag="v1.24.0")
data("jobs")


if cache_dir := data.cache_dir:
    cached_stems: tuple[str, ...] = tuple(fp.stem for fp in cache_dir.iterdir())
else:
    msg = "Datasets cache unset"
    raise TypeError(msg)

# NOTE: Lots of redundancies, many urls point to the same data (sha)
>>> pl.read_parquet(_readers._METADATA).shape
# (2879, 9)

# NOTE: Version range per sha
tag_sort: pl.Expr = pl.col("tag").sort()
tag_range: pl.Expr = pl.concat_str(tag_sort.first(), tag_sort.last(), separator=" - ")

# NOTE: Producing a name only when the file is already in the cache
file_name: pl.Expr = pl.when(pl.col("sha").is_in(cached_stems)).then(
    pl.concat_str("sha", "suffix")
)

cache_summary: pl.DataFrame = (
    pl.scan_parquet(_readers._METADATA)
    .group_by("dataset_name", "suffix", "size", "sha")
    .agg(tag_range=tag_range)
    .select(pl.exclude("sha"), file_name=file_name)
    .sort("dataset_name", "size")
    .collect()
)

>>> cache_summary.shape
# (116, 5)

>>> cache_summary.head(10)

shape: (10, 5)
┌───────────────┬────────┬─────────┬───────────────────┬─────────────────────────────────┐
│ dataset_name  ┆ suffix ┆ size    ┆ tag_range         ┆ file_name                       │
│ ---           ┆ ---    ┆ ---     ┆ ---               ┆ ---                             │
│ str           ┆ str    ┆ i64     ┆ str               ┆ str                             │
╞═══════════════╪════════╪═════════╪═══════════════════╪═════════════════════════════════╡
│ 7zip          ┆ .png   ┆ 3969    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ airports      ┆ .csv   ┆ 210365  ┆ v1.5.0 - v2.10.0  ┆ 608ba6d51fa70584c3fa1d31eb9453… │
│ annual-precip ┆ .json  ┆ 266265  ┆ v1.29.0 - v2.10.0 ┆ null                            │
│ anscombe      ┆ .json  ┆ 1703    ┆ v1.5.0 - v2.10.0  ┆ null                            │
│ barley        ┆ .json  ┆ 8487    ┆ v1.5.0 - v2.10.0  ┆ 8dc50de2509b6e197ce95c24c98f90… │
│ birdstrikes   ┆ .csv   ┆ 1223329 ┆ v2.0.0 - v2.10.0  ┆ null                            │
│ birdstrikes   ┆ .json  ┆ 4183924 ┆ v1.5.0 - v1.31.1  ┆ null                            │
│ budget        ┆ .json  ┆ 374289  ┆ v1.5.0 - v2.8.1   ┆ null                            │
│ budget        ┆ .json  ┆ 391353  ┆ v2.9.0 - v2.10.0  ┆ null                            │
│ budgets       ┆ .json  ┆ 18079   ┆ v1.5.0 - v2.10.0  ┆ 8a909e24f698a3b0f6c637c30ec95e… │
└───────────────┴────────┴─────────┴───────────────────┴─────────────────────────────────┘

- Allow quickly switching between version tags #3150 (comment)

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures

Caching is the more sensible default when considering a notebook environment Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables

Now that only a single version is supported, it is possible to mitigate the `pandas` case w/o `.parquet` support (#3631 (comment)) This commit adds the file and some tools needed to implement this - but I'll need to follow up with some more changes to integrate this into `_Reader`

- Made paths a `ClassVar` - Removed unused `SchemaCache` methods - Replace `_FIELD_TO_DTYPE` w/ `_DTYPE_TO_FIELD` - Only one variant is ever used Use a `SchemaCache` instance per-`pandas`-based reader - Make fallback `csv_cache` initialization lazy - Only going to use the global when no dependencies found - Otherwise, instance-per-reader

- Readable via url w/ `vegafusion` installed - Currently no cases where a dataset has both `.parquet` and another extension

Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling

- Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work

Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631

Originally added in e1290d4 Try to reduce the size of #3631. This change is atomic and useful enough on its own

- Using `load` instead of `data` - Don't mention multi-versions, as that was dropped

- `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc

- Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed

dangotbanned added 6 commits October 2, 2024 22:13

wip

7933771

feat(DRAFT): Minimal reimplementation

b30081e

refactor: Make version accessible via data.source_tag

279586b

- Allow quickly switching between version tags #3150 (comment)

refactor: ext_fn -> Dataset.read_fn

32150ad

docs: Add trailing docs to long literals

f1d18a2

docs: Add module-level doc

4d3c550

dangotbanned added the maintenance label Oct 4, 2024

dangotbanned added 23 commits October 4, 2024 20:15

Merge branch 'main' into vega-datasets

7e65841

Merge branch 'main' into vega-datasets

05773af

Merge branch 'main' into vega-datasets

4fff80a

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

dangotbanned added 17 commits January 14, 2025 16:31

docs: Add missing descriptions to Metadata

64b80ff

refactor: Renaming/reorganize in tools/

a0f7585

Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures

test: Skip is_image datasets

0df79b0

refactor: Make caching **opt-out**, use $XDG_CACHE_HOME

ee0d381

Caching is the more sensible default when considering a notebook environment Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables

refactor(typing): Add _iter_results helper

138ede6

chore: Include .parquet in metadata.csv.gz

a1839df

- Readable via url w/ `vegafusion` installed - Currently no cases where a dataset has both `.parquet` and another extension

feat: Extend _extract_suffix to support Metadata

2db8daf

Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling

refactor(typing): Simplify Dataset import

c265e1d

fix: Convert str to correct types in CsvCache

5503e0b

feat: Support pandas w/o a .parquet reader

3c7c571

refactor: Reduce repetition w/ _Reader._download

c23805d

feat(DRAFT): Metadata-based error handling

056f96d

- Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work

Merge remote-tracking branch 'upstream/main' into vega-datasets

9c5db19

chore(ruff): Remove unused 0.9.2 ignores

e168948

Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631

Merge remote-tracking branch 'upstream/main' into vega-datasets

7d6b81d

dangotbanned added a commit that referenced this pull request Jan 16, 2025

test: Make skip_requires_pyarrow compatible w/ pytest.param

be3f18a

Originally added in e1290d4 Try to reduce the size of #3631. This change is atomic and useful enough on its own

dangotbanned mentioned this pull request Jan 16, 2025

test: Make skip_requires_pyarrow compatible w/ pytest.param #3772

Merged

Merge remote-tracking branch 'upstream/main' into vega-datasets

a752b3c

This was referenced Jan 17, 2025

docs(example): Adds Confidence Interval Ellipses #3747

Merged

Tracking: uv transition #3773

Open

dangotbanned added 8 commits January 17, 2025 12:55

Merge remote-tracking branch 'upstream/main' into vega-datasets

5975a8b

refactor: clean up, standardize _exceptions.py

7fd1f4d

test: Refactor decorators, test new errors

5dc227e

docs: Replace outdated docs

ba01af1

- Using `load` instead of `data` - Don't mention multi-versions, as that was dropped

Merge remote-tracking branch 'upstream/main' into vega-datasets

80647b6

Merge remote-tracking branch 'upstream/main' into vega-datasets

ad4c747

refactor: Clean up tools.datasets

63f4be0

- `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

feat(RFC): Adds altair.datasets #3631

Are you sure you want to change the base?

feat(RFC): Adds altair.datasets #3631

Conversation

dangotbanned commented Oct 4, 2024 • edited Loading

Related

Status

Description

Examples

Tasks

Resolved

Deferred

Reducing cache footprint

Investigate providing a decorator to add a backend

Provide more meaningful info on the state of ALTAIR_DATASETS_DIR

feat(RFC): Adds `altair.datasets` #3631

feat(RFC): Adds `altair.datasets` #3631

dangotbanned commented Oct 4, 2024 •

edited

Loading

Provide more meaningful info on the state of `ALTAIR_DATASETS_DIR`