Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

seisman · 2023-12-26T16:35:44Z

For each GMT remote dataset, we need to define it like:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions={
            "01d": Resolution(["gridline", "pixel"], False),
            "30m": Resolution(["gridline", "pixel"], False),
            "20m": Resolution(["gridline", "pixel"], False),
            "15m": Resolution(["gridline", "pixel"], False),
            "10m": Resolution(["gridline", "pixel"], False),
            "06m": Resolution(["gridline", "pixel"], False),
            "05m": Resolution(["gridline", "pixel"], True),
            "04m": Resolution(["gridline", "pixel"], True),
            "03m": Resolution(["gridline", "pixel"], True),
            "02m": Resolution(["gridline", "pixel"], True),
            "01m": Resolution(["gridline"], True),
        },
    ),

The resolutions property is a dictionary of available resolutions and the corresponding registrations and tile information. As you can see, ["gridline", "pixel"] is duplicated multiple times. I feel maintaining such a dictionary is tedious.

Since most resolutions support both "gridline" and "pixel" registrations, it makes sense to let Resolution.registrations have a default value ["gridline", "pixel"]. Similarly, Resolution.tiled can have a default value False.

Then, the dataset definition can be written as:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions={
            "01d": Resolution(),
            "30m": Resolution(),
            "20m": Resolution(),
            "15m": Resolution(),
            "10m": Resolution(),
            "06m": Resolution(),
            "05m": Resolution(tiled=True),
            "04m": Resolution(tiled=True),
            "03m": Resolution(tiled=True),
            "02m": Resolution(tiled=True),
            "01m": Resolution(registrations=["gridline"], tiled=True),
        },
    ),

The new function definition looks more compact, but an entry like "01d": Resolution() looks weird, since it may be unclear what Resolution() means.

Thus, I add the new code property to the Resolution class, then a Resolution object can be defined like

Resolution(code="01d", registrations=["gridline", "pixel"], tiled=False)

or the shortest version:

Resolution("01d")

After the above changes, entries like "01d": Resolution("01d") are still weird since the resolution code (e.g., 01d) must be duplicated twice. Thus, I changed resolutions from a dict to a list.

Here is the final version of the dataset definition:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions=[
            Resolution("01d"),
            Resolution("30m"),
            Resolution("20m"),
            Resolution("15m"),
            Resolution("10m"),
            Resolution("06m"),
            Resolution("05m", tiled=True),
            Resolution("04m", tiled=True),
            Resolution("03m", tiled=True),
            Resolution("02m", tiled=True),
            Resolution("01m", registrations=["gridline"], tiled=True),
        ]
    )

which I think is more compact and more readable.

…ataset definitions

.github/workflows/benchmarks.yml

pygmt/datasets/load_remote_dataset.py

weiji14 · 2024-01-07T07:40:27Z

pygmt/datasets/load_remote_dataset.py

-    if resolution not in dataset.resolutions:
+    for res in dataset.resolutions:
+        if res.code == resolution:
+            valid_registrations = res.registrations
+            is_tiled = res.tiled
+            break
+    else:
        raise GMTInvalidInput(f"Invalid resolution '{resolution}'.")


This for-loop seems a lot more inefficient compared to the previous if resolution not in dataset.resolutions dictionary-based lookup. I almost think we should go with the intermediate {"01d": Resolution(), ...} dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list of Resolution NamedTuples. I almost feel like that datasets variable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).

Alternatively, we could just keep the datasets variable as is, but bring in all the type hint stuff. The remote datasets aren't really updated that often, though I know you're working on those Moon/Venus/Mars PRs.

This for-loop seems a lot more inefficient compared to the previous if resolution not in dataset.resolutions dictionary-based lookup.

Yes, it's almost 10-times slower.

I almost think we should go with the intermediate {"01d": Resolution(), ...} dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list of Resolution NamedTuples. I almost feel like that datasets variable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).

What about a dictionary:

{ "01d": Resolution("01d"), }

It's more clear than:

{ "01d": Resolution(), }

What about a dictionary:\n\n{\n "01d": Resolution("01d"),\n}

Yes, that could work.

Refactor the internal _load_remote_dataset function to simplify the d…

6eee89e

…ataset definitions

seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Dec 26, 2023

seisman added 4 commits December 27, 2023 13:03

Update all datasets

dc70127

Add type hints

9b1f6f0

Fix

165cfbe

Merge branch 'main' into refactor/load_remote_dataset

01df476

weiji14 mentioned this pull request Dec 27, 2023

Mark unit tests with @pytest.mark.benchmark part 2 #2924

Merged

7 tasks

Fix

567a6d4

seisman added this to the 0.11.0 milestone Jan 1, 2024

seisman added 2 commits January 2, 2024 21:03

Merge branch 'main' into refactor/load_remote_dataset

f6c6905

Rewrap docstrings

704c029

seisman changed the title ~~POC: Refactor the internal _load_remote_dataset function to simplify datasets' definitions~~ Refactor the internal _load_remote_dataset function to simplify datasets' definitions Jan 2, 2024

seisman marked this pull request as ready for review January 2, 2024 13:08

seisman added 4 commits January 4, 2024 09:52

Merge branch 'main' into refactor/load_remote_dataset

91d00fb

Merge branch 'main' into refactor/load_remote_dataset

d353795

Merge branch 'main' into refactor/load_remote_dataset

a94fe5b

Temporarily run benchmarks in this PR

e5d43c4

seisman commented Jan 7, 2024

View reviewed changes

.github/workflows/benchmarks.yml Outdated Show resolved Hide resolved

weiji14 reviewed Jan 7, 2024

View reviewed changes

seisman added 6 commits January 7, 2024 17:20

Revert resolutions to a dict of Resolution object

eaab9ee

Merge branch 'main' into refactor/load_remote_dataset

666e7ea

Improve description of resolution code

2a784f6

Fix the description of resolutions

f9f09f3

Merge branch 'main' into refactor/load_remote_dataset

a32a201

Merge branch 'main' into refactor/load_remote_dataset

32f5af7

seisman requested a review from a team January 12, 2024 07:02

seisman added 2 commits January 12, 2024 15:02

Merge branch 'main' into refactor/load_remote_dataset

7e4d454

Merge branch 'main' into refactor/load_remote_dataset

2498217

seisman merged commit 0a3b46d into main Jan 14, 2024
8 of 16 checks passed

seisman deleted the refactor/load_remote_dataset branch January 14, 2024 08:43

seisman removed the needs review This PR has higher priority and needs review. label Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

seisman commented Dec 26, 2023 •

edited

Loading

weiji14 Jan 7, 2024 •

edited

Loading

weiji14 Jan 7, 2024

seisman Jan 7, 2024

weiji14 Jan 7, 2024

Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

Conversation

seisman commented Dec 26, 2023 • edited Loading

weiji14 Jan 7, 2024 • edited Loading

Choose a reason for hiding this comment

weiji14 Jan 7, 2024

Choose a reason for hiding this comment

seisman Jan 7, 2024

Choose a reason for hiding this comment

weiji14 Jan 7, 2024

Choose a reason for hiding this comment

seisman commented Dec 26, 2023 •

edited

Loading

weiji14 Jan 7, 2024 •

edited

Loading