Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917

Merged
merged 20 commits into from
Jan 14, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Dec 26, 2023

For each GMT remote dataset, we need to define it like:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions={
            "01d": Resolution(["gridline", "pixel"], False),
            "30m": Resolution(["gridline", "pixel"], False),
            "20m": Resolution(["gridline", "pixel"], False),
            "15m": Resolution(["gridline", "pixel"], False),
            "10m": Resolution(["gridline", "pixel"], False),
            "06m": Resolution(["gridline", "pixel"], False),
            "05m": Resolution(["gridline", "pixel"], True),
            "04m": Resolution(["gridline", "pixel"], True),
            "03m": Resolution(["gridline", "pixel"], True),
            "02m": Resolution(["gridline", "pixel"], True),
            "01m": Resolution(["gridline"], True),
        },
    ),

The resolutions property is a dictionary of available resolutions and the corresponding registrations and tile information. As you can see, ["gridline", "pixel"] is duplicated multiple times. I feel maintaining such a dictionary is tedious.

Since most resolutions support both "gridline" and "pixel" registrations, it makes sense to let Resolution.registrations have a default value ["gridline", "pixel"]. Similarly, Resolution.tiled can have a default value False.

Then, the dataset definition can be written as:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions={
            "01d": Resolution(),
            "30m": Resolution(),
            "20m": Resolution(),
            "15m": Resolution(),
            "10m": Resolution(),
            "06m": Resolution(),
            "05m": Resolution(tiled=True),
            "04m": Resolution(tiled=True),
            "03m": Resolution(tiled=True),
            "02m": Resolution(tiled=True),
            "01m": Resolution(registrations=["gridline"], tiled=True),
        },
    ),

The new function definition looks more compact, but an entry like "01d": Resolution() looks weird, since it may be unclear what Resolution() means.

Thus, I add the new code property to the Resolution class, then a Resolution object can be defined like

Resolution(code="01d", registrations=["gridline", "pixel"], tiled=False)

or the shortest version:

Resolution("01d")

After the above changes, entries like "01d": Resolution("01d") are still weird since the resolution code (e.g., 01d) must be duplicated twice. Thus, I changed resolutions from a dict to a list.

Here is the final version of the dataset definition:

    "earth_age": GMTRemoteDataset(
        title="seafloor age",
        name="seafloor_age",
        long_name="age of seafloor crust",
        units="Myr",
        extra_attributes={"horizontal_datum": "WGS84"},
        resolutions=[
            Resolution("01d"),
            Resolution("30m"),
            Resolution("20m"),
            Resolution("15m"),
            Resolution("10m"),
            Resolution("06m"),
            Resolution("05m", tiled=True),
            Resolution("04m", tiled=True),
            Resolution("03m", tiled=True),
            Resolution("02m", tiled=True),
            Resolution("01m", registrations=["gridline"], tiled=True),
        ]
    )

which I think is more compact and more readable.

@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Dec 26, 2023
@seisman seisman added this to the 0.11.0 milestone Jan 1, 2024
@seisman seisman changed the title POC: Refactor the internal _load_remote_dataset function to simplify datasets' definitions Refactor the internal _load_remote_dataset function to simplify datasets' definitions Jan 2, 2024
@seisman seisman marked this pull request as ready for review January 2, 2024 13:08
pygmt/datasets/load_remote_dataset.py Outdated Show resolved Hide resolved
Comment on lines 278 to 281
if resolution not in dataset.resolutions:
for res in dataset.resolutions:
if res.code == resolution:
valid_registrations = res.registrations
is_tiled = res.tiled
break
else:
raise GMTInvalidInput(f"Invalid resolution '{resolution}'.")
Copy link
Member

@weiji14 weiji14 Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for-loop seems a lot more inefficient compared to the previous if resolution not in dataset.resolutions dictionary-based lookup. I almost think we should go with the intermediate {"01d": Resolution(), ...} dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list of Resolution NamedTuples. I almost feel like that datasets variable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just keep the datasets variable as is, but bring in all the type hint stuff. The remote datasets aren't really updated that often, though I know you're working on those Moon/Venus/Mars PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for-loop seems a lot more inefficient compared to the previous if resolution not in dataset.resolutions dictionary-based lookup.

Yes, it's almost 10-times slower.

I almost think we should go with the intermediate {"01d": Resolution(), ...} dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list of Resolution NamedTuples. I almost feel like that datasets variable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).

What about a dictionary:

{
   "01d": Resolution("01d"),
}

It's more clear than:

{
    "01d": Resolution(),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a dictionary:\n\n{\n "01d": Resolution("01d"),\n}

Yes, that could work.

@seisman seisman requested a review from a team January 12, 2024 07:02
@seisman seisman merged commit 0a3b46d into main Jan 14, 2024
8 of 16 checks passed
@seisman seisman deleted the refactor/load_remote_dataset branch January 14, 2024 08:43
@seisman seisman removed the needs review This PR has higher priority and needs review. label Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants